LIS 601: Week2
Evaluating Information Retrieval Systems: Downie's personal study notes of Tague-Sutcliffe (1992).

Welcome and caveat:

Find below my personal study notes for:

Tague-Sutcliffe, Jean. 1992. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management 28 (4): 467-90.

While these notes are rather long they are by no means complete. I have also added in some extra information where appropriate to help me with the lecture. I advise that you might want to adopt a similar approach to note taking for the remainder of the readings. This set of notes is rather long because the article actually forms the basis for my structuring of this course. There is no substitute for reading the article, however. Remember, these are my personal notes; the article itself is the authority for future marking purposes (hint, hint, nudge, nudge).

Introduction

Validity

"the extent to which the experiment actually determines what the experimenter wishes to determine."

Reliability

"the extent to which the experimental results can be replicated"

Efficiency

"the extent to which the experiment is effective (i.e., valid and reliable, relative to resources consumed)"

Need for testing

Identify the problem needing answers
Review literature

Type of test

Treatments, factor(s) - independent variables
Outcomes -dependent variables
Experiments (laboratory) control sources of variation:

independent, evironmental, and concomitant variables
users, databases, searchers, search constraints controlled
Advantages:

generalizability (validity, reliability good)

Disadvantages:

more expensive
artificial

Operational tests (real-life):

Minimal control of sources of variation
(Control actually lies on a continuum from strong to none)
Advantages:

real life data and insights
less expensive

Disadvantages:

need cooperation of administrators, users, etc.
not generalizable
difficult to attribute causes

Definition of variables

Define independent, dependent, and environmental variables
Operationalization: deciding what to observe and how to measure what one is observing
Key variables:

Database

collection documents
measured by: number of documents, number of records, size in blocks or bytes, etc.
can use concentration measures
qualitative properties problematic:

representational issues (i.e. completeness, medium, etc.)
defining "documents"

Information Representation

exhaustivity of indexing
specificity of indexing
degree of vocabulary control
degree of vocabulary linkage
vocabulary accommodation
term discrimination
degree of post-coordination
degree of syntactic control
indexing accuracy
inter-indexer consistency
type of classification or clustering used

Users

understanding users is growing in importance
example categories:

type of user
context of use
kinds of information needed
immediacy of need

sense-making, problem solving viewpoints
various used to understand users:

psychometric scales
likert scales
educational ability, etc.

Queries

Query: the user's verbalized statement of need
Search statement: expression of the Query in the form usable by the IR system

dependent on the type of system (Boolean, vector, natural language,etc.)
can be measured: size, exhaustivity, specificity, etc.

real-life sources
artificial sources (derived from titles, etc.)

Search Intermediaries

delegated vs end-user searches

Retrieval Process

search logic used
access modes (commands, hypertext, menus, etc.)
time spent retrieving can be important
user interaction with system (finding help, revising queries, etc).

Retrieval Evaluation

precision, recall, fallout
for ranked output measures can be made at predetermined rank levels
E measure and MZ metric developed to give overall performance indications
recall measurement problematic (i.e. finding all relevant documents):

arbitrarily define relevant documents
use a small document set
random sampling of unretrieved documents
use comparative measures
use Blair and Maron's (1985) technique (broad searches)

Tague and Schultz's (1989) four dimension of evaluation:

Informativeness of retrieved set (precision)
Completeness of retrieved set (recall)
Time expended by user (contact time)
Quality of user's experience (user friendliness)

THE RELEVANCE PROBLEM

What scale to use (binary, ranked, weighted)?
Who judges relevance (users, experts, etc.)?
What is the distinction between subject relevance (aboutness) and usefulness (pertinence)?
One could adopt the Dervin and Nilan (1986) approach which does not attempt to evaluate individual relevance decisions but holisticly (sic?) and qualitatively evaluates the experience of the user's interaction with the system

cost effectiveness: how efficiently the system operates
cost benefit: how much value does the system give the users, institution, community, etc.

Database decisions (talking about the information and its structure. See Retrieval Software).

The build or buy decision:

building gives one greater control but is expensive, time consuming, etc.
build for testing specific theories about retrieval basics (e.g. a new indexing method, etc.)
buy if investigating practical concerns

Key features:

coverage of database
source of documents
source of vocabulary
form of documents
fields of records
display formats
interface issues (i.e. windows, etc.)
design of inverted files: stoplists, truncation, stemming, etc.

Use of standard test databases:

Cranfield, Medlars, Ontap Eric, etc.
might be overused, prone to idiosyncracies

Finding queries

Major problem is just finding queries
Real users:

Advantages:

realistic queries

Disadvantages:

not willing to participate
drop out before tests are finished
can be expensive
determining relevance is a problem
participants might not be representative (the classic second-year undergrad or white rat)

Canned queries: test collections, previous user's, etc.

Advantages:

cost-efficient
some have predetermined relevance judgements
experimenter can control search process (i.e. gives the system the best possible search statements. etc.)

Disadvantages:

artificial both in nature and approach
loss of end-user input

Retrieval Software and the Search Process (think Search Engines)

build or buy decisions here also (could also add modify to list of decisions)
whatever decision, training can be important
in general, operational tests involve buying or modifying
procedures must be standardized where possible
observation/avoidance of fatigue, breakdowns, etc. important (i.e. extraneous source of variance)

Experimental design

Note in this section how the experimental elements are organized:

Experimentat strategies (treatments)
Searchers
Queries
Blocks of similar queries
Time periods

Data collection

data can be categorized by the component under investigation:

database

contact builder, run statisical evaluations
doing it yourself can be costly

people

observations

obtrusive (can influence results)
unobtrusive (can have ethical consequences, i.e. privacy, consent issues)

surveys, questionaires

low response rate a nuisance
self-reporting a problem
pretesting important

protocols (e.g. talk-throughs, interviews, etc.)

can be obtrusive
pretesting also important

computer logs

ethical issues
gaining access

processes

getting relevance judgements a problem
for bibliographic databases make sure to have access to real documents
use full-text databases where possible (allows users to judge relevance easier)

results

determine in advance what analyses are to be done so data is collected and encoded properly

Data analysis

quantitative or qualitative analyses (or a mixture of both)?
quantitative analyses:

descriptive statistics

summarizes data
tables, histograms, etc
recall - precision graphs are very useful
measures of central tendency, variance, and association are good supplements to tables and graphs
make sure to determine whether data is categorical, ordinal, interval/ratio so proper measures can be applied
aggregation of different variables where possible

inferential statistics

dependent on type of data
used for estimation (of parameters or population characteristics)

for example, average recall, average costs

comparison

for example, whether there really is a different recall rate for different search strategies
ANOVA widely used

exploration of relationships

for example, building a model of a system that depicts the outcome of the system as a function of various independent factors (i.e. time spent searching, length of query, etc.)

forecasting

for example, predicting future use patterns based upon past (time-related) data

understanding the structure of multivariate data

for example, when each user has a profile consisting of attributes to a wide range of variables (e.g. age, education, income, gender, language ability, etc.) the users can be clustered into discernible groups

remember to distinguish between discrete and continuous data (determines appropriate test)
SPSS, SAS and MiniTab are all commonly used statistics packages

Presenting results

Report should follow this outline:

purpose of test (why was it done and what were the expectations)
background of test (what have others done in regards to this specific problem)
methods used (variables, equipment, environment, etc., so others can replicate)
presentation of results (textual, tabular and graphical)
conclusions (review and summarize, explain the contriburion made, indicate implications for future research

Conferences (SIGIR, CAIS, ASIS, Digital Libraries, ALA, etc.)
Journals (JASIS, JDOC, Communications of the ACM, Information Processing and Management, etc.).

Page creator: J. Stephen Downie
Page created: 16 September 1997
Page updated: 18 September 1997

LIS 601: Week2 Evaluating Information Retrieval Systems: Downie's personal study notes of Tague-Sutcliffe (1992).

Welcome and caveat:

LIS 601: Week2
Evaluating Information Retrieval Systems: Downie's personal study notes of Tague-Sutcliffe (1992).