LIS 601: Week2
Evaluating Information Retrieval Systems: Downie's personal
study notes of Tague-Sutcliffe (1992).
Welcome and caveat:
Find below my personal study notes for:
Tague-Sutcliffe, Jean. 1992. The pragmatics of information retrieval
experimentation, revisited. Information Processing and Management 28
(4): 467-90.
While these notes are rather long they are by no means complete. I have
also added in some extra information where appropriate to help me with
the lecture. I advise that you might want to adopt a similar approach to
note taking for the remainder of the readings. This set of notes is rather
long because the article actually forms the basis for my structuring of
this course. There is no substitute for reading the article, however. Remember,
these are my personal notes; the article itself is the authority for future
marking purposes (hint, hint, nudge, nudge).
- Introduction
- Validity
- "the extent to which the experiment actually determines
what the experimenter wishes to determine."
- Reliability
- "the extent to which the experimental results can be replicated"
- Efficiency
- "the extent to which the experiment is effective (i.e., valid
and reliable, relative to resources consumed)"
- Need for testing
- Identify the problem needing answers
- Review literature
- Type of test
- Treatments, factor(s) - independent variables
- Outcomes -dependent variables
- Experiments (laboratory) control sources of variation:
- independent, evironmental, and concomitant variables
- users, databases, searchers, search constraints controlled
- Advantages:
- generalizability (validity, reliability good)
- Disadvantages:
- more expensive
- artificial
- Operational tests (real-life):
- Minimal control of sources of variation
- (Control actually lies on a continuum from strong to none)
- Advantages:
- real life data and insights
- less expensive
- Disadvantages:
- need cooperation of administrators, users, etc.
- not generalizable
- difficult to attribute causes
- Definition of variables
- Define independent, dependent, and environmental variables
- Operationalization: deciding what to observe and how to measure what
one is observing
- Key variables:
- Database
- collection documents
- measured by: number of documents, number of records, size in blocks
or bytes, etc.
- can use concentration measures
- qualitative properties problematic:
- representational issues (i.e. completeness, medium, etc.)
- defining "documents"
- Information Representation
- exhaustivity of indexing
- specificity of indexing
- degree of vocabulary control
- degree of vocabulary linkage
- vocabulary accommodation
- term discrimination
- degree of post-coordination
- degree of syntactic control
- indexing accuracy
- inter-indexer consistency
- type of classification or clustering used
- Users
- understanding users is growing in importance
- example categories:
- type of user
- context of use
- kinds of information needed
- immediacy of need
- sense-making, problem solving viewpoints
- various used to understand users:
- psychometric scales
- likert scales
- educational ability, etc.
- Queries
- Query: the user's verbalized statement of need
- Search statement: expression of the Query in the form usable by the
IR system
- dependent on the type of system (Boolean, vector, natural language,etc.)
- can be measured: size, exhaustivity, specificity, etc.
- real-life sources
- artificial sources (derived from titles, etc.)
- Search Intermediaries
- delegated vs end-user searches
- Retrieval Process
- search logic used
- access modes (commands, hypertext, menus, etc.)
- time spent retrieving can be important
- user interaction with system (finding help, revising queries, etc).
- Retrieval Evaluation
- precision, recall, fallout
- for ranked output measures can be made at predetermined rank levels
- E measure and MZ metric developed to give overall performance indications
- recall measurement problematic (i.e. finding all relevant documents):
- arbitrarily define relevant documents
- use a small document set
- random sampling of unretrieved documents
- use comparative measures
- use Blair and Maron's (1985) technique (broad searches)
- Tague and Schultz's (1989) four dimension of evaluation:
- Informativeness of retrieved set (precision)
- Completeness of retrieved set (recall)
- Time expended by user (contact time)
- Quality of user's experience (user friendliness)
- THE RELEVANCE PROBLEM
- What scale to use (binary, ranked, weighted)?
- Who judges relevance (users, experts, etc.)?
- What is the distinction between subject relevance
(aboutness) and usefulness (pertinence)?
- One could adopt the Dervin and Nilan (1986) approach
which does not attempt to evaluate individual relevance decisions but holisticly
(sic?) and qualitatively evaluates the experience of the user's interaction
with the system
- cost effectiveness: how efficiently the system
operates
- cost benefit: how much value does the system
give the users, institution, community, etc.
- Database decisions (talking about the information
and its structure. See Retrieval Software).
- The build or buy decision:
- building gives one greater control but is expensive, time consuming,
etc.
- build for testing specific theories about retrieval basics (e.g. a
new indexing method, etc.)
- buy if investigating practical concerns
- Key features:
- coverage of database
- source of documents
- source of vocabulary
- form of documents
- fields of records
- display formats
- interface issues (i.e. windows, etc.)
- design of inverted files: stoplists, truncation, stemming, etc.
- Use of standard test databases:
- Cranfield, Medlars, Ontap Eric, etc.
- might be overused, prone to idiosyncracies
- Finding queries
- Major problem is just finding queries
- Real users:
- Advantages:
- realistic queries
- Disadvantages:
- not willing to participate
- drop out before tests are finished
- can be expensive
- determining relevance is a problem
- participants might not be representative (the classic second-year undergrad
or white rat)
- Canned queries: test collections, previous user's, etc.
- Advantages:
- cost-efficient
- some have predetermined relevance judgements
- experimenter can control search process (i.e. gives the system the
best possible search statements. etc.)
- Disadvantages:
- artificial both in nature and approach
- loss of end-user input
- Retrieval Software and the Search Process (think
Search Engines)
- build or buy decisions here also (could also add modify to list of
decisions)
- whatever decision, training can be important
- in general, operational tests involve buying or modifying
- procedures must be standardized where possible
- observation/avoidance of fatigue, breakdowns, etc. important (i.e.
extraneous source of variance)
- Experimental design
- Note in this section how the experimental elements are organized:
- Experimentat strategies (treatments)
- Searchers
- Queries
- Blocks of similar queries
- Time periods
- Data collection
- data can be categorized by the component under investigation:
- database
- contact builder, run statisical evaluations
- doing it yourself can be costly
- people
- observations
- obtrusive (can influence results)
- unobtrusive (can have ethical consequences, i.e. privacy, consent issues)
- surveys, questionaires
- low response rate a nuisance
- self-reporting a problem
- pretesting important
- protocols (e.g. talk-throughs, interviews, etc.)
- can be obtrusive
- pretesting also important
- computer logs
- ethical issues
- gaining access
- processes
- getting relevance judgements a problem
- for bibliographic databases make sure to have access to real documents
- use full-text databases where possible (allows users to judge relevance
easier)
- results
- determine in advance what analyses are to be done so data is collected
and encoded properly
- Data analysis
- quantitative or qualitative analyses (or a mixture of both)?
- quantitative analyses:
- descriptive statistics
- summarizes data
- tables, histograms, etc
- recall - precision graphs are very useful
- measures of central tendency, variance, and association are good supplements
to tables and graphs
- make sure to determine whether data is categorical, ordinal, interval/ratio
so proper measures can be applied
- aggregation of different variables where possible
- inferential statistics
- dependent on type of data
- used for estimation (of parameters or population characteristics)
- for example, average recall, average costs
- comparison
- for example, whether there really is a different recall rate for different
search strategies
- ANOVA widely used
- exploration of relationships
- for example, building a model of a system that depicts the outcome
of the system as a function of various independent factors (i.e. time spent
searching, length of query, etc.)
- forecasting
- for example, predicting future use patterns based upon past (time-related)
data
- understanding the structure of multivariate data
- for example, when each user has a profile consisting of attributes
to a wide range of variables (e.g. age, education, income, gender, language
ability, etc.) the users can be clustered into discernible groups
- remember to distinguish between discrete and continuous data (determines
appropriate test)
- SPSS, SAS and MiniTab are all commonly used statistics packages
- Presenting results
- Report should follow this outline:
- purpose of test (why was it done and what were the expectations)
- background of test (what have others done in regards to this specific
problem)
- methods used (variables, equipment, environment, etc., so others can
replicate)
- presentation of results (textual, tabular and graphical)
- conclusions (review and summarize, explain the contriburion made, indicate
implications for future research
- Conferences (SIGIR, CAIS, ASIS, Digital Libraries, ALA, etc.)
- Journals (JASIS, JDOC, Communications of the ACM, Information Processing
and Management, etc.).
Page creator: J. Stephen Downie
Page created: 16 September 1997
Page updated: 18 September 1997