LIS 601: Week2
Evaluating Information Retrieval Systems: Downie's personal study notes of Tague-Sutcliffe (1992).


Welcome and caveat:

Find below my personal study notes for:

While these notes are rather long they are by no means complete. I have also added in some extra information where appropriate to help me with the lecture. I advise that you might want to adopt a similar approach to note taking for the remainder of the readings. This set of notes is rather long because the article actually forms the basis for my structuring of this course. There is no substitute for reading the article, however. Remember, these are my personal notes; the article itself is the authority for future marking purposes (hint, hint, nudge, nudge).


  1. Introduction
    1. Validity
      1. "the extent to which the experiment actually determines what the experimenter wishes to determine."
    2. Reliability
      1. "the extent to which the experimental results can be replicated"
    3. Efficiency
      1. "the extent to which the experiment is effective (i.e., valid and reliable, relative to resources consumed)"
  1. Need for testing
    1. Identify the problem needing answers
    2. Review literature
  2. Type of test
    1. Treatments, factor(s) - independent variables
    2. Outcomes -dependent variables
    3. Experiments (laboratory) control sources of variation:
      1. independent, evironmental, and concomitant variables
      2. users, databases, searchers, search constraints controlled
      3. Advantages:
        1. generalizability (validity, reliability good)
      4. Disadvantages:
        1. more expensive
        2. artificial
    4. Operational tests (real-life):
      1. Minimal control of sources of variation
      2. (Control actually lies on a continuum from strong to none)
      3. Advantages:
        1. real life data and insights
        2. less expensive
      4. Disadvantages:
        1. need cooperation of administrators, users, etc.
        2. not generalizable
        3. difficult to attribute causes
  3. Definition of variables
    1. Define independent, dependent, and environmental variables
    2. Operationalization: deciding what to observe and how to measure what one is observing
    3. Key variables:
      1. Database
        1. collection documents
        2. measured by: number of documents, number of records, size in blocks or bytes, etc.
        3. can use concentration measures
        4. qualitative properties problematic:
          1. representational issues (i.e. completeness, medium, etc.)
          2. defining "documents"
      2. Information Representation
        1. exhaustivity of indexing
        2. specificity of indexing
        3. degree of vocabulary control
        4. degree of vocabulary linkage
        5. vocabulary accommodation
        6. term discrimination
        7. degree of post-coordination
        8. degree of syntactic control
        9. indexing accuracy
        10. inter-indexer consistency
        11. type of classification or clustering used
      3. Users
        1. understanding users is growing in importance
        2. example categories:
          1. type of user
          2. context of use
          3. kinds of information needed
          4. immediacy of need
        3. sense-making, problem solving viewpoints
        4. various used to understand users:
          1. psychometric scales
          2. likert scales
          3. educational ability, etc.
      4. Queries
        1. Query: the user's verbalized statement of need
        2. Search statement: expression of the Query in the form usable by the IR system
          1. dependent on the type of system (Boolean, vector, natural language,etc.)
          2. can be measured: size, exhaustivity, specificity, etc.
        3. real-life sources
        4. artificial sources (derived from titles, etc.)
      5. Search Intermediaries
        1. delegated vs end-user searches
      6. Retrieval Process
        1. search logic used
        2. access modes (commands, hypertext, menus, etc.)
        3. time spent retrieving can be important
        4. user interaction with system (finding help, revising queries, etc).
      7. Retrieval Evaluation
        1. precision, recall, fallout
        2. for ranked output measures can be made at predetermined rank levels
        3. E measure and MZ metric developed to give overall performance indications
        4. recall measurement problematic (i.e. finding all relevant documents):
          1. arbitrarily define relevant documents
          2. use a small document set
          3. random sampling of unretrieved documents
          4. use comparative measures
          5. use Blair and Maron's (1985) technique (broad searches)
        5. Tague and Schultz's (1989) four dimension of evaluation:
          1. Informativeness of retrieved set (precision)
          2. Completeness of retrieved set (recall)
          3. Time expended by user (contact time)
          4. Quality of user's experience (user friendliness)
        6. THE RELEVANCE PROBLEM
          1. What scale to use (binary, ranked, weighted)?
          2. Who judges relevance (users, experts, etc.)?
          3. What is the distinction between subject relevance (aboutness) and usefulness (pertinence)?
          4. One could adopt the Dervin and Nilan (1986) approach which does not attempt to evaluate individual relevance decisions but holisticly (sic?) and qualitatively evaluates the experience of the user's interaction with the system
        7. cost effectiveness: how efficiently the system operates
        8. cost benefit: how much value does the system give the users, institution, community, etc.
  4. Database decisions (talking about the information and its structure. See Retrieval Software).
    1. The build or buy decision:
      1. building gives one greater control but is expensive, time consuming, etc.
      2. build for testing specific theories about retrieval basics (e.g. a new indexing method, etc.)
      3. buy if investigating practical concerns
    2. Key features:
      1. coverage of database
      2. source of documents
      3. source of vocabulary
      4. form of documents
      5. fields of records
      6. display formats
      7. interface issues (i.e. windows, etc.)
      8. design of inverted files: stoplists, truncation, stemming, etc.
    3. Use of standard test databases:
      1. Cranfield, Medlars, Ontap Eric, etc.
      2. might be overused, prone to idiosyncracies
  5. Finding queries
    1. Major problem is just finding queries
    2. Real users:
      1. Advantages:
        1. realistic queries
      2. Disadvantages:
        1. not willing to participate
        2. drop out before tests are finished
        3. can be expensive
        4. determining relevance is a problem
        5. participants might not be representative (the classic second-year undergrad or white rat)
    3. Canned queries: test collections, previous user's, etc.
      1. Advantages:
        1. cost-efficient
        2. some have predetermined relevance judgements
        3. experimenter can control search process (i.e. gives the system the best possible search statements. etc.)
      2. Disadvantages:
        1. artificial both in nature and approach
        2. loss of end-user input
  6. Retrieval Software and the Search Process (think Search Engines)
    1. build or buy decisions here also (could also add modify to list of decisions)
    2. whatever decision, training can be important
    3. in general, operational tests involve buying or modifying
    4. procedures must be standardized where possible
    5. observation/avoidance of fatigue, breakdowns, etc. important (i.e. extraneous source of variance)
  7. Experimental design
    1. Note in this section how the experimental elements are organized:
      1. Experimentat strategies (treatments)
      2. Searchers
      3. Queries
      4. Blocks of similar queries
      5. Time periods
  8. Data collection
    1. data can be categorized by the component under investigation:
      1. database
        1. contact builder, run statisical evaluations
        2. doing it yourself can be costly
      2. people
        1. observations
          1. obtrusive (can influence results)
          2. unobtrusive (can have ethical consequences, i.e. privacy, consent issues)
        2. surveys, questionaires
          1. low response rate a nuisance
          2. self-reporting a problem
          3. pretesting important
        3. protocols (e.g. talk-throughs, interviews, etc.)
          1. can be obtrusive
          2. pretesting also important
        4. computer logs
          1. ethical issues
          2. gaining access
      3. processes
        1. getting relevance judgements a problem
        2. for bibliographic databases make sure to have access to real documents
        3. use full-text databases where possible (allows users to judge relevance easier)
      4. results
        1. determine in advance what analyses are to be done so data is collected and encoded properly
  9. Data analysis
    1. quantitative or qualitative analyses (or a mixture of both)?
    2. quantitative analyses:
      1. descriptive statistics
        1. summarizes data
        2. tables, histograms, etc
        3. recall - precision graphs are very useful
        4. measures of central tendency, variance, and association are good supplements to tables and graphs
        5. make sure to determine whether data is categorical, ordinal, interval/ratio so proper measures can be applied
        6. aggregation of different variables where possible
      2. inferential statistics
        1. dependent on type of data
        2. used for estimation (of parameters or population characteristics)
          1. for example, average recall, average costs
        3. comparison
          1. for example, whether there really is a different recall rate for different search strategies
          2. ANOVA widely used
        4. exploration of relationships
          1. for example, building a model of a system that depicts the outcome of the system as a function of various independent factors (i.e. time spent searching, length of query, etc.)
        5. forecasting
          1. for example, predicting future use patterns based upon past (time-related) data
        6. understanding the structure of multivariate data
          1. for example, when each user has a profile consisting of attributes to a wide range of variables (e.g. age, education, income, gender, language ability, etc.) the users can be clustered into discernible groups
        7. remember to distinguish between discrete and continuous data (determines appropriate test)
        8. SPSS, SAS and MiniTab are all commonly used statistics packages
  10. Presenting results
    1. Report should follow this outline:
      1. purpose of test (why was it done and what were the expectations)
      2. background of test (what have others done in regards to this specific problem)
      3. methods used (variables, equipment, environment, etc., so others can replicate)
      4. presentation of results (textual, tabular and graphical)
      5. conclusions (review and summarize, explain the contriburion made, indicate implications for future research
    2. Conferences (SIGIR, CAIS, ASIS, Digital Libraries, ALA, etc.)
    3. Journals (JASIS, JDOC, Communications of the ACM, Information Processing and Management, etc.).


Page creator: J. Stephen Downie
Page created: 16 September 1997
Page updated: 18 September 1997