LIS 601: Week3
Important Automatic Indexing Concepts1

  1. Stemming
    1. use to conflate, or reduce, morphologic variants of a word to a single index term
    2. should increase recall by grouping similar concepts under single term
    3. achieved through the use of a stemming algorithm (e.g. the Porter algorithm)
    4. prone to overkill (too severly conflating terms):
      1. e.g. "morals" and "more" stemmed to "mor"
    5. weak stemming seems most effective
      1. e.g. removing only plurals, "-ing", or "ed"
    6. questionable utility
      1. manual truncation (i.e. user wildcarding) about as effective
  2. Statistical Analysis
    1. based on the the theory that a term's frequency properties (i.e. how many times it appears in a document and /or within the database)1 can determined its utility as an index term
      1. 1 informetric properties
    2. high frequency terms and very low frequency terms deemed to be poor candidates
    3. relatively simple to perform because the basic operation is that of counting tokens and types
    4. quite effective in the creation of single term indexes
    5. phrase indexes generated statistically tend to contain many nonsensical phrases
      1. single term indexes are quite effective so phrase-indexing not as popular
    6. context sensitive
      1. thus no universal set of statistics to generate a universal index
      2. works best with single subject databases
    7. sensitive to document length
      1. works best with abstracts and/or paragraphs as document surrogates
    8. words can be assigned weights that represent the value of the word as an indexing term
      1. weights usually:
        1. binary (0 or 1) [not present, present]
        2. ranged (0<= x =>y) [not present, present in some way]
      2. weights can be determined in a variety ways which in turn can be combined
        1. Within-Document term frequency (commonly denoted as tf)
        2. Inverse Document frequency (IDF)
          1. proportion of documents within the collection that the term can be found
          2. the combination tf * IDF has proven to be quite effective
        3. Term Discrimination Value
          1. measures via a similarity measure how the use of a term increases (lower discrimination) or decreases (higher discrimination) the similarity of the documents in the collection
  3. Syntactic Analysis
    1. important for phrase indexing
    2. intended to overcome the ambiguities created via statistical phrase indexing
    3. use of Finite State Automata (e.g. Recursive Transition Networks (RTN) to process text
      1. RTNs contain rules that identify the parts of speech
      2. match text to rules
      3. phrases are indexed (created) if the text successfully traverses the network (i.e. satisfies the rules embedded in the RTN
    4. computationally expensive
    5. rules complex and thus prone to failure (like all Natural Language Processing)
    6. anaphoric statements problematic
    7. only slight improvement over statistically generated phrases
    8. interesting to note that syntactic and statistic phrase indexes of the the same text can be quite different yet retrieval effectiveness does not seem to be affected
  4. Probablisitic Analysis
    1. based on the notion that a term is a good indexing term if it occurs in relevant documents
    2. big problem is the classic relevance problem (i.e. actually knowing the relevant documents)
    3. Salton (1989, 287) shows that, in the end, the properties of probablistic indexing can be obtained through the tf * IDF equation, so probablistic methods not that important
  5. Thesaurus Use and Construction
    1. intertwined with above concepts and methods
      1. stems can be classes via statistical or syntactic methods
    2. pre-existing thesauri can be used to validate candidate terms
    3. automatically generated thesauri generated via
      1. various similarity measures, for example:
        1. Dice
        2. Jacard
        3. Salton's Cosine Correlation
      2. clustering algorithms (each has unique results), for example:
        1. Single Link
        2. Average Link
        3. Complete Link
        4. Ward's Method
    4. useful for synonym control
    5. weak for hierarchical and semantic relations

Page creator: J. Stephen Downie
Page created: 24 September 1997
Page updated: 25 September 1997