LIS 601: Week3
Important Automatic Indexing Concepts1
- Stemming
- use to conflate, or reduce, morphologic variants of a word to a single
index term
- should increase recall by grouping similar concepts under single term
- achieved through the use of a stemming algorithm (e.g. the Porter algorithm)
- prone to overkill (too severly conflating terms):
- e.g. "morals" and "more" stemmed to "mor"
- weak stemming seems most effective
- e.g. removing only plurals, "-ing", or "ed"
- questionable utility
- manual truncation (i.e. user wildcarding) about as effective
- Statistical Analysis
- based on the the theory that a term's frequency properties (i.e. how
many times it appears in a document and /or within the database)1
can determined its utility as an index term
- 1 informetric properties
- high frequency terms and very low frequency terms deemed to be poor
candidates
- relatively simple to perform because the basic operation is that of
counting tokens and types
- quite effective in the creation of single term indexes
- phrase indexes generated statistically tend to contain many nonsensical
phrases
- single term indexes are quite effective so phrase-indexing not as popular
- context sensitive
- thus no universal set of statistics to generate a universal index
- works best with single subject databases
- sensitive to document length
- works best with abstracts and/or paragraphs as document surrogates
- words can be assigned weights that represent the value of the
word as an indexing term
- weights usually:
- binary (0 or 1) [not present, present]
- ranged (0<= x =>y) [not present, present in some way]
- weights can be determined in a variety ways which in turn can be combined
- Within-Document term frequency (commonly denoted as tf)
- Inverse Document frequency (IDF)
- proportion of documents within the collection that the term can be
found
- Term Discrimination Value
- measures via a similarity measure how the use of a term increases (lower
discrimination) or decreases (higher discrimination) the similarity of
the documents in the collection
- Syntactic Analysis
- important for phrase indexing
- intended to overcome the ambiguities created via statistical phrase
indexing
- use of Finite State Automata (e.g. Recursive Transition Networks (RTN)
to process text
- RTNs contain rules that identify the parts of speech
- match text to rules
- phrases are indexed (created) if the text successfully traverses the
network (i.e. satisfies the rules embedded in the RTN
- computationally expensive
- rules complex and thus prone to failure (like all Natural Language
Processing)
- anaphoric statements problematic
- only slight improvement over statistically generated phrases
- interesting to note that syntactic and statistic phrase indexes of
the the same text can be quite different yet retrieval effectiveness does
not seem to be affected
- Probablisitic Analysis
- based on the notion that a term is a good indexing term if it occurs
in relevant documents
- big problem is the classic relevance problem (i.e. actually knowing
the relevant documents)
- Salton (1989, 287) shows that, in the end, the properties of probablistic
indexing can be obtained through the tf * IDF equation, so probablistic
methods not that important
- Thesaurus Use and Construction
- intertwined with above concepts and methods
- stems can be classes via statistical or syntactic methods
- pre-existing thesauri can be used to validate candidate terms
- automatically generated thesauri generated via
- various similarity measures, for example:
- Dice
- Jacard
- Salton's Cosine Correlation
- clustering algorithms (each has unique results), for example:
- Single Link
- Average Link
- Complete Link
- Ward's Method
- useful for synonym control
- weak for hierarchical and semantic relations
Page creator: J. Stephen Downie
Page created: 24 September 1997
Page updated: 25 September 1997