LIS 601: Week3
Important Automatic Indexing Concepts¹

Stemming

use to conflate, or reduce, morphologic variants of a word to a single index term
should increase recall by grouping similar concepts under single term
achieved through the use of a stemming algorithm (e.g. the Porter algorithm)
prone to overkill (too severly conflating terms):

e.g. "morals" and "more" stemmed to "mor"

weak stemming seems most effective

e.g. removing only plurals, "-ing", or "ed"

questionable utility

manual truncation (i.e. user wildcarding) about as effective

Statistical Analysis

based on the the theory that a term's frequency properties (i.e. how many times it appears in a document and /or within the database)¹ can determined its utility as an index term

¹informetric properties

high frequency terms and very low frequency terms deemed to be poor candidates
relatively simple to perform because the basic operation is that of counting tokens and types
quite effective in the creation of single term indexes
phrase indexes generated statistically tend to contain many nonsensical phrases

single term indexes are quite effective so phrase-indexing not as popular

context sensitive

thus no universal set of statistics to generate a universal index
works best with single subject databases

sensitive to document length

works best with abstracts and/or paragraphs as document surrogates

words can be assigned weights that represent the value of the word as an indexing term

weights usually:

binary (0 or 1) [not present, present]
ranged (0<= x =>y) [not present, present in some way]

weights can be determined in a variety ways which in turn can be combined

Within-Document term frequency (commonly denoted as tf)
Inverse Document frequency (IDF)

proportion of documents within the collection that the term can be found
the combination tf * IDF has proven to be quite effective

Term Discrimination Value

measures via a similarity measure how the use of a term increases (lower discrimination) or decreases (higher discrimination) the similarity of the documents in the collection

Syntactic Analysis

important for phrase indexing
intended to overcome the ambiguities created via statistical phrase indexing
use of Finite State Automata (e.g. Recursive Transition Networks (RTN) to process text

RTNs contain rules that identify the parts of speech
match text to rules
phrases are indexed (created) if the text successfully traverses the network (i.e. satisfies the rules embedded in the RTN

computationally expensive
rules complex and thus prone to failure (like all Natural Language Processing)
anaphoric statements problematic
only slight improvement over statistically generated phrases
interesting to note that syntactic and statistic phrase indexes of the the same text can be quite different yet retrieval effectiveness does not seem to be affected

Probablisitic Analysis

based on the notion that a term is a good indexing term if it occurs in relevant documents
big problem is the classic relevance problem (i.e. actually knowing the relevant documents)
Salton (1989, 287) shows that, in the end, the properties of probablistic indexing can be obtained through the tf * IDF equation, so probablistic methods not that important

Thesaurus Use and Construction

intertwined with above concepts and methods

stems can be classes via statistical or syntactic methods

pre-existing thesauri can be used to validate candidate terms
automatically generated thesauri generated via

various similarity measures, for example:

clustering algorithms (each has unique results), for example:

Single Link
Average Link
Complete Link
Ward's Method

useful for synonym control
weak for hierarchical and semantic relations

Page creator: J. Stephen Downie
Page created: 24 September 1997
Page updated: 25 September 1997