Course: MLIS, 2ND Year
Automatic indexing is indexing made by algorithmic procedures. The
algorithm works on a database containing document representations (which
may be full text representations or bibliographical records or partial text
representations and in principle also value added databases). Automatic
indexing may also be performed on non-text databases,
e.g. images or music.
This statistical technique: Involves
(1) the determination of certain probability relationships between individual
content-bearing words and subject categories, and
(2) the use of these relationships to predict the category to which a
document containing the words belongs.
The basic and simplest concept of automatic indexing developed in
the 1950s was the KWIC or Keyword in Context index based on
permutations of significant words in titles, abstracts or full text --
manipulated by machine. The first major report on the application of this
indexing concept occurred at the International Conference on Scientific
Information (ICSI) held in Washington, D. C. in November of 1958. The
paper was not the sensational product; the actual demonstration of the
method was the sensation of the conference.
At the risk of getting ahead of ourselves and in view of the obvious
information explosion that our scientific and intelligence communities surely
face, let us point out what successful automatic indexing could mean.
First, we seem to be rapidly approaching the time when along with the
printed page there will be an associated tape of corresponding information
ready for direct input to a computing machine.
This means that as each organization receives its daily incoming documents
a machine could read them and route them directly to the proper users. The
users could describe their
Information needs in terms of "standing" requests and on the basis of these
a machine could determine how the incoming "take" should be
disseminated. Since automatic dissemination is only a special aspect of a
System, it follows that automatic indexing also would allow incoming
documents to be indexed and thus identified for subsequent retrieval.
Basic Notions: This approach to the problem of automatic indexing is a
statistical one. It is based on the rather straightforward notion that the
individual words in a document function. The fundamental thesis says, in
effect, that statistics on kind, frequency, location, order, etc.,
Words and Predictions: Concerning the selection of clue words, how
shall we decide which words convey the most information, how many
different words should be used, etc.? Clearly, certain content-bearing words
such as "electron" and "transistor" are better clues than logical type words
such as "if", and "then", etc.
The Empirical Test: First a corpus of documents was selected and
indexed using a set of subject categories created for the purposes of the
experiment. The design, execution, results and evaluation of this test are
examined in the following sections.
Automatic indexing is the process of analyzing an item to extract the
Information to be permanently kept in an index. This text categorizes the
indexing techniques into statistical, natural language, concept, and hypertext
Statistical strategies: Statistical strategies cover the broadest range of
indexing techniques and are the most prevalent in commercial systems. The
words/phrases are the domain of searchable values.
Natural Language: Natural Language approaches perform the similar
processing token identification as in statistical techniques, but then
additionally perform varying levels of natural language parsing of the item
(e.g., present, past, future actions).
Concept index: Concept indexing uses the words within an item to
correlate to concepts discussed in the item. This is a generalization of the
specific words to values used to index the item.
Hypertext linkages: Finally, a special class of indexing can be defined
by creation of hypertext linkages. These linkages provide virtual threads of
concepts between items versus directly defining the concept within an item.
Automatic indexing is the preprocessing stage allowing search of items
in an Information Retrieval System. Its role is critical to the success of
searches in finding relevant items. If the concepts within an item are not
located and represented in the index during this stage, the item is not
found during search. Some techniques allow for the combinations of
data at search time to equate to particular concepts (i.e.post co-