Text Databases and Information Retrieval
ELLEN RILOFF and LEE HOLLAAR
Department of Computer Science, University of Utah ͗riloff,hollaar@cs.utah.edu͘

The goal of a traditional information
retrieval (IR) system is to search an
information repository, such as a text
database, and retrieve documents that
are potentially relevant to a query.
Since query-based IR systems must operate in real time, they must be able to
search large volumes of text quickly and
efficiently. Other information-retrieval
applications, such as text categorization, text routing, and text filtering, are
also becoming increasingly important.
These applications are generally concerned with long-term information
needs, where a topic is expected to be of
interest for an extended period of time.
Text categorization systems assign predefined category labels to texts. For example, a text categorization system for
computer science might use categories
such as operating systems, programming languages, artificial intelligence,
or information retrieval. Text routing
systems typically accept a set of user
profiles and automatically classify texts
so that relevant texts can be routed to
appropriate users [Harman 1994]. Text
filtering systems accept a list of topics
that are, or are not, of interest and
allow only texts that satisfy the filter to
pass through to the user [Belkin and
Croft 1992]. Text categorization systems
are typically applied to static databases,
while text routing and text filtering systems are usually applied to incoming
data streams.
Information-retrieval systems must
grapple with all of the ambiguities and
idiosyncrasies inherent in natural language, such as synonymy (e.g., “start”,
“begin”, and “initiate” have essentially
the same meaning) and polysemy (e.g.,

“shot” has many different meanings, including the act of shooting, an injection,
a quantity of liquor, a photograph, pellets, or an attempt). Phrases also require special attention because multiword
expressions
often
have
a
composite meaning different from the
individual words. For example, a “hot
dog” does not usually refer to a warm
canine, and an “operating system” does
not usually refer to a system that is
simply operating.
Most information-retrieval systems
preprocess a document collection into an
inverted file that allows the system to
determine quickly which words appear in
each document. Stopword lists are commonly used to remove highly frequent
words, such as “the” and “of,” under the
assumption that they don’t contribute
much to the meaning of a text. Stemming
algorithms are sometimes used to reduce
a word to its root form so that different
morphological variations will match
[Frakes and Baeza-Yates 1992]. An alternative text-representation scheme uses
superimposed codewords to produce a
fixed-length vector from the binary representations of words. The fixed-length vector is especially useful for parallel and
hardware systems, but this method can
sometimes hallucinate words that don’t
actually appear in the original document.
Traditional information-retrieval methods retrieve documents by searching for
relevant words or phrases. Most commercial IR systems allow the user to define a
query using keywords and standard Boolean operators. These systems retrieve
documents that precisely match the
query. The vector-space model [Salton

Copyright © 1996, CRC Press.

ACM Computing Surveys, Vol. 28, No. 1, March 1996
134

•

Ellen Riloff and Lee Hollaar

1971] is a well-known method for automatic indexing that views each document
and query as a vector in an N-dimensional space, where N is the number of
relevant terms in the database. The
query vector is compared to all of the
document vectors using a similarity metric. Another retrieval model for automatic
indexing uses probability estimates to determine whether a document satisfies a
user’s query. For example, Bayesian inference networks have been used to compute the belief associated with a query for
each document in a database.
Relevance feedback techniques can
improve performance by asking the user
for feedback about the retrieved texts
[Salton 1989; Van Rijsbergen 1979]. The
user labels a subset of the retrieved
texts as relevant, and this information
is fed back into the system to modify the
original query, usually by adding new
terms or by changing the weights of the
original query terms. Relevance feedback has consistently been shown to
improve the performance of IR systems.
Experiments with richer text representations have also been conducted using natural-language processing (NLP)
techniques. Syntactic approaches have
been used to generate more complex
indexing terms consisting of phrases
and head-modifier structures. Knowledge-based NLP systems have been
used to generate conceptual meaning
representations of queries and documents. Information extraction techniques [Lehnert and Sundheim 1991]
have also been shown to be effective for
text classification problems, and represent a compromise between word-based
techniques and in-depth natural-language processing.
The future holds great promise for
integrating information-retrieval techniques with natural-language processing systems. The strengths of these
methodologies are largely complementary. IR systems use shallow text representations, which allows them to process large amounts of text quickly and
efficiently. But the accuracy of these
ACM Computing Surveys, Vol. 28, No. 1, March 1996

systems often suffers because of a lack
of semantic analysis, especially for complex information requests. Natural-language processing systems, on the other
hand, usually perform conceptual analyses, which allows them to produce
richer meanings and representations.
However, NLP techniques are more
computationally expensive and therefore are more difficult to scale up to
large text collections.
The information-retrieval community is
facing new challenges posed by larger
and more heterogeneous text databases,
which have led to an explosion of new
approaches and methodologies. As longer
texts become available on-line, new approaches are needed to process texts that
discuss multiple topics. A variety of techniques for subtopic identification and passage-based retrieval are actively being explored. Another area of active research is
intelligent information retrieval, which
draws upon techniques from artificial intelligence to generate richer text representations. Natural-language processing
methods (such as information extraction),
case-based reasoning techniques, and machine learning algorithms are all being
applied to information retrieval tasks in
the hopes of building more effective retrieval systems (for example, see ACM
[1995]). Intelligent information retrieval
is an exciting new direction for IR research.
REFERENCES
ACM. 1995. Proceedings of the 18th Annual
International ACM SIGIR Conference on
Research and Development in Information Retrieval. ACM, New York.
BELKIN, N. AND CROFT, W. B. 1992. Information
filtering and information retrieval: Two sides
of the same coin? Commun. ACM 35, 12,
29 –38.
FRAKES, W. B. AND BAEZA-YATES, R., EDS.
1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ.
HARMAN, D., ED. 1994. The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special
Publication 500 –215, Gaithersburg, MD.
LEHNERT, W. G. AND SUNDHEIM, B. 1991. A per-
Text Databases and Information Retrieval
formance evaluation of text analysis technologies. AI Mag. 12, 3, 81–94.
SALTON, G., ED. 1971. The SMART Retrieval
System: Experiments in Automatic Document
Processing. Prentice-Hall, Englewood Cliffs,
NJ.

•

135

SALTON, G. 1989. Automatic Text Processing:
The Transformation, Analysis, and Retrieval
of Information by Computer. Addison-Wesley,
Reading, MA.
VAN RIJSBERGEN, C. J. 1979. Information Retrieval (2nd Ed.). Butterworths, London.

ACM Computing Surveys, Vol. 28, No. 1, March 1996

Text databases and information retrieval

  • 1.
    Text Databases andInformation Retrieval ELLEN RILOFF and LEE HOLLAAR Department of Computer Science, University of Utah ͗riloff,hollaar@cs.utah.edu͘ The goal of a traditional information retrieval (IR) system is to search an information repository, such as a text database, and retrieve documents that are potentially relevant to a query. Since query-based IR systems must operate in real time, they must be able to search large volumes of text quickly and efficiently. Other information-retrieval applications, such as text categorization, text routing, and text filtering, are also becoming increasingly important. These applications are generally concerned with long-term information needs, where a topic is expected to be of interest for an extended period of time. Text categorization systems assign predefined category labels to texts. For example, a text categorization system for computer science might use categories such as operating systems, programming languages, artificial intelligence, or information retrieval. Text routing systems typically accept a set of user profiles and automatically classify texts so that relevant texts can be routed to appropriate users [Harman 1994]. Text filtering systems accept a list of topics that are, or are not, of interest and allow only texts that satisfy the filter to pass through to the user [Belkin and Croft 1992]. Text categorization systems are typically applied to static databases, while text routing and text filtering systems are usually applied to incoming data streams. Information-retrieval systems must grapple with all of the ambiguities and idiosyncrasies inherent in natural language, such as synonymy (e.g., “start”, “begin”, and “initiate” have essentially the same meaning) and polysemy (e.g., “shot” has many different meanings, including the act of shooting, an injection, a quantity of liquor, a photograph, pellets, or an attempt). Phrases also require special attention because multiword expressions often have a composite meaning different from the individual words. For example, a “hot dog” does not usually refer to a warm canine, and an “operating system” does not usually refer to a system that is simply operating. Most information-retrieval systems preprocess a document collection into an inverted file that allows the system to determine quickly which words appear in each document. Stopword lists are commonly used to remove highly frequent words, such as “the” and “of,” under the assumption that they don’t contribute much to the meaning of a text. Stemming algorithms are sometimes used to reduce a word to its root form so that different morphological variations will match [Frakes and Baeza-Yates 1992]. An alternative text-representation scheme uses superimposed codewords to produce a fixed-length vector from the binary representations of words. The fixed-length vector is especially useful for parallel and hardware systems, but this method can sometimes hallucinate words that don’t actually appear in the original document. Traditional information-retrieval methods retrieve documents by searching for relevant words or phrases. Most commercial IR systems allow the user to define a query using keywords and standard Boolean operators. These systems retrieve documents that precisely match the query. The vector-space model [Salton Copyright © 1996, CRC Press. ACM Computing Surveys, Vol. 28, No. 1, March 1996
  • 2.
    134 • Ellen Riloff andLee Hollaar 1971] is a well-known method for automatic indexing that views each document and query as a vector in an N-dimensional space, where N is the number of relevant terms in the database. The query vector is compared to all of the document vectors using a similarity metric. Another retrieval model for automatic indexing uses probability estimates to determine whether a document satisfies a user’s query. For example, Bayesian inference networks have been used to compute the belief associated with a query for each document in a database. Relevance feedback techniques can improve performance by asking the user for feedback about the retrieved texts [Salton 1989; Van Rijsbergen 1979]. The user labels a subset of the retrieved texts as relevant, and this information is fed back into the system to modify the original query, usually by adding new terms or by changing the weights of the original query terms. Relevance feedback has consistently been shown to improve the performance of IR systems. Experiments with richer text representations have also been conducted using natural-language processing (NLP) techniques. Syntactic approaches have been used to generate more complex indexing terms consisting of phrases and head-modifier structures. Knowledge-based NLP systems have been used to generate conceptual meaning representations of queries and documents. Information extraction techniques [Lehnert and Sundheim 1991] have also been shown to be effective for text classification problems, and represent a compromise between word-based techniques and in-depth natural-language processing. The future holds great promise for integrating information-retrieval techniques with natural-language processing systems. The strengths of these methodologies are largely complementary. IR systems use shallow text representations, which allows them to process large amounts of text quickly and efficiently. But the accuracy of these ACM Computing Surveys, Vol. 28, No. 1, March 1996 systems often suffers because of a lack of semantic analysis, especially for complex information requests. Natural-language processing systems, on the other hand, usually perform conceptual analyses, which allows them to produce richer meanings and representations. However, NLP techniques are more computationally expensive and therefore are more difficult to scale up to large text collections. The information-retrieval community is facing new challenges posed by larger and more heterogeneous text databases, which have led to an explosion of new approaches and methodologies. As longer texts become available on-line, new approaches are needed to process texts that discuss multiple topics. A variety of techniques for subtopic identification and passage-based retrieval are actively being explored. Another area of active research is intelligent information retrieval, which draws upon techniques from artificial intelligence to generate richer text representations. Natural-language processing methods (such as information extraction), case-based reasoning techniques, and machine learning algorithms are all being applied to information retrieval tasks in the hopes of building more effective retrieval systems (for example, see ACM [1995]). Intelligent information retrieval is an exciting new direction for IR research. REFERENCES ACM. 1995. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York. BELKIN, N. AND CROFT, W. B. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12, 29 –38. FRAKES, W. B. AND BAEZA-YATES, R., EDS. 1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ. HARMAN, D., ED. 1994. The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special Publication 500 –215, Gaithersburg, MD. LEHNERT, W. G. AND SUNDHEIM, B. 1991. A per-
  • 3.
    Text Databases andInformation Retrieval formance evaluation of text analysis technologies. AI Mag. 12, 3, 81–94. SALTON, G., ED. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ. • 135 SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA. VAN RIJSBERGEN, C. J. 1979. Information Retrieval (2nd Ed.). Butterworths, London. ACM Computing Surveys, Vol. 28, No. 1, March 1996