2. Introduction.
Traditional information retrieval systems
usually adopt index terms to index and retrieve
documents.
An index term is a keyword(or group of related
words) which has some meaning of its own
(usually a noun).
3. The advantage of using index
terms
Simple
The semantic of the documents and of the
user information need can be naturally
expressed through sets of index terms.
Ranking algorithms are at the core of information
retrieval systems(predicting which documents are
relevant and which are not).
4. A taxonomy of information retrieval
models
Retrieval:
Ad hoc
Filtering
Classic Models
Browsing
U
S
E
R
T
A
S
K
Boolean
Vector
Probabilistic
Structured Models
Non-overlapping lists
Proximal Nodes
Flat
Structured Guided
Hypertext
Browsing
Fuzzy
Extended Boolean
Set Theoretic
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Inference Network
Belief Network
Probabilistic
5. Index Terms Full Text Full Text+
Structure
Retrieval Classic
Set Theoretic
Algebraic
Probabilistic
Classic
Set
Theoretic
Algebraic
Probabilistic
Structured
Browsing Flat Flat
Hypertext
Structure Guided
Hypertext
Figure 2.2 Retrieval models most frequently associated with distinct
combinations of a document logical view and a user task.
6. Retrieval : Ad hoc and Filtering
Ad hoc : The documents in the collection
remain relatively static while new queries
are submtted to the system.
Filtering : The queries remain relatively
static while new documents come into the
system
7. Filtering
Typically, the filtering task simply
indicates to the user the documents
which might be of interest to him.
Routing : Rank the filtering documents
and show this ranking to the user.
Constructing user profiles in two ways.
8. A formal characterization of IR models
D : A set composed of logical views(or
representation) for the documents in the
collection.
Q : A set composed of logical views(or
representation) for the user information
needs(queries).
F : A framework for modeling document
representations, queries, and their relationships.
R(qi, dj) : A ranking function which defines an
ordering among the documents with regard to the
query.
9. Classic information retrieval
model
Basic concepts : Each document is
described by a set of representative
keywords called index terms.
Assign a numerical weights to distinct
relevance between index terms.
10. Define
ki : A generic index term
K : The set of all index terms {k1,…,kt}
wi,j : A weight associated with index term
ki of a document dj
gi : A function returns the weight associated
with ki in any t-dimensoinal vector( gi(dj)=wi,j )
11. Boolean model
Based on a binary decision criterion without any
notion of a grading scale.
Boolean expressions have precise semantics.It is
not simple to translate an information need into
a Boolean expression.
Can be represented as a disjunction of
conjunction vectors(in disjunctive normal form-
DNF).
12. Vector model
Assign non-binary weights to index
terms in queries and in documents.
Compute the similarity between
documents and query.
More precise than Boolean model.
13. 想法
We think of the documents as a collection C
of objects and think of the user query as a
specification of a set A of objects.In this
scenario, the IR problem can be reduced to
the problem of determine which documents
are in the set A and which ones are not(i.e.,
the IR problem can be viewed as a
clustering problem).
14. Intra-cluster : One needs to determine
what are the features which better
describe the objects in the set A.
Inter-cluster : One needs to determine
what are the features which better
distinguish the objects in the set A.
15. tf : inter-clustering similarity is quantified by
measuring the raw frequency of a term ki
inside a document dj, such term frequency is
usually referred to as the tf factor and
provides one measure of how well that term
describes the document contents.
idf : inter-clustering similarity is quantified by
measuring the inverse of the frequency of a
term ki among the documents in the
collection.This frequency is often referred to
as the inverse document frequency.
16. Vector model is simple and fast. It’s a
popular retrieval model.
Disadvantage : Index terms are
assumed to be mutually independent. It
doesn’t account for index term
dependencies.
17. Probabilistic model
We can think of the querying process
as a process of specifying the properties
of an ideal answer set(The problem is
that we do not know exactly what these
properties are.).
18. Structured text retrieval model
Retrieval models which combine information on
text content with information on the document
structure are called structured text retrieval
model.
Match point : refer to the position in the text
of a sequence of words which matches the user
query.
Region : refer to a contiguous portion of the
text.
Node : refer to a structural component of the
document such as a chapter, a section, a
subsection.
19. Model based on Non-overlapping
lists
Divide the whole text of each document
in non-overlapping text regions which
are collected in a list.
Text regions in the same list have no
overlapping, but text regions from
distinct lists might overlap.
20. Model based on Proximal
nodes
A model which allows the definition of
independent hierarchical indexing
structures over the same document text.
Each of these index structures is a strict
hierarchy composed of chapters,
sections, paragraphs, pages, and lines
which called nodes.
22. Flat browsing
The documents might be represented
as dots in a plan or as elements in a list.
Relevance feedback
Disadvantage : In a given page or
screen there may not be any indication
about the context where the user is.
23. Structure guided browsing
Organized in a directory structure. It
groups documents covering related
topics.
The same idea can be applied to a
single document.
Using history map.
24. The hypertext model
Written text is usually conceived to be
read sequentially.
The reader should not expect to fully
understand the message conveyed by
the writer by randomly reading pieces
of text here and there.