1. UNIT-II(Modelling And Retrieval
Evaluation )
IV Year / VIII Semester
By
P.THENMOZHI AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques
3. MODELING AND RETRIEVAL
EVALUATION
• Basic Retrieval Models
• An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
• There are Three main IR models:
– Boolean model
– Vector space model
– Probabilistic model
4. • Each term is associated with a weight.Given a
collection of documents D, let
• V = {t1, t2... t|V|} be the set of distinctive
terms in the collection, where ti is a term.
• The set V is usually called the vocabulary of
the collection, and |V| is its size,
• i.e., the number of terms in V.
5. • An IR model is a quadruple [D, Q, F, R(qi, dj)]
where
• 1. D is a set of logical views for the documents
in the collection
• 2. Q is a set of logical views for the user
queries
• 3. F is a framework for modeling documents
and queries
• 4. R(qi, dj) is a ranking function
6.
7. Boolean Model
• The Boolean model is one of the earliest and
simplest information retrieval models.
• It uses the notion of exact matching to match
documents to the user query.
• Both the query and the retrieval are based on
Boolean algebra.
8. • In the Boolean model, documents and queries
are represented as sets of terms.
• That is, each term is only considered present
or absent in a document.
9. • Boolean Queries:
• Query terms are combined logically using the Boolean
operators AND, OR, and NOT, which have their usual
semantics in logic.
• Thus, a Boolean query has a precise semantics.
• For instance, the query, ((x AND y) AND (NOT z)) says
that a retrieved document must contain both the terms
x and y but not z.
• As another example, the query expression (x OR y)
means that at least one of these terms must be in each
retrieved document.
• Here, we assume that x, y and z are terms. In general,
they can be Boolean expressions themselves.
10. • Document Retrieval:
• Given a Boolean query, the system retrieves
every document that makes the query
logically true.
• Thus, the retrieval is based on the binary
decision criterion, i.e., a document is either
relevant or irrelevant. Intuitively, this is called
exact match.
• Most search engines support some limited
forms of Boolean retrieval using explicit
inclusion and exclusion operators.
11. • Drawbacks of the Boolean Model
• No ranking of the documents is provided
(absence of a grading scale)
• Information need has to be translated into a
Boolean expression, which most users find
awkward
• The Boolean queries formulated by the users
are most often too simplistic.
12. TF-IDF (Term Frequency/Inverse
Document Frequency) Weighting
• We assign to each term in a document a
weight for that term that depends on the
number of occurrences of the term in the
document.
• We would like to compute a score between a
query term t and a document d, based on the
weight of t in d. The simplest approach is to
assign the weight to be equal to the number
of occurrences of term t in document d.
13. • This weighting scheme is referred to as term
frequency and is denoted tft,d, with the
subscripts denoting the term and the
document in order.
• For a document d, the set of weights
determined by the tf weights above (or indeed
any weighting function that maps the number
of occurrences of t in d to a positive real
value) may be viewed as a quantitative digest
of that document.
14. • How is the document frequency df of a term
used to scale its weight? Denoting as usual the
total number of documents in a collection by
N, we define the inverse document frequency
(idf) of a term t as follows:
• idft = log
𝑁
𝑑𝑓𝑡
15. • Tf-idf weighting
• We now combine the definitions of term
frequency and inverse document frequency, to
produce a composite weight for each term in
each document.
• The tf-idf weighting scheme assigns to term t
a weight in document d given by
•
• tf-idft,d = tft,d ×idft.
16. • Document d is the sum, over all query terms,
of the number of times each of the query
terms occurs in d.
• We can refine this idea so that we add up not
the number of occurrences of each query
term t in d, but instead the tf-idf weight of
each term in d.
• Score (q, d) = 𝑡∈𝑞 tf − idf𝑡, 𝑑.
17. Cosine similarity
• Documents could be ranked by computing the distance between
the points representing the documents and the query.
• More commonly, a similarity measure is used (rather than a
distance or dissimilarity measure), so that the documents with the
highest scores are the most similar to the query.
• A number of similarity measures have been proposed and tested
for this purpose.
• The most successful of these is the cosine correlation similarity
measure.
• The cosine correlation measures the cosine of the angle between
the query and the document vectors.
• When the vectors are normalized so that all documents and queries
are represented by vectors of equal length, the cosine of the angle
between two identical vectors will be 1 (the angle is zero), and for
two vectors that do not share any non-zero terms, the cosine will
be 0.
18. • The cosine measure is defined as:
• 𝐶𝑜𝑠𝑖𝑛𝑒(𝐷𝑖, 𝑄) =
𝑗=1
𝑡
𝑑𝑖𝑗 · 𝑞𝑗
𝑗=1
𝑡
𝑑𝑖𝑗2. 𝑗=1
𝑡
𝑞𝑗2
• The numerator of this measure is the sum of the products
of the term weights for the matching query and document
terms (known as the dot product or inner product).
• The denominator normalizes this score by dividing by the
product of the lengths of the two vectors. There is no
theoretical reason why the cosine correlation should be
preferred to other similarity measures, but it does perform
somewhat better in evaluations of search quality.