Ir 09

Lecture 09
Information Retrieval

The Vector Space Model for Scoring
Variant tf-idf Functions
 For assigning a weight for each term in each document, a number
of alternatives to tf and tf-idf have been considered.
 Sublinear tf scaling:
 It seems unlikely that twenty occurrences of a term in a document
truly carry twenty times the significance of a single occurrence.
Accordingly, there has been considerable research into variants of
term frequency that go beyond counting the number of
occurrences of a term.
 A common modification is to use instead the logarithm of the term
frequency, which assigns a weight given by:
𝒘𝒇 𝒕,𝒅 =
𝟏 + log 𝟏𝟎 𝒕𝒇 𝒕,𝒅 𝒊𝒇 𝒕𝒇 𝒕,𝒅 > 𝟎
𝟎, 𝑶𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

 In this form, we may replace tf by some other function wf as in
(6.13), to obtain:
𝒘𝒇 − 𝒊𝒅𝒇 𝒕,𝒅 = 𝒘𝒇 𝒕,𝒅 ∗ 𝒊𝒅𝒇 𝒕
 Maximum tf normalization:
 One well-studied technique is to normalize the tf weights of all
terms occurring in a document by the maximum tf in that
document. For each document d, let 𝒕𝒇 𝒎𝒂𝒙 𝒅 = 𝒎𝒂𝒙 𝝉∈𝒅 where 𝝉
ranges over all terms in d.
 Then, we compute a normalized term frequency for each term t in
document d by:
𝒏𝒕𝒇 𝒕,𝒅 = 𝒂 + 𝟏 − 𝒂
𝒕𝒇 𝒕,𝒅
𝒕𝒇 𝒎𝒂𝒙 𝒅
 where 𝒂 is a value between 0 and 1 and is generally set to 0.4,
although some early work used the value 0.5.

 The term 𝒂 is a smoothing term whose role is to damp the contribution
of the second term – which may be viewed as a scaling down of tf by
the largest tf value in d.
 We will encounter smoothing further when discussing classification.
 The basic idea is to avoid a large swing in 𝒏𝒕𝒇 𝒕,𝒅 from modest changes
in 𝒕𝒇 𝒕,𝒅 (say from 1 to 2).
 The main idea of maximum tf normalization is to mitigate the following
anomaly: we observe higher term frequencies in longer documents,
merely because longer documents tend to repeat the same words over
and over again. To appreciate this, consider the following extreme
example:
 Supposed we were to take a document 𝒅 and create a new document
𝒅 by simply appending a copy of 𝒅 to itself. While 𝒅 should be no more
relevant to any query than 𝒅 is, the use of 𝑺𝒄𝒐𝒓𝒆 𝒅, 𝒒 = 𝒕∈𝒒 𝒕𝒇 − 𝒊𝒅𝒇 𝒕,𝒅
would assign it twice as high a score as 𝒅 .

 Maximum tf normalization does suffer from the following issues:
1. The method is unstable in the following sense: a change in the
stop word list can dramatically alter term weightings (and
therefore ranking). Thus, it is hard to tune.
2. A document may contain an outlier term with an unusually large
number of occurrences of that term, not representative of the
content of that document.
3. More generally, a document in which the most frequent term
appears roughly as often as many other terms should be treated
differently from one with a more skewed distribution.

Document and Query Weighting Schemes
 The above Equation is fundamental to information retrieval systems
that use any form of vector space scoring.
 Variations from one vector space scoring method to another hinge on
the specific choices of weights in the vectors 𝑽(𝒅) and 𝑽 𝒒 together
with a mnemonic for representing a specific combination of weights;
this system of mnemonics is sometimes called SMART notation,
following the authors of an early text retrieval system.
 The mnemonic for representing a combination of weights takes the
form ddd.qqq where the first triplet gives the term weighting of the
document vector, while the second triplet gives the weighting in the
query vector.
 The first letter in each triplet specifies the term frequency component
of the weighting, the second the document frequency component, and
the third the form of normalization used.

Document and Query Weighting Schemes
 It is quite common to apply different normalization functions
to 𝑽(𝒅) and 𝑽 𝒒 .
 For example, a very standard weighting scheme is lnc.ltc
 where the document vector has log-weighted term
frequency , no idf (for both effectiveness and efficiency
reasons), and cosine normalization,
 while the query vector uses log-weighted term frequency, idf
weighting, and cosine normalization.

Evaluation in information retrieval
 How do we know which of the previously discussed IR techniques
are effective in which applications?
 Should we use stop lists?
 Should we stemming?
 Should we use inverse document frequency weighting?

 To measure ad hoc information retrieval effectiveness in the
standard way, we need a test collection consisting of three things:
1. Document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment of either
relevant or nonrelevant for each query-document pair.

 A document in the test collection is given a binary classification as
either relevant or nonrelevant.
 This decision is referred to as the gold standard or ground truth
judgment of relevance.
Relevance is assessed relative to an information need, not a query.
This means that: a document is relevant if it addresses the stated
information need, not because it just happens to contain all the words in
the query.

 Information Need:
Query = Jaguar

Query = Jaguar Speed

Query = Speed of Jaguar

Query = Speed of Jaguar animal

Query = Speed of impala

Query = Speed of impala animal

Query = impala

Standard Test Collections
1. Cranfield Collection:
 Abstracts of 1398 articles
 A set of 225 queries, and their respective relevance judgments.
2. TREC (Text Retrieval Conference):
 6 CDs containing 1.89 million documents
 Relevance judgments for 450 information needs, which are called
topics and specified in detailed text passages.
3. CLEF (Cross Language Evaluation Forum):
 This evaluation series has concentrated on European languages and
cross-language information retrieval.

 Precision: What fraction of the returned results are relevant to
the information need?
𝑃 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: What fraction of the relevant documents in the collection
were returned by the system?
𝑅 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
Precision/Recall . . . Revisited

Precision/Recall . . . Revisited

 Offline Evaluation a.k.a. Manual Judgments
 Ask experts or users to explicitly evaluate your retrieval system.
 Online Evaluation  Observing Users
 See how normal users interact with your retrieval system when
just using it.
General Types of Evaluation

 Offline Evaluation involves:
 Select queries to evaluate on
 Get results for those queries
 Assess the relevance of those results to the queries
 Compute your offline metric

 Online Evaluation involves:
 Capture users’ search behavior:
 Search queries
 Results and Clicks
 Mouse Movement
 Assess the relevance of those results to the queries
 Compute your offline metric

Ir 09

More Related Content

What's hot

Viewers also liked

Similar to Ir 09

Recently uploaded

Ir 09