Lecture 09
Information Retrieval
The Vector Space Model for Scoring
Variant tf-idf Functions
 For assigning a weight for each term in each document, a number
of alternatives to tf and tf-idf have been considered.
 Sublinear tf scaling:
 It seems unlikely that twenty occurrences of a term in a document
truly carry twenty times the significance of a single occurrence.
Accordingly, there has been considerable research into variants of
term frequency that go beyond counting the number of
occurrences of a term.
 A common modification is to use instead the logarithm of the term
frequency, which assigns a weight given by:
𝒘𝒇 𝒕,𝒅 =
𝟏 + log 𝟏𝟎 𝒕𝒇 𝒕,𝒅 𝒊𝒇 𝒕𝒇 𝒕,𝒅 > 𝟎
𝟎, 𝑶𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
The Vector Space Model for Scoring
Variant tf-idf Functions
 In this form, we may replace tf by some other function wf as in
(6.13), to obtain:
𝒘𝒇 − 𝒊𝒅𝒇 𝒕,𝒅 = 𝒘𝒇 𝒕,𝒅 ∗ 𝒊𝒅𝒇 𝒕
 Maximum tf normalization:
 One well-studied technique is to normalize the tf weights of all
terms occurring in a document by the maximum tf in that
document. For each document d, let 𝒕𝒇 𝒎𝒂𝒙 𝒅 = 𝒎𝒂𝒙 𝝉∈𝒅 where 𝝉
ranges over all terms in d.
 Then, we compute a normalized term frequency for each term t in
document d by:
𝒏𝒕𝒇 𝒕,𝒅 = 𝒂 + 𝟏 − 𝒂
𝒕𝒇 𝒕,𝒅
𝒕𝒇 𝒎𝒂𝒙 𝒅
 where 𝒂 is a value between 0 and 1 and is generally set to 0.4,
although some early work used the value 0.5.
The Vector Space Model for Scoring
Variant tf-idf Functions
 The term 𝒂 is a smoothing term whose role is to damp the contribution
of the second term – which may be viewed as a scaling down of tf by
the largest tf value in d.
 We will encounter smoothing further when discussing classification.
 The basic idea is to avoid a large swing in 𝒏𝒕𝒇 𝒕,𝒅 from modest changes
in 𝒕𝒇 𝒕,𝒅 (say from 1 to 2).
 The main idea of maximum tf normalization is to mitigate the following
anomaly: we observe higher term frequencies in longer documents,
merely because longer documents tend to repeat the same words over
and over again. To appreciate this, consider the following extreme
example:
 Supposed we were to take a document 𝒅 and create a new document
𝒅 by simply appending a copy of 𝒅 to itself. While 𝒅 should be no more
relevant to any query than 𝒅 is, the use of 𝑺𝒄𝒐𝒓𝒆 𝒅, 𝒒 = 𝒕∈𝒒 𝒕𝒇 − 𝒊𝒅𝒇 𝒕,𝒅
would assign it twice as high a score as 𝒅 .
The Vector Space Model for Scoring
Variant tf-idf Functions
 Maximum tf normalization does suffer from the following issues:
1. The method is unstable in the following sense: a change in the
stop word list can dramatically alter term weightings (and
therefore ranking). Thus, it is hard to tune.
2. A document may contain an outlier term with an unusually large
number of occurrences of that term, not representative of the
content of that document.
3. More generally, a document in which the most frequent term
appears roughly as often as many other terms should be treated
differently from one with a more skewed distribution.
The Vector Space Model for Scoring
Document and Query Weighting Schemes
 The above Equation is fundamental to information retrieval systems
that use any form of vector space scoring.
 Variations from one vector space scoring method to another hinge on
the specific choices of weights in the vectors 𝑽(𝒅) and 𝑽 𝒒 together
with a mnemonic for representing a specific combination of weights;
this system of mnemonics is sometimes called SMART notation,
following the authors of an early text retrieval system.
 The mnemonic for representing a combination of weights takes the
form ddd.qqq where the first triplet gives the term weighting of the
document vector, while the second triplet gives the weighting in the
query vector.
 The first letter in each triplet specifies the term frequency component
of the weighting, the second the document frequency component, and
the third the form of normalization used.
The Vector Space Model for Scoring
Document and Query Weighting Schemes
 It is quite common to apply different normalization functions
to 𝑽(𝒅) and 𝑽 𝒒 .
 For example, a very standard weighting scheme is lnc.ltc
 where the document vector has log-weighted term
frequency , no idf (for both effectiveness and efficiency
reasons), and cosine normalization,
 while the query vector uses log-weighted term frequency, idf
weighting, and cosine normalization.
The Vector Space Model for Scoring
Document and Query Weighting Schemes
 It is quite common to apply different normalization functions
to 𝑽(𝒅) and 𝑽 𝒒 .
 For example, a very standard weighting scheme is lnc.ltc
 where the document vector has log-weighted term
frequency , no idf (for both effectiveness and efficiency
reasons), and cosine normalization,
 while the query vector uses log-weighted term frequency, idf
weighting, and cosine normalization.
Evaluation in information retrieval
 How do we know which of the previously discussed IR techniques
are effective in which applications?
 Should we use stop lists?
 Should we stemming?
 Should we use inverse document frequency weighting?
Evaluation in information retrieval
 To measure ad hoc information retrieval effectiveness in the
standard way, we need a test collection consisting of three things:
1. Document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment of either
relevant or nonrelevant for each query-document pair.
Evaluation in information retrieval
 A document in the test collection is given a binary classification as
either relevant or nonrelevant.
 This decision is referred to as the gold standard or ground truth
judgment of relevance.
Relevance is assessed relative to an information need, not a query.
This means that: a document is relevant if it addresses the stated
information need, not because it just happens to contain all the words in
the query.
Evaluation in information retrieval
 Information Need:
Query = Jaguar
Evaluation in information retrieval
 Information Need:
Query = Jaguar Speed
Evaluation in information retrieval
 Information Need:
Query = Speed of Jaguar
Evaluation in information retrieval
 Information Need:
Query = Speed of Jaguar animal
Evaluation in information retrieval
 Information Need:
Query = Speed of impala
Evaluation in information retrieval
 Information Need:
Query = Speed of impala animal
Evaluation in information retrieval
 Information Need:
Query = impala
Evaluation in information retrieval
Standard Test Collections
 Information Need:
1. Cranfield Collection:
 Abstracts of 1398 articles
 A set of 225 queries, and their respective relevance judgments.
2. TREC (Text Retrieval Conference):
 6 CDs containing 1.89 million documents
 Relevance judgments for 450 information needs, which are called
topics and specified in detailed text passages.
3. CLEF (Cross Language Evaluation Forum):
 This evaluation series has concentrated on European languages and
cross-language information retrieval.
 Precision: What fraction of the returned results are relevant to
the information need?
𝑃 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: What fraction of the relevant documents in the collection
were returned by the system?
𝑅 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
Evaluation in information retrieval
Precision/Recall . . . Revisited
Evaluation in information retrieval
Precision/Recall . . . Revisited
 Offline Evaluation a.k.a. Manual Judgments
 Ask experts or users to explicitly evaluate your retrieval system.
 Online Evaluation  Observing Users
 See how normal users interact with your retrieval system when
just using it.
Evaluation in information retrieval
General Types of Evaluation
 Offline Evaluation involves:
 Select queries to evaluate on
 Get results for those queries
 Assess the relevance of those results to the queries
 Compute your offline metric
Evaluation in information retrieval
General Types of Evaluation
 Online Evaluation involves:
 Capture users’ search behavior:
 Search queries
 Results and Clicks
 Mouse Movement
 Assess the relevance of those results to the queries
 Compute your offline metric
Evaluation in information retrieval
General Types of Evaluation

Ir 09

  • 1.
  • 2.
    The Vector SpaceModel for Scoring Variant tf-idf Functions  For assigning a weight for each term in each document, a number of alternatives to tf and tf-idf have been considered.  Sublinear tf scaling:  It seems unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term.  A common modification is to use instead the logarithm of the term frequency, which assigns a weight given by: 𝒘𝒇 𝒕,𝒅 = 𝟏 + log 𝟏𝟎 𝒕𝒇 𝒕,𝒅 𝒊𝒇 𝒕𝒇 𝒕,𝒅 > 𝟎 𝟎, 𝑶𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
  • 3.
    The Vector SpaceModel for Scoring Variant tf-idf Functions  In this form, we may replace tf by some other function wf as in (6.13), to obtain: 𝒘𝒇 − 𝒊𝒅𝒇 𝒕,𝒅 = 𝒘𝒇 𝒕,𝒅 ∗ 𝒊𝒅𝒇 𝒕  Maximum tf normalization:  One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document d, let 𝒕𝒇 𝒎𝒂𝒙 𝒅 = 𝒎𝒂𝒙 𝝉∈𝒅 where 𝝉 ranges over all terms in d.  Then, we compute a normalized term frequency for each term t in document d by: 𝒏𝒕𝒇 𝒕,𝒅 = 𝒂 + 𝟏 − 𝒂 𝒕𝒇 𝒕,𝒅 𝒕𝒇 𝒎𝒂𝒙 𝒅  where 𝒂 is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5.
  • 4.
    The Vector SpaceModel for Scoring Variant tf-idf Functions  The term 𝒂 is a smoothing term whose role is to damp the contribution of the second term – which may be viewed as a scaling down of tf by the largest tf value in d.  We will encounter smoothing further when discussing classification.  The basic idea is to avoid a large swing in 𝒏𝒕𝒇 𝒕,𝒅 from modest changes in 𝒕𝒇 𝒕,𝒅 (say from 1 to 2).  The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. To appreciate this, consider the following extreme example:  Supposed we were to take a document 𝒅 and create a new document 𝒅 by simply appending a copy of 𝒅 to itself. While 𝒅 should be no more relevant to any query than 𝒅 is, the use of 𝑺𝒄𝒐𝒓𝒆 𝒅, 𝒒 = 𝒕∈𝒒 𝒕𝒇 − 𝒊𝒅𝒇 𝒕,𝒅 would assign it twice as high a score as 𝒅 .
  • 5.
    The Vector SpaceModel for Scoring Variant tf-idf Functions  Maximum tf normalization does suffer from the following issues: 1. The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune. 2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document. 3. More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.
  • 6.
    The Vector SpaceModel for Scoring Document and Query Weighting Schemes  The above Equation is fundamental to information retrieval systems that use any form of vector space scoring.  Variations from one vector space scoring method to another hinge on the specific choices of weights in the vectors 𝑽(𝒅) and 𝑽 𝒒 together with a mnemonic for representing a specific combination of weights; this system of mnemonics is sometimes called SMART notation, following the authors of an early text retrieval system.  The mnemonic for representing a combination of weights takes the form ddd.qqq where the first triplet gives the term weighting of the document vector, while the second triplet gives the weighting in the query vector.  The first letter in each triplet specifies the term frequency component of the weighting, the second the document frequency component, and the third the form of normalization used.
  • 7.
    The Vector SpaceModel for Scoring Document and Query Weighting Schemes  It is quite common to apply different normalization functions to 𝑽(𝒅) and 𝑽 𝒒 .  For example, a very standard weighting scheme is lnc.ltc  where the document vector has log-weighted term frequency , no idf (for both effectiveness and efficiency reasons), and cosine normalization,  while the query vector uses log-weighted term frequency, idf weighting, and cosine normalization.
  • 8.
    The Vector SpaceModel for Scoring Document and Query Weighting Schemes  It is quite common to apply different normalization functions to 𝑽(𝒅) and 𝑽 𝒒 .  For example, a very standard weighting scheme is lnc.ltc  where the document vector has log-weighted term frequency , no idf (for both effectiveness and efficiency reasons), and cosine normalization,  while the query vector uses log-weighted term frequency, idf weighting, and cosine normalization.
  • 9.
    Evaluation in informationretrieval  How do we know which of the previously discussed IR techniques are effective in which applications?  Should we use stop lists?  Should we stemming?  Should we use inverse document frequency weighting?
  • 10.
    Evaluation in informationretrieval  To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection consisting of three things: 1. Document collection 2. A test suite of information needs, expressible as queries 3. A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.
  • 11.
    Evaluation in informationretrieval  A document in the test collection is given a binary classification as either relevant or nonrelevant.  This decision is referred to as the gold standard or ground truth judgment of relevance. Relevance is assessed relative to an information need, not a query. This means that: a document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.
  • 12.
    Evaluation in informationretrieval  Information Need: Query = Jaguar
  • 13.
    Evaluation in informationretrieval  Information Need: Query = Jaguar Speed
  • 14.
    Evaluation in informationretrieval  Information Need: Query = Speed of Jaguar
  • 15.
    Evaluation in informationretrieval  Information Need: Query = Speed of Jaguar animal
  • 16.
    Evaluation in informationretrieval  Information Need: Query = Speed of impala
  • 17.
    Evaluation in informationretrieval  Information Need: Query = Speed of impala animal
  • 18.
    Evaluation in informationretrieval  Information Need: Query = impala
  • 19.
    Evaluation in informationretrieval Standard Test Collections  Information Need: 1. Cranfield Collection:  Abstracts of 1398 articles  A set of 225 queries, and their respective relevance judgments. 2. TREC (Text Retrieval Conference):  6 CDs containing 1.89 million documents  Relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. 3. CLEF (Cross Language Evaluation Forum):  This evaluation series has concentrated on European languages and cross-language information retrieval.
  • 20.
     Precision: Whatfraction of the returned results are relevant to the information need? 𝑃 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑  Recall: What fraction of the relevant documents in the collection were returned by the system? 𝑅 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 Evaluation in information retrieval Precision/Recall . . . Revisited
  • 21.
    Evaluation in informationretrieval Precision/Recall . . . Revisited
  • 22.
     Offline Evaluationa.k.a. Manual Judgments  Ask experts or users to explicitly evaluate your retrieval system.  Online Evaluation  Observing Users  See how normal users interact with your retrieval system when just using it. Evaluation in information retrieval General Types of Evaluation
  • 23.
     Offline Evaluationinvolves:  Select queries to evaluate on  Get results for those queries  Assess the relevance of those results to the queries  Compute your offline metric Evaluation in information retrieval General Types of Evaluation
  • 24.
     Online Evaluationinvolves:  Capture users’ search behavior:  Search queries  Results and Clicks  Mouse Movement  Assess the relevance of those results to the queries  Compute your offline metric Evaluation in information retrieval General Types of Evaluation