Scoring, term weighting and the vector space

 Is a document simply a sequence of words?
 Many structural components like authors, title, date of publication …..
 Metadata – data about documents
 Fields – document features where possible values are finite. Example – dates,
ISBN
 Zones – document features whose content can be arbitrary text fields. Example –
title, abstract

A user may specify requirements on fields and zones

One parametric index for each zone/field
Dictionary comes from a fixed vocabulary
Separate inverted index is build for each zone of the document

 Dictionary structure whatever vocabulary stems from the text of that zone
 Advantages:
 Reduced size of the dictionary
 Efficient query answering using weighted zone scoring

 Different field/zones have different importance in evaluating how a document
matches a query
 For a query q and a document d, weighted zone scoring assigns a pair to (q, d)
[(query, document)] a score in range [0, 1] by computing a linear combination of
zone scores
 Let each document has l zones. Let g1….gl belongs to [0,1] such that 𝑖=1
𝑙
𝑔𝑖 = 1
 Each field/zone contributes a Boolean value – let si be the Boolean score denoting a
match or absence between q and the ith zone
 The weighted zone is 𝑖=1
𝑙
𝑔𝑖 𝑠𝑖

 Consider the query Shakespeare in a collection in which each document has three
zones: author, title and body
 Boolean score function take the value 1 if the query term Shakespeare is present
in the zone otherwise 0
 Weight score term require three weights gbody, gtitle and gauthor
 Let gbody=0.5, gtitle=0.3 and gauthor=0.2
 Here author zone is least important, title zone is somewhat more and body
contributes the most

 Could have been specified by an expert
 Can be judged editorially
 Each training example is a tuple consisting of a query q and a document d and a
relevance judgment of q on d
 The judgment can be binary
 A judgment score can also be used
 Compute the weights such that the learned scores approximate the relevance
judgments as much as possible
 An optimization problem

 Consider two zones: title and body
 Compute Boolean variables sT(d, q), sB(d, q) depending on the query matching
 Compute a score between 0 and 1 by using the relation:
 Score (d, q) = g sT(d, q) + (1-g)sB(d, q)
 Constant g is determined from a set of training examples µj = (dj, qj, r(dj, qj))
 In each training example, each training document and a training query is accessed by a
human editor who delivers a relevant judgment r(dj, qj).
 For each training example µj ,we have Boolean values sT(d, q) and sB(d, q), that we use to
compute a score
 Error of scoring function
 µ(g, µj) = (r(dj, qj) – score(dj, qj))2
 Total error 𝑗 µ(g, µj)

 Let n01r (n01n) be the numbers of training examples that STitle = 0 and SBody = 1 and the
judgment is relevant (irrelevant). The contribution of those examples that STitle = 0 and
SBody = 1 to the total error is
 [1-(1-g)]2n01r + [0-(1-g)]2n01n
 The total error is (n01r+n10n)g2 + (n10r + n01n)(1-g)2 +n00r + n11n
 By differentiating with respect to g and setting the result to 0, the optimal value of g is

𝒏 𝟏𝟎𝒓
+𝒏 𝟎𝟏𝒏
𝒏 𝟏𝟎𝒓
+𝒏 𝟏𝟎𝒏
+𝒏 𝟎𝟏𝒓
+𝒏 𝟎𝟏𝒏

Scoring, term weighting and the vector space

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Scoring, term weighting and the vector space

Similar to Scoring, term weighting and the vector space (20)

More from Ujjawal

More from Ujjawal (10)

Recently uploaded

Recently uploaded (20)

Scoring, term weighting and the vector space