INTRODUCTION INFORMATION RETRIEVAL EVALUVATION

INTRODUCTION TO INFORMATION
RETRIEVAL EVALUVATION
METHODS
Prem Sankar C
IIITM KERALA

INFORMATION RETRIEVAL (IR)
 IR is the activity of obtaining information resources
relevant to an information need from a collection of
information resources.
 IR - Select the most relevant document (precision),
and preferably all the relevant ones (recall) from set
of document terms
 Ex: search strings in web search engines.

IR EVALUVATION
 The evaluation of IR system is the process of
assessing how well a system meets the information
needs of its users.
 To measure information retrieval effectiveness in
the standard way, we need a test collection
consisting of three things:
1. A document collection
2. A test suite of information needs, (queries)
3. A set of relevance judgments, a binary
assessment of either relevant or nonrelevant for
each query-document pair.

4
HUMAN LABELED CORPORA
(GOLD STANDARD)
 Start with a corpus of documents.
 Collect a set of queries for this corpus.
 Have one or more human experts
exhaustively label the relevant documents for
each query.
 Typically assumes binary relevance
judgments.
 Requires considerable human effort for large
document/query corpora.

5
Relevant
documents
Retrieved
document
s
Entire
document
collection
retrieved &
relevant
not retrieved
but relevant
retrieved &
irrelevant
Not retrieved &
irrelevant
retrieved not retrieved
relevantirrelevant
PRECISION AND RECALL

PRECISION
 Precision is the ability to retrieve top-ranked
documents that are mostly relevant.
retrieveddocumentsofnumberTotal
retrieveddocumentsrelevantofNumber
precision 
P(relevant|retrieved)

RECALL
 Recall is the fraction of the documents relevant to
the query that are successfully retrieved
 The ability of the search to find all of the relevant
items in the corpus.
documentsrelevantofnumberTotal
retrieveddocumentsrelevantofNumber
recall 
P(retrieved|relevant)

 F-measure is the harmonic mean of precision and
recall:

IR RELEVANCE
 Relevance denotes how well a retrieved document
or set of documents meets the information need of
the user.
 Relevance may include concerns such as
timeliness, authority or novelty of the result.
 Relevance is assessed relative to the user need,
not the query
 E.g., Information need: My swimming pool is
becoming black and needs to be cleaned.
 Query: pool cleaner

DIFFICULTIES IN EVALUATING IR SYSTEMS
 Effectiveness is related to the relevancy of
retrieved items.
 Relevancy is not typically binary but
continuous.
 Even if relevancy is binary, it can be a
difficult judgment to make.
 Relevancy, from a human standpoint, is:
 Subjective: Depends upon a specific user’s
judgment.
 Situational: Relates to user’s current needs.
 Cognitive: Depends on human perception and
behavior.
 Dynamic: Changes over time.

IR JUDGEMENT
 Evaluating the performance of information retrieval
systems usually takes a lot of human efforts.
 Relevance judgement is a laborious task when we
have a large set of retrieved documents.
 One of the most interesting evaluation techniques
used in TREC is the pooling method employed to
deal with relevance judgements, so as to reduce
human efforts.
 In TREC, each participating system reports the
1000 top-ranked documents for each topic. Of
these, only the top 100 from each system are
collected into a pool for human assessment.

PRECISION@K
 Set a rank threshold K
 Compute % relevant in top K
 Ignores documents ranked lower than K
 Ex:
 Prec@3 of 2/3
 Prec@4 of 2/4
 Prec@5 of 3/5
 In similar fashion we have Recall@K

MEAN AVERAGE PRECISION
 Consider rank position of each relevant doc
K1, K2, … KR
 Compute Precision@K for each K1, K2, … KR
 Average precision = average of P@K
 Ex: Avg precision =1/3(1/1+2/3+3/5)
 MAP Score is Average Precision across multiple
queries/rankings

DISCOUNTED CUMULATIVE GAIN
 Popular measure for evaluating web search and
related tasks
 Two assumptions:
1. Highly relevant documents are more useful than
marginally relevant documents
2. The lower the ranked position of a relevant
document, the less useful it is for the user, since it
is less likely to be examined

DISCOUNTED CUMULATIVE GAIN
 Uses graded relevance as a measure of usefulness, or
gain, from examining a document
 Gain is accumulated starting at the top of the ranking
and may be reduced, or discounted, at lower ranks
 Typical discount is 1/log (rank)
 With base 2, the discount at rank 4 is 1/2, and at rank 8
it is 1/3

HUMAN JUDGMENTS ARE
 Expensive
 Inconsistent between raters , Over time
 Decay in value as documents/query mix evolves
 Not always representative of “real users”

IR JUDGEMENT
 How fast does it index?
 How fast does it search?
 Does it recommend related info
 Crowd source relevance judgments facilities

INTRODUCTION INFORMATION RETRIEVAL EVALUVATION

INTRODUCTION INFORMATION RETRIEVAL EVALUVATION

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to INTRODUCTION INFORMATION RETRIEVAL EVALUVATION

Similar to INTRODUCTION INFORMATION RETRIEVAL EVALUVATION (20)

More from Premsankar Chakkingal

More from Premsankar Chakkingal (14)

Recently uploaded

Recently uploaded (20)

INTRODUCTION INFORMATION RETRIEVAL EVALUVATION