This document introduces several common methods for evaluating information retrieval systems, including precision, recall, F-measure, and discounted cumulative gain. It explains that evaluation requires a test collection consisting of documents, queries, and human relevance judgments. Precision measures the number of relevant retrieved documents while recall measures the number of relevant documents retrieved. F-measure combines precision and recall.
2. INFORMATION RETRIEVAL (IR)
IR is the activity of obtaining information resources
relevant to an information need from a collection of
information resources.
IR - Select the most relevant document (precision),
and preferably all the relevant ones (recall) from set
of document terms
Ex: search strings in web search engines.
3. IR EVALUVATION
The evaluation of IR system is the process of
assessing how well a system meets the information
needs of its users.
To measure information retrieval effectiveness in
the standard way, we need a test collection
consisting of three things:
1. A document collection
2. A test suite of information needs, (queries)
3. A set of relevance judgments, a binary
assessment of either relevant or nonrelevant for
each query-document pair.
4. 4
HUMAN LABELED CORPORA
(GOLD STANDARD)
Start with a corpus of documents.
Collect a set of queries for this corpus.
Have one or more human experts
exhaustively label the relevant documents for
each query.
Typically assumes binary relevance
judgments.
Requires considerable human effort for large
document/query corpora.
6. PRECISION
Precision is the ability to retrieve top-ranked
documents that are mostly relevant.
retrieveddocumentsofnumberTotal
retrieveddocumentsrelevantofNumber
precision
P(relevant|retrieved)
7. RECALL
Recall is the fraction of the documents relevant to
the query that are successfully retrieved
The ability of the search to find all of the relevant
items in the corpus.
documentsrelevantofnumberTotal
retrieveddocumentsrelevantofNumber
recall
P(retrieved|relevant)
8. F-measure is the harmonic mean of precision and
recall:
9. IR RELEVANCE
Relevance denotes how well a retrieved document
or set of documents meets the information need of
the user.
Relevance may include concerns such as
timeliness, authority or novelty of the result.
Relevance is assessed relative to the user need,
not the query
E.g., Information need: My swimming pool is
becoming black and needs to be cleaned.
Query: pool cleaner
10. DIFFICULTIES IN EVALUATING IR SYSTEMS
Effectiveness is related to the relevancy of
retrieved items.
Relevancy is not typically binary but
continuous.
Even if relevancy is binary, it can be a
difficult judgment to make.
Relevancy, from a human standpoint, is:
Subjective: Depends upon a specific user’s
judgment.
Situational: Relates to user’s current needs.
Cognitive: Depends on human perception and
behavior.
Dynamic: Changes over time.
11. IR JUDGEMENT
Evaluating the performance of information retrieval
systems usually takes a lot of human efforts.
Relevance judgement is a laborious task when we
have a large set of retrieved documents.
One of the most interesting evaluation techniques
used in TREC is the pooling method employed to
deal with relevance judgements, so as to reduce
human efforts.
In TREC, each participating system reports the
1000 top-ranked documents for each topic. Of
these, only the top 100 from each system are
collected into a pool for human assessment.
12. PRECISION@K
Set a rank threshold K
Compute % relevant in top K
Ignores documents ranked lower than K
Ex:
Prec@3 of 2/3
Prec@4 of 2/4
Prec@5 of 3/5
In similar fashion we have Recall@K
13. MEAN AVERAGE PRECISION
Consider rank position of each relevant doc
K1, K2, … KR
Compute Precision@K for each K1, K2, … KR
Average precision = average of P@K
Ex: Avg precision =1/3(1/1+2/3+3/5)
MAP Score is Average Precision across multiple
queries/rankings
14. DISCOUNTED CUMULATIVE GAIN
Popular measure for evaluating web search and
related tasks
Two assumptions:
1. Highly relevant documents are more useful than
marginally relevant documents
2. The lower the ranked position of a relevant
document, the less useful it is for the user, since it
is less likely to be examined
15. DISCOUNTED CUMULATIVE GAIN
Uses graded relevance as a measure of usefulness, or
gain, from examining a document
Gain is accumulated starting at the top of the ranking
and may be reduced, or discounted, at lower ranks
Typical discount is 1/log (rank)
With base 2, the discount at rank 4 is 1/2, and at rank 8
it is 1/3
16. HUMAN JUDGMENTS ARE
Expensive
Inconsistent between raters , Over time
Decay in value as documents/query mix evolves
Not always representative of “real users”
17. IR JUDGEMENT
How fast does it index?
How fast does it search?
Does it recommend related info
Crowd source relevance judgments facilities