1
Chapter Five
Retrieval Evaluation
 Evaluation of IR systems
 Relevance judgments
 Performance measures (Recall,
Precision, etc.)
2
Why System Evaluation?
•It provides the ability to measure the
difference between IR systems
–How well do our search engines work?
–Is system A better than B?
–Under what conditions?
•Evaluation drives what to research on
regarding existing IR system. And what to
improve
–Identify techniques that work and do not work
–There are many retrieval models/ algorithms/ systems
• which one is the best?
–What is the best component for:
• Similarity measures (dot-product, cosine, …)
• Index term selection (stop-word removal, stemming…)
• Term weighting (TF, TF-IDF,…)
Types of Evaluation Strategies
•System-centered evaluation
–Given documents, queries, & relevance judgments
• Try several variations of the system
• Measure which system returns the “best” matching list
of documents
•User-centered evaluation
–Given several users, and at least two IR systems
• Have each user try the same task on each system
• Measure which system works the “best” for users
information need
• How do we measure users satisfaction? How do we know
their impression towards the IR system ?
Major Evaluation Criteria
What are some of the main measures for evaluating an
IR system’s performance?
• Efficiency: time, space
–Speed in terms of retrieval time and indexing time
–Speed of query processing
–The space taken by corpus vs. index
• Is there a need for compression
–Index size: Index/corpus size ratio
• Effectiveness
–How is a system capable of retrieving relevant
documents from the collection?
–Is a system better than another one?
–User satisfaction: How “good” are the documents
that are returned as a response to user query?
–“Relevance” of results to meet information need of
users
2
5
Difficulties in Evaluating IR System
 IR systems essentially facilitate communication
between a user and document collections
 Relevance is a measure of the effectiveness of
communication
 Effectiveness is related to the relevancy of
retrieved items.
 Relevance: relates information need (query)
and a document or surrogate
 Relevancy is not typically binary but continuous.
 Even if relevancy is binary, it is a difficult
judgment to make.
 Relevance is the degree of a correspondence
existing between a document and a query as
determined by requester / information
specialist/ external judge / other users
Difficulties in Evaluating IR System
 Relevance judgments is made by
 The user who posed the retrieval problem
 An external judge or information specialists
or system developer
 Is the relevance judgment made by users,
information specialists or external person the
same ? Why?
 Relevance judgment is usually:
 Subjective: Depends upon a specific user’s
judgment.
 Situational: Relates to user’s current needs.
 Cognitive: Depends on human perception and
behavior.
 Dynamic: Changes over time.
Retrieval scenario
= relevant document
A.
B.
C.
D.
E.
F.
•The scenario where 13 results retrieved by different
systems for a given query?
Measuring Retrieval Effectiveness
• Retrieval of documents may result in:
–False negative (false drop): some relevant documents
may not be retrieved.
–False positive: some irrelevant documents may be
retrieved.
–For many applications a good index should not permit
any false drops, but may permit a few false positives.
“Type two errors”
“Errors of omission”
“False negatives”
“Type one errors”
“Errors of commission”
“False positives”
 Metrics often used to
evaluate effectiveness
of the system
relevant irrelevant
retrieved
not
retrieved
A B
C D
3
Relevant performance metrics
•Recall: The ability of the search to find all of the relevant
items in the corpus
– Recall is percentage of relevant documents
retrieved from the database in response to users
query.
= No. of relevant documents retrieved
Total no. of relevant documents in database
•Precision: The ability to retrieve top-ranked documents that
are mostly relevant.
–Precision is percentage of retrieved documents
that are relevant to the query (i.e. number of
retrieved documents that are relevant).
= No. of relevant documents retrieved
Total number of documents retrieved
Measuring Retrieval Effectiveness
• When is precision important? When is recall important?
|
}
{
|
|
}
{
}
{
|
Re
Relevant
Retrieved
Relevant
call


|
}
{
|
|
}
{
}
{
|
Pr
Retrieved
Retrieved
Relevant
ecision


Relevant Not
relevant
Retrieved A B
Not retrieved C D
Collection size = A+B+C+D
Relevant = A+C
Retrieved = A+B
Relevant Retrieved
Relevant +
Retrieved
Not Relevant + Not Retrieved
Example 1
= relevant document
Assume there are 14 relevant documents in the corpus,
compute precision and recall at the given cutoff points?
1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall
Precision
Hits 1-10
Hits 11-20
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Precision
5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
Recall
12
Example 2
n doc # relevant Recall Precision
1 588 x 0.17 1
2 589 x 0.33 1
3 576 0.33 0.666667
4 590 x 0.5 0.75
5 986 0.5 0.6
6 592 x 0.67 0.667
7 984 0.67 0.571429
8 988 0.67 0.5
9 578 0.67 0.444444
10 985 0.67 0.4
11 103 0.67 0.363636
12 591 0.67 0.333333
13 772 x 0.83 0.384615
14 990 0.83 0.357143
•Let total number
of relevant
documents = 6,
compute recall
and precision for
each cut off point
n:
4
Graphing Precision and Recall
•Plot each (recall, precision) point on a graph. Two
ways of plotting:
– Cutoff vs. Recall/Precision graph
– Recall vs. Precision graph
•Recall and Precision are inversely related
– Recall is a non-decreasing function of the number
of documents retrieved,
– precision usually decreases (in a good system)
•The plot is usually for a single query. How we plot
for two or more queries?
Precision/Recall tradeoff
•Can increase recall by retrieving many documents (down
to a low level of relevance ranking),
–but many irrelevant documents would be fetched,
reducing precision
•Can get high recall (but low precision) by retrieving all
documents for all queries
1
0
1
Recall
Precision
The ideal: Better recall
and precision
Returns relevant
documents but
misses many
useful ones too
Returns most relevant
documents but includes
lots of junk
15
Compare Two or More Systems
• The curve closest to the upper right-hand
corner of the graph indicates the best
performance
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
NoStem Stem
Need for
Interpolation
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
•Two issues:
–How do you compare performance across queries?
–Is the sawtooth shape intuitive of what’s going on?
Solution: Interpolation!
Interpolate a precision value for each standard recall level
5
Interpolation
•It is a general form of precision/recall calculation
•Precision change w.r.t. Recall (not a fixed point)
–It is an empirical fact that on average as recall
increases, precision decreases
•Interpolate precision at 11 standard recall levels:
–rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
•The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall level
between the jth and (j + 1)th level:
)
(
max
)
(
1
r
P
r
P
j
j r
r
r
j




Result of Interpolation
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5
Recall
Precision
Calculating Precision at Standard Recall Levels
Assume that there are a total of 10 relevant document
Ranking Relevant Recall Precision
1. Doc. 50 Rel 10% 100%
2. Doc. 34 Not rel ? ?
3. Doc. 45 Rel 20% 67%
4. Doc. 8 Not rel ? ?
5. Doc. 23 Not rel ? ?
6. Doc. 16 Rel 30% 50%
7. Doc. 63 Not rel ? ?
8. Doc. 119 Rel 40% 50%
9. Doc. 10 Not rel ? ?
10. Doc. 2 Not rel ? ?
11. Doc. 9 Rel 50% 40%
Interpolating across queries
• For each query calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.
• Average precision at seen relevant documents
–Typically average performance over a large set of
queries.
–Favors systems which produce relevant documents
high in rankings
6
Single-valued measures
• Single value measures: may want a single value for each
query to evaluate performance
• Such single valued measures include:
– Average precision : Average precision is calculated by
averaging precision when recall increases.
– Mean average precision
– R-precision, etc.
Average precision
• Average precision at each retrieved relevant document
–Relevant documents not retrieved contribute zero to
score
• Example: Assume total of 14 relevant documents,
compute mean average precision.
= relevant document
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19
4/20
Precision
Precision
Hits 1-10
Hits 11-20
AP = 0.231
MAP (Mean Average Precision)
 rij = rank of the j-th relevant
document for Qi
 |Ri| = #rel. doc. for Qi
 n = # test queries
 
 


i i
j
Q R
D ij
i r
j
R
n
MAP )
(
|
|
1
1
8
3
)]
8
2
4
1
(
2
1
)
10
3
5
2
1
1
(
3
1
[
2
1






MAP
• Computing mean average for more than one query
Relevant Docs.
retrieved
Query
1
Query
2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10
• E.g. Assume that for query 1 and 2, there are 3 and 2 relevant
documents in the collection, respectively.
24
R- Precision
 Precision at the R-th position in the ranking of
results for a query, where R is the total number
of relevant documents.
 Calculate precision after R documents are seen
 Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
7
 Average precision is calculated by averaging precision when recall
increases. Hence
 average precision at this recall values = 62.2% for Ranking #1,
52.0% for Ranking #2 . Thus, using this measure we can say
that Ranking #1 is better than #2.
More Example
• Mean Average Precision (MAP): Often we have a number
of queries to evaluate for a given system.
• For each query, we can calculate average precision, and if
we take average of those averages for a given system, it
gives us Mean Average Precision (MAP), which is a very
popular measure to compare two systems.
• R-precision: It is defined as precision after R documents
retrieved, where R is the total number of relevant
documents for a given query.
• Average precision and R-precision are shown to be highly
correlated.
• In the previous example, since the number of relevant
documents (R) is 5, R-precision for both the rankings is
0.4 (value of precision after 5 documents retrieved).
More Example
27
F-Measure
• One measure of performance that takes into account
both recall and precision.
• Harmonic mean of recall and precision:
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?
P
R
R
P
PR
F 1
1
2
2




28
E-Measure
•Associated with Van Rijsbergen
•Allows user to specify importance of recall and precision
•It is parameterized F Measure. A variant of F measure
that allows weighting emphasis on precision over recall:
•Value of  controls trade-off:
– = 1: Equal weight for precision and recall (E=F).
– > 1: Weight recall more. It emphasizes precision.
– < 1: Weight precision more. It emphasizes recall.
P
R
R
P
PR
E 1
2
2
2
2
)
1
(
)
1
(





 



8
29
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken
into account.
• Recall is undefined when there is no relevant document in the
collection.
• Precision is undefined when no document is retrieved.
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
|
}
{
|
|
}
{
}
{
|
Relevant
ed
NotRetriev
Relevant
Miss


|
}
{
|
|
}
{
}
{
|
t
NotRelevan
t
NotRelevan
Retrieved
Fallout


30
Thank you

Chapter 5 Query Evaluation.pdf

  • 1.
    1 Chapter Five Retrieval Evaluation Evaluation of IR systems  Relevance judgments  Performance measures (Recall, Precision, etc.) 2 Why System Evaluation? •It provides the ability to measure the difference between IR systems –How well do our search engines work? –Is system A better than B? –Under what conditions? •Evaluation drives what to research on regarding existing IR system. And what to improve –Identify techniques that work and do not work –There are many retrieval models/ algorithms/ systems • which one is the best? –What is the best component for: • Similarity measures (dot-product, cosine, …) • Index term selection (stop-word removal, stemming…) • Term weighting (TF, TF-IDF,…) Types of Evaluation Strategies •System-centered evaluation –Given documents, queries, & relevance judgments • Try several variations of the system • Measure which system returns the “best” matching list of documents •User-centered evaluation –Given several users, and at least two IR systems • Have each user try the same task on each system • Measure which system works the “best” for users information need • How do we measure users satisfaction? How do we know their impression towards the IR system ? Major Evaluation Criteria What are some of the main measures for evaluating an IR system’s performance? • Efficiency: time, space –Speed in terms of retrieval time and indexing time –Speed of query processing –The space taken by corpus vs. index • Is there a need for compression –Index size: Index/corpus size ratio • Effectiveness –How is a system capable of retrieving relevant documents from the collection? –Is a system better than another one? –User satisfaction: How “good” are the documents that are returned as a response to user query? –“Relevance” of results to meet information need of users
  • 2.
    2 5 Difficulties in EvaluatingIR System  IR systems essentially facilitate communication between a user and document collections  Relevance is a measure of the effectiveness of communication  Effectiveness is related to the relevancy of retrieved items.  Relevance: relates information need (query) and a document or surrogate  Relevancy is not typically binary but continuous.  Even if relevancy is binary, it is a difficult judgment to make.  Relevance is the degree of a correspondence existing between a document and a query as determined by requester / information specialist/ external judge / other users Difficulties in Evaluating IR System  Relevance judgments is made by  The user who posed the retrieval problem  An external judge or information specialists or system developer  Is the relevance judgment made by users, information specialists or external person the same ? Why?  Relevance judgment is usually:  Subjective: Depends upon a specific user’s judgment.  Situational: Relates to user’s current needs.  Cognitive: Depends on human perception and behavior.  Dynamic: Changes over time. Retrieval scenario = relevant document A. B. C. D. E. F. •The scenario where 13 results retrieved by different systems for a given query? Measuring Retrieval Effectiveness • Retrieval of documents may result in: –False negative (false drop): some relevant documents may not be retrieved. –False positive: some irrelevant documents may be retrieved. –For many applications a good index should not permit any false drops, but may permit a few false positives. “Type two errors” “Errors of omission” “False negatives” “Type one errors” “Errors of commission” “False positives”  Metrics often used to evaluate effectiveness of the system relevant irrelevant retrieved not retrieved A B C D
  • 3.
    3 Relevant performance metrics •Recall:The ability of the search to find all of the relevant items in the corpus – Recall is percentage of relevant documents retrieved from the database in response to users query. = No. of relevant documents retrieved Total no. of relevant documents in database •Precision: The ability to retrieve top-ranked documents that are mostly relevant. –Precision is percentage of retrieved documents that are relevant to the query (i.e. number of retrieved documents that are relevant). = No. of relevant documents retrieved Total number of documents retrieved Measuring Retrieval Effectiveness • When is precision important? When is recall important? | } { | | } { } { | Re Relevant Retrieved Relevant call   | } { | | } { } { | Pr Retrieved Retrieved Relevant ecision   Relevant Not relevant Retrieved A B Not retrieved C D Collection size = A+B+C+D Relevant = A+C Retrieved = A+B Relevant Retrieved Relevant + Retrieved Not Relevant + Not Retrieved Example 1 = relevant document Assume there are 14 relevant documents in the corpus, compute precision and recall at the given cutoff points? 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20 Recall Precision Hits 1-10 Hits 11-20 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Precision 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14 Recall 12 Example 2 n doc # relevant Recall Precision 1 588 x 0.17 1 2 589 x 0.33 1 3 576 0.33 0.666667 4 590 x 0.5 0.75 5 986 0.5 0.6 6 592 x 0.67 0.667 7 984 0.67 0.571429 8 988 0.67 0.5 9 578 0.67 0.444444 10 985 0.67 0.4 11 103 0.67 0.363636 12 591 0.67 0.333333 13 772 x 0.83 0.384615 14 990 0.83 0.357143 •Let total number of relevant documents = 6, compute recall and precision for each cut off point n:
  • 4.
    4 Graphing Precision andRecall •Plot each (recall, precision) point on a graph. Two ways of plotting: – Cutoff vs. Recall/Precision graph – Recall vs. Precision graph •Recall and Precision are inversely related – Recall is a non-decreasing function of the number of documents retrieved, – precision usually decreases (in a good system) •The plot is usually for a single query. How we plot for two or more queries? Precision/Recall tradeoff •Can increase recall by retrieving many documents (down to a low level of relevance ranking), –but many irrelevant documents would be fetched, reducing precision •Can get high recall (but low precision) by retrieving all documents for all queries 1 0 1 Recall Precision The ideal: Better recall and precision Returns relevant documents but misses many useful ones too Returns most relevant documents but includes lots of junk 15 Compare Two or More Systems • The curve closest to the upper right-hand corner of the graph indicates the best performance 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision NoStem Stem Need for Interpolation 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision •Two issues: –How do you compare performance across queries? –Is the sawtooth shape intuitive of what’s going on? Solution: Interpolation! Interpolate a precision value for each standard recall level
  • 5.
    5 Interpolation •It is ageneral form of precision/recall calculation •Precision change w.r.t. Recall (not a fixed point) –It is an empirical fact that on average as recall increases, precision decreases •Interpolate precision at 11 standard recall levels: –rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, where j = 0 …. 10 •The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level between the jth and (j + 1)th level: ) ( max ) ( 1 r P r P j j r r r j     Result of Interpolation 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Recall Precision Calculating Precision at Standard Recall Levels Assume that there are a total of 10 relevant document Ranking Relevant Recall Precision 1. Doc. 50 Rel 10% 100% 2. Doc. 34 Not rel ? ? 3. Doc. 45 Rel 20% 67% 4. Doc. 8 Not rel ? ? 5. Doc. 23 Not rel ? ? 6. Doc. 16 Rel 30% 50% 7. Doc. 63 Not rel ? ? 8. Doc. 119 Rel 40% 50% 9. Doc. 10 Not rel ? ? 10. Doc. 2 Not rel ? ? 11. Doc. 9 Rel 50% 40% Interpolating across queries • For each query calculate precision at 11 standard recall levels • Compute average precision at each standard recall level across all queries. • Plot average precision/recall curves to evaluate overall system performance on a document/query corpus. • Average precision at seen relevant documents –Typically average performance over a large set of queries. –Favors systems which produce relevant documents high in rankings
  • 6.
    6 Single-valued measures • Singlevalue measures: may want a single value for each query to evaluate performance • Such single valued measures include: – Average precision : Average precision is calculated by averaging precision when recall increases. – Mean average precision – R-precision, etc. Average precision • Average precision at each retrieved relevant document –Relevant documents not retrieved contribute zero to score • Example: Assume total of 14 relevant documents, compute mean average precision. = relevant document 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 Precision Precision Hits 1-10 Hits 11-20 AP = 0.231 MAP (Mean Average Precision)  rij = rank of the j-th relevant document for Qi  |Ri| = #rel. doc. for Qi  n = # test queries       i i j Q R D ij i r j R n MAP ) ( | | 1 1 8 3 )] 8 2 4 1 ( 2 1 ) 10 3 5 2 1 1 ( 3 1 [ 2 1       MAP • Computing mean average for more than one query Relevant Docs. retrieved Query 1 Query 2 1st rel. doc. 1 4 2nd rel. doc. 5 8 3rd rel. doc. 10 • E.g. Assume that for query 1 and 2, there are 3 and 2 relevant documents in the collection, respectively. 24 R- Precision  Precision at the R-th position in the ranking of results for a query, where R is the total number of relevant documents.  Calculate precision after R documents are seen  Can be averaged over all queries n doc # relevant 1 588 x 2 589 x 3 576 4 590 x 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 R = # of relevant docs = 6 R-Precision = 4/6 = 0.67
  • 7.
    7  Average precisionis calculated by averaging precision when recall increases. Hence  average precision at this recall values = 62.2% for Ranking #1, 52.0% for Ranking #2 . Thus, using this measure we can say that Ranking #1 is better than #2. More Example • Mean Average Precision (MAP): Often we have a number of queries to evaluate for a given system. • For each query, we can calculate average precision, and if we take average of those averages for a given system, it gives us Mean Average Precision (MAP), which is a very popular measure to compare two systems. • R-precision: It is defined as precision after R documents retrieved, where R is the total number of relevant documents for a given query. • Average precision and R-precision are shown to be highly correlated. • In the previous example, since the number of relevant documents (R) is 5, R-precision for both the rankings is 0.4 (value of precision after 5 documents retrieved). More Example 27 F-Measure • One measure of performance that takes into account both recall and precision. • Harmonic mean of recall and precision: • Compared to arithmetic mean, both need to be high for harmonic mean to be high. • What if no relevant documents exist? P R R P PR F 1 1 2 2     28 E-Measure •Associated with Van Rijsbergen •Allows user to specify importance of recall and precision •It is parameterized F Measure. A variant of F measure that allows weighting emphasis on precision over recall: •Value of  controls trade-off: – = 1: Equal weight for precision and recall (E=F). – > 1: Weight recall more. It emphasizes precision. – < 1: Weight precision more. It emphasizes recall. P R R P PR E 1 2 2 2 2 ) 1 ( ) 1 (          
  • 8.
    8 29 Problems with bothprecision and recall • Number of irrelevant documents in the collection is not taken into account. • Recall is undefined when there is no relevant document in the collection. • Precision is undefined when no document is retrieved. Other measures • Noise = retrieved irrelevant docs / retrieved docs • Silence/Miss = non-retrieved relevant docs / relevant docs – Noise = 1 – Precision; Silence = 1 – Recall | } { | | } { } { | Relevant ed NotRetriev Relevant Miss   | } { | | } { } { | t NotRelevan t NotRelevan Retrieved Fallout   30 Thank you