Chapter 5 Query Evaluation.pdf

1
Chapter Five
Retrieval Evaluation
 Evaluation of IR systems
 Relevance judgments
 Performance measures (Recall,
Precision, etc.)
2
Why System Evaluation?
•It provides the ability to measure the
difference between IR systems
–How well do our search engines work?
–Is system A better than B?
–Under what conditions?
•Evaluation drives what to research on
regarding existing IR system. And what to
improve
–Identify techniques that work and do not work
–There are many retrieval models/ algorithms/ systems
• which one is the best?
–What is the best component for:
• Similarity measures (dot-product, cosine, …)
• Index term selection (stop-word removal, stemming…)
• Term weighting (TF, TF-IDF,…)
Types of Evaluation Strategies
•System-centered evaluation
–Given documents, queries, & relevance judgments
• Try several variations of the system
• Measure which system returns the “best” matching list
of documents
•User-centered evaluation
–Given several users, and at least two IR systems
• Have each user try the same task on each system
• Measure which system works the “best” for users
information need
• How do we measure users satisfaction? How do we know
their impression towards the IR system ?
Major Evaluation Criteria
What are some of the main measures for evaluating an
IR system’s performance?
• Efficiency: time, space
–Speed in terms of retrieval time and indexing time
–Speed of query processing
–The space taken by corpus vs. index
• Is there a need for compression
–Index size: Index/corpus size ratio
• Effectiveness
–How is a system capable of retrieving relevant
documents from the collection?
–Is a system better than another one?
–User satisfaction: How “good” are the documents
that are returned as a response to user query?
–“Relevance” of results to meet information need of
users

2
5
Difficulties in Evaluating IR System
 IR systems essentially facilitate communication
between a user and document collections
 Relevance is a measure of the effectiveness of
communication
 Effectiveness is related to the relevancy of
retrieved items.
 Relevance: relates information need (query)
and a document or surrogate
 Relevancy is not typically binary but continuous.
 Even if relevancy is binary, it is a difficult
judgment to make.
 Relevance is the degree of a correspondence
existing between a document and a query as
determined by requester / information
specialist/ external judge / other users
Difficulties in Evaluating IR System
 Relevance judgments is made by
 The user who posed the retrieval problem
 An external judge or information specialists
or system developer
 Is the relevance judgment made by users,
information specialists or external person the
same ? Why?
 Relevance judgment is usually:
 Subjective: Depends upon a specific user’s
judgment.
 Situational: Relates to user’s current needs.
 Cognitive: Depends on human perception and
behavior.
 Dynamic: Changes over time.
Retrieval scenario
= relevant document
A.
B.
C.
D.
E.
F.
•The scenario where 13 results retrieved by different
systems for a given query?
Measuring Retrieval Effectiveness
• Retrieval of documents may result in:
–False negative (false drop): some relevant documents
may not be retrieved.
–False positive: some irrelevant documents may be
retrieved.
–For many applications a good index should not permit
any false drops, but may permit a few false positives.
“Type two errors”
“Errors of omission”
“False negatives”
“Type one errors”
“Errors of commission”
“False positives”
 Metrics often used to
evaluate effectiveness
of the system
relevant irrelevant
retrieved
not
retrieved
A B
C D

3
Relevant performance metrics
•Recall: The ability of the search to find all of the relevant
items in the corpus
– Recall is percentage of relevant documents
retrieved from the database in response to users
query.
= No. of relevant documents retrieved
Total no. of relevant documents in database
•Precision: The ability to retrieve top-ranked documents that
are mostly relevant.
–Precision is percentage of retrieved documents
that are relevant to the query (i.e. number of
retrieved documents that are relevant).
= No. of relevant documents retrieved
Total number of documents retrieved
Measuring Retrieval Effectiveness
• When is precision important? When is recall important?
|
}
{
|
|
}
{
}
{
|
Re
Relevant
Retrieved
Relevant
call


|
}
{
|
|
}
{
}
{
|
Pr
Retrieved
Retrieved
Relevant
ecision


Relevant Not
relevant
Retrieved A B
Not retrieved C D
Collection size = A+B+C+D
Relevant = A+C
Retrieved = A+B
Relevant Retrieved
Relevant +
Retrieved
Not Relevant + Not Retrieved
Example 1
= relevant document
Assume there are 14 relevant documents in the corpus,
compute precision and recall at the given cutoff points?
1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall
Precision
Hits 1-10
Hits 11-20
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Precision
5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
Recall
12
Example 2
n doc # relevant Recall Precision
1 588 x 0.17 1
2 589 x 0.33 1
3 576 0.33 0.666667
4 590 x 0.5 0.75
5 986 0.5 0.6
6 592 x 0.67 0.667
7 984 0.67 0.571429
8 988 0.67 0.5
9 578 0.67 0.444444
10 985 0.67 0.4
11 103 0.67 0.363636
12 591 0.67 0.333333
13 772 x 0.83 0.384615
14 990 0.83 0.357143
•Let total number
of relevant
documents = 6,
compute recall
and precision for
each cut off point
n:

4
Graphing Precision and Recall
•Plot each (recall, precision) point on a graph. Two
ways of plotting:
– Cutoff vs. Recall/Precision graph
– Recall vs. Precision graph
•Recall and Precision are inversely related
– Recall is a non-decreasing function of the number
of documents retrieved,
– precision usually decreases (in a good system)
•The plot is usually for a single query. How we plot
for two or more queries?
Precision/Recall tradeoff
•Can increase recall by retrieving many documents (down
to a low level of relevance ranking),
–but many irrelevant documents would be fetched,
reducing precision
•Can get high recall (but low precision) by retrieving all
documents for all queries
1
0
1
Recall
Precision
The ideal: Better recall
and precision
Returns relevant
documents but
misses many
useful ones too
Returns most relevant
documents but includes
lots of junk
15
Compare Two or More Systems
• The curve closest to the upper right-hand
corner of the graph indicates the best
performance
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
NoStem Stem
Need for
Interpolation
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
•Two issues:
–How do you compare performance across queries?
–Is the sawtooth shape intuitive of what’s going on?
Solution: Interpolation!
Interpolate a precision value for each standard recall level

5
Interpolation
•It is a general form of precision/recall calculation
•Precision change w.r.t. Recall (not a fixed point)
–It is an empirical fact that on average as recall
increases, precision decreases
•Interpolate precision at 11 standard recall levels:
–rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
•The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall level
between the jth and (j + 1)th level:
)
(
max
)
(
1
r
P
r
P
j
j r
r
r
j




Result of Interpolation
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5
Recall
Precision
Calculating Precision at Standard Recall Levels
Assume that there are a total of 10 relevant document
Ranking Relevant Recall Precision
1. Doc. 50 Rel 10% 100%
2. Doc. 34 Not rel ? ?
3. Doc. 45 Rel 20% 67%
6. Doc. 16 Rel 30% 50%
8. Doc. 119 Rel 40% 50%
11. Doc. 9 Rel 50% 40%
Interpolating across queries
• For each query calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.
• Average precision at seen relevant documents
–Typically average performance over a large set of
queries.
–Favors systems which produce relevant documents
high in rankings

6
Single-valued measures
• Single value measures: may want a single value for each
query to evaluate performance
• Such single valued measures include:
– Average precision : Average precision is calculated by
averaging precision when recall increases.
– Mean average precision
– R-precision, etc.
Average precision
• Average precision at each retrieved relevant document
–Relevant documents not retrieved contribute zero to
score
• Example: Assume total of 14 relevant documents,
compute mean average precision.
= relevant document
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19
4/20
Precision
Precision
Hits 1-10
Hits 11-20
AP = 0.231
MAP (Mean Average Precision)
 rij = rank of the j-th relevant
document for Qi
 |Ri| = #rel. doc. for Qi
 n = # test queries
 
 


i i
j
Q R
D ij
i r
j
R
n
MAP )
(
|
|
1
1
8
3
)]
8
2
4
1
(
2
1
)
10
3
5
2
1
1
(
3
1
[
2
1






MAP
• Computing mean average for more than one query
Relevant Docs.
retrieved
Query
1
Query
2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10
• E.g. Assume that for query 1 and 2, there are 3 and 2 relevant
documents in the collection, respectively.
24
R- Precision
 Precision at the R-th position in the ranking of
results for a query, where R is the total number
of relevant documents.
 Calculate precision after R documents are seen
 Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67

7
 Average precision is calculated by averaging precision when recall
increases. Hence
 average precision at this recall values = 62.2% for Ranking #1,
52.0% for Ranking #2 . Thus, using this measure we can say
that Ranking #1 is better than #2.
More Example
• Mean Average Precision (MAP): Often we have a number
of queries to evaluate for a given system.
• For each query, we can calculate average precision, and if
we take average of those averages for a given system, it
gives us Mean Average Precision (MAP), which is a very
popular measure to compare two systems.
• R-precision: It is defined as precision after R documents
retrieved, where R is the total number of relevant
documents for a given query.
• Average precision and R-precision are shown to be highly
correlated.
• In the previous example, since the number of relevant
documents (R) is 5, R-precision for both the rankings is
0.4 (value of precision after 5 documents retrieved).
More Example
27
F-Measure
• One measure of performance that takes into account
both recall and precision.
• Harmonic mean of recall and precision:
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?
P
R
R
P
PR
F 1
1
2
2




28
E-Measure
•Associated with Van Rijsbergen
•Allows user to specify importance of recall and precision
•It is parameterized F Measure. A variant of F measure
that allows weighting emphasis on precision over recall:
•Value of  controls trade-off:
– = 1: Equal weight for precision and recall (E=F).
– > 1: Weight recall more. It emphasizes precision.
– < 1: Weight precision more. It emphasizes recall.
P
R
R
P
PR
E 1
2
2
2
2
)
1
(
)
1
(





 




8
29
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken
into account.
• Recall is undefined when there is no relevant document in the
collection.
• Precision is undefined when no document is retrieved.
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
|
}
{
|
|
}
{
}
{
|
Relevant
ed
NotRetriev
Relevant
Miss


|
}
{
|
|
}
{
}
{
|
t
NotRelevan
t
NotRelevan
Retrieved
Fallout


30
Thank you

Chapter 5 Query Evaluation.pdf

More Related Content

Similar to Chapter 5 Query Evaluation.pdf

More from Habtamu100

Recently uploaded

Chapter 5 Query Evaluation.pdf