information technology materrailas paper

2
Why System Evaluation?
It provides the ability to measure the difference between
IR systems
 How well do our search engines work?
 Is system A better than B?
 Under what conditions?
Evaluation drives what to research on regarding existing
IR system. And what to improve
 Identify techniques that work and do not work
 There are many retrieval models/ algorithms/ systems
 which one is the best?
 What is the best component for:
 Similarity measures (dot-product, cosine, …)
 Index term selection (stop-word removal, stemming…)
 Term weighting (TF, TF-IDF,…)

Types of Evaluation Strategies
System-centered evaluation
 Given documents, queries, & relevance judgments
 Try several variations of the system
 Measure which system returns the “best” matching list of
documents
User-centered evaluation
 Given several users, and at least two IR systems
 Have each user try the same task on each system
 Measure which system works the “best” for users
information need
 How do we measure users satisfaction? How do we know
their impression towards the IR system ?

Major Evaluation Criteria
What are some of the main measures for evaluating
an IR system’s performance?
Efficiency: time, space
Speed in terms of retrieval time and indexing time
Speed of query processing
The space taken by corpus vs. index
 Is there a need for compression
Index size: Index/corpus size ratio
Effectiveness
How is a system capable of retrieving relevant documents
from the collection?
Is a system better than another one?
User satisfaction: How “good” are the documents that are
returned as a response to user query?
“Relevance” of results to meet information need of users

5
Difficulties in Evaluating IR System
IR systems essentially facilitate communication
between a user and document collections
Relevance is a measure of the effectiveness of
communication
Effectiveness is related to the relevancy of retrieved items.
Relevance: relates information need (query) and a
document or surrogate
Relevancy is not typically binary but continuous.
Even if relevancy is binary, it is a difficult judgment to make.
 Relevance is the degree of a correspondence existing
between a document and a query as determined by
requester / information specialist/ external judge /
other users

Difficulties in Evaluating IR System
Relevance judgments is made by
The user who posed the retrieval problem
An external judge or information specialists or system
developer
 Is the relevance judgment made by users,
information specialists or external person the same ?
Why?
 Relevance judgment is usually:
 Subjective: Depends upon a specific user’s judgment.
 Situational: Relates to user’s current needs.
 Cognitive: Depends on human perception and
behavior.
 Dynamic: Changes over time.

Retrieval scenario
= relevant document
A.
B.
C.
D.
E.
F.
•The scenario where 13 results retrieved by different
systems for a given query?

Measuring Retrieval Effectiveness
Retrieval of documents may result in:
False negative (false drop): some relevant documents may
not be retrieved.
False positive: some irrelevant documents may be retrieved.
For many applications a good index should not permit any
false drops, but may permit a few false positives.
“Type two errors”
“Errors of omission”
“False negatives”
“Type one errors”
“Errors of commission”
“False positives”
•Metrics often used to
evaluate effectiveness
of the system
relevant irrelevant
retrieved
not
retrieved
A B
C D

Relevant performance metrics
Recall: The ability of the search to find all of the relevant
items in the corpus
 Recall is percentage of relevant documents retrieved from
the database in response to users query.
= No. of relevant documents retrieved
Total no. of relevant documents in database
Precision: The ability to retrieve top-ranked documents that
are mostly relevant.
Precision is percentage of retrieved documents that are
relevant to the query (i.e. number of retrieved documents
that are relevant).
= No. of relevant documents retrieved
Total number of documents retrieved

Measuring Retrieval Effectiveness
When is precision important? When is recall important?
|
}
{
|
|
}
{
}
{
|
Re
Relevant
Retrieved
Relevant
call


|
}
{
|
|
}
{
}
{
|
Pr
Retrieved
Retrieved
Relevant
ecision


Relevant Not
relevant
Retrieved A B
Not retrieved C D
Collection size = A+B+C+D
Relevant = A+C
Retrieved = A+B
Relevant Retrieved
Relevant +
Retrieved
Not Relevant + Not Retrieved

Example 1
= relevant document
Assume there are 14 relevant documents in the corpus,
compute precision and recall at the given cutoff points?
1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall
Precision
Hits 1-10
Hits 11-20
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Precision
5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
Recall

12
Example 2
n doc # relevant RecallPrecision
1 588 x 0.17 1
2 589 x 0.33 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.67 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.83 0.38
14 990
•Let total number
of relevant
documents = 6,
compute recall
and precision for
each cut off point
n:

Average precision
Average precision at each retrieved relevant document
 Relevant documents not retrieved contribute zero to score
Example: Assume total of 14 relevant documents,
compute mean average precision.
= relevant document
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
5/11 5/12 5/135/14 5/15 6/16 6/17 6/18 6/19
4/20
Precision
Precision
Hits 1-10
Hits 11-20
AP = 0.231

MAP (Mean Average Precision)
 rij = rank of the j-th relevant
document for Qi
 |Ri| = #rel. doc. for Qi
 n = # test queries
 
 


i i
j
Q R
D ij
i r
j
R
n
MAP )
(
|
|
1
1
8
3
)]
8
2
4
1
(
2
1
)
10
3
5
2
1
1
(
3
1
[
2
1






MAP
• Computing mean average for more than one query
Relevant Docs.
retrieved
Query
1
Query
2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10
• E.g. Assume that for query 1 and 2, there are 3 and 2 relevant
documents in the collection, respectively.

15
R- Precision
Precision at the R-th position in the ranking of results
for a query, where R is the total number of relevant
documents.
 Calculate precision after R documents are seen
 Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67

• Average precision is calculated by averaging precision
when recall increases. Hence
• average precision at this recall values = 62.2% for
Ranking #1, 52.0% for Ranking #2 . Thus, using this
measure we can say that Ranking #1 is better than #2.
More Example

Precision/Recall tradeoff
Can increase recall by retrieving many documents
(down to a low level of relevance ranking),
but many irrelevant documents would be fetched, reducing
precision
Can get high recall (but low precision) by retrieving all
documents for all queries
1
0
1
Recall
Precision
The ideal: Better recall
and precision
Returns relevant
documents but
misses many
useful ones too
Returns most relevant
documents but includes
lots of junk

18
F-Measure
 One measure of performance that takes into
account both recall and precision.
 Harmonic mean of recall and precision:
 Compared to arithmetic mean, both need to be
high for harmonic mean to be high.
 What if no relevant documents exist?
P
R
R
P
PR
F 1
1
2
2





19
E-Measure
Associated with Van Rijsbergen
Allows user to specify importance of recall and
precision
It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on
precision over recall:
Value of  controls trade-off:
 = 1: Equal weight for precision and recall (E=F).
 > 1: Weight recall more. It emphasizes precision.
 < 1: Weight precision more. It emphasizes recall.
P
R
R
P
PR
E 1
2
2
2
2
)
1
(
)
1
(





 




20
Quiz(5%)
Write short answer for the following questions
1. List and explain the two IR system Evaluation
Strategies(2pt)
2. Suppose XY IR system returns 8 relevant
documents, and 10 irrelevant documents. There
are a total of 20 relevant documents in the
collection/dataset. What is the precision of the
system on this search, and what is its recall?(3pt)

information technology materrailas paper

information technology materrailas paper

Recommended

Recommended

More Related Content

Similar to information technology materrailas paper

Similar to information technology materrailas paper (20)

Recently uploaded

Recently uploaded (20)

information technology materrailas paper