FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
information technology materrailas paper
1.
2. 2
Why System Evaluation?
It provides the ability to measure the difference between
IR systems
How well do our search engines work?
Is system A better than B?
Under what conditions?
Evaluation drives what to research on regarding existing
IR system. And what to improve
Identify techniques that work and do not work
There are many retrieval models/ algorithms/ systems
which one is the best?
What is the best component for:
Similarity measures (dot-product, cosine, …)
Index term selection (stop-word removal, stemming…)
Term weighting (TF, TF-IDF,…)
3. Types of Evaluation Strategies
System-centered evaluation
Given documents, queries, & relevance judgments
Try several variations of the system
Measure which system returns the “best” matching list of
documents
User-centered evaluation
Given several users, and at least two IR systems
Have each user try the same task on each system
Measure which system works the “best” for users
information need
How do we measure users satisfaction? How do we know
their impression towards the IR system ?
4. Major Evaluation Criteria
What are some of the main measures for evaluating
an IR system’s performance?
Efficiency: time, space
Speed in terms of retrieval time and indexing time
Speed of query processing
The space taken by corpus vs. index
Is there a need for compression
Index size: Index/corpus size ratio
Effectiveness
How is a system capable of retrieving relevant documents
from the collection?
Is a system better than another one?
User satisfaction: How “good” are the documents that are
returned as a response to user query?
“Relevance” of results to meet information need of users
5. 5
Difficulties in Evaluating IR System
IR systems essentially facilitate communication
between a user and document collections
Relevance is a measure of the effectiveness of
communication
Effectiveness is related to the relevancy of retrieved items.
Relevance: relates information need (query) and a
document or surrogate
Relevancy is not typically binary but continuous.
Even if relevancy is binary, it is a difficult judgment to make.
Relevance is the degree of a correspondence existing
between a document and a query as determined by
requester / information specialist/ external judge /
other users
6. Difficulties in Evaluating IR System
Relevance judgments is made by
The user who posed the retrieval problem
An external judge or information specialists or system
developer
Is the relevance judgment made by users,
information specialists or external person the same ?
Why?
Relevance judgment is usually:
Subjective: Depends upon a specific user’s judgment.
Situational: Relates to user’s current needs.
Cognitive: Depends on human perception and
behavior.
Dynamic: Changes over time.
7. Retrieval scenario
= relevant document
A.
B.
C.
D.
E.
F.
•The scenario where 13 results retrieved by different
systems for a given query?
8. Measuring Retrieval Effectiveness
Retrieval of documents may result in:
False negative (false drop): some relevant documents may
not be retrieved.
False positive: some irrelevant documents may be retrieved.
For many applications a good index should not permit any
false drops, but may permit a few false positives.
“Type two errors”
“Errors of omission”
“False negatives”
“Type one errors”
“Errors of commission”
“False positives”
•Metrics often used to
evaluate effectiveness
of the system
relevant irrelevant
retrieved
not
retrieved
A B
C D
9. Relevant performance metrics
Recall: The ability of the search to find all of the relevant
items in the corpus
Recall is percentage of relevant documents retrieved from
the database in response to users query.
= No. of relevant documents retrieved
Total no. of relevant documents in database
Precision: The ability to retrieve top-ranked documents that
are mostly relevant.
Precision is percentage of retrieved documents that are
relevant to the query (i.e. number of retrieved documents
that are relevant).
= No. of relevant documents retrieved
Total number of documents retrieved
10. Measuring Retrieval Effectiveness
When is precision important? When is recall important?
|
}
{
|
|
}
{
}
{
|
Re
Relevant
Retrieved
Relevant
call
|
}
{
|
|
}
{
}
{
|
Pr
Retrieved
Retrieved
Relevant
ecision
Relevant Not
relevant
Retrieved A B
Not retrieved C D
Collection size = A+B+C+D
Relevant = A+C
Retrieved = A+B
Relevant Retrieved
Relevant +
Retrieved
Not Relevant + Not Retrieved
11. Example 1
= relevant document
Assume there are 14 relevant documents in the corpus,
compute precision and recall at the given cutoff points?
1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall
Precision
Hits 1-10
Hits 11-20
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Precision
5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
Recall
12. 12
Example 2
n doc # relevant RecallPrecision
1 588 x 0.17 1
2 589 x 0.33 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.67 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.83 0.38
14 990
•Let total number
of relevant
documents = 6,
compute recall
and precision for
each cut off point
n:
13. Average precision
Average precision at each retrieved relevant document
Relevant documents not retrieved contribute zero to score
Example: Assume total of 14 relevant documents,
compute mean average precision.
= relevant document
1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
5/11 5/12 5/135/14 5/15 6/16 6/17 6/18 6/19
4/20
Precision
Precision
Hits 1-10
Hits 11-20
AP = 0.231
14. MAP (Mean Average Precision)
rij = rank of the j-th relevant
document for Qi
|Ri| = #rel. doc. for Qi
n = # test queries
i i
j
Q R
D ij
i r
j
R
n
MAP )
(
|
|
1
1
8
3
)]
8
2
4
1
(
2
1
)
10
3
5
2
1
1
(
3
1
[
2
1
MAP
• Computing mean average for more than one query
Relevant Docs.
retrieved
Query
1
Query
2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10
• E.g. Assume that for query 1 and 2, there are 3 and 2 relevant
documents in the collection, respectively.
15. 15
R- Precision
Precision at the R-th position in the ranking of results
for a query, where R is the total number of relevant
documents.
Calculate precision after R documents are seen
Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
16. • Average precision is calculated by averaging precision
when recall increases. Hence
• average precision at this recall values = 62.2% for
Ranking #1, 52.0% for Ranking #2 . Thus, using this
measure we can say that Ranking #1 is better than #2.
More Example
17. Precision/Recall tradeoff
Can increase recall by retrieving many documents
(down to a low level of relevance ranking),
but many irrelevant documents would be fetched, reducing
precision
Can get high recall (but low precision) by retrieving all
documents for all queries
1
0
1
Recall
Precision
The ideal: Better recall
and precision
Returns relevant
documents but
misses many
useful ones too
Returns most relevant
documents but includes
lots of junk
18. 18
F-Measure
One measure of performance that takes into
account both recall and precision.
Harmonic mean of recall and precision:
Compared to arithmetic mean, both need to be
high for harmonic mean to be high.
What if no relevant documents exist?
P
R
R
P
PR
F 1
1
2
2
19. 19
E-Measure
Associated with Van Rijsbergen
Allows user to specify importance of recall and
precision
It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on
precision over recall:
Value of controls trade-off:
= 1: Equal weight for precision and recall (E=F).
> 1: Weight recall more. It emphasizes precision.
< 1: Weight precision more. It emphasizes recall.
P
R
R
P
PR
E 1
2
2
2
2
)
1
(
)
1
(
20. 20
Quiz(5%)
Write short answer for the following questions
1. List and explain the two IR system Evaluation
Strategies(2pt)
2. Suppose XY IR system returns 8 relevant
documents, and 10 irrelevant documents. There
are a total of 20 relevant documents in the
collection/dataset. What is the precision of the
system on this search, and what is its recall?(3pt)