Time-aware Evaluation of
Cumulative Citation
Recommendation Systems
Krisztian Balog
University of Stavanger
SIGIR 2013 workshop on Time-aware Information Access (#TAIA2013) | Dublin, Ireland, Aug 2013
Laura Dietz, Jeffrey Dalton
CIIR, University of Massachusetts, Amherst
CCR @TREC 2012 KBA
Evaluation methodology
Target entity: Aharon Barak
1328055120'f6462409e60d2748a0adef82fe68b86d
1328057880'79cdee3c9218ec77f6580183cb16e045
1328057280'80fb850c089caa381a796c34e23d9af8
1328056560'450983d117c5a7903a3a27c959cc682a
1328056560'450983d117c5a7903a3a27c959cc682a
1328056260'684e2f8fc90de6ef949946f5061a91e0
1328056560'be417475cca57b6557a7d5db0bbc6959
1328057520'4e92eb721bfbfdfa0b1d9476b1ecb009
1328058660'807e4aaeca58000f6889c31c24712247
1328060040'7a8c209ad36bbb9c946348996f8c616b
1328063280'1ac4b6f3a58004d1596d6e42c4746e21
1328064660'1a0167925256b32d715c1a3a2ee0730c
1328062980'7324a71469556bcd1f3904ba090ab685
PositiveNegative
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
score
Target entity: Aharon Barak
urlname stream_id
Cutoff
1000
500
500
480
450
430
428
428
380
380
375
315
263
1328055120'f6462409e60d2748a0adef82fe68b86d
1328057880'79cdee3c9218ec77f6580183cb16e045
1328057280'80fb850c089caa381a796c34e23d9af8
1328056560'450983d117c5a7903a3a27c959cc682a
1328056560'450983d117c5a7903a3a27c959cc682a
1328056260'684e2f8fc90de6ef949946f5061a91e0
1328056560'be417475cca57b6557a7d5db0bbc6959
1328057520'4e92eb721bfbfdfa0b1d9476b1ecb009
1328058660'807e4aaeca58000f6889c31c24712247
1328060040'7a8c209ad36bbb9c946348996f8c616b
1328063280'1ac4b6f3a58004d1596d6e42c4746e21
1328064660'1a0167925256b32d715c1a3a2ee0730c
1328062980'7324a71469556bcd1f3904ba090ab685
PositiveNegative
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
CCR @TREC 2012 KBA
- Cumulative citation recommendation
- Filter a time-ordered corpus for documents that are
highly relevant to a predefined set of entities
- For each entity, provide a ranked list of documents
based on their “citation-worthiness”
CCR @TREC 2012 KBA
- Cumulative citation recommendation
- Filter a time-ordered corpus for documents that are
highly relevant to a predefined set of entities
- For each entity, provide a ranked list of documents
based on their “citation-worthiness”
Results are evaluated in a single batch
(temporal aspects are not considered)
CCR @TREC 2012 KBA
- Cumulative citation recommendation
- Filter a time-ordered corpus for documents that are
highly relevant to a predefined set of entities
- For each entity, provide a ranked list of documents
based on their “citation-worthiness”
Evaluation metrics are set-based
(using a confidence cut-off)
Aims
- Develop a time-aware evaluation paradigm for
streaming collections
- Capture how retrieval effectiveness changes over time
- Deal with ground truth of bursty nature
- Accommodate various underlying user models
- Test the ideas on CCR
Overview
time
1. Slicing time
2. Measuring
slice relevance
3. Aggregating
slice relevance.87
.65
Slice
importance
Overview
time
.87
.65
Slice
importance
1. Slicing time
Slicing time
- Simplifying assumptions
- Slices are non-overlapping
- Unconcerned about slices that don’t contain any
relevant documents
(A) Uniform slicing
- Slices of equal length
(B) Non-uniform slicing
- Slices of varying length
#relevant
time
(A)
(B)
ti
Overview
time
.87
.65
Slice
importance
2. Measuring
slice relevance
Measuring slice relevance
- Ranked list of documents within a given slice
- Evaluation metric
- Standard IR metrics
- MAP, R-Prec, NDCG
d =< d1, . . . , dn >
m(di, q)
Overview
time
.87
.65
Slice
importance
3. Aggregating
slice relevance
Aggregating slice relevance
- Probabilistic formulation to estimate the
likelihood of relevance
P(r = 1|d, q, m) =
X
i2I
P(r = 1|di, q, i)P(i|q)
Slice-based
relevance
Slice
importance
⇡ m(di, q)
Slice importance
- Uniform slicing
- All slices are equally important
- Non-uniform slicing
- Bursty periods (i.e., slices with more relevant
documents) are more important
P(i|q) =
1
I
P(i|q) =
#R(i, q)
P
i2I #R(i, q)
Experiments
- Official TREC 2012 KBA CCR runs
- 8 systems, best run for each system
- Only uniform time slicing
- Binary relevance
Results
Atemporal vs. temporal ranking (MAP, weekly slicing)
0
0.15
0.3
0.45
0.6
UvA
udel_fang LSIS CWI
UMass_CIIR
uiucGSLIS
hltcoe
igpi2012
helsinki
Atemporal
Temporal (uniform slice weighting)
Temporal (non-uniform slice weighting)
0
0.175
0.35
0.525
0.7
UvA
udel_fang LSIS CWI
UMass_CIIR
uiucGSLIS
hltcoe
igpi2012
helsinki
Atemporal
Temporal (uniform slice weighting)
Temporal (non-uniform slice weighting)
Results
Atemporal vs. temporal ranking (MAP, daily slicing)
Zooming in
atemporal
(MAP)
temporal (MAP)temporal (MAP)temporal (MAP)temporal (MAP)
atemporal
(MAP)
weekly slicingweekly slicing daily slicingdaily slicing
atemporal
(MAP)
uniform non-uniform uniform non-uniform
LSIS 0.48 0.52 0.54 0.60 0.62
CWI 0.45 0.48 0.51 0.62 0.63
LSIS CWI
Findings
- Top performing teams are (almost) always the
same, independent of the metric
- Temporal evaluation provides additional
insights
Wrap-up
- Framework for temporal evaluation
- Applied to the evaluation of TREC 2012 KBA CCR
systems
- Future work
- Non-uniform slice weighting
- Other streaming tasks/collections (e.g., microblog
search)
- Generalize to other time-aware information access
tasks
Questions?
Online appendix:
http://ciir.cs.umass.edu/~dietz/streameval/

Time-aware Evaluation of Cumulative Citation Recommendation Systems

  • 1.
    Time-aware Evaluation of CumulativeCitation Recommendation Systems Krisztian Balog University of Stavanger SIGIR 2013 workshop on Time-aware Information Access (#TAIA2013) | Dublin, Ireland, Aug 2013 Laura Dietz, Jeffrey Dalton CIIR, University of Massachusetts, Amherst
  • 2.
  • 3.
    Evaluation methodology Target entity:Aharon Barak 1328055120'f6462409e60d2748a0adef82fe68b86d 1328057880'79cdee3c9218ec77f6580183cb16e045 1328057280'80fb850c089caa381a796c34e23d9af8 1328056560'450983d117c5a7903a3a27c959cc682a 1328056560'450983d117c5a7903a3a27c959cc682a 1328056260'684e2f8fc90de6ef949946f5061a91e0 1328056560'be417475cca57b6557a7d5db0bbc6959 1328057520'4e92eb721bfbfdfa0b1d9476b1ecb009 1328058660'807e4aaeca58000f6889c31c24712247 1328060040'7a8c209ad36bbb9c946348996f8c616b 1328063280'1ac4b6f3a58004d1596d6e42c4746e21 1328064660'1a0167925256b32d715c1a3a2ee0730c 1328062980'7324a71469556bcd1f3904ba090ab685 PositiveNegative Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak score Target entity: Aharon Barak urlname stream_id Cutoff 1000 500 500 480 450 430 428 428 380 380 375 315 263 1328055120'f6462409e60d2748a0adef82fe68b86d 1328057880'79cdee3c9218ec77f6580183cb16e045 1328057280'80fb850c089caa381a796c34e23d9af8 1328056560'450983d117c5a7903a3a27c959cc682a 1328056560'450983d117c5a7903a3a27c959cc682a 1328056260'684e2f8fc90de6ef949946f5061a91e0 1328056560'be417475cca57b6557a7d5db0bbc6959 1328057520'4e92eb721bfbfdfa0b1d9476b1ecb009 1328058660'807e4aaeca58000f6889c31c24712247 1328060040'7a8c209ad36bbb9c946348996f8c616b 1328063280'1ac4b6f3a58004d1596d6e42c4746e21 1328064660'1a0167925256b32d715c1a3a2ee0730c 1328062980'7324a71469556bcd1f3904ba090ab685 PositiveNegative Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak Aharon_Barak
  • 4.
    CCR @TREC 2012KBA - Cumulative citation recommendation - Filter a time-ordered corpus for documents that are highly relevant to a predefined set of entities - For each entity, provide a ranked list of documents based on their “citation-worthiness”
  • 5.
    CCR @TREC 2012KBA - Cumulative citation recommendation - Filter a time-ordered corpus for documents that are highly relevant to a predefined set of entities - For each entity, provide a ranked list of documents based on their “citation-worthiness” Results are evaluated in a single batch (temporal aspects are not considered)
  • 6.
    CCR @TREC 2012KBA - Cumulative citation recommendation - Filter a time-ordered corpus for documents that are highly relevant to a predefined set of entities - For each entity, provide a ranked list of documents based on their “citation-worthiness” Evaluation metrics are set-based (using a confidence cut-off)
  • 7.
    Aims - Develop atime-aware evaluation paradigm for streaming collections - Capture how retrieval effectiveness changes over time - Deal with ground truth of bursty nature - Accommodate various underlying user models - Test the ideas on CCR
  • 8.
    Overview time 1. Slicing time 2.Measuring slice relevance 3. Aggregating slice relevance.87 .65 Slice importance
  • 9.
  • 10.
    Slicing time - Simplifyingassumptions - Slices are non-overlapping - Unconcerned about slices that don’t contain any relevant documents (A) Uniform slicing - Slices of equal length (B) Non-uniform slicing - Slices of varying length #relevant time (A) (B) ti
  • 11.
  • 12.
    Measuring slice relevance -Ranked list of documents within a given slice - Evaluation metric - Standard IR metrics - MAP, R-Prec, NDCG d =< d1, . . . , dn > m(di, q)
  • 13.
  • 14.
    Aggregating slice relevance -Probabilistic formulation to estimate the likelihood of relevance P(r = 1|d, q, m) = X i2I P(r = 1|di, q, i)P(i|q) Slice-based relevance Slice importance ⇡ m(di, q)
  • 15.
    Slice importance - Uniformslicing - All slices are equally important - Non-uniform slicing - Bursty periods (i.e., slices with more relevant documents) are more important P(i|q) = 1 I P(i|q) = #R(i, q) P i2I #R(i, q)
  • 16.
    Experiments - Official TREC2012 KBA CCR runs - 8 systems, best run for each system - Only uniform time slicing - Binary relevance
  • 17.
    Results Atemporal vs. temporalranking (MAP, weekly slicing) 0 0.15 0.3 0.45 0.6 UvA udel_fang LSIS CWI UMass_CIIR uiucGSLIS hltcoe igpi2012 helsinki Atemporal Temporal (uniform slice weighting) Temporal (non-uniform slice weighting)
  • 18.
    0 0.175 0.35 0.525 0.7 UvA udel_fang LSIS CWI UMass_CIIR uiucGSLIS hltcoe igpi2012 helsinki Atemporal Temporal(uniform slice weighting) Temporal (non-uniform slice weighting) Results Atemporal vs. temporal ranking (MAP, daily slicing)
  • 19.
    Zooming in atemporal (MAP) temporal (MAP)temporal(MAP)temporal (MAP)temporal (MAP) atemporal (MAP) weekly slicingweekly slicing daily slicingdaily slicing atemporal (MAP) uniform non-uniform uniform non-uniform LSIS 0.48 0.52 0.54 0.60 0.62 CWI 0.45 0.48 0.51 0.62 0.63 LSIS CWI
  • 20.
    Findings - Top performingteams are (almost) always the same, independent of the metric - Temporal evaluation provides additional insights
  • 21.
    Wrap-up - Framework fortemporal evaluation - Applied to the evaluation of TREC 2012 KBA CCR systems - Future work - Non-uniform slice weighting - Other streaming tasks/collections (e.g., microblog search) - Generalize to other time-aware information access tasks
  • 22.