This talk gives a survey of current approaches to searching the temporal web. In such a web collection, the contents are created and/or edited over time, and examples are web archives, news archives, blogs, micro-blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching the temporal web. The reason for this is multifold: 1) the collection is strongly time-dependent, i.e., with multiple versions of documents, 2) the contents of documents are about events happened at particular time periods, 3) the meanings of semantic annotations can change over time, and 4) a query representing an information need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed, namely, 1) How to understand temporal search intent represented by time-sensitive queries? 2) How to handle the temporal dynamics of queries and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
Searching the Temporal Web: Challenges and Current Approaches
1. Searching the Temporal Web
Challenges and Current
Approaches
Nattiya Kanhabua
IR Group, University of Glasgow
4 February 2013
2. Outline
• Evolution of the Web
• Temporal IR system
• Current Approaches
– Content Analysis
– Query Analysis
– Retrieval and Ranking
• Open Issues
2
3. Evolution of the Web
• Web is changing over time in many aspects:
– Size: web pages are added/deleted all the time
– Content: web pages are edited/modified
– Query: users’ information needs changes
[Ke et al., CN 2006; Risvik et al., CN 2002]
[Dumais, SIAM-SDM 2012; WebDyn 2010] 3
4. 2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2012
2008
Google counts
1 trillion
unique URLs
Web and Index Sizes
2009
TBs or PBs of data/index
Tens of thousands of PCs
http://www.worldwidewebsize.com/
Impacts: crawling, indexing, and caching 4
9. Time-sensitive Queries
• Represent temporal information needs
– E.g., 2006 FIFA World Cup, Thailand tsunami
• Queries and the relevance depend on time
– Query popularities change over time
– Documents are about events at particular time
9
10. Time Distribution of Qrel
Recency query Time-sensitive query
Time-insensitive query
[Li et al., CIKM 2003]
10
13. Two Time Aspects
Two time dimensions
1. Publication or modified time
2. Content or event time
content time
publication time
13
14. Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
Document Dating
Let’s me see…
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
14
16. Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
Freq
1
1
1
1
1
1
tsunami
Thailand
A non-timestamped
document
16
[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]
17. Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
Freq
1
1
1
1
1
1
17
[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]
18. Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
Freq
1
1
1
1
1
1
18
[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]
19. Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
Freq
1
1
1
1
1
1
19
[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]
20. Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]
Freq
1
1
1
1
1
1
20
21. Normalized Log-likelihood Ratio
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Normalized log-likelihood
ratio
• Variant of Kullback-Leibler
divergence
• Similarity of a document and
time partitions
• C is the background model
estimated on the corpus
• Linear interpolation
smoothing to avoid the zero
probability of unseen words
[Kraaij, SIGIR Forum 2005]
21
22. Link-based Approach
• Dating a document using its neighbors
1. Web pages linking to the document
• Incoming links
2. Web pages pointed by the document
• Outgoing links
3. Media assets associated with the document
• E.g., images
• Averaging the last-modified dates of its
neighbors as timestamps
[Nunes et al., WIDM 2007]
22
23. Hybrid Approach
• Inferring timestamps using machine
learning
– Exploit links, contents of a web pages and its neighbors
– Features: linguistic, position, page formats, and tags
[Chen et al., SIGIR 2010]
23
24. Content Time Extraction
• Three types of temporal expressions
1. Explicit: time mentions being mapped directly to
a time point or interval, e.g., “July 4, 2012”
2. Implicit: imprecise time point or interval, e.g.,
“Independence Day 2012”
3. Relative: resolved to a time point or interval
using other types or the publication date, e.g.,
“next month”
[Alonso et al., SIGIR Forum 2007]
24
25. Identifying Relevant Time
• How to determine relevant temporal
expressions tagged in a document?
– Not all temporal expressions associated to an event
are equally relevant
Reported by World Health Organization (WHO) on
29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
25
26. Current Approaches
1. Ranking temporal expressions using
different features
2. The task of identifying relevant time is
regarded as a classification problem
– Two classes: (1) relevant and (2) irrelevant
– Definition: relevant referring to the starting, ending or
ongoing time of the event
[Strötgen et al., TempWeb 2012; Kanhabua et al., TAIA 2012]
26
28. Determining Time of Queries
• Two types of temporal queries:
1. Explicit: time is provided, "Presidential election 2012“
2. Implicit: time is not provided, "Germany World Cup"
• Temporal intent can be implicitly inferred
• Previous studies on temporal queries:
– 1.5% of web queries are explicit
– ~7% of web queries are implicit
[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009]
28
30. Query Log Analysis
• Features from query logs
– Analyze query frequencies over time for identifying
the relevant time of queries
• Time-series analysis
– Leverage time-series decomposition for detecting
seasonal queries
[Metzler et al., SIGIR 2009; Shokouhi, SIGIR 2011]
30
32. Analyzing Top-k Documents
• Using temporal language models
– Determine time of queries when no time is given explicitly
• Exploiting time from search snippets
– Extract temporal expressions (i.e., years) from the
contents of top-k retrieved web snippets for a given query
[Kanhabua et al., ECDL 2010; Campos et al., CIKM 2012]
32
35. Searching the Past
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
– A journalist wants to write a timeline of a news article
– A Wikipedia contributor searches for historical
information about an entity of interests
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results
35
36. • Time must be explicitly modeled in order to
increase the effectiveness of ranking
– To order search results so that the most relevant
ones are ranked higher
• Time uncertainty should be taken into account
– Two temporal expressions can refer to the same time
period even though they are not equally written
– E.g. the query “Independence Day 2011”
• A retrieval model relying on term-matching only will fail to
retrieve documents mentioning “July 4, 2011”
Challenges
36
37. Query/Document Models
• A temporal query consists of:
– Query keywords
– Temporal expressions
• A document consists of:
– Terms, i.e., bag-of-words
– Publication time and temporal expressions
37
38. Temporal Query Examples
• A temporal query consists of:
– Query keywords
– Temporal expressions
• A document consists of:
– Terms, i.e., bag-of-words
– Publication time and temporal expressions
[Berberich et al., ECIR 2010]
38
39. Time-aware Ranking
• Two main approaches
1. Mixture model
• Linearly combining textual- and temporal similarity
2. Probabilistic model
• Generating a query from the textual part and temporal part
of a document independently
[Li et al., CIKM 2003; Baeza-Yates, SIGIR Forum 2005]
[Berberich et al., ECIR 2010; Kanhabua et al., ECDL 2010]
39
40. Mixture Model
• Linearly combine textual- and temporal similarity
– α indicates the importance of similarity scores
• Both scores are normalized before combining
– Textual similarity can be determined using any term-
based retrieval model
• E.g., tf.idf or a unigram language model
40[Li et al., CIKM 2003; Baeza-Yates, SIGIR Forum 2005]
[Kanhabua et al., ECDL 2010]
41. Mixture Model
• Linearly combine textual- and temporal similarity
– α indicates the importance of similarity scores
• Both scores are normalized before combining
– Textual similarity can be determined using any term-
based retrieval model
• E.g., tf.idf or a unigram language model
How to determine temporal similarity?
41
43. Probabilistic Model
• Assume that temporal expressions in the
query are generated independently from a
two-step generative model:
– P(tq|td) can be estimated likelihood for each time
interval tq that td can refer to
– Linear interpolation smoothing is applied to
eliminates zero probabilities
• I.e., an unseen temporal expression tq in d
43
[Berberich et al., ECIR 2010]
44. • Five time-aware ranking models
– LMT [Berberich et al., ECIR 2010]
– LMTU [Berberich et al., ECIR 2010]
– TS [Kanhabua et al., ECLD 2010]
– TSU [Kanhabua et al., ECLD 2010]
– FuzzySet [Kalczynski et al., Inf. Process. 2005]
Comparison of Time-aware Ranking
[Kanhabua et al., SIGIR 2011a]
44
46. Problem Statements
• Queries of named entities (people, company, place)
– Highly dynamic in appearance, i.e., relationships between
terms changes over time
– E.g. changes of roles, name alterations, or semantic shift
Named Entity Evolution
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2
Query: “Hillary R. Clinton” and written from 1997 to 2002
Documents about “New York Senator” and “First Lady of
the United States” are relevant
46
47. Examples of Name Changes
47
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
48. Current Approaches
• Temporal co-occurrence
• Mining Wikipedia revisions
• Temporal association rule mining
48[Berberich et al., WebDB 2009; Kanhabua et al., JCDL 2010]
[Kaluarachchi et al., CIKM 2010; Tahmasebi et al., COLING 2012]
49. 1. Performance prediction
– Predict the retrieval effectiveness wrt. a ranking model
Query Prediction Problems
query
precision = ?
recall = ?
MAP = ?
predict
[Cronen-Townsend et al., SIGIR 2002; Diaz et al., SIGIR 2004]
[Hauff et al., ECIR 2010; Carmel et al., 2010; Kanhabua et al. SIGIR 2011b]
49
50. 1. Performance prediction
– Predict the retrieval effectiveness wrt. a ranking model
2. Retrieval model prediction
– Predict the retrieval model that is most suitable
Query Prediction Problems
query ranking = ?
predict max(precision)
max(recall)
max(MAP)
[Peng et al., ECIR 2010; Kanhabua et al., SIGIR 2012]
50
51. References
• [Alonso et al., SIGIR Forum 2007] Omar Alonso, Michael Gertz, Ricardo A. Baeza-Yates: On the
value of temporal information in information retrieval. SIGIR Forum 41(2): 35-41 (2007)
• [Baeza-Yates, SIGIR Forum 2005] Ricardo A. Baeza-Yates: Searching the future. SIGIR workshop
MF/IR 2005
• [Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard
Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009
• [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum:
A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25
• [Campos et al., CIKM 2012] Ricardo Campos, Gaël Dias, Alípio Jorge, Celia Nunes: GTE: A
Distributional Second-Order Co-Occurrence Approach to Improve the Identification of Top Relevant
Dates in Web Snippets. CIKM 2012
• [Carmel et al., 2010] David Carmel, Elad Yom-Tov: Estimating the Query Difficulty for Information
Retrieval. Morgan & Claypool Publishers 2010
• [Chen et al., SIGIR 2010] Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang: Web
page publication time detection and its application for page rank. SIGIR 2010: 859-860
• [Cronen-Townsend et al., SIGIR 2002] Stephen Cronen-Townsend, Yun Zhou, W. Bruce Croft:
Predicting query performance. SIGIR 2002: 299-306
• [Diaz et al., SIGIR 2004] Fernando Diaz, Rosie Jones: Using temporal profiles of queries for precision
prediction. SIGIR 2004: 18-24
• [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-
SDM 2012
51
52. References (cont’)
• [Hauff et al., ECIR 2010] Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, Franciska de Jong: Query
Performance Prediction: Evaluation Contrasted with Effectiveness. ECIR 2010: 204-216
• [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language
models for the disclosure of historical text. AHC 2005: 161-168
• [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.
Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query
translation in text retrieval with association rules. CIKM 2010: 1789-1792
• [Kalczynski et al., Inf. Process. 2005] Pawel Jan Kalczynski, Amy Chou: Temporal Document
Retrieval Model for business news archives. Inf. Process. Manage. 41(3): 635-650 (2005)
• [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms in
searching document archives. JCDL 2010: 79-88
• [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for Re-
ranking Search Results. ECDL 2010: 261-272
• [Kanhabua et al., SIGIR 2011a] Nattiya Kanhabua, Kjetil Nørvåg: A comparison of time-aware ranking
methods. SIGIR 2011: 1257-1258
• [Kanhabua et al., SIGIR 2011b] Nattiya Kanhabua, Kjetil Nørvåg: Time-based query performance
predictors. SIGIR 2011: 1181-1182
• [Kanhabua et al., SIGIR 2012] Nattiya Kanhabua, Klaus Berberich, Kjetil Nørvåg: Learning to select a
time-aware retrieval model. SIGIR 2012
• [Kanhabua et al., TAIA 2012] Nattiya Kanhabua, Sara Romano, Avaré Stewart: Identifying Relevant
Temporal Expressions for Real-World Events. Time-aware Information Access Workshop 2012
52
53. References (cont’)
• [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their
ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447
(2006)
• [Kraaij, SIGIR Forum 2005] Wessel Kraaij: Variations on language modeling for information
retrieval. SIGIR Forum 39(1): 61 (2005)
• [Li et al., CIKM 2003] Xiaoyan Li, W. Bruce Croft: Time-based language models. CIKM 2003:
469-475
• [Nunes et al., WIDM 2007] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Using neighbors to
date web documents. WIDM 2007: 129-136
• [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:
Improving search relevance for implicitly temporal queries. SIGIR 2009: 700-701
• [Mazeika et al., CIKM 2011] Arturas Mazeika, Tomasz Tylenda, Gerhard Weikum: Entity
timelines: visual analytics and named entity evolution. CIKM 2011: 2585-2588
• [Peng et al., ECIR 2010] Jie Peng, Craig Macdonald, Iadh Ounis: Learning to Select a Ranking
Function. ECIR 2010: 114-126
• [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.
Computer Networks 39(3): 289-302 (2002)
• [Shokouhi, SIGIR 2011] Milad Shokouhi: Detecting Seasonal Queries by Time-Series Analysis.
SIGIR 2011: 1171-1172
• [Strötgen et al., SemEval 2010] Jannik Strötgen, Michael Gertz: Heideltime: High quality rule-
based extraction and normalization of temporal expressions. SemEval 2010: 321-324
53
54. References (cont’)
• [Strötgen et al., TempWeb 2012] Jannik Strötgen, Omar Alonso, Michael Gertz: Identification of
top relevant temporal expressions in documents. Temporal Web Workshop 2012.
• [Tahmasebi et al., COLING2012] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge
Holzmann, Thomas Risse: NEER: An Unsupervised Method for Named Entity Evolution
Recognition. COLING 2012
• [Verhagen et al., ACL 2005] Marc Verhagen, Inderjeet Mani, Roser Sauri, Jessica Littman,
Robert Knippen, Seok Bae Jang, Anna Rumshisky, John Phillips, James Pustejovsky: Automating
Temporal Annotation with TARSQI. ACL 2005
• [WebDyn 2010] Web Dynamics course: http://www.mpi-
inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken,
Germany, 2010
• [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,
Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139
54