In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in retrieval and ranking models? To this end, we will present current approaches to the addressed problems as well as outline the directions for future research.
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
Â
Temporal Web Dynamics and Implications for Information Retrieval
1. Temporal Web Dynamics
Implications for Information Retrieval
Nattiya Kanhabua
1st ALEXANDRIA Workshop
L3S Research Center, Hannover, Germany
15 September 2014
2. Outline
⢠What are temporal web dynamics?
⢠Why the dynamics impact search?
⢠Overview of time-aware approaches
â Temporal Information Extraction
â Temporal Query Analysis
â Time-aware Retrieval and Ranking
⢠Conclusion and outlook
3. Temporal Web Dynamics
⢠Web is changing over time in many aspects,
e.g., size, content, structure and how it is
accessed by user interactions or queries.
â Size: web pages are added/deleted at all time
â Content: web pages are edited/modified
â Query: usersâ information needs changes
[Ke et al., CN 2006; Risvik et al., CN 2002]
[Dumais, SIAM-SDM 2012; WebDyn 2010]
5. Changes in User Behavior
Implications: Query Analysis, Ranking
Fig. 2 Categorization of queries with temporal information needs.
http://www.google.com/insights/search
6. Temporal Query Examples
⢠A temporal query consists of:
â Query keywords
â Temporal expressions
⢠A document consists of:
â Terms, i.e., bag-of-words
â Publication time and temporal expressions
[Berberich et al., ECIR 2010]
9. Two Time Aspects
Two time dimensions
1. Publication or modified time
2. Content or event time
content time
publication time
10. Problem Statements
⢠Difficult to find the trustworthy time for web documents
â Time gap between crawling and indexing
â Decentralization and relocation of web documents
â No standard metadata for time/date
Document Dating
Letâs me seeâŚ
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
â For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? â
11. Probabilistic Approach
Timestamp Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language
Models
⢠Based on the statistic usage
of words over time
⢠Compare each word of a
non-timestamped document
with a reference corpus
⢠Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005; Kraaij, SIGIR Forum 2005; Kanhabua et al., ECDL 2008]
Freq
1
1
1
1
1
1
12. Extracting Content Time
⢠How to determine relevant temporal
expressions tagged in a document?
â Not all temporal expressions associated to an event
are equally relevant
⢠Approaches: machine learning; rule-based
Reported by World Health Organization (WHO) on
29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
[Kanhabua et al., TAIA 2012; StrĂśtgen et al., TempWeb 2012; Hoffart et al., AIJ 2012]
14. Temporal Queries
⢠Temporal queries exist in
the Web and archives
â Relevancy is dependent on time
â Documents are about events at
particular time
â Users: historians, librarians or
journalists
[Li et al., CIKM 2003; Jones and Diaz, ACM TOIS 2007; Berberich et al., ECIR 2010;
Peetz et al., Information Retrieval 2014]
15. ⢠Searching temporal document collections
â E.g., digital libraries, web/news archives
⢠Problems: semantic gaps or lacking knowledge
1. possibly relevant time of queries
2. terminology changes over time
Challenges
16. Challenges
⢠Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
âŚ
timek
suggest
17. Challenges
⢠Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
âŚ
timek
suggest
How to determine the time of an implicit temporal query?
19. Query Log Analysis
⢠Mining query logs
â Analyze query frequencies over time for identifying
the relevant time of queries
â Re-rank search results of implicit temporal queries
using the determined time
[Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]
20. Search Result Analysis
⢠Use temporal bursts for query
modeling
â Identify temporal bursts in the ranked
lists of documents
â Sample terms from the documents and
update the query model
⢠Use temporal language models
â Determine tentative time for a query
â Re-rank search results using the
determined time
[Kanhabua et al., ECDL 2010; Peetz et al., Information Retrieval 2014]
21. ⢠Intuition: documents published closely to the
time of queries are more relevant
â Assign document priors based on publication dates
Re-rank Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
[Kanhabua et al., ECDL 2010]
22. ⢠Intuition: documents published closely to the
time of queries are more relevant
â Assign document priors based on publication dates
Re-rank Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
[Kanhabua et al., ECDL 2010]
23. Challenges
⢠Semantic gaps: lacking knowledge about
1. Possibly relevant time of queries
2. Named entity changes over time
query
synonym@2001
synonym@2002
âŚ
synonym@2011
suggest
24. Problem Statements
⢠Queries of named entities (people, company, place)
â Highly dynamic in appearance, i.e., relationships between
terms changes over time
â E.g. changes of roles, name alterations, or semantic shift
Named Entity Evolution
25. Problem Statements
⢠Queries of named entities (people, company, place)
â Highly dynamic in appearance, i.e., relationships between
terms changes over time
â E.g. changes of roles, name alterations, or semantic shift
Named Entity Evolution
Scenario 1
Query: âPope Benedict XVIâ and written before 2005
Documents about âJoseph Alois Ratzingerâ are relevant
26. Problem Statements
⢠Queries of named entities (people, company, place)
â Highly dynamic in appearance, i.e., relationships between
terms changes over time
â E.g. changes of roles, name alterations, or semantic shift
Named Entity Evolution
Scenario 1
Query: âPope Benedict XVIâ and written before 2005
Documents about âJoseph Alois Ratzingerâ are relevant
Scenario 2
Query: âHillary R. Clintonâ and written from 1997 to 2002
Documents about âNew York Senatorâ and âFirst Lady of
the United Statesâ are relevant
28. Find Temporal Synonyms
⢠Extract time-based synonyms from Wikipedia
⢠Find a set of entity-synonym relationships at time tk
⢠For each ei Ͼ Etk , extract anchor texts from article
links:
â Entity: President_of_the_United_States
â Synonym: George W. Bush
â Time: 11/2004
President_of_th
e_United_States
George
W. Bush
George
W. Bush
Presiden
t George
W. Bush
Presiden
t Bush
(43)
[Kanhabua et al., JCDL 2010]
31. Searching the Past
⢠Time must be explicitly modeled in order to
increase the effectiveness of ranking
â To order search results so that the most relevant ones
are ranked higher
Web
archives
news
archives
blogs emails
âtemporal document
collectionsâ
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results
32. Query/Document Models
⢠A temporal query consists of:
â Query keywords
â Temporal expressions
⢠A document consists of:
â Terms, i.e., bag-of-words
â Publication time and temporal expressions
33. Time-aware Ranking Models
⢠Two main approaches
1. Mixture model [Kanhabua et al., ECDL 2010]
⢠Linearly combining textual- and temporal similarity
2. Probabilistic model [Berberich et al., ECIR 2010]
⢠Generating a query from the textual part and temporal part
of a document independently
34. Mixture Model
⢠Linearly combine textual- and temporal similarity
â Îą indicates the importance of similarity scores
⢠Both scores are normalized before combining
â Textual similarity can be determined using any term-
based retrieval model
⢠E.g., tf.idf or a unigram language model
35. Mixture Model
⢠Linearly combine textual- and temporal similarity
â Îą indicates the importance of similarity scores
⢠Both scores are normalized before combining
â Textual similarity can be determined using any term-
based retrieval model
⢠E.g., tf.idf or a unigram language model
How to determine temporal similarity?
36. Temporal Similarity
⢠Assume that temporal expressions in the query are
generated independently from a two-step
generative model:
â P(tq|td) can be estimated based on publication time
using an exponential decay function [Kanhabua et al.,
ECDL 2010]
â Linear interpolation smoothing is applied to eliminates
zero probabilities
⢠I.e., an unseen temporal expression tq in d
Similarityscore
Time
d1 d2
<q>
Dist(d1,q)
Dist(d2,q)
37. Conclusion and Outlook
⢠Temporal web dynamics and its impact
⢠State of the art temporal IR techniques
⢠Future work:
â Search in versioned document collections
â Efficient methods for document processing
â Effective retrieval and ranking, e.g., return
aggregated results or summaries
â Support exploratory search in Web archives
38. References
⢠[Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard
Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009
⢠[Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum:
A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25
⢠[Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-
SDM 2012
⢠[de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language
models for the disclosure of historical text. AHC 2005: 161-168
⢠[Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.
Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query
translation in text retrieval with association rules. CIKM 2010: 1789-1792
⢠[Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvüg: Exploiting time-based synonyms in
searching document archives. JCDL 2010: 79-88
⢠[Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvüg: Determining Time of Queries for Re-
ranking Search Results. ECDL 2010: 261-272
⢠[Kanhabua et al., TAIA 2012] Nattiya Kanhabua, Sara Romano, AvarÊ Stewart: Identifying Relevant
Temporal Expressions for Real-World Events. Time-aware Information Access Workshop 2012
⢠[Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their
ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447
(2006)
39. References (contâ)
⢠[Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:
Improving search relevance for implicitly temporal queries. SIGIR 2009: 700-701
⢠[Nunes et al., ECIR 2008] SÊrgio Nunes, Cristina Ribeiro, Gabriel David: Use of Temporal
Expressions in Web Search. ECIR 2008: 580-584
⢠[Peetz et al., Information Retrieval 2014] Maria-Hendrike Peetz, Edgar Meij, Maarten de Rijke.
Using temporal bursts for query modeling. Information Retrieval, 17(1), 74-108, 2014.
⢠[Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.
Computer Networks 39(3): 289-302 (2002)
⢠[Shokouhi, SIGIR 2011] Milad Shokouhi: Detecting Seasonal Queries by Time-Series Analysis.
SIGIR 2011: 1171-1172
⢠[StrÜtgen et al., TempWeb 2012] Jannik StrÜtgen, Omar Alonso, Michael Gertz: Identification of
top relevant temporal expressions in documents. Temporal Web Workshop 2012.
⢠[Tahmasebi et al., COLING2012] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge
Holzmann, Thomas Risse: NEER: An Unsupervised Method for Named Entity Evolution
Recognition. COLING 2012
⢠[WebDyn 2010] Web Dynamics course: http://www.mpi-
inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, SaarbrĂźcken,
Germany, 2010
⢠[Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,
Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139
Editor's Notes
The Web is evolving over time and it has shown the temporal dynamics in many aspects:
Google Insights for Search, you can compare search volume patterns across specific regions, categories, time frames and properties
Note that, the actual value of any time point, e.g., tbl, tbu, tel, or teu, is an integer or the number of time units (e.g., milliseconds or days) passed (or to pass) a reference point of time (e.g., the UNIX epoch).