We address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.
Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. We analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms.
In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Time-aware Approaches to Information Retrieval
1. Time-aware Approaches to
Information Retrieval
Nattiya Kanhabua
Department of Computer and Information Science
Norwegian University of Science and Technology
24 February 2012
2. Nattiya Kanhabua 2PhD defense
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
Motivation
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents about
Pope Benedict XVI written
before 2005
Term-based IR approaches
may give unsatisfied results
3. Nattiya Kanhabua 3PhD defense
• A web archive search tool by the Internet Archive
– Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
No keyword query
No relevance ranking
1Retrieved on 15 January 2011
4. Nattiya Kanhabua 4PhD defense
• A news archive search tool by Google
– Query by keywords
– Rank results by relevance or date
Google News Archive Search
Not consider
terminology
changes over time
5. Nattiya Kanhabua 5PhD defense
• Study problems of temporal search
• Propose approaches to solve the problems
• Main research question
“How to exploit temporal information in documents, queries,
and external sources in order to improve the retrieval
effectiveness?”
Objective of PhD thesis
6. Nattiya Kanhabua 6PhD defense
Part I - Content Analysis
RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis
RQ2: How to determine time of queries?
RQ3: How to handle terminology changes over time?
RQ4: How to predict the effectiveness of temporal queries?
RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models
RQ6: How to model time into retrieval and ranking?
RQ7: How to combine different features and time for ranking?
Outline contributions
7. PART I - CONTENT ANALYSIS
PhD defense Nattiya Kanhabua 7
8. Nattiya Kanhabua 8PhD defense
Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
RQ1: Determining time of documents
I found a bible-like
document. But I have
no idea when it was
created?
Let’s me see…
This document is probably
written in 850 A.C.
with 95% confidence.
“ For a given document with uncertain timestamp,
can the contents be used to determine the timestamp
with a sufficiently high confidence? ”
9. Nattiya Kanhabua 9PhD defense
Preliminaries
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language Models
[de Jong 2005]
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
10. Nattiya Kanhabua 10PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-
Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for
Digital Libraries (ECDL), 2008.
11. Nattiya Kanhabua 11PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
12. Nattiya Kanhabua 12PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
13. Nattiya Kanhabua 13PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
Intuition: A term weight depends on how good the
term is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
14. Nattiya Kanhabua 14PhD defense
Experiments
• Collection
– 9,000 documents collected from the Internet Archive
– 8 years time span, 15 news sources
– Randomly select 1,000 documents for testing
• Results
– Proposed techniques gain improvement over the baseline
• Precision = the fraction of documents correctly dated
• Open issue
– The effectiveness of document dating is still limited
• Highly dependent on the quality of a reference corpus
15. PART II - QUERY ANALYSIS
PhD defense Nattiya Kanhabua 15
16. Nattiya Kanhabua 16PhD defense
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
Challenges with temporal queries
17. Nattiya Kanhabua 17PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
…
timek
suggest
18. Nattiya Kanhabua 18PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
…
timek
suggest
19. Nattiya Kanhabua 19PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
synonym@2001
synonym@2002
…
synonym@2011
suggest
20. RQ2: Determining time of queries
PhD defense Nattiya Kanhabua 20
Problem Statements
• 1.5% of web queries are explicitly provided with temporal
expression [Nunes 2008]
– Time is a part of query, “U.S. Presidential election 2008”
• About 7% of web queries have temporal intent implicitly
provided [Metzler 2009]
– Time is not given in queries, e.g., “Germany World Cup” or
“tsunami”
– Difficult to achieve high accuracy using only keywords
– Relevant documents associated to particular time not given
21. Nattiya Kanhabua 21PhD defense
1. Determining the time of queries when no time is given
2. Re-ranking search results using the determined time
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In
Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries
(ECDL), 2010.
22. Nattiya Kanhabua 22PhD defense
Approach I. Dating using keywords*
Approach II. Dating using top-k documents*
– Queries are short keywords
– Inspired by pseudo-relevance feedback
Approach III. Using timestamp of top-k documents
– No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
Determining time of queries
23. Nattiya Kanhabua 23PhD defense
• Intuition: documents published closely to the time of
queries are more relevant
– Assign document priors based on publication dates
Re-ranking search results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
24. Nattiya Kanhabua 24PhD defense
Determining the time of queries
• Collection
– NYT Corpus contains over 1.8M (1987-2007)
– 30 time-sensitive queries from the TREC Robust2004
• Results
– The smaller top-k, the better precision (k=5 > k=10 > k=15)
– The larger g (granularity), the better precision (g=12-month > g=6-month)
Experiments: Part 1 Precision = the fraction of
queries correctly dated
25. Nattiya Kanhabua 25PhD defense
Re-ranking of search results
• Collection
– TREC Robust2004, 30 time-sensitive queries
– NYT Corpus, 24 queries from Google zeitgeist
• Results
– Approach III (no TMLs) outperforms all other approaches
• Using publication dates is more accurate than the dating process
• Open issue
– Time can improve the effectiveness (if the query dating is improved
with a higher accuracy)
Experiments: Part 2
26. Nattiya Kanhabua 26PhD defense
Challenges of temporal search
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
synonym@2001
synonym@2002
…
synonym@2011
suggest
27. Nattiya Kanhabua 27PhD defense
Problem Statements
• Queries composed of named entities (people, organization,
location)
– Highly dynamic in appearance, i.e., relationships between terms
changes over time
– E.g. changes of roles, name alterations, or semantic shift
RQ3: Handling terminology changes
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2
Query: “Hillary R. Clinton” and written from 1997 to 2002
Documents about “New York Senator” and “First Lady of
the United States” are relevant
29. Nattiya Kanhabua 29PhD defense
• Discover time-based synonyms over time using Wikipedia
– Generally, synonyms are words with similar meanings
– This work refers synonyms as alternative names of an entity
• Improve the accuracy of time of synonyms
• Query expansion using time-based synonyms
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In
Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
33. Nattiya Kanhabua 33PhD defense
Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
President_of_the_
United_States
George W.
Bush
George
W. Bush
President
George W.
Bush
President
Bush (43)
34. Initial results
• Time periods are not accurate
Note: the time of synonyms are timestamps of Wikipedia articles (8 years)
PhD defense Nattiya Kanhabua 34
35. • Analyze NYT Corpus to discover more accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm [Kleinberg 2003]
– Time periods of synonyms = burst intervals
Enhancement using NYT
PhD defense Nattiya Kanhabua 35
Initial results
36. Nattiya Kanhabua 36PhD defense
Query expansion
1. A user enters an entity as a query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
37. Nattiya Kanhabua 37PhD defense
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
38. Nattiya Kanhabua 38PhD defense
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
39. Nattiya Kanhabua 39PhD defense
Part 1- Synonym detection
• Collection
– The whole history of English Wikipedia
• all pages and revisions 03/2001 to 03/2008
• 85 month snapshots about 2.8 Terabytes
• Result
– Randomly selected 500 entity-synonym relationships for evaluating
• Accuracy 51% for all types of entities
• Accuracy 73% for people, organization, and company
Part 2 - Query expansion
• Collection
– TREC Robust2004 Track (250 queries)
– NewsLibrary.com over 100M U.S. news articles (20 temporal queries)
– Result
• Baseline: Probabilistic Model without query expansion
• QE significantly improves the effectiveness over the baseline for both collections
• Open issues
– Only the name changes of famous persons can be discovered
Experiments
40. Nattiya Kanhabua 40PhD defense
Two problems are addressed
1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
Query prediction problems
query
precision = ?
recall = ?
MAP = ?
predict
41. Nattiya Kanhabua 41PhD defense
Two problems are addressed
1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
2. Ranking prediction
• Predict the ranking model that is most suitable
Query prediction problems
query ranking = ?
predict
max(precision)
max(recall)
max(MAP)
42. Nattiya Kanhabua 42PhD defense
Problem Statement
• Predict the effectiveness (e.g., MAP) that a query will achieve
in advance of, or during retrieval [Hauff 2010]
– high MAP “good”
– low MAP “poor”
Objective
• Apply query enhancement techniques to improve the
overall performance
– Query suggestion is applied for “poor” queries
• To best of our knowledge, predicting the performance of
temporal queries has never done before
RQ4: Query performance prediction
43. Nattiya Kanhabua 43PhD defense
• Contributions
– First study of performance prediction for temporal queries
– Propose 10 time-based pre-retrieval predictors
• Both text and time are considered
• Experiment
– Collection: NYT Corpus and 40 temporal queries [Berberich 2010]
• Results
– Time-based predictors outperform keyword-based predictors
– Combined predictors outperform single predictors in most cases
• Open issue
– Increase the number of queries
– Consider time uncertainty
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster),
In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
44. Nattiya Kanhabua 44PhD defense
• Problem statement
– Two time dimensions: publication time and content time
• Content time = temporal expressions mentioned in documents
– Difference in effectiveness for temporal queries when
ranking using publication time or content time
RQ5: Time-aware ranking prediction
45. Nattiya Kanhabua 45PhD defense
• Contributions
– First study of the impact on effectiveness of ranking models
using the two time dimensions
– Three features from analyzing top-k documents
• Temporal KL-divergence [Diaz 2004]
• Content Clarity [Cronen-Townsend 2002]
• Divergence of retrieval scores [Peng 2010]
• Results
– A small number of top-k documents achieves better
performance
– The larger number k, the more irrelevant documents are
introduced into the analysis
• Open issue
– When comparing with the optimal case there is still room for
further improvements
Discussion
Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).
46. PART III - RETRIEVAL AND RANKING
MODELS
PhD defense Nattiya Kanhabua 46
47. Nattiya Kanhabua 47PhD defense
• Problem statements
– Time must be explicitly modeled in order to increase
the effectiveness
– Time uncertainty should be taken into account
• Two temporal expressions can refer to the same time period
even though they are not equally written
• Example
– Given the query “Independence Day 2011”, a retrieval
model relying on term-matching will fail to retrieve
documents mentioning “July 4, 2011”
RQ6: Time-aware ranking models
48. Nattiya Kanhabua 48PhD defense
• Contributions
– Analyze and compare five ranking methods
• Experiment
– Collection: NYT Corpus and 40 temporal queries[Berberich 2010]
• Result
– TSU outperforms other methods significantly for most metrics
• Conclusions
– Although TSU gains the best performance, it is limited for a
document collection with no time metadata
– LMT, LMTU can be applied to any collection without time
metadata, but extraction of temporal expressions is needed.
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods
(poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
49. Nattiya Kanhabua 49PhD defense
• Problem statement
– Can the combination of time and other features help improving
the retrieval effectiveness?
RQ7:Ranking related news predictions
• A new task called ranking related news predictions
– Retrieve predictions related to a news story in news archives
– Rank them according to their relevance to the news story
51. Nattiya Kanhabua 51PhD defense
• Define the task ranking related news predictions
– Searching the future is proposed in [Baeza-Yates 2005]
• Propose four classes of features
– Term similarity, entity-based similarity, topic similarity and
temporal similarity
• Rank predictions using learning-to-rank [Liu 2009]
• Make available the dataset with over 6000 judgments
Contributions
Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings
of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
52. Nattiya Kanhabua 52PhD defense
• NYT Corpus
– More than 25% contain at least one prediction
• Feature analysis
– Topic features play an important role in ranking
– Features in top-5 features with lowest weights are entity-
based features
• Open issues
– Extract predictions from other sources, e.g., Wikipedia,
blogs, comments, etc.
– Sentiment analysis for future-related information.
Experiments
53. Nattiya Kanhabua 53PhD defense
Solutions to all research questions:
Part I - Content Analysis
RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis
RQ2: How to determine time of queries?
RQ3: How to handle terminology changes over time?
RQ4: How to predict the effectiveness of temporal queries?
RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models
RQ6: How to model time into retrieval and ranking?
RQ7: How to combine different features and time for ranking?
Conclusions
54. Nattiya Kanhabua 54PhD defense
• Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-
Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for
Digital Libraries (ECDL), 2008.
• Nattiya Kanhabua and Kjetil Nørvåg, Using temporal language models for document dating, In Proceedings
of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2009
• Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In
Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries
(ECDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In
Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, QUEST: query expansion using synonyms over time, In Proceedings of
the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of
the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings
of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of
the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, Technical Report.
Publications
55. Nattiya Kanhabua 55PhD defense
• [Baeza-Yates 2005] R. A. Baeza-Yates. Searching the future. In Proceedings of SIGIR workshop on
mathematical/formal methods in information retrieval MF/IR, SIGIR ’05, 2005.
• [Berberich 2010] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for
temporal information needs. In Proceedings of the 32nd European Conference on IR Research on Advances in
Information Retrieval, ECIR ’10, pp. 13-25, 2010.
• [Cronen-Townsend 2002] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’02, pp. 299-306, 2002.
• [Diaz 2004] F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of
the 27th annual international ACM SIGIR conference on Research and development in information retrieval,
SIGIR ’04, pp. 18-24, 2004.
• [Hauff 2010] C. Hauff, L. Azzopardi, D. Hiemstra, and F. de Jong. Query performance prediction: Evaluation
contrasted with effectiveness. In Proceedings of the 32nd European Conference on IR Research on Advances
in Information Retrieval, ECIR ’10, pp. 204-216, April 2010.
• [de Jong 2005] F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical
text. In Humanities, computers and cultural heritage: Proceedings of the 16th International Conference of the
Association for History and Computing, AHC '05, pp. 161-168, 2005.
• [Kleinberg 2003] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7:pp.
373-397, October 2003.
• [Liu 2009] T-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):pp. 225-331, March
2009.
• [Metzler 2009] D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal
queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’09, pp. 700-701, 2009.
• [Nunes 2008] S. Nunes, C. Ribeiro, and G. David. Use of temporal expressions in web search. In Proceedings of
the 30th European Conference on IR Research on Advances in Information Retrieval, ECIR ’08, pp. 580-584,
2008.
• [Peng 2010] J. Peng, C. Macdonald, and I. Ounis. Learning to select a ranking function. In Proceedings of the
32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 114-126, 2010.
References