Time-aware Approaches to Information Retrieval

Time-aware Approaches to
Information Retrieval
Nattiya Kanhabua
Department of Computer and Information Science
Norwegian University of Science and Technology
24 February 2012

Nattiya Kanhabua 2PhD defense
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
Motivation
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents about
Pope Benedict XVI written
before 2005
Term-based IR approaches
may give unsatisfied results

• A web archive search tool by the Internet Archive
– Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
No keyword query
No relevance ranking
1Retrieved on 15 January 2011

• A news archive search tool by Google
– Query by keywords
– Rank results by relevance or date
Google News Archive Search
Not consider
terminology
changes over time

• Study problems of temporal search
• Propose approaches to solve the problems
• Main research question
“How to exploit temporal information in documents, queries,
and external sources in order to improve the retrieval
effectiveness?”
Objective of PhD thesis

Part I - Content Analysis
RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis
RQ2: How to determine time of queries?
RQ3: How to handle terminology changes over time?
RQ4: How to predict the effectiveness of temporal queries?
RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models
RQ6: How to model time into retrieval and ranking?
RQ7: How to combine different features and time for ranking?
Outline contributions

PART I - CONTENT ANALYSIS
PhD defense Nattiya Kanhabua 7

Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
RQ1: Determining time of documents
I found a bible-like
document. But I have
no idea when it was
created?
Let’s me see…
This document is probably
written in 850 A.C.
with 95% confidence.
“ For a given document with uncertain timestamp,
can the contents be used to determine the timestamp
with a sufficiently high confidence? ”

Preliminaries
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language Models
[de Jong 2005]
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage

Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-
Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for
Digital Libraries (ECDL), 2008.

Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing

Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio

Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
Intuition: A term weight depends on how good the
term is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter

Experiments
• Collection
– 9,000 documents collected from the Internet Archive
– 8 years time span, 15 news sources
– Randomly select 1,000 documents for testing
• Results
– Proposed techniques gain improvement over the baseline
• Precision = the fraction of documents correctly dated
• Open issue
– The effectiveness of document dating is still limited
• Highly dependent on the quality of a reference corpus

PART II - QUERY ANALYSIS

• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
Challenges with temporal queries

query
time1
time2
…
timek
suggest

query
synonym@2001
synonym@2002
…
synonym@2011
suggest

RQ2: Determining time of queries
Problem Statements
• 1.5% of web queries are explicitly provided with temporal
expression [Nunes 2008]
– Time is a part of query, “U.S. Presidential election 2008”
• About 7% of web queries have temporal intent implicitly
provided [Metzler 2009]
– Time is not given in queries, e.g., “Germany World Cup” or
“tsunami”
– Difficult to achieve high accuracy using only keywords
– Relevant documents associated to particular time not given

1. Determining the time of queries when no time is given
2. Re-ranking search results using the determined time
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In
Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries
(ECDL), 2010.

Approach I. Dating using keywords*
Approach II. Dating using top-k documents*
– Queries are short keywords
– Inspired by pseudo-relevance feedback
Approach III. Using timestamp of top-k documents
– No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
Determining time of queries

• Intuition: documents published closely to the time of
queries are more relevant
– Assign document priors based on publication dates
Re-ranking search results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results

Determining the time of queries
• Collection
– NYT Corpus contains over 1.8M (1987-2007)
– 30 time-sensitive queries from the TREC Robust2004
• Results
– The smaller top-k, the better precision (k=5 > k=10 > k=15)
– The larger g (granularity), the better precision (g=12-month > g=6-month)
Experiments: Part 1 Precision = the fraction of
queries correctly dated

Re-ranking of search results
• Collection
– TREC Robust2004, 30 time-sensitive queries
– NYT Corpus, 24 queries from Google zeitgeist
• Results
– Approach III (no TMLs) outperforms all other approaches
• Using publication dates is more accurate than the dating process
• Open issue
– Time can improve the effectiveness (if the query dating is improved
with a higher accuracy)
Experiments: Part 2

Challenges of temporal search
query
synonym@2001
synonym@2002
…
synonym@2011
suggest

Problem Statements
• Queries composed of named entities (people, organization,
location)
– Highly dynamic in appearance, i.e., relationships between terms
changes over time
– E.g. changes of roles, name alterations, or semantic shift
RQ3: Handling terminology changes
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2
Query: “Hillary R. Clinton” and written from 1997 to 2002
Documents about “New York Senator” and “First Lady of
the United States” are relevant

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

• Discover time-based synonyms over time using Wikipedia
– Generally, synonyms are words with similar meanings
– This work refers synonyms as alternative names of an entity
• Improve the accuracy of time of synonyms
• Query expansion using time-based synonyms
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In
Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.

Recognize named entities

Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
President_of_the_
United_States
George W.
Bush
George
W. Bush
President
George W.
Bush
President
Bush (43)

Initial results
• Time periods are not accurate
Note: the time of synonyms are timestamps of Wikipedia articles (8 years)

• Analyze NYT Corpus to discover more accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm [Kleinberg 2003]
– Time periods of synonyms = burst intervals
Enhancement using NYT
Initial results

Query expansion
1. A user enters an entity as a query

Query expansion
2. The system retrieves synonyms wrt. the query

Query expansion
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query

Part 1- Synonym detection
• Collection
– The whole history of English Wikipedia
• all pages and revisions 03/2001 to 03/2008
• 85 month snapshots about 2.8 Terabytes
• Result
– Randomly selected 500 entity-synonym relationships for evaluating
• Accuracy 51% for all types of entities
• Accuracy 73% for people, organization, and company
Part 2 - Query expansion
• Collection
– TREC Robust2004 Track (250 queries)
– NewsLibrary.com over 100M U.S. news articles (20 temporal queries)
– Result
• Baseline: Probabilistic Model without query expansion
• QE significantly improves the effectiveness over the baseline for both collections
• Open issues
– Only the name changes of famous persons can be discovered
Experiments

Two problems are addressed
1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
Query prediction problems
query
precision = ?
recall = ?
MAP = ?
predict

Two problems are addressed
1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
2. Ranking prediction
• Predict the ranking model that is most suitable
Query prediction problems
query ranking = ?
predict
max(precision)
max(recall)
max(MAP)

Problem Statement
• Predict the effectiveness (e.g., MAP) that a query will achieve
in advance of, or during retrieval [Hauff 2010]
– high MAP  “good”
– low MAP  “poor”
Objective
• Apply query enhancement techniques to improve the
overall performance
– Query suggestion is applied for “poor” queries
• To best of our knowledge, predicting the performance of
temporal queries has never done before
RQ4: Query performance prediction

• Contributions
– First study of performance prediction for temporal queries
– Propose 10 time-based pre-retrieval predictors
• Both text and time are considered
• Experiment
– Collection: NYT Corpus and 40 temporal queries [Berberich 2010]
• Results
– Time-based predictors outperform keyword-based predictors
– Combined predictors outperform single predictors in most cases
• Open issue
– Increase the number of queries
– Consider time uncertainty
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster),
In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Problem statement
– Two time dimensions: publication time and content time
• Content time = temporal expressions mentioned in documents
– Difference in effectiveness for temporal queries when
ranking using publication time or content time
RQ5: Time-aware ranking prediction

• Contributions
– First study of the impact on effectiveness of ranking models
using the two time dimensions
– Three features from analyzing top-k documents
• Temporal KL-divergence [Diaz 2004]
• Content Clarity [Cronen-Townsend 2002]
• Divergence of retrieval scores [Peng 2010]
• Results
– A small number of top-k documents achieves better
performance
– The larger number k, the more irrelevant documents are
introduced into the analysis
• Open issue
– When comparing with the optimal case there is still room for
further improvements
Discussion
Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).

PART III - RETRIEVAL AND RANKING
MODELS

• Problem statements
– Time must be explicitly modeled in order to increase
the effectiveness
– Time uncertainty should be taken into account
• Two temporal expressions can refer to the same time period
even though they are not equally written
• Example
– Given the query “Independence Day 2011”, a retrieval
model relying on term-matching will fail to retrieve
documents mentioning “July 4, 2011”
RQ6: Time-aware ranking models

• Contributions
– Analyze and compare five ranking methods
• Experiment
– Collection: NYT Corpus and 40 temporal queries[Berberich 2010]
• Result
– TSU outperforms other methods significantly for most metrics
• Conclusions
– Although TSU gains the best performance, it is limited for a
document collection with no time metadata
– LMT, LMTU can be applied to any collection without time
metadata, but extraction of temporal expressions is needed.
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods
(poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Problem statement
– Can the combination of time and other features help improving
the retrieval effectiveness?
RQ7:Ranking related news predictions
• A new task called ranking related news predictions
– Retrieve predictions related to a news story in news archives
– Rank them according to their relevance to the news story

Related news predictions

• Define the task ranking related news predictions
– Searching the future is proposed in [Baeza-Yates 2005]
• Propose four classes of features
– Term similarity, entity-based similarity, topic similarity and
temporal similarity
• Rank predictions using learning-to-rank [Liu 2009]
• Make available the dataset with over 6000 judgments
Contributions
Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings
of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• NYT Corpus
– More than 25% contain at least one prediction
• Feature analysis
– Topic features play an important role in ranking
– Features in top-5 features with lowest weights are entity-
based features
• Open issues
– Extract predictions from other sources, e.g., Wikipedia,
blogs, comments, etc.
– Sentiment analysis for future-related information.
Experiments

Solutions to all research questions:
Part I - Content Analysis
RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis
RQ2: How to determine time of queries?
RQ3: How to handle terminology changes over time?
RQ4: How to predict the effectiveness of temporal queries?
RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models
RQ6: How to model time into retrieval and ranking?
RQ7: How to combine different features and time for ranking?
Conclusions

• Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-
Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for
Digital Libraries (ECDL), 2008.
• Nattiya Kanhabua and Kjetil Nørvåg, Using temporal language models for document dating, In Proceedings
of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2009
• Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In
Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries
(ECDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In
Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, QUEST: query expansion using synonyms over time, In Proceedings of
the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of
the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings
of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of
the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, Technical Report.
Publications

• [Baeza-Yates 2005] R. A. Baeza-Yates. Searching the future. In Proceedings of SIGIR workshop on
mathematical/formal methods in information retrieval MF/IR, SIGIR ’05, 2005.
• [Berberich 2010] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for
temporal information needs. In Proceedings of the 32nd European Conference on IR Research on Advances in
Information Retrieval, ECIR ’10, pp. 13-25, 2010.
• [Cronen-Townsend 2002] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’02, pp. 299-306, 2002.
• [Diaz 2004] F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of
the 27th annual international ACM SIGIR conference on Research and development in information retrieval,
SIGIR ’04, pp. 18-24, 2004.
• [Hauff 2010] C. Hauff, L. Azzopardi, D. Hiemstra, and F. de Jong. Query performance prediction: Evaluation
contrasted with effectiveness. In Proceedings of the 32nd European Conference on IR Research on Advances
in Information Retrieval, ECIR ’10, pp. 204-216, April 2010.
• [de Jong 2005] F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical
text. In Humanities, computers and cultural heritage: Proceedings of the 16th International Conference of the
Association for History and Computing, AHC '05, pp. 161-168, 2005.
• [Kleinberg 2003] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7:pp.
373-397, October 2003.
• [Liu 2009] T-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):pp. 225-331, March
2009.
• [Metzler 2009] D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal
queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’09, pp. 700-701, 2009.
• [Nunes 2008] S. Nunes, C. Ribeiro, and G. David. Use of temporal expressions in web search. In Proceedings of
the 30th European Conference on IR Research on Advances in Information Retrieval, ECIR ’08, pp. 580-584,
2008.
• [Peng 2010] J. Peng, C. Macdonald, and I. Ounis. Learning to select a ranking function. In Proceedings of the
32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 114-126, 2010.
References

Thank you

Time-aware Approaches to Information Retrieval

Recommended

Recommended

More Related Content

Similar to Time-aware Approaches to Information Retrieval

Similar to Time-aware Approaches to Information Retrieval (20)

More from Nattiya Kanhabua

More from Nattiya Kanhabua (14)

Recently uploaded

Recently uploaded (14)

Time-aware Approaches to Information Retrieval