SlideShare a Scribd company logo
Time-aware Approaches to
Information Retrieval
Nattiya Kanhabua
Department of Computer and Information Science
Norwegian University of Science and Technology
24 February 2012
Nattiya Kanhabua 2PhD defense
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
Motivation
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents about
Pope Benedict XVI written
before 2005
Term-based IR approaches
may give unsatisfied results
Nattiya Kanhabua 3PhD defense
• A web archive search tool by the Internet Archive
– Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
No keyword query
No relevance ranking
1Retrieved on 15 January 2011
Nattiya Kanhabua 4PhD defense
• A news archive search tool by Google
– Query by keywords
– Rank results by relevance or date
Google News Archive Search
Not consider
terminology
changes over time
Nattiya Kanhabua 5PhD defense
• Study problems of temporal search
• Propose approaches to solve the problems
• Main research question
“How to exploit temporal information in documents, queries,
and external sources in order to improve the retrieval
effectiveness?”
Objective of PhD thesis
Nattiya Kanhabua 6PhD defense
Part I - Content Analysis
RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis
RQ2: How to determine time of queries?
RQ3: How to handle terminology changes over time?
RQ4: How to predict the effectiveness of temporal queries?
RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models
RQ6: How to model time into retrieval and ranking?
RQ7: How to combine different features and time for ranking?
Outline contributions
PART I - CONTENT ANALYSIS
PhD defense Nattiya Kanhabua 7
Nattiya Kanhabua 8PhD defense
Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
RQ1: Determining time of documents
I found a bible-like
document. But I have
no idea when it was
created?
Let’s me see…
This document is probably
written in 850 A.C.
with 95% confidence.
“ For a given document with uncertain timestamp,
can the contents be used to determine the timestamp
with a sufficiently high confidence? ”
Nattiya Kanhabua 9PhD defense
Preliminaries
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language Models
[de Jong 2005]
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
Nattiya Kanhabua 10PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-
Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for
Digital Libraries (ECDL), 2008.
Nattiya Kanhabua 11PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Nattiya Kanhabua 12PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
Nattiya Kanhabua 13PhD defense
Improving document dating
Three enhancement techniques:
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
Intuition: A term weight depends on how good the
term is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
Nattiya Kanhabua 14PhD defense
Experiments
• Collection
– 9,000 documents collected from the Internet Archive
– 8 years time span, 15 news sources
– Randomly select 1,000 documents for testing
• Results
– Proposed techniques gain improvement over the baseline
• Precision = the fraction of documents correctly dated
• Open issue
– The effectiveness of document dating is still limited
• Highly dependent on the quality of a reference corpus
PART II - QUERY ANALYSIS
PhD defense Nattiya Kanhabua 15
Nattiya Kanhabua 16PhD defense
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
Challenges with temporal queries
Nattiya Kanhabua 17PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
…
timek
suggest
Nattiya Kanhabua 18PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
…
timek
suggest
Nattiya Kanhabua 19PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
synonym@2001
synonym@2002
…
synonym@2011
suggest
RQ2: Determining time of queries
PhD defense Nattiya Kanhabua 20
Problem Statements
• 1.5% of web queries are explicitly provided with temporal
expression [Nunes 2008]
– Time is a part of query, “U.S. Presidential election 2008”
• About 7% of web queries have temporal intent implicitly
provided [Metzler 2009]
– Time is not given in queries, e.g., “Germany World Cup” or
“tsunami”
– Difficult to achieve high accuracy using only keywords
– Relevant documents associated to particular time not given
Nattiya Kanhabua 21PhD defense
1. Determining the time of queries when no time is given
2. Re-ranking search results using the determined time
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In
Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries
(ECDL), 2010.
Nattiya Kanhabua 22PhD defense
Approach I. Dating using keywords*
Approach II. Dating using top-k documents*
– Queries are short keywords
– Inspired by pseudo-relevance feedback
Approach III. Using timestamp of top-k documents
– No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
Determining time of queries
Nattiya Kanhabua 23PhD defense
• Intuition: documents published closely to the time of
queries are more relevant
– Assign document priors based on publication dates
Re-ranking search results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
Nattiya Kanhabua 24PhD defense
Determining the time of queries
• Collection
– NYT Corpus contains over 1.8M (1987-2007)
– 30 time-sensitive queries from the TREC Robust2004
• Results
– The smaller top-k, the better precision (k=5 > k=10 > k=15)
– The larger g (granularity), the better precision (g=12-month > g=6-month)
Experiments: Part 1 Precision = the fraction of
queries correctly dated
Nattiya Kanhabua 25PhD defense
Re-ranking of search results
• Collection
– TREC Robust2004, 30 time-sensitive queries
– NYT Corpus, 24 queries from Google zeitgeist
• Results
– Approach III (no TMLs) outperforms all other approaches
• Using publication dates is more accurate than the dating process
• Open issue
– Time can improve the effectiveness (if the query dating is improved
with a higher accuracy)
Experiments: Part 2
Nattiya Kanhabua 26PhD defense
Challenges of temporal search
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
synonym@2001
synonym@2002
…
synonym@2011
suggest
Nattiya Kanhabua 27PhD defense
Problem Statements
• Queries composed of named entities (people, organization,
location)
– Highly dynamic in appearance, i.e., relationships between terms
changes over time
– E.g. changes of roles, name alterations, or semantic shift
RQ3: Handling terminology changes
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2
Query: “Hillary R. Clinton” and written from 1997 to 2002
Documents about “New York Senator” and “First Lady of
the United States” are relevant
Nattiya Kanhabua 28PhD defense
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 29PhD defense
• Discover time-based synonyms over time using Wikipedia
– Generally, synonyms are words with similar meanings
– This work refers synonyms as alternative names of an entity
• Improve the accuracy of time of synonyms
• Query expansion using time-based synonyms
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In
Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
Nattiya Kanhabua 30PhD defense
Recognize named entities
Nattiya Kanhabua 31PhD defense
Recognize named entities
Nattiya Kanhabua 32PhD defense
Recognize named entities
Nattiya Kanhabua 33PhD defense
Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
President_of_the_
United_States
George W.
Bush
George
W. Bush
President
George W.
Bush
President
Bush (43)
Initial results
• Time periods are not accurate
Note: the time of synonyms are timestamps of Wikipedia articles (8 years)
PhD defense Nattiya Kanhabua 34
• Analyze NYT Corpus to discover more accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm [Kleinberg 2003]
– Time periods of synonyms = burst intervals
Enhancement using NYT
PhD defense Nattiya Kanhabua 35
Initial results
Nattiya Kanhabua 36PhD defense
Query expansion
1. A user enters an entity as a query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 37PhD defense
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 38PhD defense
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 39PhD defense
Part 1- Synonym detection
• Collection
– The whole history of English Wikipedia
• all pages and revisions 03/2001 to 03/2008
• 85 month snapshots about 2.8 Terabytes
• Result
– Randomly selected 500 entity-synonym relationships for evaluating
• Accuracy 51% for all types of entities
• Accuracy 73% for people, organization, and company
Part 2 - Query expansion
• Collection
– TREC Robust2004 Track (250 queries)
– NewsLibrary.com over 100M U.S. news articles (20 temporal queries)
– Result
• Baseline: Probabilistic Model without query expansion
• QE significantly improves the effectiveness over the baseline for both collections
• Open issues
– Only the name changes of famous persons can be discovered
Experiments
Nattiya Kanhabua 40PhD defense
Two problems are addressed
1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
Query prediction problems
query
precision = ?
recall = ?
MAP = ?
predict
Nattiya Kanhabua 41PhD defense
Two problems are addressed
1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
2. Ranking prediction
• Predict the ranking model that is most suitable
Query prediction problems
query ranking = ?
predict
max(precision)
max(recall)
max(MAP)
Nattiya Kanhabua 42PhD defense
Problem Statement
• Predict the effectiveness (e.g., MAP) that a query will achieve
in advance of, or during retrieval [Hauff 2010]
– high MAP  “good”
– low MAP  “poor”
Objective
• Apply query enhancement techniques to improve the
overall performance
– Query suggestion is applied for “poor” queries
• To best of our knowledge, predicting the performance of
temporal queries has never done before
RQ4: Query performance prediction
Nattiya Kanhabua 43PhD defense
• Contributions
– First study of performance prediction for temporal queries
– Propose 10 time-based pre-retrieval predictors
• Both text and time are considered
• Experiment
– Collection: NYT Corpus and 40 temporal queries [Berberich 2010]
• Results
– Time-based predictors outperform keyword-based predictors
– Combined predictors outperform single predictors in most cases
• Open issue
– Increase the number of queries
– Consider time uncertainty
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster),
In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
Nattiya Kanhabua 44PhD defense
• Problem statement
– Two time dimensions: publication time and content time
• Content time = temporal expressions mentioned in documents
– Difference in effectiveness for temporal queries when
ranking using publication time or content time
RQ5: Time-aware ranking prediction
Nattiya Kanhabua 45PhD defense
• Contributions
– First study of the impact on effectiveness of ranking models
using the two time dimensions
– Three features from analyzing top-k documents
• Temporal KL-divergence [Diaz 2004]
• Content Clarity [Cronen-Townsend 2002]
• Divergence of retrieval scores [Peng 2010]
• Results
– A small number of top-k documents achieves better
performance
– The larger number k, the more irrelevant documents are
introduced into the analysis
• Open issue
– When comparing with the optimal case there is still room for
further improvements
Discussion
Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).
PART III - RETRIEVAL AND RANKING
MODELS
PhD defense Nattiya Kanhabua 46
Nattiya Kanhabua 47PhD defense
• Problem statements
– Time must be explicitly modeled in order to increase
the effectiveness
– Time uncertainty should be taken into account
• Two temporal expressions can refer to the same time period
even though they are not equally written
• Example
– Given the query “Independence Day 2011”, a retrieval
model relying on term-matching will fail to retrieve
documents mentioning “July 4, 2011”
RQ6: Time-aware ranking models
Nattiya Kanhabua 48PhD defense
• Contributions
– Analyze and compare five ranking methods
• Experiment
– Collection: NYT Corpus and 40 temporal queries[Berberich 2010]
• Result
– TSU outperforms other methods significantly for most metrics
• Conclusions
– Although TSU gains the best performance, it is limited for a
document collection with no time metadata
– LMT, LMTU can be applied to any collection without time
metadata, but extraction of temporal expressions is needed.
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods
(poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
Nattiya Kanhabua 49PhD defense
• Problem statement
– Can the combination of time and other features help improving
the retrieval effectiveness?
RQ7:Ranking related news predictions
• A new task called ranking related news predictions
– Retrieve predictions related to a news story in news archives
– Rank them according to their relevance to the news story
Nattiya Kanhabua 50PhD defense
Related news predictions
Nattiya Kanhabua 51PhD defense
• Define the task ranking related news predictions
– Searching the future is proposed in [Baeza-Yates 2005]
• Propose four classes of features
– Term similarity, entity-based similarity, topic similarity and
temporal similarity
• Rank predictions using learning-to-rank [Liu 2009]
• Make available the dataset with over 6000 judgments
Contributions
Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings
of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
Nattiya Kanhabua 52PhD defense
• NYT Corpus
– More than 25% contain at least one prediction
• Feature analysis
– Topic features play an important role in ranking
– Features in top-5 features with lowest weights are entity-
based features
• Open issues
– Extract predictions from other sources, e.g., Wikipedia,
blogs, comments, etc.
– Sentiment analysis for future-related information.
Experiments
Nattiya Kanhabua 53PhD defense
Solutions to all research questions:
Part I - Content Analysis
RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis
RQ2: How to determine time of queries?
RQ3: How to handle terminology changes over time?
RQ4: How to predict the effectiveness of temporal queries?
RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models
RQ6: How to model time into retrieval and ranking?
RQ7: How to combine different features and time for ranking?
Conclusions
Nattiya Kanhabua 54PhD defense
• Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-
Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for
Digital Libraries (ECDL), 2008.
• Nattiya Kanhabua and Kjetil Nørvåg, Using temporal language models for document dating, In Proceedings
of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2009
• Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In
Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries
(ECDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In
Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, QUEST: query expansion using synonyms over time, In Proceedings of
the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of
the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings
of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of
the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, Technical Report.
Publications
Nattiya Kanhabua 55PhD defense
• [Baeza-Yates 2005] R. A. Baeza-Yates. Searching the future. In Proceedings of SIGIR workshop on
mathematical/formal methods in information retrieval MF/IR, SIGIR ’05, 2005.
• [Berberich 2010] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for
temporal information needs. In Proceedings of the 32nd European Conference on IR Research on Advances in
Information Retrieval, ECIR ’10, pp. 13-25, 2010.
• [Cronen-Townsend 2002] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’02, pp. 299-306, 2002.
• [Diaz 2004] F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of
the 27th annual international ACM SIGIR conference on Research and development in information retrieval,
SIGIR ’04, pp. 18-24, 2004.
• [Hauff 2010] C. Hauff, L. Azzopardi, D. Hiemstra, and F. de Jong. Query performance prediction: Evaluation
contrasted with effectiveness. In Proceedings of the 32nd European Conference on IR Research on Advances
in Information Retrieval, ECIR ’10, pp. 204-216, April 2010.
• [de Jong 2005] F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical
text. In Humanities, computers and cultural heritage: Proceedings of the 16th International Conference of the
Association for History and Computing, AHC '05, pp. 161-168, 2005.
• [Kleinberg 2003] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7:pp.
373-397, October 2003.
• [Liu 2009] T-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):pp. 225-331, March
2009.
• [Metzler 2009] D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal
queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’09, pp. 700-701, 2009.
• [Nunes 2008] S. Nunes, C. Ribeiro, and G. David. Use of temporal expressions in web search. In Proceedings of
the 30th European Conference on IR Research on Advances in Information Retrieval, ECIR ’08, pp. 580-584,
2008.
• [Peng 2010] J. Peng, C. Macdonald, and I. Ounis. Learning to select a ranking function. In Proceedings of the
32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 114-126, 2010.
References
Thank you
PhD defense Nattiya Kanhabua 56

More Related Content

Similar to Time-aware Approaches to Information Retrieval

Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Wise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time WorkshopWise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time Workshopalywise
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
 
Update of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsUpdate of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsVrije Universiteit Amsterdam
 
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesIUPUI
 
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...Association for Computational Linguistics
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Dirk Roorda
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
 
Harmonising Research between South and North: Results from ROER4D’s Question ...
Harmonising Research between South and North: Results from ROER4D’s Question ...Harmonising Research between South and North: Results from ROER4D’s Question ...
Harmonising Research between South and North: Results from ROER4D’s Question ...Open Education Global (OEGlobal)
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto UniversityStephanie Simms
 
AP Exam Question Types Overview.pptx
AP Exam Question Types Overview.pptxAP Exam Question Types Overview.pptx
AP Exam Question Types Overview.pptxDave Phillips
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
 

Similar to Time-aware Approaches to Information Retrieval (20)

Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Wise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time WorkshopWise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time Workshop
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search Results
 
A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries. A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries.
 
Update of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsUpdate of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agents
 
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
 
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 
Cue Forum2008
Cue Forum2008Cue Forum2008
Cue Forum2008
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document Archives
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Preparing Your Research Material for the Future - 2015-02-23 - Humanities Div...
Preparing Your Research Material for the Future - 2015-02-23 - Humanities Div...Preparing Your Research Material for the Future - 2015-02-23 - Humanities Div...
Preparing Your Research Material for the Future - 2015-02-23 - Humanities Div...
 
Harmonising Research between South and North: Results from ROER4D’s Question ...
Harmonising Research between South and North: Results from ROER4D’s Question ...Harmonising Research between South and North: Results from ROER4D’s Question ...
Harmonising Research between South and North: Results from ROER4D’s Question ...
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto University
 
AP Exam Question Types Overview.pptx
AP Exam Question Types Overview.pptxAP Exam Question Types Overview.pptx
AP Exam Question Types Overview.pptx
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
 
Preparing Your Research Material for the Future - 2014-06-09 - Humanities Div...
Preparing Your Research Material for the Future - 2014-06-09 - Humanities Div...Preparing Your Research Material for the Future - 2014-06-09 - Humanities Div...
Preparing Your Research Material for the Future - 2014-06-09 - Humanities Div...
 

More from Nattiya Kanhabua

Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News PredictionsNattiya Kanhabua
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updatesNattiya Kanhabua
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Nattiya Kanhabua
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
 
Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Nattiya Kanhabua
 
Supporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information RetrievalSupporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information RetrievalNattiya Kanhabua
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
 

More from Nattiya Kanhabua (14)

Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updates
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
 
Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?
 
Supporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information RetrievalSupporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information Retrieval
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
 

Recently uploaded

05232024 Joint Meeting - Community Networking
05232024 Joint Meeting - Community Networking05232024 Joint Meeting - Community Networking
05232024 Joint Meeting - Community NetworkingMichael Orias
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesIP ServerOne
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
 
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfSkillCertProExams
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationAccess Innovations, Inc.
 
527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdfrajpreetkaur75080
 
The Canoga Gardens Development Project. PDF
The Canoga Gardens Development Project. PDFThe Canoga Gardens Development Project. PDF
The Canoga Gardens Development Project. PDFRahsaan L. Browne
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
 
123445566544333222333444dxcvbcvcvharsh.pptx
123445566544333222333444dxcvbcvcvharsh.pptx123445566544333222333444dxcvbcvcvharsh.pptx
123445566544333222333444dxcvbcvcvharsh.pptxgargh1099
 
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...Rahsaan L. Browne
 
Introduction of Biology in living organisms
Introduction of Biology in living organismsIntroduction of Biology in living organisms
Introduction of Biology in living organismssoumyapottola
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerVladimir Samoylov
 
Hi-Tech Industry 2024-25 Prospective.pptx
Hi-Tech Industry 2024-25 Prospective.pptxHi-Tech Industry 2024-25 Prospective.pptx
Hi-Tech Industry 2024-25 Prospective.pptxShivamM16
 
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22LHelferty
 

Recently uploaded (14)

05232024 Joint Meeting - Community Networking
05232024 Joint Meeting - Community Networking05232024 Joint Meeting - Community Networking
05232024 Joint Meeting - Community Networking
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf
 
The Canoga Gardens Development Project. PDF
The Canoga Gardens Development Project. PDFThe Canoga Gardens Development Project. PDF
The Canoga Gardens Development Project. PDF
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
123445566544333222333444dxcvbcvcvharsh.pptx
123445566544333222333444dxcvbcvcvharsh.pptx123445566544333222333444dxcvbcvcvharsh.pptx
123445566544333222333444dxcvbcvcvharsh.pptx
 
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
 
Introduction of Biology in living organisms
Introduction of Biology in living organismsIntroduction of Biology in living organisms
Introduction of Biology in living organisms
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
Hi-Tech Industry 2024-25 Prospective.pptx
Hi-Tech Industry 2024-25 Prospective.pptxHi-Tech Industry 2024-25 Prospective.pptx
Hi-Tech Industry 2024-25 Prospective.pptx
 
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
 

Time-aware Approaches to Information Retrieval

  • 1. Time-aware Approaches to Information Retrieval Nattiya Kanhabua Department of Computer and Information Science Norwegian University of Science and Technology 24 February 2012
  • 2. Nattiya Kanhabua 2PhD defense • Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails Motivation Web archives news archives blogs emails “temporal document collections” Retrieve documents about Pope Benedict XVI written before 2005 Term-based IR approaches may give unsatisfied results
  • 3. Nattiya Kanhabua 3PhD defense • A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no Wayback Machine1 No keyword query No relevance ranking 1Retrieved on 15 January 2011
  • 4. Nattiya Kanhabua 4PhD defense • A news archive search tool by Google – Query by keywords – Rank results by relevance or date Google News Archive Search Not consider terminology changes over time
  • 5. Nattiya Kanhabua 5PhD defense • Study problems of temporal search • Propose approaches to solve the problems • Main research question “How to exploit temporal information in documents, queries, and external sources in order to improve the retrieval effectiveness?” Objective of PhD thesis
  • 6. Nattiya Kanhabua 6PhD defense Part I - Content Analysis RQ1: How to determine time of non-timestamped documents? Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking? Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking? Outline contributions
  • 7. PART I - CONTENT ANALYSIS PhD defense Nattiya Kanhabua 7
  • 8. Nattiya Kanhabua 8PhD defense Problem Statements • Difficult to find the trustworthy time for web documents – Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date RQ1: Determining time of documents I found a bible-like document. But I have no idea when it was created? Let’s me see… This document is probably written in 850 A.C. with 95% confidence. “ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ”
  • 9. Nattiya Kanhabua 9PhD defense Preliminaries Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Similarity Scores Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004 Temporal Language Models [de Jong 2005] • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage
  • 10. Nattiya Kanhabua 10PhD defense Improving document dating Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non- Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008.
  • 11. Nattiya Kanhabua 11PhD defense Improving document dating Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing
  • 12. Nattiya Kanhabua 12PhD defense Improving document dating Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio
  • 13. Nattiya Kanhabua 13PhD defense Improving document dating Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter
  • 14. Nattiya Kanhabua 14PhD defense Experiments • Collection – 9,000 documents collected from the Internet Archive – 8 years time span, 15 news sources – Randomly select 1,000 documents for testing • Results – Proposed techniques gain improvement over the baseline • Precision = the fraction of documents correctly dated • Open issue – The effectiveness of document dating is still limited • Highly dependent on the quality of a reference corpus
  • 15. PART II - QUERY ANALYSIS PhD defense Nattiya Kanhabua 15
  • 16. Nattiya Kanhabua 16PhD defense • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time Challenges with temporal queries
  • 17. Nattiya Kanhabua 17PhD defense Challenges with temporal queries • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time query time1 time2 … timek suggest
  • 18. Nattiya Kanhabua 18PhD defense Challenges with temporal queries • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time query time1 time2 … timek suggest
  • 19. Nattiya Kanhabua 19PhD defense Challenges with temporal queries • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time query synonym@2001 synonym@2002 … synonym@2011 suggest
  • 20. RQ2: Determining time of queries PhD defense Nattiya Kanhabua 20 Problem Statements • 1.5% of web queries are explicitly provided with temporal expression [Nunes 2008] – Time is a part of query, “U.S. Presidential election 2008” • About 7% of web queries have temporal intent implicitly provided [Metzler 2009] – Time is not given in queries, e.g., “Germany World Cup” or “tsunami” – Difficult to achieve high accuracy using only keywords – Relevant documents associated to particular time not given
  • 21. Nattiya Kanhabua 21PhD defense 1. Determining the time of queries when no time is given 2. Re-ranking search results using the determined time Our contributions Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010.
  • 22. Nattiya Kanhabua 22PhD defense Approach I. Dating using keywords* Approach II. Dating using top-k documents* – Queries are short keywords – Inspired by pseudo-relevance feedback Approach III. Using timestamp of top-k documents – No temporal language models are used *Using Temporal Language Models proposed by de Jong et al. Determining time of queries
  • 23. Nattiya Kanhabua 23PhD defense • Intuition: documents published closely to the time of queries are more relevant – Assign document priors based on publication dates Re-ranking search results query News archive Determine time 2005, 2004, 2006, ... D2009 Initial retrieved results D2005 Re-ranked results
  • 24. Nattiya Kanhabua 24PhD defense Determining the time of queries • Collection – NYT Corpus contains over 1.8M (1987-2007) – 30 time-sensitive queries from the TREC Robust2004 • Results – The smaller top-k, the better precision (k=5 > k=10 > k=15) – The larger g (granularity), the better precision (g=12-month > g=6-month) Experiments: Part 1 Precision = the fraction of queries correctly dated
  • 25. Nattiya Kanhabua 25PhD defense Re-ranking of search results • Collection – TREC Robust2004, 30 time-sensitive queries – NYT Corpus, 24 queries from Google zeitgeist • Results – Approach III (no TMLs) outperforms all other approaches • Using publication dates is more accurate than the dating process • Open issue – Time can improve the effectiveness (if the query dating is improved with a higher accuracy) Experiments: Part 2
  • 26. Nattiya Kanhabua 26PhD defense Challenges of temporal search • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time query synonym@2001 synonym@2002 … synonym@2011 suggest
  • 27. Nattiya Kanhabua 27PhD defense Problem Statements • Queries composed of named entities (people, organization, location) – Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift RQ3: Handling terminology changes Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant
  • 28. Nattiya Kanhabua 28PhD defense QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
  • 29. Nattiya Kanhabua 29PhD defense • Discover time-based synonyms over time using Wikipedia – Generally, synonyms are words with similar meanings – This work refers synonyms as alternative names of an entity • Improve the accuracy of time of synonyms • Query expansion using time-based synonyms Our contributions Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
  • 30. Nattiya Kanhabua 30PhD defense Recognize named entities
  • 31. Nattiya Kanhabua 31PhD defense Recognize named entities
  • 32. Nattiya Kanhabua 32PhD defense Recognize named entities
  • 33. Nattiya Kanhabua 33PhD defense Find synonyms • Find a set of entity-synonym relationships at time tk • For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004 President_of_the_ United_States George W. Bush George W. Bush President George W. Bush President Bush (43)
  • 34. Initial results • Time periods are not accurate Note: the time of synonyms are timestamps of Wikipedia articles (8 years) PhD defense Nattiya Kanhabua 34
  • 35. • Analyze NYT Corpus to discover more accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm [Kleinberg 2003] – Time periods of synonyms = burst intervals Enhancement using NYT PhD defense Nattiya Kanhabua 35 Initial results
  • 36. Nattiya Kanhabua 36PhD defense Query expansion 1. A user enters an entity as a query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
  • 37. Nattiya Kanhabua 37PhD defense Query expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
  • 38. Nattiya Kanhabua 38PhD defense Query expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
  • 39. Nattiya Kanhabua 39PhD defense Part 1- Synonym detection • Collection – The whole history of English Wikipedia • all pages and revisions 03/2001 to 03/2008 • 85 month snapshots about 2.8 Terabytes • Result – Randomly selected 500 entity-synonym relationships for evaluating • Accuracy 51% for all types of entities • Accuracy 73% for people, organization, and company Part 2 - Query expansion • Collection – TREC Robust2004 Track (250 queries) – NewsLibrary.com over 100M U.S. news articles (20 temporal queries) – Result • Baseline: Probabilistic Model without query expansion • QE significantly improves the effectiveness over the baseline for both collections • Open issues – Only the name changes of famous persons can be discovered Experiments
  • 40. Nattiya Kanhabua 40PhD defense Two problems are addressed 1. Performance prediction • Predict the retrieval effectiveness wrt. a ranking model Query prediction problems query precision = ? recall = ? MAP = ? predict
  • 41. Nattiya Kanhabua 41PhD defense Two problems are addressed 1. Performance prediction • Predict the retrieval effectiveness wrt. a ranking model 2. Ranking prediction • Predict the ranking model that is most suitable Query prediction problems query ranking = ? predict max(precision) max(recall) max(MAP)
  • 42. Nattiya Kanhabua 42PhD defense Problem Statement • Predict the effectiveness (e.g., MAP) that a query will achieve in advance of, or during retrieval [Hauff 2010] – high MAP  “good” – low MAP  “poor” Objective • Apply query enhancement techniques to improve the overall performance – Query suggestion is applied for “poor” queries • To best of our knowledge, predicting the performance of temporal queries has never done before RQ4: Query performance prediction
  • 43. Nattiya Kanhabua 43PhD defense • Contributions – First study of performance prediction for temporal queries – Propose 10 time-based pre-retrieval predictors • Both text and time are considered • Experiment – Collection: NYT Corpus and 40 temporal queries [Berberich 2010] • Results – Time-based predictors outperform keyword-based predictors – Combined predictors outperform single predictors in most cases • Open issue – Increase the number of queries – Consider time uncertainty Discussion Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
  • 44. Nattiya Kanhabua 44PhD defense • Problem statement – Two time dimensions: publication time and content time • Content time = temporal expressions mentioned in documents – Difference in effectiveness for temporal queries when ranking using publication time or content time RQ5: Time-aware ranking prediction
  • 45. Nattiya Kanhabua 45PhD defense • Contributions – First study of the impact on effectiveness of ranking models using the two time dimensions – Three features from analyzing top-k documents • Temporal KL-divergence [Diaz 2004] • Content Clarity [Cronen-Townsend 2002] • Divergence of retrieval scores [Peng 2010] • Results – A small number of top-k documents achieves better performance – The larger number k, the more irrelevant documents are introduced into the analysis • Open issue – When comparing with the optimal case there is still room for further improvements Discussion Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).
  • 46. PART III - RETRIEVAL AND RANKING MODELS PhD defense Nattiya Kanhabua 46
  • 47. Nattiya Kanhabua 47PhD defense • Problem statements – Time must be explicitly modeled in order to increase the effectiveness – Time uncertainty should be taken into account • Two temporal expressions can refer to the same time period even though they are not equally written • Example – Given the query “Independence Day 2011”, a retrieval model relying on term-matching will fail to retrieve documents mentioning “July 4, 2011” RQ6: Time-aware ranking models
  • 48. Nattiya Kanhabua 48PhD defense • Contributions – Analyze and compare five ranking methods • Experiment – Collection: NYT Corpus and 40 temporal queries[Berberich 2010] • Result – TSU outperforms other methods significantly for most metrics • Conclusions – Although TSU gains the best performance, it is limited for a document collection with no time metadata – LMT, LMTU can be applied to any collection without time metadata, but extraction of temporal expressions is needed. Discussion Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
  • 49. Nattiya Kanhabua 49PhD defense • Problem statement – Can the combination of time and other features help improving the retrieval effectiveness? RQ7:Ranking related news predictions • A new task called ranking related news predictions – Retrieve predictions related to a news story in news archives – Rank them according to their relevance to the news story
  • 50. Nattiya Kanhabua 50PhD defense Related news predictions
  • 51. Nattiya Kanhabua 51PhD defense • Define the task ranking related news predictions – Searching the future is proposed in [Baeza-Yates 2005] • Propose four classes of features – Term similarity, entity-based similarity, topic similarity and temporal similarity • Rank predictions using learning-to-rank [Liu 2009] • Make available the dataset with over 6000 judgments Contributions Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
  • 52. Nattiya Kanhabua 52PhD defense • NYT Corpus – More than 25% contain at least one prediction • Feature analysis – Topic features play an important role in ranking – Features in top-5 features with lowest weights are entity- based features • Open issues – Extract predictions from other sources, e.g., Wikipedia, blogs, comments, etc. – Sentiment analysis for future-related information. Experiments
  • 53. Nattiya Kanhabua 53PhD defense Solutions to all research questions: Part I - Content Analysis RQ1: How to determine time of non-timestamped documents? Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking? Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking? Conclusions
  • 54. Nattiya Kanhabua 54PhD defense • Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non- Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008. • Nattiya Kanhabua and Kjetil Nørvåg, Using temporal language models for document dating, In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2009 • Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010. • Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010. • Nattiya Kanhabua and Kjetil Nørvåg, QUEST: query expansion using synonyms over time, In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010. • Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011. • Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011. • Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011. • Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, Technical Report. Publications
  • 55. Nattiya Kanhabua 55PhD defense • [Baeza-Yates 2005] R. A. Baeza-Yates. Searching the future. In Proceedings of SIGIR workshop on mathematical/formal methods in information retrieval MF/IR, SIGIR ’05, 2005. • [Berberich 2010] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for temporal information needs. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 13-25, 2010. • [Cronen-Townsend 2002] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’02, pp. 299-306, 2002. • [Diaz 2004] F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04, pp. 18-24, 2004. • [Hauff 2010] C. Hauff, L. Azzopardi, D. Hiemstra, and F. de Jong. Query performance prediction: Evaluation contrasted with effectiveness. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 204-216, April 2010. • [de Jong 2005] F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical text. In Humanities, computers and cultural heritage: Proceedings of the 16th International Conference of the Association for History and Computing, AHC '05, pp. 161-168, 2005. • [Kleinberg 2003] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7:pp. 373-397, October 2003. • [Liu 2009] T-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):pp. 225-331, March 2009. • [Metzler 2009] D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’09, pp. 700-701, 2009. • [Nunes 2008] S. Nunes, C. Ribeiro, and G. David. Use of temporal expressions in web search. In Proceedings of the 30th European Conference on IR Research on Advances in Information Retrieval, ECIR ’08, pp. 580-584, 2008. • [Peng 2010] J. Peng, C. Macdonald, and I. Ounis. Learning to select a ranking function. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 114-126, 2010. References
  • 56. Thank you PhD defense Nattiya Kanhabua 56