SlideShare a Scribd company logo
Leveraging Dynamic Query Subtopics for
Time-aware Search Result Diversification
Tu Ngoc Nguyen and Nattiya Kanhabua
1
Motivation
• Query underlying aspects change over time
2
I. Subtopic Mining
A. From Query Logs
B. From Document Collection
II. Time-aware Diversifying Models
III. Experiment
3
Outline
4
temporally
ambiguous, multi-
faceted queries
subtopic mining
query log
document
collection
time-aware
diversification
d1
d2
.
.
dn
dynamic
subtopics
t
System Pipeline
1.a
co-click bipartite
q1
q2
.
.
qn
related queries
querying
time t
clustering
subtopics
querying
time t + 1
5
Subtopic Mining Approach: Query Log
march madness
began
14/03/2006
ncaa women
tournament began
18/03/2006 01/04/2006
final four began
query: ncaa
6
Subtopic Dynamics: Query Log
7
temporally
ambiguous, multi-
faceted queries
subtopic mining
query log
document
collection
time-aware
diversification
d1
d2
.
.
dn
dynamic
subtopics
t
System Pipeline
1.b
– Probabilistic subtopic modeling: Latent Dirichlet Allocation (LDA)
8
Subtopic Mining: Document Collection
query: apple
9
Subtopic Dynamics: Document Collection
subtopic dynamics : Blogs08 query volume: GoogleTrend
10
temporally
ambiguous, multi-
faceted queries
subtopic mining
query log
document
collection
time-aware
diversification
d1
d2
.
.
dn
dynamic
subtopics
t
System Pipeline
2.
- Probabilistic model:
- Pr(c|q): weight of certain subtopic in a query
- Pr(q|d), Pr(d|q): relation between document and query
- Pr(c|d), Pr(d|c): relation between document and subtopic
- IA-Select objective function:
11
IA-Select Model [Vallet and Castells. 2012]
document
relevance novelty
• MMR [Carbonell and Goldstein. 98]
- diversify based on similarity of document contents
• IA-Select [Agrawal et al. 2009]
- diversify based on the taxonomy of subtopic categories
• xQuaD [Santos et al. 2010]
- general form of IA-Select
- define objective function as a mixture of relevance and diversity
probabilities
• Topic richness [Dou et al. 2011]
- general form of xQuaD and IA-Select models
- accepts topics from multiple sources
12
Search Result Diversification Models
- Probabilistic model:
- Pr(c|q): weight of certain subtopic in a query
- Pr(q|d), Pr(d|q): relation between document and query
- Pr(c|d), Pr(d|c): relation between document and subtopic
- xQuaD objective function:
13
xQuaD Model [Vallet and Castells. 2012]
novelty
document-topic
relevance
document-query
relevance
• Temp-IASelect
- IA-Select objective function:
- Temp-IASelect objective function:
14
Temporal Diversifying Models
• Temp-xQuaD
- xQuaD objective function:
- Temp-xQuaD objective function:
• Temp-topic richness
- generalization of temp-xQuaD and temp-IASelect 15
Temporal Diversifying Models
• TREC Blogs08 Collection
- crawled from Jan 2008 to Feb 2009
- clean HTML tags using HtmlCleaner and Boilerpipe libraries
- index using Lucene Core
- document’s publication date extracted from:
- Blog content
- URL
- Retrieval date
16
Experiment Settings
• Retrieval baseline:
- Okapi BM25
• Relevance assessments:
- human assessment
- binary relevance judgment, follows TREC Diversity Track 2009 and 2011
- 2 dimensional assessment: relevance and time
- exclude topics mined from query log (time gap between AOL and
Blogs08)
- top 10 most probability words represents a topic.
• Querying-time points:
- how popular is the query at particular time t
- how different to the previous time slice t-1
17
Experiment Settings
18
Relevance assessments
Document Publication
Date
Subtopic Hitting time Relevance
Title: It’s The Most Wonderful Time Of The
Year
Content: The greatest sports week of the
year is upon us, and Chris’s Sports Blog is
ready. Check back daily for coverage of the
ACC and NCAA Tournament, tips on how to
fill out your bracket …
2008-03-17 ncaa basketball
tournament
2008-03 1
Title: Is there a bigger joke than the NCAA?
Content: Each year the NCAA discovers a
new way to make a bigger ass of itself than
the previous one. This years specialty is to
bar from NCAA playoffs in every sport any
school who persists in using "unacceptable"
team names and school mascots, by their
exclusive definition…
2005-08-22 ncaa basketball
tournament
2008-03 0
Title: Apple Quince Jam
Content: The apple quince is a fruit that
ripens in the period from October to
November, it is an apple with a strange
shape: it looks like a pear and apple, and it is
lumpy…
2007-11-15 apple jam 2008-03 1
• α-NDCG
- adding diversity and novelty to nDCG
• Intent-Aware Precision (Precision-IA)
- intent-aware version of Precision
- treat subtopic as distinct interpretation of query
• Intent-Aware Expected Reciprocal Rank (ERR-IA)
- based on cascade model of search
19
Evaluation Metrics
20
Experimental Results
* baselines with dynamic subtopic mining
21
Experimental Results
△ (p < 0.05), △△ (p < 0.01)
Conclusion
• studied temporally ambiguous and multi-faceted queries
- subtopic temporal variability
- subtopic mining from two different sources (query logs, document
collection)
• propose time-aware search results diversification frameworks
• Model and predict the subtopic change
• Combine diversifying by subtopics and time in a unified framework
22
Future work
THANK YOU.
23
Settings
• Estimate natural number of subtopics
– Suresh et al. 2010 view LDA as matrix factorization mechanism
– Cd×w = M1d×t × M2t×w
• d: number of document in the corpus
• w: size of vocabulary
• t: the number of topics
– optimum t is with minimal divergence value
•
• CM1 is the distribution of singular values of M1
• CM2 is obtained by normalize vector L · M2, L is 1 x D vector of lengths of
each document in C.
24
• New documents appear all the time
• Document content change over time
• Queries and query volumes change over time
• Example: [Kulkarni et al. 2011]
25
march madness
ncaa
Motivation
Query Dynamic Metrics
• Kendall τ Coefficient based:
• Jaccard Coefficient based:
26
Cluster Subtopic Candidates
• Clustering approach [Song et al. 2011]:
– step 1: Construct a similarity matrix of the related queries
– step 2: Cluster using Affinity Propagation algorithm
– step 3: Extract a set of exemplars as subtopics of the query
• Similarity metrics:
– lexical similarity:
• keywords and cosine similarity
– co-click similarity:
• based on fraction of common clicks
– semantic similarity:
• use WordNet as external KB
27
• Vector-based:
– Cosine Similarity
• Bag of words-based:
– Jaccard Coefficient
• Ranked list of words-based:
– Kendall τ Coefficient -based:
• Multinomial distribution-based:
– Kullback-Leibler Divergence
– Jensen-Shannon Divergence
Topic Similarity Metrics [Kim and Oh. 2011]
28
Subtopic Mining Approach
• Dynamic queries: select 57 out of 61 queries from the AOL query log i.e.
yearly recurrent or time-independent.
• Settings
– partition collection into 14 one-month length time slices
– training data in time slice ti is top 2000 documents D with d ∈ D,
pubDate(d) ∈ t
– number of subtopic is preset in the range from 5 to 20
• Subtopic weight:
– weight w(c) is the probability that a given query q implies subtopic c
–
29
Temporal Document Collection
• TREC Blogs08 Collection
- crawled from Jan 2008 to Feb 2009
- clean HTML tags using HtmlCleaner and Boilerpipe libraries
- index using Lucene Core
- document’s publication date extracted from:
- Blog content
- URL
- Retrieval date
30
Subtopic Evaluation
• 61 queries: 51 event-related queries, 10 standard ambiguous queries
– aspect removed e.g. march madness brackets → march madness
• Subtopic evaluation metrics [Radlinski et al. 2010]:
– coherence
– distinctness
– plausibility
– completeness
31
Subtopic Evaluation
- Perplexity: a measure of the ability of a model to generalize documents
-
- use holdout validation with 90% data for training and 10% for testing
- randomly select 20 out of 57 queries at a random time slice
32

More Related Content

Viewers also liked

Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Nattiya Kanhabua
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Nattiya Kanhabua
 
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Nattiya Kanhabua
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updates
Nattiya Kanhabua
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
Nattiya Kanhabua
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
Nattiya Kanhabua
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document Archives
Nattiya Kanhabua
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Nattiya Kanhabua
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Nattiya Kanhabua
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
Nattiya Kanhabua
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?
Nattiya Kanhabua
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
Nattiya Kanhabua
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...
Nattiya Kanhabua
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
Nattiya Kanhabua
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
Nattiya Kanhabua
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?
Nattiya Kanhabua
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
Nattiya Kanhabua
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Nattiya Kanhabua
 

Viewers also liked (18)

Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
 
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updates
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document Archives
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
 

Similar to Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

Update of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsUpdate of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agents
Vrije Universiteit Amsterdam
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...
Aravind Sesagiri Raamkumar
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
Treasure Data, Inc.
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search Perspective
Nattiya Kanhabua
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
Ben Blaiszik
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
Tutorial on query auto completion
Tutorial on query auto completionTutorial on query auto completion
Tutorial on query auto completion
Yichen Feng
 
Tutorial on query auto-completion
Tutorial on query auto-completionTutorial on query auto-completion
Tutorial on query auto-completion
Yichen Feng
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Thanh Tran
 
Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the Web
Tu Nguyen
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Aravind Sesagiri Raamkumar
 
HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
Digital Preservation at UNM Libraries
Digital Preservation at UNM LibrariesDigital Preservation at UNM Libraries
Digital Preservation at UNM Libraries
Kevin J. Comerford, University of New Mexico
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
Jian Wu
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
Zide Meng
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
88mooom
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
Marcus Hanwell
 
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Giannis Tsakonas
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
Anubhav Jain
 

Similar to Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification (20)

Update of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agentsUpdate of time-invalid information in knowledge bases through mobile agents
Update of time-invalid information in knowledge bases through mobile agents
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search Perspective
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Tutorial on query auto completion
Tutorial on query auto completionTutorial on query auto completion
Tutorial on query auto completion
 
Tutorial on query auto-completion
Tutorial on query auto-completionTutorial on query auto-completion
Tutorial on query auto-completion
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the Web
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
 
HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
HDF5 FastQuery
 
Digital Preservation at UNM Libraries
Digital Preservation at UNM LibrariesDigital Preservation at UNM Libraries
Digital Preservation at UNM Libraries
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 

Recently uploaded

Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
eCommerce Institute
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 

Recently uploaded (16)

Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 

Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification

  • 1. Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification Tu Ngoc Nguyen and Nattiya Kanhabua 1
  • 2. Motivation • Query underlying aspects change over time 2
  • 3. I. Subtopic Mining A. From Query Logs B. From Document Collection II. Time-aware Diversifying Models III. Experiment 3 Outline
  • 4. 4 temporally ambiguous, multi- faceted queries subtopic mining query log document collection time-aware diversification d1 d2 . . dn dynamic subtopics t System Pipeline 1.a
  • 5. co-click bipartite q1 q2 . . qn related queries querying time t clustering subtopics querying time t + 1 5 Subtopic Mining Approach: Query Log
  • 6. march madness began 14/03/2006 ncaa women tournament began 18/03/2006 01/04/2006 final four began query: ncaa 6 Subtopic Dynamics: Query Log
  • 7. 7 temporally ambiguous, multi- faceted queries subtopic mining query log document collection time-aware diversification d1 d2 . . dn dynamic subtopics t System Pipeline 1.b
  • 8. – Probabilistic subtopic modeling: Latent Dirichlet Allocation (LDA) 8 Subtopic Mining: Document Collection query: apple
  • 9. 9 Subtopic Dynamics: Document Collection subtopic dynamics : Blogs08 query volume: GoogleTrend
  • 10. 10 temporally ambiguous, multi- faceted queries subtopic mining query log document collection time-aware diversification d1 d2 . . dn dynamic subtopics t System Pipeline 2.
  • 11. - Probabilistic model: - Pr(c|q): weight of certain subtopic in a query - Pr(q|d), Pr(d|q): relation between document and query - Pr(c|d), Pr(d|c): relation between document and subtopic - IA-Select objective function: 11 IA-Select Model [Vallet and Castells. 2012] document relevance novelty
  • 12. • MMR [Carbonell and Goldstein. 98] - diversify based on similarity of document contents • IA-Select [Agrawal et al. 2009] - diversify based on the taxonomy of subtopic categories • xQuaD [Santos et al. 2010] - general form of IA-Select - define objective function as a mixture of relevance and diversity probabilities • Topic richness [Dou et al. 2011] - general form of xQuaD and IA-Select models - accepts topics from multiple sources 12 Search Result Diversification Models
  • 13. - Probabilistic model: - Pr(c|q): weight of certain subtopic in a query - Pr(q|d), Pr(d|q): relation between document and query - Pr(c|d), Pr(d|c): relation between document and subtopic - xQuaD objective function: 13 xQuaD Model [Vallet and Castells. 2012] novelty document-topic relevance document-query relevance
  • 14. • Temp-IASelect - IA-Select objective function: - Temp-IASelect objective function: 14 Temporal Diversifying Models
  • 15. • Temp-xQuaD - xQuaD objective function: - Temp-xQuaD objective function: • Temp-topic richness - generalization of temp-xQuaD and temp-IASelect 15 Temporal Diversifying Models
  • 16. • TREC Blogs08 Collection - crawled from Jan 2008 to Feb 2009 - clean HTML tags using HtmlCleaner and Boilerpipe libraries - index using Lucene Core - document’s publication date extracted from: - Blog content - URL - Retrieval date 16 Experiment Settings
  • 17. • Retrieval baseline: - Okapi BM25 • Relevance assessments: - human assessment - binary relevance judgment, follows TREC Diversity Track 2009 and 2011 - 2 dimensional assessment: relevance and time - exclude topics mined from query log (time gap between AOL and Blogs08) - top 10 most probability words represents a topic. • Querying-time points: - how popular is the query at particular time t - how different to the previous time slice t-1 17 Experiment Settings
  • 18. 18 Relevance assessments Document Publication Date Subtopic Hitting time Relevance Title: It’s The Most Wonderful Time Of The Year Content: The greatest sports week of the year is upon us, and Chris’s Sports Blog is ready. Check back daily for coverage of the ACC and NCAA Tournament, tips on how to fill out your bracket … 2008-03-17 ncaa basketball tournament 2008-03 1 Title: Is there a bigger joke than the NCAA? Content: Each year the NCAA discovers a new way to make a bigger ass of itself than the previous one. This years specialty is to bar from NCAA playoffs in every sport any school who persists in using "unacceptable" team names and school mascots, by their exclusive definition… 2005-08-22 ncaa basketball tournament 2008-03 0 Title: Apple Quince Jam Content: The apple quince is a fruit that ripens in the period from October to November, it is an apple with a strange shape: it looks like a pear and apple, and it is lumpy… 2007-11-15 apple jam 2008-03 1
  • 19. • α-NDCG - adding diversity and novelty to nDCG • Intent-Aware Precision (Precision-IA) - intent-aware version of Precision - treat subtopic as distinct interpretation of query • Intent-Aware Expected Reciprocal Rank (ERR-IA) - based on cascade model of search 19 Evaluation Metrics
  • 20. 20 Experimental Results * baselines with dynamic subtopic mining
  • 21. 21 Experimental Results △ (p < 0.05), △△ (p < 0.01)
  • 22. Conclusion • studied temporally ambiguous and multi-faceted queries - subtopic temporal variability - subtopic mining from two different sources (query logs, document collection) • propose time-aware search results diversification frameworks • Model and predict the subtopic change • Combine diversifying by subtopics and time in a unified framework 22 Future work
  • 24. Settings • Estimate natural number of subtopics – Suresh et al. 2010 view LDA as matrix factorization mechanism – Cd×w = M1d×t × M2t×w • d: number of document in the corpus • w: size of vocabulary • t: the number of topics – optimum t is with minimal divergence value • • CM1 is the distribution of singular values of M1 • CM2 is obtained by normalize vector L · M2, L is 1 x D vector of lengths of each document in C. 24
  • 25. • New documents appear all the time • Document content change over time • Queries and query volumes change over time • Example: [Kulkarni et al. 2011] 25 march madness ncaa Motivation
  • 26. Query Dynamic Metrics • Kendall τ Coefficient based: • Jaccard Coefficient based: 26
  • 27. Cluster Subtopic Candidates • Clustering approach [Song et al. 2011]: – step 1: Construct a similarity matrix of the related queries – step 2: Cluster using Affinity Propagation algorithm – step 3: Extract a set of exemplars as subtopics of the query • Similarity metrics: – lexical similarity: • keywords and cosine similarity – co-click similarity: • based on fraction of common clicks – semantic similarity: • use WordNet as external KB 27
  • 28. • Vector-based: – Cosine Similarity • Bag of words-based: – Jaccard Coefficient • Ranked list of words-based: – Kendall τ Coefficient -based: • Multinomial distribution-based: – Kullback-Leibler Divergence – Jensen-Shannon Divergence Topic Similarity Metrics [Kim and Oh. 2011] 28
  • 29. Subtopic Mining Approach • Dynamic queries: select 57 out of 61 queries from the AOL query log i.e. yearly recurrent or time-independent. • Settings – partition collection into 14 one-month length time slices – training data in time slice ti is top 2000 documents D with d ∈ D, pubDate(d) ∈ t – number of subtopic is preset in the range from 5 to 20 • Subtopic weight: – weight w(c) is the probability that a given query q implies subtopic c – 29
  • 30. Temporal Document Collection • TREC Blogs08 Collection - crawled from Jan 2008 to Feb 2009 - clean HTML tags using HtmlCleaner and Boilerpipe libraries - index using Lucene Core - document’s publication date extracted from: - Blog content - URL - Retrieval date 30
  • 31. Subtopic Evaluation • 61 queries: 51 event-related queries, 10 standard ambiguous queries – aspect removed e.g. march madness brackets → march madness • Subtopic evaluation metrics [Radlinski et al. 2010]: – coherence – distinctness – plausibility – completeness 31
  • 32. Subtopic Evaluation - Perplexity: a measure of the ability of a model to generalize documents - - use holdout validation with 90% data for training and 10% for testing - randomly select 20 out of 57 queries at a random time slice 32

Editor's Notes

  1. In this work, we use a state-of-the-art finding related queries technique for a query that is considered as a dynamic query to mine its set of related queries, assuming that these related queries expose the underlining subtopics of the dynamic query. Here we run a Markov random walk with restart (RWR) on the weighted bipartite graph composed of two sets of nodes, queries and URLs. The bipartite graph is partitioned by time into separated parts, with which we obtained a set of (explicit and implicit) related queries at different time intervals.
  2. LDA is a Bayesian multinomial mixture model which has become a state of the art and popular method in text analysis due to its ability to produce interpretable and semantically coherent topics. a topic is represented by a ranked list of word probability
  3. MMR instantiates the scoring function by estimating the similarity between d ∈ R\S and its most dissimilar document dj ∈ S. IA-Select investigate the problem of ambiguous queries with the overall objective of maximizing the probability that an average user finds at least one relevant doc- ument in the top n search results. Their model assumes an explicit taxonomy of subtopics is available, and both documents and queries may fall into multiple subtopics. xQuaD defined the objective function for diversification as a mixture of relevance and diversity probabilities topic richness is a generalization of xQuaD and IA-Select that combine subtopics from different sources
  4. MMR instantiates the scoring function by estimating the similarity between d ∈ R\S and its most dissimilar document dj ∈ S. IA-Select investigate the problem of ambiguous queries with the overall objective of maximizing the probability that an average user finds at least one relevant doc- ument in the top n search results. Their model assumes an explicit taxonomy of subtopics is available, and both documents and queries may fall into multiple subtopics. xQuaD defined the objective function for diversification as a mixture of relevance and diversity probabilities topic richness is a generalization of xQuaD and IA-Select that combine subtopics from different sources
  5. MMR instantiates the scoring function by estimating the similarity between d ∈ R\S and its most dissimilar document dj ∈ S. IA-Select investigate the problem of ambiguous queries with the overall objective of maximizing the probability that an average user finds at least one relevant doc- ument in the top n search results. Their model assumes an explicit taxonomy of subtopics is available, and both documents and queries may fall into multiple subtopics. xQuaD defined the objective function for diversification as a mixture of relevance and diversity probabilities topic richness is a generalization of xQuaD and IA-Select that combine subtopics from different sources
  6. There are temporal aspects to Web search queries that search engines must account for in order to provide the most relevant results to their users. asn example,inthe middle of March 2008,the query march madness suddenly became very popular, occurring thousands of times when one month before it occurred infrequently. The rise in popularity was a result of the popular annual college basketball championship in the United States. In addition to changes in query frequency, there were other changes associated with the query march madness during the championship period. For one, the National Collegiate Athletic Association (NCAA) homepage (http://ncaa.com) became very relevant to the query. The page provides comprehensive coverage of US college sports, but does not typically focus on basketball – except in March during March Madness. Other results were more relevant to the query march madness during March because they provided dynamic content. For example, the CBS Sports college basketball page, which provides real time game information, became relevant to people seeking to learn the score of a game in progress. In contrast, relatively static pages, like the Wikipedia page about March Madness, became less relevant during this period of high interest. Such pages are useful for learning about March Madness in general, but not for actively monitoring the event, and thus are better suited to satisfy the need of searchers when the query is not spiking. The changes in which pages were relevant to the query march madness during the month of March reflects the fact that people’s query intent was also changing.
  7. The Random walk with Restart on the clickthrough graph giving us a set of related queries. However, these related queries can be duplicated or near-duplicated in semantic meanings. In the next step, we cluster these related queries in order to eliminate redundancy in the list. We apply Affinity Propagation (AP) as the clustering method. AP is an exemplar-based clustering method. It takes as input similarities between data points and outputs a set of data points (exemplars) that best represent the data, and assign each non-exemplar point to its most appropriate exemplar, thereby the data points are grouped into clusters.