SlideShare a Scribd company logo
1 of 159
Download to read offline
Search, Exploration and
Analytics of Evolving Data
Nattiya Kanhabua
L3S Research Center
Hannover, Germany
The 1st Keystone Training School on
Keyword Search over Big Data
23 July 2015, Malta
Lecturer
Education qualification
2007 - 2011: Ph.D. degree, Norwegian University of Science and Technology, Norway
Thesis: “Time-aware Approaches to Information Retrieval”
2003 - 2005: M.Sc. in Computer Science, Asian Institute of Technology, Thailand
Thesis: “Agent-based Simulation of Trade in Barter Trade Exchanges”
1997 - 2001: B.Eng. in Computer Engineering, Kasetsart University, Thailand
Project: “Software Process Enhancement and Control System”
Work experience
2011- now: Postdoc, L3S Research Center, Germany
05/2015: Visiting researcher, University of Trento, Italy
03-05/2010: Research intern, Yahoo! Research, Spain
2007 - 2011: Temporary Scientific Staff, NTNU, Norway
2006 - 2007: Research assistant, University of Trento, Italy
06-10/2006: Research assistant, AIT, Thailand
2005 - 2006: Analyst programmer, IFDS Group, UK
2002 - 2003: Research assistant, Kasetsart University, Thailand
2001 - 2002: System analyst, Accenture, Thailand & Singapore
Skills
• 7+ years of research experience in information
retrieval, data mining, machine learning, predictive
methods and spatio-temporal analysis
• 3+ years of research experience in BigData, e.g., large-
scale processing and MapReduce
 Hadoop  Pig  Mahout  HBase
 Tomcat  Servlet  Lucene  MySQL
 Python  JAVA  JSP  PHP
 Weka  R  UML  JSON
 Eclipse  NLP  RDF  WARC
223 July 2015The 1st Keystone Summer School: Keyword Search
 9:00 – 10:30 Part I
 Introduction to Temporal Dynamics
 Temporal Information Extraction
 Temporal Query Analysis (I)
 10:30 – 11:00 Coffee break
 11:00 – 12:30 Part II
 Temporal Query Analysis (II)
 Time-aware Retrieval and Ranking
 Applications of Temporal IR
 Conclusions and Outlook
3
Schedule
23 July 2015The 1st Keystone Summer School: Keyword Search
Additional Resource
 Book: Temporal Information Retrieval
 Foundations and Trends® in Information Retrieval
 Volume 9, Issue 2, pp 91-208, 2015
 Download: http://goo.gl/TunlBb
 References can be found in the book
423 July 2015The 1st Keystone Summer School: Keyword Search
Introduction to Temporal Dynamics
 What are temporal dynamics?
 Why do they occur and impact search?
 When and how to leverage temporal information for IR?
523 July 2015The 1st Keystone Summer School: Keyword Search
6
Temporal Dynamics
Figure: Internet Growth/Usage Phases/Tech Events
(created by Mark Schueler, used with permission)
23 July 2015
Temporal Web Dynamics
 Web is changing over time in many aspects, e.g., size, content,
structure and how it is accessed by user interactions or queries.
 Size: web pages are added/deleted at all time
 Content: web pages are edited/modified
 Query: users’ information needs changes
[Risvik et al., CN 2002; Ke et al., CN 2006]
[WebDyn 2010; Dumais, SIAM-SDM 2012]
723 July 2015
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!
1995 2015
Web and Index Sizes
823 July 2015The 1st Keystone Summer School: Keyword Search
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2015
9
Web and Index Sizes
23 July 2015The 1st Keystone Summer School: Keyword Search
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2015
2008
Google counts
1 trillion
unique URLs
10
Web and Index Sizes
23 July 2015The 1st Keystone Summer School: Keyword Search
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2020
2009
TBs or PBs of data/index
Tens of thousands of PCs
2008
Google counts
1 trillion
unique URLs
11
?
Web and Index Sizes
23 July 2015The 1st Keystone Summer School: Keyword Search
http://www.worldwidewebsize.com/ 12
Web and Index Sizes
23 July 2015
Content Change
 The content of the Web, changes constantly over time, e.g., web
documents are added, modified or deleted continuously.
 National and international initiatives collect and preserve parts of
the Web [Gomes et al., TPDL 2011; Costa et al., TempWeb 2013]
Figure: WayBack Machine
a web archive search tool by
Internet Archive
1323 July 2015The 1st Keystone Summer School: Keyword Search
Content Change
 Challenge:
 Document representation and retrieval
1423 July 2015The 1st Keystone Summer School: Keyword Search
Categorization of Content Change
15
 Implication:
 Crawling, Indexing, Ranking
23 July 2015The 1st Keystone Summer School: Keyword Search
User Interaction Dynamics
 Browsing and querying (or search) behavior
 User preference, e.g., likes, comments, interests
 User’s profiles [Rybak et al., ECIR 2014]
1623 July 2015The 1st Keystone Summer School: Keyword Search
Query Popularity Change
 Challenge:
 Time-sensitive queries
 Query understanding and processing
Google Insights for Search: http://www.google.com/insights/search/
Query: Halloween
1723 July 2015The 1st Keystone Summer School: Keyword Search
Categorization of Web Search Queries
http://www.google.com/insights/search 18
 Implication:
 Query Analysis, Ranking
23 July 2015The 1st Keystone Summer School: Keyword Search
Temporal Information Extraction
(1) Document Creation Time
(2) Document Focus Time
(3) Entity and Event Evolution
1923 July 2015The 1st Keystone Summer School: Keyword Search
Motivation
 Incorporating time into search can increase retrieval effectiveness
 Only when temporal information is available
 Research problem:
 How to determine the publication of a document?
 How to extract temporal information from document contents?
2023 July 2015The 1st Keystone Summer School: Keyword Search
Two Time Aspects
1. Publication or modified time
 Task: determining timestamps of documents
 Method: rule-based technique, or temporal language models
2. Content or focus time
 Task: temporal information extraction
 Method: natural language processing, or time and event recognition
algorithms
2123 July 2015The 1st Keystone Summer School: Keyword Search
content time
publication time
2223 July 2015The 1st Keystone Summer School: Keyword Search
 Problem Statement: Hard to find trustworthy time for a web page
 Time gap between crawling and indexing
 Decentralization and relocation of web documents
 No standard metadata for time/date
23
Determining Document Creation Time
23 July 2015The 1st Keystone Summer School: Keyword Search
 Problem Statement: Hard to find trustworthy time for a web page
 Time gap between crawling and indexing
 Decentralization and relocation of web documents
 No standard metadata for time/date
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
24
Determining Document Creation Time
23 July 2015The 1st Keystone Summer School: Keyword Search
 Problem Statement: Hard to find trustworthy time for a web page
 Time gap between crawling and indexing
 Decentralization and relocation of web documents
 No standard metadata for time/date
Let’s me see…
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
25
Determining Document Creation Time
23 July 2015The 1st Keystone Summer School: Keyword Search
Current Approaches
1. Content-based
 Temporal language model [de Jong et al., AHC 2005;
Kanhabua and Nørvåg, ECDL 2008]
 Classifier using features based on text’s time expressions
[Chambers, ACL 2012;Ge et al., EMNLP 2013]
 Using burstiness of terms for estimating timestamps
[Kotsakos et al., SIGIR 2014]
2. Non content-based
 Finding the oldest version of a page in a web archive [Jatowt
et al., WIDM 2007]
 Leveraging external resources [Hauff and Azzopardi, ECIR
2005;Nunes et al., WIDM 2007; SalahEldeen and Nelson,
TempWeb 2013]
2623 July 2015The 1st Keystone Summer School: Keyword Search
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language Models
 Based on the statistic usage of
words over time
 Compare each word of a non-
timestamped document with a
reference corpus
 Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
2723 July 2015The 1st Keystone Summer School: Keyword Search
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language Models
 Based on the statistic usage of
words over time
 Compare each word of a non-
timestamped document with a
reference corpus
 Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
28
tsunami
Thailand
A non-timestamped
document
23 July 2015The 1st Keystone Summer School: Keyword Search
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language Models
 Based on the statistic usage of
words over time
 Compare each word of a non-
timestamped document with a
reference corpus
 Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
29
tsunami
Thailand
A non-timestamped
document
23 July 2015The 1st Keystone Summer School: Keyword Search
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language Models
 Based on the statistic usage of
words over time
 Compare each word of a non-
timestamped document with a
reference corpus
 Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
30
tsunami
Thailand
A non-timestamped
document
23 July 2015The 1st Keystone Summer School: Keyword Search
Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language Models
 Based on the statistic usage of
words over time
 Compare each word of a non-
timestamped document with a
reference corpus
 Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1
31
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
23 July 2015The 1st Keystone Summer School: Keyword Search
Normalized Log-likelihood Ratio
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Normalized log-likelihood ratio
[Kraaij, SIGIR Forum 2005]
 Variant of Kullback-Leibler
divergence
 Similarity of a document and time
partitions
 C is the background model
estimated on the corpus
 Linear interpolation smoothing to
avoid the zero probability of
unseen words
Freq
1
1
1
1
1
1
32
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
23 July 2015The 1st Keystone Summer School: Keyword Search
Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
[Kanhabua et al., ECDL 2008] (Slide provided by the authors) 3323 July 2015
Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
3423 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: A term weight depends on how good the term is
for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
3523 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Semantic-based Preprocessing
36
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Semantic-based
Preprocessing
Description
Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives
Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States”
Word sense
disambiguation
Identify the correct sense of a word from context, e.g. “bank”
Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave”
have the common concept of “disaster”
Word filtering Select the top-ranked words according to TF-IDF scores for a comparison
23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Leveraging Search Statistics
37
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Leveraging Search Statistics
38
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
(b)(a)
23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Leveraging Search Statistics
39
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
An inverse partition
frequency, ipf = log N/n
23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Temporal Entropy
A measure of temporal information which a word conveys.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
Tells how good a term is in separating a partition from others.
A term occurring in few partitions has higher temporal entropy
compared to one appearing in many partitions.
The higher temporal entropy a term has, the better
representative of a partition.
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
4023 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
4123 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
42
Np is the total number of
partitions in a corpus
23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Temporal Entropy
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter
43
Np is the total number of
partitions in a corpus
A probability of a partition
p containing a term wi
23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
Non Content-based Approaches
 Dating a document using its neighbors
1. Web pages linking to the document
 I.e., incoming links
2. Web pages pointed by the document
 I.e., outgoing links
3. Media assets associated with the document
 E.g., images
 Averaging the last-modified dates of its neighbors as timestamps
44[Hauff and Azzopardi, 2005; Nunes et al., WIDM 2007] 23 July 2015The 1st Keystone Summer School: Keyword Search
Non Content-based Approaches
 Drawbacks:
 Rely on the availability and accuracy of other information
 Cover only pages from most recent years
 Cannot determine the age of the actual contents
45[SalahEldeen and Nelson, 2013] 23 July 2015The 1st Keystone Summer School: Keyword Search
Determining Document Focus Time
 Three types of temporal expressions
1. Explicit: time mentions being mapped directly to a time point or
interval, e.g., “July 4, 2012”
2. Implicit: imprecise time point or interval, e.g., “Independence Day
2012”
3. Relative: resolved to a time point or interval using other types or
the publication date, e.g., “next month”
 Time and event recognition [Mani and Wilson, ACL 2000]
 A mix of hand-crafted and machine-learnt rules
 Ranking the most relevant temporal expressions [Strötgen et al.,
TempWeb 2012]
4623 July 2015The 1st Keystone Summer School: Keyword Search
Time Taggers for Calculating Focus Time
HeidelTime:
http://heideltime.ifi.uni-
heidelberg.de/heideltime
Timestamp:
2013/7/15
23 July 2015 47[Jatowt et al., CIKM 2013](Slide provided by the authors)
 Document may lack any temporal expressions
 Temporal expressions may be weakly related to document’s
theme
 Temporal taggers are not perfect
Limitations
Estimating document focus time
without using temporal expressions
23 July 2015 48[Jatowt et al., CIKM 2013](Slide provided by the authors)
Focus Time of Documents
 Def. A document has focus time t if its content refers to t
23 July 2015 49[Jatowt et al., CIKM 2013](Slide provided by the authors)
Estimating Focus time: Concept
 Use time-referenced documents for estimating focus time of
target document
A-1935-
-----May
2011----
C------
News Article
Collections
---A----
--2012--
---B--
1978----
-1915---
--------
--C—B--
---A---
--1948--
--------
-C-----
2003--
-----
A—B--
C---A-
----
Target
Document
Target document
focus time
+
... ...
23 July 2015 50[Jatowt et al., CIKM 2013](Slide provided by the authors)
Word Graph
 Word co-occurrence graph from large collections of news articles
 Link weight estimated by Jaccard coefficient using sentence as unit
war
nazi
1945
1939
aushwitz
jews
germany
jalta
hiroshima
23 July 2015 51[Jatowt et al., CIKM 2013](Slide provided by the authors)
Estimating Direct Word-Year Association
 Word-year associations derived from graph
Word w is strongly associated with year y if
if it frequently co-occurs with y
A(war, 1900)
A(war, 1901)
…
A(war, 1944)
A(war, 1945)
…
A(war, 2009)
A(war, 2010)
A(hiroshima, 1900)
A(hiroshima, 1901)
…
A(hiroshima, 1944)
A(hiroshima, 1945)
…
A(hiroshima, 2009)
A(hiroshima, 2010)
A(word, 1900)
A(word, 1901)
…
A(word, 1944)
A(word, 1945)
…
A(word, 2009)
A(word, 2010)
23 July 2015 52[Jatowt et al., CIKM 2013](Slide provided by the authors)
Word w is strongly associated with year y if many other words that
frequently co-occur with w are also strongly associated with y
Second Level Term-Year Association
     

V
j
jdiriji ywAwwA
V
ywA
1
2
,,
1
,
war
nazi
1945
1939
aushwitz
jews
germany
jalta
hiroshima
israel
23 July 2015 53[Jatowt et al., CIKM 2013](Slide provided by the authors)
If a document contains many words strongly associated with year y,
the document is strongly associated with y
Estimating Document-Year Association
1900 1920 1940 1960 1980 2000
word A
word B
A(word,year)
word C
A + 2B + 2C
Time
A B C
B C
Document
Document-year association
23 July 2015 54[Jatowt et al., CIKM 2013](Slide provided by the authors)
Finding Discriminative Features
 Not every word is useful for estimating text focus time
 E.g., “man”, “city” have stable associations with years
 Temporal entropy – measure of variability of word associations
 Temporal kurtosis – measure of peakness of word associations
 E.g., “war”, “earthquake” vs. “hitler”, “stalingrad”
1900 1920 1940 1960 1980 2000
word A
word B
Temporal_Entropy(A) < Temporal_Entropy(B)
A(word,year)
1900 1920 1940 1960 1980 2000
word A
Temporal_Kurtosis(A) > Temporal_Kurtosis(B)
A(word,year)
word B
Temporal entropy and Temporal kurtosis
used as temporal weights for words
23 July 2015 55[Jatowt et al., CIKM 2013](Slide provided by the authors)
Importance of Words in Document
 Words weakly related to document theme should be skipped
TextRank
0.90 independence
0.82 poland
0.74 war
0.61 nazi
0.56 hitler
0.54 ….
President Obama took part in the
celebrations of the Polish
Independence Day. The US
president met main Polish
politicians in Warsaw.
Poland regained independence at
the end of the World War I
following Bolshevik Revolution.
It then lost the independence as a
result of Nazi and Soviet invasions
led by Hitler and Stalin.
Poland is located in East Europe.
Target Document
Document to
graph conversion
independence
poland war
hitler
…
…
…
…
…
TextRank scores used as discriminatory
semantic weights for words
[Mihalcea and Tarau, EMNLP 2004]
23 July 2015 56[Jatowt et al., CIKM 2013](Slide provided by the authors)
Estimating Focus Time
1900 1920 1940 1960 1980 2000
word A
word B
word C
Weighted sum (temporality and
semantics)
Focus time: Interval based
threshold
Time
A(word,year)
1900 1920 1940 1960 1980 2000
A B C
B C
Document
Focus time: Instant based
1900 1920 1940 1960 1980 2000
23 July 2015 57[Jatowt et al., CIKM 2013](Slide provided by the authors)
Combined Approach
 Combining estimated focus time and temporal expressions in text
 Representing dates on timeline - Gaussian Kernel Density Estimate
 Mixture of Gaussian distributions with means centered on extracted
dates
     ydSydSydS TempExpEstComb ,,, 
---1935--------
----2011-------
----------------
----------------
----1932-------
------1940-----
----------------
1932-----2001--
-------------
1932 1935 1940 2001 2011
Target document
23 July 2015 58[Jatowt et al., CIKM 2013](Slide provided by the authors)
 News articles collected from Google News Archive using country
names as queries
 Germany (87k), UK (149k), France (110k), Japan (97k), Israel (92k)
 Published within [1990, 2010]
 Dates falling in [1900, 2013] were found using regular expressions
Experimental Settings: Word Graphs
23 July 2015 59[Jatowt et al., CIKM 2013](Slide provided by the authors)
Experimental Settings: Test Datasets
 Datasets on events related to countries:
 Wiki: 250 Wikipedia pages about events
 Books: 735 paragraphs from 2 text books about history (timelines)
 Web: 812 paragraphs from web pages on history (BBC timelines,
etc.)
Datasets
total
#doc
avr.
#sent
avr. time span
of events
avr. year
of events
avr.
#dates
Wiki 250 179 3.4 years 1958 14.5
Book 735 43 4.4 years 1982 4.5
Web 819 18.3 1.3 years 1957 2.4
23 July 2015 60
Experimental Settings: Baselines
 Baselines:
 Random
 Date-based (using only dates in document text)
 LDA-based
1. 100 topics over sentences containing year mentions
2. Finding topic distribution of each year
3. Calculating document-year association based on topic distribution
of documents
23 July 2015 61[Jatowt et al., CIKM 2013](Slide provided by the authors)
Experimental Settings: Measurements
 Measures:
 Average error (in years)
 Pearson Correlation Coefficient between ground truth years and
years in focus time
Ground truth
Estimated
focus time
Ground truth
Estimated
focus time
tfocus - + + - - + - - +
Average error (years) for
instant-based representation
Correlation measure (-1..+1) for
interval-based representation
error + + - - + + - - +
23 July 2015 62[Jatowt et al., CIKM 2013](Slide provided by the authors)
Experimental Results
Datasets
random
baseline
LDA
baseline
date-based
baseline
Proposed
(no dates)
Proposed
combined
(with dates)
Wiki 36.5 27.2 3 18.3 2.83
Books 39.3 37.3 48.1 23.5 20.4
Web 40.5 41.4 53.4 23.6 20.7
Datasets
random
baseline
LDA
baseline
date-based
baseline
Proposed
(no dates)
Proposed
combined
(with dates)
Wiki 0 0.1 0.65 0.29 0.66
Books 0 0.04 0.01 0.25 0.30
Web 0 0.02 -0.03 0.26 0.41
Average error
Pearson Correlation Coefficient
23 July 2015 63[Jatowt et al., CIKM 2013](Slide provided by the authors)
 How well can we estimate focus time of documents about
distant past ?
Effect of Time Distance on Focus Time
Wiki Books
Web
Instant-based
focus time representation
23 July 2015 64The 1st Keystone Summer School: Keyword Search
Question?
6523 July 2015The 1st Keystone Summer School: Keyword Search
Temporal Query Analysis
(1) Temporal query intent
(2) Dynamic query subtopics
6623 July 2015The 1st Keystone Summer School: Keyword Search
Temporal Queries
 Temporal information needs
 Searching temporal document collections
 E.g., digital libraries, web/news archives
 Users: historians, librarians, journalists or students
 Temporal queries exist in both standard collections and the Web
 Relevancy is dependent on time
 Documents are about events at particular time
6723 July 2015The 1st Keystone Summer School: Keyword Search
Types of Temporal Queries
 Two types of temporal queries
1. Explicit: time is provided, "Presidential election 2012“
2. Implicit: time is not provided, "Germany World Cup"
 Temporal intent can be implicitly inferred
 I.e., refer to the World Cup event in 2006
 Studies of web search query logs show a significant fraction
of temporal queries
 1.5% of web queries are explicit
 ~7% of web queries are implicit
 13.8% of queries contain explicit time and 17.1% of queries have
temporal intent implicitly provided
68[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]23 July 2015
Figure: Variances of
temporal queries and
their dynamics
23 July 2015 69The 1st Keystone Summer School: Keyword Search
Understanding Temporal Query Intent
 Current approaches:
1. Mining temporal patterns in query logs
2. Analyzing top-k search results
70
[Vlachos et al., SIGMOD 2004; Radinsky et al., WWW 2012]
[Jones and Diaz, TOIS 2007; Campos et al., CIKM 2012] 23 July 2015
Motivation
 Temporal queries are a significant fraction of Web
search queries [Zhang et al., EMNLP 2010]
 13.8% of explicit temporal queries
 17.1% of implicit temporal queries
 Characteristics:
 Certain temporal patterns, i.e., spikes, periodicity
(hourly or daily), seasonality and trends
 Underlying temporal information needs without
temporal patterns observed
 Tasks:
 Understand temporal search intent
 Enable advanced enhancement techniques
 Automatic method for detecting events in search streams
US Election
2016
Brazil FIFA
World Cup
23 July 2015 71[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Preliminaries
 Data model:
 Set of queries Q issues at different time points
 Set of clicked URLs U and click-through data
 Temporal document collection D
 q: keywords or term(q), and hitting time(q)
 yq: time series data extracted form Q, U and D
 Two-step approach:
 Automatically extract a set of candidate queries {q1, ..., qn} from Q
 Classify candidates as event-related queries {e1, ..., em} using
machine learning techniques
23 July 2015 72[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Identifying Event Candidates
 Time and keyword-based clustering:
Step1: Partition query logs into one week
• Group queries from the same event
• Possibly contain multiple, unrelated events
Step2: Cluster queries by lexical similarity
• Pre-process and sort queries alphabetically
• Compute Jaccard similarity of a query pair
Easter - easter 2006, easter 2007, easter 20crafts,
easter activities, easter animation, easter animations,
easter background, easter basket, easter bread,
easter bucket, easter bunny, easter bunny decorations,
easter bunny lights
23 July 2015 73[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Event-related Query Classification
 Classify a query as event-related or not:
 Periodic and seasonal events
 Popular and trending events
 Sporadic (rare) and unseen events
 General time-sensitive queries
 Underlying temporal information needs
 Features:
 Time-series features, e.g., seasonality or trends
 Popularity-based features, e.g., click-through and burstiness
 Statistic features, e.g., probability distribution of results
temporal KL-divergence and skewness (kurtosis)
23 July 2015 74[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Query: Easter
Seasonality
Query: World cup
 Detect seasonal queries [Shokouhi, SIGIR 2011]
 E.g., Annual events, e.g., US Open and Easter,
or a 4-year recurring event, e.g., FIFA World Cup
 Method: time-series decomposition using Holt-
Winters adaptive exponential smoothing
 Input: time-series data extracted from external
document collections, YD
 Compute a cosine similarity as seasonality
 Y is the original time-series data
 S is the seasonality component
23 July 2015 75[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Autocorrelation
 Detect trending events by their predictability
 Cross correlation with itself or between its
past and future values at different time lags
 The stronger inter-day dependencies, the
higher value for autocorrelation
 where lag=1, shifting the 2nd time series by
one day, called 1st-order autocorrelation
23 July 2015 76[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Temporal KL-divergence
 Analyze a temporal distribution in a result set
 Measure the difference between the distribution over time
of top-k documents of q and the document collection C
 P(t|q) is the probability of generating a publication date t
given q
 P(t|C) is the probability of a publication date t in the
collection
23 July 2015 77[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Surprise Score
 Detect unseen events or surprisingly popular
queries [Radinsky et al. , WWW 2012]
 Assume an unplanned event happening when there is
a significant prediction error
 Compute the sum of squared errors of prediction
(SSE) using a simple linear regression model
23 July 2015 78[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Experiments
 Query logs:
• Two datasets, i.e., AOL and MSN
• AOL: 30M queries March 1 - May 31, 2006
• MSN: 15M queries from May 2006
 Temporal collection:
• The New York Times Annotated Corpus
• 1.8M documents from 1987 - 2007
 Setting:
• HeidelTime for time extraction and OpenNLP for entity extraction
• Cleansing-step parameters: Jaccard similarity threshold>0.2; edit
distance<3; overlap n-gram=2
• For burstiness features, default parameters for the burst detection
technique provided by CISHELL
In total, 837 event-related queries
23 July 2015 79[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Experimental Results (I)
 Feature selection:
• Study high-impact (best) features
• Investigate their importance independent
from classification algorithms
• InfoGainAttributeEval method in WEKA
 Main findings:
• Discriminative features are mostly derived
from D and Q
• TemporalKL and kurtosis are among
influential features
• Trend-based features, such as,
autocorrelation, burst weight, and trending
level, play an important role
• Seasonality computed from Q has less
impact than the one extracted D
23 July 2015 80[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Experimental Results (II)
 Query classification:
• Several classifiers, i.e., support vector
machine (SVM), AdaBoost, decision tree
(J48), and neural network (NN)
• Metrics: accuracy, precision, recall, F-
measure using 10-fold cross validation
 Main findings:
• J48 is the best performing algorithm
• TemporalKL achieves accuracy of 84%
• Adding autocorrelation, kurtosis, and
seasonality increases the performance
• However, the performance has dropped
after adding max. query frequency, so on
23 July 2015 81[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
Analyzing Top-k Search Results
 Using temporal language models
 Determine time of queries when no time is given explicitly
 Re-rank search results using the determined time
 Exploiting time from search snippets
 Extract temporal expressions (i.e., years) from the contents of top-k
retrieved web snippets for a given query
 Content-based language-independent approach
82[Kanhabua and Nørvåg, ECDL 2010; Campos et al., CIKM 2012] 23 July 2015
Determining Time of Queries
 Approach I. Dating using keywords*
 Approach II. Dating using top-k documents*
 Queries are short keywords
 Inspired by pseudo-relevance feedback
 Approach III. Using timestamp of top-k documents
 No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
8323 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
I. Dating using Keywords
8423 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
I. Dating using Keywords
85
Query’s temporal
profiles
23 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
II. Dating using Top-k Documents
8623 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
II. Dating using Top-k Documents
87
Query’s temporal
profiles
23 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
III. Using Timestamp of Documents
8823 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
III. Using Timestamp of Documents
89
Query’s tempora
profiles
23 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
Re-ranking Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
9023 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
 Intuition: documents published closely to the time of queries are
more relevant
 Assign document priors based on publication dates
 Intuition: documents published closely to the time of queries are
more relevant
 Assign document priors based on publication dates
Re-ranking Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
9123 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
march madness
began
14/03/2006
ncaa women
tournament began
18/03/2006 01/04/2006
final four began
query: ncaa
Change of Query Subtopics over Time
92[Nguyen and Kanhabua, ECIR 2014] 23 July 2015The 1st Keystone Summer School: Keyword Search
Mining Temporal Anchor Texts
 Anchor texts are complementary description
for target pages, widely used to improve search
 Characteristics:
 Short summary (a few words) of target pages
 Collective wisdom of people other than authors
 Similar behavior to real-world queries and titles
 Capturing aboutness or what a document is about
 Main ideas:
 Temporal anchor texts mined from the edit history of
Wikipedia as a hook for tracking entity evolution
 Large-scale analysis and a more robust discovery of
evolving information using limited resources
23 July 2015 93The 1st Keystone Summer School: Keyword Search
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using
the one-month granularity
2. For each Wikipedia snapshot, identify
named entity articles/pages
3. Extract anchor texts from all articles
linking to an entity page
4. Rank aggregated entity-anchor
relationships at a particular time t
[Kanhabua and Nørvåg, JCDL 2010] 23 July 2015 94The 1st Keystone Summer School: Keyword Search
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using
the one-month granularity
2. For each Wikipedia snapshot,
identify named entity articles/pages
3. Extract anchor texts from all articles
linking to an entity page
4. Rank aggregated entity-anchor
relationships at a particular time t
23 July 2015 95The 1st Keystone Summer School: Keyword Search
Mining Temporal Anchor Texts
1. Partition Wikipedia revisions using
the one-month granularity
2. For each Wikipedia snapshot, identify
named entity articles/pages
3. Extract anchor texts from all articles
linking to an entity page
4. Rank aggregated entity-anchor
relationships at a particular time t
President_of_the_
United_States
President
Bush (43)
Time:
10/2005
Barack
Obama
Time:
George
W. Bush
Time:
11/2004 23 July 2015 96The 1st Keystone Summer School: Keyword Search
1. Multi-word title with all words capitalized,
except prepositions, determiners, etc.
E.g., President_of_the_United_States => entity
2. Single-word titles with multiple capital
letters
E.g., UNICEF and WHO => entities
3. 75% of occurrences in the article text itself
are capitalized (not beginning of sentence)
Recognizing Named Entity Articles
[Bunescu and Pasca, EACL 2006] 23 July 2015 97The 1st Keystone Summer School: Keyword Search
Weight anchor texts by importance with respect
to a target entity at particular time:
• Link-independent : inlink pages are independent and
equally important to the target page
• Compute based on the whole collection of Wikipedia
entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
[Dou et al., SIGIR 2009] 23 July 2015 98The 1st Keystone Summer School: Keyword Search
Weight anchor texts by importance with respect
to a target entity at particular time:
• Link-independent : inlink pages are independent and
equally important to the target page
• Compute based on the whole collection of Wikipedia
entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
[Dou et al., SIGIR 2009] 23 July 2015 99The 1st Keystone Summer School: Keyword Search
Experiments
 Data collection:
• A dump of English Wikipedia edit history (2.8 TB)
• All pages and revisions 03/2001 to 03/2008
• 85 snapshots + 4 additional snapshots
(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)
 Tools:
• Preprocess/store revisions using MWDumper
http://www.mediawiki.org/wiki/Mwdumper
• Store anchor texts: mySQL databases
23 July 2015 100The 1st Keystone Summer School: Keyword Search
Top-100 Named Entities
23 July 2015 101The 1st Keystone Summer School: Keyword Search
Top-100 Named Entities
23 July 2015 102The 1st Keystone Summer School: Keyword Search
Top-100 Named Entities
23 July 2015 103The 1st Keystone Summer School: Keyword Search
Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama
2. Senator Obama's
legislative
accomplishments
3. Illinois
4. U.S. Sen. Barack Obama
1. Senator Barack Obama
2. Illinois Senator Barack
Obama
3. Barack Hussein Obama II
4. Senator Obama's
legislative
accomplishments
07/2008 10/2008
23 July 2015 104The 1st Keystone Summer School: Keyword Search
Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama
2. Senator Obama's
legislative
accomplishments
3. Illinois
4. U.S. Sen. Barack Obama
1. Senator Barack Obama
2. Illinois Senator Barack
Obama
3. Barack Hussein Obama II
4. Senator Obama's
legislative
accomplishments
07/2008
1. Senator Barack
Obama
2. Illinois Senator Barack
Obama
3. Barak Obama, U.S.
Senator, Illinois, 2008
Democratic nominee for
U.S. President
4. presidential
candidacy
announcement
1. President Barack
Obama
2. Senator Barack Obama
3. U.S. President Barack
Obama
4. 44th President of the
United States
5. Obama Administration
10/2008
23 July 2015 105The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 106The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 107The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 108The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 109The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 110The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 111The 1st Keystone Summer School: Keyword Search
Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology
23 July 2015 112The 1st Keystone Summer School: Keyword Search
Question?
11323 July 2015The 1st Keystone Summer School: Keyword Search
Time-aware Retrieval and Ranking
(1) Recency-based Ranking
(2) Time-dependent Ranking
11423 July 2015The 1st Keystone Summer School: Keyword Search
RECAP
 Two time dimensions
1. Publication or modified time
2. Content or focus time
11523 July 2015The 1st Keystone Summer School: Keyword Search
Searching the past
 Historical or temporal information needs
 A journalist working the historical story of a particular news article
 A Wikipedia contributor finding relevant information that has not
been written about yet
116
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results
23 July 2015The 1st Keystone Summer School: Keyword Search
Temporal Query Examples
 A temporal query consists of:
 Query keywords
 Temporal expressions
 A document consists of:
 Terms, i.e., bag-of-words
 Publication time and temporal expressions
11723 July 2015The 1st Keystone Summer School: Keyword Search
Temporal Query Examples
[Berberich et al., ECIR 2010] 11823 July 2015The 1st Keystone Summer School: Keyword Search
 Assign prior probabilities using an exponential function
 E.g., a more recent creation date obtains high probability
 Current approaches:
 Time-based language model [Li and Croft, CIKM 2003]
 Using retention functions [Peetz and de Rijke, ECIR 2013]
 Incorporating freshness into web authority [Dai and Davison,
SIGIR 2010]
Recency-based Ranking
11923 July 2015The 1st Keystone Summer School: Keyword Search
 Time must be explicitly modeled in order to increase the
effectiveness of ranking
 To order search results so that the most relevant ones come first
 Time uncertainty should be taken into account
 Two temporal expressions can refer to the same time period even
though they are not equally written
 E.g. the query “Independence Day 2011”
 A retrieval model relying on term-matching only will fail to
retrieve documents mentioning “July 4, 2011”
Time-dependent Ranking
12023 July 2015The 1st Keystone Summer School: Keyword Search
Time-dependent Ranking
 Two main approaches:
1. Mixture model [Kanhabua et al., ECDL 2010]
 Linearly combining textual- and temporal similarity
2. Probabilistic model [Berberich et al., ECIR 2010]
 Generating a query from the textual part and temporal part
of a document independently
12123 July 2015The 1st Keystone Summer School: Keyword Search
Mixture Model
 Linearly combine textual- and temporal similarity
 α indicates the importance of similarity scores
 Both scores are normalized before combining
 Textual similarity can be determined using any term-based
retrieval model
 E.g., tf.idf or a unigram language model
12223 July 2015The 1st Keystone Summer School: Keyword Search
Mixture Model
 Linearly combine textual- and temporal similarity
 α indicates the importance of similarity scores
 Both scores are normalized before combining
 Textual similarity can be determined using any term-based
retrieval model
 E.g., tf.idf or a unigram language model
123
How to determine temporal similarity?
23 July 2015The 1st Keystone Summer School: Keyword Search
Temporal Similarity
Similarityscore
Time
d1 d2<q>
Dist(d1,q)
Dist(d2,q)
[Kanhabua et al., ECDL 2010]
23 July 2015 124The 1st Keystone Summer School: Keyword Search
Temporal Similarity
 Assume that temporal expressions in the query are generated
independently from a two-step generative model:
 P(tq|td) can be estimated based on publication time using an
exponential decay function [Kanhabua et al., ECDL 2010]
 Linear interpolation smoothing is applied to eliminates zero
probabilities
 I.e., an unseen temporal expression tq in d
12523 July 2015The 1st Keystone Summer School: Keyword Search
Comparison of time-aware ranking
Five time-aware ranking models
 LMT [Berberich et al., ECIR 2010]
 LMTU [Berberich et al., ECIR 2010]
 TS [Kanhabua et al., ECLD 2010]
 TSU [Kanhabua et al., ECLD 2010]
 FuzzySet [Kalczynski et al., Inf. Process. 2005]
126[Kanhabua et al., SIGIR 2011]23 July 2015The 1st Keystone Summer School: Keyword Search
 Experiment:
 New York Times Annotated Corpus
 40 temporal queries [Berberich et al., ECIR 2010]
 Result:
 TSU outperforms other methods significantly for most metrics
 Conclusions:
 Although TSU gains the best performance, but only applied to a
collection with time metadata
 LMT, LMTU can be applied to any collection without time metadata,
but time extraction is needed
Discussion
12723 July 2015The 1st Keystone Summer School: Keyword Search
128
Applications for Temporal IR
(1) Searching the Future
(2) Time-aware Recontextualization
23 July 2015The 1st Keystone Summer School: Keyword Search
Searching the Future
 People are naturally curious about the future
 What will happen to EU economies in next 5 years?
 What will be potential effects of climate changes?
12923 July 2015The 1st Keystone Summer School: Keyword Search
Previous work
 Searching the future
 Extract temporal expressions from news articles
 Retrieve future information using a probabilistic model, i.e.,
multiplying textual similarity and a time confidence
 Supporting analysis of future-related information in news and
Web
 Extract future mentions from news snippets obtained from search
engines
 Summarize and aggregate results using clustering methods, but no
ranking
[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 13023 July 2015
Recorded Future
http://www.recordedfuture.com/
13123 July 2015The 1st Keystone Summer School: Keyword Search
Yahoo! Time Explorer
[Matthews et al., HCIR 2010] 13223 July 2015The 1st Keystone Summer School: Keyword Search
Ranking News Predictions
 Over 32% of 2.5M documents from Yahoo! News (July’09 –
July’10) contain at least one prediction
 Retrieve predictions related to a news story in news archives and
rank by relevance
13323 July 2015
Related News Predictions
[Kanhabua et al., SIGIR 2011] 13423 July 2015The 1st Keystone Summer School: Keyword Search
Related News Predictions
[Kanhabua et al., SIGIR 2011] 13523 July 2015The 1st Keystone Summer School: Keyword Search
Related News Predictions
[Kanhabua et al., SIGIR 2011] 13623 July 2015The 1st Keystone Summer School: Keyword Search
 Four classes of features
 Term similarity, entity-based similarity, topic similarity and temporal
similarity
 Rank results using a learning-to-rank technique
Approach
23 July 2015 137The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
Step 1: Document annotation.
 Extract temporal expressions
using time and event recognition
 Normalize them to dates so they
can be anchored on a timeline
 Output: sentences annotated
with named entities and dates,
i.e., predictions
Step 2: Retrieving predictions.
 Automatically generate a query
from a news article being read
 Retrieve predictions that match
the query
 Rank predictions by relevance
(i.e., a prediction is “relevant” if it
is about the topics of the article)
System Architecture
[Kanhabua et al., SIGIR 2011] 13823 July 2015The 1st Keystone Summer School: Keyword Search
 Capture the term similarity between q and p
1. TF-IDF scoring function
 Problem: keyword matching, short texts
 Predictions not match with query terms
2. Field-aware ranking function, e.g., bm25f
 Search the context of a prediction, i.e., surrounding sentences
Term Similarity
13923 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
 Measure the similarity between q
and p using annotated entities in
dp, p, q
 Features commonly employed in
entity ranking
Entity-based Similarity
14023 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
 Compute the similarity between q and p on topic
 Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for
modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic Similarity
14123 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
 Compute the similarity between q and p on topic
 Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for
modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic Similarity
14223 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
Hypothesis I. Predictions that are more recent to the query are
more relevant
Temporal Similarity
14323 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
Hypothesis I. Predictions that are more recent to the query are
more relevant
Temporal Similarity
Hypothesis II. Predictions extracted from more recent documents
are more relevant
14423 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
Learning-to-rank: Given an unseen (q, p), p is ranked using a
model trained over a set of labeled query/prediction
 SVM-MAP [Yue et al., SIGIR 2007]
 RankSVM [Joachims, KDD 2002]
 SGD-SVM [Zhang, ICML 2004]
 PegasosSVM [Shalev-Shwartz et al., ICML 2007]
 PA-Perceptron [Crammer et al., J. Mach. Learn. 2006]
Ranking Method
14523 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
42 future-related topics
Relevance Judgments
14623 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
 New York Times Annotated Corpus
 1.8 million articles, over 20 years
 More than 25% contain at least one prediction
 Annotation process uses several language processing tools
 OpenNLP for tokenizing, sentence splitting, part-of-speech tagging,
shallow parsing
 SuperSense tagger for named entity recognition
 TARSQI for extracting temporal expressions
 Apache Lucene for indexing and retrieving.
 44,335,519 sentences and 548,491 predictions
 939,455 future dates (avg. future date/prediction is 1.7)
Experiments
14723 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
 Results:
 Topic features play an important role in ranking
 Features in top-5 features with lowest weights are entity-based
features
 Open issues:
 Extract predictions from other sources, e.g., Wikipedia, blogs,
comments, etc.
 Sentiment analysis for future-related information
Discussion
14823 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
Prior to 1964, many of the cigarette
companies advertised their brand by
falsely claiming that their product did not
have serious health risks. A couple of
examples would be "Play safe with Philip
Morris" and "More doctors smoke
Camels". Such claims were made both to
increase the sales of their product and to
combat the increasing public knowledge of
smoking's negative health effects.
Advertisement poster from the
1950s
Time-aware
contextualization
Time-aware Contextualization
23 July 2015 149[Tran et al., WSDM 2015] (Slide provided by the authors)
Physician
http://en.wikipedia.org/wiki/Physician
Camel (cigarette)
http://en.wikipedia.org/wiki/Camel_(cigarette)
Cigarette
http://en.wikipedia.org/wiki/Cigarette
Entity linking is not sufficient
Wikipedia pages tend to contain large amounts of content
Relevant information might be distributed over various articles
The crucial temporal aspect is missing in pure linking approaches
Entity Linking
23 July 2015 150[Tran et al., WSDM 2015] (Slide provided by the authors)
Problem Statement
23 July 2015 151
Time-aware contextualization aims to associate an information item
d with time-aware, concise and coherent context information c for
easing its understanding
Several sub-goals of the information search process have to
combined with each other
 c has to be relevant for d
 c has to complement the information already available in d
 c has to consider the time of creation of d
 the context information should be concise to avoid overloading the user
[Tran et al., WSDM 2015] (Slide provided by the authors)
User
Article
Query
Formulation
Context
Ranking
Contextualization
units Index
Context
Context
Retrieval
Contextualization units
Extraction
Context
Hook
Identification
Approach Overview
23 July 2015 152[Tran et al., WSDM 2015] (Slide provided by the authors)
 The goal is to generate a set of queries for a given document to
retrieve candidates as input for the re-ranking step
 We explore two families of query formulation methods
 Document-based methods : title, lead, title+lead
 Hook-based methods: each_hook, all_hooks, and query performance
prediction (qpp_r@k) with the following features
 Linguistics features
 Document frequency
 Scope
 Temporal document frequency
 Temporal scope
 Temporal similarity
Query Formulation
23 July 2015 153[Tran et al., WSDM 2015] (Slide provided by the authors)
Context retrieval:
Learning to rank context:
• The ranking algorithm needs to balance two goals, i.e., high topical and
temporal relevance as well as complementarity for providing additional
information
• Use supervised machine learning that takes as input a set of labeled
examples and various complementarity features
 Topic diversity
 Text difference
 Entity difference
 Anchor text difference
 Distributional similarity
 Cosine distance
 Relevance
 Temporal similarity
Context Ranking
23 July 2015 154[Tran et al., WSDM 2015] (Slide provided by the authors)
Experiments
23 July 2015 155
Datasets:
 51 news articles from New York Times Corpus
 Wikipedia (2013), 26 million contextualization units (paragraphs)
 9464 manual labeled examples (article/context pairs)
 Learning to rank algorithms: RankBoost, Random Forests and Adarank
Baselines
 Entity linking (Milne and Witten)
 Language model (LM)
 Time-aware language model (LM-T)
[Tran et al., WSDM 2015] (Slide provided by the authors)
Evaluating Query Formulation Methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P@1 P@3 P@10 MAP
title+lead
all_hooks
qpp_r@100
Wikification technique
achieves a low recall of 0.229
Hook-based approaches
outperform the document-
based approaches
Query performance
prediction method obtains the
highest results on all metrics
[Tran et al., WSDM 2015] (Slide provided by the authors) 23 July 2015 156
The Effect of Complementarity Features
0
0.2
0.4
0.6
0.8
1
P@1 P@3 P@10 MAP
LM-T
RF
Purely using the time dimension
in context retrieval is not sufficient
in the contextualization task
Complementarity plays an
important role in contextualization
23 July 2015 157[Tran et al., WSDM 2015] (Slide provided by the authors)
Conclusions and Outlook
 Introduced the general topic of web evolution.
 Pinpointed a number of issues related to temporal IR.
 Focused on temporal information extraction, temporal query
analysis, as well as time-aware retrieval and ranking.
 Wrapped up with related applications to temporal IR.
 Future directions:
 Real-time web mining
 Spatio-temporal search and analytics
 Brain-inspired information access
23 July 2015 158The 1st Keystone Summer School: Keyword Search
Thank you!
15923 July 2015The 1st Keystone Summer School: Keyword Search

More Related Content

What's hot

B2: Open Up: Open Data in the Public Sector
B2: Open Up: Open Data in the Public SectorB2: Open Up: Open Data in the Public Sector
B2: Open Up: Open Data in the Public SectorMarieke Guy
 
Designing a second generation of open data platforms
Designing a second generation of open data platformsDesigning a second generation of open data platforms
Designing a second generation of open data platformsYannis Charalabidis
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
 
Supporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discoverySupporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discoveryMathieu d'Aquin
 
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)Stefan Dietze
 
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challengeScott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challengeGigaScience, BGI Hong Kong
 
VALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataVALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataPeter Neish
 
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesStefan Dietze
 
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Stefan Dietze
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkRobert H. McDonald
 

What's hot (20)

November 18, 2015 NISO Webinar: Text Mining: Digging Deep for Knowledge
November 18, 2015 NISO Webinar: Text Mining: Digging Deep for KnowledgeNovember 18, 2015 NISO Webinar: Text Mining: Digging Deep for Knowledge
November 18, 2015 NISO Webinar: Text Mining: Digging Deep for Knowledge
 
B2: Open Up: Open Data in the Public Sector
B2: Open Up: Open Data in the Public SectorB2: Open Up: Open Data in the Public Sector
B2: Open Up: Open Data in the Public Sector
 
Designing a second generation of open data platforms
Designing a second generation of open data platformsDesigning a second generation of open data platforms
Designing a second generation of open data platforms
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
 
Supporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discoverySupporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discovery
 
LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)LAK Dataset and Challenge (April 2013)
LAK Dataset and Challenge (April 2013)
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challengeScott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
 
Jan 14 NISO Webinar Net Neutrality: Will Library Resources be stuck in the Sl...
Jan 14 NISO Webinar Net Neutrality: Will Library Resources be stuck in the Sl...Jan 14 NISO Webinar Net Neutrality: Will Library Resources be stuck in the Sl...
Jan 14 NISO Webinar Net Neutrality: Will Library Resources be stuck in the Sl...
 
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
 
VALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open DataVALA 2016 L-Plate session on Linked Open Data
VALA 2016 L-Plate session on Linked Open Data
 
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, ExamplesLearning Analytics & Linked Data – Opportunities, Challenges, Examples
Learning Analytics & Linked Data – Opportunities, Challenges, Examples
 
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
Open Educational Data - Datasets and APIs (Athens Green Hackathon 2012)
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
 
Broad Data
Broad DataBroad Data
Broad Data
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
 

Viewers also liked

1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon ChallengeJoel Azzopardi
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceJosé Ramón Ríos Viqueira
 
Supporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information RetrievalSupporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information RetrievalNattiya Kanhabua
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 

Viewers also liked (7)

1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
Supporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information RetrievalSupporting Exploration and Serendipity in Information Retrieval
Supporting Exploration and Serendipity in Information Retrieval
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 

Similar to Search, Exploration and Analytics of Evolving Data

20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data ThingsKatina Toufexis
 
Improving usage and impact of digitised resources
Improving usage and impact of digitised resourcesImproving usage and impact of digitised resources
Improving usage and impact of digitised resourcesAlastair Dunning
 
Improving usage and impact of digitised resources
Improving usage and impact of digitised resourcesImproving usage and impact of digitised resources
Improving usage and impact of digitised resourcesJisc
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
 
Pass the baton: How to run a faster race
Pass the baton: How to run a faster racePass the baton: How to run a faster race
Pass the baton: How to run a faster racePaul Seiler
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveNattiya Kanhabua
 
Learning Analytics for Adaptive Learning And Standardization
Learning Analytics for Adaptive Learning And StandardizationLearning Analytics for Adaptive Learning And Standardization
Learning Analytics for Adaptive Learning And StandardizationOpen Cyber University of Korea
 
Learning Analytics: Seeking new insights from educational data
Learning Analytics: Seeking new insights from educational dataLearning Analytics: Seeking new insights from educational data
Learning Analytics: Seeking new insights from educational dataAndrew Deacon
 
Aligning Learning Analytics with Classroom Practices & Needs
Aligning Learning Analytics with Classroom Practices & NeedsAligning Learning Analytics with Classroom Practices & Needs
Aligning Learning Analytics with Classroom Practices & NeedsSimon Knight
 
User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...Andrew Preater
 
Learning resource metadata on the web (LiLE workshop)
Learning resource metadata on the web (LiLE workshop)Learning resource metadata on the web (LiLE workshop)
Learning resource metadata on the web (LiLE workshop)Phil Barker
 
Searching the Temporal Web: Challenges and Current Approaches
Searching the Temporal Web: Challenges and Current ApproachesSearching the Temporal Web: Challenges and Current Approaches
Searching the Temporal Web: Challenges and Current ApproachesNattiya Kanhabua
 
Ntu share project final report
Ntu share project final reportNtu share project final report
Ntu share project final reportVicki McGarvey
 
Building Portals for Evidence Informed Education: Lessons from the Dead
Building Portals for Evidence Informed Education: Lessons from the DeadBuilding Portals for Evidence Informed Education: Lessons from the Dead
Building Portals for Evidence Informed Education: Lessons from the Dead Balrymes
 
S.NoSalesforce Business Analyst roleComputer Systems Analysts.docx
S.NoSalesforce Business Analyst roleComputer Systems Analysts.docxS.NoSalesforce Business Analyst roleComputer Systems Analysts.docx
S.NoSalesforce Business Analyst roleComputer Systems Analysts.docxjeffsrosalyn
 
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...TimelessFuture
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
 

Similar to Search, Exploration and Analytics of Evolving Data (20)

20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
Improving usage and impact of digitised resources
Improving usage and impact of digitised resourcesImproving usage and impact of digitised resources
Improving usage and impact of digitised resources
 
Improving usage and impact of digitised resources
Improving usage and impact of digitised resourcesImproving usage and impact of digitised resources
Improving usage and impact of digitised resources
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
Projects Summary
Projects SummaryProjects Summary
Projects Summary
 
Pass the baton: How to run a faster race
Pass the baton: How to run a faster racePass the baton: How to run a faster race
Pass the baton: How to run a faster race
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search Perspective
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search Perspective
 
Learning Analytics for Adaptive Learning And Standardization
Learning Analytics for Adaptive Learning And StandardizationLearning Analytics for Adaptive Learning And Standardization
Learning Analytics for Adaptive Learning And Standardization
 
Creating and Delivering Content in a Web 2.0 World
Creating and Delivering Content in a Web 2.0 WorldCreating and Delivering Content in a Web 2.0 World
Creating and Delivering Content in a Web 2.0 World
 
Learning Analytics: Seeking new insights from educational data
Learning Analytics: Seeking new insights from educational dataLearning Analytics: Seeking new insights from educational data
Learning Analytics: Seeking new insights from educational data
 
Aligning Learning Analytics with Classroom Practices & Needs
Aligning Learning Analytics with Classroom Practices & NeedsAligning Learning Analytics with Classroom Practices & Needs
Aligning Learning Analytics with Classroom Practices & Needs
 
User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...
 
Learning resource metadata on the web (LiLE workshop)
Learning resource metadata on the web (LiLE workshop)Learning resource metadata on the web (LiLE workshop)
Learning resource metadata on the web (LiLE workshop)
 
Searching the Temporal Web: Challenges and Current Approaches
Searching the Temporal Web: Challenges and Current ApproachesSearching the Temporal Web: Challenges and Current Approaches
Searching the Temporal Web: Challenges and Current Approaches
 
Ntu share project final report
Ntu share project final reportNtu share project final report
Ntu share project final report
 
Building Portals for Evidence Informed Education: Lessons from the Dead
Building Portals for Evidence Informed Education: Lessons from the DeadBuilding Portals for Evidence Informed Education: Lessons from the Dead
Building Portals for Evidence Informed Education: Lessons from the Dead
 
S.NoSalesforce Business Analyst roleComputer Systems Analysts.docx
S.NoSalesforce Business Analyst roleComputer Systems Analysts.docxS.NoSalesforce Business Analyst roleComputer Systems Analysts.docx
S.NoSalesforce Business Analyst roleComputer Systems Analysts.docx
 
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
 

More from Nattiya Kanhabua

Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationNattiya Kanhabua
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News PredictionsNattiya Kanhabua
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updatesNattiya Kanhabua
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Nattiya Kanhabua
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
 
Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Nattiya Kanhabua
 
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Nattiya Kanhabua
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalNattiya Kanhabua
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
 

More from Nattiya Kanhabua (20)

Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updates
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
 
Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?
 
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search Results
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document Archives
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Search, Exploration and Analytics of Evolving Data

  • 1. Search, Exploration and Analytics of Evolving Data Nattiya Kanhabua L3S Research Center Hannover, Germany The 1st Keystone Training School on Keyword Search over Big Data 23 July 2015, Malta
  • 2. Lecturer Education qualification 2007 - 2011: Ph.D. degree, Norwegian University of Science and Technology, Norway Thesis: “Time-aware Approaches to Information Retrieval” 2003 - 2005: M.Sc. in Computer Science, Asian Institute of Technology, Thailand Thesis: “Agent-based Simulation of Trade in Barter Trade Exchanges” 1997 - 2001: B.Eng. in Computer Engineering, Kasetsart University, Thailand Project: “Software Process Enhancement and Control System” Work experience 2011- now: Postdoc, L3S Research Center, Germany 05/2015: Visiting researcher, University of Trento, Italy 03-05/2010: Research intern, Yahoo! Research, Spain 2007 - 2011: Temporary Scientific Staff, NTNU, Norway 2006 - 2007: Research assistant, University of Trento, Italy 06-10/2006: Research assistant, AIT, Thailand 2005 - 2006: Analyst programmer, IFDS Group, UK 2002 - 2003: Research assistant, Kasetsart University, Thailand 2001 - 2002: System analyst, Accenture, Thailand & Singapore Skills • 7+ years of research experience in information retrieval, data mining, machine learning, predictive methods and spatio-temporal analysis • 3+ years of research experience in BigData, e.g., large- scale processing and MapReduce  Hadoop  Pig  Mahout  HBase  Tomcat  Servlet  Lucene  MySQL  Python  JAVA  JSP  PHP  Weka  R  UML  JSON  Eclipse  NLP  RDF  WARC 223 July 2015The 1st Keystone Summer School: Keyword Search
  • 3.  9:00 – 10:30 Part I  Introduction to Temporal Dynamics  Temporal Information Extraction  Temporal Query Analysis (I)  10:30 – 11:00 Coffee break  11:00 – 12:30 Part II  Temporal Query Analysis (II)  Time-aware Retrieval and Ranking  Applications of Temporal IR  Conclusions and Outlook 3 Schedule 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 4. Additional Resource  Book: Temporal Information Retrieval  Foundations and Trends® in Information Retrieval  Volume 9, Issue 2, pp 91-208, 2015  Download: http://goo.gl/TunlBb  References can be found in the book 423 July 2015The 1st Keystone Summer School: Keyword Search
  • 5. Introduction to Temporal Dynamics  What are temporal dynamics?  Why do they occur and impact search?  When and how to leverage temporal information for IR? 523 July 2015The 1st Keystone Summer School: Keyword Search
  • 6. 6 Temporal Dynamics Figure: Internet Growth/Usage Phases/Tech Events (created by Mark Schueler, used with permission) 23 July 2015
  • 7. Temporal Web Dynamics  Web is changing over time in many aspects, e.g., size, content, structure and how it is accessed by user interactions or queries.  Size: web pages are added/deleted at all time  Content: web pages are edited/modified  Query: users’ information needs changes [Risvik et al., CN 2002; Ke et al., CN 2006] [WebDyn 2010; Dumais, SIAM-SDM 2012] 723 July 2015
  • 8. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters! 1995 2015 Web and Index Sizes 823 July 2015The 1st Keystone Summer School: Keyword Search
  • 9. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters!2004 Index grows to 4.2 billion pages 1995 2015 9 Web and Index Sizes 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 10. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters!2004 Index grows to 4.2 billion pages 1995 2015 2008 Google counts 1 trillion unique URLs 10 Web and Index Sizes 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 11. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters!2004 Index grows to 4.2 billion pages 1995 2020 2009 TBs or PBs of data/index Tens of thousands of PCs 2008 Google counts 1 trillion unique URLs 11 ? Web and Index Sizes 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 13. Content Change  The content of the Web, changes constantly over time, e.g., web documents are added, modified or deleted continuously.  National and international initiatives collect and preserve parts of the Web [Gomes et al., TPDL 2011; Costa et al., TempWeb 2013] Figure: WayBack Machine a web archive search tool by Internet Archive 1323 July 2015The 1st Keystone Summer School: Keyword Search
  • 14. Content Change  Challenge:  Document representation and retrieval 1423 July 2015The 1st Keystone Summer School: Keyword Search
  • 15. Categorization of Content Change 15  Implication:  Crawling, Indexing, Ranking 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 16. User Interaction Dynamics  Browsing and querying (or search) behavior  User preference, e.g., likes, comments, interests  User’s profiles [Rybak et al., ECIR 2014] 1623 July 2015The 1st Keystone Summer School: Keyword Search
  • 17. Query Popularity Change  Challenge:  Time-sensitive queries  Query understanding and processing Google Insights for Search: http://www.google.com/insights/search/ Query: Halloween 1723 July 2015The 1st Keystone Summer School: Keyword Search
  • 18. Categorization of Web Search Queries http://www.google.com/insights/search 18  Implication:  Query Analysis, Ranking 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 19. Temporal Information Extraction (1) Document Creation Time (2) Document Focus Time (3) Entity and Event Evolution 1923 July 2015The 1st Keystone Summer School: Keyword Search
  • 20. Motivation  Incorporating time into search can increase retrieval effectiveness  Only when temporal information is available  Research problem:  How to determine the publication of a document?  How to extract temporal information from document contents? 2023 July 2015The 1st Keystone Summer School: Keyword Search
  • 21. Two Time Aspects 1. Publication or modified time  Task: determining timestamps of documents  Method: rule-based technique, or temporal language models 2. Content or focus time  Task: temporal information extraction  Method: natural language processing, or time and event recognition algorithms 2123 July 2015The 1st Keystone Summer School: Keyword Search
  • 22. content time publication time 2223 July 2015The 1st Keystone Summer School: Keyword Search
  • 23.  Problem Statement: Hard to find trustworthy time for a web page  Time gap between crawling and indexing  Decentralization and relocation of web documents  No standard metadata for time/date 23 Determining Document Creation Time 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 24.  Problem Statement: Hard to find trustworthy time for a web page  Time gap between crawling and indexing  Decentralization and relocation of web documents  No standard metadata for time/date I found a bible-like document. But I have no idea when it was created? “ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ” 24 Determining Document Creation Time 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 25.  Problem Statement: Hard to find trustworthy time for a web page  Time gap between crawling and indexing  Decentralization and relocation of web documents  No standard metadata for time/date Let’s me see… This document is probably written in 850 A.C. with 95% confidence. I found a bible-like document. But I have no idea when it was created? “ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ” 25 Determining Document Creation Time 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 26. Current Approaches 1. Content-based  Temporal language model [de Jong et al., AHC 2005; Kanhabua and Nørvåg, ECDL 2008]  Classifier using features based on text’s time expressions [Chambers, ACL 2012;Ge et al., EMNLP 2013]  Using burstiness of terms for estimating timestamps [Kotsakos et al., SIGIR 2014] 2. Non content-based  Finding the oldest version of a page in a web archive [Jatowt et al., WIDM 2007]  Leveraging external resources [Hauff and Azzopardi, ECIR 2005;Nunes et al., WIDM 2007; SalahEldeen and Nelson, TempWeb 2013] 2623 July 2015The 1st Keystone Summer School: Keyword Search
  • 27. Content-based Approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Temporal Language Models  Based on the statistic usage of words over time  Compare each word of a non- timestamped document with a reference corpus  Tentative timestamp -- a time partition mostly overlaps in word usage Freq 1 1 1 1 1 1 2723 July 2015The 1st Keystone Summer School: Keyword Search
  • 28. Content-based Approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Temporal Language Models  Based on the statistic usage of words over time  Compare each word of a non- timestamped document with a reference corpus  Tentative timestamp -- a time partition mostly overlaps in word usage Freq 1 1 1 1 1 1 28 tsunami Thailand A non-timestamped document 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 29. Content-based Approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Temporal Language Models  Based on the statistic usage of words over time  Compare each word of a non- timestamped document with a reference corpus  Tentative timestamp -- a time partition mostly overlaps in word usage Freq 1 1 1 1 1 1 29 tsunami Thailand A non-timestamped document 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 30. Content-based Approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Temporal Language Models  Based on the statistic usage of words over time  Compare each word of a non- timestamped document with a reference corpus  Tentative timestamp -- a time partition mostly overlaps in word usage Freq 1 1 1 1 1 1 30 tsunami Thailand A non-timestamped document 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 31. Content-based Approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Temporal Language Models  Based on the statistic usage of words over time  Compare each word of a non- timestamped document with a reference corpus  Tentative timestamp -- a time partition mostly overlaps in word usage Freq 1 1 1 1 1 1 31 tsunami Thailand A non-timestamped document Similarity Scores Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 32. Normalized Log-likelihood Ratio Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Normalized log-likelihood ratio [Kraaij, SIGIR Forum 2005]  Variant of Kullback-Leibler divergence  Similarity of a document and time partitions  C is the background model estimated on the corpus  Linear interpolation smoothing to avoid the zero probability of unseen words Freq 1 1 1 1 1 1 32 tsunami Thailand A non-timestamped document Similarity Scores Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 33. Improving Temporal LMs Enhancement techniques 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing [Kanhabua et al., ECDL 2008] (Slide provided by the authors) 3323 July 2015
  • 34. Improving Temporal LMs Enhancement techniques 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio 3423 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 35. Improving Temporal LMs Enhancement techniques 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter 3523 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 36. Semantic-based Preprocessing 36 Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Semantic-based Preprocessing Description Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States” Word sense disambiguation Identify the correct sense of a word from context, e.g. “bank” Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave” have the common concept of “disaster” Word filtering Select the top-ranked words according to TF-IDF scores for a comparison 23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 37. Leveraging Search Statistics 37 Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio 23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 38. Leveraging Search Statistics 38 Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio (b)(a) 23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 39. Leveraging Search Statistics 39 Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio P(wi) is the probability that wi occurs: P(wi) = 1.0 if a gaining query P(wi) = 0.5 if a declining query f(R) converts a ranked number into weight. The higher ranked query is more important. An inverse partition frequency, ipf = log N/n 23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 40. Temporal Entropy Temporal Entropy A measure of temporal information which a word conveys. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. Tells how good a term is in separating a partition from others. A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions. The higher temporal entropy a term has, the better representative of a partition. Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter 4023 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 41. Temporal Entropy Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter 4123 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 42. Temporal Entropy Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter 42 Np is the total number of partitions in a corpus 23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 43. Temporal Entropy Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter 43 Np is the total number of partitions in a corpus A probability of a partition p containing a term wi 23 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)
  • 44. Non Content-based Approaches  Dating a document using its neighbors 1. Web pages linking to the document  I.e., incoming links 2. Web pages pointed by the document  I.e., outgoing links 3. Media assets associated with the document  E.g., images  Averaging the last-modified dates of its neighbors as timestamps 44[Hauff and Azzopardi, 2005; Nunes et al., WIDM 2007] 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 45. Non Content-based Approaches  Drawbacks:  Rely on the availability and accuracy of other information  Cover only pages from most recent years  Cannot determine the age of the actual contents 45[SalahEldeen and Nelson, 2013] 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 46. Determining Document Focus Time  Three types of temporal expressions 1. Explicit: time mentions being mapped directly to a time point or interval, e.g., “July 4, 2012” 2. Implicit: imprecise time point or interval, e.g., “Independence Day 2012” 3. Relative: resolved to a time point or interval using other types or the publication date, e.g., “next month”  Time and event recognition [Mani and Wilson, ACL 2000]  A mix of hand-crafted and machine-learnt rules  Ranking the most relevant temporal expressions [Strötgen et al., TempWeb 2012] 4623 July 2015The 1st Keystone Summer School: Keyword Search
  • 47. Time Taggers for Calculating Focus Time HeidelTime: http://heideltime.ifi.uni- heidelberg.de/heideltime Timestamp: 2013/7/15 23 July 2015 47[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 48.  Document may lack any temporal expressions  Temporal expressions may be weakly related to document’s theme  Temporal taggers are not perfect Limitations Estimating document focus time without using temporal expressions 23 July 2015 48[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 49. Focus Time of Documents  Def. A document has focus time t if its content refers to t 23 July 2015 49[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 50. Estimating Focus time: Concept  Use time-referenced documents for estimating focus time of target document A-1935- -----May 2011---- C------ News Article Collections ---A---- --2012-- ---B-- 1978---- -1915--- -------- --C—B-- ---A--- --1948-- -------- -C----- 2003-- ----- A—B-- C---A- ---- Target Document Target document focus time + ... ... 23 July 2015 50[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 51. Word Graph  Word co-occurrence graph from large collections of news articles  Link weight estimated by Jaccard coefficient using sentence as unit war nazi 1945 1939 aushwitz jews germany jalta hiroshima 23 July 2015 51[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 52. Estimating Direct Word-Year Association  Word-year associations derived from graph Word w is strongly associated with year y if if it frequently co-occurs with y A(war, 1900) A(war, 1901) … A(war, 1944) A(war, 1945) … A(war, 2009) A(war, 2010) A(hiroshima, 1900) A(hiroshima, 1901) … A(hiroshima, 1944) A(hiroshima, 1945) … A(hiroshima, 2009) A(hiroshima, 2010) A(word, 1900) A(word, 1901) … A(word, 1944) A(word, 1945) … A(word, 2009) A(word, 2010) 23 July 2015 52[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 53. Word w is strongly associated with year y if many other words that frequently co-occur with w are also strongly associated with y Second Level Term-Year Association        V j jdiriji ywAwwA V ywA 1 2 ,, 1 , war nazi 1945 1939 aushwitz jews germany jalta hiroshima israel 23 July 2015 53[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 54. If a document contains many words strongly associated with year y, the document is strongly associated with y Estimating Document-Year Association 1900 1920 1940 1960 1980 2000 word A word B A(word,year) word C A + 2B + 2C Time A B C B C Document Document-year association 23 July 2015 54[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 55. Finding Discriminative Features  Not every word is useful for estimating text focus time  E.g., “man”, “city” have stable associations with years  Temporal entropy – measure of variability of word associations  Temporal kurtosis – measure of peakness of word associations  E.g., “war”, “earthquake” vs. “hitler”, “stalingrad” 1900 1920 1940 1960 1980 2000 word A word B Temporal_Entropy(A) < Temporal_Entropy(B) A(word,year) 1900 1920 1940 1960 1980 2000 word A Temporal_Kurtosis(A) > Temporal_Kurtosis(B) A(word,year) word B Temporal entropy and Temporal kurtosis used as temporal weights for words 23 July 2015 55[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 56. Importance of Words in Document  Words weakly related to document theme should be skipped TextRank 0.90 independence 0.82 poland 0.74 war 0.61 nazi 0.56 hitler 0.54 …. President Obama took part in the celebrations of the Polish Independence Day. The US president met main Polish politicians in Warsaw. Poland regained independence at the end of the World War I following Bolshevik Revolution. It then lost the independence as a result of Nazi and Soviet invasions led by Hitler and Stalin. Poland is located in East Europe. Target Document Document to graph conversion independence poland war hitler … … … … … TextRank scores used as discriminatory semantic weights for words [Mihalcea and Tarau, EMNLP 2004] 23 July 2015 56[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 57. Estimating Focus Time 1900 1920 1940 1960 1980 2000 word A word B word C Weighted sum (temporality and semantics) Focus time: Interval based threshold Time A(word,year) 1900 1920 1940 1960 1980 2000 A B C B C Document Focus time: Instant based 1900 1920 1940 1960 1980 2000 23 July 2015 57[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 58. Combined Approach  Combining estimated focus time and temporal expressions in text  Representing dates on timeline - Gaussian Kernel Density Estimate  Mixture of Gaussian distributions with means centered on extracted dates      ydSydSydS TempExpEstComb ,,,  ---1935-------- ----2011------- ---------------- ---------------- ----1932------- ------1940----- ---------------- 1932-----2001-- ------------- 1932 1935 1940 2001 2011 Target document 23 July 2015 58[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 59.  News articles collected from Google News Archive using country names as queries  Germany (87k), UK (149k), France (110k), Japan (97k), Israel (92k)  Published within [1990, 2010]  Dates falling in [1900, 2013] were found using regular expressions Experimental Settings: Word Graphs 23 July 2015 59[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 60. Experimental Settings: Test Datasets  Datasets on events related to countries:  Wiki: 250 Wikipedia pages about events  Books: 735 paragraphs from 2 text books about history (timelines)  Web: 812 paragraphs from web pages on history (BBC timelines, etc.) Datasets total #doc avr. #sent avr. time span of events avr. year of events avr. #dates Wiki 250 179 3.4 years 1958 14.5 Book 735 43 4.4 years 1982 4.5 Web 819 18.3 1.3 years 1957 2.4 23 July 2015 60
  • 61. Experimental Settings: Baselines  Baselines:  Random  Date-based (using only dates in document text)  LDA-based 1. 100 topics over sentences containing year mentions 2. Finding topic distribution of each year 3. Calculating document-year association based on topic distribution of documents 23 July 2015 61[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 62. Experimental Settings: Measurements  Measures:  Average error (in years)  Pearson Correlation Coefficient between ground truth years and years in focus time Ground truth Estimated focus time Ground truth Estimated focus time tfocus - + + - - + - - + Average error (years) for instant-based representation Correlation measure (-1..+1) for interval-based representation error + + - - + + - - + 23 July 2015 62[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 63. Experimental Results Datasets random baseline LDA baseline date-based baseline Proposed (no dates) Proposed combined (with dates) Wiki 36.5 27.2 3 18.3 2.83 Books 39.3 37.3 48.1 23.5 20.4 Web 40.5 41.4 53.4 23.6 20.7 Datasets random baseline LDA baseline date-based baseline Proposed (no dates) Proposed combined (with dates) Wiki 0 0.1 0.65 0.29 0.66 Books 0 0.04 0.01 0.25 0.30 Web 0 0.02 -0.03 0.26 0.41 Average error Pearson Correlation Coefficient 23 July 2015 63[Jatowt et al., CIKM 2013](Slide provided by the authors)
  • 64.  How well can we estimate focus time of documents about distant past ? Effect of Time Distance on Focus Time Wiki Books Web Instant-based focus time representation 23 July 2015 64The 1st Keystone Summer School: Keyword Search
  • 65. Question? 6523 July 2015The 1st Keystone Summer School: Keyword Search
  • 66. Temporal Query Analysis (1) Temporal query intent (2) Dynamic query subtopics 6623 July 2015The 1st Keystone Summer School: Keyword Search
  • 67. Temporal Queries  Temporal information needs  Searching temporal document collections  E.g., digital libraries, web/news archives  Users: historians, librarians, journalists or students  Temporal queries exist in both standard collections and the Web  Relevancy is dependent on time  Documents are about events at particular time 6723 July 2015The 1st Keystone Summer School: Keyword Search
  • 68. Types of Temporal Queries  Two types of temporal queries 1. Explicit: time is provided, "Presidential election 2012“ 2. Implicit: time is not provided, "Germany World Cup"  Temporal intent can be implicitly inferred  I.e., refer to the World Cup event in 2006  Studies of web search query logs show a significant fraction of temporal queries  1.5% of web queries are explicit  ~7% of web queries are implicit  13.8% of queries contain explicit time and 17.1% of queries have temporal intent implicitly provided 68[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]23 July 2015
  • 69. Figure: Variances of temporal queries and their dynamics 23 July 2015 69The 1st Keystone Summer School: Keyword Search
  • 70. Understanding Temporal Query Intent  Current approaches: 1. Mining temporal patterns in query logs 2. Analyzing top-k search results 70 [Vlachos et al., SIGMOD 2004; Radinsky et al., WWW 2012] [Jones and Diaz, TOIS 2007; Campos et al., CIKM 2012] 23 July 2015
  • 71. Motivation  Temporal queries are a significant fraction of Web search queries [Zhang et al., EMNLP 2010]  13.8% of explicit temporal queries  17.1% of implicit temporal queries  Characteristics:  Certain temporal patterns, i.e., spikes, periodicity (hourly or daily), seasonality and trends  Underlying temporal information needs without temporal patterns observed  Tasks:  Understand temporal search intent  Enable advanced enhancement techniques  Automatic method for detecting events in search streams US Election 2016 Brazil FIFA World Cup 23 July 2015 71[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 72. Preliminaries  Data model:  Set of queries Q issues at different time points  Set of clicked URLs U and click-through data  Temporal document collection D  q: keywords or term(q), and hitting time(q)  yq: time series data extracted form Q, U and D  Two-step approach:  Automatically extract a set of candidate queries {q1, ..., qn} from Q  Classify candidates as event-related queries {e1, ..., em} using machine learning techniques 23 July 2015 72[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 73. Identifying Event Candidates  Time and keyword-based clustering: Step1: Partition query logs into one week • Group queries from the same event • Possibly contain multiple, unrelated events Step2: Cluster queries by lexical similarity • Pre-process and sort queries alphabetically • Compute Jaccard similarity of a query pair Easter - easter 2006, easter 2007, easter 20crafts, easter activities, easter animation, easter animations, easter background, easter basket, easter bread, easter bucket, easter bunny, easter bunny decorations, easter bunny lights 23 July 2015 73[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 74. Event-related Query Classification  Classify a query as event-related or not:  Periodic and seasonal events  Popular and trending events  Sporadic (rare) and unseen events  General time-sensitive queries  Underlying temporal information needs  Features:  Time-series features, e.g., seasonality or trends  Popularity-based features, e.g., click-through and burstiness  Statistic features, e.g., probability distribution of results temporal KL-divergence and skewness (kurtosis) 23 July 2015 74[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 75. Query: Easter Seasonality Query: World cup  Detect seasonal queries [Shokouhi, SIGIR 2011]  E.g., Annual events, e.g., US Open and Easter, or a 4-year recurring event, e.g., FIFA World Cup  Method: time-series decomposition using Holt- Winters adaptive exponential smoothing  Input: time-series data extracted from external document collections, YD  Compute a cosine similarity as seasonality  Y is the original time-series data  S is the seasonality component 23 July 2015 75[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 76. Autocorrelation  Detect trending events by their predictability  Cross correlation with itself or between its past and future values at different time lags  The stronger inter-day dependencies, the higher value for autocorrelation  where lag=1, shifting the 2nd time series by one day, called 1st-order autocorrelation 23 July 2015 76[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 77. Temporal KL-divergence  Analyze a temporal distribution in a result set  Measure the difference between the distribution over time of top-k documents of q and the document collection C  P(t|q) is the probability of generating a publication date t given q  P(t|C) is the probability of a publication date t in the collection 23 July 2015 77[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 78. Surprise Score  Detect unseen events or surprisingly popular queries [Radinsky et al. , WWW 2012]  Assume an unplanned event happening when there is a significant prediction error  Compute the sum of squared errors of prediction (SSE) using a simple linear regression model 23 July 2015 78[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 79. Experiments  Query logs: • Two datasets, i.e., AOL and MSN • AOL: 30M queries March 1 - May 31, 2006 • MSN: 15M queries from May 2006  Temporal collection: • The New York Times Annotated Corpus • 1.8M documents from 1987 - 2007  Setting: • HeidelTime for time extraction and OpenNLP for entity extraction • Cleansing-step parameters: Jaccard similarity threshold>0.2; edit distance<3; overlap n-gram=2 • For burstiness features, default parameters for the burst detection technique provided by CISHELL In total, 837 event-related queries 23 July 2015 79[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 80. Experimental Results (I)  Feature selection: • Study high-impact (best) features • Investigate their importance independent from classification algorithms • InfoGainAttributeEval method in WEKA  Main findings: • Discriminative features are mostly derived from D and Q • TemporalKL and kurtosis are among influential features • Trend-based features, such as, autocorrelation, burst weight, and trending level, play an important role • Seasonality computed from Q has less impact than the one extracted D 23 July 2015 80[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 81. Experimental Results (II)  Query classification: • Several classifiers, i.e., support vector machine (SVM), AdaBoost, decision tree (J48), and neural network (NN) • Metrics: accuracy, precision, recall, F- measure using 10-fold cross validation  Main findings: • J48 is the best performing algorithm • TemporalKL achieves accuracy of 84% • Adding autocorrelation, kurtosis, and seasonality increases the performance • However, the performance has dropped after adding max. query frequency, so on 23 July 2015 81[Kanhabua et al., TempWeb 2015](Slide provided by the authors)
  • 82. Analyzing Top-k Search Results  Using temporal language models  Determine time of queries when no time is given explicitly  Re-rank search results using the determined time  Exploiting time from search snippets  Extract temporal expressions (i.e., years) from the contents of top-k retrieved web snippets for a given query  Content-based language-independent approach 82[Kanhabua and Nørvåg, ECDL 2010; Campos et al., CIKM 2012] 23 July 2015
  • 83. Determining Time of Queries  Approach I. Dating using keywords*  Approach II. Dating using top-k documents*  Queries are short keywords  Inspired by pseudo-relevance feedback  Approach III. Using timestamp of top-k documents  No temporal language models are used *Using Temporal Language Models proposed by de Jong et al. 8323 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 84. I. Dating using Keywords 8423 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 85. I. Dating using Keywords 85 Query’s temporal profiles 23 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 86. II. Dating using Top-k Documents 8623 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 87. II. Dating using Top-k Documents 87 Query’s temporal profiles 23 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 88. III. Using Timestamp of Documents 8823 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 89. III. Using Timestamp of Documents 89 Query’s tempora profiles 23 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 90. Re-ranking Search Results query News archive Determine time 2005, 2004, 2006, ... D2009 Initial retrieved results 9023 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)  Intuition: documents published closely to the time of queries are more relevant  Assign document priors based on publication dates
  • 91.  Intuition: documents published closely to the time of queries are more relevant  Assign document priors based on publication dates Re-ranking Search Results query News archive Determine time 2005, 2004, 2006, ... D2009 Initial retrieved results D2005 Re-ranked results 9123 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)
  • 92. march madness began 14/03/2006 ncaa women tournament began 18/03/2006 01/04/2006 final four began query: ncaa Change of Query Subtopics over Time 92[Nguyen and Kanhabua, ECIR 2014] 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 93. Mining Temporal Anchor Texts  Anchor texts are complementary description for target pages, widely used to improve search  Characteristics:  Short summary (a few words) of target pages  Collective wisdom of people other than authors  Similar behavior to real-world queries and titles  Capturing aboutness or what a document is about  Main ideas:  Temporal anchor texts mined from the edit history of Wikipedia as a hook for tracking entity evolution  Large-scale analysis and a more robust discovery of evolving information using limited resources 23 July 2015 93The 1st Keystone Summer School: Keyword Search
  • 94. Mining Temporal Anchor Texts 1. Partition Wikipedia revisions using the one-month granularity 2. For each Wikipedia snapshot, identify named entity articles/pages 3. Extract anchor texts from all articles linking to an entity page 4. Rank aggregated entity-anchor relationships at a particular time t [Kanhabua and Nørvåg, JCDL 2010] 23 July 2015 94The 1st Keystone Summer School: Keyword Search
  • 95. Mining Temporal Anchor Texts 1. Partition Wikipedia revisions using the one-month granularity 2. For each Wikipedia snapshot, identify named entity articles/pages 3. Extract anchor texts from all articles linking to an entity page 4. Rank aggregated entity-anchor relationships at a particular time t 23 July 2015 95The 1st Keystone Summer School: Keyword Search
  • 96. Mining Temporal Anchor Texts 1. Partition Wikipedia revisions using the one-month granularity 2. For each Wikipedia snapshot, identify named entity articles/pages 3. Extract anchor texts from all articles linking to an entity page 4. Rank aggregated entity-anchor relationships at a particular time t President_of_the_ United_States President Bush (43) Time: 10/2005 Barack Obama Time: George W. Bush Time: 11/2004 23 July 2015 96The 1st Keystone Summer School: Keyword Search
  • 97. 1. Multi-word title with all words capitalized, except prepositions, determiners, etc. E.g., President_of_the_United_States => entity 2. Single-word titles with multiple capital letters E.g., UNICEF and WHO => entities 3. 75% of occurrences in the article text itself are capitalized (not beginning of sentence) Recognizing Named Entity Articles [Bunescu and Pasca, EACL 2006] 23 July 2015 97The 1st Keystone Summer School: Keyword Search
  • 98. Weight anchor texts by importance with respect to a target entity at particular time: • Link-independent : inlink pages are independent and equally important to the target page • Compute based on the whole collection of Wikipedia entity pages at particular time t • Two variants: 1) article links, and 2) distinct pages Temporal Anchor Weighting [Dou et al., SIGIR 2009] 23 July 2015 98The 1st Keystone Summer School: Keyword Search
  • 99. Weight anchor texts by importance with respect to a target entity at particular time: • Link-independent : inlink pages are independent and equally important to the target page • Compute based on the whole collection of Wikipedia entity pages at particular time t • Two variants: 1) article links, and 2) distinct pages Temporal Anchor Weighting [Dou et al., SIGIR 2009] 23 July 2015 99The 1st Keystone Summer School: Keyword Search
  • 100. Experiments  Data collection: • A dump of English Wikipedia edit history (2.8 TB) • All pages and revisions 03/2001 to 03/2008 • 85 snapshots + 4 additional snapshots (24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)  Tools: • Preprocess/store revisions using MWDumper http://www.mediawiki.org/wiki/Mwdumper • Store anchor texts: mySQL databases 23 July 2015 100The 1st Keystone Summer School: Keyword Search
  • 101. Top-100 Named Entities 23 July 2015 101The 1st Keystone Summer School: Keyword Search
  • 102. Top-100 Named Entities 23 July 2015 102The 1st Keystone Summer School: Keyword Search
  • 103. Top-100 Named Entities 23 July 2015 103The 1st Keystone Summer School: Keyword Search
  • 104. Evolving Context “Barack Obama” time 05/2008 03/2009 1. Senator Barack Obama 2. Senator Obama's legislative accomplishments 3. Illinois 4. U.S. Sen. Barack Obama 1. Senator Barack Obama 2. Illinois Senator Barack Obama 3. Barack Hussein Obama II 4. Senator Obama's legislative accomplishments 07/2008 10/2008 23 July 2015 104The 1st Keystone Summer School: Keyword Search
  • 105. Evolving Context “Barack Obama” time 05/2008 03/2009 1. Senator Barack Obama 2. Senator Obama's legislative accomplishments 3. Illinois 4. U.S. Sen. Barack Obama 1. Senator Barack Obama 2. Illinois Senator Barack Obama 3. Barack Hussein Obama II 4. Senator Obama's legislative accomplishments 07/2008 1. Senator Barack Obama 2. Illinois Senator Barack Obama 3. Barak Obama, U.S. Senator, Illinois, 2008 Democratic nominee for U.S. President 4. presidential candidacy announcement 1. President Barack Obama 2. Senator Barack Obama 3. U.S. President Barack Obama 4. 44th President of the United States 5. Obama Administration 10/2008 23 July 2015 105The 1st Keystone Summer School: Keyword Search
  • 106. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 106The 1st Keystone Summer School: Keyword Search
  • 107. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 107The 1st Keystone Summer School: Keyword Search
  • 108. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 108The 1st Keystone Summer School: Keyword Search
  • 109. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 109The 1st Keystone Summer School: Keyword Search
  • 110. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 110The 1st Keystone Summer School: Keyword Search
  • 111. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 111The 1st Keystone Summer School: Keyword Search
  • 112. Main Findings Evolving information & context • Role changes for political entities • Geographic name changes for locations • Trend or things in vogue for celebrities • Products in demand for technology 23 July 2015 112The 1st Keystone Summer School: Keyword Search
  • 113. Question? 11323 July 2015The 1st Keystone Summer School: Keyword Search
  • 114. Time-aware Retrieval and Ranking (1) Recency-based Ranking (2) Time-dependent Ranking 11423 July 2015The 1st Keystone Summer School: Keyword Search
  • 115. RECAP  Two time dimensions 1. Publication or modified time 2. Content or focus time 11523 July 2015The 1st Keystone Summer School: Keyword Search
  • 116. Searching the past  Historical or temporal information needs  A journalist working the historical story of a particular news article  A Wikipedia contributor finding relevant information that has not been written about yet 116 Web archives news archives blogs emails “temporal document collections” Retrieve documents about Pope Benedict XVI written before 2005 Term-based IR approaches may give unsatisfied results 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 117. Temporal Query Examples  A temporal query consists of:  Query keywords  Temporal expressions  A document consists of:  Terms, i.e., bag-of-words  Publication time and temporal expressions 11723 July 2015The 1st Keystone Summer School: Keyword Search
  • 118. Temporal Query Examples [Berberich et al., ECIR 2010] 11823 July 2015The 1st Keystone Summer School: Keyword Search
  • 119.  Assign prior probabilities using an exponential function  E.g., a more recent creation date obtains high probability  Current approaches:  Time-based language model [Li and Croft, CIKM 2003]  Using retention functions [Peetz and de Rijke, ECIR 2013]  Incorporating freshness into web authority [Dai and Davison, SIGIR 2010] Recency-based Ranking 11923 July 2015The 1st Keystone Summer School: Keyword Search
  • 120.  Time must be explicitly modeled in order to increase the effectiveness of ranking  To order search results so that the most relevant ones come first  Time uncertainty should be taken into account  Two temporal expressions can refer to the same time period even though they are not equally written  E.g. the query “Independence Day 2011”  A retrieval model relying on term-matching only will fail to retrieve documents mentioning “July 4, 2011” Time-dependent Ranking 12023 July 2015The 1st Keystone Summer School: Keyword Search
  • 121. Time-dependent Ranking  Two main approaches: 1. Mixture model [Kanhabua et al., ECDL 2010]  Linearly combining textual- and temporal similarity 2. Probabilistic model [Berberich et al., ECIR 2010]  Generating a query from the textual part and temporal part of a document independently 12123 July 2015The 1st Keystone Summer School: Keyword Search
  • 122. Mixture Model  Linearly combine textual- and temporal similarity  α indicates the importance of similarity scores  Both scores are normalized before combining  Textual similarity can be determined using any term-based retrieval model  E.g., tf.idf or a unigram language model 12223 July 2015The 1st Keystone Summer School: Keyword Search
  • 123. Mixture Model  Linearly combine textual- and temporal similarity  α indicates the importance of similarity scores  Both scores are normalized before combining  Textual similarity can be determined using any term-based retrieval model  E.g., tf.idf or a unigram language model 123 How to determine temporal similarity? 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 124. Temporal Similarity Similarityscore Time d1 d2<q> Dist(d1,q) Dist(d2,q) [Kanhabua et al., ECDL 2010] 23 July 2015 124The 1st Keystone Summer School: Keyword Search
  • 125. Temporal Similarity  Assume that temporal expressions in the query are generated independently from a two-step generative model:  P(tq|td) can be estimated based on publication time using an exponential decay function [Kanhabua et al., ECDL 2010]  Linear interpolation smoothing is applied to eliminates zero probabilities  I.e., an unseen temporal expression tq in d 12523 July 2015The 1st Keystone Summer School: Keyword Search
  • 126. Comparison of time-aware ranking Five time-aware ranking models  LMT [Berberich et al., ECIR 2010]  LMTU [Berberich et al., ECIR 2010]  TS [Kanhabua et al., ECLD 2010]  TSU [Kanhabua et al., ECLD 2010]  FuzzySet [Kalczynski et al., Inf. Process. 2005] 126[Kanhabua et al., SIGIR 2011]23 July 2015The 1st Keystone Summer School: Keyword Search
  • 127.  Experiment:  New York Times Annotated Corpus  40 temporal queries [Berberich et al., ECIR 2010]  Result:  TSU outperforms other methods significantly for most metrics  Conclusions:  Although TSU gains the best performance, but only applied to a collection with time metadata  LMT, LMTU can be applied to any collection without time metadata, but time extraction is needed Discussion 12723 July 2015The 1st Keystone Summer School: Keyword Search
  • 128. 128 Applications for Temporal IR (1) Searching the Future (2) Time-aware Recontextualization 23 July 2015The 1st Keystone Summer School: Keyword Search
  • 129. Searching the Future  People are naturally curious about the future  What will happen to EU economies in next 5 years?  What will be potential effects of climate changes? 12923 July 2015The 1st Keystone Summer School: Keyword Search
  • 130. Previous work  Searching the future  Extract temporal expressions from news articles  Retrieve future information using a probabilistic model, i.e., multiplying textual similarity and a time confidence  Supporting analysis of future-related information in news and Web  Extract future mentions from news snippets obtained from search engines  Summarize and aggregate results using clustering methods, but no ranking [Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 13023 July 2015
  • 131. Recorded Future http://www.recordedfuture.com/ 13123 July 2015The 1st Keystone Summer School: Keyword Search
  • 132. Yahoo! Time Explorer [Matthews et al., HCIR 2010] 13223 July 2015The 1st Keystone Summer School: Keyword Search
  • 133. Ranking News Predictions  Over 32% of 2.5M documents from Yahoo! News (July’09 – July’10) contain at least one prediction  Retrieve predictions related to a news story in news archives and rank by relevance 13323 July 2015
  • 134. Related News Predictions [Kanhabua et al., SIGIR 2011] 13423 July 2015The 1st Keystone Summer School: Keyword Search
  • 135. Related News Predictions [Kanhabua et al., SIGIR 2011] 13523 July 2015The 1st Keystone Summer School: Keyword Search
  • 136. Related News Predictions [Kanhabua et al., SIGIR 2011] 13623 July 2015The 1st Keystone Summer School: Keyword Search
  • 137.  Four classes of features  Term similarity, entity-based similarity, topic similarity and temporal similarity  Rank results using a learning-to-rank technique Approach 23 July 2015 137The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 138. Step 1: Document annotation.  Extract temporal expressions using time and event recognition  Normalize them to dates so they can be anchored on a timeline  Output: sentences annotated with named entities and dates, i.e., predictions Step 2: Retrieving predictions.  Automatically generate a query from a news article being read  Retrieve predictions that match the query  Rank predictions by relevance (i.e., a prediction is “relevant” if it is about the topics of the article) System Architecture [Kanhabua et al., SIGIR 2011] 13823 July 2015The 1st Keystone Summer School: Keyword Search
  • 139.  Capture the term similarity between q and p 1. TF-IDF scoring function  Problem: keyword matching, short texts  Predictions not match with query terms 2. Field-aware ranking function, e.g., bm25f  Search the context of a prediction, i.e., surrounding sentences Term Similarity 13923 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 140.  Measure the similarity between q and p using annotated entities in dp, p, q  Features commonly employed in entity ranking Entity-based Similarity 14023 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 141.  Compute the similarity between q and p on topic  Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for modeling topics 1. Train a topic model 2. Infer topics 3. Compute topic similarity Topic Similarity 14123 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 142.  Compute the similarity between q and p on topic  Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for modeling topics 1. Train a topic model 2. Infer topics 3. Compute topic similarity Topic Similarity 14223 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 143. Hypothesis I. Predictions that are more recent to the query are more relevant Temporal Similarity 14323 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 144. Hypothesis I. Predictions that are more recent to the query are more relevant Temporal Similarity Hypothesis II. Predictions extracted from more recent documents are more relevant 14423 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 145. Learning-to-rank: Given an unseen (q, p), p is ranked using a model trained over a set of labeled query/prediction  SVM-MAP [Yue et al., SIGIR 2007]  RankSVM [Joachims, KDD 2002]  SGD-SVM [Zhang, ICML 2004]  PegasosSVM [Shalev-Shwartz et al., ICML 2007]  PA-Perceptron [Crammer et al., J. Mach. Learn. 2006] Ranking Method 14523 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 146. 42 future-related topics Relevance Judgments 14623 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 147.  New York Times Annotated Corpus  1.8 million articles, over 20 years  More than 25% contain at least one prediction  Annotation process uses several language processing tools  OpenNLP for tokenizing, sentence splitting, part-of-speech tagging, shallow parsing  SuperSense tagger for named entity recognition  TARSQI for extracting temporal expressions  Apache Lucene for indexing and retrieving.  44,335,519 sentences and 548,491 predictions  939,455 future dates (avg. future date/prediction is 1.7) Experiments 14723 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 148.  Results:  Topic features play an important role in ranking  Features in top-5 features with lowest weights are entity-based features  Open issues:  Extract predictions from other sources, e.g., Wikipedia, blogs, comments, etc.  Sentiment analysis for future-related information Discussion 14823 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]
  • 149. Prior to 1964, many of the cigarette companies advertised their brand by falsely claiming that their product did not have serious health risks. A couple of examples would be "Play safe with Philip Morris" and "More doctors smoke Camels". Such claims were made both to increase the sales of their product and to combat the increasing public knowledge of smoking's negative health effects. Advertisement poster from the 1950s Time-aware contextualization Time-aware Contextualization 23 July 2015 149[Tran et al., WSDM 2015] (Slide provided by the authors)
  • 150. Physician http://en.wikipedia.org/wiki/Physician Camel (cigarette) http://en.wikipedia.org/wiki/Camel_(cigarette) Cigarette http://en.wikipedia.org/wiki/Cigarette Entity linking is not sufficient Wikipedia pages tend to contain large amounts of content Relevant information might be distributed over various articles The crucial temporal aspect is missing in pure linking approaches Entity Linking 23 July 2015 150[Tran et al., WSDM 2015] (Slide provided by the authors)
  • 151. Problem Statement 23 July 2015 151 Time-aware contextualization aims to associate an information item d with time-aware, concise and coherent context information c for easing its understanding Several sub-goals of the information search process have to combined with each other  c has to be relevant for d  c has to complement the information already available in d  c has to consider the time of creation of d  the context information should be concise to avoid overloading the user [Tran et al., WSDM 2015] (Slide provided by the authors)
  • 153.  The goal is to generate a set of queries for a given document to retrieve candidates as input for the re-ranking step  We explore two families of query formulation methods  Document-based methods : title, lead, title+lead  Hook-based methods: each_hook, all_hooks, and query performance prediction (qpp_r@k) with the following features  Linguistics features  Document frequency  Scope  Temporal document frequency  Temporal scope  Temporal similarity Query Formulation 23 July 2015 153[Tran et al., WSDM 2015] (Slide provided by the authors)
  • 154. Context retrieval: Learning to rank context: • The ranking algorithm needs to balance two goals, i.e., high topical and temporal relevance as well as complementarity for providing additional information • Use supervised machine learning that takes as input a set of labeled examples and various complementarity features  Topic diversity  Text difference  Entity difference  Anchor text difference  Distributional similarity  Cosine distance  Relevance  Temporal similarity Context Ranking 23 July 2015 154[Tran et al., WSDM 2015] (Slide provided by the authors)
  • 155. Experiments 23 July 2015 155 Datasets:  51 news articles from New York Times Corpus  Wikipedia (2013), 26 million contextualization units (paragraphs)  9464 manual labeled examples (article/context pairs)  Learning to rank algorithms: RankBoost, Random Forests and Adarank Baselines  Entity linking (Milne and Witten)  Language model (LM)  Time-aware language model (LM-T) [Tran et al., WSDM 2015] (Slide provided by the authors)
  • 156. Evaluating Query Formulation Methods 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P@1 P@3 P@10 MAP title+lead all_hooks qpp_r@100 Wikification technique achieves a low recall of 0.229 Hook-based approaches outperform the document- based approaches Query performance prediction method obtains the highest results on all metrics [Tran et al., WSDM 2015] (Slide provided by the authors) 23 July 2015 156
  • 157. The Effect of Complementarity Features 0 0.2 0.4 0.6 0.8 1 P@1 P@3 P@10 MAP LM-T RF Purely using the time dimension in context retrieval is not sufficient in the contextualization task Complementarity plays an important role in contextualization 23 July 2015 157[Tran et al., WSDM 2015] (Slide provided by the authors)
  • 158. Conclusions and Outlook  Introduced the general topic of web evolution.  Pinpointed a number of issues related to temporal IR.  Focused on temporal information extraction, temporal query analysis, as well as time-aware retrieval and ranking.  Wrapped up with related applications to temporal IR.  Future directions:  Real-time web mining  Spatio-temporal search and analytics  Brain-inspired information access 23 July 2015 158The 1st Keystone Summer School: Keyword Search
  • 159. Thank you! 15923 July 2015The 1st Keystone Summer School: Keyword Search