Search, Exploration and Analytics of Evolving Data

Search, Exploration and
Analytics of Evolving Data
Nattiya Kanhabua
L3S Research Center
Hannover, Germany
The 1st Keystone Training School on
Keyword Search over Big Data
23 July 2015, Malta

Lecturer
Education qualification
2007 - 2011: Ph.D. degree, Norwegian University of Science and Technology, Norway
Thesis: “Time-aware Approaches to Information Retrieval”
2003 - 2005: M.Sc. in Computer Science, Asian Institute of Technology, Thailand
Thesis: “Agent-based Simulation of Trade in Barter Trade Exchanges”
1997 - 2001: B.Eng. in Computer Engineering, Kasetsart University, Thailand
Project: “Software Process Enhancement and Control System”
Work experience
2011- now: Postdoc, L3S Research Center, Germany
05/2015: Visiting researcher, University of Trento, Italy
03-05/2010: Research intern, Yahoo! Research, Spain
2007 - 2011: Temporary Scientific Staff, NTNU, Norway
2006 - 2007: Research assistant, University of Trento, Italy
06-10/2006: Research assistant, AIT, Thailand
2005 - 2006: Analyst programmer, IFDS Group, UK
2002 - 2003: Research assistant, Kasetsart University, Thailand
2001 - 2002: System analyst, Accenture, Thailand & Singapore
Skills
• 7+ years of research experience in information
retrieval, data mining, machine learning, predictive
methods and spatio-temporal analysis
• 3+ years of research experience in BigData, e.g., large-
scale processing and MapReduce
 Hadoop  Pig  Mahout  HBase
 Tomcat  Servlet  Lucene  MySQL
 Python  JAVA  JSP  PHP
 Weka  R  UML  JSON
 Eclipse  NLP  RDF  WARC
223 July 2015The 1st Keystone Summer School: Keyword Search

 9:00 – 10:30 Part I
 Introduction to Temporal Dynamics
 Temporal Information Extraction
 Temporal Query Analysis (I)
 10:30 – 11:00 Coffee break
 11:00 – 12:30 Part II
 Temporal Query Analysis (II)
 Time-aware Retrieval and Ranking
 Applications of Temporal IR
 Conclusions and Outlook
3
Schedule

Additional Resource
 Book: Temporal Information Retrieval
 Foundations and Trends® in Information Retrieval
 Volume 9, Issue 2, pp 91-208, 2015
 Download: http://goo.gl/TunlBb
 References can be found in the book

Introduction to Temporal Dynamics
 What are temporal dynamics?
 Why do they occur and impact search?
 When and how to leverage temporal information for IR?

6
Temporal Dynamics
Figure: Internet Growth/Usage Phases/Tech Events
(created by Mark Schueler, used with permission)
23 July 2015

Temporal Web Dynamics
 Web is changing over time in many aspects, e.g., size, content,
structure and how it is accessed by user interactions or queries.
 Size: web pages are added/deleted at all time
 Content: web pages are edited/modified
 Query: users’ information needs changes
[Risvik et al., CN 2002; Ke et al., CN 2006]
[WebDyn 2010; Dumais, SIAM-SDM 2012]
723 July 2015

2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!
1995 2015
Web and Index Sizes

2000
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2015
9
Web and Index Sizes

2000
Index grows to
4.2 billion pages
1995 2015
2008
Google counts
1 trillion
unique URLs
10
Web and Index Sizes

2000
Index grows to
4.2 billion pages
1995 2020
2009
TBs or PBs of data/index
Tens of thousands of PCs
2008
Google counts
1 trillion
unique URLs
11
?
Web and Index Sizes

http://www.worldwidewebsize.com/ 12
Web and Index Sizes
23 July 2015

Content Change
 The content of the Web, changes constantly over time, e.g., web
documents are added, modified or deleted continuously.
 National and international initiatives collect and preserve parts of
the Web [Gomes et al., TPDL 2011; Costa et al., TempWeb 2013]
Figure: WayBack Machine
a web archive search tool by
Internet Archive

Content Change
 Challenge:
 Document representation and retrieval

Categorization of Content Change
15
 Implication:
 Crawling, Indexing, Ranking

User Interaction Dynamics
 Browsing and querying (or search) behavior
 User preference, e.g., likes, comments, interests
 User’s profiles [Rybak et al., ECIR 2014]

Query Popularity Change
 Challenge:
 Time-sensitive queries
 Query understanding and processing
Google Insights for Search: http://www.google.com/insights/search/
Query: Halloween

Categorization of Web Search Queries
http://www.google.com/insights/search 18
 Implication:
 Query Analysis, Ranking

Temporal Information Extraction
(1) Document Creation Time
(2) Document Focus Time
(3) Entity and Event Evolution

Motivation
 Incorporating time into search can increase retrieval effectiveness
 Only when temporal information is available
 Research problem:
 How to determine the publication of a document?
 How to extract temporal information from document contents?

Two Time Aspects
1. Publication or modified time
 Task: determining timestamps of documents
 Method: rule-based technique, or temporal language models
2. Content or focus time
 Task: temporal information extraction
 Method: natural language processing, or time and event recognition
algorithms

content time
publication time

 Problem Statement: Hard to find trustworthy time for a web page
 Time gap between crawling and indexing
 Decentralization and relocation of web documents
 No standard metadata for time/date
23
Determining Document Creation Time

I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
24

Let’s me see…
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
“ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ”
25

Current Approaches
1. Content-based
 Temporal language model [de Jong et al., AHC 2005;
Kanhabua and Nørvåg, ECDL 2008]
 Classifier using features based on text’s time expressions
[Chambers, ACL 2012;Ge et al., EMNLP 2013]
 Using burstiness of terms for estimating timestamps
[Kotsakos et al., SIGIR 2014]
2. Non content-based
 Finding the oldest version of a page in a web archive [Jatowt
et al., WIDM 2007]
 Leveraging external resources [Hauff and Azzopardi, ECIR
2005;Nunes et al., WIDM 2007; SalahEldeen and Nelson,
TempWeb 2013]

Content-based Approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
 Based on the statistic usage of
words over time
 Compare each word of a non-
timestamped document with a
reference corpus
 Tentative timestamp -- a time
partition mostly overlaps in word
usage
Freq
1
1
1
1
1
1

Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
words over time
reference corpus
usage
Freq
1
1
1
1
1
1
28
tsunami
Thailand
A non-timestamped
document

Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
words over time
reference corpus
usage
Freq
1
1
1
1
1
1
29
tsunami
Thailand
A non-timestamped
document

Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
words over time
reference corpus
usage
Freq
1
1
1
1
1
1
30
tsunami
Thailand
A non-timestamped
document

Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
words over time
reference corpus
usage
Freq
1
1
1
1
1
1
31
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Normalized Log-likelihood Ratio
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Normalized log-likelihood ratio
[Kraaij, SIGIR Forum 2005]
 Variant of Kullback-Leibler
divergence
 Similarity of a document and time
partitions
 C is the background model
estimated on the corpus
 Linear interpolation smoothing to
avoid the zero probability of
unseen words
Freq
1
1
1
1
1
1
32
tsunami
Thailand
A non-timestamped
document
Similarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Improving Temporal LMs
Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
[Kanhabua et al., ECDL 2008] (Slide provided by the authors) 3323 July 2015

Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
3423 July 2015[Kanhabua et al., ECDL 2008] (Slide provided by the authors)

Intuition: A term weight depends on how good the term is
for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term
selection presented in Lochbaum and Streeter

Semantic-based Preprocessing
36
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Semantic-based
Preprocessing
Description
Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives
Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States”
Word sense
disambiguation
Identify the correct sense of a word from context, e.g. “bank”
Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave”
have the common concept of “disaster”
Word filtering Select the top-ranked words according to TF-IDF scores for a comparison

Leveraging Search Statistics
37

38
(b)(a)

39
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
An inverse partition
frequency, ipf = log N/n

Temporal Entropy
Temporal Entropy
A measure of temporal information which a word conveys.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
Tells how good a term is in separating a partition from others.
A term occurring in few partitions has higher temporal entropy
compared to one appearing in many partitions.
The higher temporal entropy a term has, the better
representative of a partition.
Intuition: A term weight depends on how good the term
is for separating time partitions (discriminative)

Temporal Entropy

Temporal Entropy
42
Np is the total number of
partitions in a corpus

Temporal Entropy
43
Np is the total number of
partitions in a corpus
A probability of a partition
p containing a term wi

Non Content-based Approaches
 Dating a document using its neighbors
1. Web pages linking to the document
 I.e., incoming links
2. Web pages pointed by the document
 I.e., outgoing links
3. Media assets associated with the document
 E.g., images
 Averaging the last-modified dates of its neighbors as timestamps
44[Hauff and Azzopardi, 2005; Nunes et al., WIDM 2007] 23 July 2015The 1st Keystone Summer School: Keyword Search

Non Content-based Approaches
 Drawbacks:
 Rely on the availability and accuracy of other information
 Cover only pages from most recent years
 Cannot determine the age of the actual contents
45[SalahEldeen and Nelson, 2013] 23 July 2015The 1st Keystone Summer School: Keyword Search

Determining Document Focus Time
 Three types of temporal expressions
1. Explicit: time mentions being mapped directly to a time point or
interval, e.g., “July 4, 2012”
2. Implicit: imprecise time point or interval, e.g., “Independence Day
2012”
3. Relative: resolved to a time point or interval using other types or
the publication date, e.g., “next month”
 Time and event recognition [Mani and Wilson, ACL 2000]
 A mix of hand-crafted and machine-learnt rules
 Ranking the most relevant temporal expressions [Strötgen et al.,
TempWeb 2012]

Time Taggers for Calculating Focus Time
HeidelTime:
http://heideltime.ifi.uni-
heidelberg.de/heideltime
Timestamp:
2013/7/15
23 July 2015 47[Jatowt et al., CIKM 2013](Slide provided by the authors)

 Document may lack any temporal expressions
 Temporal expressions may be weakly related to document’s
theme
 Temporal taggers are not perfect
Limitations
Estimating document focus time
without using temporal expressions

Focus Time of Documents
 Def. A document has focus time t if its content refers to t

Estimating Focus time: Concept
 Use time-referenced documents for estimating focus time of
target document
A-1935-
-----May
2011----
C------
News Article
Collections
---A----
--2012--
---B--
1978----
-1915---
--------
--C—B--
---A---
--1948--
--------
-C-----
2003--
-----
A—B--
C---A-
----
Target
Document
Target document
focus time
+
... ...

Word Graph
 Word co-occurrence graph from large collections of news articles
 Link weight estimated by Jaccard coefficient using sentence as unit
war
nazi
1945
1939
aushwitz
jews
germany
jalta
hiroshima

Estimating Direct Word-Year Association
 Word-year associations derived from graph
Word w is strongly associated with year y if
if it frequently co-occurs with y
A(war, 1900)
A(war, 1901)
…
A(war, 1944)
A(war, 1945)
…
A(war, 2009)
A(war, 2010)
A(hiroshima, 1900)
A(hiroshima, 1901)
…
A(hiroshima, 1944)
A(hiroshima, 1945)
…
A(hiroshima, 2009)
A(hiroshima, 2010)
A(word, 1900)
A(word, 1901)
…
A(word, 1944)
A(word, 1945)
…
A(word, 2009)
A(word, 2010)

Word w is strongly associated with year y if many other words that
frequently co-occur with w are also strongly associated with y
Second Level Term-Year Association
     

V
j
jdiriji ywAwwA
V
ywA
1
2
,,
1
,
war
nazi
1945
1939
aushwitz
jews
germany
jalta
hiroshima
israel

If a document contains many words strongly associated with year y,
the document is strongly associated with y
Estimating Document-Year Association
1900 1920 1940 1960 1980 2000
word A
word B
A(word,year)
word C
A + 2B + 2C
Time
A B C
B C
Document
Document-year association

Finding Discriminative Features
 Not every word is useful for estimating text focus time
 E.g., “man”, “city” have stable associations with years
 Temporal entropy – measure of variability of word associations
 Temporal kurtosis – measure of peakness of word associations
 E.g., “war”, “earthquake” vs. “hitler”, “stalingrad”
1900 1920 1940 1960 1980 2000
word A
word B
Temporal_Entropy(A) < Temporal_Entropy(B)
A(word,year)
1900 1920 1940 1960 1980 2000
word A
Temporal_Kurtosis(A) > Temporal_Kurtosis(B)
A(word,year)
word B
Temporal entropy and Temporal kurtosis
used as temporal weights for words

Importance of Words in Document
 Words weakly related to document theme should be skipped
TextRank
0.90 independence
0.82 poland
0.74 war
0.61 nazi
0.56 hitler
0.54 ….
President Obama took part in the
celebrations of the Polish
Independence Day. The US
president met main Polish
politicians in Warsaw.
Poland regained independence at
the end of the World War I
following Bolshevik Revolution.
It then lost the independence as a
result of Nazi and Soviet invasions
led by Hitler and Stalin.
Poland is located in East Europe.
Target Document
Document to
graph conversion
independence
poland war
hitler
…
…
…
…
…
TextRank scores used as discriminatory
semantic weights for words
[Mihalcea and Tarau, EMNLP 2004]

Estimating Focus Time
1900 1920 1940 1960 1980 2000
word A
word B
word C
Weighted sum (temporality and
semantics)
Focus time: Interval based
threshold
Time
A(word,year)
1900 1920 1940 1960 1980 2000
A B C
B C
Document
Focus time: Instant based
1900 1920 1940 1960 1980 2000

Combined Approach
 Combining estimated focus time and temporal expressions in text
 Representing dates on timeline - Gaussian Kernel Density Estimate
 Mixture of Gaussian distributions with means centered on extracted
dates
     ydSydSydS TempExpEstComb ,,, 
---1935--------
----2011-------
----------------
----------------
----1932-------
------1940-----
----------------
1932-----2001--
-------------
1932 1935 1940 2001 2011
Target document

 News articles collected from Google News Archive using country
names as queries
 Germany (87k), UK (149k), France (110k), Japan (97k), Israel (92k)
 Published within [1990, 2010]
 Dates falling in [1900, 2013] were found using regular expressions
Experimental Settings: Word Graphs

Experimental Settings: Test Datasets
 Datasets on events related to countries:
 Wiki: 250 Wikipedia pages about events
 Books: 735 paragraphs from 2 text books about history (timelines)
 Web: 812 paragraphs from web pages on history (BBC timelines,
etc.)
Datasets
total
#doc
avr.
#sent
avr. time span
of events
avr. year
of events
avr.
#dates
Wiki 250 179 3.4 years 1958 14.5
Book 735 43 4.4 years 1982 4.5
Web 819 18.3 1.3 years 1957 2.4
23 July 2015 60

Experimental Settings: Baselines
 Baselines:
 Random
 Date-based (using only dates in document text)
 LDA-based
1. 100 topics over sentences containing year mentions
2. Finding topic distribution of each year
3. Calculating document-year association based on topic distribution
of documents

Experimental Settings: Measurements
 Measures:
 Average error (in years)
 Pearson Correlation Coefficient between ground truth years and
years in focus time
Ground truth
Estimated
focus time
Ground truth
Estimated
focus time
tfocus - + + - - + - - +
Average error (years) for
instant-based representation
Correlation measure (-1..+1) for
interval-based representation
error + + - - + + - - +

Experimental Results
Datasets
random
baseline
LDA
baseline
date-based
baseline
Proposed
(no dates)
Proposed
combined
(with dates)
Wiki 36.5 27.2 3 18.3 2.83
Books 39.3 37.3 48.1 23.5 20.4
Web 40.5 41.4 53.4 23.6 20.7
Datasets
random
baseline
LDA
baseline
date-based
baseline
Proposed
(no dates)
Proposed
combined
(with dates)
Wiki 0 0.1 0.65 0.29 0.66
Books 0 0.04 0.01 0.25 0.30
Web 0 0.02 -0.03 0.26 0.41
Average error
Pearson Correlation Coefficient

 How well can we estimate focus time of documents about
distant past ?
Effect of Time Distance on Focus Time
Wiki Books
Web
Instant-based
focus time representation
23 July 2015 64The 1st Keystone Summer School: Keyword Search

Question?

Temporal Query Analysis
(1) Temporal query intent
(2) Dynamic query subtopics

Temporal Queries
 Temporal information needs
 Searching temporal document collections
 E.g., digital libraries, web/news archives
 Users: historians, librarians, journalists or students
 Temporal queries exist in both standard collections and the Web
 Relevancy is dependent on time
 Documents are about events at particular time

Types of Temporal Queries
 Two types of temporal queries
1. Explicit: time is provided, "Presidential election 2012“
2. Implicit: time is not provided, "Germany World Cup"
 Temporal intent can be implicitly inferred
 I.e., refer to the World Cup event in 2006
 Studies of web search query logs show a significant fraction
of temporal queries
 1.5% of web queries are explicit
 ~7% of web queries are implicit
 13.8% of queries contain explicit time and 17.1% of queries have
temporal intent implicitly provided
68[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]23 July 2015

Figure: Variances of
temporal queries and
their dynamics

Understanding Temporal Query Intent
 Current approaches:
1. Mining temporal patterns in query logs
2. Analyzing top-k search results
70
[Vlachos et al., SIGMOD 2004; Radinsky et al., WWW 2012]
[Jones and Diaz, TOIS 2007; Campos et al., CIKM 2012] 23 July 2015

Motivation
 Temporal queries are a significant fraction of Web
search queries [Zhang et al., EMNLP 2010]
 13.8% of explicit temporal queries
 17.1% of implicit temporal queries
 Characteristics:
 Certain temporal patterns, i.e., spikes, periodicity
(hourly or daily), seasonality and trends
 Underlying temporal information needs without
temporal patterns observed
 Tasks:
 Understand temporal search intent
 Enable advanced enhancement techniques
 Automatic method for detecting events in search streams
US Election
2016
Brazil FIFA
World Cup
23 July 2015 71[Kanhabua et al., TempWeb 2015](Slide provided by the authors)

Preliminaries
 Data model:
 Set of queries Q issues at different time points
 Set of clicked URLs U and click-through data
 Temporal document collection D
 q: keywords or term(q), and hitting time(q)
 yq: time series data extracted form Q, U and D
 Two-step approach:
 Automatically extract a set of candidate queries {q1, ..., qn} from Q
 Classify candidates as event-related queries {e1, ..., em} using
machine learning techniques

Identifying Event Candidates
 Time and keyword-based clustering:
Step1: Partition query logs into one week
• Group queries from the same event
• Possibly contain multiple, unrelated events
Step2: Cluster queries by lexical similarity
• Pre-process and sort queries alphabetically
• Compute Jaccard similarity of a query pair
Easter - easter 2006, easter 2007, easter 20crafts,
easter activities, easter animation, easter animations,
easter background, easter basket, easter bread,
easter bucket, easter bunny, easter bunny decorations,
easter bunny lights

Event-related Query Classification
 Classify a query as event-related or not:
 Periodic and seasonal events
 Popular and trending events
 Sporadic (rare) and unseen events
 General time-sensitive queries
 Underlying temporal information needs
 Features:
 Time-series features, e.g., seasonality or trends
 Popularity-based features, e.g., click-through and burstiness
 Statistic features, e.g., probability distribution of results
temporal KL-divergence and skewness (kurtosis)

Query: Easter
Seasonality
Query: World cup
 Detect seasonal queries [Shokouhi, SIGIR 2011]
 E.g., Annual events, e.g., US Open and Easter,
or a 4-year recurring event, e.g., FIFA World Cup
 Method: time-series decomposition using Holt-
Winters adaptive exponential smoothing
 Input: time-series data extracted from external
document collections, YD
 Compute a cosine similarity as seasonality
 Y is the original time-series data
 S is the seasonality component

Autocorrelation
 Detect trending events by their predictability
 Cross correlation with itself or between its
past and future values at different time lags
 The stronger inter-day dependencies, the
higher value for autocorrelation
 where lag=1, shifting the 2nd time series by
one day, called 1st-order autocorrelation

Temporal KL-divergence
 Analyze a temporal distribution in a result set
 Measure the difference between the distribution over time
of top-k documents of q and the document collection C
 P(t|q) is the probability of generating a publication date t
given q
 P(t|C) is the probability of a publication date t in the
collection

Surprise Score
 Detect unseen events or surprisingly popular
queries [Radinsky et al. , WWW 2012]
 Assume an unplanned event happening when there is
a significant prediction error
 Compute the sum of squared errors of prediction
(SSE) using a simple linear regression model

Experiments
 Query logs:
• Two datasets, i.e., AOL and MSN
• AOL: 30M queries March 1 - May 31, 2006
• MSN: 15M queries from May 2006
 Temporal collection:
• The New York Times Annotated Corpus
• 1.8M documents from 1987 - 2007
 Setting:
• HeidelTime for time extraction and OpenNLP for entity extraction
• Cleansing-step parameters: Jaccard similarity threshold>0.2; edit
distance<3; overlap n-gram=2
• For burstiness features, default parameters for the burst detection
technique provided by CISHELL
In total, 837 event-related queries

Experimental Results (I)
 Feature selection:
• Study high-impact (best) features
• Investigate their importance independent
from classification algorithms
• InfoGainAttributeEval method in WEKA
 Main findings:
• Discriminative features are mostly derived
from D and Q
• TemporalKL and kurtosis are among
influential features
• Trend-based features, such as,
autocorrelation, burst weight, and trending
level, play an important role
• Seasonality computed from Q has less
impact than the one extracted D

Experimental Results (II)
 Query classification:
• Several classifiers, i.e., support vector
machine (SVM), AdaBoost, decision tree
(J48), and neural network (NN)
• Metrics: accuracy, precision, recall, F-
measure using 10-fold cross validation
 Main findings:
• J48 is the best performing algorithm
• TemporalKL achieves accuracy of 84%
• Adding autocorrelation, kurtosis, and
seasonality increases the performance
• However, the performance has dropped
after adding max. query frequency, so on

Analyzing Top-k Search Results
 Using temporal language models
 Determine time of queries when no time is given explicitly
 Re-rank search results using the determined time
 Exploiting time from search snippets
 Extract temporal expressions (i.e., years) from the contents of top-k
retrieved web snippets for a given query
 Content-based language-independent approach
82[Kanhabua and Nørvåg, ECDL 2010; Campos et al., CIKM 2012] 23 July 2015

Determining Time of Queries
 Approach I. Dating using keywords*
 Approach II. Dating using top-k documents*
 Queries are short keywords
 Inspired by pseudo-relevance feedback
 Approach III. Using timestamp of top-k documents
 No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
8323 July 2015[Kanhabua and Nørvåg, ECDL 2010](Slide provided by the authors)

I. Dating using Keywords

I. Dating using Keywords
85
Query’s temporal
profiles

II. Dating using Top-k Documents

II. Dating using Top-k Documents
87
Query’s temporal
profiles

III. Using Timestamp of Documents

III. Using Timestamp of Documents
89
Query’s tempora
profiles

Re-ranking Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
 Intuition: documents published closely to the time of queries are
more relevant
 Assign document priors based on publication dates

 Intuition: documents published closely to the time of queries are
more relevant
 Assign document priors based on publication dates
Re-ranking Search Results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results

march madness
began
14/03/2006
ncaa women
tournament began
18/03/2006 01/04/2006
final four began
query: ncaa
Change of Query Subtopics over Time
92[Nguyen and Kanhabua, ECIR 2014] 23 July 2015The 1st Keystone Summer School: Keyword Search

Mining Temporal Anchor Texts
 Anchor texts are complementary description
for target pages, widely used to improve search
 Characteristics:
 Short summary (a few words) of target pages
 Collective wisdom of people other than authors
 Similar behavior to real-world queries and titles
 Capturing aboutness or what a document is about
 Main ideas:
 Temporal anchor texts mined from the edit history of
Wikipedia as a hook for tracking entity evolution
 Large-scale analysis and a more robust discovery of
evolving information using limited resources

1. Partition Wikipedia revisions using
the one-month granularity
2. For each Wikipedia snapshot, identify
named entity articles/pages
3. Extract anchor texts from all articles
linking to an entity page
4. Rank aggregated entity-anchor
relationships at a particular time t
[Kanhabua and Nørvåg, JCDL 2010] 23 July 2015 94The 1st Keystone Summer School: Keyword Search

2. For each Wikipedia snapshot,
identify named entity articles/pages

2. For each Wikipedia snapshot, identify
named entity articles/pages
President_of_the_
United_States
President
Bush (43)
Time:
10/2005
Barack
Obama
Time:
George
W. Bush
Time:
11/2004 23 July 2015 96The 1st Keystone Summer School: Keyword Search

1. Multi-word title with all words capitalized,
except prepositions, determiners, etc.
E.g., President_of_the_United_States => entity
2. Single-word titles with multiple capital
letters
E.g., UNICEF and WHO => entities
3. 75% of occurrences in the article text itself
are capitalized (not beginning of sentence)
Recognizing Named Entity Articles
[Bunescu and Pasca, EACL 2006] 23 July 2015 97The 1st Keystone Summer School: Keyword Search

Weight anchor texts by importance with respect
to a target entity at particular time:
• Link-independent : inlink pages are independent and
equally important to the target page
• Compute based on the whole collection of Wikipedia
entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
[Dou et al., SIGIR 2009] 23 July 2015 98The 1st Keystone Summer School: Keyword Search

Weight anchor texts by importance with respect
to a target entity at particular time:
• Link-independent : inlink pages are independent and
equally important to the target page
• Compute based on the whole collection of Wikipedia
entity pages at particular time t
• Two variants: 1) article links, and 2) distinct pages
Temporal Anchor Weighting
[Dou et al., SIGIR 2009] 23 July 2015 99The 1st Keystone Summer School: Keyword Search

Experiments
 Data collection:
• A dump of English Wikipedia edit history (2.8 TB)
• All pages and revisions 03/2001 to 03/2008
• 85 snapshots + 4 additional snapshots
(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)
 Tools:
• Preprocess/store revisions using MWDumper
http://www.mediawiki.org/wiki/Mwdumper
• Store anchor texts: mySQL databases

Top-100 Named Entities

Evolving Context
“Barack Obama”
time
05/2008 03/2009
1. Senator Barack Obama
2. Senator Obama's
legislative
accomplishments
3. Illinois
4. U.S. Sen. Barack Obama
2. Illinois Senator Barack
Obama
3. Barack Hussein Obama II
4. Senator Obama's
legislative
accomplishments
07/2008 10/2008

Evolving Context
“Barack Obama”
time
05/2008 03/2009
2. Senator Obama's
legislative
accomplishments
3. Illinois
4. U.S. Sen. Barack Obama
Obama
3. Barack Hussein Obama II
4. Senator Obama's
legislative
accomplishments
07/2008
1. Senator Barack
Obama
Obama
3. Barak Obama, U.S.
Senator, Illinois, 2008
Democratic nominee for
U.S. President
4. presidential
candidacy
announcement
1. President Barack
Obama
3. U.S. President Barack
Obama
4. 44th President of the
United States
5. Obama Administration
10/2008

Main Findings
Evolving information & context
• Role changes for political entities
• Geographic name changes for
locations
• Trend or things in vogue for
celebrities
• Products in demand for
technology

Main Findings
locations
celebrities
technology

Question?

Time-aware Retrieval and Ranking
(1) Recency-based Ranking
(2) Time-dependent Ranking

RECAP
 Two time dimensions
1. Publication or modified time
2. Content or focus time

Searching the past
 Historical or temporal information needs
 A journalist working the historical story of a particular news article
 A Wikipedia contributor finding relevant information that has not
been written about yet
116
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results

Temporal Query Examples
 A temporal query consists of:
 Query keywords
 Temporal expressions
 A document consists of:
 Terms, i.e., bag-of-words
 Publication time and temporal expressions

Temporal Query Examples
[Berberich et al., ECIR 2010] 11823 July 2015The 1st Keystone Summer School: Keyword Search

 Assign prior probabilities using an exponential function
 E.g., a more recent creation date obtains high probability
 Current approaches:
 Time-based language model [Li and Croft, CIKM 2003]
 Using retention functions [Peetz and de Rijke, ECIR 2013]
 Incorporating freshness into web authority [Dai and Davison,
SIGIR 2010]
Recency-based Ranking

 Time must be explicitly modeled in order to increase the
effectiveness of ranking
 To order search results so that the most relevant ones come first
 Time uncertainty should be taken into account
 Two temporal expressions can refer to the same time period even
though they are not equally written
 E.g. the query “Independence Day 2011”
 A retrieval model relying on term-matching only will fail to
retrieve documents mentioning “July 4, 2011”
Time-dependent Ranking

Time-dependent Ranking
 Two main approaches:
1. Mixture model [Kanhabua et al., ECDL 2010]
 Linearly combining textual- and temporal similarity
2. Probabilistic model [Berberich et al., ECIR 2010]
 Generating a query from the textual part and temporal part
of a document independently

Mixture Model
 Linearly combine textual- and temporal similarity
 α indicates the importance of similarity scores
 Both scores are normalized before combining
 Textual similarity can be determined using any term-based
retrieval model
 E.g., tf.idf or a unigram language model

Mixture Model
 Linearly combine textual- and temporal similarity
 α indicates the importance of similarity scores
 Both scores are normalized before combining
 Textual similarity can be determined using any term-based
retrieval model
 E.g., tf.idf or a unigram language model
123
How to determine temporal similarity?

Temporal Similarity
Similarityscore
Time
d1 d2<q>
Dist(d1,q)
Dist(d2,q)
[Kanhabua et al., ECDL 2010]

Temporal Similarity
 Assume that temporal expressions in the query are generated
independently from a two-step generative model:
 P(tq|td) can be estimated based on publication time using an
exponential decay function [Kanhabua et al., ECDL 2010]
 Linear interpolation smoothing is applied to eliminates zero
probabilities
 I.e., an unseen temporal expression tq in d

Comparison of time-aware ranking
Five time-aware ranking models
 LMT [Berberich et al., ECIR 2010]
 LMTU [Berberich et al., ECIR 2010]
 TS [Kanhabua et al., ECLD 2010]
 TSU [Kanhabua et al., ECLD 2010]
 FuzzySet [Kalczynski et al., Inf. Process. 2005]
126[Kanhabua et al., SIGIR 2011]23 July 2015The 1st Keystone Summer School: Keyword Search

 Experiment:
 New York Times Annotated Corpus
 40 temporal queries [Berberich et al., ECIR 2010]
 Result:
 TSU outperforms other methods significantly for most metrics
 Conclusions:
 Although TSU gains the best performance, but only applied to a
collection with time metadata
 LMT, LMTU can be applied to any collection without time metadata,
but time extraction is needed
Discussion

128
Applications for Temporal IR
(1) Searching the Future
(2) Time-aware Recontextualization

Searching the Future
 People are naturally curious about the future
 What will happen to EU economies in next 5 years?
 What will be potential effects of climate changes?

Previous work
 Searching the future
 Extract temporal expressions from news articles
 Retrieve future information using a probabilistic model, i.e.,
multiplying textual similarity and a time confidence
 Supporting analysis of future-related information in news and
Web
 Extract future mentions from news snippets obtained from search
engines
 Summarize and aggregate results using clustering methods, but no
ranking
[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 13023 July 2015

Recorded Future
http://www.recordedfuture.com/

Yahoo! Time Explorer
[Matthews et al., HCIR 2010] 13223 July 2015The 1st Keystone Summer School: Keyword Search

Ranking News Predictions
 Over 32% of 2.5M documents from Yahoo! News (July’09 –
July’10) contain at least one prediction
 Retrieve predictions related to a news story in news archives and
rank by relevance
13323 July 2015

Related News Predictions
[Kanhabua et al., SIGIR 2011] 13423 July 2015The 1st Keystone Summer School: Keyword Search

 Four classes of features
 Term similarity, entity-based similarity, topic similarity and temporal
similarity
 Rank results using a learning-to-rank technique
Approach
23 July 2015 137The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]

Step 1: Document annotation.
 Extract temporal expressions
using time and event recognition
 Normalize them to dates so they
can be anchored on a timeline
 Output: sentences annotated
with named entities and dates,
i.e., predictions
Step 2: Retrieving predictions.
 Automatically generate a query
from a news article being read
 Retrieve predictions that match
the query
 Rank predictions by relevance
(i.e., a prediction is “relevant” if it
is about the topics of the article)
System Architecture

 Capture the term similarity between q and p
1. TF-IDF scoring function
 Problem: keyword matching, short texts
 Predictions not match with query terms
2. Field-aware ranking function, e.g., bm25f
 Search the context of a prediction, i.e., surrounding sentences
Term Similarity
13923 July 2015The 1st Keystone Summer School: Keyword Search[Kanhabua et al., SIGIR 2011]

 Measure the similarity between q
and p using annotated entities in
dp, p, q
 Features commonly employed in
entity ranking
Entity-based Similarity

 Compute the similarity between q and p on topic
 Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for
modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic Similarity

Hypothesis I. Predictions that are more recent to the query are
more relevant
Temporal Similarity

Hypothesis I. Predictions that are more recent to the query are
more relevant
Temporal Similarity
Hypothesis II. Predictions extracted from more recent documents
are more relevant

Learning-to-rank: Given an unseen (q, p), p is ranked using a
model trained over a set of labeled query/prediction
 SVM-MAP [Yue et al., SIGIR 2007]
 RankSVM [Joachims, KDD 2002]
 SGD-SVM [Zhang, ICML 2004]
 PegasosSVM [Shalev-Shwartz et al., ICML 2007]
 PA-Perceptron [Crammer et al., J. Mach. Learn. 2006]
Ranking Method

42 future-related topics
Relevance Judgments

 New York Times Annotated Corpus
 1.8 million articles, over 20 years
 More than 25% contain at least one prediction
 Annotation process uses several language processing tools
 OpenNLP for tokenizing, sentence splitting, part-of-speech tagging,
shallow parsing
 SuperSense tagger for named entity recognition
 TARSQI for extracting temporal expressions
 Apache Lucene for indexing and retrieving.
 44,335,519 sentences and 548,491 predictions
 939,455 future dates (avg. future date/prediction is 1.7)
Experiments

 Results:
 Topic features play an important role in ranking
 Features in top-5 features with lowest weights are entity-based
features
 Open issues:
 Extract predictions from other sources, e.g., Wikipedia, blogs,
comments, etc.
 Sentiment analysis for future-related information
Discussion

Prior to 1964, many of the cigarette
companies advertised their brand by
falsely claiming that their product did not
have serious health risks. A couple of
examples would be "Play safe with Philip
Morris" and "More doctors smoke
Camels". Such claims were made both to
increase the sales of their product and to
combat the increasing public knowledge of
smoking's negative health effects.
Advertisement poster from the
1950s
Time-aware
contextualization
Time-aware Contextualization
23 July 2015 149[Tran et al., WSDM 2015] (Slide provided by the authors)

Physician
http://en.wikipedia.org/wiki/Physician
Camel (cigarette)
http://en.wikipedia.org/wiki/Camel_(cigarette)
Cigarette
http://en.wikipedia.org/wiki/Cigarette
Entity linking is not sufficient
Wikipedia pages tend to contain large amounts of content
Relevant information might be distributed over various articles
The crucial temporal aspect is missing in pure linking approaches
Entity Linking

Problem Statement
23 July 2015 151
Time-aware contextualization aims to associate an information item
d with time-aware, concise and coherent context information c for
easing its understanding
Several sub-goals of the information search process have to
combined with each other
 c has to be relevant for d
 c has to complement the information already available in d
 c has to consider the time of creation of d
 the context information should be concise to avoid overloading the user
[Tran et al., WSDM 2015] (Slide provided by the authors)

User
Article
Query
Formulation
Context
Ranking
Contextualization
units Index
Context
Context
Retrieval
Contextualization units
Extraction
Context
Hook
Identification
Approach Overview

 The goal is to generate a set of queries for a given document to
retrieve candidates as input for the re-ranking step
 We explore two families of query formulation methods
 Document-based methods : title, lead, title+lead
 Hook-based methods: each_hook, all_hooks, and query performance
prediction (qpp_r@k) with the following features
 Linguistics features
 Document frequency
 Scope
 Temporal document frequency
 Temporal scope
 Temporal similarity
Query Formulation

Context retrieval:
Learning to rank context:
• The ranking algorithm needs to balance two goals, i.e., high topical and
temporal relevance as well as complementarity for providing additional
information
• Use supervised machine learning that takes as input a set of labeled
examples and various complementarity features
 Topic diversity
 Text difference
 Entity difference
 Anchor text difference
 Distributional similarity
 Cosine distance
 Relevance
 Temporal similarity
Context Ranking

Experiments
23 July 2015 155
Datasets:
 51 news articles from New York Times Corpus
 Wikipedia (2013), 26 million contextualization units (paragraphs)
 9464 manual labeled examples (article/context pairs)
 Learning to rank algorithms: RankBoost, Random Forests and Adarank
Baselines
 Entity linking (Milne and Witten)
 Language model (LM)
 Time-aware language model (LM-T)
[Tran et al., WSDM 2015] (Slide provided by the authors)

Evaluating Query Formulation Methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P@1 P@3 P@10 MAP
title+lead
all_hooks
qpp_r@100
Wikification technique
achieves a low recall of 0.229
Hook-based approaches
outperform the document-
based approaches
Query performance
prediction method obtains the
highest results on all metrics
[Tran et al., WSDM 2015] (Slide provided by the authors) 23 July 2015 156

The Effect of Complementarity Features
0
0.2
0.4
0.6
0.8
1
P@1 P@3 P@10 MAP
LM-T
RF
Purely using the time dimension
in context retrieval is not sufficient
in the contextualization task
Complementarity plays an
important role in contextualization

Conclusions and Outlook
 Introduced the general topic of web evolution.
 Pinpointed a number of issues related to temporal IR.
 Focused on temporal information extraction, temporal query
analysis, as well as time-aware retrieval and ranking.
 Wrapped up with related applications to temporal IR.
 Future directions:
 Real-time web mining
 Spatio-temporal search and analytics
 Brain-inspired information access

Thank you!

Search, Exploration and Analytics of Evolving Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Search, Exploration and Analytics of Evolving Data

Similar to Search, Exploration and Analytics of Evolving Data (20)

More from Nattiya Kanhabua

More from Nattiya Kanhabua (20)

Recently uploaded

Recently uploaded (20)

Search, Exploration and Analytics of Evolving Data