Taking the temporal dimension into account in searching, i.e., using time of content creation as part of the search condition, is now gaining increasingly interest. However, in the case of web search and web warehousing, the timestamps (time of creation or creation of contents) of web pages and documents found on the web are in general not known or cannot be trusted, and must be determined otherwise. In this paper, we describe approaches that enhance and increase the quality of existing techniques for determining timestamps based on a temporal language model. Through a number of experiments on temporal document collections we show how our new methods improve the accuracy of timestamping compared to the previous models.
Improving Temporal Language Models For Determining Time of Non-Timestamped Documents
1. Improving Temporal Language Models for
Determining Time of Non-Timestamped Documents
Nattiya Kanhabua and Kjetil NNattiya Kanhabua and Kjetil Nøørvrvåågg
Dept. of Computer Science,Dept. of Computer Science,
Norwegian University of Science and Technology,Norwegian University of Science and Technology,
Trondheim, NorwayTrondheim, Norway
ECDL 2008 Conference,ECDL 2008 Conference, ÅÅrhusrhus DenmarkDenmark
2. ECDL 2008 Norwegian University of Science and
Technology
2
Agenda
Motivation and Challenge
Preliminaries
Our Approaches
Evaluation
Conclusion
3. ECDL 2008 Norwegian University of Science and
Technology
3
Motivation
Research QuestionResearch Question
““ How to improve search
results in long-term archives of
digital documents? ””
AnswerAnswer
Extend keyword search with a
temporal information --
Temporal text-containment search
[Nørvåg’04]
Temporal Information
Timestamp, e.g. the created or updated date
In local archives, timestamp can be found in document
metadata which is trustable
Q: Is document timestamp in WWW archive also trustable ?
A: Not always, some problems:
1. A lack of metadata preservation
2. A time gap between crawling and indexing
3. Relocation of web documents
4. ECDL 2008 Norwegian University of Science and
Technology
4
Challenge
I found a bible-like
document. But I have
no idea when it was
created?
You should ask Guru!
Let’s me see…
This document is probably
originated in 850 A.C.
with 95% confidence.
““For a given document with uncertain timestamp, can the contents be
used to determine the timestamp with a sufficiently high confidence?””
5. ECDL 2008 Norwegian University of Science and
Technology
5
Preliminaries
“A model for dating documents”
Temporal Language Models presented in [de Jong et al. ’04]
Based on the statistic usage of words over time.
Compare a non-timestamped document with a reference corpus.
A reference time partition mostly overlaps in term usage -- the
tentative timestamp.
earthquake2004
Thailand2004
tsunami2004
tidal wave1999
Japan1999
tsunami1999
WordPartition
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
tsunami
Thailand
tsunami
Thailand
Partition score
“1999”: 1
“2004”: 1
Partition score
“1999”: 1
“2004”: 1 + 1
Partition score
“1999”: 1 = 1
“2004”: 1 + 1 = 2 most likely timestamp
6. ECDL 2008 Norwegian University of Science and
Technology
6
Proposed Approaches
Three ways in improving: temporal language models
1) Data preprocessing
2) Word interpolation
3) Similarity score
7. ECDL 2008 Norwegian University of Science and
Technology
7
Data Preprocessing
A direct comparison between extracted words in a
document vs. temporal language models limits accuracy..
Only the top-ranked N according to TF-IDF
scores will be selected as index terms
Word filteringWord filtering
Comparing 2 language models on concept
level avoids a less frequency word problem
Concept extractionConcept extraction
Identifying the correct sense of word by
analyzing context in a sentence, e.g. “bank”
Word sense disambiguationWord sense disambiguation
Co-occurrence of different words can alter
the meaning, e.g. “United States”
Collocation extractionCollocation extraction
Most interesting classes of words are
selected, e.g. nouns, verbs, and adjectives
PartPart--ofof--speech taggingspeech tagging
DescriptionDescriptionSemanticSemantic--based Preprocessingbased Preprocessing
8. ECDL 2008 Norwegian University of Science and
Technology
8
Word Interpolation
When a word has zero probability for a time partition according to
a limited size of a corpus collection, it could have a non-zero
frequency in that period in other documents outside a corpus.
““ A word is categorized into one of two classes depending on
characteristics occurring in time: recurring or non-recurring. ””
Related to periodic events.
For example, “Summer Olympic”,
“World Cup”, “French Open”
Words that are not recurringnot recurring.
For example, “Terrorism”,
“Tsunami”
Recurring Non-recurring
Identify recurring words by looking at overlap of wordsoverlap of words distribution
at the (flexible) endpoint of possible periods: every year or 4 years
9. ECDL 2008 Norwegian University of Science and
Technology
9
Word Interpolation (cont’)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
(a) "Terrorism" before interpolating
Frequency
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
(b) "Terrorism" after interpolating
Frequency
NR1 NR2
NR3
Non-recurringRecurring
“ How to interpolate words depends on which category a
word belongs to: recurring or non-recurring. ”
0
1000
2000
3000
4000
5000
6000
1996 2000 2004 2008
Year
(a) "Olympic games" before interpolating
Frequency
0
1000
2000
3000
4000
5000
6000
1996 2000 2004 2008
Year
(b) "Olympic games" after interpolating
Frequency
10. ECDL 2008 Norwegian University of Science and
Technology
10
Similarity Score
“A term weighting concerns temporality, temporal entropy based
on the term selection method presented in [Lochbaum,Steeter’89].”
The higher temporal entropy a term has, the better
representative of a partition.
A term occurring in few partitions has higher temporal entropy
compared to one appearing in many partitions.
Tells how good a term is in separating a partition from others.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
A measure of temporal information which a word conveys.
Temporal Entropy
A probability of a partition
p containing a term wi
Np is the total number of
partitions in a corpus
11. ECDL 2008 Norwegian University of Science and
Technology
11
Similarity Score (cont’)
“ By analyzing search statistics [Google Zeitgeist], we can
increase the probability for a particular time partition. ”
(b)(a)
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
A linear combination of a GZ score to an original similarity score [de Jong et al. ’04]
An inverse partition
frequency, ipf = log N/n
12. ECDL 2008 Norwegian University of Science and
Technology
12
Experimental Setting
A reference corpus
•Documents with known dates.
•Collected from the Internet Archive.
•News history web pages, e.g. ABC
News, CNN, NewYork Post, etc.
earthquake
Thailand
tsunami
tidal wave
Japan
tsunami
Word
0.080
0.012
0.091
0.009
0.003
0.015
Probability
2004
2004
2004
1999
1999
1999
Partition
Temporal Language Models
•A list of words and its probability in
each time partition.
•Intended to capture word usage
within a certain time period.
Build
13. ECDL 2008 Norwegian University of Science and
Technology
13
Experiments
Constraints of a training set:
1. Cover the domain of a document to be dated.
2. Cover the time period of a document to be dated.
A reference corpusA reference corpus
(15 sources)(15 sources)
A training set A testing set
Select 10 news
sources from
various domains.
Randomly select 1000
documents for testing
from 5 new sources
(different from training
sources)
Precision = the fraction of documents
correctly dated
Recall = the fraction of correctly dated
documents processed
14. ECDL 2008 Norwegian University of Science and
Technology
14
Experiment (cont’)
Similar to other classification tasks,
a system should be able to tell how
much confidence it has in assigning
a timestamp.
Confidence is measured by the
distance between scores of the 1st
and 2nd ranked partitions.
Dating task and
confidence
C
Combination TE,GZ with semantic-
based preprocessing, or without.
Temporal Entropy,
Google Zeitgeist
B
Various combinations of semantics:
1) POS – WSD – CON – FILT
2) POS – COLL – WSD – FILT
3) POS – COLL – WSD – CON – FILT
Semantic-based
preprocessing
A
DescriptionDescriptionEvaluation AspectsEvaluation AspectsExperimentExperiment
15. ECDL 2008 Norwegian University of Science and
Technology
15
0
10
20
30
40
50
60
70
80
1-w 1-m 3-m 6-m 12-m
Granularities
(b)
Precision(%)
Baseline
TE
GZ
S-TE
S-GZ
0
10
20
30
40
50
60
70
80
1-w 1-m 3-m 6-m 12-m
Granularities
(a)
Precision(%)
Baseline
A.1
A.2
A.3
Results
Semantic-based preprocessing Temporal Entropy, Google Zeitgeist
• Increase precision in almost all
granularities except 1-week
• In a small granularity, it is hard to gain
high accuracy
• By applying semantic-based first, TE and GZ
obtain high improvement
• Semantic-based preprocessing generates
collocation and concepts
•Weighted high by TE and GZ (most of search
statistics are noun phrases)
16. ECDL 2008 Norwegian University of Science and
Technology
16
Results (cont’)
0
10
20
30
40
50
60
70
80
90
100
110
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Confidence level
(c)
Percentage(%)
Precision
Recall
Confidence levels and document dating accuracy
The higher the confidence,
the more reliable results.
17. ECDL 2008 Norwegian University of Science and
Technology
17
Conclusion
Our approaches considerably increase quality compared to the
baseline based on the previous approach.
Applications that require high precision can select only
documents where the timestamp has been determined with high
confidence.
Future research:
Apply other classification algorithms in documents dating
Introduce a weighting scheme to words and interpolate only
significant words
18. ECDL 2008 Norwegian University of Science and
Technology
18
Questions
Questions are welcome ☺
19. ECDL 2008 Norwegian University of Science and
Technology
19
Related Works
A small amount of works on determining time of documents.
Divided into two categories:
determining time of creation of document/contents,
determining time of topic of contents.
Employ two techniques: learninglearning--basedbased and nonnon--learninglearning.
NonNon--learninglearningLearningLearning--basedbased
Learns from a set of training documents. Does not require a corpus collection.
[Swan,Allan’99] and [Swan,Jensen’00]
use a statistical method called
hypothesis testing. In [de Jong et al.’05]
is based on a statistic language model.
In [Mani,Wilson’00] and [Llidó et al’01],
require explicitly time-tagged
documents which will be resolved into
a concrete date or an absolute date.
Gives a summary of time of events
appeared in the document content.
Gives the most likely originated time
which is similar to written time of a
document.
20. ECDL 2008 Norwegian University of Science and
Technology
20
Temporal Language Models
Given a collection of corpus documents C={d1,d2,…,dn}
A document model is defined as di={{w1,w2,…,wn}, (ti, ti+1)}
•• wherewhere ti<ti+1 andand ti<Time(di) <ti+1
Similarity between two language models
“A normalized log-likelihood ratio[Kraaij’05]”
Score(di,pj) = ∑wd
P(w|di) · log P(w|pj)
P(w|C)
A probability of word w
in a document di
A probability of word w
in a time partition pi
A probability of word w
in a corpus collection C