Improving Temporal Language Models For Determining Time of Non-Timestamped Documents

Improving Temporal Language Models for
Determining Time of Non-Timestamped Documents
Nattiya Kanhabua and Kjetil NNattiya Kanhabua and Kjetil Nøørvrvåågg
Dept. of Computer Science,Dept. of Computer Science,
Norwegian University of Science and Technology,Norwegian University of Science and Technology,
Trondheim, NorwayTrondheim, Norway
ECDL 2008 Conference,ECDL 2008 Conference, ÅÅrhusrhus DenmarkDenmark

ECDL 2008 Norwegian University of Science and
Technology
2
Agenda
Motivation and Challenge
Preliminaries
Our Approaches
Evaluation
Conclusion

Technology
3
Motivation
Research QuestionResearch Question
““ How to improve search
results in long-term archives of
digital documents? ””
AnswerAnswer
Extend keyword search with a
temporal information --
Temporal text-containment search
[Nørvåg’04]
Temporal Information
Timestamp, e.g. the created or updated date
In local archives, timestamp can be found in document
metadata which is trustable
Q: Is document timestamp in WWW archive also trustable ?
A: Not always, some problems:
1. A lack of metadata preservation
2. A time gap between crawling and indexing
3. Relocation of web documents

Technology
4
Challenge
I found a bible-like
document. But I have
no idea when it was
created?
You should ask Guru!
Let’s me see…
This document is probably
originated in 850 A.C.
with 95% confidence.
““For a given document with uncertain timestamp, can the contents be
used to determine the timestamp with a sufficiently high confidence?””

Technology
5
Preliminaries
“A model for dating documents”
Temporal Language Models presented in [de Jong et al. ’04]
Based on the statistic usage of words over time.
Compare a non-timestamped document with a reference corpus.
A reference time partition mostly overlaps in term usage -- the
tentative timestamp.
earthquake2004
Thailand2004
tsunami2004
tidal wave1999
Japan1999
tsunami1999
WordPartition
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
tsunami
Thailand
tsunami
Thailand
Partition score
“1999”: 1
“2004”: 1
Partition score
“1999”: 1
“2004”: 1 + 1
Partition score
“1999”: 1 = 1
“2004”: 1 + 1 = 2 most likely timestamp

Technology
6
Proposed Approaches
Three ways in improving: temporal language models
1) Data preprocessing
2) Word interpolation
3) Similarity score

Technology
7
Data Preprocessing
A direct comparison between extracted words in a
document vs. temporal language models limits accuracy..
Only the top-ranked N according to TF-IDF
scores will be selected as index terms
Word filteringWord filtering
Comparing 2 language models on concept
level avoids a less frequency word problem
Concept extractionConcept extraction
Identifying the correct sense of word by
analyzing context in a sentence, e.g. “bank”
Word sense disambiguationWord sense disambiguation
Co-occurrence of different words can alter
the meaning, e.g. “United States”
Collocation extractionCollocation extraction
Most interesting classes of words are
selected, e.g. nouns, verbs, and adjectives
PartPart--ofof--speech taggingspeech tagging
DescriptionDescriptionSemanticSemantic--based Preprocessingbased Preprocessing

Technology
8
Word Interpolation
When a word has zero probability for a time partition according to
a limited size of a corpus collection, it could have a non-zero
frequency in that period in other documents outside a corpus.
““ A word is categorized into one of two classes depending on
characteristics occurring in time: recurring or non-recurring. ””
Related to periodic events.
For example, “Summer Olympic”,
“World Cup”, “French Open”
Words that are not recurringnot recurring.
For example, “Terrorism”,
“Tsunami”
Recurring Non-recurring
Identify recurring words by looking at overlap of wordsoverlap of words distribution
at the (flexible) endpoint of possible periods: every year or 4 years

Technology
9
Word Interpolation (cont’)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
(a) "Terrorism" before interpolating
Frequency
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
(b) "Terrorism" after interpolating
Frequency
NR1 NR2
NR3
Non-recurringRecurring
“ How to interpolate words depends on which category a
word belongs to: recurring or non-recurring. ”
0
1000
2000
3000
4000
5000
6000
1996 2000 2004 2008
Year
(a) "Olympic games" before interpolating
Frequency
0
1000
2000
3000
4000
5000
6000
1996 2000 2004 2008
Year
(b) "Olympic games" after interpolating
Frequency

Technology
10
Similarity Score
“A term weighting concerns temporality, temporal entropy based
on the term selection method presented in [Lochbaum,Steeter’89].”
The higher temporal entropy a term has, the better
representative of a partition.
A term occurring in few partitions has higher temporal entropy
compared to one appearing in many partitions.
Tells how good a term is in separating a partition from others.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
A measure of temporal information which a word conveys.
Temporal Entropy
A probability of a partition
p containing a term wi
Np is the total number of
partitions in a corpus

Technology
11
Similarity Score (cont’)
“ By analyzing search statistics [Google Zeitgeist], we can
increase the probability for a particular time partition. ”
(b)(a)
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
A linear combination of a GZ score to an original similarity score [de Jong et al. ’04]
An inverse partition
frequency, ipf = log N/n

Technology
12
Experimental Setting
A reference corpus
•Documents with known dates.
•Collected from the Internet Archive.
•News history web pages, e.g. ABC
News, CNN, NewYork Post, etc.
earthquake
Thailand
tsunami
tidal wave
Japan
tsunami
Word
0.080
0.012
0.091
0.009
0.003
0.015
Probability
2004
2004
2004
1999
1999
1999
Partition
•A list of words and its probability in
each time partition.
•Intended to capture word usage
within a certain time period.
Build

Technology
13
Experiments
Constraints of a training set:
1. Cover the domain of a document to be dated.
2. Cover the time period of a document to be dated.
A reference corpusA reference corpus
(15 sources)(15 sources)
A training set A testing set
Select 10 news
sources from
various domains.
Randomly select 1000
documents for testing
from 5 new sources
(different from training
sources)
Precision = the fraction of documents
correctly dated
Recall = the fraction of correctly dated
documents processed

Technology
14
Experiment (cont’)
Similar to other classification tasks,
a system should be able to tell how
much confidence it has in assigning
a timestamp.
Confidence is measured by the
distance between scores of the 1st
and 2nd ranked partitions.
Dating task and
confidence
C
Combination TE,GZ with semantic-
based preprocessing, or without.
Temporal Entropy,
Google Zeitgeist
B
Various combinations of semantics:
1) POS – WSD – CON – FILT
2) POS – COLL – WSD – FILT
3) POS – COLL – WSD – CON – FILT
Semantic-based
preprocessing
A
DescriptionDescriptionEvaluation AspectsEvaluation AspectsExperimentExperiment

Technology
15
0
10
20
30
40
50
60
70
80
1-w 1-m 3-m 6-m 12-m
Granularities
(b)
Precision(%)
Baseline
TE
GZ
S-TE
S-GZ
0
10
20
30
40
50
60
70
80
1-w 1-m 3-m 6-m 12-m
Granularities
(a)
Precision(%)
Baseline
A.1
A.2
A.3
Results
Semantic-based preprocessing Temporal Entropy, Google Zeitgeist
• Increase precision in almost all
granularities except 1-week
• In a small granularity, it is hard to gain
high accuracy
• By applying semantic-based first, TE and GZ
obtain high improvement
• Semantic-based preprocessing generates
collocation and concepts
•Weighted high by TE and GZ (most of search
statistics are noun phrases)

Technology
16
Results (cont’)
0
10
20
30
40
50
60
70
80
90
100
110
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Confidence level
(c)
Percentage(%)
Precision
Recall
Confidence levels and document dating accuracy
The higher the confidence,
the more reliable results.

Technology
17
Conclusion
Our approaches considerably increase quality compared to the
baseline based on the previous approach.
Applications that require high precision can select only
documents where the timestamp has been determined with high
confidence.
Future research:
Apply other classification algorithms in documents dating
Introduce a weighting scheme to words and interpolate only
significant words

Technology
18
Questions
Questions are welcome ☺

Technology
19
Related Works
A small amount of works on determining time of documents.
Divided into two categories:
determining time of creation of document/contents,
determining time of topic of contents.
Employ two techniques: learninglearning--basedbased and nonnon--learninglearning.
NonNon--learninglearningLearningLearning--basedbased
Learns from a set of training documents. Does not require a corpus collection.
[Swan,Allan’99] and [Swan,Jensen’00]
use a statistical method called
hypothesis testing. In [de Jong et al.’05]
is based on a statistic language model.
In [Mani,Wilson’00] and [Llidó et al’01],
require explicitly time-tagged
documents which will be resolved into
a concrete date or an absolute date.
Gives a summary of time of events
appeared in the document content.
Gives the most likely originated time
which is similar to written time of a
document.

Technology
20
Given a collection of corpus documents C={d1,d2,…,dn}
A document model is defined as di={{w1,w2,…,wn}, (ti, ti+1)}
•• wherewhere ti<ti+1 andand ti<Time(di) <ti+1
Similarity between two language models
“A normalized log-likelihood ratio[Kraaij’05]”
Score(di,pj) = ∑wd
P(w|di) · log P(w|pj)
P(w|C)
A probability of word w
in a document di
in a time partition pi
in a corpus collection C

Improving Temporal Language Models For Determining Time of Non-Timestamped Documents

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Improving Temporal Language Models For Determining Time of Non-Timestamped Documents

Similar to Improving Temporal Language Models For Determining Time of Non-Timestamped Documents (20)

Recently uploaded

Recently uploaded (20)

Improving Temporal Language Models For Determining Time of Non-Timestamped Documents