SlideShare a Scribd company logo
1 of 20
Download to read offline
Improving Temporal Language Models for
Determining Time of Non-Timestamped Documents
Nattiya Kanhabua and Kjetil NNattiya Kanhabua and Kjetil Nøørvrvåågg
Dept. of Computer Science,Dept. of Computer Science,
Norwegian University of Science and Technology,Norwegian University of Science and Technology,
Trondheim, NorwayTrondheim, Norway
ECDL 2008 Conference,ECDL 2008 Conference, ÅÅrhusrhus DenmarkDenmark
ECDL 2008 Norwegian University of Science and
Technology
2
Agenda
Motivation and Challenge
Preliminaries
Our Approaches
Evaluation
Conclusion
ECDL 2008 Norwegian University of Science and
Technology
3
Motivation
Research QuestionResearch Question
““ How to improve search
results in long-term archives of
digital documents? ””
AnswerAnswer
Extend keyword search with a
temporal information --
Temporal text-containment search
[Nørvåg’04]
Temporal Information
Timestamp, e.g. the created or updated date
In local archives, timestamp can be found in document
metadata which is trustable
Q: Is document timestamp in WWW archive also trustable ?
A: Not always, some problems:
1. A lack of metadata preservation
2. A time gap between crawling and indexing
3. Relocation of web documents
ECDL 2008 Norwegian University of Science and
Technology
4
Challenge
I found a bible-like
document. But I have
no idea when it was
created?
You should ask Guru!
Let’s me see…
This document is probably
originated in 850 A.C.
with 95% confidence.
““For a given document with uncertain timestamp, can the contents be
used to determine the timestamp with a sufficiently high confidence?””
ECDL 2008 Norwegian University of Science and
Technology
5
Preliminaries
“A model for dating documents”
Temporal Language Models presented in [de Jong et al. ’04]
Based on the statistic usage of words over time.
Compare a non-timestamped document with a reference corpus.
A reference time partition mostly overlaps in term usage -- the
tentative timestamp.
earthquake2004
Thailand2004
tsunami2004
tidal wave1999
Japan1999
tsunami1999
WordPartition
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
tsunami
Thailand
tsunami
Thailand
Partition score
“1999”: 1
“2004”: 1
Partition score
“1999”: 1
“2004”: 1 + 1
Partition score
“1999”: 1 = 1
“2004”: 1 + 1 = 2 most likely timestamp
ECDL 2008 Norwegian University of Science and
Technology
6
Proposed Approaches
Three ways in improving: temporal language models
1) Data preprocessing
2) Word interpolation
3) Similarity score
ECDL 2008 Norwegian University of Science and
Technology
7
Data Preprocessing
A direct comparison between extracted words in a
document vs. temporal language models limits accuracy..
Only the top-ranked N according to TF-IDF
scores will be selected as index terms
Word filteringWord filtering
Comparing 2 language models on concept
level avoids a less frequency word problem
Concept extractionConcept extraction
Identifying the correct sense of word by
analyzing context in a sentence, e.g. “bank”
Word sense disambiguationWord sense disambiguation
Co-occurrence of different words can alter
the meaning, e.g. “United States”
Collocation extractionCollocation extraction
Most interesting classes of words are
selected, e.g. nouns, verbs, and adjectives
PartPart--ofof--speech taggingspeech tagging
DescriptionDescriptionSemanticSemantic--based Preprocessingbased Preprocessing
ECDL 2008 Norwegian University of Science and
Technology
8
Word Interpolation
When a word has zero probability for a time partition according to
a limited size of a corpus collection, it could have a non-zero
frequency in that period in other documents outside a corpus.
““ A word is categorized into one of two classes depending on
characteristics occurring in time: recurring or non-recurring. ””
Related to periodic events.
For example, “Summer Olympic”,
“World Cup”, “French Open”
Words that are not recurringnot recurring.
For example, “Terrorism”,
“Tsunami”
Recurring Non-recurring
Identify recurring words by looking at overlap of wordsoverlap of words distribution
at the (flexible) endpoint of possible periods: every year or 4 years
ECDL 2008 Norwegian University of Science and
Technology
9
Word Interpolation (cont’)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
(a) "Terrorism" before interpolating
Frequency
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
2000
2001
2002
2003
2004
2005
2006
2007
2008
Year
(b) "Terrorism" after interpolating
Frequency
NR1 NR2
NR3
Non-recurringRecurring
“ How to interpolate words depends on which category a
word belongs to: recurring or non-recurring. ”
0
1000
2000
3000
4000
5000
6000
1996 2000 2004 2008
Year
(a) "Olympic games" before interpolating
Frequency
0
1000
2000
3000
4000
5000
6000
1996 2000 2004 2008
Year
(b) "Olympic games" after interpolating
Frequency
ECDL 2008 Norwegian University of Science and
Technology
10
Similarity Score
“A term weighting concerns temporality, temporal entropy based
on the term selection method presented in [Lochbaum,Steeter’89].”
The higher temporal entropy a term has, the better
representative of a partition.
A term occurring in few partitions has higher temporal entropy
compared to one appearing in many partitions.
Tells how good a term is in separating a partition from others.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
A measure of temporal information which a word conveys.
Temporal Entropy
A probability of a partition
p containing a term wi
Np is the total number of
partitions in a corpus
ECDL 2008 Norwegian University of Science and
Technology
11
Similarity Score (cont’)
“ By analyzing search statistics [Google Zeitgeist], we can
increase the probability for a particular time partition. ”
(b)(a)
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
A linear combination of a GZ score to an original similarity score [de Jong et al. ’04]
An inverse partition
frequency, ipf = log N/n
ECDL 2008 Norwegian University of Science and
Technology
12
Experimental Setting
A reference corpus
•Documents with known dates.
•Collected from the Internet Archive.
•News history web pages, e.g. ABC
News, CNN, NewYork Post, etc.
earthquake
Thailand
tsunami
tidal wave
Japan
tsunami
Word
0.080
0.012
0.091
0.009
0.003
0.015
Probability
2004
2004
2004
1999
1999
1999
Partition
Temporal Language Models
•A list of words and its probability in
each time partition.
•Intended to capture word usage
within a certain time period.
Build
ECDL 2008 Norwegian University of Science and
Technology
13
Experiments
Constraints of a training set:
1. Cover the domain of a document to be dated.
2. Cover the time period of a document to be dated.
A reference corpusA reference corpus
(15 sources)(15 sources)
A training set A testing set
Select 10 news
sources from
various domains.
Randomly select 1000
documents for testing
from 5 new sources
(different from training
sources)
Precision = the fraction of documents
correctly dated
Recall = the fraction of correctly dated
documents processed
ECDL 2008 Norwegian University of Science and
Technology
14
Experiment (cont’)
Similar to other classification tasks,
a system should be able to tell how
much confidence it has in assigning
a timestamp.
Confidence is measured by the
distance between scores of the 1st
and 2nd ranked partitions.
Dating task and
confidence
C
Combination TE,GZ with semantic-
based preprocessing, or without.
Temporal Entropy,
Google Zeitgeist
B
Various combinations of semantics:
1) POS – WSD – CON – FILT
2) POS – COLL – WSD – FILT
3) POS – COLL – WSD – CON – FILT
Semantic-based
preprocessing
A
DescriptionDescriptionEvaluation AspectsEvaluation AspectsExperimentExperiment
ECDL 2008 Norwegian University of Science and
Technology
15
0
10
20
30
40
50
60
70
80
1-w 1-m 3-m 6-m 12-m
Granularities
(b)
Precision(%)
Baseline
TE
GZ
S-TE
S-GZ
0
10
20
30
40
50
60
70
80
1-w 1-m 3-m 6-m 12-m
Granularities
(a)
Precision(%)
Baseline
A.1
A.2
A.3
Results
Semantic-based preprocessing Temporal Entropy, Google Zeitgeist
• Increase precision in almost all
granularities except 1-week
• In a small granularity, it is hard to gain
high accuracy
• By applying semantic-based first, TE and GZ
obtain high improvement
• Semantic-based preprocessing generates
collocation and concepts
•Weighted high by TE and GZ (most of search
statistics are noun phrases)
ECDL 2008 Norwegian University of Science and
Technology
16
Results (cont’)
0
10
20
30
40
50
60
70
80
90
100
110
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Confidence level
(c)
Percentage(%)
Precision
Recall
Confidence levels and document dating accuracy
The higher the confidence,
the more reliable results.
ECDL 2008 Norwegian University of Science and
Technology
17
Conclusion
Our approaches considerably increase quality compared to the
baseline based on the previous approach.
Applications that require high precision can select only
documents where the timestamp has been determined with high
confidence.
Future research:
Apply other classification algorithms in documents dating
Introduce a weighting scheme to words and interpolate only
significant words
ECDL 2008 Norwegian University of Science and
Technology
18
Questions
Questions are welcome ☺
ECDL 2008 Norwegian University of Science and
Technology
19
Related Works
A small amount of works on determining time of documents.
Divided into two categories:
determining time of creation of document/contents,
determining time of topic of contents.
Employ two techniques: learninglearning--basedbased and nonnon--learninglearning.
NonNon--learninglearningLearningLearning--basedbased
Learns from a set of training documents. Does not require a corpus collection.
[Swan,Allan’99] and [Swan,Jensen’00]
use a statistical method called
hypothesis testing. In [de Jong et al.’05]
is based on a statistic language model.
In [Mani,Wilson’00] and [Llidó et al’01],
require explicitly time-tagged
documents which will be resolved into
a concrete date or an absolute date.
Gives a summary of time of events
appeared in the document content.
Gives the most likely originated time
which is similar to written time of a
document.
ECDL 2008 Norwegian University of Science and
Technology
20
Temporal Language Models
Given a collection of corpus documents C={d1,d2,…,dn}
A document model is defined as di={{w1,w2,…,wn}, (ti, ti+1)}
•• wherewhere ti<ti+1 andand ti<Time(di) <ti+1
Similarity between two language models
“A normalized log-likelihood ratio[Kraaij’05]”
Score(di,pj) = ∑wd
P(w|di) · log P(w|pj)
P(w|C)
A probability of word w
in a document di
A probability of word w
in a time partition pi
A probability of word w
in a corpus collection C

More Related Content

Viewers also liked

Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
Nattiya Kanhabua
 
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Nattiya Kanhabua
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Nattiya Kanhabua
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
Nattiya Kanhabua
 

Viewers also liked (18)

On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
 
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
 
Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?Preservation and Forgetting: Friends or Foes?
Preservation and Forgetting: Friends or Foes?
 
Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document Archives
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search Results
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search Perspective
 
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
 
Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...Exploiting temporal information in retrieval of archived documents (doctoral ...
Exploiting temporal information in retrieval of archived documents (doctoral ...
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
 

Similar to Improving Temporal Language Models For Determining Time of Non-Timestamped Documents

Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...
INRIA-OAK
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
pathsproject
 

Similar to Improving Temporal Language Models For Determining Time of Non-Timestamped Documents (20)

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
 
TRECVID 2016 : Video to Text Description
TRECVID 2016 : Video to Text DescriptionTRECVID 2016 : Video to Text Description
TRECVID 2016 : Video to Text Description
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search Perspective
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
Voice Cloning
Voice CloningVoice Cloning
Voice Cloning
 
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
ANEA: Automated (Named) Entity Annotation for German Domain-Specific TextsANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
SISAP17
SISAP17SISAP17
SISAP17
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
ICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptxICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptx
 

Recently uploaded

Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 

Recently uploaded (20)

Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 

Improving Temporal Language Models For Determining Time of Non-Timestamped Documents

  • 1. Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil NNattiya Kanhabua and Kjetil Nøørvrvåågg Dept. of Computer Science,Dept. of Computer Science, Norwegian University of Science and Technology,Norwegian University of Science and Technology, Trondheim, NorwayTrondheim, Norway ECDL 2008 Conference,ECDL 2008 Conference, ÅÅrhusrhus DenmarkDenmark
  • 2. ECDL 2008 Norwegian University of Science and Technology 2 Agenda Motivation and Challenge Preliminaries Our Approaches Evaluation Conclusion
  • 3. ECDL 2008 Norwegian University of Science and Technology 3 Motivation Research QuestionResearch Question ““ How to improve search results in long-term archives of digital documents? ”” AnswerAnswer Extend keyword search with a temporal information -- Temporal text-containment search [Nørvåg’04] Temporal Information Timestamp, e.g. the created or updated date In local archives, timestamp can be found in document metadata which is trustable Q: Is document timestamp in WWW archive also trustable ? A: Not always, some problems: 1. A lack of metadata preservation 2. A time gap between crawling and indexing 3. Relocation of web documents
  • 4. ECDL 2008 Norwegian University of Science and Technology 4 Challenge I found a bible-like document. But I have no idea when it was created? You should ask Guru! Let’s me see… This document is probably originated in 850 A.C. with 95% confidence. ““For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence?””
  • 5. ECDL 2008 Norwegian University of Science and Technology 5 Preliminaries “A model for dating documents” Temporal Language Models presented in [de Jong et al. ’04] Based on the statistic usage of words over time. Compare a non-timestamped document with a reference corpus. A reference time partition mostly overlaps in term usage -- the tentative timestamp. earthquake2004 Thailand2004 tsunami2004 tidal wave1999 Japan1999 tsunami1999 WordPartition Temporal Language Models tsunami Thailand A non-timestamped document tsunami Thailand tsunami Thailand Partition score “1999”: 1 “2004”: 1 Partition score “1999”: 1 “2004”: 1 + 1 Partition score “1999”: 1 = 1 “2004”: 1 + 1 = 2 most likely timestamp
  • 6. ECDL 2008 Norwegian University of Science and Technology 6 Proposed Approaches Three ways in improving: temporal language models 1) Data preprocessing 2) Word interpolation 3) Similarity score
  • 7. ECDL 2008 Norwegian University of Science and Technology 7 Data Preprocessing A direct comparison between extracted words in a document vs. temporal language models limits accuracy.. Only the top-ranked N according to TF-IDF scores will be selected as index terms Word filteringWord filtering Comparing 2 language models on concept level avoids a less frequency word problem Concept extractionConcept extraction Identifying the correct sense of word by analyzing context in a sentence, e.g. “bank” Word sense disambiguationWord sense disambiguation Co-occurrence of different words can alter the meaning, e.g. “United States” Collocation extractionCollocation extraction Most interesting classes of words are selected, e.g. nouns, verbs, and adjectives PartPart--ofof--speech taggingspeech tagging DescriptionDescriptionSemanticSemantic--based Preprocessingbased Preprocessing
  • 8. ECDL 2008 Norwegian University of Science and Technology 8 Word Interpolation When a word has zero probability for a time partition according to a limited size of a corpus collection, it could have a non-zero frequency in that period in other documents outside a corpus. ““ A word is categorized into one of two classes depending on characteristics occurring in time: recurring or non-recurring. ”” Related to periodic events. For example, “Summer Olympic”, “World Cup”, “French Open” Words that are not recurringnot recurring. For example, “Terrorism”, “Tsunami” Recurring Non-recurring Identify recurring words by looking at overlap of wordsoverlap of words distribution at the (flexible) endpoint of possible periods: every year or 4 years
  • 9. ECDL 2008 Norwegian University of Science and Technology 9 Word Interpolation (cont’) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year (a) "Terrorism" before interpolating Frequency 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year (b) "Terrorism" after interpolating Frequency NR1 NR2 NR3 Non-recurringRecurring “ How to interpolate words depends on which category a word belongs to: recurring or non-recurring. ” 0 1000 2000 3000 4000 5000 6000 1996 2000 2004 2008 Year (a) "Olympic games" before interpolating Frequency 0 1000 2000 3000 4000 5000 6000 1996 2000 2004 2008 Year (b) "Olympic games" after interpolating Frequency
  • 10. ECDL 2008 Norwegian University of Science and Technology 10 Similarity Score “A term weighting concerns temporality, temporal entropy based on the term selection method presented in [Lochbaum,Steeter’89].” The higher temporal entropy a term has, the better representative of a partition. A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions. Tells how good a term is in separating a partition from others. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. A measure of temporal information which a word conveys. Temporal Entropy A probability of a partition p containing a term wi Np is the total number of partitions in a corpus
  • 11. ECDL 2008 Norwegian University of Science and Technology 11 Similarity Score (cont’) “ By analyzing search statistics [Google Zeitgeist], we can increase the probability for a particular time partition. ” (b)(a) P(wi) is the probability that wi occurs: P(wi) = 1.0 if a gaining query P(wi) = 0.5 if a declining query f(R) converts a ranked number into weight. The higher ranked query is more important. A linear combination of a GZ score to an original similarity score [de Jong et al. ’04] An inverse partition frequency, ipf = log N/n
  • 12. ECDL 2008 Norwegian University of Science and Technology 12 Experimental Setting A reference corpus •Documents with known dates. •Collected from the Internet Archive. •News history web pages, e.g. ABC News, CNN, NewYork Post, etc. earthquake Thailand tsunami tidal wave Japan tsunami Word 0.080 0.012 0.091 0.009 0.003 0.015 Probability 2004 2004 2004 1999 1999 1999 Partition Temporal Language Models •A list of words and its probability in each time partition. •Intended to capture word usage within a certain time period. Build
  • 13. ECDL 2008 Norwegian University of Science and Technology 13 Experiments Constraints of a training set: 1. Cover the domain of a document to be dated. 2. Cover the time period of a document to be dated. A reference corpusA reference corpus (15 sources)(15 sources) A training set A testing set Select 10 news sources from various domains. Randomly select 1000 documents for testing from 5 new sources (different from training sources) Precision = the fraction of documents correctly dated Recall = the fraction of correctly dated documents processed
  • 14. ECDL 2008 Norwegian University of Science and Technology 14 Experiment (cont’) Similar to other classification tasks, a system should be able to tell how much confidence it has in assigning a timestamp. Confidence is measured by the distance between scores of the 1st and 2nd ranked partitions. Dating task and confidence C Combination TE,GZ with semantic- based preprocessing, or without. Temporal Entropy, Google Zeitgeist B Various combinations of semantics: 1) POS – WSD – CON – FILT 2) POS – COLL – WSD – FILT 3) POS – COLL – WSD – CON – FILT Semantic-based preprocessing A DescriptionDescriptionEvaluation AspectsEvaluation AspectsExperimentExperiment
  • 15. ECDL 2008 Norwegian University of Science and Technology 15 0 10 20 30 40 50 60 70 80 1-w 1-m 3-m 6-m 12-m Granularities (b) Precision(%) Baseline TE GZ S-TE S-GZ 0 10 20 30 40 50 60 70 80 1-w 1-m 3-m 6-m 12-m Granularities (a) Precision(%) Baseline A.1 A.2 A.3 Results Semantic-based preprocessing Temporal Entropy, Google Zeitgeist • Increase precision in almost all granularities except 1-week • In a small granularity, it is hard to gain high accuracy • By applying semantic-based first, TE and GZ obtain high improvement • Semantic-based preprocessing generates collocation and concepts •Weighted high by TE and GZ (most of search statistics are noun phrases)
  • 16. ECDL 2008 Norwegian University of Science and Technology 16 Results (cont’) 0 10 20 30 40 50 60 70 80 90 100 110 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Confidence level (c) Percentage(%) Precision Recall Confidence levels and document dating accuracy The higher the confidence, the more reliable results.
  • 17. ECDL 2008 Norwegian University of Science and Technology 17 Conclusion Our approaches considerably increase quality compared to the baseline based on the previous approach. Applications that require high precision can select only documents where the timestamp has been determined with high confidence. Future research: Apply other classification algorithms in documents dating Introduce a weighting scheme to words and interpolate only significant words
  • 18. ECDL 2008 Norwegian University of Science and Technology 18 Questions Questions are welcome ☺
  • 19. ECDL 2008 Norwegian University of Science and Technology 19 Related Works A small amount of works on determining time of documents. Divided into two categories: determining time of creation of document/contents, determining time of topic of contents. Employ two techniques: learninglearning--basedbased and nonnon--learninglearning. NonNon--learninglearningLearningLearning--basedbased Learns from a set of training documents. Does not require a corpus collection. [Swan,Allan’99] and [Swan,Jensen’00] use a statistical method called hypothesis testing. In [de Jong et al.’05] is based on a statistic language model. In [Mani,Wilson’00] and [Llidó et al’01], require explicitly time-tagged documents which will be resolved into a concrete date or an absolute date. Gives a summary of time of events appeared in the document content. Gives the most likely originated time which is similar to written time of a document.
  • 20. ECDL 2008 Norwegian University of Science and Technology 20 Temporal Language Models Given a collection of corpus documents C={d1,d2,…,dn} A document model is defined as di={{w1,w2,…,wn}, (ti, ti+1)} •• wherewhere ti<ti+1 andand ti<Time(di) <ti+1 Similarity between two language models “A normalized log-likelihood ratio[Kraaij’05]” Score(di,pj) = ∑wd P(w|di) · log P(w|pj) P(w|C) A probability of word w in a document di A probability of word w in a time partition pi A probability of word w in a corpus collection C