SlideShare a Scribd company logo
1 of 13
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Determine the Time Period When a Text
Was Written Using Time Series Analysis
Costin-Gabriel CHIRU, Madalina Toia
costin.chiru@cs.pub.ro
Introduction (1)
• Purpose: an application using TSA to determine
the time period during which a text was
written.
• Applications:
– Digitizing old books: the date of their publishing
can not always be determined;
– Web search: pages returned in chronological order.
• Methodology: identify
– Words' importance for the analyzed text and
– Words' fingerprinting to find the time interval
when they were most probable used.
28.06.2016 ITISE 2016 1
Introduction (2)
• Difficult to determine when an undated text
was written, especially when the writer is
unknown.
• Linguists can approximate the period of time
based on the events described in the paper
– Reliable when the texts are describing events that
happened close to the moment of writing.
– Does it work for other types of documents:
economical, medical, science fiction?
• Use language variations  slow and spread out
through large time periods  wide estimations.
28.06.2016 ITISE 2016 2
Methodology
• Determine the words relevance for the
analyzed text based on:
– Words' frequencies from the text and
– Words' properties (capitalization, number of
characters).
• Determine the time frame when it was most
likely for them to be used:
– Using Google Books N-gram corpus (~ 5 mil. books
-4% of all the books ever written, ~ 5 bill. words);
– consider time & storage constraints  unigrams.
28.06.2016 ITISE 2016 3
Related work
• Concerned with the change of words’ form or
meaning in time.
• Relay on the assumption that the documents
time-stamp can be determined by the biggest
overlapping of its words' fingerprints.
• Resources: newspapers, Google Books N-
gram Corpus, Google Zeitgest
• Techniques: eliminate periods of time when
the document could not have been written,
classifiers (SVM & Naive Bayes ), N-grams
28.06.2016 ITISE 2016 4
Application's Design
• 1. User interface – user inputs the text to be processed
• 2. The application processes the text and extracts its vocabulary
• 3. Word Processing – extract the word’s TS (smoothed unigrams freq.)
• 4. Simple Peek Detection (SPD) – moments of time when that word is
most probable to have been used (based on mean and standard dev.)
• 5. Earth Mover’s Distance (EMD) – intervals are filtered out 
maintain only the ones having the highest probability (the words’
fingerprints)
• 6. Reducing unit – intervals received from EMD module for all words
are overlapped in order to obtain the document time-stamp
• 7. User interface – results are presented to the user
28.06.2016 ITISE 2016 5
User Interface Word Processing Simple Peak Detection
Reducing Unit
Database
EMD Algorithm
28.06.2016 ITISE 2016 1
Case Study 1
28.06.2016 ITISE 2016 7
• The program works best with books that introduce new words in
the language.
– Harry Potter introducing words like Quidditch or Hogwarts
– Freq. ↗ around 1998 (first book was published), ↘ in 2007 (after the last
book was written)
– A Game of Thrones or Lord of the Rings, with words like direwolf or hobbit
Case Study 2
28.06.2016 ITISE 2016 8
Medici and
Cesare - strong
fingerprint
around 1565
Clement VII and Henry
VIII – 2 spikes: around
1520 (Katherina de
Aragon) and between
1560-1570
• History books are easier to date due to historical events and
famous peoples' names .
– Niccolo Machiavelli's Il Principe (1532 in Italian) in which Lorenzzo
Medici, Cesare Borgia, Henry VIII and Pope Clement are important
characters  detected in 1563
Other Results
28.06.2016 ITISE 2016 9
Sources of errors
• Difficult/impossible to correct:
– small number of publications from the 16th and 17th
centuries  spikes in words’ TS during that time;
– wrongly dated books inside the corpus;
– authors tend to write about past events after a while 
delays in identifying the publishing year.
• To be corrected in the future work:
– using the unigrams (“dark matter” – spike after 1980);
– using the whole dataset (including unreliable data < 1800);
– the way the words are fingerprinted (different weighting
system, consider words’ part-of-speech, if they are
neologisms/archaisms);
– the way the relevant intervals are combined (also accept
plateaus, not only spikes).
28.06.2016 ITISE 2016 10
Conclusions
• Automatic dating of a document is a viable
solution.
• The results are strongly influenced by the
corpus used for extracting the words' time
series.
• The words' fingerprinting works best for
documents containing words strongly linked
to a time period: historical books, science
fiction publication or newspaper articles.
28.06.2016 ITISE 2016 11
Questions
28.06.2016 ITISE 2016 12
Thank you very much!

More Related Content

Similar to Determine the time period when a text was written using time series analysis

Being Practical. Electronic editions of Flemish literary texts and documents ...
Being Practical. Electronic editions of Flemish literary texts and documents ...Being Practical. Electronic editions of Flemish literary texts and documents ...
Being Practical. Electronic editions of Flemish literary texts and documents ...Edward Vanhoutte
 
Communities and Ancestors Associated with Egon Börger and ASM
Communities and Ancestors Associated with Egon Börger and ASMCommunities and Ancestors Associated with Egon Börger and ASM
Communities and Ancestors Associated with Egon Börger and ASMJonathan Bowen
 
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...María Navas Loro
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...Digital Classicist Seminar Berlin
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
The Art of Inquiry and Cross-Text Connections
The Art of Inquiry and Cross-Text ConnectionsThe Art of Inquiry and Cross-Text Connections
The Art of Inquiry and Cross-Text ConnectionsPam Page
 
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...Scottish Language Dictionaries
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search PerspectiveNattiya Kanhabua
 
Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe Research Data Alliance
 
Timo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of KnowledgeTimo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of KnowledgeTimo Honkela
 
It services & research methods
It services & research methodsIt services & research methods
It services & research methodsAkanshShandilya
 
RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...
RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...
RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...ASIS&T
 
Topic Maps for the Three Kingdoms: The Many Applications of Topic Maps
Topic Maps for the Three Kingdoms: The Many Applications of Topic MapsTopic Maps for the Three Kingdoms: The Many Applications of Topic Maps
Topic Maps for the Three Kingdoms: The Many Applications of Topic MapsSteve Pepper
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3Nick Grattan
 

Similar to Determine the time period when a text was written using time series analysis (20)

Being Practical. Electronic editions of Flemish literary texts and documents ...
Being Practical. Electronic editions of Flemish literary texts and documents ...Being Practical. Electronic editions of Flemish literary texts and documents ...
Being Practical. Electronic editions of Flemish literary texts and documents ...
 
Communities and Ancestors Associated with Egon Börger and ASM
Communities and Ancestors Associated with Egon Börger and ASMCommunities and Ancestors Associated with Egon Börger and ASM
Communities and Ancestors Associated with Egon Börger and ASM
 
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
The Art of Inquiry and Cross-Text Connections
The Art of Inquiry and Cross-Text ConnectionsThe Art of Inquiry and Cross-Text Connections
The Art of Inquiry and Cross-Text Connections
 
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Dynamics of Web: Analysis and Implications from Search Perspective
Dynamics of Web: Analysis and Implications from Search  PerspectiveDynamics of Web: Analysis and Implications from Search  Perspective
Dynamics of Web: Analysis and Implications from Search Perspective
 
Master of Research Webinar
Master of Research WebinarMaster of Research Webinar
Master of Research Webinar
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe
 
Timo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of KnowledgeTimo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of Knowledge
 
It services & research methods
It services & research methodsIt services & research methods
It services & research methods
 
RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...
RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...
RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...
 
Term paper guideline
Term paper guidelineTerm paper guideline
Term paper guideline
 
Date And Time Converter
Date And Time ConverterDate And Time Converter
Date And Time Converter
 
Topic Maps for the Three Kingdoms: The Many Applications of Topic Maps
Topic Maps for the Three Kingdoms: The Many Applications of Topic MapsTopic Maps for the Three Kingdoms: The Many Applications of Topic Maps
Topic Maps for the Three Kingdoms: The Many Applications of Topic Maps
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisUniversity Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...University Politehnica Bucharest
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileUniversity Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaUniversity Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyUniversity Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUniversity Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentareaUniversity Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsUniversity Politehnica Bucharest
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...University Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Recently uploaded

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 

Recently uploaded (20)

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 

Determine the time period when a text was written using time series analysis

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Determine the Time Period When a Text Was Written Using Time Series Analysis Costin-Gabriel CHIRU, Madalina Toia costin.chiru@cs.pub.ro
  • 2. Introduction (1) • Purpose: an application using TSA to determine the time period during which a text was written. • Applications: – Digitizing old books: the date of their publishing can not always be determined; – Web search: pages returned in chronological order. • Methodology: identify – Words' importance for the analyzed text and – Words' fingerprinting to find the time interval when they were most probable used. 28.06.2016 ITISE 2016 1
  • 3. Introduction (2) • Difficult to determine when an undated text was written, especially when the writer is unknown. • Linguists can approximate the period of time based on the events described in the paper – Reliable when the texts are describing events that happened close to the moment of writing. – Does it work for other types of documents: economical, medical, science fiction? • Use language variations  slow and spread out through large time periods  wide estimations. 28.06.2016 ITISE 2016 2
  • 4. Methodology • Determine the words relevance for the analyzed text based on: – Words' frequencies from the text and – Words' properties (capitalization, number of characters). • Determine the time frame when it was most likely for them to be used: – Using Google Books N-gram corpus (~ 5 mil. books -4% of all the books ever written, ~ 5 bill. words); – consider time & storage constraints  unigrams. 28.06.2016 ITISE 2016 3
  • 5. Related work • Concerned with the change of words’ form or meaning in time. • Relay on the assumption that the documents time-stamp can be determined by the biggest overlapping of its words' fingerprints. • Resources: newspapers, Google Books N- gram Corpus, Google Zeitgest • Techniques: eliminate periods of time when the document could not have been written, classifiers (SVM & Naive Bayes ), N-grams 28.06.2016 ITISE 2016 4
  • 6. Application's Design • 1. User interface – user inputs the text to be processed • 2. The application processes the text and extracts its vocabulary • 3. Word Processing – extract the word’s TS (smoothed unigrams freq.) • 4. Simple Peek Detection (SPD) – moments of time when that word is most probable to have been used (based on mean and standard dev.) • 5. Earth Mover’s Distance (EMD) – intervals are filtered out  maintain only the ones having the highest probability (the words’ fingerprints) • 6. Reducing unit – intervals received from EMD module for all words are overlapped in order to obtain the document time-stamp • 7. User interface – results are presented to the user 28.06.2016 ITISE 2016 5 User Interface Word Processing Simple Peak Detection Reducing Unit Database EMD Algorithm
  • 8. Case Study 1 28.06.2016 ITISE 2016 7 • The program works best with books that introduce new words in the language. – Harry Potter introducing words like Quidditch or Hogwarts – Freq. ↗ around 1998 (first book was published), ↘ in 2007 (after the last book was written) – A Game of Thrones or Lord of the Rings, with words like direwolf or hobbit
  • 9. Case Study 2 28.06.2016 ITISE 2016 8 Medici and Cesare - strong fingerprint around 1565 Clement VII and Henry VIII – 2 spikes: around 1520 (Katherina de Aragon) and between 1560-1570 • History books are easier to date due to historical events and famous peoples' names . – Niccolo Machiavelli's Il Principe (1532 in Italian) in which Lorenzzo Medici, Cesare Borgia, Henry VIII and Pope Clement are important characters  detected in 1563
  • 11. Sources of errors • Difficult/impossible to correct: – small number of publications from the 16th and 17th centuries  spikes in words’ TS during that time; – wrongly dated books inside the corpus; – authors tend to write about past events after a while  delays in identifying the publishing year. • To be corrected in the future work: – using the unigrams (“dark matter” – spike after 1980); – using the whole dataset (including unreliable data < 1800); – the way the words are fingerprinted (different weighting system, consider words’ part-of-speech, if they are neologisms/archaisms); – the way the relevant intervals are combined (also accept plateaus, not only spikes). 28.06.2016 ITISE 2016 10
  • 12. Conclusions • Automatic dating of a document is a viable solution. • The results are strongly influenced by the corpus used for extracting the words' time series. • The words' fingerprinting works best for documents containing words strongly linked to a time period: historical books, science fiction publication or newspaper articles. 28.06.2016 ITISE 2016 11
  • 13. Questions 28.06.2016 ITISE 2016 12 Thank you very much!

Editor's Notes

  1. 5,195,769 books 4% of all the
  2. SPD involves much more computation - O(n3)  is trying to select only those areas from the graph that resemble to a Gaussian distribution Reducing unit – consider words’ frequencies and their properties: increase the weights of capitalized words; decrease the weights of words having less than 4 letters
  3. spikes in the graph that are caused by unusual uses of that word due to the fact that it is linked to some important events problem of the SPD algorithm is that it detects time intervals that are very large.  SPD for ice cream is [1905, 2008]
  4. The first spike is explained by the fact that during that time Henry VII was trying to divorce Katherina de Aragon and Pope Clement VII has not allowed it, forcing the king to form its own religion
  5. \