Determine the time period when a text was written using time series analysis

Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Determine the Time Period When a Text
Was Written Using Time Series Analysis
Costin-Gabriel CHIRU, Madalina Toia
costin.chiru@cs.pub.ro

Introduction (1)
• Purpose: an application using TSA to determine
the time period during which a text was
written.
• Applications:
– Digitizing old books: the date of their publishing
can not always be determined;
– Web search: pages returned in chronological order.
• Methodology: identify
– Words' importance for the analyzed text and
– Words' fingerprinting to find the time interval
when they were most probable used.
28.06.2016 ITISE 2016 1

Introduction (2)
• Difficult to determine when an undated text
was written, especially when the writer is
unknown.
• Linguists can approximate the period of time
based on the events described in the paper
– Reliable when the texts are describing events that
happened close to the moment of writing.
– Does it work for other types of documents:
economical, medical, science fiction?
• Use language variations  slow and spread out
through large time periods  wide estimations.
28.06.2016 ITISE 2016 2

Methodology
• Determine the words relevance for the
analyzed text based on:
– Words' frequencies from the text and
– Words' properties (capitalization, number of
characters).
• Determine the time frame when it was most
likely for them to be used:
– Using Google Books N-gram corpus (~ 5 mil. books
-4% of all the books ever written, ~ 5 bill. words);
– consider time & storage constraints  unigrams.
28.06.2016 ITISE 2016 3

Related work
• Concerned with the change of words’ form or
meaning in time.
• Relay on the assumption that the documents
time-stamp can be determined by the biggest
overlapping of its words' fingerprints.
• Resources: newspapers, Google Books N-
gram Corpus, Google Zeitgest
• Techniques: eliminate periods of time when
the document could not have been written,
classifiers (SVM & Naive Bayes ), N-grams
28.06.2016 ITISE 2016 4

Application's Design
• 1. User interface – user inputs the text to be processed
• 2. The application processes the text and extracts its vocabulary
• 3. Word Processing – extract the word’s TS (smoothed unigrams freq.)
• 4. Simple Peek Detection (SPD) – moments of time when that word is
most probable to have been used (based on mean and standard dev.)
• 5. Earth Mover’s Distance (EMD) – intervals are filtered out 
maintain only the ones having the highest probability (the words’
fingerprints)
• 6. Reducing unit – intervals received from EMD module for all words
are overlapped in order to obtain the document time-stamp
• 7. User interface – results are presented to the user
28.06.2016 ITISE 2016 5
User Interface Word Processing Simple Peak Detection
Reducing Unit
Database
EMD Algorithm

Case Study 1
28.06.2016 ITISE 2016 7
• The program works best with books that introduce new words in
the language.
– Harry Potter introducing words like Quidditch or Hogwarts
– Freq. ↗ around 1998 (first book was published), ↘ in 2007 (after the last
book was written)
– A Game of Thrones or Lord of the Rings, with words like direwolf or hobbit

Case Study 2
28.06.2016 ITISE 2016 8
Medici and
Cesare - strong
fingerprint
around 1565
Clement VII and Henry
VIII – 2 spikes: around
1520 (Katherina de
Aragon) and between
1560-1570
• History books are easier to date due to historical events and
famous peoples' names .
– Niccolo Machiavelli's Il Principe (1532 in Italian) in which Lorenzzo
Medici, Cesare Borgia, Henry VIII and Pope Clement are important
characters  detected in 1563

Other Results
28.06.2016 ITISE 2016 9

Sources of errors
• Difficult/impossible to correct:
– small number of publications from the 16th and 17th
centuries  spikes in words’ TS during that time;
– wrongly dated books inside the corpus;
– authors tend to write about past events after a while 
delays in identifying the publishing year.
• To be corrected in the future work:
– using the unigrams (“dark matter” – spike after 1980);
– using the whole dataset (including unreliable data < 1800);
– the way the words are fingerprinted (different weighting
system, consider words’ part-of-speech, if they are
neologisms/archaisms);
– the way the relevant intervals are combined (also accept
plateaus, not only spikes).
28.06.2016 ITISE 2016 10

Conclusions
• Automatic dating of a document is a viable
solution.
• The results are strongly influenced by the
corpus used for extracting the words' time
series.
• The words' fingerprinting works best for
documents containing words strongly linked
to a time period: historical books, science
fiction publication or newspaper articles.
28.06.2016 ITISE 2016 11

Questions
28.06.2016 ITISE 2016 12
Thank you very much!

Determine the time period when a text was written using time series analysis

Recommended

Recommended

More Related Content

Similar to Determine the time period when a text was written using time series analysis

Similar to Determine the time period when a text was written using time series analysis (20)

More from University Politehnica Bucharest

More from University Politehnica Bucharest (20)

Recently uploaded

Recently uploaded (20)

Determine the time period when a text was written using time series analysis

Editor's Notes