Language is a living corpus, words tending to be created, or to disappear over time. Even the degree of certain words' usage tends to fluctuate due to historical events, cultural movements or scientific discoveries. The changes from the lan-guage are reflected in the written texts and thus, by tracking them one can deter-mine the moment when these texts were written. In this paper, we present an ap-plication that uses time series analysis built on top of the Google Books N-gram corpus to determine the time period during which a text was written. The applica-tion is based on words' fingerprinting to find the time interval when they were most probable used and on word' importance for the given text. Combining the fingerprints for all the text's words according to their importance allows the time stamping of that text.
2. Introduction (1)
• Purpose: an application using TSA to determine
the time period during which a text was
written.
• Applications:
– Digitizing old books: the date of their publishing
can not always be determined;
– Web search: pages returned in chronological order.
• Methodology: identify
– Words' importance for the analyzed text and
– Words' fingerprinting to find the time interval
when they were most probable used.
28.06.2016 ITISE 2016 1
3. Introduction (2)
• Difficult to determine when an undated text
was written, especially when the writer is
unknown.
• Linguists can approximate the period of time
based on the events described in the paper
– Reliable when the texts are describing events that
happened close to the moment of writing.
– Does it work for other types of documents:
economical, medical, science fiction?
• Use language variations slow and spread out
through large time periods wide estimations.
28.06.2016 ITISE 2016 2
4. Methodology
• Determine the words relevance for the
analyzed text based on:
– Words' frequencies from the text and
– Words' properties (capitalization, number of
characters).
• Determine the time frame when it was most
likely for them to be used:
– Using Google Books N-gram corpus (~ 5 mil. books
-4% of all the books ever written, ~ 5 bill. words);
– consider time & storage constraints unigrams.
28.06.2016 ITISE 2016 3
5. Related work
• Concerned with the change of words’ form or
meaning in time.
• Relay on the assumption that the documents
time-stamp can be determined by the biggest
overlapping of its words' fingerprints.
• Resources: newspapers, Google Books N-
gram Corpus, Google Zeitgest
• Techniques: eliminate periods of time when
the document could not have been written,
classifiers (SVM & Naive Bayes ), N-grams
28.06.2016 ITISE 2016 4
6. Application's Design
• 1. User interface – user inputs the text to be processed
• 2. The application processes the text and extracts its vocabulary
• 3. Word Processing – extract the word’s TS (smoothed unigrams freq.)
• 4. Simple Peek Detection (SPD) – moments of time when that word is
most probable to have been used (based on mean and standard dev.)
• 5. Earth Mover’s Distance (EMD) – intervals are filtered out
maintain only the ones having the highest probability (the words’
fingerprints)
• 6. Reducing unit – intervals received from EMD module for all words
are overlapped in order to obtain the document time-stamp
• 7. User interface – results are presented to the user
28.06.2016 ITISE 2016 5
User Interface Word Processing Simple Peak Detection
Reducing Unit
Database
EMD Algorithm
8. Case Study 1
28.06.2016 ITISE 2016 7
• The program works best with books that introduce new words in
the language.
– Harry Potter introducing words like Quidditch or Hogwarts
– Freq. ↗ around 1998 (first book was published), ↘ in 2007 (after the last
book was written)
– A Game of Thrones or Lord of the Rings, with words like direwolf or hobbit
9. Case Study 2
28.06.2016 ITISE 2016 8
Medici and
Cesare - strong
fingerprint
around 1565
Clement VII and Henry
VIII – 2 spikes: around
1520 (Katherina de
Aragon) and between
1560-1570
• History books are easier to date due to historical events and
famous peoples' names .
– Niccolo Machiavelli's Il Principe (1532 in Italian) in which Lorenzzo
Medici, Cesare Borgia, Henry VIII and Pope Clement are important
characters detected in 1563
11. Sources of errors
• Difficult/impossible to correct:
– small number of publications from the 16th and 17th
centuries spikes in words’ TS during that time;
– wrongly dated books inside the corpus;
– authors tend to write about past events after a while
delays in identifying the publishing year.
• To be corrected in the future work:
– using the unigrams (“dark matter” – spike after 1980);
– using the whole dataset (including unreliable data < 1800);
– the way the words are fingerprinted (different weighting
system, consider words’ part-of-speech, if they are
neologisms/archaisms);
– the way the relevant intervals are combined (also accept
plateaus, not only spikes).
28.06.2016 ITISE 2016 10
12. Conclusions
• Automatic dating of a document is a viable
solution.
• The results are strongly influenced by the
corpus used for extracting the words' time
series.
• The words' fingerprinting works best for
documents containing words strongly linked
to a time period: historical books, science
fiction publication or newspaper articles.
28.06.2016 ITISE 2016 11
SPD involves much more computation - O(n3) is trying to select only those areas from the graph that resemble to a Gaussian distribution
Reducing unit – consider words’ frequencies and their properties: increase the weights of capitalized words; decrease the weights of words having less than 4 letters
spikes in the graph that are caused by unusual uses of that word due to the fact that it is linked to some important events
problem of the SPD algorithm is that it detects time intervals that are very large. SPD for ice cream is [1905, 2008]
The first spike is explained by the fact that during that time Henry VII was trying to divorce Katherina de Aragon and Pope Clement VII has not allowed it, forcing the king to form its own religion