Diachronic Analysis
Hello!
Pierpaolo Basile
Researcher in Computer Science
Natural Language Processing, Distributional
Semantic Models, Information Retrieval
pierpaolo.basile@gmail.com
http://www.di.uniba.it/~basilepp/
Basic Concepts
Natural Language Processing
Computational Linguistics
Diachronic Linguistics
Natural Language
Refers to the language spoken by people, e.g.
English, Japanese, Swahili, Italian, as opposed to
artificial languages, like C++, Java, etc.
…Processing
Applications that deal with natural language in a
way or another
NLP Applications
▪ Classify text into categories
▪ Index and search large texts
▪ Automatic translation
▪ Speech understanding
▪ Information extraction
▪ Automatic summarization
▪ Question answering
▪ Knowledge acquisition
▪ Text generations / dialogues
Why NLP?
▪ Google, Yahoo!, Bing (3,37%), Baidu (0,79%) -> Information Retrieval
▪ LinkedIn -> Information Extraction + Information Retrieval
▪ Google Translate, Babelfish, Systran -> Machine Translation
▪ Ask, IBM Watson -> Question Answering
▪ Myspace, Facebook, Twitter -> Social Networks, Processing of User-
Generated Content
▪ All “Big Guys” have (several) strong NLP research labs: IBM, Microsoft,
AT&T, Xerox, ORACLE-Sun Microsystems, etc.
▪ Academia: research in a university environment
NLP Applications: Search
NLP Applications: Machine Translation
NLP Applications: Personal Assistant
Apple Siri (2011) Microsoft Cortana (2014)Google Now (2013)
Linguistics Levels of
Analysis
▪ Speech
▪ Written language
▫ Phonology: sounds / letters / pronunciation
▫ Morphology: the structure of words
▫ Syntax: how these sequences are structured
▫ Semantics: meaning of the strings
▪ Interaction between levels
Issues in Syntax
Identify the part of speech (POS)
dog = noun ate = verb homework = noun
“the dog ate my homework”
Issues in Syntax
Chunking: identify basic structures (phrases)
[the dog]-NP [ate]-VP [my homework]-NP
Shallow parsing
the dog->subject ate->predicate my homework->object
“the dog ate my homework”
Issues in Syntax
Full parsing S
NP VP
VBP NPDT NN
PRP$ NN
The dog ate my homework
Issues in Semantics
“They are producing about 1,000 automobiles in the new plant”
“The sea flora consists in 1,000 different plant species”
Understand language! How?
• “plant” = industrial plant
• “plant” = living organism
Words are ambiguous
Importance of semantics?
• Machine Translation: wrong translations
• Information Retrieval: wrong information
Computational Linguistics
Doing linguistics on computers
More on the linguistic side than NLP, but closely
related (interdisciplinary field)
Diachronic Linguistics
The scientific study of language change over time
also called Historical Linguistics
Marty, in 2015
people will surf
on the web!!!
Surf!?!?!
On the
web!?!?!?
Diachronic Linguistics
Why?
▪ Observe changes in particular languages
▪ Reconstruct the pre-history of languages
▪ Develop general theories about how and why language
changes
▪ Describe the history of speech communities
▪ Etymology
Synchronic
It describes the language rules
at a specific point of time
without taking its history into
account.
Synchronic vs.
Diachronic
Diachronic
It considers the evoluation of a
language over time.
Google N-gram Viewer
▪Search and visualize n-gram statistics from Google Books
▪N-gram: sequence of n words
▪Google Books digitalizes millions of books
N-gram
“Google Books digitalizes millions of books”
1-gram
Google, Books, digitalizes, millions, of, books
2-gram
Google Books, Books digitalizes, digitalizes millions, millions of, of books
3-gram
Google Books digitalizes, Books digitalizes millions, digitalizes millions of,
millions of books
Google N-gram Viewer
https://books.google.com/ngrams
CULTUROMICSA form of computational lexicology that studies human behavior and
cultural trends through the quantitative analysis of digitized texts.
J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, T. G. B. Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig,
J. Orwant, S. Pinker, M. A. Nowak e E. L. Aiden, Quantitative Analysis of Culture Using Millions of Digitized Books
Culturomics
Grammar Evolution
Culturomics
Forgot the old
Culturomics
Forgot the old
Culturomics
Forgot the old
Culturomics
Popularity
Culturomics
Censorship
Marc Chagall (English)
Culturomics
Censorship
Marc Chagall (German)
Nazi censorship
Culturomics
Events
Russian Flu
Spanish Flu
Asian Flu
Culturomics
feminism (English)
Culturomics
feminism (Italian)
«sufraggette»
Culturomics
men vs. women
Culturomics
God (English)
Culturomics
God (Italian)
Culturomics
Religion
Google N-gram Viewer
Part-of-speech
Google N-gram Viewer
Part-of-speech
Google N-gram Viewer
Part-of-speech
Google N-gram Issues
Google N-gram Issues
Overabundance of Scientific Literature
Google N-gram Issues
OCR Errors
long s in old books looks a lot like a f
Google N-gram Issues
Popularity Contests
a book only appears
once, whether it’s been
read once or millions of
times
“ Moving away from mere
frequentist approaches…
What’s Tezguno?
What’s Tezguno?
A bottle of Tezguno is on the table.
Everyone likes Tezguno.
Tezguno makes you drunk.
We make Tezguno out of corn.
What’s Tezguno?
Distributional
Semantic Models
John Rupert Firth
You shall know a
word by the
company it keeps!
Ludwig Wittgenstein
Meaning of a word
is determined by its
usage.
Zellig Harris
 Methods in structural
linguistics
 Distributional
structure
 Mathematical
structures of
language
Distributional
Semantic Models
Analysis of word-usage
statistics over huge corpora
Geometric space of
concepts
Similar words are
represented close in the
space
Distributional Semantics
Count co-occurrences
dog cat bread pasta meat mouse
dog 40 7 1 0 1 5
cat 7 32 0 1 0 8
bread 1 0 22 15 8 0
pasta 0 1 15 24 10 1
meat 1 0 8 10 30 2
mouse 5 8 0 1 2 31
Distributional Semantics
Word similarity
dog cat bread pasta meat mouse
dog 40 7 1 0 1 5
cat 7 32 0 1 0 8
bread 1 0 22 15 8 0
pasta 0 1 15 24 10 1
meat 1 0 8 10 30 2
mouse 5 8 0 1 2 31
Distributional Semantics
Word Similarity
dog cat bread pasta meat mouse
dog 40 7 1 0 1 5
cat 7 32 0 1 0 8
bread 1 0 22 15 8 0
pasta 0 1 15 24 10 1
meat 1 0 8 10 30 2
mouse 5 8 0 1 2 31
Distributional Semantics
Word Similarity
dog cat bread pasta meat mouse
dog 40 7 1 0 1 5
cat 7 32 0 1 0 8
bread 1 0 22 15 8 0
pasta 0 1 15 24 10 1
meat 1 0 8 10 30 2
mouse 5 8 0 1 2 31
Geometric space
WordSpace
dog
pasta
bread
mouse
cat
Geometric space
WordSpace
dog
pasta
bread
mouse
cat cat and mouse are close in the space
“ A WordSpace is a snapshot of a
specific corpus it does not take
into account temporal information
Temporal Random Indexing
(TRI)
Corpus1900
RI
Space1
Corpus1920
RI
Space2
Corpus1930
RI
Space3
Corpus1940
RI
Space4
Temporal Random Indexing
(TRI)
▪ Corpus with temporal information: split the corpus in
several time periods
▪ Build a WordSpace for each time period
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning
over time. IJCoL vol. 1: Emerging Topics at the First Italian Conference on Computational Linguistics,
Accademia University Press.
Similarity between words can
change over time
WordSpace 1910 WordSpace 1920 WordSpace 1930
chiamare
chiamare
telefonare
chiamare
telefonare
Google
N-gram
TRI
Change point
detection
▪Track the word meaning change over time
▪Build a time series by taking into account the semantic shift of each word
▪Find significant change: Mean shift model
telefonare -> 0,25 0,3 0,7 0,8 0,75
1900 1910 1920 1930 1940
change point
Evaluation
▪Build TRI by relying on the Italian Google Ngram corpus
▪Build a standard benchmarking for meaning shift detection for the Italian
language (“Dizionario Sabatino Coletti”, “Dizionario Etimologico Zanichelli”)
▪Evaluate the performance of TRI: compare the system output with manual
annotations provided by experts
▪Future work: extend the evaluation to the English language
P. Basile, A. Caputo, G. Semeraro. Diachronic Analysis of the Italian Language exploiting Google Ngram.
CLIC-it 2016: Third Italian Conference on Computational Linguistics.
Thanks!
!
Any questions?
You can find me at
pierpaolo.basile@gmail.com

Diachronic Analysis