Detecting semantic shift in large corpora by exploiting temporal random indexing

Detecting semantic
shift in large corpora
by exploiting Temporal
Random Indexing
Pierpaolo Basile
pierpaolo.basile@uniba.it

Hello!
I am Pierpaolo Basile
Natural Language Processing
Distributional Semantics
Information Retrieval/Filtering
You can find me at pierpaolo.basile@uniba.it

Words change their
meaning (usage) Marty, in 2015
people will surf on
the web!!!

Words change their
meaning (usage)
Surf!?!?! On
the
web!?!?!?

Surf!?!?! On
the
web!?!?!?
Motivation
Detect meaning shift
surf the Net/Internet to use the Internet
When was this meaning introduced?

Diachronic Linguistics
The scientific study of language change over time
(also called Historical Linguistics)

Synchronic
It describes the language rules
at a specific point in time
without taking its history into
account
Synchronic vs.
Diachronic
Diachronic
It considers the evolution of a
language over time

Diachronic Linguistics
Why?
▪ Observe changes in particular languages
▪ Reconstruct the pre-history of languages
▪ Develop general theories about how and why language
changes
▪ Describe the history of speech communities
▪ Etymology

Distributional
Semantic Models (DSM)
You shall know a word by
the company it keeps!
The meaning of a word is
its use in the language
Distributional structure
Mathematical structures
of language
John Rupert Firth Ludwig Wittgenstein Zellig Harris

Distributional
Semantic Models
● Analysis of word-usage
statistics over huge
corpora
● Geometric space of
concepts
● Similar words are
represented close in the
space

“ A WordSpace is a snapshot of a
specific corpus, it does not take
into account temporal information

RI
Random Indexing
Building the WordSpace
▪ Assign a random vector to
each term in the corpus
vocabulary
▪ Semantic vector for a term
is the sum of the context
vectors co-occurring with
the term
Random Vector
…-1 0 1 0 0 0 0 0 1 0 0 0 -1 …
▪ Sparse
▪ High dimensional
▪ Ternary {-1, 0, +1}
▪ Small number of randomly
distributed non-zero
elements

RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>

RI
How it works
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
context window = 2

RI
How it works
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
context window = 2
svfox
+= <-1, -1, 1, 1, 0, 1, 1, -1, -1, 0>
+

Random Indexing
(formal)
B preserves the Euclidean
distance between points
(Johnson-Lindenstrauss lemma)
Kanerva, Pentti. Sparse distributed memory. MIT press, 1988.

Temporal Random Indexing
TRI
Corpus1900
Corpus1901
Corpus1902
Corpus1920
...
random
vectors
corpus
vocabulary
1
Corpus1900

TRI
RI
Space1900Corpus1900
Corpus1901
Corpus1902
Corpus1920
...
random
vectors
corpus
vocabulary
RI
Space1901
RI
Space1902
RI
Space1920
...
Semantic vector for a
term is the sum of the
context vectors
co-occurring with the
term in the same time
period
1
2
2

TRI
1985 ...surf the sea...
1994 ...surf the web...
Random vectors
surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
...

TRI
Random vectors
surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
...
svsurf_1985
+= <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>

TRI
Random vectors
surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
...
svsurf_1985
+= <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
svsurf_1994
+= <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>

TRI
▪ Corpus with temporal information
▫split the corpus in several time periods
▪ Build a WordSpace for each time period using TRI
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1

TRI
▪ Corpus with temporal information
▫split the corpus in several time periods
▪ Build a WordSpace for each time period using TRI
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol.1
Random vectors are
shared across time
periods!

Similarity between words can
change over time
WordSpace 1870 WordSpace 1920 WordSpace 1930
chiamare
(call)
chiamare
(call)
telefonare
(phone)
chiamare
(call)
telefonare
(phone)

Methodology
TRI Time
Series
Change Point
Detection
Run TRI on a corpus
split in time periods
Provide a time series
for each word
Detect significant
changes in the time
series

Several time series Γ at the time interval k
log frequency
point-wise
cumulative
Log of the word frequency in
each time period k
Cosine similarity between word
vectors across two time periods
Considers a cumulative vector
of the previous k-1 time periods
Time Series
Kulkarni et al., Statistically significant detection of linguistic change. WWW 2015.

Change point
detection
▪Track the word meaning change over time
▪Build a time series by taking into account the semantic shift of each word
▪Find significant change: Mean shift model
telefonare -> 0,25 0,3 0,7 0,8 0,75
1900 1910 1920 1930 1940
change point
(phone)

Change point
detection
▪Mean shift of Γ pivoted at time period j
▪Search statistically significant mean shift
▫bootstrapping approach under the null hypothesis that
there is no change in the meaning

Evaluation
Results about the Italian language

Build a gold standard for the evaluation
change point
http://dizionari.corriere.it/dizionario_italiano/

Evaluation
Results
Method Accuracy
TRIpoint
0.3086
TRIcum
0.2963
TRR1point
0.2716
log freq 0.2346
TRR2point
0.1728
TRR1cum
0.1605
TRR2cum
0.1235
Accuracy: the year predicted by the system must be equal
or greater than one of the years reported in the gold
standard
TRR1 and TRR2 are variants of TRI
based on Reflective Random Indexing
P. Basile, A. Caputo, R. Luisi, G. Semeraro, Diachronic analysis of the Italian language exploiting Google Ngram, CLiC-it 2016

Social media
▪ Build TRI on Twitter
▪ About 500M tweets (feb. 2012 – sep. 2015)
▪ Time interval = 1 month

Social media
Local Election
Roma
Marino (Roma
Mayor) crisis

Detecting semantic
shift in large corpora
UK Internet Web Archive
Joint work with
Barbara McGillivray
Research Fellow, Alan Turing Institute

UK Internet
Web Archive
▪ UK Web Archive collects, makes accessible and preserves web
resources of scholarly and cultural importance from the UK domain
▪ JISC UK Web Domain Dataset (1996-2013)
▫ resources from the Internet Archive that were hosted on
domains ending in ‘.uk’

Data Format
▪ ARC format: used to store "web crawls" as sequences of content
blocks
▪ WARC format: is an enhancement of ARC format for supporting
metadata, duplicate detection events and more

Data Format
▪ ARC format: used to store "web crawls" as sequences of content
blocks
▪ WARC format: is an enhancement of ARC format for supporting
metadata, duplicate detection events and more
We need to extract the
textual content from HTML
pages and discard all other
types of content

From ARC/WARC to
WET
▪ WET format: contains extracted plaintext from the data stored in
ARC/WARC archives
WARC
ARC
Filter HTML
pages
Extract
text WET

From ARC/WARC to
WET
WARC
ARC
Filter HTML
pages
Extract
text WET
Azure
Blob Storage
Azure
Blob Storage
Azure Batch
and VMs pool
Jsoup library

From WET to tokens
▪ Extract tokens from text using Apache Lucene Standard Tokenizer
▪ Store tokens for each month (time period)
WET Tokenization
Tokens
Azure
Blob Storage
Azure
Blob Storage
Azure Batch
and VMs pool
https://github.com/alan-turing-institute/UKWebArchive_semantic_chang
e

~60 TerabytesARC/WARC
3 Terabytes
Tokens
5,5 Terabytes
WET

TRI
The UK Web Archive
▪ Build the vocabulary
▪ Build co-occurrence matrices (context window=5)
▪ Perform TRI
▪ Build time series
▪ Run change point detection

TRI
The UK Web Archive
▪ Build the vocabulary
▪ Build co-occurrence matrices (to speed-up TRI)
▪ Perform TRI
▪ Build time series
▪ Run change point detection
https://github.com/alan-turing-institute/temporal-random-indexin
g
We performed a preliminary analysis on the
20% of the corpus discarding words that
appear less than 500 times
(~1M words, ~140 billion of occurrences)

Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
One matrix for each time period (in our experiment, one month)
201212_matrix.gz

Co-occurrence
matrices
56...
Target
word

Co-occurrence
matrices
linux google 173 xp 454 manufacturer 237 job
64 install 255 security 137 cgi 47 operating
705 host 69 performance 44 sharing 56...
co-occurrence
co-occ.
word
count

Co-occurrence
matrices
56...
We plan to make the matrices freely available!

TRI
December, 2000 December, 2012
sparc windows
kernel microsoft
pwdb xp
asm debian
unix netware
netinet macos
packlist suse
How the neighborhood of the word
‘linux’ changed over time
2000: technical terms
2012: linux is recognized as an
operating system
sim(linux200012
,linux201212
)=0.228

Build a gold standard for the evaluation
Historical dictionary

Evaluation
Metrics
▪ Precision: how many change points are correctly identified?
▪ Recall: is TRI able to identify all the change points reported in
OED?
▪ TRI could identify correct change points not reported in OED!
▪ Some words (slang) are not reported in OED
▫ exploit other dictionaries: Urban Dictionary?

Current outcome
▪ WET files for all the JISC UK Web Domain Dataset (1996-2013)
▫ tokenized content (100% ~3TB)
▫ co-occurrence matrices (20% ~350GB)
▫ WordSpace
▸ for each month (20% ~400GB)
▸ for each year (20% ~50GB)
▪ Time series from 1996 to 2013
▫ cumulative, pointwise, month, year
▪ Change point detection with different p-values

Conclusion and
Future work
▪ TRI is able to scale-up on a large corpus of billion of tokens
▪ Co-occurrence matrices can be re-used for building WordSpaces
exploiting other approaches
▪ Future work
▫ matrices and WordSpaces from the whole corpus
▫ finalize the construction of the gold standard
▫ comparison with other approaches (word embeddings
alignment)

Thanks!!
Any questions?
You can find me at @basilepp &
pierpaolo.basile@uniba.it

Detecting semantic shift in large corpora by exploiting temporal random indexing

Recommended

Recommended

More Related Content

Similar to Detecting semantic shift in large corpora by exploiting temporal random indexing

Similar to Detecting semantic shift in large corpora by exploiting temporal random indexing (20)

More from Pierpaolo Basile

More from Pierpaolo Basile (17)

Recently uploaded

Recently uploaded (17)

Detecting semantic shift in large corpora by exploiting temporal random indexing