During the last decade, the surge in available data spanning different epochs has inspired a new analysis of cultural, social, and linguistic phenomena from a temporal perspective.
In this talk, I will describe Temporal Random Indexing (TRI) a method that enables the analysis of the time evolution of the meaning of a word by exploiting large corpora.
TRI is able to build WordSpaces that take into account temporal information. This methodology is exploited for building time series that trace how a word changes its meaning over time. I will report some experiments on the Italian language, and I will show the preliminary results obtained during my visiting at the Turing Institute by analysing the UK internet archive corpus.
Detecting semantic shift in large corpora by exploiting temporal random indexing
1. Detecting semantic
shift in large corpora
by exploiting Temporal
Random Indexing
Pierpaolo Basile
pierpaolo.basile@uniba.it
2. Hello!
I am Pierpaolo Basile
Natural Language Processing
Distributional Semantics
Information Retrieval/Filtering
You can find me at pierpaolo.basile@uniba.it
7. Synchronic
It describes the language rules
at a specific point in time
without taking its history into
account
Synchronic vs.
Diachronic
Diachronic
It considers the evolution of a
language over time
8. Diachronic Linguistics
Why?
▪ Observe changes in particular languages
▪ Reconstruct the pre-history of languages
▪ Develop general theories about how and why language
changes
▪ Describe the history of speech communities
▪ Etymology
10. Distributional
Semantic Models (DSM)
You shall know a word by
the company it keeps!
The meaning of a word is
its use in the language
Distributional structure
Mathematical structures
of language
John Rupert Firth Ludwig Wittgenstein Zellig Harris
11. Distributional
Semantic Models
● Analysis of word-usage
statistics over huge
corpora
● Geometric space of
concepts
● Similar words are
represented close in the
space
12. “ A WordSpace is a snapshot of a
specific corpus, it does not take
into account temporal information
13. RI
Random Indexing
Building the WordSpace
▪ Assign a random vector to
each term in the corpus
vocabulary
▪ Semantic vector for a term
is the sum of the context
vectors co-occurring with
the term
Random Vector
…-1 0 1 0 0 0 0 0 1 0 0 0 -1 …
▪ Sparse
▪ High dimensional
▪ Ternary {-1, 0, +1}
▪ Small number of randomly
distributed non-zero
elements
14. RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
15. RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
context window = 2
16. RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
context window = 2
svfox
+= <-1, -1, 1, 1, 0, 1, 1, -1, -1, 0>
+
18. Random Indexing
(formal)
B preserves the Euclidean
distance between points
(Johnson-Lindenstrauss lemma)
Kanerva, Pentti. Sparse distributed memory. MIT press, 1988.
24. Temporal Random Indexing
TRI
▪ Corpus with temporal information
▫split the corpus in several time periods
▪ Build a WordSpace for each time period using TRI
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1
25. Temporal Random Indexing
TRI
▪ Corpus with temporal information
▫split the corpus in several time periods
▪ Build a WordSpace for each time period using TRI
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol.1
Random vectors are
shared across time
periods!
26. Similarity between words can
change over time
WordSpace 1870 WordSpace 1920 WordSpace 1930
chiamare
(call)
chiamare
(call)
telefonare
(phone)
chiamare
(call)
telefonare
(phone)
29. Several time series Γ at the time interval k
log frequency
point-wise
cumulative
Log of the word frequency in
each time period k
Cosine similarity between word
vectors across two time periods
Considers a cumulative vector
of the previous k-1 time periods
Time Series
Kulkarni et al., Statistically significant detection of linguistic change. WWW 2015.
30. Change point
detection
▪Track the word meaning change over time
▪Build a time series by taking into account the semantic shift of each word
▪Find significant change: Mean shift model
telefonare -> 0,25 0,3 0,7 0,8 0,75
1900 1910 1920 1930 1940
change point
(phone)
31. Change point
detection
▪Mean shift of Γ pivoted at time period j
▪Search statistically significant mean shift
▫bootstrapping approach under the null hypothesis that
there is no change in the meaning
33. Build a gold standard for the evaluation
change point
http://dizionari.corriere.it/dizionario_italiano/
34. Evaluation
Results
Method Accuracy
TRIpoint
0.3086
TRIcum
0.2963
TRR1point
0.2716
log freq 0.2346
TRR2point
0.1728
TRR1cum
0.1605
TRR2cum
0.1235
Accuracy: the year predicted by the system must be equal
or greater than one of the years reported in the gold
standard
TRR1 and TRR2 are variants of TRI
based on Reflective Random Indexing
P. Basile, A. Caputo, R. Luisi, G. Semeraro, Diachronic analysis of the Italian language exploiting Google Ngram, CLiC-it 2016
35. Social media
▪ Build TRI on Twitter
▪ About 500M tweets (feb. 2012 – sep. 2015)
▪ Time interval = 1 month
38. Detecting semantic
shift in large corpora
UK Internet Web Archive
Joint work with
Barbara McGillivray
Research Fellow, Alan Turing Institute
39. UK Internet
Web Archive
▪ UK Web Archive collects, makes accessible and preserves web
resources of scholarly and cultural importance from the UK domain
▪ JISC UK Web Domain Dataset (1996-2013)
▫ resources from the Internet Archive that were hosted on
domains ending in ‘.uk’
40. Data Format
▪ ARC format: used to store "web crawls" as sequences of content
blocks
▪ WARC format: is an enhancement of ARC format for supporting
metadata, duplicate detection events and more
41. Data Format
▪ ARC format: used to store "web crawls" as sequences of content
blocks
▪ WARC format: is an enhancement of ARC format for supporting
metadata, duplicate detection events and more
We need to extract the
textual content from HTML
pages and discard all other
types of content
42. From ARC/WARC to
WET
▪ WET format: contains extracted plaintext from the data stored in
ARC/WARC archives
WARC
ARC
Filter HTML
pages
Extract
text WET
44. From WET to tokens
▪ Extract tokens from text using Apache Lucene Standard Tokenizer
▪ Store tokens for each month (time period)
WET Tokenization
Tokens
Azure
Blob Storage
Azure
Blob Storage
Azure Batch
and VMs pool
https://github.com/alan-turing-institute/UKWebArchive_semantic_chang
e
46. TRI
The UK Web Archive
▪ Build the vocabulary
▪ Build co-occurrence matrices (context window=5)
▪ Perform TRI
▪ Build time series
▪ Run change point detection
47. TRI
The UK Web Archive
▪ Build the vocabulary
▪ Build co-occurrence matrices (to speed-up TRI)
▪ Perform TRI
▪ Build time series
▪ Run change point detection
https://github.com/alan-turing-institute/temporal-random-indexin
g
We performed a preliminary analysis on the
20% of the corpus discarding words that
appear less than 500 times
(~1M words, ~140 billion of occurrences)
48. Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
One matrix for each time period (in our experiment, one month)
201212_matrix.gz
49. Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
Target
word
50. Co-occurrence
matrices
linux google 173 xp 454 manufacturer 237 job
64 install 255 security 137 cgi 47 operating
705 host 69 performance 44 sharing 56...
co-occurrence
co-occ.
word
count
51. Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
We plan to make the matrices freely available!
52. TRI
December, 2000 December, 2012
sparc windows
kernel microsoft
pwdb xp
asm debian
unix netware
netinet macos
packlist suse
How the neighborhood of the word
‘linux’ changed over time
2000: technical terms
2012: linux is recognized as an
operating system
sim(linux200012
,linux201212
)=0.228
55. Build a gold standard for the evaluation
Historical dictionary
56. Build a gold standard for the evaluation
Historical dictionary
57. Evaluation
Metrics
▪ Precision: how many change points are correctly identified?
▪ Recall: is TRI able to identify all the change points reported in
OED?
▪ TRI could identify correct change points not reported in OED!
▪ Some words (slang) are not reported in OED
▫ exploit other dictionaries: Urban Dictionary?
58. Current outcome
▪ WET files for all the JISC UK Web Domain Dataset (1996-2013)
▫ tokenized content (100% ~3TB)
▫ co-occurrence matrices (20% ~350GB)
▫ WordSpace
▸ for each month (20% ~400GB)
▸ for each year (20% ~50GB)
▪ Time series from 1996 to 2013
▫ cumulative, pointwise, month, year
▪ Change point detection with different p-values
59. Conclusion and
Future work
▪ TRI is able to scale-up on a large corpus of billion of tokens
▪ Co-occurrence matrices can be re-used for building WordSpaces
exploiting other approaches
▪ Future work
▫ matrices and WordSpaces from the whole corpus
▫ finalize the construction of the gold standard
▫ comparison with other approaches (word embeddings
alignment)