SlideShare a Scribd company logo
1 of 60
Download to read offline
Detecting semantic
shift in large corpora
by exploiting Temporal
Random Indexing
Pierpaolo Basile
pierpaolo.basile@uniba.it
Hello!
I am Pierpaolo Basile
Natural Language Processing
Distributional Semantics
Information Retrieval/Filtering
You can find me at pierpaolo.basile@uniba.it
Words change their
meaning (usage) Marty, in 2015
people will surf on
the web!!!
Words change their
meaning (usage)
Surf!?!?! On
the
web!?!?!?
Surf!?!?! On
the
web!?!?!?
Motivation
Detect meaning shift
surf the Net/Internet to use the Internet
When was this meaning introduced?
Diachronic Linguistics
The scientific study of language change over time
(also called Historical Linguistics)
Synchronic
It describes the language rules
at a specific point in time
without taking its history into
account
Synchronic vs.
Diachronic
Diachronic
It considers the evolution of a
language over time
Diachronic Linguistics
Why?
▪ Observe changes in particular languages
▪ Reconstruct the pre-history of languages
▪ Develop general theories about how and why language
changes
▪ Describe the history of speech communities
▪ Etymology
How to represent
semantics?
Distributional
Semantic Models (DSM)
You shall know a word by
the company it keeps!
The meaning of a word is
its use in the language
Distributional structure
Mathematical structures
of language
John Rupert Firth Ludwig Wittgenstein Zellig Harris
Distributional
Semantic Models
● Analysis of word-usage
statistics over huge
corpora
● Geometric space of
concepts
● Similar words are
represented close in the
space
“ A WordSpace is a snapshot of a
specific corpus, it does not take
into account temporal information
RI
Random Indexing
Building the WordSpace
▪ Assign a random vector to
each term in the corpus
vocabulary
▪ Semantic vector for a term
is the sum of the context
vectors co-occurring with
the term
Random Vector
…-1 0 1 0 0 0 0 0 1 0 0 0 -1 …
▪ Sparse
▪ High dimensional
▪ Ternary {-1, 0, +1}
▪ Small number of randomly
distributed non-zero
elements
RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
context window = 2
RI
How it works
The quick brown fox jumps over the lazy dog Random vectors
quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0>
over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0>
lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1>
dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
context window = 2
svfox
+= <-1, -1, 1, 1, 0, 1, 1, -1, -1, 0>
+
math>magic
Random Indexing
(formal)
B preserves the Euclidean
distance between points
(Johnson-Lindenstrauss lemma)
Kanerva, Pentti. Sparse distributed memory. MIT press, 1988.
Temporal Random Indexing
TRI
Corpus1900
Corpus1901
Corpus1902
Corpus1920
...
random
vectors
corpus
vocabulary
1
Corpus1900
Temporal Random Indexing
TRI
RI
Space1900Corpus1900
Corpus1901
Corpus1902
Corpus1920
...
random
vectors
corpus
vocabulary
RI
Space1901
RI
Space1902
RI
Space1920
...
Semantic vector for a
term is the sum of the
context vectors
co-occurring with the
term in the same time
period
1
2
2
Temporal Random Indexing
TRI
1985 ...surf the sea...
1994 ...surf the web...
Random vectors
surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
...
Temporal Random Indexing
TRI
1985 ...surf the sea...
1994 ...surf the web...
Random vectors
surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
...
svsurf_1985
+= <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
Temporal Random Indexing
TRI
1985 ...surf the sea...
1994 ...surf the web...
Random vectors
surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0>
sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
...
svsurf_1985
+= <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
svsurf_1994
+= <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
Temporal Random Indexing
TRI
▪ Corpus with temporal information
▫split the corpus in several time periods
▪ Build a WordSpace for each time period using TRI
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1
Temporal Random Indexing
TRI
▪ Corpus with temporal information
▫split the corpus in several time periods
▪ Build a WordSpace for each time period using TRI
▪ Words in different WordSpaces are comparable!
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol.1
Random vectors are
shared across time
periods!
Similarity between words can
change over time
WordSpace 1870 WordSpace 1920 WordSpace 1930
chiamare
(call)
chiamare
(call)
telefonare
(phone)
chiamare
(call)
telefonare
(phone)
Google
N-gram
TRI
phone
call
Methodology
TRI Time
Series
Change Point
Detection
Run TRI on a corpus
split in time periods
Provide a time series
for each word
Detect significant
changes in the time
series
Several time series Γ at the time interval k
log frequency
point-wise
cumulative
Log of the word frequency in
each time period k
Cosine similarity between word
vectors across two time periods
Considers a cumulative vector
of the previous k-1 time periods
Time Series
Kulkarni et al., Statistically significant detection of linguistic change. WWW 2015.
Change point
detection
▪Track the word meaning change over time
▪Build a time series by taking into account the semantic shift of each word
▪Find significant change: Mean shift model
telefonare -> 0,25 0,3 0,7 0,8 0,75
1900 1910 1920 1930 1940
change point
(phone)
Change point
detection
▪Mean shift of Γ pivoted at time period j
▪Search statistically significant mean shift
▫bootstrapping approach under the null hypothesis that
there is no change in the meaning
Evaluation
Results about the Italian language
Build a gold standard for the evaluation
change point
http://dizionari.corriere.it/dizionario_italiano/
Evaluation
Results
Method Accuracy
TRIpoint
0.3086
TRIcum
0.2963
TRR1point
0.2716
log freq 0.2346
TRR2point
0.1728
TRR1cum
0.1605
TRR2cum
0.1235
Accuracy: the year predicted by the system must be equal
or greater than one of the years reported in the gold
standard
TRR1 and TRR2 are variants of TRI
based on Reflective Random Indexing
P. Basile, A. Caputo, R. Luisi, G. Semeraro, Diachronic analysis of the Italian language exploiting Google Ngram, CLiC-it 2016
Social media
▪ Build TRI on Twitter
▪ About 500M tweets (feb. 2012 – sep. 2015)
▪ Time interval = 1 month
Social media
Social media
Local Election
Roma
Marino (Roma
Mayor) crisis
Detecting semantic
shift in large corpora
UK Internet Web Archive
Joint work with
Barbara McGillivray
Research Fellow, Alan Turing Institute
UK Internet
Web Archive
▪ UK Web Archive collects, makes accessible and preserves web
resources of scholarly and cultural importance from the UK domain
▪ JISC UK Web Domain Dataset (1996-2013)
▫ resources from the Internet Archive that were hosted on
domains ending in ‘.uk’
Data Format
▪ ARC format: used to store "web crawls" as sequences of content
blocks
▪ WARC format: is an enhancement of ARC format for supporting
metadata, duplicate detection events and more
Data Format
▪ ARC format: used to store "web crawls" as sequences of content
blocks
▪ WARC format: is an enhancement of ARC format for supporting
metadata, duplicate detection events and more
We need to extract the
textual content from HTML
pages and discard all other
types of content
From ARC/WARC to
WET
▪ WET format: contains extracted plaintext from the data stored in
ARC/WARC archives
WARC
ARC
Filter HTML
pages
Extract
text WET
From ARC/WARC to
WET
WARC
ARC
Filter HTML
pages
Extract
text WET
Azure
Blob Storage
Azure
Blob Storage
Azure Batch
and VMs pool
Jsoup library
From WET to tokens
▪ Extract tokens from text using Apache Lucene Standard Tokenizer
▪ Store tokens for each month (time period)
WET Tokenization
Tokens
Azure
Blob Storage
Azure
Blob Storage
Azure Batch
and VMs pool
https://github.com/alan-turing-institute/UKWebArchive_semantic_chang
e
~60 TerabytesARC/WARC
3 Terabytes
Tokens
5,5 Terabytes
WET
TRI
The UK Web Archive
▪ Build the vocabulary
▪ Build co-occurrence matrices (context window=5)
▪ Perform TRI
▪ Build time series
▪ Run change point detection
TRI
The UK Web Archive
▪ Build the vocabulary
▪ Build co-occurrence matrices (to speed-up TRI)
▪ Perform TRI
▪ Build time series
▪ Run change point detection
https://github.com/alan-turing-institute/temporal-random-indexin
g
We performed a preliminary analysis on the
20% of the corpus discarding words that
appear less than 500 times
(~1M words, ~140 billion of occurrences)
Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
One matrix for each time period (in our experiment, one month)
201212_matrix.gz
Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
Target
word
Co-occurrence
matrices
linux google 173 xp 454 manufacturer 237 job
64 install 255 security 137 cgi 47 operating
705 host 69 performance 44 sharing 56...
co-occurrence
co-occ.
word
count
Co-occurrence
matrices
linux swapping 4 google 173 xp 454 manufacturer
237 job 64 install 255 security 137 cgi 47
operating 705 host 69 performance 44 sharing
56...
We plan to make the matrices freely available!
TRI
December, 2000 December, 2012
sparc windows
kernel microsoft
pwdb xp
asm debian
unix netware
netinet macos
packlist suse
How the neighborhood of the word
‘linux’ changed over time
2000: technical terms
2012: linux is recognized as an
operating system
sim(linux200012
,linux201212
)=0.228
TRI
word-word similarity
TRI
word-word similarity
Build a gold standard for the evaluation
Historical dictionary
Build a gold standard for the evaluation
Historical dictionary
Evaluation
Metrics
▪ Precision: how many change points are correctly identified?
▪ Recall: is TRI able to identify all the change points reported in
OED?
▪ TRI could identify correct change points not reported in OED!
▪ Some words (slang) are not reported in OED
▫ exploit other dictionaries: Urban Dictionary?
Current outcome
▪ WET files for all the JISC UK Web Domain Dataset (1996-2013)
▫ tokenized content (100% ~3TB)
▫ co-occurrence matrices (20% ~350GB)
▫ WordSpace
▸ for each month (20% ~400GB)
▸ for each year (20% ~50GB)
▪ Time series from 1996 to 2013
▫ cumulative, pointwise, month, year
▪ Change point detection with different p-values
Conclusion and
Future work
▪ TRI is able to scale-up on a large corpus of billion of tokens
▪ Co-occurrence matrices can be re-used for building WordSpaces
exploiting other approaches
▪ Future work
▫ matrices and WordSpaces from the whole corpus
▫ finalize the construction of the gold standard
▫ comparison with other approaches (word embeddings
alignment)
Thanks!!
Any questions?
You can find me at @basilepp &
pierpaolo.basile@uniba.it

More Related Content

Similar to Detecting semantic shift in large corpora by exploiting temporal random indexing

Real-time Collaborative Editing with CRDTs
Real-time Collaborative Editing with CRDTsReal-time Collaborative Editing with CRDTs
Real-time Collaborative Editing with CRDTsC4Media
 
introduction into IR
introduction into IRintroduction into IR
introduction into IRssusere3b1a2
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for DiscoveryOCLC
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...COST Action TD1210
 
Boolean IR and Indexing.pptx
Boolean IR and Indexing.pptxBoolean IR and Indexing.pptx
Boolean IR and Indexing.pptxMahsadelavari
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
Interpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextInterpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextEric Kansa
 
Ontology and the Lexicon: week.2
Ontology and the Lexicon: week.2Ontology and the Lexicon: week.2
Ontology and the Lexicon: week.2shukaihsieh
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...Data Science Milan
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Protocols of Interaction: Best Current Practices
Protocols of Interaction: Best Current PracticesProtocols of Interaction: Best Current Practices
Protocols of Interaction: Best Current PracticesC4Media
 
OSMC 2014 | Processing millions of logs with Logstash and integrating with El...
OSMC 2014 | Processing millions of logs with Logstash and integrating with El...OSMC 2014 | Processing millions of logs with Logstash and integrating with El...
OSMC 2014 | Processing millions of logs with Logstash and integrating with El...NETWAYS
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesPierpaolo Basile
 

Similar to Detecting semantic shift in large corpora by exploiting temporal random indexing (20)

lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
Real-time Collaborative Editing with CRDTs
Real-time Collaborative Editing with CRDTsReal-time Collaborative Editing with CRDTs
Real-time Collaborative Editing with CRDTs
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
 
Boolean IR and Indexing.pptx
Boolean IR and Indexing.pptxBoolean IR and Indexing.pptx
Boolean IR and Indexing.pptx
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
Interpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextInterpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open Context
 
Ontology and the Lexicon: week.2
Ontology and the Lexicon: week.2Ontology and the Lexicon: week.2
Ontology and the Lexicon: week.2
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
 
London Scala Meetup - Omnia
London Scala Meetup - OmniaLondon Scala Meetup - Omnia
London Scala Meetup - Omnia
 
LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Protocols of Interaction: Best Current Practices
Protocols of Interaction: Best Current PracticesProtocols of Interaction: Best Current Practices
Protocols of Interaction: Best Current Practices
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
OSMC 2014 | Processing millions of logs with Logstash and integrating with El...
OSMC 2014 | Processing millions of logs with Logstash and integrating with El...OSMC 2014 | Processing millions of logs with Logstash and integrating with El...
OSMC 2014 | Processing millions of logs with Logstash and integrating with El...
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spaces
 

More from Pierpaolo Basile

Diachronic analysis of entities by exploiting wikipedia page revisions
Diachronic analysis of entities by exploiting wikipedia page revisionsDiachronic analysis of entities by exploiting wikipedia page revisions
Diachronic analysis of entities by exploiting wikipedia page revisionsPierpaolo Basile
 
Come l'industria tecnologica ha cancellato le donne dalla storia
Come l'industria tecnologica ha cancellato le donne dalla storiaCome l'industria tecnologica ha cancellato le donne dalla storia
Come l'industria tecnologica ha cancellato le donne dalla storiaPierpaolo Basile
 
EVALITA 2018 NLP4FUN - Solving language games
EVALITA 2018 NLP4FUN - Solving language gamesEVALITA 2018 NLP4FUN - Solving language games
EVALITA 2018 NLP4FUN - Solving language gamesPierpaolo Basile
 
Buon appetito! Analyzing Happiness in Italian Tweets
Buon appetito! Analyzing Happiness in Italian TweetsBuon appetito! Analyzing Happiness in Italian Tweets
Buon appetito! Analyzing Happiness in Italian TweetsPierpaolo Basile
 
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Bi-directional LSTM-CNNs-CRF for Italian Sequence LabelingBi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Bi-directional LSTM-CNNs-CRF for Italian Sequence LabelingPierpaolo Basile
 
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
INSERT COIN - Storia dei videogame: da Spacewar a Street FighterINSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
INSERT COIN - Storia dei videogame: da Spacewar a Street FighterPierpaolo Basile
 
QuestionCube DigithON 2017
QuestionCube DigithON 2017QuestionCube DigithON 2017
QuestionCube DigithON 2017Pierpaolo Basile
 
La macchina più geek dell’universo The Turing Machine
La macchina più geek dell’universo The Turing MachineLa macchina più geek dell’universo The Turing Machine
La macchina più geek dell’universo The Turing MachinePierpaolo Basile
 
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...Pierpaolo Basile
 
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...Pierpaolo Basile
 
A Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesA Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesPierpaolo Basile
 
Exploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringExploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringPierpaolo Basile
 
Sst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaoloSst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaoloPierpaolo Basile
 
AI*IA 2012 PAI Workshop OTTHO
AI*IA 2012 PAI Workshop OTTHOAI*IA 2012 PAI Workshop OTTHO
AI*IA 2012 PAI Workshop OTTHOPierpaolo Basile
 
Word Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information AccessWord Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information AccessPierpaolo Basile
 
Encoding syntactic dependencies by vector permutation
Encoding syntactic dependencies by vector permutationEncoding syntactic dependencies by vector permutation
Encoding syntactic dependencies by vector permutationPierpaolo Basile
 

More from Pierpaolo Basile (17)

Diachronic analysis of entities by exploiting wikipedia page revisions
Diachronic analysis of entities by exploiting wikipedia page revisionsDiachronic analysis of entities by exploiting wikipedia page revisions
Diachronic analysis of entities by exploiting wikipedia page revisions
 
Come l'industria tecnologica ha cancellato le donne dalla storia
Come l'industria tecnologica ha cancellato le donne dalla storiaCome l'industria tecnologica ha cancellato le donne dalla storia
Come l'industria tecnologica ha cancellato le donne dalla storia
 
EVALITA 2018 NLP4FUN - Solving language games
EVALITA 2018 NLP4FUN - Solving language gamesEVALITA 2018 NLP4FUN - Solving language games
EVALITA 2018 NLP4FUN - Solving language games
 
Buon appetito! Analyzing Happiness in Italian Tweets
Buon appetito! Analyzing Happiness in Italian TweetsBuon appetito! Analyzing Happiness in Italian Tweets
Buon appetito! Analyzing Happiness in Italian Tweets
 
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Bi-directional LSTM-CNNs-CRF for Italian Sequence LabelingBi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
 
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
INSERT COIN - Storia dei videogame: da Spacewar a Street FighterINSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
 
QuestionCube DigithON 2017
QuestionCube DigithON 2017QuestionCube DigithON 2017
QuestionCube DigithON 2017
 
(Open) data hacking
(Open) data hacking(Open) data hacking
(Open) data hacking
 
La macchina più geek dell’universo The Turing Machine
La macchina più geek dell’universo The Turing MachineLa macchina più geek dell’universo The Turing Machine
La macchina più geek dell’universo The Turing Machine
 
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
 
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
 
A Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesA Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional Spaces
 
Exploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringExploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question Answering
 
Sst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaoloSst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaolo
 
AI*IA 2012 PAI Workshop OTTHO
AI*IA 2012 PAI Workshop OTTHOAI*IA 2012 PAI Workshop OTTHO
AI*IA 2012 PAI Workshop OTTHO
 
Word Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information AccessWord Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information Access
 
Encoding syntactic dependencies by vector permutation
Encoding syntactic dependencies by vector permutationEncoding syntactic dependencies by vector permutation
Encoding syntactic dependencies by vector permutation
 

Recently uploaded

Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...ZurliaSoop
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.thamaeteboho94
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxlionnarsimharajumjf
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 

Recently uploaded (17)

Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 

Detecting semantic shift in large corpora by exploiting temporal random indexing

  • 1. Detecting semantic shift in large corpora by exploiting Temporal Random Indexing Pierpaolo Basile pierpaolo.basile@uniba.it
  • 2. Hello! I am Pierpaolo Basile Natural Language Processing Distributional Semantics Information Retrieval/Filtering You can find me at pierpaolo.basile@uniba.it
  • 3. Words change their meaning (usage) Marty, in 2015 people will surf on the web!!!
  • 4. Words change their meaning (usage) Surf!?!?! On the web!?!?!?
  • 5. Surf!?!?! On the web!?!?!? Motivation Detect meaning shift surf the Net/Internet to use the Internet When was this meaning introduced?
  • 6. Diachronic Linguistics The scientific study of language change over time (also called Historical Linguistics)
  • 7. Synchronic It describes the language rules at a specific point in time without taking its history into account Synchronic vs. Diachronic Diachronic It considers the evolution of a language over time
  • 8. Diachronic Linguistics Why? ▪ Observe changes in particular languages ▪ Reconstruct the pre-history of languages ▪ Develop general theories about how and why language changes ▪ Describe the history of speech communities ▪ Etymology
  • 10. Distributional Semantic Models (DSM) You shall know a word by the company it keeps! The meaning of a word is its use in the language Distributional structure Mathematical structures of language John Rupert Firth Ludwig Wittgenstein Zellig Harris
  • 11. Distributional Semantic Models ● Analysis of word-usage statistics over huge corpora ● Geometric space of concepts ● Similar words are represented close in the space
  • 12. “ A WordSpace is a snapshot of a specific corpus, it does not take into account temporal information
  • 13. RI Random Indexing Building the WordSpace ▪ Assign a random vector to each term in the corpus vocabulary ▪ Semantic vector for a term is the sum of the context vectors co-occurring with the term Random Vector …-1 0 1 0 0 0 0 0 1 0 0 0 -1 … ▪ Sparse ▪ High dimensional ▪ Ternary {-1, 0, +1} ▪ Small number of randomly distributed non-zero elements
  • 14. RI How it works The quick brown fox jumps over the lazy dog Random vectors quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0> brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0> jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0> over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0> lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1> dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0>
  • 15. RI How it works The quick brown fox jumps over the lazy dog Random vectors quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0> brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0> jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0> over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0> lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1> dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0> context window = 2
  • 16. RI How it works The quick brown fox jumps over the lazy dog Random vectors quick: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0> brown: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> fox: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0> jumps: <0, 0, 0, 0, 0, 1, 0, -1, 0, 0> over: <0, 0, 0, 0, 0, 0, 1, 0, -1, 0> lazy: <1, 0, 0, 0, 0, 0, 0, 0, 0, -1> dog: <0, 0, 0, -1, 0, 1, 0, 0, 0, 0> context window = 2 svfox += <-1, -1, 1, 1, 0, 1, 1, -1, -1, 0> +
  • 18. Random Indexing (formal) B preserves the Euclidean distance between points (Johnson-Lindenstrauss lemma) Kanerva, Pentti. Sparse distributed memory. MIT press, 1988.
  • 21. Temporal Random Indexing TRI 1985 ...surf the sea... 1994 ...surf the web... Random vectors surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0> sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0> ...
  • 22. Temporal Random Indexing TRI 1985 ...surf the sea... 1994 ...surf the web... Random vectors surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0> sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0> ... svsurf_1985 += <0, -1, 0, 1, 0, 0, 0, 0, 0, 0>
  • 23. Temporal Random Indexing TRI 1985 ...surf the sea... 1994 ...surf the web... Random vectors surf: <-1, 0, 1, 0, 0, 0, 0, 0, 0, 0> sea: <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> web: <0, 0, 0, 0, -1, 0, 1, 0, 0, 0> ... svsurf_1985 += <0, -1, 0, 1, 0, 0, 0, 0, 0, 0> svsurf_1994 += <0, 0, 0, 0, -1, 0, 1, 0, 0, 0>
  • 24. Temporal Random Indexing TRI ▪ Corpus with temporal information ▫split the corpus in several time periods ▪ Build a WordSpace for each time period using TRI ▪ Words in different WordSpaces are comparable! P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1
  • 25. Temporal Random Indexing TRI ▪ Corpus with temporal information ▫split the corpus in several time periods ▪ Build a WordSpace for each time period using TRI ▪ Words in different WordSpaces are comparable! P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol.1 Random vectors are shared across time periods!
  • 26. Similarity between words can change over time WordSpace 1870 WordSpace 1920 WordSpace 1930 chiamare (call) chiamare (call) telefonare (phone) chiamare (call) telefonare (phone)
  • 28. Methodology TRI Time Series Change Point Detection Run TRI on a corpus split in time periods Provide a time series for each word Detect significant changes in the time series
  • 29. Several time series Γ at the time interval k log frequency point-wise cumulative Log of the word frequency in each time period k Cosine similarity between word vectors across two time periods Considers a cumulative vector of the previous k-1 time periods Time Series Kulkarni et al., Statistically significant detection of linguistic change. WWW 2015.
  • 30. Change point detection ▪Track the word meaning change over time ▪Build a time series by taking into account the semantic shift of each word ▪Find significant change: Mean shift model telefonare -> 0,25 0,3 0,7 0,8 0,75 1900 1910 1920 1930 1940 change point (phone)
  • 31. Change point detection ▪Mean shift of Γ pivoted at time period j ▪Search statistically significant mean shift ▫bootstrapping approach under the null hypothesis that there is no change in the meaning
  • 32. Evaluation Results about the Italian language
  • 33. Build a gold standard for the evaluation change point http://dizionari.corriere.it/dizionario_italiano/
  • 34. Evaluation Results Method Accuracy TRIpoint 0.3086 TRIcum 0.2963 TRR1point 0.2716 log freq 0.2346 TRR2point 0.1728 TRR1cum 0.1605 TRR2cum 0.1235 Accuracy: the year predicted by the system must be equal or greater than one of the years reported in the gold standard TRR1 and TRR2 are variants of TRI based on Reflective Random Indexing P. Basile, A. Caputo, R. Luisi, G. Semeraro, Diachronic analysis of the Italian language exploiting Google Ngram, CLiC-it 2016
  • 35. Social media ▪ Build TRI on Twitter ▪ About 500M tweets (feb. 2012 – sep. 2015) ▪ Time interval = 1 month
  • 38. Detecting semantic shift in large corpora UK Internet Web Archive Joint work with Barbara McGillivray Research Fellow, Alan Turing Institute
  • 39. UK Internet Web Archive ▪ UK Web Archive collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain ▪ JISC UK Web Domain Dataset (1996-2013) ▫ resources from the Internet Archive that were hosted on domains ending in ‘.uk’
  • 40. Data Format ▪ ARC format: used to store "web crawls" as sequences of content blocks ▪ WARC format: is an enhancement of ARC format for supporting metadata, duplicate detection events and more
  • 41. Data Format ▪ ARC format: used to store "web crawls" as sequences of content blocks ▪ WARC format: is an enhancement of ARC format for supporting metadata, duplicate detection events and more We need to extract the textual content from HTML pages and discard all other types of content
  • 42. From ARC/WARC to WET ▪ WET format: contains extracted plaintext from the data stored in ARC/WARC archives WARC ARC Filter HTML pages Extract text WET
  • 43. From ARC/WARC to WET WARC ARC Filter HTML pages Extract text WET Azure Blob Storage Azure Blob Storage Azure Batch and VMs pool Jsoup library
  • 44. From WET to tokens ▪ Extract tokens from text using Apache Lucene Standard Tokenizer ▪ Store tokens for each month (time period) WET Tokenization Tokens Azure Blob Storage Azure Blob Storage Azure Batch and VMs pool https://github.com/alan-turing-institute/UKWebArchive_semantic_chang e
  • 46. TRI The UK Web Archive ▪ Build the vocabulary ▪ Build co-occurrence matrices (context window=5) ▪ Perform TRI ▪ Build time series ▪ Run change point detection
  • 47. TRI The UK Web Archive ▪ Build the vocabulary ▪ Build co-occurrence matrices (to speed-up TRI) ▪ Perform TRI ▪ Build time series ▪ Run change point detection https://github.com/alan-turing-institute/temporal-random-indexin g We performed a preliminary analysis on the 20% of the corpus discarding words that appear less than 500 times (~1M words, ~140 billion of occurrences)
  • 48. Co-occurrence matrices linux swapping 4 google 173 xp 454 manufacturer 237 job 64 install 255 security 137 cgi 47 operating 705 host 69 performance 44 sharing 56... One matrix for each time period (in our experiment, one month) 201212_matrix.gz
  • 49. Co-occurrence matrices linux swapping 4 google 173 xp 454 manufacturer 237 job 64 install 255 security 137 cgi 47 operating 705 host 69 performance 44 sharing 56... Target word
  • 50. Co-occurrence matrices linux google 173 xp 454 manufacturer 237 job 64 install 255 security 137 cgi 47 operating 705 host 69 performance 44 sharing 56... co-occurrence co-occ. word count
  • 51. Co-occurrence matrices linux swapping 4 google 173 xp 454 manufacturer 237 job 64 install 255 security 137 cgi 47 operating 705 host 69 performance 44 sharing 56... We plan to make the matrices freely available!
  • 52. TRI December, 2000 December, 2012 sparc windows kernel microsoft pwdb xp asm debian unix netware netinet macos packlist suse How the neighborhood of the word ‘linux’ changed over time 2000: technical terms 2012: linux is recognized as an operating system sim(linux200012 ,linux201212 )=0.228
  • 55. Build a gold standard for the evaluation Historical dictionary
  • 56. Build a gold standard for the evaluation Historical dictionary
  • 57. Evaluation Metrics ▪ Precision: how many change points are correctly identified? ▪ Recall: is TRI able to identify all the change points reported in OED? ▪ TRI could identify correct change points not reported in OED! ▪ Some words (slang) are not reported in OED ▫ exploit other dictionaries: Urban Dictionary?
  • 58. Current outcome ▪ WET files for all the JISC UK Web Domain Dataset (1996-2013) ▫ tokenized content (100% ~3TB) ▫ co-occurrence matrices (20% ~350GB) ▫ WordSpace ▸ for each month (20% ~400GB) ▸ for each year (20% ~50GB) ▪ Time series from 1996 to 2013 ▫ cumulative, pointwise, month, year ▪ Change point detection with different p-values
  • 59. Conclusion and Future work ▪ TRI is able to scale-up on a large corpus of billion of tokens ▪ Co-occurrence matrices can be re-used for building WordSpaces exploiting other approaches ▪ Future work ▫ matrices and WordSpaces from the whole corpus ▫ finalize the construction of the gold standard ▫ comparison with other approaches (word embeddings alignment)
  • 60. Thanks!! Any questions? You can find me at @basilepp & pierpaolo.basile@uniba.it