Detecting language change for the digital humanities
Oct. 1, 2018•0 likes
1 likes
Be the first to like this
Show More
•3,285 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Report
Science
Keynote at 6th Estonian Digital Humanities Conference in Tartu.
If you wish to have the comments and text for the slides, please email me at the address given in the last slide.
Detecting language change for the digital humanities
Detecting language change
for the digital humanities;
challenges and opportunities
Nina Tahmasebi, PhD
University of Gothenburg
6th Estonian Digital Humanities conference
Digital humanities
(Postdoc at CDH)
Mathematics
(B.Sc &
M.Sc.)
Me
Electrical
Engineering
Computer science
(Phd + Postdoc)
NLP /
Language
Technology
(Researcher)
LiWA – Living Web Archives
preparing for
evolution aware
access support
dealing with
terminology
evolution
Semantic &
Terminology Evolution
Noise and Spam
Filtering
Improved Capturing
Existing Web Archiving Technology
Temporal
Coherence
What
is new?
time
Increasing amount of historical texts
in digital format
Easy digital access for anyone!
Not only scholars.
Possibility to digitally analyze
historical documents
at large scale.
Information from primary sources
Not only modern interpretations. Text-based
Digital Humanities
Aims
To find word sense changes
automatically by
To find what changes, how it
changed and when it changed
Stone
Music
Lifestyle
Rock
1
Modeling word
senses
2
Comparing these
over time
20132008 201220102009 2011 2014 2015 2016 2017 2018
Tahmasebi et al.
2008
Single-sense
Sense-differentiated
Sagi et al
2009
Gulordava
& Baroni
2011
Tang et al
2013
Kim et al
2014
Kulkarni et al
2015
Hamilton et al
Eger and Mehler
Rodda et al
Basile et al
2016
Azarbonyad et al
Takamura et al
Kahnmann & Heyer
Bamler & Mandt
2017
Yao et a,
Rudolph & Blei
2018
Wijaya & Yentizerzi
2011
Lau et al
2012
Cook et al
2013
Cook et al
Mitra et al
2014
Mitra et al
2015
Frerman & Lapata
Tang et al
2016
Tahmasebi & Risse
2017
Costin-Gabriel
& Rebedea
Tjong Kim Sang
2016
embeddings
dynamic embeddings
neural embeddings
topic models
word sense induction
Mihalcea & Nastase
2012
topic models
word sense induction
20132008 201220102009 2011 2014 2015 2016 2017 2018
embeddings
dynamic embeddings
neural embeddings
Downsides
Random in
• Initialization
• Order in which the training examples
are seen
100 Million tokens per time span*
Typically learn one vector per word
Stable/less dominant senses get lost!
Stone
Music
Lifestyle
Rock
Our study
-
2000 000
4000 000
6000 000
8000 000
10000 000
12000 000
14000 000
16000 000
1749
1757
1779
1787
1795
1803
1811
1819
1827
1835
1843
1851
1859
1867
1875
1883
1891
1899
1907
1915
1923
Numberoftokens
Year
Size of Kubhist in tokens
tokens
* https://spraakbanken.gu.se/korp/?mode=kubhist
Word2Vec (W2V)
a two-layer neural net
(skip-gram)
KubHist*
Swedish Newspapers
1749-1925
Trained yearly vectors
What did we do?
11 (10) words over time
nyhet 'news'
tidning 'newspaper'
politik 'politics'
telefon 'telephone'
telegraf 'telegraph'
kvinna 'woman'
man 'man'
glad, 'happy'
retorik ‘rhetoric'
resa 'travel'
musik 'music'
A = {happy, smiling, glad}
B = {happy, joyful, cheerful, excited}
Overlap = 1
Unique = 3+4-1 = 6
Jaccard similarity = 1/6
Result summary
Avg. Jaccard similarity, normalized
frequency and Spearman correltion
The more frequent the term,
the more stable the vectors
0.11-0.19 overlap
between years
2-3 words in
common each year
I like the room but not the sheets.
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only verbs)