SlideShare a Scribd company logo
1 of 73
Detecting language change
for the digital humanities;
challenges and opportunities
Nina Tahmasebi, PhD
University of Gothenburg
6th Estonian Digital Humanities conference
Digital humanities
(Postdoc at CDH)
Mathematics
(B.Sc &
M.Sc.)
Me
Electrical
Engineering
Computer science
(Phd + Postdoc)
NLP /
Language
Technology
(Researcher)
Computer science
Data
1010011010010
1001010010101
0011010010101
I love you You love me
I
love
you
Language
Language technology
Data
1010011010
0101001010
0101010011
Prezentio id:
1
Digital humanities
Language
Data
1010011010
0101001010
0101010011
Some
terminology
Digital Humanities, Computer Science,
Language Technology
LT  NLP
Data Science
Text and Resource
Long-term / diachronic
Token
Some
terminology
A
Ba
Digital Humanities, Computer Science,
Language Technology
LT  NLP
Data Science
Text
Long-term / diachronic
Token
Vector (1, 4, 3) (=3 dimensions)
Topic modeling
Language
changes
Part I
LiWA – Living Web Archives
preparing for
evolution aware
access support
dealing with
terminology
evolution
Semantic &
Terminology Evolution
Noise and Spam
Filtering
Improved Capturing
Existing Web Archiving Technology
Temporal
Coherence
What
is new?
time
Increasing amount of historical texts
in digital format
Easy digital access for anyone!
Not only scholars.
Possibility to digitally analyze
historical documents
at large scale.
Information from primary sources
Not only modern interpretations. Text-based
Digital Humanities
time
Spelling change
Teutschland Deutschland
Lexical replacement:
Named entity change
time
St. Petersburg St. Petersburg
Petrograd
Leningrad
time
Lexical replacement:
Petrograd St. Petersburg
felitious happy
awesome
He was an
awesome leader!
He was an
awesome leader!
time
Kona Qwinna KvinnaQvinna
time
What is the problem?
St. Petersburg
Petrograd
Finding
What is the problem?
St. Petersburg
Finding Interpreting
time
Petrograd
Sebastini’s benefit last night at the
Opera House was overflowing with
the fashionable and gay
The Times, April 27th, 1787
Sebastini’s benefit last night at the
Opera House was overflowing with
the fashionable and gay
What is the problem?
St. Petersburg
Finding Interpreting
time
Petrograd
girl
criminal
Wolf ‘varg’
Aims
To find word sense changes
automatically by
To find what changes, how it
changed and when it changed
Stone
Music
Lifestyle
Rock
1
Modeling word
senses
2
Comparing these
over time
20132008 201220102009 2011 2014 2015 2016 2017 2018
Tahmasebi et al.
2008
Single-sense
Sense-differentiated
Sagi et al
2009
Gulordava
& Baroni
2011
Tang et al
2013
Kim et al
2014
Kulkarni et al
2015
Hamilton et al
Eger and Mehler
Rodda et al
Basile et al
2016
Azarbonyad et al
Takamura et al
Kahnmann & Heyer
Bamler & Mandt
2017
Yao et a,
Rudolph & Blei
2018
Wijaya & Yentizerzi
2011
Lau et al
2012
Cook et al
2013
Cook et al
Mitra et al
2014
Mitra et al
2015
Frerman & Lapata
Tang et al
2016
Tahmasebi & Risse
2017
Costin-Gabriel
& Rebedea
Tjong Kim Sang
2016
embeddings
dynamic embeddings
neural embeddings
topic models
word sense induction
Mihalcea & Nastase
2012
topic models
word sense induction
20132008 201220102009 2011 2014 2015 2016 2017 2018
embeddings
dynamic embeddings
neural embeddings
(Neural) Word embeddings
Word embeddings shown in 2D instead of 50-100000
Image: Nieto Pina and Johansson, RANLP’15
Word embedding-based models
Image: Kulkarni et al. WWW’15
Downsides
Random in
• Initialization
• Order in which the training examples
are seen
100 Million tokens per time span*
Typically learn one vector per word
 Stable/less dominant senses get lost!
Stone
Music
Lifestyle
Rock
Presented at DHN2018
A Study on Word2Vec
on a Historical Swedish
Newspaper Corpus
Our study
-
2000 000
4000 000
6000 000
8000 000
10000 000
12000 000
14000 000
16000 000
1749
1757
1779
1787
1795
1803
1811
1819
1827
1835
1843
1851
1859
1867
1875
1883
1891
1899
1907
1915
1923
Numberoftokens
Year
Size of Kubhist in tokens
tokens
* https://spraakbanken.gu.se/korp/?mode=kubhist
Word2Vec (W2V)
a two-layer neural net
(skip-gram)
KubHist*
Swedish Newspapers
1749-1925
Trained yearly vectors
What did we do?
11 (10) words over time
nyhet 'news'
tidning 'newspaper'
politik 'politics'
telefon 'telephone'
telegraf 'telegraph'
kvinna 'woman'
man 'man'
glad, 'happy'
retorik ‘rhetoric'
resa 'travel'
musik 'music'
A = {happy, smiling, glad}
B = {happy, joyful, cheerful, excited}
Overlap = 1
Unique = 3+4-1 = 6
Jaccard similarity = 1/6
Some results I
Woman:
1912: 'kvinna': [valbarhet, valrätt, rösträtt, själfförsörjande, sexuell, okunnig, högerparti, politisk, radikal,
vänsterparti]
1908: 'kvinna': [österåsen, ung, rösträtt, ljusglimt, flicka, iförda, knäböjande, begåfvad, värnlös, jubla]
1895: 'kvinna': [qvinna, varelse, människa, öfvermåttan, flicka, reptil, gosse, förälskade, öfvergifven, högväxt]
1879: 'kvinna': [qvarlefva, vålnad, öfvade, rättskaffens, begåfvade, skenbart, skummande, vilde, herskar, mygga]
1867: 'kvinna': [äes, kvrk, kunäe, mle, näo, nuvaranäe, äer, v«r«, uä, äig]
1868: 'kvinna': [piller, kvilken, mis, kade, klo, nde, äock, reäan, äsom, bvilken]
Some results II
Politics:
1925: 'politik': [näring, trygghet, kamp, arbetarrörelse, konservativ, nationell, strävan, europa, neutralitet,
önskad]
1922: 'politik': [åskådning, socialistisk, ägnad, demokrati, utrikespolitisk, sakligt, situation, representativ,
auktoritet, ärlig]
1900: 'politik': [enig, bvad, finlands, politisk, konstitutionel, revolution, armenien, citera, civiliserade, dementi]
1872: 'politik': [republikansk, opposition, kränka, reaktionär, neutral, republikan, tillbakavisa, changarniers,
påfvedöme, horace]
1858: 'politik': [asylrätt, allians, frankrikes, konstitutionell, konflikt, försonlig, rysslands, press, makt, fördrag]
1844: 'politik': [tadla, allians, vägran, irländsk, frankrikes, bemedling, tribun, segra, ministeriell, fördrag]
Result summary
Avg. Jaccard similarity, normalized
frequency and Spearman correltion
The more frequent the term,
the more stable the vectors
0.11-0.19 overlap
between years
2-3 words in
common each year
Next step
OCR errors
Spelling
normalization
Research
methodologies in DH
Part II
Digital, large-scale data
Data Hypothesis
Data Hypothesis
Questions
Representativeness
Image: http://first-the-trousers.com/hello-world/
TheStreetlighteffect
method + data = results
result
result
hypothesis
1 Data 3 Hypothesis2 Method / Preprocessing
result
hypothesis
Reject
1 Method 2
Correct interpretation
of the results
result
hypothesis
Accept
Math results, average difference
Source: Factfullness
Men
Women
Men
Women
Math results, average difference
Source: Factfullness
Source: Factfullness
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Range of math scores
Men
Women
Comparison of the same data
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Source: Factfullness
Men
Women
1 Method 3
Where do the
results live?
2
Correct interpretation
of the results
result
hypothesis
result
hypothesis
1 Method 3
Where do the
results live?
2
Correct interpretation
of the results
Text-mining method
Dimensions
Filtering: Function
words
Filtering: Stopwords
Part-of-speech tagging
Lemmatization
Tokenization
NLP pipeline: From text to result
I like the room but not the sheets.
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only verbs)
Viewpoint on the data
Viewpoint on the data
Choosing a method
1010011010010
1001010010101
0011010010101
Results
Marie Antoinette
Queen of
France
Child of
Empress Maria Theresa
Child of
Francis I
Archduchess
Austria
Evaluation
Prof. Hans Rosling
You can’t
understand the
world
without numbers…
Factfullness
… and you cannot
understand it
only with numbers.
Thank you for listening!
Nina.tahmasebi@gu.se
nina@tahmasebi.se

More Related Content

Similar to Detecting language change for the digital humanities

PPT slides
PPT slidesPPT slides
PPT slides
butest
 

Similar to Detecting language change for the digital humanities (20)

Perspectives on the evidence, value and impact of LIS research: conceptual ch...
Perspectives on the evidence, value and impact of LIS research: conceptual ch...Perspectives on the evidence, value and impact of LIS research: conceptual ch...
Perspectives on the evidence, value and impact of LIS research: conceptual ch...
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
 
EssayWriting Writing Introductions And Conclusions Teaching Resour
EssayWriting Writing Introductions And Conclusions  Teaching ResourEssayWriting Writing Introductions And Conclusions  Teaching Resour
EssayWriting Writing Introductions And Conclusions Teaching Resour
 
Timo Honkela: Digital Preservation and Computational Modeling of Language and...
Timo Honkela: Digital Preservation and Computational Modeling of Language and...Timo Honkela: Digital Preservation and Computational Modeling of Language and...
Timo Honkela: Digital Preservation and Computational Modeling of Language and...
 
20190711 dh-utrecht
20190711 dh-utrecht20190711 dh-utrecht
20190711 dh-utrecht
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
Digitization and the impact on the libraries Dundee june 2014
Digitization and the impact on the libraries Dundee june 2014Digitization and the impact on the libraries Dundee june 2014
Digitization and the impact on the libraries Dundee june 2014
 
SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"
 
Rethinking the Public Library, Rolf Hapel - CILIP Ireland/ LAI Joint Conferen...
Rethinking the Public Library, Rolf Hapel - CILIP Ireland/ LAI Joint Conferen...Rethinking the Public Library, Rolf Hapel - CILIP Ireland/ LAI Joint Conferen...
Rethinking the Public Library, Rolf Hapel - CILIP Ireland/ LAI Joint Conferen...
 
Rethink the Public Library
Rethink the Public Library Rethink the Public Library
Rethink the Public Library
 
Student Introduction to National History Day in Ohio
Student Introduction to National History Day in OhioStudent Introduction to National History Day in Ohio
Student Introduction to National History Day in Ohio
 
Timo Honkela: Artificial Intelligence and Machine Learning in the Service of ...
Timo Honkela: Artificial Intelligence and Machine Learning in the Service of ...Timo Honkela: Artificial Intelligence and Machine Learning in the Service of ...
Timo Honkela: Artificial Intelligence and Machine Learning in the Service of ...
 
PPT slides
PPT slidesPPT slides
PPT slides
 
Effective Presentations using Data Visualization
Effective Presentations using Data VisualizationEffective Presentations using Data Visualization
Effective Presentations using Data Visualization
 
Big Data in Economic Research: Twitter, Phone calls and Political events
Big Data in Economic Research: Twitter, Phone calls and Political eventsBig Data in Economic Research: Twitter, Phone calls and Political events
Big Data in Economic Research: Twitter, Phone calls and Political events
 
Stefan Vollmer on Exploring digital literacy practices
Stefan Vollmer on Exploring digital literacy practicesStefan Vollmer on Exploring digital literacy practices
Stefan Vollmer on Exploring digital literacy practices
 
Language, Culture, and Software
Language, Culture, and SoftwareLanguage, Culture, and Software
Language, Culture, and Software
 
Visualization in the Digital Humanities
Visualization in the Digital HumanitiesVisualization in the Digital Humanities
Visualization in the Digital Humanities
 
On Languages and Sharing (open data), Eliana Trinaistic & Veronica Costea
On Languages and Sharing (open data), Eliana Trinaistic & Veronica CosteaOn Languages and Sharing (open data), Eliana Trinaistic & Veronica Costea
On Languages and Sharing (open data), Eliana Trinaistic & Veronica Costea
 
Character-based Neural Embeddings for Tweet Clustering
Character-based  Neural Embeddings for Tweet ClusteringCharacter-based  Neural Embeddings for Tweet Clustering
Character-based Neural Embeddings for Tweet Clustering
 

More from Nina Tahmasebi

More from Nina Tahmasebi (6)

CHR2022-final.pdf
CHR2022-final.pdfCHR2022-final.pdf
CHR2022-final.pdf
 
Tartu-DHtalk-final.pdf
Tartu-DHtalk-final.pdfTartu-DHtalk-final.pdf
Tartu-DHtalk-final.pdf
 
2022-10-18-KBR-for publication.pdf
2022-10-18-KBR-for publication.pdf2022-10-18-KBR-for publication.pdf
2022-10-18-KBR-for publication.pdf
 
2020 10-26-language change-stuttgart-workshop
2020 10-26-language change-stuttgart-workshop2020 10-26-language change-stuttgart-workshop
2020 10-26-language change-stuttgart-workshop
 
2020 09-28-odense-final-forpublication
2020 09-28-odense-final-forpublication2020 09-28-odense-final-forpublication
2020 09-28-odense-final-forpublication
 
Workshop on Digital Literacy - Digital text and data-intensive research
Workshop on Digital Literacy - Digital text and data-intensive researchWorkshop on Digital Literacy - Digital text and data-intensive research
Workshop on Digital Literacy - Digital text and data-intensive research
 

Recently uploaded

Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
GlendelCaroz
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
ssusera4ec7b
 

Recently uploaded (20)

MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
MSCII_              FCT UNIT 5 TOXICOLOGY.pdfMSCII_              FCT UNIT 5 TOXICOLOGY.pdf
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
 
GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) Enzymology
 
Micropropagation of Madagascar periwinkle (Catharanthus roseus)
Micropropagation of Madagascar periwinkle (Catharanthus roseus)Micropropagation of Madagascar periwinkle (Catharanthus roseus)
Micropropagation of Madagascar periwinkle (Catharanthus roseus)
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algae
 
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.pptTHE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
 
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENSANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptx
 

Detecting language change for the digital humanities