SlideShare a Scribd company logo
A Study on Word2Vec on a
Historical Swedish
Newspaper Corpus
Nina Tahmasebi, PhD
University of Gothenburg
DHN2018
2018-03-08
What is new?
time
Background
Increasing amount of historical texts
in digital format
Easy digital access for anyone!
Not only scholars.
Possibility to digitally analyze
historical documents
at large scale.
Information from primary resources
Not only modern interpretations.
Culturomics:
Tracking the development of
cultural and language
phenomena over time.
E.g., topics, opinions …
time
LanguageChange
• Spelling variations
• Diachronic word replacement (w2w)
Teutschland Deutschlandfelitious happy
Background
time
LanguageChange
• Spelling variations
• Diachronic word replacement (w2w)
• Named entity change
Petrograd St. Petersburg
Background
LanguageChange
• Spelling variations
• Diachronic word replacement (w2w)
• Named entity change
• Word sense change
He was an awesome leader
time
Background
time
LanguageChange
• Spelling variations
• Diachronic word replacement (w2w)
• Named entity change
• Word sense change
• General LanguageChange
Background
Kona  Qwinna  Qvinna  Kvinna
St. Petersburg
What is the problem?
1500
1525
1550
1575
1600
1625
1650
1675
1700
1725
1750
1775
1800
1825
1850
1875
1900
1925
1950
1975
2000
time
Petrograd
Finding relevant content
• Spelling variations, named entity change
Interpreting content
• Word sense evolution, term to term evolution
Aligning e.g., topics or opinions across time
E.g., ”Iphone 5 is awesome”
 positive or negative review?
What is the problem?
Finding relevant content
• Spelling variations, named entity change
Interpreting content
• Word sense change, diachronic word replacements
Aligning e.g., topics or opinions across time
E.g., ”Iphone 5 is awesome”
 positive or negative review?
1500
1525
1550
1575
1600
1625
1650
1675
1700
1725
1750
1775
1800
1825
1850
1875
1900
1925
1950
1975
2000
time
St. Petersburg
”Sestini’s benefit last night at the Opera House was
overflowing with the fashionable and gay”
The Problem – Interpreting
The Problem – Interpreting
The Problem – Interpreting
”Sebastini’s benefit last night at the Opera
House was overflowing with the
fashionable and gay”
TheTimes, April 27th, 1787
Aims
To find word sense changes automatically
by
1. Modeling word senses
2. Comparing these over time
To find what changes, how it changed and when it changed
We are not the first, nor the last
• Context vectors
• Topic models
• Graph-based methods
• Word embedding methods
context vectors
ti tj
w
The meanings
of words are
not fixed but
in fact
undergo
change
The meanings
of words are
not fixed but
in fact
undergo
change
Finally, we conduct
a preliminary
evaluation in which
we apply our
methods to the task
of
Finally, we conduct
a preliminary
evaluation in which
we apply our
methods to the task
of
BNC ukWaC
Word embeddings
W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king“)
Word embeddings shown in 2D instead of 50-100000
Image: Nieto Pina and Johansson, RANLP’15
Iraq - Violence = Jordan
Human - Animal = Ethics
President - Power = Prime Minister
Library - Books = Hall
Word embedding-based models
Kulkarni et al. WWW’15
• Project a word onto a
vector/point
(POS, frequency and embeddings)
• Track vectors over time
Kim et al. LACSS 2014
Basile et al. CLiC-it 2016
Zhang et al. WWW’16
Hamilton et al. ACL 2016
Bamler and Mandt ICML 2017
Image: Kulkarni et al. WWW’15
Downsides with word embeddings
1. Random in
• Initialization
• Order in which the training examples are seen
2. 100 Million tokens per time span
3. Typically learn one vector per word
Stable/less dominant senses get lost!
Stone
Music
Lifestyle
Rock
Our study
• Word2Vec (W2V)
• a two-layer neural net
(skip-gram) out-of-the-box
using the Deeplearning4j
(DL4J) for Java
• Kubhist
• Swedish Newspapers
• 1749-1925
• Trained yearly vectors
-
2000 000
4000 000
6000 000
8000 000
10000 000
12000 000
14000 000
16000 000
1749
1757
1779
1787
1795
1803
1811
1819
1827
1835
1843
1851
1859
1867
1875
1883
1891
1899
1907
1915
1923
Numberoftokens
Year
Size of Kubhist in tokens
tokens
What did we do?
• 11 (10) words over time
nyhet 'news',
tidning 'newspaper',
politik 'politics',
telefon 'telephone',
telegraf 'telegraph',
kvinna 'woman',
man 'man',
glad, 'happy',
retorik 'rethoric',
resa 'travel' and
musik 'music'.
A = {happy, smiling, glad}
B = {happy, joyful, cheerful, excited}
Overlap = 1
Unique = 3+4-1 = 6
Jaccard similarity = 1/6
Result summary
Avg. Jaccard similarity, normalized frequency and Spearman correlation
Study on W2V for Kubhist
• The more frequent the term,
the more stable the vectors
• 0.11-0.19 overlap between
years  2-3 words in
common each year
1 word  jacc = 0.05
2 words  jacc = 0.11
3 words  jacc = 0.18
4 words  jacc = 0.25
Some Swedish results
Women:
1912: 'kvinna': [valbarhet, valrätt, rösträtt, själfförsörjande, sexuell, okunnig, högerparti, politisk, radikal, vänsterparti]
1908: 'kvinna': [österåsen, ung, rösträtt, ljusglimt, flicka, iförda, knäböjande, begåfvad, värnlös, jubla]
1895: 'kvinna': [qvinna, varelse, människa, öfvermåttan, flicka, reptil, gosse, förälskade, öfvergifven, högväxt]
1879: 'kvinna': [qvarlefva, vålnad, öfvade, rättskaffens, begåfvade, skenbart, skummande, vilde, herskar, mygga]
1867: 'kvinna': [äes, kvrk, kunäe, mle, näo, nuvaranäe, äer, v«r«, uä, äig]
1868: 'kvinna': [piller, kvilken, mis, kade, klo, nde, äock, reäan, äsom, bvilken]
Some Swedish results
Politics:
1925: 'politik': [näring, trygghet, kamp, arbetarrörelse, konservativ, nationell, strävan, europa, neutralitet, önskad]
1922: 'politik': [åskådning, socialistisk, ägnad, demokrati, utrikespolitisk, sakligt, situation, representativ, auktoritet, ärlig]
1900: 'politik': [enig, bvad, finlands, politisk, konstitutionel, revolution, armenien, citera, civiliserade, dementi]
1872: 'politik': [republikansk, opposition, kränka, reaktionär, neutral, republikan, tillbakavisa, changarniers, påfvedöme, hora
1858: 'politik': [asylrätt, allians, frankrikes, konstitutionell, konflikt, försonlig, rysslands, press, makt, fördrag]
1844: 'politik': [tadla, allians, vägran, irländsk, frankrikes, bemedling, tribun, segra, ministeriell, fördrag]
Next step
• OCR errors
• Spelling normalization
utbetalt34 1
utbetalta 15
utbetaltu 1
utbe¬ 2
utbfjudes 1
utbftte 1
utbi 1
utbiJdning 1
utbiedd 1
utbiedning 2
breflådör 1
breflåifas 1
breflörsändning 1
breflösen 5
brefmassan 1
brefmottagningsställen 1
brefmärken 4
brefmärkena 2
Current and Future work
• Test neural word embeddings on
Kubhist
• Correct OCR errors
• Spelling variations
• Find diachronic word replacements
• Handikappad  funktionsnedsatt 
funktionshindrad
• Sense-based embeddings for word
sense change
Thank you!
Nina Tahmasebi, PhD
Språkbanken (The Swedish Language Bank),
Center for Digital humanities
University of Gothenburg
Nina.tahmasebi@GU.se
Algorithm Overview
Step 1:
Word sense discr.
(curvature clustering)
individual time slices
O(|S|T)
Step 2:
Detecting stable
senses
 units
Step 3:
Relating units
Paths
Stone
Music
Lifestyle
Rock
Tahmasebi & Risse, RANLP2017

More Related Content

Recently uploaded

Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 

Recently uploaded (20)

Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 

Featured

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Dhn2018-A Study on Word2Vec on a Historical Swedish Newspaper Corpus

  • 1. A Study on Word2Vec on a Historical Swedish Newspaper Corpus Nina Tahmasebi, PhD University of Gothenburg DHN2018 2018-03-08
  • 2. What is new? time Background Increasing amount of historical texts in digital format Easy digital access for anyone! Not only scholars. Possibility to digitally analyze historical documents at large scale. Information from primary resources Not only modern interpretations. Culturomics: Tracking the development of cultural and language phenomena over time. E.g., topics, opinions …
  • 3. time LanguageChange • Spelling variations • Diachronic word replacement (w2w) Teutschland Deutschlandfelitious happy Background
  • 4. time LanguageChange • Spelling variations • Diachronic word replacement (w2w) • Named entity change Petrograd St. Petersburg Background
  • 5. LanguageChange • Spelling variations • Diachronic word replacement (w2w) • Named entity change • Word sense change He was an awesome leader time Background
  • 6. time LanguageChange • Spelling variations • Diachronic word replacement (w2w) • Named entity change • Word sense change • General LanguageChange Background Kona  Qwinna  Qvinna  Kvinna
  • 7. St. Petersburg What is the problem? 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 time Petrograd Finding relevant content • Spelling variations, named entity change Interpreting content • Word sense evolution, term to term evolution Aligning e.g., topics or opinions across time E.g., ”Iphone 5 is awesome”  positive or negative review?
  • 8. What is the problem? Finding relevant content • Spelling variations, named entity change Interpreting content • Word sense change, diachronic word replacements Aligning e.g., topics or opinions across time E.g., ”Iphone 5 is awesome”  positive or negative review? 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 time St. Petersburg
  • 9. ”Sestini’s benefit last night at the Opera House was overflowing with the fashionable and gay” The Problem – Interpreting
  • 10. The Problem – Interpreting
  • 11. The Problem – Interpreting ”Sebastini’s benefit last night at the Opera House was overflowing with the fashionable and gay” TheTimes, April 27th, 1787
  • 12. Aims To find word sense changes automatically by 1. Modeling word senses 2. Comparing these over time To find what changes, how it changed and when it changed
  • 13. We are not the first, nor the last • Context vectors • Topic models • Graph-based methods • Word embedding methods context vectors ti tj w The meanings of words are not fixed but in fact undergo change The meanings of words are not fixed but in fact undergo change Finally, we conduct a preliminary evaluation in which we apply our methods to the task of Finally, we conduct a preliminary evaluation in which we apply our methods to the task of BNC ukWaC
  • 14. Word embeddings W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle") W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king“) Word embeddings shown in 2D instead of 50-100000 Image: Nieto Pina and Johansson, RANLP’15 Iraq - Violence = Jordan Human - Animal = Ethics President - Power = Prime Minister Library - Books = Hall
  • 15. Word embedding-based models Kulkarni et al. WWW’15 • Project a word onto a vector/point (POS, frequency and embeddings) • Track vectors over time Kim et al. LACSS 2014 Basile et al. CLiC-it 2016 Zhang et al. WWW’16 Hamilton et al. ACL 2016 Bamler and Mandt ICML 2017 Image: Kulkarni et al. WWW’15
  • 16. Downsides with word embeddings 1. Random in • Initialization • Order in which the training examples are seen 2. 100 Million tokens per time span 3. Typically learn one vector per word Stable/less dominant senses get lost! Stone Music Lifestyle Rock
  • 17. Our study • Word2Vec (W2V) • a two-layer neural net (skip-gram) out-of-the-box using the Deeplearning4j (DL4J) for Java • Kubhist • Swedish Newspapers • 1749-1925 • Trained yearly vectors - 2000 000 4000 000 6000 000 8000 000 10000 000 12000 000 14000 000 16000 000 1749 1757 1779 1787 1795 1803 1811 1819 1827 1835 1843 1851 1859 1867 1875 1883 1891 1899 1907 1915 1923 Numberoftokens Year Size of Kubhist in tokens tokens
  • 18. What did we do? • 11 (10) words over time nyhet 'news', tidning 'newspaper', politik 'politics', telefon 'telephone', telegraf 'telegraph', kvinna 'woman', man 'man', glad, 'happy', retorik 'rethoric', resa 'travel' and musik 'music'. A = {happy, smiling, glad} B = {happy, joyful, cheerful, excited} Overlap = 1 Unique = 3+4-1 = 6 Jaccard similarity = 1/6
  • 19. Result summary Avg. Jaccard similarity, normalized frequency and Spearman correlation
  • 20. Study on W2V for Kubhist • The more frequent the term, the more stable the vectors • 0.11-0.19 overlap between years  2-3 words in common each year 1 word  jacc = 0.05 2 words  jacc = 0.11 3 words  jacc = 0.18 4 words  jacc = 0.25
  • 21. Some Swedish results Women: 1912: 'kvinna': [valbarhet, valrätt, rösträtt, själfförsörjande, sexuell, okunnig, högerparti, politisk, radikal, vänsterparti] 1908: 'kvinna': [österåsen, ung, rösträtt, ljusglimt, flicka, iförda, knäböjande, begåfvad, värnlös, jubla] 1895: 'kvinna': [qvinna, varelse, människa, öfvermåttan, flicka, reptil, gosse, förälskade, öfvergifven, högväxt] 1879: 'kvinna': [qvarlefva, vålnad, öfvade, rättskaffens, begåfvade, skenbart, skummande, vilde, herskar, mygga] 1867: 'kvinna': [äes, kvrk, kunäe, mle, näo, nuvaranäe, äer, v«r«, uä, äig] 1868: 'kvinna': [piller, kvilken, mis, kade, klo, nde, äock, reäan, äsom, bvilken]
  • 22. Some Swedish results Politics: 1925: 'politik': [näring, trygghet, kamp, arbetarrörelse, konservativ, nationell, strävan, europa, neutralitet, önskad] 1922: 'politik': [åskådning, socialistisk, ägnad, demokrati, utrikespolitisk, sakligt, situation, representativ, auktoritet, ärlig] 1900: 'politik': [enig, bvad, finlands, politisk, konstitutionel, revolution, armenien, citera, civiliserade, dementi] 1872: 'politik': [republikansk, opposition, kränka, reaktionär, neutral, republikan, tillbakavisa, changarniers, påfvedöme, hora 1858: 'politik': [asylrätt, allians, frankrikes, konstitutionell, konflikt, försonlig, rysslands, press, makt, fördrag] 1844: 'politik': [tadla, allians, vägran, irländsk, frankrikes, bemedling, tribun, segra, ministeriell, fördrag]
  • 23. Next step • OCR errors • Spelling normalization utbetalt34 1 utbetalta 15 utbetaltu 1 utbe¬ 2 utbfjudes 1 utbftte 1 utbi 1 utbiJdning 1 utbiedd 1 utbiedning 2 breflådör 1 breflåifas 1 breflörsändning 1 breflösen 5 brefmassan 1 brefmottagningsställen 1 brefmärken 4 brefmärkena 2
  • 24. Current and Future work • Test neural word embeddings on Kubhist • Correct OCR errors • Spelling variations • Find diachronic word replacements • Handikappad  funktionsnedsatt  funktionshindrad • Sense-based embeddings for word sense change
  • 25. Thank you! Nina Tahmasebi, PhD Språkbanken (The Swedish Language Bank), Center for Digital humanities University of Gothenburg Nina.tahmasebi@GU.se
  • 26. Algorithm Overview Step 1: Word sense discr. (curvature clustering) individual time slices O(|S|T) Step 2: Detecting stable senses  units Step 3: Relating units Paths Stone Music Lifestyle Rock Tahmasebi & Risse, RANLP2017