Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dhn2018-A Study on Word2Vec on a Historical Swedish Newspaper Corpus

44 views

Published on

Presentation from The Di­gital Hu­man­it­ies in the Nor­dic Countries
3rd Conference (DHN2018)

Published in: Science
  • Be the first to comment

  • Be the first to like this

Dhn2018-A Study on Word2Vec on a Historical Swedish Newspaper Corpus

  1. 1. A Study on Word2Vec on a Historical Swedish Newspaper Corpus Nina Tahmasebi, PhD University of Gothenburg DHN2018 2018-03-08
  2. 2. What is new? time Background Increasing amount of historical texts in digital format Easy digital access for anyone! Not only scholars. Possibility to digitally analyze historical documents at large scale. Information from primary resources Not only modern interpretations. Culturomics: Tracking the development of cultural and language phenomena over time. E.g., topics, opinions …
  3. 3. time LanguageChange • Spelling variations • Diachronic word replacement (w2w) Teutschland Deutschlandfelitious happy Background
  4. 4. time LanguageChange • Spelling variations • Diachronic word replacement (w2w) • Named entity change Petrograd St. Petersburg Background
  5. 5. LanguageChange • Spelling variations • Diachronic word replacement (w2w) • Named entity change • Word sense change He was an awesome leader time Background
  6. 6. time LanguageChange • Spelling variations • Diachronic word replacement (w2w) • Named entity change • Word sense change • General LanguageChange Background Kona  Qwinna  Qvinna  Kvinna
  7. 7. St. Petersburg What is the problem? 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 time Petrograd Finding relevant content • Spelling variations, named entity change Interpreting content • Word sense evolution, term to term evolution Aligning e.g., topics or opinions across time E.g., ”Iphone 5 is awesome”  positive or negative review?
  8. 8. What is the problem? Finding relevant content • Spelling variations, named entity change Interpreting content • Word sense change, diachronic word replacements Aligning e.g., topics or opinions across time E.g., ”Iphone 5 is awesome”  positive or negative review? 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 time St. Petersburg
  9. 9. ”Sestini’s benefit last night at the Opera House was overflowing with the fashionable and gay” The Problem – Interpreting
  10. 10. The Problem – Interpreting
  11. 11. The Problem – Interpreting ”Sebastini’s benefit last night at the Opera House was overflowing with the fashionable and gay” TheTimes, April 27th, 1787
  12. 12. Aims To find word sense changes automatically by 1. Modeling word senses 2. Comparing these over time To find what changes, how it changed and when it changed
  13. 13. We are not the first, nor the last • Context vectors • Topic models • Graph-based methods • Word embedding methods context vectors ti tj w The meanings of words are not fixed but in fact undergo change The meanings of words are not fixed but in fact undergo change Finally, we conduct a preliminary evaluation in which we apply our methods to the task of Finally, we conduct a preliminary evaluation in which we apply our methods to the task of BNC ukWaC
  14. 14. Word embeddings W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle") W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king“) Word embeddings shown in 2D instead of 50-100000 Image: Nieto Pina and Johansson, RANLP’15 Iraq - Violence = Jordan Human - Animal = Ethics President - Power = Prime Minister Library - Books = Hall
  15. 15. Word embedding-based models Kulkarni et al. WWW’15 • Project a word onto a vector/point (POS, frequency and embeddings) • Track vectors over time Kim et al. LACSS 2014 Basile et al. CLiC-it 2016 Zhang et al. WWW’16 Hamilton et al. ACL 2016 Bamler and Mandt ICML 2017 Image: Kulkarni et al. WWW’15
  16. 16. Downsides with word embeddings 1. Random in • Initialization • Order in which the training examples are seen 2. 100 Million tokens per time span 3. Typically learn one vector per word Stable/less dominant senses get lost! Stone Music Lifestyle Rock
  17. 17. Our study • Word2Vec (W2V) • a two-layer neural net (skip-gram) out-of-the-box using the Deeplearning4j (DL4J) for Java • Kubhist • Swedish Newspapers • 1749-1925 • Trained yearly vectors - 2000 000 4000 000 6000 000 8000 000 10000 000 12000 000 14000 000 16000 000 1749 1757 1779 1787 1795 1803 1811 1819 1827 1835 1843 1851 1859 1867 1875 1883 1891 1899 1907 1915 1923 Numberoftokens Year Size of Kubhist in tokens tokens
  18. 18. What did we do? • 11 (10) words over time nyhet 'news', tidning 'newspaper', politik 'politics', telefon 'telephone', telegraf 'telegraph', kvinna 'woman', man 'man', glad, 'happy', retorik 'rethoric', resa 'travel' and musik 'music'. A = {happy, smiling, glad} B = {happy, joyful, cheerful, excited} Overlap = 1 Unique = 3+4-1 = 6 Jaccard similarity = 1/6
  19. 19. Result summary Avg. Jaccard similarity, normalized frequency and Spearman correlation
  20. 20. Study on W2V for Kubhist • The more frequent the term, the more stable the vectors • 0.11-0.19 overlap between years  2-3 words in common each year 1 word  jacc = 0.05 2 words  jacc = 0.11 3 words  jacc = 0.18 4 words  jacc = 0.25
  21. 21. Some Swedish results Women: 1912: 'kvinna': [valbarhet, valrätt, rösträtt, själfförsörjande, sexuell, okunnig, högerparti, politisk, radikal, vänsterparti] 1908: 'kvinna': [österåsen, ung, rösträtt, ljusglimt, flicka, iförda, knäböjande, begåfvad, värnlös, jubla] 1895: 'kvinna': [qvinna, varelse, människa, öfvermåttan, flicka, reptil, gosse, förälskade, öfvergifven, högväxt] 1879: 'kvinna': [qvarlefva, vålnad, öfvade, rättskaffens, begåfvade, skenbart, skummande, vilde, herskar, mygga] 1867: 'kvinna': [äes, kvrk, kunäe, mle, näo, nuvaranäe, äer, v«r«, uä, äig] 1868: 'kvinna': [piller, kvilken, mis, kade, klo, nde, äock, reäan, äsom, bvilken]
  22. 22. Some Swedish results Politics: 1925: 'politik': [näring, trygghet, kamp, arbetarrörelse, konservativ, nationell, strävan, europa, neutralitet, önskad] 1922: 'politik': [åskådning, socialistisk, ägnad, demokrati, utrikespolitisk, sakligt, situation, representativ, auktoritet, ärlig] 1900: 'politik': [enig, bvad, finlands, politisk, konstitutionel, revolution, armenien, citera, civiliserade, dementi] 1872: 'politik': [republikansk, opposition, kränka, reaktionär, neutral, republikan, tillbakavisa, changarniers, påfvedöme, hora 1858: 'politik': [asylrätt, allians, frankrikes, konstitutionell, konflikt, försonlig, rysslands, press, makt, fördrag] 1844: 'politik': [tadla, allians, vägran, irländsk, frankrikes, bemedling, tribun, segra, ministeriell, fördrag]
  23. 23. Next step • OCR errors • Spelling normalization utbetalt34 1 utbetalta 15 utbetaltu 1 utbe¬ 2 utbfjudes 1 utbftte 1 utbi 1 utbiJdning 1 utbiedd 1 utbiedning 2 breflådör 1 breflåifas 1 breflörsändning 1 breflösen 5 brefmassan 1 brefmottagningsställen 1 brefmärken 4 brefmärkena 2
  24. 24. Current and Future work • Test neural word embeddings on Kubhist • Correct OCR errors • Spelling variations • Find diachronic word replacements • Handikappad  funktionsnedsatt  funktionshindrad • Sense-based embeddings for word sense change
  25. 25. Thank you! Nina Tahmasebi, PhD Språkbanken (The Swedish Language Bank), Center for Digital humanities University of Gothenburg Nina.tahmasebi@GU.se
  26. 26. Algorithm Overview Step 1: Word sense discr. (curvature clustering) individual time slices O(|S|T) Step 2: Detecting stable senses  units Step 3: Relating units Paths Stone Music Lifestyle Rock Tahmasebi & Risse, RANLP2017

×