Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Detecting language change for the digital humanities

80 views

Published on

Keynote at 6th Estonian Digital Humanities Conference in Tartu.

If you wish to have the comments and text for the slides, please email me at the address given in the last slide.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Detecting language change for the digital humanities

  1. 1. Detecting language change for the digital humanities; challenges and opportunities Nina Tahmasebi, PhD University of Gothenburg 6th Estonian Digital Humanities conference
  2. 2. Digital humanities (Postdoc at CDH) Mathematics (B.Sc & M.Sc.) Me Electrical Engineering Computer science (Phd + Postdoc) NLP / Language Technology (Researcher)
  3. 3. Computer science Data 1010011010010 1001010010101 0011010010101
  4. 4. I love you You love me I love you
  5. 5. Language Language technology Data 1010011010 0101001010 0101010011
  6. 6. Prezentio id: 1
  7. 7. Digital humanities Language Data 1010011010 0101001010 0101010011
  8. 8. Some terminology Digital Humanities, Computer Science, Language Technology LT  NLP Data Science Text and Resource Long-term / diachronic Token
  9. 9. Some terminology A Ba Digital Humanities, Computer Science, Language Technology LT  NLP Data Science Text Long-term / diachronic Token Vector (1, 4, 3) (=3 dimensions) Topic modeling
  10. 10. Language changes Part I
  11. 11. LiWA – Living Web Archives preparing for evolution aware access support dealing with terminology evolution Semantic & Terminology Evolution Noise and Spam Filtering Improved Capturing Existing Web Archiving Technology Temporal Coherence
  12. 12. What is new? time Increasing amount of historical texts in digital format Easy digital access for anyone! Not only scholars. Possibility to digitally analyze historical documents at large scale. Information from primary sources Not only modern interpretations. Text-based Digital Humanities
  13. 13. time Spelling change Teutschland Deutschland
  14. 14. Lexical replacement: Named entity change time St. Petersburg St. Petersburg Petrograd Leningrad
  15. 15. time Lexical replacement: Petrograd St. Petersburg felitious happy
  16. 16. awesome He was an awesome leader! He was an awesome leader! time
  17. 17. Kona Qwinna KvinnaQvinna
  18. 18. time What is the problem? St. Petersburg Petrograd Finding
  19. 19. What is the problem? St. Petersburg Finding Interpreting time Petrograd
  20. 20. Sebastini’s benefit last night at the Opera House was overflowing with the fashionable and gay
  21. 21. The Times, April 27th, 1787 Sebastini’s benefit last night at the Opera House was overflowing with the fashionable and gay
  22. 22. What is the problem? St. Petersburg Finding Interpreting time Petrograd
  23. 23. girl criminal Wolf ‘varg’
  24. 24. Aims To find word sense changes automatically by To find what changes, how it changed and when it changed Stone Music Lifestyle Rock 1 Modeling word senses 2 Comparing these over time
  25. 25. 20132008 201220102009 2011 2014 2015 2016 2017 2018 Tahmasebi et al. 2008 Single-sense Sense-differentiated Sagi et al 2009 Gulordava & Baroni 2011 Tang et al 2013 Kim et al 2014 Kulkarni et al 2015 Hamilton et al Eger and Mehler Rodda et al Basile et al 2016 Azarbonyad et al Takamura et al Kahnmann & Heyer Bamler & Mandt 2017 Yao et a, Rudolph & Blei 2018 Wijaya & Yentizerzi 2011 Lau et al 2012 Cook et al 2013 Cook et al Mitra et al 2014 Mitra et al 2015 Frerman & Lapata Tang et al 2016 Tahmasebi & Risse 2017 Costin-Gabriel & Rebedea Tjong Kim Sang 2016 embeddings dynamic embeddings neural embeddings topic models word sense induction Mihalcea & Nastase 2012
  26. 26. topic models word sense induction 20132008 201220102009 2011 2014 2015 2016 2017 2018 embeddings dynamic embeddings neural embeddings
  27. 27. (Neural) Word embeddings Word embeddings shown in 2D instead of 50-100000 Image: Nieto Pina and Johansson, RANLP’15
  28. 28. Word embedding-based models Image: Kulkarni et al. WWW’15
  29. 29. Downsides Random in • Initialization • Order in which the training examples are seen 100 Million tokens per time span* Typically learn one vector per word  Stable/less dominant senses get lost! Stone Music Lifestyle Rock
  30. 30. Presented at DHN2018 A Study on Word2Vec on a Historical Swedish Newspaper Corpus
  31. 31. Our study - 2000 000 4000 000 6000 000 8000 000 10000 000 12000 000 14000 000 16000 000 1749 1757 1779 1787 1795 1803 1811 1819 1827 1835 1843 1851 1859 1867 1875 1883 1891 1899 1907 1915 1923 Numberoftokens Year Size of Kubhist in tokens tokens * https://spraakbanken.gu.se/korp/?mode=kubhist Word2Vec (W2V) a two-layer neural net (skip-gram) KubHist* Swedish Newspapers 1749-1925 Trained yearly vectors
  32. 32. What did we do? 11 (10) words over time nyhet 'news' tidning 'newspaper' politik 'politics' telefon 'telephone' telegraf 'telegraph' kvinna 'woman' man 'man' glad, 'happy' retorik ‘rhetoric' resa 'travel' musik 'music' A = {happy, smiling, glad} B = {happy, joyful, cheerful, excited} Overlap = 1 Unique = 3+4-1 = 6 Jaccard similarity = 1/6
  33. 33. Some results I Woman: 1912: 'kvinna': [valbarhet, valrätt, rösträtt, själfförsörjande, sexuell, okunnig, högerparti, politisk, radikal, vänsterparti] 1908: 'kvinna': [österåsen, ung, rösträtt, ljusglimt, flicka, iförda, knäböjande, begåfvad, värnlös, jubla] 1895: 'kvinna': [qvinna, varelse, människa, öfvermåttan, flicka, reptil, gosse, förälskade, öfvergifven, högväxt] 1879: 'kvinna': [qvarlefva, vålnad, öfvade, rättskaffens, begåfvade, skenbart, skummande, vilde, herskar, mygga] 1867: 'kvinna': [äes, kvrk, kunäe, mle, näo, nuvaranäe, äer, v«r«, uä, äig] 1868: 'kvinna': [piller, kvilken, mis, kade, klo, nde, äock, reäan, äsom, bvilken]
  34. 34. Some results II Politics: 1925: 'politik': [näring, trygghet, kamp, arbetarrörelse, konservativ, nationell, strävan, europa, neutralitet, önskad] 1922: 'politik': [åskådning, socialistisk, ägnad, demokrati, utrikespolitisk, sakligt, situation, representativ, auktoritet, ärlig] 1900: 'politik': [enig, bvad, finlands, politisk, konstitutionel, revolution, armenien, citera, civiliserade, dementi] 1872: 'politik': [republikansk, opposition, kränka, reaktionär, neutral, republikan, tillbakavisa, changarniers, påfvedöme, horace] 1858: 'politik': [asylrätt, allians, frankrikes, konstitutionell, konflikt, försonlig, rysslands, press, makt, fördrag] 1844: 'politik': [tadla, allians, vägran, irländsk, frankrikes, bemedling, tribun, segra, ministeriell, fördrag]
  35. 35. Result summary Avg. Jaccard similarity, normalized frequency and Spearman correltion The more frequent the term, the more stable the vectors 0.11-0.19 overlap between years 2-3 words in common each year
  36. 36. Next step OCR errors Spelling normalization
  37. 37. Research methodologies in DH Part II
  38. 38. Digital, large-scale data
  39. 39. Data Hypothesis Data Hypothesis
  40. 40. Questions
  41. 41. Representativeness
  42. 42. Image: http://first-the-trousers.com/hello-world/ TheStreetlighteffect
  43. 43. method + data = results result
  44. 44. result hypothesis
  45. 45. 1 Data 3 Hypothesis2 Method / Preprocessing result hypothesis Reject
  46. 46. 1 Method 2 Correct interpretation of the results result hypothesis Accept
  47. 47. Math results, average difference Source: Factfullness Men Women
  48. 48. Men Women Math results, average difference Source: Factfullness
  49. 49. Source: Factfullness NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Range of math scores
  50. 50. Men Women Comparison of the same data NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Source: Factfullness Men Women
  51. 51. 1 Method 3 Where do the results live? 2 Correct interpretation of the results result hypothesis
  52. 52. result hypothesis 1 Method 3 Where do the results live? 2 Correct interpretation of the results
  53. 53. Text-mining method Dimensions Filtering: Function words Filtering: Stopwords Part-of-speech tagging Lemmatization Tokenization NLP pipeline: From text to result
  54. 54. I like the room but not the sheets. I like the room but not the sheets. (after stop word filtering) I like the room but not the sheet. (after lemmatization) I like the room but not the sheet. (only nouns) I like the room but not the sheet. (frequency filtering) I like the room but not the sheet. (only verbs)
  55. 55. Viewpoint on the data
  56. 56. Viewpoint on the data
  57. 57. Choosing a method
  58. 58. 1010011010010 1001010010101 0011010010101 Results
  59. 59. Marie Antoinette Queen of France Child of Empress Maria Theresa Child of Francis I Archduchess Austria
  60. 60. Evaluation
  61. 61. Prof. Hans Rosling You can’t understand the world without numbers… Factfullness … and you cannot understand it only with numbers.
  62. 62. Thank you for listening! Nina.tahmasebi@gu.se nina@tahmasebi.se

×