Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2020 09-28-odense-final-forpublication

10 views

Published on

Keynote at Synergies - Bridging the Gap Between Traditional and Digital Literary Studies

Published in: Science
  • Be the first to comment

  • Be the first to like this

2020 09-28-odense-final-forpublication

  1. 1. The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies Nina Tahmasebi, Associate Professor University of Gothenburg Synergies: Bridging the Gap Between Traditional and Digital Literary Studies September 2020, Copenhagen
  2. 2. Views on text DH Language Data 1010011010010 1001010010101 0011010010101 Nina Tahmasebi, University of Gothenburg, Synergies 2020 2
  3. 3. Nina Tahmasebi, University of Gothenburg, Synergies 2020 3
  4. 4. Based on • Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198- 227. • Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449. Nina Tahmasebi, University of Gothenburg, Synergies 2020 4
  5. 5. When do we benefit from computational methods? Nina Tahmasebi, University of Gothenburg, Synergies 2020 5
  6. 6. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, University of Gothenburg, Synergies 2020 6
  7. 7. Nina Tahmasebi, University of Gothenburg, Synergies 2020 7
  8. 8. Nina Tahmasebi, University of Gothenburg, Synergies 2020
  9. 9. Nina Tahmasebi, University of Gothenburg, Synergies 2020 9
  10. 10. Image: https://ipec.co.zwNina Tahmasebi, University of Gothenburg, Synergies 2020
  11. 11. Nina Tahmasebi, University of Gothenburg, Synergies 2020 11
  12. 12. From text to answers text text mining method research question results Nina Tahmasebi, University of Gothenburg, Synergies 2020 12
  13. 13. From text to answers text research question text mining method Nina Tahmasebi, University of Gothenburg, Synergies 2020 results 13
  14. 14. Today’s outline 3. Research results and interpretation 1. Digital Text 2. Data-intensive research methodology Nina Tahmasebi, University of Gothenburg, Synergies 2020 14
  15. 15. Digital Text Nina Tahmasebi, University of Gothenburg, Synergies 2020 15
  16. 16. A book: • Empty pages in the beginning / end • Large letter at the beginning of each chapter • Images? Nina Tahmasebi, University of Gothenburg, Synergies 2020 16
  17. 17. Too many physical pieces cannot be treated manually. Digital Text Nina Tahmasebi, University of Gothenburg, Synergies 2020 18
  18. 18. Too many digital texts cannot be studied in TOO LARGE DETAIL either! We need to ignore a lot of formatting • White pages • White space • Fonts • Capitalization of letters • Etc… Nina Tahmasebi, University of Gothenburg, Synergies 2020 19
  19. 19. Digital text Printed texts not available digitally Printed texts born digital Other digital publications User generated textEdited text Less errors of the kind • OCR errors due to modern fonts, • Less dirty pages, younger age. • Modern language Data of the kind: • News • Professional blogs • Reviews A lot of errors • Spelling errors • Grammatical errors • Abbreviations • Smileys (automatic) Metadata The older the text, the more errors • Paper in bad quality • Different fonts • Skewed columns • (Spelling variations) Nina Tahmasebi, University of Gothenburg, Synergies 2020 20
  20. 20. Researcher/group analyzing in detail Individual Individual text With individual intent Signal change Signal topic, cluster, vector… Multiple texts – dataset/corpus Researcher/group analyzing in detail Text mining scenario Nina Tahmasebi, University of Gothenburg, Synergies 2020 21 NLP step
  21. 21. Nina Tahmasebi, University of Gothenburg, Synergies 2020 22
  22. 22. I like the room but not the sheet. (only verbs) I like the room but not the sheet. (frequency filtering) I like the room but not the sheet. (only nouns) I like the room but not the sheet. (after lemmatization) I like the room but not the sheets. (after stop word filtering) I like the room but not the sheets. Nina Tahmasebi, University of Gothenburg, Synergies 2020 23
  23. 23. Clean much – keep much information Matter of economy: • We cannot afford to keep it all • So we keep what gives us most value (= information) Nina Tahmasebi, University of Gothenburg, Synergies 2020 24 frequency information
  24. 24. Nina Tahmasebi, University of Gothenburg, Synergies 2020 3. Nouns. After a series of experiments, it was determined that the thematic information in this corpus could best be captured by modeling only the remaining nouns. Using the Standford POS tagger, each word in each segment was marked up with a part of speech indicator and all but the nouns were removed.12 Jockers and Mimno, Significant Themes in 19th-Century Literature 25
  25. 25. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Nina Tahmasebi, University of Gothenburg, Synergies 2020 26
  26. 26. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Prezentio add. 5 Nina Tahmasebi, University of Gothenburg, Synergies 2020 27
  27. 27. Nina Tahmasebi, University of Gothenburg, Synergies 2020 28
  28. 28. Nina Tahmasebi, University of Gothenburg, Synergies 2020 29
  29. 29. Amount of information Amount of text Text mining method Nina Tahmasebi, University of Gothenburg, Synergies 2020 30
  30. 30. Data-intensive research methodology Nina Tahmasebi, University of Gothenburg, Synergies 2020 31
  31. 31. Traditional research methodology Research question Text Nina Tahmasebi, University of Gothenburg, Synergies 2020 32
  32. 32. Data-intensive research methodology Research question Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 33
  33. 33. Data-intensive research methodology Research question Text (digital large-scale text) Hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 34
  34. 34. Data Hypothesis Data Hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 35
  35. 35. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 36
  36. 36. Nina Tahmasebi, University of Gothenburg, Synergies 2020 37
  37. 37. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 38
  38. 38. Text-mining method Dimensions Filtering: Function words Filtering: Stopwords Part-of-speech tagging Lemmatization Tokenization NLP pipeline: From text to result Nina Tahmasebi, University of Gothenburg, Synergies 2020 39
  39. 39. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 40
  40. 40. Results as a window to the text Nina Tahmasebi, University of Gothenburg, Synergies 2020 41
  41. 41. Viewpoint on the data Nina Tahmasebi, University of Gothenburg, Synergies 2020 42
  42. 42. Nina Tahmasebi, University of Gothenburg, Synergies 2020 43
  43. 43. Nina Tahmasebi, University of Gothenburg, Synergies 2020 44
  44. 44. Nina Tahmasebi, University of Gothenburg, Synergies 2020 45
  45. 45. Nina Tahmasebi, University of Gothenburg, Synergies 2020 46
  46. 46. The better your method (WRT the information related to your research question)  the better the pieces Amount of informa tion Amount of text Text mining method Nina Tahmasebi, University of Gothenburg, Synergies 2020 47
  47. 47. Data-intensive research methodology Hypothesis Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 48
  48. 48. Data-intensive research methodology results results results Text mining method Text (digital large-scale text) Research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 49
  49. 49. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 50
  50. 50. Truths about data- intensive research Not all methods fit all data Not all data fit all questions Not all methods can answer all questions Nothing lives separately, it must be evaluated together: Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 51
  51. 51. Results and research questions Hypotes Text mining method resultat Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020
  52. 52. Method + Data = Results result Nina Tahmasebi, University of Gothenburg, Synergies 2020 53
  53. 53. result hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 54
  54. 54. Reject 1 Data 2 Method / Preprocessing 3 Hypothesis result hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 55
  55. 55. Accept 1 Method 2 Correct interpretation of the results result hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 56
  56. 56. Math results, average difference Men Women Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 57
  57. 57. Men Women Math results, average difference Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 58
  58. 58. NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Range of math scores Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 59
  59. 59. Men Women Comparison of the same data NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Source: Factfullness Men Women Nina Tahmasebi, University of Gothenburg, Synergies 2020 60
  60. 60. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, University of Gothenburg, Synergies 2020 61
  61. 61. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. Language is never, ever, ever, random, Adam Kilgariff, 2005 Nina Tahmasebi, University of Gothenburg, Synergies 2020 62
  62. 62. Experimental design Even when the math is right, we need to question the selection and the grounds on which our conclusions are. • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, University of Gothenburg, Synergies 2020 63
  63. 63. Conclusions Nina Tahmasebi, University of Gothenburg, Synergies 2020
  64. 64. Nina Tahmasebi, University of Gothenburg, Synergies 2020 65
  65. 65. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 66
  66. 66. Experimental design • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, University of Gothenburg, Synergies 2020 67
  67. 67. Prof. Hans Rosling You can’t understand the world without numbers… Factfullness … and you cannot understand it only with numbers. Nina Tahmasebi, University of Gothenburg, Synergies 2020 68
  68. 68. Tack! Nina.tahmasebi@gu.se nina@tahmasebi.se Nina Tahmasebi, University of Gothenburg, Synergies 2020 69

×