Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workshop on Digital Literacy - Digital text and data-intensive research

28 views

Published on

How does Digital Text relate to written non-digital text? What do we need to think about when using digital large-scale methods and interpreting the results.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Workshop on Digital Literacy - Digital text and data-intensive research

  1. 1. Digital Text and Data-Intensive Research Nina Tahmasebi, Associate Professor University of Gothenburg Digital Literacy | 2020-2021 Nina Tahmasebi, Digital Literacy, Sept. 2020
  2. 2. Centre for Digital Humanities (2018-2019) Mathematics (B.Sc & M.Sc) 2003-2008 Computer/ Data Science (Phd + Postdoc) 2008-2014) NLP / Language Technology (Researcher, Associate Professor) 2014→ Nina Tahmasebi, Digital Literacy, Sept. 2020 2
  3. 3. Views on text Language 1010011010010 1001010010101 0011010010101 Data
  4. 4. Nina Tahmasebi, Digital Literacy, Sept. 2020 4
  5. 5. Based on • Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198- 227. • Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449. Nina Tahmasebi, Digital Literacy, Sept. 2020 5
  6. 6. When do we benefit from computational methods? Nina Tahmasebi, Digital Literacy, Sept. 2020 6
  7. 7. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, Digital Literacy, Sept. 2020
  8. 8. Nina Tahmasebi, Digital Literacy, Sept. 2020
  9. 9. Nina Tahmasebi, Digital Literacy, Sept. 2020
  10. 10. From text to answers text text mining method research question results Nina Tahmasebi, Digital Literacy, Sept. 2020 10
  11. 11. From text to answers text research question text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020 11 results
  12. 12. Today’s outline 3. Research results and interpretation 1. Digital Text 2. Data-intensive research methodology Nina Tahmasebi, Digital Literacy, Sept. 2020 12
  13. 13. Digital Text Nina Tahmasebi, Digital Literacy, Sept. 2020
  14. 14. A book: • Empty pages in the beginning / end • Large letter at the beginning of each chapter • Images? Nina Tahmasebi, Digital Literacy, Sept. 2020 14
  15. 15. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, Digital Literacy, Sept. 2020 15
  16. 16. Too many physical pieces cannot be treated manually. Digital Text Nina Tahmasebi, Digital Literacy, Sept. 2020 16
  17. 17. Too many digital texts cannot be studied in TOO LARGE DETAIL either! We need to ignore a lot of formatting • White pages • White space • Fonts • Capitalization of letters • Etc… Nina Tahmasebi, Digital Literacy, Sept. 2020 17
  18. 18. Digital text Printed texts not available digitally Printed texts born digital Other digital publications User generated textEdited text Less errors of the kind • OCR errors due to modern fonts, • Less dirty pages, younger age. • Modern language Data of the kind: • News • Professional blogs • Reviews A lot of errors • Spelling errors • Grammatical errors • Abbreviations • Smileys (automatic) Metadata The older the text, the more errors • Paper in bad quality • Different fonts • Skewed columns • (Spelling variations) Nina Tahmasebi, Digital Literacy, Sept. 2020 19
  19. 19. Nina Tahmasebi, Digital Literacy, Sept. 2020 20
  20. 20. Corpus /dataset • Corpus → linguistically oriented • Dataset → any collection of text! • Thematic • Time periods • Media types • Genre • … • There are certain types of questions that cannot be answered by any text Digital text Nina Tahmasebi, Digital Literacy, Sept. 2020 21
  21. 21. Individual Individual text With individual intent Multiple texts – dataset/corpus Bits and pieces from a large dataset Researcher/group analyzing in detail Smart search scenario Nina Tahmasebi, Digital Literacy, Sept. 2020 22
  22. 22. Smart search/selection • All interpretation and analysis is left to human • Often, the correctness of each individual bit is simple to verify • But what happens when we have millions of bits and pieces? → We still cannot study manually Researcher/group analyzing in detail Multiple texts – dataset/corpus Bits and pieces from a large dataset Nina Tahmasebi, Digital Literacy, Sept. 2020 23
  23. 23. Nina Tahmasebi, Digital Literacy, Sept. 2020
  24. 24. Sources of error: • We made a bad model: • E.g. Lost formatting • Too many OCR errors in the text → We cannot find what we are looking for →We find much more than we need • What we are looking for semantically is not covered by the terms we use for search: kvinna ≠ quinna • Other sources of error? Researcher/group analyzing in detail Multiple texts – dataset/corpus Bits and pieces from a large dataset Nina Tahmasebi, Digital Literacy, Sept. 2020 25
  25. 25. Researcher/group analyzing in detail Individual Individual text With individual intent Signal change Signal topic, cluster, vector… Multiple texts – dataset/corpus Researcher/group analyzing in detail Text mining scenario Nina Tahmasebi, Digital Literacy, Sept. 2020 26
  26. 26. NLP Evaluation Information extraction ICALL Meaning change Change in grammar, sentiment, argumentation Temporal Information Retrieval Macro analysis and signal change Word sense induction Word sense disambiguation Role labeling Event detection IR IE/signal/information Word vectors/word matrices tfIdf/mutual information Models of grammar Language Models Language Technology /NLP Lemmatization Part of speech tagging Parsning Semantic enrichment (e.g., word sense disambiguation) Extract temporal references in text Link data Filter: tex boiler plate, ads, recurrent data) Clean words from noise Normalize/remove stop words Temporal references based on metadata Pre-processing Gather dataset Include links How long was each passage viewed Metadata Data collection
  27. 27. NLP Evaluation Information extraction ICALL Meaning change Change in grammar, sentiment, argumentation Temporal Information Retrieval Macro analysis and signal change Word sense induction Word sense disambiguation Role labeling Event detection IR IE/signal/information Word vectors/word matrices tfIdf/mutual information Models of grammar Language Models Language Technology /NLP Lemmatization Part of speech tagging Parsning Semantic enrichment (e.g., word sense disambiguation) Extract temporal references in text Link data Filter: tex boiler plate, ads, recurrent data) Clean words from noise Normalize/remove stop words Temporal references based on metadata Pre-processing Gather dataset Include links How long was each passage viewed Metadata Data collection Text mining scenario Smart search scenario
  28. 28. Nina Tahmasebi, Digital Literacy, Sept. 2020 29
  29. 29. Clean much – keep much information Tokenize Remove low-frequent words Remove veeeery high-frequent words Tokens with little information • Numbers, punctuation marks etc. Remove capitalization Normalize (é → e, eeee→e) → Choices all depend on application and research question Matter of economy: • We cannot afford to keep it all • So we keep what gives us most value (= information) frequency information Nina Tahmasebi, Digital Literacy, Sept. 2020 30
  30. 30. I like the room but not the sheet. (only verbs) I like the room but not the sheet. (frequency filtering) I like the room but not the sheet. (only nouns) I like the room but not the sheet. (after lemmatization) I like the room but not the sheets. (after stop word filtering) I like the room but not the sheets. Nina Tahmasebi, Digital Literacy, Sept. 2020 31
  31. 31. Nina Tahmasebi, Digital Literacy, Sept. 2020 32 3. Nouns. After a series of experiments, it was determined that the thematic information in this corpus could best be captured by modeling only the remaining nouns. Using the Standford POS tagger, each word in each segment was marked up with a part of speech indicator and all but the nouns were removed.12 Jockers and Mimno, Significant Themes in 19th-Century Literature
  32. 32. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three.
  33. 33. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Prezentio add. 5
  34. 34. Nina Tahmasebi, Digital Literacy, Sept. 2020 35
  35. 35. Amount of information Amount of text Text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020
  36. 36. In short, ladies and gentlemen, my message today is that data is gold. … Let's start mining it. Neelie Kroes Vice-President of the European Commission responsible for the Digital Agenda, SPEECH/11/872 , 2011 Nina Tahmasebi, Digital Literacy, Sept. 2020
  37. 37. Is it true that data is gold? Nina Tahmasebi, Digital Literacy, Sept. 2020
  38. 38. same data + different methods = different answers Nina Tahmasebi, Digital Literacy, Sept. 2020 39
  39. 39. Since there is infinite amount of information in the text, the text becomes infinitely complex. → Currently, there are no methods to mine all the information Nina Tahmasebi, Digital Literacy, Sept. 2020 40
  40. 40. Data-intensive research methodology Nina Tahmasebi, Digital Literacy, Sept. 2020
  41. 41. Traditional research methodology Research question Text Nina Tahmasebi, Digital Literacy, Sept. 2020 42
  42. 42. Data-intensive research methodology Research question Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 43
  43. 43. Data-intensive research methodology Research question Text (digital large-scale text) Hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 44
  44. 44. Data Hypothesis Data Hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 45
  45. 45. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 46
  46. 46. Text-mining method Dimensions Filtering: Function words Filtering: Stopwords Part-of-speech tagging Lemmatization Tokenization NLP pipeline: From text to result Nina Tahmasebi, Digital Literacy, Sept. 2020 47
  47. 47. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 48
  48. 48. Results as a window to the text Nina Tahmasebi, Digital Literacy, Sept. 2020 49
  49. 49. Viewpoint on the data Nina Tahmasebi, Digital Literacy, Sept. 2020 50
  50. 50. Nina Tahmasebi, Digital Literacy, Sept. 2020 51
  51. 51. Nina Tahmasebi, Digital Literacy, Sept. 2020 52
  52. 52. Nina Tahmasebi, Digital Literacy, Sept. 2020 53
  53. 53. Nina Tahmasebi, Digital Literacy, Sept. 2020 54
  54. 54. The better your method (WRT the information related to your research question) → the better the pieces Amount of informa tion Amount of text Text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020 55
  55. 55. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 56
  56. 56. Data-intensive research methodology Hypothesis Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 57
  57. 57. Data-intensive research methodology results results results Text mining method Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 58
  58. 58. Results and research questions Research question Sometimes the results do not answer the research question in full Nina Tahmasebi, Digital Literacy, Sept. 2020 59
  59. 59. Nina Tahmasebi, Digital Literacy, Sept. 2020 60
  60. 60. Image: https://ipec.co.zwNina Tahmasebi, Digital Literacy, Sept. 2020
  61. 61. Research questions Evidence • Attack/demonstrations • Homicide investigation • Financial irregularities • Data breach Majority • How well is our product received • Which of our issues are most/least attractive to our voters? • How will people vote? Nina Tahmasebi, Digital Literacy, Sept. 2020 62
  62. 62. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, Digital Literacy, Sept. 2020 63
  63. 63. Truths about data- intensive research Not all methods fit all data Not all data fit all questions Not all methods can answer all questions Nothing lives separately, it must be evaluated together: Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 64
  64. 64. Truths about data- intensive research (II) Gives us the possibility to ask new kinds of questions Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 65
  65. 65. Nina Tahmasebi, Digital Literacy, Sept. 2020
  66. 66. Nina Tahmasebi, Digital Literacy, Sept. 2020m
  67. 67. Truths about data- intensive research (II) Gives us the possibility to ask new kinds of questions Which kind of questions fit your purposes? Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 68
  68. 68. Results and research questions Hypotes Text mining method resultat Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 69
  69. 69. Reduction vs. representation digitization preprocessing method hypothesis choice Nina Tahmasebi, Digital Literacy, Sept. 2020 70
  70. 70. ? Store Writer A Male authors Journal 1 Written language Pharmacy Writer B Female authors Journal 2 Spoken language Are these different (significantly) or the same? Sample 2Sample 1 Sample 2Sample 1 H1 H0 Nina Tahmasebi, Digital Literacy, Sept. 2020 71
  71. 71. Inference requires random selection • Only if the selection is random, can we use the sample to draw conclusions about the world • We almost NEVER have a random sample in a textual corpus! → We cannot draw conclusions about the world Sample 2Sample 1 random inference Nina Tahmasebi, Digital Literacy, Sept. 2020 72
  72. 72. When we have little data, the uncertainty is large: • Is A larger than B? But when we have large data, we are more certain about our observations, STILL, our errors can be much larger • Because our selection is biased Sample 2 Sample 2 Sample 1 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Nina Tahmasebi, Digital Literacy, Sept. 2020 73
  73. 73. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. Language is never, ever, ever, random, Adam Kilgariff, 2005 Nina Tahmasebi, Digital Literacy, Sept. 2020 74
  74. 74. Method + Data = Results result Nina Tahmasebi, Digital Literacy, Sept. 2020 75
  75. 75. result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 76
  76. 76. Reject 1 Data 2 Method / Preprocessing 3 Hypothesis result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 77
  77. 77. Accept 1 Method 2 Correct interpretation of the results result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 78
  78. 78. Math results, average difference Men Women Nina Tahmasebi, Digital Literacy, Sept. 2020 79Source: Factfullness
  79. 79. Men Women Math results, average difference Nina Tahmasebi, Digital Literacy, Sept. 2020 80Source: Factfullness
  80. 80. NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Range of math scores Nina Tahmasebi, Digital Literacy, Sept. 2020 81Source: Factfullness
  81. 81. Men Women Comparison of the same data NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Source: Factfullness Men Women Nina Tahmasebi, Digital Literacy, Sept. 2020 82
  82. 82. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, Digital Literacy, Sept. 2020 83
  83. 83. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, Digital Literacy, Sept. 2020 84
  84. 84. Experimental design Even when the math is right, we need to question the selection and the grounds on which our conclusions are. • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, Digital Literacy, Sept. 2020 85
  85. 85. Evaluation Nina Tahmasebi, Digital Literacy, Sept. 2020
  86. 86. Evaluation individual individual text signal topic, cluster, vector… signal change collective text minimum optimum medium Nina Tahmasebi, Digital Literacy, Sept. 2020
  87. 87. Representativeness Nina Tahmasebi, Digital Literacy, Sept. 2020 88
  88. 88. Conclusions Nina Tahmasebi, Digital Literacy, Sept. 2020
  89. 89. Nina Tahmasebi, Digital Literacy, Sept. 2020 90
  90. 90. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, Digital Literacy, Sept. 2020 91
  91. 91. Experimental design • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, Digital Literacy, Sept. 2020 92
  92. 92. Prof. Hans Rosling You can’t understand the world without numbers… Factfullness … and you cannot understand it only with numbers. Nina Tahmasebi, Digital Literacy, Sept. 2020 93
  93. 93. Tack! Nina.tahmasebi@gu.se nina@tahmasebi.se Nina Tahmasebi, Digital Literacy, Sept. 2020 94

×