SlideShare a Scribd company logo
The Strengths and Pitfalls of
Large-Scale Text Mining for
Literary Studies
Nina Tahmasebi, Associate Professor
University of Gothenburg
Synergies: Bridging the Gap Between Traditional
and Digital Literary Studies
September 2020, Copenhagen
Views on text
DH
Language
Data
1010011010010
1001010010101
0011010010101
Nina Tahmasebi, University of Gothenburg, Synergies 2020 2
Nina Tahmasebi, University of Gothenburg, Synergies 2020 3
Based on
• Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for
Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198-
227.
• Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of
Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449.
Nina Tahmasebi, University of Gothenburg, Synergies 2020 4
When do we benefit from
computational methods?
Nina Tahmasebi, University of Gothenburg, Synergies 2020 5
A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, University of Gothenburg, Synergies 2020 6
Nina Tahmasebi, University of Gothenburg, Synergies 2020 7
Nina Tahmasebi, University of Gothenburg, Synergies 2020
Nina Tahmasebi, University of Gothenburg, Synergies 2020 9
Image: https://ipec.co.zwNina Tahmasebi, University of Gothenburg, Synergies 2020
Nina Tahmasebi, University of Gothenburg, Synergies 2020 11
From text to answers
text
text mining
method
research question
results
Nina Tahmasebi, University of Gothenburg, Synergies 2020 12
From text to answers
text
research question
text mining
method
Nina Tahmasebi, University of Gothenburg, Synergies 2020
results
13
Today’s outline
3. Research results and interpretation
1. Digital Text
2. Data-intensive research methodology
Nina Tahmasebi, University of Gothenburg, Synergies 2020 14
Digital Text
Nina Tahmasebi, University of Gothenburg, Synergies 2020 15
A book:
• Empty pages in the
beginning / end
• Large letter at the
beginning of each chapter
• Images?
Nina Tahmasebi, University of Gothenburg, Synergies 2020 16
Too many physical
pieces cannot be
treated manually.
Digital Text
Nina Tahmasebi, University of Gothenburg, Synergies 2020 18
Too many digital texts cannot
be studied in TOO LARGE
DETAIL either!
We need to ignore a lot of formatting
• White pages
• White space
• Fonts
• Capitalization of letters
• Etc…
Nina Tahmasebi, University of Gothenburg, Synergies 2020 19
Digital text
Printed texts
not available digitally
Printed texts
born digital
Other digital
publications
User generated textEdited text
Less errors of the kind
• OCR errors due to modern
fonts,
• Less dirty pages, younger age.
• Modern language
Data of the kind:
• News
• Professional blogs
• Reviews
A lot of errors
• Spelling errors
• Grammatical errors
• Abbreviations
• Smileys
(automatic) Metadata
The older the text, the more
errors
• Paper in bad quality
• Different fonts
• Skewed columns
• (Spelling variations)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 20
Researcher/group
analyzing in detail
Individual
Individual text
With individual intent
Signal change
Signal
topic, cluster, vector…
Multiple texts –
dataset/corpus
Researcher/group
analyzing in detail
Text mining scenario
Nina Tahmasebi, University of Gothenburg, Synergies 2020 21
NLP step
Nina Tahmasebi, University of Gothenburg, Synergies 2020 22
I like the room but not the sheet. (only verbs)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheets.
Nina Tahmasebi, University of Gothenburg, Synergies 2020 23
Clean much – keep much information
Matter of economy:
• We cannot afford
to keep it all
• So we keep what gives us most value
(= information)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 24
frequency
information
Nina Tahmasebi, University of Gothenburg, Synergies 2020
3. Nouns. After a series of experiments, it was determined that the thematic
information in this corpus could best be captured by modeling only the remaining
nouns. Using the Standford POS tagger, each word in each segment was marked up with
a part of speech indicator and all but the nouns were removed.12
Jockers and Mimno, Significant Themes in
19th-Century Literature
25
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three. Nina Tahmasebi, University of Gothenburg, Synergies 2020 26
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
Prezentio add. 5
Nina Tahmasebi, University of Gothenburg, Synergies 2020 27
Nina Tahmasebi, University of Gothenburg, Synergies 2020 28
Nina Tahmasebi, University of Gothenburg, Synergies 2020 29
Amount of
information
Amount of text
Text mining
method
Nina Tahmasebi, University of Gothenburg, Synergies 2020 30
Data-intensive
research methodology
Nina Tahmasebi, University of Gothenburg, Synergies 2020 31
Traditional research methodology
Research
question
Text
Nina Tahmasebi, University of Gothenburg, Synergies 2020 32
Data-intensive research methodology
Research
question
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 33
Data-intensive research methodology
Research
question
Text
(digital large-scale text)
Hypothesis
Nina Tahmasebi, University of Gothenburg, Synergies 2020 34
Data Hypothesis
Data Hypothesis
Nina Tahmasebi, University of Gothenburg, Synergies 2020 35
Hypothesis
Data-intensive research methodology
Text mining
method
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 36
Nina Tahmasebi, University of Gothenburg, Synergies 2020 37
Hypothesis
Data-intensive research methodology
Text mining
method
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 38
Text-mining method
Dimensions
Filtering: Function words
Filtering: Stopwords
Part-of-speech tagging
Lemmatization
Tokenization
NLP pipeline: From text to result
Nina Tahmasebi, University of Gothenburg, Synergies 2020 39
Hypothesis
Data-intensive research methodology
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 40
Results as a window to the text
Nina Tahmasebi, University of Gothenburg, Synergies 2020 41
Viewpoint on the data
Nina Tahmasebi, University of Gothenburg, Synergies 2020 42
Nina Tahmasebi, University of Gothenburg, Synergies 2020 43
Nina Tahmasebi, University of Gothenburg, Synergies 2020 44
Nina Tahmasebi, University of Gothenburg, Synergies 2020 45
Nina Tahmasebi, University of Gothenburg, Synergies 2020 46
The better your method
(WRT the information related to
your research question)
 the better the pieces
Amount
of
informa
tion
Amount of text
Text mining
method
Nina Tahmasebi, University of Gothenburg, Synergies 2020 47
Data-intensive research methodology
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, University of Gothenburg, Synergies 2020 48
Data-intensive research methodology
results
results
results
Text mining
method
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, University of Gothenburg, Synergies 2020 49
Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, University of Gothenburg, Synergies 2020 50
Truths about data-
intensive research
Not all methods fit all data
Not all data fit all questions
Not all methods can answer all questions
Nothing lives separately,
it must be evaluated together:
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, Synergies 2020 51
Results and
research questions
Hypotes
Text mining
method
resultat
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, Synergies 2020
Method + Data = Results
result
Nina Tahmasebi, University of Gothenburg, Synergies 2020 53
result
hypothesis
Nina Tahmasebi, University of Gothenburg, Synergies 2020 54
Reject 1 Data 2 Method / Preprocessing 3 Hypothesis
result
hypothesis
Nina Tahmasebi, University of Gothenburg, Synergies 2020 55
Accept 1 Method 2
Correct interpretation
of the results
result
hypothesis
Nina Tahmasebi, University of Gothenburg, Synergies 2020 56
Math results, average difference
Men
Women
Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 57
Men
Women
Math results, average difference
Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 58
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Range of math scores
Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 59
Men
Women
Comparison of the same data
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Source: Factfullness
Men
Women
Nina Tahmasebi, University of Gothenburg, Synergies 2020 60
result
hypothesis
1 Method 2
Correct interpretation
of the results
3
Where do the
results live?
Nina Tahmasebi, University of Gothenburg, Synergies 2020 61
In corpus studies, we frequently do have enough data, so
the fact that a relation between two phenomena is
demonstrably non-random, does not support the
inference that it is not arbitrary. Language is never,
ever, ever, random,
Adam Kilgariff, 2005
Nina Tahmasebi, University of Gothenburg, Synergies 2020 62
Experimental design
Even when the math is right, we need to question the
selection and the grounds on which our conclusions are.
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, University of Gothenburg, Synergies 2020 63
Conclusions
Nina Tahmasebi, University of Gothenburg, Synergies 2020
Nina Tahmasebi, University of Gothenburg, Synergies 2020 65
Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, University of Gothenburg, Synergies 2020 66
Experimental design
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, University of Gothenburg, Synergies 2020 67
Prof. Hans Rosling
You can’t understand
the world without
numbers…
Factfullness
… and you cannot
understand it
only with numbers.
Nina Tahmasebi, University of Gothenburg, Synergies 2020 68
Tack!
Nina.tahmasebi@gu.se
nina@tahmasebi.se
Nina Tahmasebi, University of Gothenburg, Synergies 2020 69

More Related Content

Recently uploaded

Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 

Recently uploaded (20)

Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

2020 09-28-odense-final-forpublication

  • 1. The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies Nina Tahmasebi, Associate Professor University of Gothenburg Synergies: Bridging the Gap Between Traditional and Digital Literary Studies September 2020, Copenhagen
  • 2. Views on text DH Language Data 1010011010010 1001010010101 0011010010101 Nina Tahmasebi, University of Gothenburg, Synergies 2020 2
  • 3. Nina Tahmasebi, University of Gothenburg, Synergies 2020 3
  • 4. Based on • Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198- 227. • Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449. Nina Tahmasebi, University of Gothenburg, Synergies 2020 4
  • 5. When do we benefit from computational methods? Nina Tahmasebi, University of Gothenburg, Synergies 2020 5
  • 6. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, University of Gothenburg, Synergies 2020 6
  • 7. Nina Tahmasebi, University of Gothenburg, Synergies 2020 7
  • 8. Nina Tahmasebi, University of Gothenburg, Synergies 2020
  • 9. Nina Tahmasebi, University of Gothenburg, Synergies 2020 9
  • 10. Image: https://ipec.co.zwNina Tahmasebi, University of Gothenburg, Synergies 2020
  • 11. Nina Tahmasebi, University of Gothenburg, Synergies 2020 11
  • 12. From text to answers text text mining method research question results Nina Tahmasebi, University of Gothenburg, Synergies 2020 12
  • 13. From text to answers text research question text mining method Nina Tahmasebi, University of Gothenburg, Synergies 2020 results 13
  • 14. Today’s outline 3. Research results and interpretation 1. Digital Text 2. Data-intensive research methodology Nina Tahmasebi, University of Gothenburg, Synergies 2020 14
  • 15. Digital Text Nina Tahmasebi, University of Gothenburg, Synergies 2020 15
  • 16. A book: • Empty pages in the beginning / end • Large letter at the beginning of each chapter • Images? Nina Tahmasebi, University of Gothenburg, Synergies 2020 16
  • 17.
  • 18. Too many physical pieces cannot be treated manually. Digital Text Nina Tahmasebi, University of Gothenburg, Synergies 2020 18
  • 19. Too many digital texts cannot be studied in TOO LARGE DETAIL either! We need to ignore a lot of formatting • White pages • White space • Fonts • Capitalization of letters • Etc… Nina Tahmasebi, University of Gothenburg, Synergies 2020 19
  • 20. Digital text Printed texts not available digitally Printed texts born digital Other digital publications User generated textEdited text Less errors of the kind • OCR errors due to modern fonts, • Less dirty pages, younger age. • Modern language Data of the kind: • News • Professional blogs • Reviews A lot of errors • Spelling errors • Grammatical errors • Abbreviations • Smileys (automatic) Metadata The older the text, the more errors • Paper in bad quality • Different fonts • Skewed columns • (Spelling variations) Nina Tahmasebi, University of Gothenburg, Synergies 2020 20
  • 21. Researcher/group analyzing in detail Individual Individual text With individual intent Signal change Signal topic, cluster, vector… Multiple texts – dataset/corpus Researcher/group analyzing in detail Text mining scenario Nina Tahmasebi, University of Gothenburg, Synergies 2020 21 NLP step
  • 22. Nina Tahmasebi, University of Gothenburg, Synergies 2020 22
  • 23. I like the room but not the sheet. (only verbs) I like the room but not the sheet. (frequency filtering) I like the room but not the sheet. (only nouns) I like the room but not the sheet. (after lemmatization) I like the room but not the sheets. (after stop word filtering) I like the room but not the sheets. Nina Tahmasebi, University of Gothenburg, Synergies 2020 23
  • 24. Clean much – keep much information Matter of economy: • We cannot afford to keep it all • So we keep what gives us most value (= information) Nina Tahmasebi, University of Gothenburg, Synergies 2020 24 frequency information
  • 25. Nina Tahmasebi, University of Gothenburg, Synergies 2020 3. Nouns. After a series of experiments, it was determined that the thematic information in this corpus could best be captured by modeling only the remaining nouns. Using the Standford POS tagger, each word in each segment was marked up with a part of speech indicator and all but the nouns were removed.12 Jockers and Mimno, Significant Themes in 19th-Century Literature 25
  • 26. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Nina Tahmasebi, University of Gothenburg, Synergies 2020 26
  • 27. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Prezentio add. 5 Nina Tahmasebi, University of Gothenburg, Synergies 2020 27
  • 28. Nina Tahmasebi, University of Gothenburg, Synergies 2020 28
  • 29. Nina Tahmasebi, University of Gothenburg, Synergies 2020 29
  • 30. Amount of information Amount of text Text mining method Nina Tahmasebi, University of Gothenburg, Synergies 2020 30
  • 31. Data-intensive research methodology Nina Tahmasebi, University of Gothenburg, Synergies 2020 31
  • 32. Traditional research methodology Research question Text Nina Tahmasebi, University of Gothenburg, Synergies 2020 32
  • 33. Data-intensive research methodology Research question Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 33
  • 34. Data-intensive research methodology Research question Text (digital large-scale text) Hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 34
  • 35. Data Hypothesis Data Hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 35
  • 36. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 36
  • 37. Nina Tahmasebi, University of Gothenburg, Synergies 2020 37
  • 38. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 38
  • 39. Text-mining method Dimensions Filtering: Function words Filtering: Stopwords Part-of-speech tagging Lemmatization Tokenization NLP pipeline: From text to result Nina Tahmasebi, University of Gothenburg, Synergies 2020 39
  • 40. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 40
  • 41. Results as a window to the text Nina Tahmasebi, University of Gothenburg, Synergies 2020 41
  • 42. Viewpoint on the data Nina Tahmasebi, University of Gothenburg, Synergies 2020 42
  • 43. Nina Tahmasebi, University of Gothenburg, Synergies 2020 43
  • 44. Nina Tahmasebi, University of Gothenburg, Synergies 2020 44
  • 45. Nina Tahmasebi, University of Gothenburg, Synergies 2020 45
  • 46. Nina Tahmasebi, University of Gothenburg, Synergies 2020 46
  • 47. The better your method (WRT the information related to your research question)  the better the pieces Amount of informa tion Amount of text Text mining method Nina Tahmasebi, University of Gothenburg, Synergies 2020 47
  • 48. Data-intensive research methodology Hypothesis Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 48
  • 49. Data-intensive research methodology results results results Text mining method Text (digital large-scale text) Research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 49
  • 50. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 50
  • 51. Truths about data- intensive research Not all methods fit all data Not all data fit all questions Not all methods can answer all questions Nothing lives separately, it must be evaluated together: Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020 51
  • 52. Results and research questions Hypotes Text mining method resultat Text (digital large-scale text) Nina Tahmasebi, University of Gothenburg, Synergies 2020
  • 53. Method + Data = Results result Nina Tahmasebi, University of Gothenburg, Synergies 2020 53
  • 54. result hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 54
  • 55. Reject 1 Data 2 Method / Preprocessing 3 Hypothesis result hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 55
  • 56. Accept 1 Method 2 Correct interpretation of the results result hypothesis Nina Tahmasebi, University of Gothenburg, Synergies 2020 56
  • 57. Math results, average difference Men Women Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 57
  • 58. Men Women Math results, average difference Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 58
  • 59. NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Range of math scores Nina Tahmasebi, University of Gothenburg, Synergies 2020Source: Factfullness 59
  • 60. Men Women Comparison of the same data NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Source: Factfullness Men Women Nina Tahmasebi, University of Gothenburg, Synergies 2020 60
  • 61. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, University of Gothenburg, Synergies 2020 61
  • 62. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. Language is never, ever, ever, random, Adam Kilgariff, 2005 Nina Tahmasebi, University of Gothenburg, Synergies 2020 62
  • 63. Experimental design Even when the math is right, we need to question the selection and the grounds on which our conclusions are. • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, University of Gothenburg, Synergies 2020 63
  • 64. Conclusions Nina Tahmasebi, University of Gothenburg, Synergies 2020
  • 65. Nina Tahmasebi, University of Gothenburg, Synergies 2020 65
  • 66. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, University of Gothenburg, Synergies 2020 66
  • 67. Experimental design • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, University of Gothenburg, Synergies 2020 67
  • 68. Prof. Hans Rosling You can’t understand the world without numbers… Factfullness … and you cannot understand it only with numbers. Nina Tahmasebi, University of Gothenburg, Synergies 2020 68