In this talk, I will discuss how we can use digital methods to generate sustainable knowledge in the humanities. I will give an overview of the data-intensive research methodology and discuss how methods, results, and data relate to each other and must be evaluated as parts of a whole: there is no such thing as a good method, nor is there a way to know if the results are good, without considering the data. I will discuss results as a window from which we can see our data, and how we can reason about the results of digital methods. I will also present the Change is Key! research program and describe our efforts to connect computational research with research questions from the humanities and social sciences.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
CHR2022-final.pdf
1. The Strengths and Pitfalls
of Large-Scale Text
Mining for DH
Nina Tahmasebi, Associate Professor
University of Gothenburg
CHR 2022
December 2022, Antwerp
2. Centre for
Digital Humanities
(2018-2019)
Mathematics
(B.Sc & M.Sc)
2003-2008
Computer/ Data Science
(Phd + Postdoc)
2008-2014)
NLP /
Language Technology
(Researcher, Associate
Professor) 2014→
Nina Tahmasebi, University of Gothenburg, CHR 2022 2
4. Change is Key!
The study of contemporary and historical societies
using methods for synchronic semantic variation and diachronic semantic change
https://www.changeiskey.org/
5. Some facts
years
6
partner universities
6
Members from 4 countries
4
Countries, with advisors
6
People including PM and SE
13
MSek from Riksbankens Jubileumsfond +
5.5MSek from the University and Faculty
33.5
Nina Tahmasebi, University of Gothenburg, CHR 2022 5
6. Our Research Questions
4 5
2 3
1
Computational
models of
meaning and
change
Gender Studies
4
5
1
2
3
7. Three axioms
There is no such thing as data-driven research
1
There is no such thing as a good computational
method
2
If you do not evaluate your results, you might
as well spend your time enjoying a hobby
3
8. From text to answers
text
Nina Tahmasebi, University of Gothenburg, CHR 2022 8
9. A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, University of Gothenburg, CHR 2022 9
10. From text to answers
text
text mining
method
Nina Tahmasebi, University of Gothenburg, CHR 2022 10
13. From text to answers
text
text mining
method
research question
results
Nina Tahmasebi, University of Gothenburg, CHR 2022 13
14. From text to answers
text
research question
text mining
method
results
Nina Tahmasebi, University of Gothenburg, CHR 2022 14
15. Based on
• Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for
Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198-
227.
• Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of
Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449.
Nina Tahmasebi, University of Gothenburg, CHR 2022 15
16. Today’s outline
4. Research results and interpretation
2. Digital Text
3. Data-intensive research methodology
1. Research Questions
Nina Tahmasebi, University of Gothenburg, CHR 2022 16
23. On the dangers of exploration I
Data
Hebrew bible text (Torah)
Method
Equidistant Letter Sequence (ELS)
Results
names of famous rabbinic personalities and
their respective birth and death dates
Bible codes (Torah code):
Nina Tahmasebi, University of Gothenburg, CHR 2022 23
24. On the dangers of exploration II
President John F.
Kennedy was shot
in the head by an
assassin who quietly
waited in a concealed
place. It was in Texas,
November 1963,
during a presidential
motorcade.
Moby Dick
25. On the dangers of exploration III
“… you can find things like this anywhere. The reason it looks amazing is
that the number of possible things to look for, and the number of places
to look, is much greater than you imagine. “
Brendan McKay, Em. Professor at Australian National University
https://users.cecs.anu.edu.au/~bdm/codes/moby.html
28. A book:
• Empty pages in the
beginning / end
• Large letter at the
beginning of each chapter
• Images?
Nina Tahmasebi, University of Gothenburg, CHR 2022 28
30. Too many physical
pieces cannot be
treated manually.
Digital Text
Nina Tahmasebi, University of Gothenburg, CHR 2022 31
31. Too many digital texts cannot
be studied in TOO LARGE
DETAIL either!
We need to ignore a lot of formatting
• White pages
• White space
• Fonts
• Capitalization of letters
• Etc…
Nina Tahmasebi, University of Gothenburg, CHR 2022 32
33. I like the room but not the sheet. (only verbs)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheets.
Nina Tahmasebi, University of Gothenburg, CHR 2022 34
34. 3. Nouns. After a series of experiments, it was determined that the thematic
information in this corpus could best be captured by modeling only the remaining
nouns. Using the Standford POS tagger, each word in each segment was marked up with
a part of speech indicator and all but the nouns were removed.12
Jockers and Mimno, Significant Themes in
19th-Century Literature
Nina Tahmasebi, University of Gothenburg, CHR 2022 35
35. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three. Nina Tahmasebi, University of Gothenburg, CHR 2022 36
36. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
Prezentio add. 5
Nina Tahmasebi, University of Gothenburg, CHR 2022 37
39. Culturomics
Michel, Jean-Baptiste,
et al. "Quantitative
analysis of culture
using millions of
digitized books."
science 331.6014
(2011): 176-182.
Nina Tahmasebi, University of Gothenburg, CHR 2022 40
40. Fig 13. Upton Sinclair wrote 11 Lanny Budd novels set during World War II.
Pechenick EA, Danforth CM, Dodds PS (2015) Characterizing the Google Books Corpus:
Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLOS ONE 10(10):
e0137041. https://doi.org/10.1371/journal.pone.0137041
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0137041
Lanny vs. Hitler
Nina Tahmasebi, University of Gothenburg, CHR 2022 41
41. When we have little data, the uncertainty
is large:
• Is A larger than B?
But when we have large data, we are more
certain about our observations, STILL, our
errors can be much larger
• Because our selection is biased Sample 2
Sample 2
Sample 1
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Nina Tahmasebi, University of Gothenburg, CHR 2022 42
42. Three axioms
There should be no such thing as data-driven
research – but the text we have is important!
1
50. Text-mining method
Dimensions
Filtering: Function words
Filtering: Stopwords
Part-of-speech tagging
Lemmatization
Tokenization
NLP pipeline: From text to result
Nina Tahmasebi, University of Gothenburg, CHR 2022 51
61. Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, University of Gothenburg, CHR 2022 62
62. Truths about data-
intensive research
Not all methods fit all data
Not all data fit all questions
Not all methods can answer all questions
Nothing lives separately,
it must be evaluated together:
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, University of Gothenburg, CHR 2022 63
63. Three axioms
There is no such thing as a good computational
method
2
Nina Tahmasebi, University of Gothenburg, CHR 2022 64
71. NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Range of math scores
Source: Factfullness Nina Tahmasebi, University of Gothenburg, CHR 2022 72
72. Men
Women
Comparison of the same data
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Source: Factfullness
Men
Women
Nina Tahmasebi, University of Gothenburg, CHR 2022 73
73. Trending variables
Alexander Koplenig, 2018
Using the parameters of the
Zipf–Mandelbrot law to
measure diachronic
lexical, syntactical and
stylistic changes – a
large-scale corpus analysis
Nina Tahmasebi, University of Gothenburg, CHR 2022 74
75. Inference requires
random selection
• Only if the selection is random,
can we use the sample to draw
conclusions about the world
• We almost NEVER have a random
sample in a textual corpus!
→ We cannot draw conclusions
about the world Sample 2
Sample 1
random
inference
Nina Tahmasebi, University of Gothenburg, CHR 2022 76
76. Experimental design
Even when the math is right, we need to question the
selection and the grounds on which our conclusions are.
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, University of Gothenburg, CHR 2022 77
77. Three axioms
If you do not evaluate your results, you might
as well spend your time enjoying a hobby
3
Nina Tahmasebi, University of Gothenburg, CHR 2022 78
80. Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, University of Gothenburg, CHR 2022 81
81. Prof. Hans Rosling
You can’t understand
the world without
numbers…
Factfullness
… and you cannot
understand it
only with numbers.
Nina Tahmasebi, University of Gothenburg, CHR 2022 82
83. Links to study circle:
http://tahmasebi.se/studiecirkel/
Look at topic models for LB
https://hengchen.net/lb/all/
To be released, TM for Litteraturbanken
Nina Tahmasebi, University of Gothenburg, CHR 2022 84