Tool criticism

Tool Criticism
Marijn Koolen
(Huygens Institute for the History of the Netherlands)
Tools & Methods guest lecture - 2021-03-02 - Groningen
● Python/Jupyter to do:
○ GIS, Plotting locations on Google Maps
○ Machine Learning, (un)supervised learning, visualisation techniques
○ Mining social media data
○ TF*IDF, information processing, Word embedding models
○ Statistics / JASP
● Questions:
○ Why interested?
○ What would you like do with this?
Your Interests?
Online Resources
● Online (note)books for DH and Python
● Generic Jupyter:
○ https://jupyter4edu.github.io/jupyter-edu-book/
○ https://programminghistorian.org/en/lessons/jupyter-notebooks
● Specific methods and techniques
○ Cultural Analytics: https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html
■ Includes TF*IDF, Tweet mining and analysis, Geocoding
○ Named Entity Recognition: http://ner.pythonhumanities.com/intro.html
○ Deep Learning: https://course.fast.ai
○ NLP: Traditional and Deep Learning: https://www.fast.ai/2019/07/08/fastai-nlp/
■ Includes Word Embeddings, sentiment analysis, topic modelling, classification, …
○ GLAM Workbench: https://glam-workbench.github.io
■ Retrieving and analysing data from Galleries, Libraries, Archives, Museums
Tool criticism
● Starting point: (digital) source criticism
○ Method / approach in the humanities and specifically in historical research (cf. Fickers, 2012)
○ Internal source criticism: content of the document
○ External source criticism: metadata of the document (context)
■ Who created the document?
■ What kind of document is it?
■ Where was it made and distributed?
■ When was it made?
■ Why was it made?
● Digital Tool Criticism
○ What makes digital tool criticism different from digital source criticism?
○ Tool hermeneutics: what was its intended use? Does that align with my intended use? How
does it affect the digital sources/data it operates on?
Guiding Questions
For researchers:
- Incorporate digital source, data and tool criticism in research process
- Explicitly ask and answer questions about assumptions, choices, limitations
- Document and share workarounds
- Look for “About” pages and documentation on
- Functionalities, configurations, parameter choices
- Selection criteria and transformations of data sets
- Develop method of experimentation with tool to test functioning
- Look under the hood to develop better intuitions, grow your conceptual toolbox
- E.g. how can you test if a search engine filters stopwords or does linguistic normalization?
Recommendations
Model: Reflection as Integrative Practice
Koolen, van Gorp & van Ossenbruggen, 2018. Toward a model for digital tool criticism: Reflection as integrative practice. In Digital
Scholarship in the Humanities 2018. https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqy048/5127711
Role of Reflection
● Reflection In Action
○ Process is often unpredictable and uncertain (Schön 1983, p. 40)
○ Some actions, recognitions and judgements we carry out spontaneously, without thinking
about them (p. 54)
○ Use reflection to criticize tacit understanding grown from repetitive experiences (p. 61)
● This fits certain aspects of scholarly practice
○ E.g. searching, browsing, selecting using various information systems (digital archives and
libraries, catalogs and other databases).
○ But information systems already have pre-selection, rarely well-documented (digital source
criticism!)
Research Design as Wicked Problem
● Wicked problem
○ Design theory concept, a problem that is inherently ill-defined (Rittel in Churchman 1967)
○ Working towards solution changes the nature of the problem
● Humanities research is designed iteratively (Bron et al. 2016)
○ Impossible to plan where investigation takes you
○ Engagement with research materials shift goal posts
○ Affects appropriateness of design for RQ
● User-friendliness of digital tools exacerbates the problem
○ Graphical User Interfaces (GUIs) often hide relevant data transformations and manipulations
○ Difficult to look under the hood
○ Requires active reflective attitude
Entanglement of Data and Tools
Entanglement of Data and Tools
Each step changes the underlying data!
● How to address tool criticism questions
○ Focus on research methods
● E.g. Social Network Analysis (SNA)
○ Understand concepts, techniques and applications of SNA before assessing SNA tools
○ How many of you have used SNA tools? How many of you want to use them?
○ Gephi or NetworkX (Python library)
● Before you ask...
○ Which layout algorithm should I use?
○ Which community detection algorithm should I use? What parameters are good?
● … understand core concepts:
○ nodes, edges, link degrees, paths, connected components,
○ Modularity, bridge, weak ties,
○ Completeness, impact of missing data
Tools or Methods?
Source: https://towardsdatascience.com/generating-twitter-ego-networks-detecting-ego-communities-93897883d255
● Term Frequency * Inverse Document Frequency
○ Used in many methods and tools
○ What was TF*IDF originally intended for?
● Again, start from method
○ Natural Language Processing, Information Theory
○ Concepts: Zipf’s law, tokenisation, stop, stem, lemma, part-of-speech, mutual information
TF*IDF
● I’ve prepared a Jupyter notebook that demonstrates the workings of TF*DF
○ Using social media data (tweets and online reviews)
○ With 7 questions to reflect on its details
● Break out groups
○ Open the notebook and discuss the questions (take 20 mins.)
○ Afterwards we discuss your observations and your own questions
○ Also, look at the Wikipedia page on TF*IDF: https://en.wikipedia.org/wiki/Tf-idf
Hands On with TF*IDF
● Text Mining of tweets (and other short records)
○ Tweets are peculiar textual representations
■ Minimal amount of text, low redundancy
■ Majority of terms occur only once
○ Which part of TF*IDF contributes more to the TF*IDF score of a tweet?
○ Consequences for ranking/clustering/mining?
Text Mining in Tweets
● Resources
○ NRC EmoLex: 8 basic emotions (https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)
○ LIWC:over 70 categories, incl. emotions (https://liwc.wpengine.com)
○ VADER: Valence, Arousal, Dominance (https://github.com/cjhutto/vaderSentiment)
● Critical questions
○ How do they work? What are they intended to measure? For what text genres?
○ How reliable are they? What do they capture well? What are typical mistakes they make?
● Lessons from 20+ years of NLP research:
○ sentiment is domain-specific, nowadays aspect-based (reviews of hotels, restaurants and
smartphone have their own vocabularies)
● ALWAYS combine quantitative with qualitative analysis!
○ They contextualise each other
Sentiment Analysis and Emotion Lexicons
Questions About Social Media Sentiment Mining
● Another Jupyter notebook, that dissects sentiment analysis
○ Using social media data (tweets and online reviews)
○ With 9 questions to reflect on its details and output
● Break out groups
○ Open the notebook and discuss the questions (take 20 mins.)
○ Afterwards we discuss your observations and your own questions
● Concepts:
○ N-grams, skipgrams, distributional semantics
○ Semantic vs. syntactic similarity (related size of context window)
○ Generic vs. domain-specific models and text corpora
○ Pre-trained models, transfer learning
○ Corpus size
● See also (shameless self-promotion):
○ Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using
word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History,
53(4), 226-243.
○ https://www.tandfonline.com/doi/pdf/10.1080/01615440.2020.1760157
Word Embedding Models
● Finding patterns in data
○ But are they meaningful patterns?
○ Main point: separating regular features (signal) from ‘accidental feature’ (noise) of a dataset
■ If I throw a 6-sided die 10 times, the average is probably close to 3.5 (regular/signal) but
the particular sequence of sides is accidental (irregular/noise)
■ Many ‘regularities’ are artefacts introduced through selection (tweets from the last 24
hours may cover Sunday evening for one part of the world and Monday morning for
another)
● Which regularities are relevant depends on your research question
○ But ML methods are oblivious to your research question and context
Machine Learning
Tweet Corpora
● Existing corpora
○ Kaggle sentiment140: https://www.kaggle.com/kazanova/sentiment140
○ GESIS TweetsCOV19: https://data.gesis.org/tweetscov19/
○ GateNLP BTC: https://github.com/GateNLP/broad_twitter_corpus
○ Disaster Tweet Corpus 2020: https://zenodo.org/record/3713920
● How were they constructed?
○ Multiple layers of selection:
■ Twitter API
■ Collection methods and period, queries and cleaning/filtering
● For what purpose were they collected?
○ How has that shaped their construction?
Tool Criticism Recommendations (From Journal Article)
● Analyze and discuss tools at the level of data transformations.
○ How do inputs and outputs differ?
○ What does this mean for interpreting the transformed data?
● Questions to ask about digital data:
○ Where do the data come from? Who made the data? Who made the data available? What selection criteria were used?
How is it organized? What preprocessing steps were used to make the data available? If digitized from analogue sources,
how does the digitized data differ from the analogue sources? Are all sources digitized or only selected materials? What
are known omissions/gaps in the data?
● Questions about digital tools:
○ Which tools are available and relevant for your research? Which tool best fits the method you want to use? How does the
tool fit the method you want to use? For which phase of your research is this tool suitable? What kind of tool is it? Who
made the tool, when, why, and what for? How does the tool transform the data that it works upon? What are the potential
consequences of this?
● Questions about digital search tools:
○ What search strategies does the tool allow? What feedback about matching and non-matching documents does the tool
provide? What ways does the tool offer for sense-making and getting an overview of the data it gives access to?
● Questions about digital analysis tools:
○ What elements of the data does the tool allow you to analyze qualitatively or quantitatively? What ways of analyzing does
the tool offer, and what ways to contextualize your analysis?
References
Bron, M., Van Gorp, J., & De Rijke, M. (2016). Media studies research in the data‐driven age: How research questions evolve. Journal
of the Association for Information Science and Technology, 67(7), 1535-1554.
Churchman, C. W. (1967). Wicked problems. Management Science, 14(4), B141–142
Fickers, A. (2012). Towards a new digital historicism? Doing history in the age of abundance. VIEW Journal of European Television
History and Culture, 1(1), 19-26.
Hoekstra, R., & Koolen, M. (2019). Data scopes for digital history research. Historical Methods: A Journal of Quantitative and
Interdisciplinary History, 52(2), 79-94.
Koolen, van Gorp & van Ossenbruggen, 2018. Toward a model for digital tool criticism: Reflection as integrative practice. In Digital
Scholarship in the Humanities 2018.
Schön, D. (1983). The reflective practitioner. How professionals think in action. New York: Basic Book. Inc., Publishers.
Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A
Journal of Quantitative and Interdisciplinary History, 53(4), 226-243.
1 of 23

Recommended

Lessons Learned from a Digital Tool Criticism Workshop by
Lessons Learned from a Digital Tool Criticism WorkshopLessons Learned from a Digital Tool Criticism Workshop
Lessons Learned from a Digital Tool Criticism WorkshopMarijn Koolen
76 views29 slides
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism by
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismTools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismMarijn Koolen
12 views44 slides
A hands-on approach to digital tool criticism: Tools for (self-)reflection by
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionMarijn Koolen
237 views32 slides
Data Scopes - Towards transparent data research in digital humanities (Digita... by
Data Scopes - Towards transparent data research in digital humanities (Digita...Data Scopes - Towards transparent data research in digital humanities (Digita...
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
78 views27 slides
Cloud computing in qualitative research data analysis with support of web qda... by
Cloud computing in qualitative research data analysis with support of web qda...Cloud computing in qualitative research data analysis with support of web qda...
Cloud computing in qualitative research data analysis with support of web qda...German Jordanian university
288 views32 slides
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu... by
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...Timo Honkela
471 views28 slides

More Related Content

Similar to Tool criticism

Search in Research, Let's Make it More Complex! by
Search in Research, Let's Make it More Complex!Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!Marijn Koolen
72 views50 slides
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an... by
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Marijn Koolen
130 views74 slides
Curtain call of zooey - what i've learned in yahoo by
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo羽祈 張
290 views32 slides
How to do science in a large IT company (ICPC World Finals 2021, Moscow) by
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)Alexander Borzunov
179 views31 slides
Requirements Engineering for the Humanities by
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the HumanitiesShawn Day
1.1K views65 slides
Solstice 2019 social media tools and data by
Solstice 2019 social media tools and data Solstice 2019 social media tools and data
Solstice 2019 social media tools and data Pete F. Atherton
274 views49 slides

Similar to Tool criticism(20)

Search in Research, Let's Make it More Complex! by Marijn Koolen
Search in Research, Let's Make it More Complex!Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Marijn Koolen72 views
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an... by Marijn Koolen
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Marijn Koolen130 views
Curtain call of zooey - what i've learned in yahoo by 羽祈 張
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo
羽祈 張290 views
How to do science in a large IT company (ICPC World Finals 2021, Moscow) by Alexander Borzunov
How to do science in a large IT company (ICPC World Finals 2021, Moscow)How to do science in a large IT company (ICPC World Finals 2021, Moscow)
How to do science in a large IT company (ICPC World Finals 2021, Moscow)
Alexander Borzunov179 views
Requirements Engineering for the Humanities by Shawn Day
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the Humanities
Shawn Day1.1K views
Solstice 2019 social media tools and data by Pete F. Atherton
Solstice 2019 social media tools and data Solstice 2019 social media tools and data
Solstice 2019 social media tools and data
Pete F. Atherton274 views
Requirements for Learning Analytics by Tore Hoel
Requirements for Learning AnalyticsRequirements for Learning Analytics
Requirements for Learning Analytics
Tore Hoel1.6K views
Wimmics Research Team 2015 Activity Report by Fabien Gandon
Wimmics Research Team 2015 Activity ReportWimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity Report
Fabien Gandon5.3K views
Scale2014 by shaunagm
Scale2014Scale2014
Scale2014
shaunagm476 views
What you did last summer? by DoThinger
What you did last summer?What you did last summer?
What you did last summer?
DoThinger389 views
Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc... by Ju Lim
Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc...Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc...
Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc...
Ju Lim36 views
Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc... by All Things Open
Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc...Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc...
Ten Lessons Learnt to Drive and Transform Open Source Software User Experienc...
All Things Open61 views
Managing Ireland's Research Data - 3 Research Methods by Rebecca Grant
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
Rebecca Grant70 views
A step towards machine learning at accionlabs by Chetan Khatri
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
Chetan Khatri576 views

More from Marijn Koolen

Recommender Systems NL Meetup by
Recommender Systems NL MeetupRecommender Systems NL Meetup
Recommender Systems NL MeetupMarijn Koolen
22 views21 slides
Narrative-Driven Recommendation for Casual Leisure Needs by
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
18 views26 slides
Digital History - Maritieme Carrieres bij de VOC by
Digital History - Maritieme Carrieres bij de VOCDigital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOCMarijn Koolen
23 views23 slides
Facilitating reusable third-party annotations in the digital edition by
Facilitating reusable third-party annotations in the digital editionFacilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital editionMarijn Koolen
196 views60 slides
Narrative-Driven Recommendation for Casual Leisure Needs by
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure NeedsMarijn Koolen
301 views32 slides
Scholary Web Annotation - HuC Live 2018 by
Scholary Web Annotation - HuC Live 2018Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018Marijn Koolen
62 views26 slides

More from Marijn Koolen(6)

Recommender Systems NL Meetup by Marijn Koolen
Recommender Systems NL MeetupRecommender Systems NL Meetup
Recommender Systems NL Meetup
Marijn Koolen22 views
Narrative-Driven Recommendation for Casual Leisure Needs by Marijn Koolen
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure Needs
Marijn Koolen18 views
Digital History - Maritieme Carrieres bij de VOC by Marijn Koolen
Digital History - Maritieme Carrieres bij de VOCDigital History - Maritieme Carrieres bij de VOC
Digital History - Maritieme Carrieres bij de VOC
Marijn Koolen23 views
Facilitating reusable third-party annotations in the digital edition by Marijn Koolen
Facilitating reusable third-party annotations in the digital editionFacilitating reusable third-party annotations in the digital edition
Facilitating reusable third-party annotations in the digital edition
Marijn Koolen196 views
Narrative-Driven Recommendation for Casual Leisure Needs by Marijn Koolen
Narrative-Driven Recommendation for Casual Leisure NeedsNarrative-Driven Recommendation for Casual Leisure Needs
Narrative-Driven Recommendation for Casual Leisure Needs
Marijn Koolen301 views
Scholary Web Annotation - HuC Live 2018 by Marijn Koolen
Scholary Web Annotation - HuC Live 2018Scholary Web Annotation - HuC Live 2018
Scholary Web Annotation - HuC Live 2018
Marijn Koolen62 views

Recently uploaded

JQUERY.pdf by
JQUERY.pdfJQUERY.pdf
JQUERY.pdfArthyR3
103 views22 slides
MercerJesse3.0.pdf by
MercerJesse3.0.pdfMercerJesse3.0.pdf
MercerJesse3.0.pdfjessemercerail
92 views6 slides
Gross Anatomy of the Liver by
Gross Anatomy of the LiverGross Anatomy of the Liver
Gross Anatomy of the Liverobaje godwin sunday
74 views12 slides
Class 9 lesson plans by
Class 9 lesson plansClass 9 lesson plans
Class 9 lesson plansTARIQ KHAN
68 views34 slides
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx by
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptxSURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptxNiranjan Chavan
43 views54 slides
11.30.23A Poverty and Inequality in America.pptx by
11.30.23A Poverty and Inequality in America.pptx11.30.23A Poverty and Inequality in America.pptx
11.30.23A Poverty and Inequality in America.pptxmary850239
86 views18 slides

Recently uploaded(20)

JQUERY.pdf by ArthyR3
JQUERY.pdfJQUERY.pdf
JQUERY.pdf
ArthyR3103 views
Class 9 lesson plans by TARIQ KHAN
Class 9 lesson plansClass 9 lesson plans
Class 9 lesson plans
TARIQ KHAN68 views
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx by Niranjan Chavan
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptxSURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx
Niranjan Chavan43 views
11.30.23A Poverty and Inequality in America.pptx by mary850239
11.30.23A Poverty and Inequality in America.pptx11.30.23A Poverty and Inequality in America.pptx
11.30.23A Poverty and Inequality in America.pptx
mary85023986 views
Monthly Information Session for MV Asterix (November) by Esquimalt MFRC
Monthly Information Session for MV Asterix (November)Monthly Information Session for MV Asterix (November)
Monthly Information Session for MV Asterix (November)
Esquimalt MFRC98 views
Create a Structure in VBNet.pptx by Breach_P
Create a Structure in VBNet.pptxCreate a Structure in VBNet.pptx
Create a Structure in VBNet.pptx
Breach_P82 views
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv... by Taste
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...
Taste53 views
Career Building in AI - Technologies, Trends and Opportunities by WebStackAcademy
Career Building in AI - Technologies, Trends and OpportunitiesCareer Building in AI - Technologies, Trends and Opportunities
Career Building in AI - Technologies, Trends and Opportunities
WebStackAcademy41 views
EILO EXCURSION PROGRAMME 2023 by info33492
EILO EXCURSION PROGRAMME 2023EILO EXCURSION PROGRAMME 2023
EILO EXCURSION PROGRAMME 2023
info33492181 views
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37 by MysoreMuleSoftMeetup
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37
Guidelines & Identification of Early Sepsis DR. NN CHAVAN 02122023.pptx by Niranjan Chavan
Guidelines & Identification of Early Sepsis DR. NN CHAVAN 02122023.pptxGuidelines & Identification of Early Sepsis DR. NN CHAVAN 02122023.pptx
Guidelines & Identification of Early Sepsis DR. NN CHAVAN 02122023.pptx
Niranjan Chavan38 views

Tool criticism

  • 1. Tool Criticism Marijn Koolen (Huygens Institute for the History of the Netherlands) Tools & Methods guest lecture - 2021-03-02 - Groningen
  • 2. ● Python/Jupyter to do: ○ GIS, Plotting locations on Google Maps ○ Machine Learning, (un)supervised learning, visualisation techniques ○ Mining social media data ○ TF*IDF, information processing, Word embedding models ○ Statistics / JASP ● Questions: ○ Why interested? ○ What would you like do with this? Your Interests?
  • 3. Online Resources ● Online (note)books for DH and Python ● Generic Jupyter: ○ https://jupyter4edu.github.io/jupyter-edu-book/ ○ https://programminghistorian.org/en/lessons/jupyter-notebooks ● Specific methods and techniques ○ Cultural Analytics: https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html ■ Includes TF*IDF, Tweet mining and analysis, Geocoding ○ Named Entity Recognition: http://ner.pythonhumanities.com/intro.html ○ Deep Learning: https://course.fast.ai ○ NLP: Traditional and Deep Learning: https://www.fast.ai/2019/07/08/fastai-nlp/ ■ Includes Word Embeddings, sentiment analysis, topic modelling, classification, … ○ GLAM Workbench: https://glam-workbench.github.io ■ Retrieving and analysing data from Galleries, Libraries, Archives, Museums
  • 5. ● Starting point: (digital) source criticism ○ Method / approach in the humanities and specifically in historical research (cf. Fickers, 2012) ○ Internal source criticism: content of the document ○ External source criticism: metadata of the document (context) ■ Who created the document? ■ What kind of document is it? ■ Where was it made and distributed? ■ When was it made? ■ Why was it made? ● Digital Tool Criticism ○ What makes digital tool criticism different from digital source criticism? ○ Tool hermeneutics: what was its intended use? Does that align with my intended use? How does it affect the digital sources/data it operates on? Guiding Questions
  • 6. For researchers: - Incorporate digital source, data and tool criticism in research process - Explicitly ask and answer questions about assumptions, choices, limitations - Document and share workarounds - Look for “About” pages and documentation on - Functionalities, configurations, parameter choices - Selection criteria and transformations of data sets - Develop method of experimentation with tool to test functioning - Look under the hood to develop better intuitions, grow your conceptual toolbox - E.g. how can you test if a search engine filters stopwords or does linguistic normalization? Recommendations
  • 7. Model: Reflection as Integrative Practice Koolen, van Gorp & van Ossenbruggen, 2018. Toward a model for digital tool criticism: Reflection as integrative practice. In Digital Scholarship in the Humanities 2018. https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqy048/5127711
  • 8. Role of Reflection ● Reflection In Action ○ Process is often unpredictable and uncertain (Schön 1983, p. 40) ○ Some actions, recognitions and judgements we carry out spontaneously, without thinking about them (p. 54) ○ Use reflection to criticize tacit understanding grown from repetitive experiences (p. 61) ● This fits certain aspects of scholarly practice ○ E.g. searching, browsing, selecting using various information systems (digital archives and libraries, catalogs and other databases). ○ But information systems already have pre-selection, rarely well-documented (digital source criticism!)
  • 9. Research Design as Wicked Problem ● Wicked problem ○ Design theory concept, a problem that is inherently ill-defined (Rittel in Churchman 1967) ○ Working towards solution changes the nature of the problem ● Humanities research is designed iteratively (Bron et al. 2016) ○ Impossible to plan where investigation takes you ○ Engagement with research materials shift goal posts ○ Affects appropriateness of design for RQ ● User-friendliness of digital tools exacerbates the problem ○ Graphical User Interfaces (GUIs) often hide relevant data transformations and manipulations ○ Difficult to look under the hood ○ Requires active reflective attitude
  • 10. Entanglement of Data and Tools
  • 11. Entanglement of Data and Tools Each step changes the underlying data!
  • 12. ● How to address tool criticism questions ○ Focus on research methods ● E.g. Social Network Analysis (SNA) ○ Understand concepts, techniques and applications of SNA before assessing SNA tools ○ How many of you have used SNA tools? How many of you want to use them? ○ Gephi or NetworkX (Python library) ● Before you ask... ○ Which layout algorithm should I use? ○ Which community detection algorithm should I use? What parameters are good? ● … understand core concepts: ○ nodes, edges, link degrees, paths, connected components, ○ Modularity, bridge, weak ties, ○ Completeness, impact of missing data Tools or Methods?
  • 14. ● Term Frequency * Inverse Document Frequency ○ Used in many methods and tools ○ What was TF*IDF originally intended for? ● Again, start from method ○ Natural Language Processing, Information Theory ○ Concepts: Zipf’s law, tokenisation, stop, stem, lemma, part-of-speech, mutual information TF*IDF
  • 15. ● I’ve prepared a Jupyter notebook that demonstrates the workings of TF*DF ○ Using social media data (tweets and online reviews) ○ With 7 questions to reflect on its details ● Break out groups ○ Open the notebook and discuss the questions (take 20 mins.) ○ Afterwards we discuss your observations and your own questions ○ Also, look at the Wikipedia page on TF*IDF: https://en.wikipedia.org/wiki/Tf-idf Hands On with TF*IDF
  • 16. ● Text Mining of tweets (and other short records) ○ Tweets are peculiar textual representations ■ Minimal amount of text, low redundancy ■ Majority of terms occur only once ○ Which part of TF*IDF contributes more to the TF*IDF score of a tweet? ○ Consequences for ranking/clustering/mining? Text Mining in Tweets
  • 17. ● Resources ○ NRC EmoLex: 8 basic emotions (https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm) ○ LIWC:over 70 categories, incl. emotions (https://liwc.wpengine.com) ○ VADER: Valence, Arousal, Dominance (https://github.com/cjhutto/vaderSentiment) ● Critical questions ○ How do they work? What are they intended to measure? For what text genres? ○ How reliable are they? What do they capture well? What are typical mistakes they make? ● Lessons from 20+ years of NLP research: ○ sentiment is domain-specific, nowadays aspect-based (reviews of hotels, restaurants and smartphone have their own vocabularies) ● ALWAYS combine quantitative with qualitative analysis! ○ They contextualise each other Sentiment Analysis and Emotion Lexicons
  • 18. Questions About Social Media Sentiment Mining ● Another Jupyter notebook, that dissects sentiment analysis ○ Using social media data (tweets and online reviews) ○ With 9 questions to reflect on its details and output ● Break out groups ○ Open the notebook and discuss the questions (take 20 mins.) ○ Afterwards we discuss your observations and your own questions
  • 19. ● Concepts: ○ N-grams, skipgrams, distributional semantics ○ Semantic vs. syntactic similarity (related size of context window) ○ Generic vs. domain-specific models and text corpora ○ Pre-trained models, transfer learning ○ Corpus size ● See also (shameless self-promotion): ○ Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 53(4), 226-243. ○ https://www.tandfonline.com/doi/pdf/10.1080/01615440.2020.1760157 Word Embedding Models
  • 20. ● Finding patterns in data ○ But are they meaningful patterns? ○ Main point: separating regular features (signal) from ‘accidental feature’ (noise) of a dataset ■ If I throw a 6-sided die 10 times, the average is probably close to 3.5 (regular/signal) but the particular sequence of sides is accidental (irregular/noise) ■ Many ‘regularities’ are artefacts introduced through selection (tweets from the last 24 hours may cover Sunday evening for one part of the world and Monday morning for another) ● Which regularities are relevant depends on your research question ○ But ML methods are oblivious to your research question and context Machine Learning
  • 21. Tweet Corpora ● Existing corpora ○ Kaggle sentiment140: https://www.kaggle.com/kazanova/sentiment140 ○ GESIS TweetsCOV19: https://data.gesis.org/tweetscov19/ ○ GateNLP BTC: https://github.com/GateNLP/broad_twitter_corpus ○ Disaster Tweet Corpus 2020: https://zenodo.org/record/3713920 ● How were they constructed? ○ Multiple layers of selection: ■ Twitter API ■ Collection methods and period, queries and cleaning/filtering ● For what purpose were they collected? ○ How has that shaped their construction?
  • 22. Tool Criticism Recommendations (From Journal Article) ● Analyze and discuss tools at the level of data transformations. ○ How do inputs and outputs differ? ○ What does this mean for interpreting the transformed data? ● Questions to ask about digital data: ○ Where do the data come from? Who made the data? Who made the data available? What selection criteria were used? How is it organized? What preprocessing steps were used to make the data available? If digitized from analogue sources, how does the digitized data differ from the analogue sources? Are all sources digitized or only selected materials? What are known omissions/gaps in the data? ● Questions about digital tools: ○ Which tools are available and relevant for your research? Which tool best fits the method you want to use? How does the tool fit the method you want to use? For which phase of your research is this tool suitable? What kind of tool is it? Who made the tool, when, why, and what for? How does the tool transform the data that it works upon? What are the potential consequences of this? ● Questions about digital search tools: ○ What search strategies does the tool allow? What feedback about matching and non-matching documents does the tool provide? What ways does the tool offer for sense-making and getting an overview of the data it gives access to? ● Questions about digital analysis tools: ○ What elements of the data does the tool allow you to analyze qualitatively or quantitatively? What ways of analyzing does the tool offer, and what ways to contextualize your analysis?
  • 23. References Bron, M., Van Gorp, J., & De Rijke, M. (2016). Media studies research in the data‐driven age: How research questions evolve. Journal of the Association for Information Science and Technology, 67(7), 1535-1554. Churchman, C. W. (1967). Wicked problems. Management Science, 14(4), B141–142 Fickers, A. (2012). Towards a new digital historicism? Doing history in the age of abundance. VIEW Journal of European Television History and Culture, 1(1), 19-26. Hoekstra, R., & Koolen, M. (2019). Data scopes for digital history research. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 52(2), 79-94. Koolen, van Gorp & van Ossenbruggen, 2018. Toward a model for digital tool criticism: Reflection as integrative practice. In Digital Scholarship in the Humanities 2018. Schön, D. (1983). The reflective practitioner. How professionals think in action. New York: Basic Book. Inc., Publishers. Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 53(4), 226-243.