SlideShare a Scribd company logo
1 of 17
Download to read offline
Text Analysis Methods
for Digital Humanities
Helen Bailey and Sands Fish
MIT Libraries
Examples of Data Narratives
•  Visualizing Emancipation
•  Narrative Visualization of Whaling Ship Logs
•  Out of Sight, Out of Mind
Approaches to Storytelling w/ Data
•  EDA - Exploratory Data Analysis
•  Exploring data from a number of perspectives:
o  Temporal
o  Geographical
o  Statistical
o  Categorical
o  Relational
•  80% - Data Hacking, 20% - Narrative Construction, Visualization,
etc.
"To use any sort of historical data, we must above all understand the
constraints under which it was collected. In this case, that means
retelling the history of why and how the ship's logs were first collected, and
how the constraints of digitization in the punch card era radically shape the
sort of evidence we can draw from them. The important thing about this sort
of work is that it helps us understand the overall biases of a particular
data set, which is crucial for limiting our interpretive leaps."
- Ben Schmidt, “Reading digital sources: a case study in ship's logs”
Inherent Biases & Limitations
•  Data capture methods and format
•  Purpose of data collection
•  Transformation over time
•  Authenticity and trust
Understand provenance
“Rather than replace humans, computers amplify human abilities. The
most productive line of inquiry, therefore, is not in identifying how automated
methods can obviate the need for researchers to read their text. Rather, the
most productive line of inquiry is to identify the best way to use both
humans and automated methods for analyzing texts.”
- Justin Grimmer and Brandon M. Stewart, “
Text as Data: The Promise and Pitfalls of Automatic Content Analysis
Methods for Political Texts”
Acquiring Text
•  Full-text resources:
o  DSpace@MIT http://dspace.mit.edu/
o  Dome http://dome.mit.edu/
o  Digital Public Library of America http://dp.la
o  Europeana http://www.europeana.eu/portal/
o  HathiTrust http://www.hathitrust.org/
•  http://libguides.mit.edu/apis - metadata only
•  http://libguides.mit.edu/digitalhumanities
Data Management and Sharing
•  Assumption of sharing and data management plan as a
funding requirement
•  Data storage options - anticipate interaction
o  Storage formats - non-proprietary and repurposable
whenever possible
o  File system storage vs. database
•  Documentation of process
http://libraries.mit.edu/guides/subjects/data-management/
Formatting / Pre-Processing
•  Tool input requirements
•  Assumptions:
o  Text as a “bag of words”
o  Unigrams, bigrams
o  Word order (or not)
o  Stop words, capitalization, punctuation
Featurizing Text
•  Each word becomes a feature
•  This is called "high dimensional" data
•  Each word is a "dimension", or "feature"
•  Features are represented as vectors in Euclidean space
•  Euclidean mathematics scales beyond 3 dimensions
The Shape of Data
•  Data structures and formats
•  Informed (in part) by:
o  Tools
o  Co-occurrence
o  Data output formats
o  Entity type
o  Temporal, geographical perspective, etc.
Validation
From Ben Schmidt’s “Machine Learning at Sea”
Network Models
•  Representing data as a network
o  Types: technological, communication, transportation, energy, airplane routes,
web linking patterns
o  social
§  non-human animal interaction
§  membership in larger groups
§  sexually transmitted diseases
§  co-authorship of scientific publications
§  trade agreements between nations
•  Mapping the News - Berkman's Controversy Work
o  Spidering
o  Influential actors over time
Topic Modeling Tools
•  MALLET
o  Can run on unstructured plain text files
o  http://mallet.cs.umass.edu/topics.php
•  Stanford Topic Modeling Toolbox
o  Requires data in a CSV or TSV file
o  http://nlp.stanford.edu/software/tmt/tmt-0.4/
Entity Extraction
•  Identifies known entities in specific categories
o  Locations
o  People
o  Organizations
o  Dates/times
•  Creates annotated text from unstructured text
•  Domain-specific
Entity Extraction Tools
•  Stanford Named Entity Recognizer
http://nlp.stanford.edu/software/CRF-NER.shtml
•  Illinois Named Entity Tagger
http://cogcomp.cs.illinois.edu/page/download_view/NETagger
•  DBPedia Spotlight
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
Geo-Parsing
•  Common Pitfalls
o  Set of places (GeoNames dictionary)
o  Dictionary determines how broad or narrow your
search is
•  Enhancements to CLAVIN by Civic Media
o  Aboutness (uses mention counting)
o  HTTP access used for more advanced workflows

More Related Content

What's hot

Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Harriett Green
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesRobert H. McDonald
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical DocumentsGeorg Vogeler
 
Challenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsChallenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsIIIF_io
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMESharonYang
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible LibraryKsenija Mincic Obradovic
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Ksenija Mincic Obradovic
 
Discussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBDiscussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBKimmo Soramaki
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISCrishuana Williams
 
Historical methods 2012
Historical methods 2012Historical methods 2012
Historical methods 2012p-logsdon
 
Building the Archive of DH Research
Building the Archive of DH ResearchBuilding the Archive of DH Research
Building the Archive of DH ResearchHarriett Green
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Getaneh Alemu
 
Digital Libraries on International Campuses
Digital Libraries on International CampusesDigital Libraries on International Campuses
Digital Libraries on International CampusesHarriett Green
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Harriett Green
 

What's hot (15)

Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical Documents
 
Librarian Legal Literacies for Text Data Mining
Librarian Legal Literacies for Text Data MiningLibrarian Legal Literacies for Text Data Mining
Librarian Legal Literacies for Text Data Mining
 
Challenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsChallenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old Institutions
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAME
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
 
Discussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBDiscussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNB
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLIS
 
Historical methods 2012
Historical methods 2012Historical methods 2012
Historical methods 2012
 
Building the Archive of DH Research
Building the Archive of DH ResearchBuilding the Archive of DH Research
Building the Archive of DH Research
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)
 
Digital Libraries on International Campuses
Digital Libraries on International CampusesDigital Libraries on International Campuses
Digital Libraries on International Campuses
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
 

Similar to Text Analysis Methods for Digital Humanities

Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCodePolitan
 
Rscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsRscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsSusanMRob
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf09372002dedi
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 
Data for the Humanities
Data for the HumanitiesData for the Humanities
Data for the Humanitieslibrarianrafia
 
Beyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationBeyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationMia
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementSarah Jones
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basicNivaTripathy2
 

Similar to Text Analysis Methods for Digital Humanities (20)

00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User Profiling
 
Rscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsRscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libs
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data for the Humanities
Data for the HumanitiesData for the Humanities
Data for the Humanities
 
Beyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationBeyond the Black Box: Data Visualisation
Beyond the Black Box: Data Visualisation
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Ir1
Ir1Ir1
Ir1
 
Demography pro sem
Demography pro semDemography pro sem
Demography pro sem
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 

Text Analysis Methods for Digital Humanities

  • 1. Text Analysis Methods for Digital Humanities Helen Bailey and Sands Fish MIT Libraries
  • 2. Examples of Data Narratives •  Visualizing Emancipation •  Narrative Visualization of Whaling Ship Logs •  Out of Sight, Out of Mind
  • 3. Approaches to Storytelling w/ Data •  EDA - Exploratory Data Analysis •  Exploring data from a number of perspectives: o  Temporal o  Geographical o  Statistical o  Categorical o  Relational •  80% - Data Hacking, 20% - Narrative Construction, Visualization, etc.
  • 4. "To use any sort of historical data, we must above all understand the constraints under which it was collected. In this case, that means retelling the history of why and how the ship's logs were first collected, and how the constraints of digitization in the punch card era radically shape the sort of evidence we can draw from them. The important thing about this sort of work is that it helps us understand the overall biases of a particular data set, which is crucial for limiting our interpretive leaps." - Ben Schmidt, “Reading digital sources: a case study in ship's logs”
  • 5. Inherent Biases & Limitations •  Data capture methods and format •  Purpose of data collection •  Transformation over time •  Authenticity and trust Understand provenance
  • 6. “Rather than replace humans, computers amplify human abilities. The most productive line of inquiry, therefore, is not in identifying how automated methods can obviate the need for researchers to read their text. Rather, the most productive line of inquiry is to identify the best way to use both humans and automated methods for analyzing texts.” - Justin Grimmer and Brandon M. Stewart, “ Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”
  • 7. Acquiring Text •  Full-text resources: o  DSpace@MIT http://dspace.mit.edu/ o  Dome http://dome.mit.edu/ o  Digital Public Library of America http://dp.la o  Europeana http://www.europeana.eu/portal/ o  HathiTrust http://www.hathitrust.org/ •  http://libguides.mit.edu/apis - metadata only •  http://libguides.mit.edu/digitalhumanities
  • 8. Data Management and Sharing •  Assumption of sharing and data management plan as a funding requirement •  Data storage options - anticipate interaction o  Storage formats - non-proprietary and repurposable whenever possible o  File system storage vs. database •  Documentation of process http://libraries.mit.edu/guides/subjects/data-management/
  • 9. Formatting / Pre-Processing •  Tool input requirements •  Assumptions: o  Text as a “bag of words” o  Unigrams, bigrams o  Word order (or not) o  Stop words, capitalization, punctuation
  • 10. Featurizing Text •  Each word becomes a feature •  This is called "high dimensional" data •  Each word is a "dimension", or "feature" •  Features are represented as vectors in Euclidean space •  Euclidean mathematics scales beyond 3 dimensions
  • 11. The Shape of Data •  Data structures and formats •  Informed (in part) by: o  Tools o  Co-occurrence o  Data output formats o  Entity type o  Temporal, geographical perspective, etc.
  • 12. Validation From Ben Schmidt’s “Machine Learning at Sea”
  • 13. Network Models •  Representing data as a network o  Types: technological, communication, transportation, energy, airplane routes, web linking patterns o  social §  non-human animal interaction §  membership in larger groups §  sexually transmitted diseases §  co-authorship of scientific publications §  trade agreements between nations •  Mapping the News - Berkman's Controversy Work o  Spidering o  Influential actors over time
  • 14. Topic Modeling Tools •  MALLET o  Can run on unstructured plain text files o  http://mallet.cs.umass.edu/topics.php •  Stanford Topic Modeling Toolbox o  Requires data in a CSV or TSV file o  http://nlp.stanford.edu/software/tmt/tmt-0.4/
  • 15. Entity Extraction •  Identifies known entities in specific categories o  Locations o  People o  Organizations o  Dates/times •  Creates annotated text from unstructured text •  Domain-specific
  • 16. Entity Extraction Tools •  Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml •  Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/download_view/NETagger •  DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
  • 17. Geo-Parsing •  Common Pitfalls o  Set of places (GeoNames dictionary) o  Dictionary determines how broad or narrow your search is •  Enhancements to CLAVIN by Civic Media o  Aboutness (uses mention counting) o  HTTP access used for more advanced workflows