Workshop Exercise: Text Analysis Methods for Digital Humanities
Helen Bailey and Sands Fish, MIT Libraries 1
Text Analysis Methods for Digital Humanities
Pre-workshop, students should download and install the MALLET GUI on their laptops.
They should also run the topic modeler on a sample text file with the default settings to make
sure it’s working correctly.
Helpful MALLET Resources
• MALLET GUI information
• Blog post on using the GUI and displaying output in Gephi
• Using MALLET on the command line
• Intro to Topic Modeling in general
• MALLET website
• Review of MALLET in Journal of Digital Humanities
• Using HO-LDA / Finding Number of Topics in Emergency Text classification
1. Run MALLET on a known corpus (full-text examples used in the demo, all from the
Gutenberg Project: Adventures of Huckleberry Finn, Alice’s Adventures in Wonderland,
Andersen’s Fairy Tales, Grimm’s Fairy Tales, Life on the Mississippi, On the Origin of
Species, The Wizard of Oz).
2. Change the parameters to see how they impact the results. For example:
• Does preserving case matter?
• How does changing the number of iterations impact the results?
• What about changing the topic proportion threshold?
• How many topic words should you print? (What are you trying to discover? How
much info is useful?)
• What do the results tell you about this corpus? How could you use this to learn
about a corpus you weren’t familiar with?
3. MALLET implements the LDA algorithm, discuss its details a little. Hierarchicial Topic
Modeling as a juxtaposition (not available via MALLET)
Helen Bailey and Sands Fish, MIT Libraries 2
Stanford Named Entity Recognizer
Pre-workshop, students should download and install the SNER GUI on their laptops.
They should also run SNER on a sample file using the default classifier (or, if that’s not
available, the first classifier in the classifier folder), to make sure it’s working correctly.
Helpful SNER Resources
• Basic GUI tutorial
• Run SNER on a known corpus. Change the classifier to see if results differ.
• Save tagged file output and open. What do you then need to do with that to make it
• Difference between entity extraction and entity disambiguation.
• Do we want to have them run the output through a concordance program?
• What might you do with this data? How could it interact with other tools to tell the
• CLAVIN Tool by Berico Technologies
o Cartographic Location And Vicinity INdexer
• MIT Center for Civic Media open source CLAVIN Server for doing geo-parsing via HTTP
o Includes special "civic sauce" for determining the "aboutness" of a document,
narrowing down to the most likely place a document is talking about.
o According to Civic, this is the best quality geo-parsing service outside of Yahoo's
• Uses ApacheNLP for location entity extraction under the hood.
1. Download source from https://github.com/sandsfish/CLAVIN-Server
2. Follow the instructions in the readme to build and setup the tool.
● We’re providing sample text to work with. What do you already know about it? What do
you know from the data itself, and what information are you lacking?
● What characteristics of the sample data are likely contributing to the results you get from
these tools? (Lack of pre-processing, for example)
● Note how long it takes for these tools o run. Consider the size of the data set we’re
working with versus the size of possible data sets you may be interested in.