Visualizing Textual Data

“Visualizing Textual Data”
Drayton C. Benner
Founder/President, Miklal Software Solutions
PhD Candidate, Northwest Semitic Philology
University of Chicago
DraytonBenner@MiklalSoftware.com

Analyzing a word in a corpus
• Digital: search results
• Print: concordances
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com

Searching digital Bibles
Search results from Olive Tree Bible Software
(OliveTree.com) on a Samsung Galaxy S 3
smartphone. Disclaimer: I wrote the search
engine for Olive Tree Bible Software, but I did
not write the display of the search results.
Search results from Logos Bible Software (logos.com) on a PC.

Searching digital texts: KWIC display
From Perseus under Philologic at philologic.uchicago.edu

Print concordances
 From Strong 1890
 Context chosen by hand
to maximize
understanding of the key
word’s context given the
space limitations
 Incredibly labor-intensive
(tens or hundreds of
person-years)

Can we unite the benefits of digital and print?
• Advantage of print
• Context chosen carefully for maximum understanding of the key word’s
context given the space constraints
• Advantages of digital
• Ability to present many texts
• Ability to present search results for any key word almost instantaneously
• Ability to allow for multiple fonts and font sizes
• Uniting the advantages of print and digital: algorithmically select the
best context for maximum understanding of the key word’s context
given the space constraints

Texts for experiment
• Bibles
• KJV (1769 edition)
• ESV (2011 edition) Old Testament/Hebrew Bible
• Novel
• Henry James, What Maisie Knew

Training data
• 500 key words chosen at random from ESV and presented to an
annotator
• A width is chosen at random, ranging from approximately what would
fit on a smartphone to a width three times as long
• All possible contexts for the key word are presented to the annotator
• The line is filled with as many words of context as will fit
• Punctuation is handled reasonably. E.g., the context cannot begin with a
period or end with an open quotation mark.

Training data

Algorithm: overview
• Score each nearby word according to its relevance to understanding
the key word
• Factors determining the score
• Is the nearby word a function word or a content word?
• How much punctuation separates the key word and the nearby word?
• Contiguous punctuation counts as one
• Syntax-based measures
• How far apart are the two key word and the nearby word in a phrase structure tree (=
constituency-based parse tree)?
• How far apart are the key word and the nearby word in a dependency tree?

Algorithm: phrase structure and dependency
trees
From http://en.wikipedia.org/wiki/Phrase_structure_rules

Algorithm: phrase structure and dependency
trees (cont.)
• Generated using the Stanford Parser
• Some text pre-processing, especially to replace the major
archaisms of the KJV
• Some post-processing
• Fix repetitive parsing errors algorithmically
• Restore the major archaisms of the KJV

Algorithm: scoring nearby words
The weight w for a nearby word n of a key word k is calculated as
follows:

Algorithm: scoring nearby words (cont.)
Key word kNearby word n
dpkdpn
ddkddn

Algorithm: picking the best context
• Each possible context for a keyword k is evaluated as the
sum of w(k,n) for each nearby word n in the possible context.
The context with the highest sum is chosen.
• The various constants were optimized using a Monte Carlo
particle filter on the training data.
• 𝑐𝑓 = 1.5; 𝑐 𝑝 = −3.37; 𝑐 𝑝𝑘 = 0.175; 𝑐 𝑝𝑛 = 0.2; 𝑐 𝑑𝑘 = 1; 𝑐 𝑑𝑛 =
1.4.
• The dependency tree was more important than the phrase
structure tree.

Algorithm: displaying the chosen context

Demo

Results
ESV training set ESV test set Maisie test set
Algorithm matches user selection (A0) 67.8% 62.5% 47.8%
Expected algorithm matches if selections were
random from a uniform distribution (Ae)
27.4% 25.5% 21.9%
𝑆 =
𝐴0 − 𝐴 𝑒
1 − 𝐴 𝑒
0.556 0.497 0.332

Results
ESV training set ESV test set Maisie test set
Algorithm matches user selection (A0) 67.8% 62.5% 47.8%
Expected algorithm matches if selections were
random from a uniform distribution (Ae)
27.4% 25.5% 21.9%
𝑆 =
𝐴0 − 𝐴 𝑒
1 − 𝐴 𝑒
0.556 0.497 0.332
Inter-annotator agreement (A0) N/A 65.8% 53.0%
Expected inter-annotator agreement if selections
were random from a uniform distribution (Ae)
N/A 27.0% 23.5%
𝑆 =
𝐴0 − 𝐴 𝑒
1 − 𝐴 𝑒
0.532 0.386

Conclusion
• Using these techniques, we can marry the benefits of print and
digital!
• Context chosen for maximum understanding of the key word’s context given
the space constraints
• Ability to present search results for any key word almost instantaneously
• Ability to allow for multiple fonts and font sizes
• Ability to present well-chosen context for many texts
• As statistical parsers improve and are extended to more languages, this
technique will improve and be able to be used for more broadly

Possible future improvements
• Get more training data from more annotators
• Allow best context not to use all of the available
space?
• Allow for ellipses?
• Handle multiple key words?

• Reference
• Drayton C. Benner. “Marrying the Benefits of Print and Digital.” Proceedings of Digital
Humanities 2014. http://dharchive.org/paper/DH2014/Paper-845.xml
Acknowledgements
• James Covington
• Annotator for the training set and both test sets
• Rodelle Williams and D. Chris Benner
• Annotators for both test sets
• Humphey H. Hardy
• Annotator for the ESV test set
• Samuel L. Boyd
• Annotator for the Maisie test set

Visualizing Textual Data

Recommended

Recommended

More Related Content

Similar to Visualizing Textual Data

Similar to Visualizing Textual Data (20)

Recently uploaded

Recently uploaded (20)

Visualizing Textual Data