Text into Data Prof Alvarado MDST 370519 February 2013
Business• Quiz 1 graded – Let me know if you have questions• Readings – Apologies for mis-posting!
Review• Last week, we took a 30,000 foot view of the use of databases in the digital humanities – We found that databases are everywhere• Databases form the foundation of all projects – Even if a database management system is not used• Relational databases are sophisticated and mature choices for foundations
Overview• We began this course by looking at code as language – Code structured like natural language – Code implies, models, and creates a world• We then looked at the opposite process – looking at language, and the products of culture, as code – We called this “reverse engineering”• Today we continue this and look specifically at text
Competing Approaches• A common approach to model text is to use XML – XML is like HTML, but more general – It allows you to mark up a text• XML assumes a text is like a tree – An “ordered hierarchy of content objects”• XML was also specifically designed to work with text
XML looks like thisNotice how the element names reference units, not layout or style
XML turns out to be very useful for defining the physical or logicalstructure of a text, but not for figures and meaningsTexts are actually more like networks
This image shows three"figures" in the text ofan Old French poem.Note how they do not"nest" neatly into thestructure of the text, butinstead cross-cut it.It is hard to model thiskind of data with XML.
Relational databases are a better choice for this since they are more abstractThe problem is, what data model to use? How do you model text in a relational database?
Liu and Smith argue for aradical model, in which textis parsed at the workd levelEach word gets its ownrecord
The Princeton Charrette Project used adatabase-driven application called Figura It was designed to represent the criticaledition of an Old French poem along withthe figural annotations of the text made by scholars A “figure” is a figure of speech orrhetorical device, like rhyming or the use of chiasmus
The database stored information aboutgrammar, manuscript images, figures, and other data that had been accumulated over the years prior to building the database
At the heart of thedatabase is the text modelthat links figures to text
In my model and in Liu & Smith’s, the text becomes a database The readable text is just a query As is the index, table of contents, etc.
The database of words and figures can be read by a program to generate a visually rich and interactive edition on the web
But it can also be used to discoverpatterns in the text not visible to the readerIt can help us discover the culturalpatterns that are “delivered” by the text to our brains
The results of a queryshowing the relationshipbetween proper nouns(agents) and figure types
Form and content are interwoven, each reinforcing the otherForm – the delivery system – is used to transmit the meaningful content, the stuff that remains in your brain after reading or hearing the story
This is a "hypergraph" ofthe same data, also easily generated from the database by code
Text is like thishttp://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps