Mdst3705 2013-02-19-text-into-data


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • (theoretically)
  • Mdst3705 2013-02-19-text-into-data

    1. 1. Text into Data Prof Alvarado MDST 370519 February 2013
    2. 2. Business• Quiz 1 graded – Let me know if you have questions• Readings – Apologies for mis-posting!
    3. 3. Review• Last week, we took a 30,000 foot view of the use of databases in the digital humanities – We found that databases are everywhere• Databases form the foundation of all projects – Even if a database management system is not used• Relational databases are sophisticated and mature choices for foundations
    4. 4. Overview• We began this course by looking at code as language – Code structured like natural language – Code implies, models, and creates a world• We then looked at the opposite process – looking at language, and the products of culture, as code – We called this “reverse engineering”• Today we continue this and look specifically at text
    5. 5. What do you remember when you read a book?
    6. 6. We remember scenes, images, plot lines, values, etc. We sometimes remember verbatim passagesWe don’t normally remember the words
    7. 7. We get much of ourculture through books(and other "culturalmodels" in Colbyswords)
    8. 8. Like cigarettes,books are a deliverymechanism(not of nicotine, butof culture)
    9. 9. Colbys theory CULTURE TEXTS
    10. 10. If texts contain cultural meanings . . . How do we get to them? How do we represent them?
    11. 11. Models of Text
    12. 12. Competing Approaches• A common approach to model text is to use XML – XML is like HTML, but more general – It allows you to mark up a text• XML assumes a text is like a tree – An “ordered hierarchy of content objects”• XML was also specifically designed to work with text
    13. 13. XML looks like thisNotice how the element names reference units, not layout or style
    14. 14. Text as Tree
    15. 15. XML turns out to be very useful for defining the physical or logicalstructure of a text, but not for figures and meaningsTexts are actually more like networks
    16. 16. This image shows three"figures" in the text ofan Old French poem.Note how they do not"nest" neatly into thestructure of the text, butinstead cross-cut it.It is hard to model thiskind of data with XML.
    17. 17. Relational databases are a better choice for this since they are more abstractThe problem is, what data model to use? How do you model text in a relational database?
    18. 18. Liu and Smith argue for aradical model, in which textis parsed at the workd levelEach word gets its ownrecord
    19. 19. The Princeton Charrette Project used adatabase-driven application called Figura It was designed to represent the criticaledition of an Old French poem along withthe figural annotations of the text made by scholars A “figure” is a figure of speech orrhetorical device, like rhyming or the use of chiasmus
    20. 20. The database stored information aboutgrammar, manuscript images, figures, and other data that had been accumulated over the years prior to building the database
    21. 21. At the heart of thedatabase is the text modelthat links figures to text
    22. 22. In my model and in Liu & Smith’s, the text becomes a database The readable text is just a query As is the index, table of contents, etc.
    23. 23. The database of words and figures can be read by a program to generate a visually rich and interactive edition on the web
    24. 24. But it can also be used to discoverpatterns in the text not visible to the readerIt can help us discover the culturalpatterns that are “delivered” by the text to our brains
    25. 25. The results of a queryshowing the relationshipbetween proper nouns(agents) and figure types
    26. 26. A structural reading of the data
    27. 27. Form and content are interwoven, each reinforcing the otherForm – the delivery system – is used to transmit the meaningful content, the stuff that remains in your brain after reading or hearing the story
    28. 28. This is a "hypergraph" ofthe same data, also easily generated from the database by code
    29. 29. Text is like this
    30. 30. A text is a signalCulture is a transmitter