Text into Data

  Prof Alvarado
   MDST 3705
19 February 2013
Business
• Quiz 1 graded
  – Let me know if you have questions
• Readings
  – Apologies for mis-posting!
Review
• Last week, we took a 30,000 foot view of
  the use of databases in the digital
  humanities
  – We found that databases are everywhere
• Databases form the foundation of all
  projects
  – Even if a database management system is
    not used
• Relational databases are sophisticated
  and mature choices for foundations
Overview
• We began this course by looking at code
  as language
  – Code structured like natural language
  – Code implies, models, and creates a world
• We then looked at the opposite process –
  looking at language, and the products of
  culture, as code
  – We called this “reverse engineering”
• Today we continue this and look
  specifically at text
What do you remember when
     you read a book?
We remember scenes, images, plot lines,
            values, etc.

  We sometimes remember verbatim
            passages

We don’t normally remember the words
We get much of our
culture through books
(and other "cultural
models" in Colby's
words)
Like cigarettes,
books are a delivery
mechanism

(not of nicotine, but
of culture)
Colby's theory           CULTURE




                 TEXTS
If texts contain cultural meanings . . .

      How do we get to them?

    How do we represent them?
Models of Text
Competing Approaches
• A common approach to model text is to
  use XML
  – XML is like HTML, but more general
  – It allows you to mark up a text
• XML assumes a text is like a tree
  – An “ordered hierarchy of content objects”
• XML was also specifically designed to
  work with text
XML looks like this




Notice how the element names reference units, not layout or style
Text as Tree
XML turns out to be very useful for
    defining the physical or logical
structure of a text, but not for figures
            and meanings

Texts are actually more like networks
This image shows three
"figures" in the text of
an Old French poem.
Note how they do not
"nest" neatly into the
structure of the text, but
instead cross-cut it.

It is hard to model this
kind of data with XML.
Relational databases are a better choice
 for this since they are more abstract

The problem is, what data model to use?

 How do you model text in a relational
            database?
Liu and Smith argue for a
radical model, in which text
is parsed at the workd level

Each word gets its own
record
The Princeton Charrette Project used a
database-driven application called Figura

 It was designed to represent the critical
edition of an Old French poem along with
the figural annotations of the text made
               by scholars

    A “figure” is a figure of speech or
rhetorical device, like rhyming or the use
                of chiasmus
The database stored information about
grammar, manuscript images, figures, and
  other data that had been accumulated
    over the years prior to building the
                 database
At the heart of the
database is the text model
that links figures to text
In my model and in Liu & Smith’s, the text
         becomes a database

    The readable text is just a query

  As is the index, table of contents, etc.
The database of words and figures
   can be read by a program to
   generate a visually rich and
  interactive edition on the web
But it can also be used to discover
patterns in the text not visible to the
               reader

It can help us discover the cultural
patterns that are “delivered” by the
         text to our brains
The results of a query
showing the relationship
between proper nouns
(agents) and figure types
A structural reading of the data
Form and content are interwoven, each
        reinforcing the other

Form – the delivery system – is used to
 transmit the meaningful content, the
   stuff that remains in your brain after
        reading or hearing the story
This is a "hypergraph" of
the same data, also easily
        generated from the
         database by code
Text is like this




http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps
A text is a signal

Culture is a transmitter

Mdst3705 2013-02-19-text-into-data

  • 1.
    Text into Data Prof Alvarado MDST 3705 19 February 2013
  • 2.
    Business • Quiz 1graded – Let me know if you have questions • Readings – Apologies for mis-posting!
  • 3.
    Review • Last week,we took a 30,000 foot view of the use of databases in the digital humanities – We found that databases are everywhere • Databases form the foundation of all projects – Even if a database management system is not used • Relational databases are sophisticated and mature choices for foundations
  • 4.
    Overview • We beganthis course by looking at code as language – Code structured like natural language – Code implies, models, and creates a world • We then looked at the opposite process – looking at language, and the products of culture, as code – We called this “reverse engineering” • Today we continue this and look specifically at text
  • 5.
    What do youremember when you read a book?
  • 6.
    We remember scenes,images, plot lines, values, etc. We sometimes remember verbatim passages We don’t normally remember the words
  • 7.
    We get muchof our culture through books (and other "cultural models" in Colby's words)
  • 8.
    Like cigarettes, books area delivery mechanism (not of nicotine, but of culture)
  • 9.
    Colby's theory CULTURE TEXTS
  • 10.
    If texts containcultural meanings . . . How do we get to them? How do we represent them?
  • 11.
  • 12.
    Competing Approaches • Acommon approach to model text is to use XML – XML is like HTML, but more general – It allows you to mark up a text • XML assumes a text is like a tree – An “ordered hierarchy of content objects” • XML was also specifically designed to work with text
  • 13.
    XML looks likethis Notice how the element names reference units, not layout or style
  • 14.
  • 15.
    XML turns outto be very useful for defining the physical or logical structure of a text, but not for figures and meanings Texts are actually more like networks
  • 16.
    This image showsthree "figures" in the text of an Old French poem. Note how they do not "nest" neatly into the structure of the text, but instead cross-cut it. It is hard to model this kind of data with XML.
  • 17.
    Relational databases area better choice for this since they are more abstract The problem is, what data model to use? How do you model text in a relational database?
  • 18.
    Liu and Smithargue for a radical model, in which text is parsed at the workd level Each word gets its own record
  • 20.
    The Princeton CharretteProject used a database-driven application called Figura It was designed to represent the critical edition of an Old French poem along with the figural annotations of the text made by scholars A “figure” is a figure of speech or rhetorical device, like rhyming or the use of chiasmus
  • 23.
    The database storedinformation about grammar, manuscript images, figures, and other data that had been accumulated over the years prior to building the database
  • 24.
    At the heartof the database is the text model that links figures to text
  • 25.
    In my modeland in Liu & Smith’s, the text becomes a database The readable text is just a query As is the index, table of contents, etc.
  • 27.
    The database ofwords and figures can be read by a program to generate a visually rich and interactive edition on the web
  • 29.
    But it canalso be used to discover patterns in the text not visible to the reader It can help us discover the cultural patterns that are “delivered” by the text to our brains
  • 30.
    The results ofa query showing the relationship between proper nouns (agents) and figure types
  • 31.
  • 33.
    Form and contentare interwoven, each reinforcing the other Form – the delivery system – is used to transmit the meaningful content, the stuff that remains in your brain after reading or hearing the story
  • 34.
    This is a"hypergraph" of the same data, also easily generated from the database by code
  • 36.
    Text is likethis http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps
  • 37.
    A text isa signal Culture is a transmitter

Editor's Notes