Mdst3703 2013-09-17-text-models


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • ----- Meeting Notes (9/17/13 12:14) -----This is where I can add notes ...
  • Old French illuminated manuscript. What does the image mean?
  • TS Eliot, the Wasteland – note use of line breaks; what do they mean?
  • A critical edition of Jane Austen’s Persuasion
  • A dictionary entry …
  • (theoretically)
  • Mdst3703 2013-09-17-text-models

    1. 1. Text Models and Markup Prof. Alvarado MDST 3703 17 September 2013
    2. 2. Business • Plan B: If Home Directory is not working for you, please use the Hive – Go to – Install VMWare Client – Use Notepad++ – Home Directory link your Desktop (also as J drive) • Tutorials – If you feel lost about HTML let me know
    3. 3. Review 1: Textual Signals • Each of the authors last week viewed the text as a kind of signal • A signal is a pattern that contains messages • Messages can be grasped through parsing the signal • What were the messages? How were they parsed?
    4. 4. text can be viewed as a long signal consisting of characters selected from a common set of characte
    5. 5. A model of communication. Messages get converted into signals and back into messages by means of a shared code. ENCODING DECODING SHARED CODE Person 1 Person 2
    6. 6. Author Parsed elements Decoded message Levi-Strauss Relations and bundles Structural oppositions Colby Thesaurus words Thematic patterns Ramsay Scenes Genres
    7. 7. Text is like this. This is a map of DC generated by thousands of individual Flickr and Twitter events. The picture is a kind of signal—collective and unconscious, yet meaningful. The patterns discerned from the signals are not intentional, but they are the products of intentional activity. [Text is like this]
    8. 8. Review 2: Semantic HTML • Also called POSH—”Plain Old Semantic HTML” • The use of HTML to describe a text, not to format it (CSS is used to format) • DIV, SPAN, CLASS, and ID are general purpose tools to provide more flexible markup • What kinds of things can POSH be used to describe?
    9. 9. Segue Semantic markup may be used to support the analysis of each of our authors—including Aristotle Aristotle: Elements of drama, Elements of plot <div class=“plot-element” id=“reversal-of- fortune”> ... </div> Levi-Strauss: Relations and Bundles in myths <span class=“relation”> ... </span> Colby: Theme words in folktales <span class=“antagonism”>fight</span> Ramsay: Scenes in plays <div class=“scene”> ... </div>
    10. 10. Let’s step back and look more closely at “text” Let’s look at some material examples
    11. 11. page o’ text Real world text comes packaged in documents
    12. 12. How is text conveyed in a document? A document is a material artifact— a medium with which to convey a signal
    13. 13. What is text?
    14. 14. Visual Signifiers • Small caps • Indentation • Alignment • Italics • Space All used to signify elements of text
    15. 15. Other examples
    16. 16. [Charrette]
    17. 17. [The Wasteland]
    18. 18. [Critical Edition]
    19. 19. [OED]
    20. 20. Documents have thee Levels: Structure, Content, Style Structure The organization of content into units (elements) and logical relationships (e.g. reading order) Content TEXT, images, video clips, etc. Style Screen and print layout Fonts, colors, etc.
    21. 21. Descriptive markup languages allow us to define structure of documents for computational purposes Theoretically, they do not specify layout or content
    22. 22. [PDF, Procedural Markup] In contrast to procedural markup like PDF
    23. 23. So, how are documents structured?
    24. 24. Hierarchically … (theoretically)
    25. 25. Document Elements and Structures Play – Act + • Scene + – Line + Book – Chapter + • Verse + Letter – Heading • Return Address • Date • Recipient Info – Name – Title – Address – Content • Salutation • Paragraph + • Closing
    26. 26. These are all “trees”
    27. 27. XML is a markup language It is a more powerful system for semantic markup than POSH
    28. 28. What is XML? • Stands for eXtensible Markup Language – Actually invented after the web – A simplification of SGML, the language used to create HTML – It specifies a set of rules for creating specialized markup languages such as HTML and TEI • It is simplified version of the SGML – Standard Generalized Markup Language • SGML was invented in the early 1970s to wrest the control of documents from computer people who were taking over industries like law and accounting
    29. 29. XML looks like this Notice how the element names reference units, not layout or style
    30. 30. Also markup for “in-line” elements
    31. 31. XML Premises 1. All documents are comprised of elements. 2. Elements contain content. 3. Elements have no layout. 4. Elements are hierarchically ordered. 5. Elements are to be indicated by “markup” – tags that define the beginning and end of an element
    32. 32. XML Markup Rules • Tags signify structural elements • Three kinds of tag – Start and End, e.g <p> and </p> – Singleton, e.g <br /> • Start and singleton tags can have attributes – Simple key/value pairs – <div class="stanza" style="color:red;"> • Basic rules – All attributes must be quoted – All tags must nest (no overlaps!)
    33. 33. Documents in XML that meet these rules are “well formed”
    34. 34. XML also provides Document Types • A Document Type Definition (DTD) defines a set of tags and rules for using them – Specifies elements, attributes, and possible combinations – E.g. in HTML, the ol and ul elements must contain li elements • A DTD is just one kind of schema system used by XML • Schema express data models of/for texts – TEI is a powerful way of describing primary source materials for scholars • Documents that use a schema properly are called “valid”
    35. 35. Originally, DTDs defined “genres” like business letter or mortgage form They were later used to define more abstract models of textual content
    36. 36. XML is used everywhere • HTML – E.g. Embed codes • TEI (Text Encoding Initiative) • RSS • Civilization IV • Playlists (e.g. XSPF or “spiff”) • Google Maps (KML)
    37. 37. The Text Encoding Initiative created TEI to mark up scholarly documents Mainly primary sources such as books and manuscripts
    38. 38. TEI • Written in XML (was SGML) • The dominant language used to encode scholarly text • Scholars can select from a large set of elements or their own elements to match what they are interested in
    39. 39. Examples • The TEI Header – m • TEI Prose – m • Find others at the TEI By Example Project –
    40. 40. XML and TEI both contain an implicit theory of text What is it?
    41. 41. OCHO • XML (and therefore HTML and TEI) imply a certain theory of text – A text is an OHCO • OHCO – Ordered Hierarchy of Content Objects • An OHCO is a kind of tree – Elements follow each other in sequences – Elements can contain other elements
    42. 42. What are the advantages of this view?
    43. 43. OHCO allows for easy processing • Every element has a precise address in the text – E.g. HTML/body/p[1] • Texts can be described in the language of kinship – Ancestors, parents, siblings, children, etc. • Texts can be restructured and manipulated by known patterns and algorithms – Traversing – Pruning – Cross-referencing
    44. 44. What are the disadvantages of OCHO?
    45. 45. Logical vs. Physical Structure THIS IS WHAT WE ENCOUNTERED AT THE END OF LAST WEEK’S STUDIO
    46. 46. Two common structures that overlap Pages and Paragraphs
    47. 47. <page n=“2”> . . . <p id=“foo”>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife</p> </page> <page n=“3”> <p id=“bar” prev_id=“foo”> a very superior character to anything deserved by his own.</p> . . . </page> Solution 1: Split Elements
    48. 48. <p>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife <pb n=“3” /> a very superior character to anything deserved by his own.</p> Solution 2: Use “Milestones” One structure gets backgrounded
    49. 49. Wittgenstein’s Manuscripts What about this?
    50. 50. The problem of overlap suggests that OHCO is not a simple as it looks How does Renear “solve” the problem?
    51. 51. Each OHCO markup schema represents an analytical perspective, an interpretive model
    52. 52. [Charrette]
    53. 53. So, XML, TEI, POSH – these allow us to impose a model on a text How does Unsworth characterize these models?
    54. 54. A markup schema is a “knowledge represention”
    55. 55. A KR is a model that comprises 1. A set of categories (aka Ontology) Names and relationships between names 2. A set of inference rules (aka Logic) A method of traversing names and relations 3. A medium for computation A medium for mechanically producing inferences 4. A language for expressing these things Such as a programming or markup language
    56. 56. What tools beside XML does Unsworth reference as useful for KR?
    57. 57. Tables
    58. 58. What are some differences between trees and tables?
    59. 59. Tables are more rigid Trees allow for indefinite depth But tables are easier to manipulate In any case, tables and trees are two major kinds of data structure that you will encounter …
    60. 60. How to reconcile these tools?
    61. 61. A Proposed Model • Texts are not documents – Documents are media, Texts are messages • Texts and documents are part of a system comprised of “levels” – They are effectively archaeology sites with stratigraphic layers – Erasures are like cities building on top of each other • Each level of the system is described by an appropriate set of tools – Document structures  XML – Textual structures, embedded ontologies  Tables
    62. 62. Basic Levels • Document – Physical objects (paper) – Logical objects (defined by space, style, punctuation, etc.) – Style and layout (also defined by space, color, etc.) – Can have superimposed versions • Text – Sequences of characters – Grammatical features – Figures and poetic features – Etc.