<ul><li>Chim Bunthoeurn </li></ul><ul><li>Lecturer , RUPP </li></ul><ul><li>Department of Computer Science </li></ul><ul><...
What Is Markup? <ul><li>Information added to a text to make its structure comprehensible </li></ul>
Computer markup <ul><li>Any kind of codes added to a document </li></ul><ul><ul><li>Typesetting (presentational markup) </...
What do we mean by declarative? <ul><li>Names and structure </li></ul><ul><li>Framework for indirection </li></ul><ul><li>...
XML <ul><li>The Extensible Markup Language </li></ul><ul><li>XML is a standard, interoperable way to represent documents f...
The two worlds of XML <ul><li>Markup of documents: the original </li></ul><ul><ul><li>This perspective is our focus here <...
What is XML? <ul><li>XML stands for EXtensible Markup Language </li></ul><ul><li>XML is designed to transport and store da...
HTML vs. XML <ul><li>< h1 > Bibliography </ h1 > </li></ul><ul><li>< p > < i > Foundations of DBs</ i >,  Abiteboul , Hull...
XML vs SGML <ul><li>origins: HTML + SGML (ISO Standard, 1986, ~600pp)  </li></ul><ul><li>W3C standard (~26 pp): XML syntax...
XML treats documents like databases <ul><li>XML brings benefits of DBs to documents </li></ul><ul><ul><li>Schema to model ...
What is structure  <ul><li>To Relational Database theorists, structure is:  </li></ul><ul><ul><li>Tables with fixed sets o...
When structure is essential  <ul><li>Large scale data </li></ul><ul><li>Data with individual parts you care about  </li></...
What’s the difference? <ul><li>Without structure </li></ul><ul><ul><li>Data conversion is far more expensive </li></ul></u...
XML design principles <ul><li>Straightforwardly usable over the Internet </li></ul><ul><li>Support for a wide variety of a...
Opportunities with XML <ul><li>Scalability and openness of Web solutions </li></ul><ul><li>“ Rich clients” for complex inf...
Web usage <ul><li>XML works with familiar Web paradigms </li></ul><ul><ul><li>Locations are expressed as URIs </li></ul></...
Some additional XML details <ul><li>Well-formedness </li></ul><ul><li>Error handling </li></ul><ul><li>Case sensitivity </...
Well-formedness <ul><li>Document has a single root element, and </li></ul><ul><li>Elements nest properly </li></ul><ul><ul...
Elements and their Content element type character content element empty element < bibliography > < paper  ID=&quot;object-...
Element Attributes <bibliography> <paper  pid =&quot; object-fusion &quot; >  <authors> <author>Y.Papakonstantinou</author...
Pure XML -- Instance Model <ul><li>XML 1.0 Standard:  </li></ul><ul><ul><li>no explicit data model </li></ul></ul><ul><ul>...
Example: Relational Data to XML R  R   tuple   A   a1   /A   B   b1   /B   C   c1   /C   /tuple   tuple...
Example: Relational Data to XML R  R   tuple   A   a1   /A   B   b1   /B   C   c1   /C   /tuple   tuple...
Adding Structure and Semantics  <ul><li>XML Document Type Definitions (DTDs): </li></ul><ul><ul><li>define the structure o...
Partial and missing DTDs <ul><li>DTDs (schemas) are needed for validation </li></ul><ul><li>DTD processing adds a burden <...
Case sensitivity <ul><li>HTML is </li></ul><ul><ul><li>Case-insensitive for tag names: <P>  =  <p> </li></ul></ul><ul><ul>...
Summary <ul><li>XML has: </li></ul><ul><ul><li>Representational power and extensibility  </li></ul></ul><ul><ul><ul><li>Cu...
XML System Architectures
First, an HTML system HTML document <ul><li>Web Server </li></ul>Web Client Internet Parser, formatter, interface
How do you get the data? Documents, stylesheets, and other data
  can all be expressed in XML. This model can work locally...
Server side XML publishing Server transforms to HTML/CSS;  Ship to client browser for display Very common current strategy...
XML everywhere <ul><li>XML separates representation from structure </li></ul><ul><ul><li>So you can use the same parsers, ...
What are the parts? <ul><li>Header stuff </li></ul><ul><ul><li>The XML Processing Instruction </li></ul></ul><ul><li><?xml...
Main document stuff <ul><li>Elements: <title>...</title> </li></ul><ul><li>Attributes:   <xref tgt=&quot;#h185&quot;> </li...
Anatomy of an element <p type=&quot;rule&quot;>Use a hyphen: ­.</p> Start-tag Content End-tag Element Element type Attribu...
Audiences XML aims to help <ul><li>Parser writers </li></ul><ul><ul><li>The Mythical CS Grad Student </li></ul></ul><ul><l...
HTML compatibility <ul><li>XHTML is an XML application </li></ul><ul><ul><li>One schema among many (probably a popular one...
What are the parts of an  XML Document? <ul><li>The DTD </li></ul><ul><li>Elements </li></ul><ul><li>Attributes </li></ul>...
Schema Languages <ul><li>3 Leading contenders (all can win): </li></ul><ul><li>XML Schema </li></ul><ul><ul><li>Backed by ...
The DTD (schema) <ul><li>A DTD is a simple schema, based on SGML </li></ul><ul><li>They consist of declarations for the pa...
Elements <ul><li>Identify structural/semantic components </li></ul><ul><li>Can (usually do) have children </li></ul><ul><l...
Attributes <ul><li>Specify properties/characteristics of elements </li></ul><ul><ul><li>That generally apply to the elemen...
General Entities <ul><li>A lexical mechanism for inclusion </li></ul><ul><ul><li>But, constrained to including subtrees </...
Predefined entities <ul><li>Used for escaping markup characters </li></ul><ul><ul><li><p>In XML, tags start with “&lt;”.</...
Character references <ul><li>Can be used to obtain untypable characters </li></ul><ul><ul><li>Such as Kanji for users with...
Comments <ul><li>Can go most anywhere </li></ul><ul><ul><li>(though not inside tags) </li></ul></ul><ul><li>Represented as...
Marked sections <ul><li>Two purposes: </li></ul><ul><ul><li>Escaping a lot of markup </li></ul></ul><ul><ul><li>Conditiona...
The “XML Declaration” PI <ul><li>At top of each XML document: </li></ul><ul><li><?XML version=&quot;1.0&quot;    standalon...
Notations <ul><li>Used to name foreign data formats referenced </li></ul><ul><li>Ties a notation name to a URI (presumably...
Identifiers <ul><li>Used in entity declarations to state where the data to be included later can be found </li></ul><ul><l...
XML 1.0 DTDs <ul><li>DTDs let you say: </li></ul><ul><ul><li>What element types can occur and where </li></ul></ul><ul><ul...
An Example DTD <ul><li><!-- DTD for Friendly Letter --> </li></ul><ul><li><!-- FPI: -//sjd//DTD Friendly letter//EN --> <!...
Another Example <ul><li><!ENTITY % inline “emph | strong”> </li></ul><ul><li><!ELEMENT doc (chap*)> </li></ul><ul><li><!EL...
A corresponding document <ul><li><?xml version=&quot;1.0&quot;> <!DOCTYPE LETTER PUBLIC   &quot;-//sjd//DTD Friendly lette...
Content Models <ul><li>These are modeled on regular expressions </li></ul><ul><li>In DTD, each element has one content mod...
Basic Operators <ul><li>Joining </li></ul><ul><ul><li>Sequence  a,b,c </li></ul></ul><ul><ul><li>Alternation  a | b | c </...
Data <ul><li>#PCDATA </li></ul><ul><li>Element names </li></ul><ul><li>Model groups </li></ul><ul><li>Mixed content  (#PCD...
Not quite regular expressions <ul><li>Ambiguity restriction </li></ul><ul><li>No alternatives must be found for any model ...
Handy terminology decoder ring <ul><li>Element: a text feature distinguished by markup </li></ul><ul><li>Tag: a string in ...
Decoder ring… <ul><li>Entity: abstraction of an item of data storage. </li></ul><ul><li>General entity: entity whose text ...
Decoder… <ul><li>Document Type declaration (DOCTYPE): declaration of root element of a document instance, can refer to: </...
Decoder… <ul><li>Content Model: description of restrictions on the content of an element </li></ul><ul><li>Model Group: co...
Ambiguity <ul><li>A content model is ambiguous if it contains an alternation (a | b) where the content models a and b cann...
Attributes <ul><li>Data types </li></ul><ul><li>Default values / omissability </li></ul><ul><li><!ATTLIST p </li></ul><ul>...
<!ATTLIST syntax <ul><li><!ATTLIST element-name   att-name type defaults   att-name type defaults …> </li></ul><ul><li><!A...
Attribute Data Types <ul><li>CDATA </li></ul><ul><li>NMTOKEN / NMTOKENS </li></ul><ul><li>Enumeration Type (a | b) </li></...
Attribute defaults <ul><li>#REQUIRED </li></ul><ul><li>#IMPLIED </li></ul><ul><li>#FIXED “value” </li></ul><ul><li>Literal...
Parameter Entities <ul><li>•  Declaring </li></ul><ul><li><!ENTITY % pent “value”> </li></ul><ul><li><!ENTITY % include-fi...
General Entities <ul><li>Simple </li></ul><ul><li><!ENTITY ent “value”> </li></ul><ul><li>External </li></ul><ul><li><!ENT...
Notations <ul><li>declaring </li></ul><ul><ul><li><!NOTATION blob SYSTEM “application/binary”> </li></ul></ul><ul><li>Usin...
Processing instructions <ul><li>Escape to procedural markup </li></ul><ul><ul><li><!NOTATION my-app SYSTEM “http://my.com/...
Namespaces <ul><li>Helps to “uniquify” markup names </li></ul><ul><ul><li>Colon delimiter allowed in names </li></ul></ul>...
Things namespace almost do <ul><li>Allow arbitrary mixing of DTDs /schemas </li></ul><ul><li>Provide a “type system” for r...
Pros and Cons of Namespaces <ul><li>You can uniquely label element types in a global way </li></ul><ul><li>You can must ch...
Things are confusing about namespaces <ul><li>The URI reference in a namespace is just a string </li></ul><ul><li>The URI ...
Namespace URI dereferencing <ul><li>There are applications within which this has been defined </li></ul><ul><li>There isn’...
XML Information Set <ul><li>What data in an XML document “counts”? </li></ul><ul><ul><li>Elements, attributes, content </l...
XML and related specs <ul><li>XML: The basic syntax, plus namespaces </li></ul><ul><ul><li>XML Namespaces: disambiguation ...
XML specification <ul><li>A “Recommendation” since 2/1998 </li></ul><ul><ul><li>The highest level for a W3C specification ...
The W3C standards* process <ul><li>World Wide Web Consortium (W3C) </li></ul><ul><li>Development is organized into WGs. </...
The beginning of XML <ul><li>Originally chartered to work on a suite: </li></ul><ul><ul><li>XML (Extensible Markup Languag...
The current XML organization <ul><li>Work products done by several  WGs </li></ul><ul><li>“ XML Plenary ” coordinates thes...
Document analysis <ul><li>Cycle of steps; repeat until out of time </li></ul><ul><li>Identify project requirements/audienc...
Project requirements <ul><li>Know the audience/readers </li></ul><ul><li>Know the authors </li></ul><ul><li>Don’t forget t...
Identifying information items <ul><li>This is pretty much a manual process </li></ul><ul><li>Often best done with paper an...
Issues to think about <ul><li>Cross-references </li></ul><ul><li>Structural divisions (headings, blurbs, ambiguities) </li...
Restrictions on data items <ul><li>Content model </li></ul><ul><li>Data values (are there controlled or semi-controlled vo...
Presentation issues <ul><li>Some text can be auto-generated, some cannot </li></ul><ul><li>Some test can be “almost” auto-...
Upcoming SlideShare
Loading in …5
×

Xml

1,188 views

Published on

ASP Lesson...

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,188
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
66
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Xml

  1. 1. <ul><li>Chim Bunthoeurn </li></ul><ul><li>Lecturer , RUPP </li></ul><ul><li>Department of Computer Science </li></ul><ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>Introduction to XML
  2. 2. What Is Markup? <ul><li>Information added to a text to make its structure comprehensible </li></ul>
  3. 3. Computer markup <ul><li>Any kind of codes added to a document </li></ul><ul><ul><li>Typesetting (presentational markup) </li></ul></ul><ul><ul><ul><li>MS Word and its ilk, TeX, Scribe, Lout, Script, nroff, XYVision </li></ul></ul></ul><ul><ul><li>Declarative markup </li></ul></ul><ul><ul><ul><li>HTML (sometimes) </li></ul></ul></ul><ul><ul><ul><li>XHTML </li></ul></ul></ul><ul><ul><ul><li>XML </li></ul></ul></ul><ul><ul><ul><li>WML </li></ul></ul></ul>
  4. 4. What do we mean by declarative? <ul><li>Names and structure </li></ul><ul><li>Framework for indirection </li></ul><ul><li>Finer level of detail (most human-legible signals are overloaded) </li></ul><ul><li>Independent of presentation (abstract) </li></ul><ul><li>People often call this “semantic” </li></ul>
  5. 5. XML <ul><li>The Extensible Markup Language </li></ul><ul><li>XML is a standard, interoperable way to represent documents for flexible processing </li></ul><ul><ul><li>Multi-format delivery </li></ul></ul><ul><ul><li>Schema-aware information retrieval </li></ul></ul><ul><ul><li>Transformation and dynamic data customization </li></ul></ul><ul><ul><li>Archival: standardized, self-describing </li></ul></ul>
  6. 6. The two worlds of XML <ul><li>Markup of documents: the original </li></ul><ul><ul><li>This perspective is our focus here </li></ul></ul><ul><ul><li>Document representation was the primary problem XML was created to solve </li></ul></ul><ul><li>Data exchange and protocol design </li></ul><ul><ul><li>XML turned out to fill important gaps </li></ul></ul><ul><ul><li>Relational databases needed a way to share records and multi-table data </li></ul></ul><ul><ul><li>Protocol designers wanted a way to encapsulate structured data </li></ul></ul>
  7. 7. What is XML? <ul><li>XML stands for EXtensible Markup Language </li></ul><ul><li>XML is designed to transport and store data. </li></ul><ul><li>XML is a markup language much like HTML </li></ul><ul><li>XML was designed to carry data, not to display data </li></ul><ul><li>XML tags are not predefined. You must define your own tags </li></ul><ul><li>XML is designed to be self-descriptive </li></ul><ul><li>XML is a W3C Recommendation </li></ul>
  8. 8. HTML vs. XML <ul><li>< h1 > Bibliography </ h1 > </li></ul><ul><li>< p > < i > Foundations of DBs</ i >, Abiteboul , Hull, Vianu </li></ul><ul><li> < br > Addison-Wesley, 1995 </li></ul><ul><li>< p > < i > Logics for DBs and ISs </ i >, Chomicki , Saake, eds. </li></ul><ul><li> < br > Kluwer, 1998 </li></ul><ul><li>< bibliography > </li></ul><ul><li>< book > < title > Foundations of DBs </ title > </li></ul><ul><li> < author > Abiteboul </ author > </li></ul><ul><li> < author > Hull </ author > </li></ul><ul><li>< author > Vianu </ author > </li></ul><ul><li> < publisher > Addison-Wesley </ publisher > </li></ul><ul><li> .... </li></ul><ul><li>.</ book > </li></ul><ul><li>< book > ... < editor > Chomicki </ editor >... </ book > ... </li></ul><ul><li></ bibliography > </li></ul>HTML tags: presentation, generic document structure XML tags: content, &quot;semantic&quot;, (DTD-) specific
  9. 9. XML vs SGML <ul><li>origins: HTML + SGML (ISO Standard, 1986, ~600pp) </li></ul><ul><li>W3C standard (~26 pp): XML syntax + DTDs </li></ul><ul><li>XML = HTML  presentational tags </li></ul><ul><li>+ user-defined DTD (tags+nesting) </li></ul><ul><li>=> really a metalanguage for defining other languages via DTDs </li></ul><ul><li>=> XML is more like SGML than HTML </li></ul><ul><li>XML = SGML  {complexity, document perspective} </li></ul><ul><li>+ {simplicity, data exchange perspective} </li></ul>
  10. 10. XML treats documents like databases <ul><li>XML brings benefits of DBs to documents </li></ul><ul><ul><li>Schema to model information directly </li></ul></ul><ul><ul><li>Formal validation, locking, versioning, rollback... </li></ul></ul><ul><li>But </li></ul><ul><ul><li>Not all traditional database concepts map cleanly, because documents are fundamentally different in some ways </li></ul></ul>
  11. 11. What is structure <ul><li>To Relational Database theorists, structure is: </li></ul><ul><ul><li>Tables with fixed sets of non-repeating named fields, that have little internal structure </li></ul></ul><ul><ul><li>E-R diagrams with fixed number of nodes </li></ul></ul><ul><li>Structured documents are different: </li></ul><ul><ul><li>The order of SECs, Ps, etc. matters (a lot) </li></ul></ul><ul><ul><li>Many hierarchical layers (which text crosses) </li></ul></ul><ul><ul><li>Text/graphic data mixes with aggregate objects </li></ul></ul><ul><ul><li>Optional or repeatable sub-parts abound </li></ul></ul><ul><ul><li>Interaction with natural language phenomena </li></ul></ul><ul><li>These are very different requirements </li></ul>
  12. 12. When structure is essential <ul><li>Large scale data </li></ul><ul><li>Data with individual parts you care about </li></ul><ul><ul><li>(like price-tag, tool-list, citation, author,...) </li></ul></ul><ul><li>Need for good navigation tools </li></ul><ul><li>Mission-critical information </li></ul><ul><li>Information that must last </li></ul><ul><li>Multi-author publishing process </li></ul><ul><li>Multiple delivery media </li></ul>
  13. 13. What’s the difference? <ul><li>Without structure </li></ul><ul><ul><li>Data conversion is far more expensive </li></ul></ul><ul><ul><li>Multi-platform and/or multi-media delivery require re-authoring and hand-work </li></ul></ul><ul><ul><li>Paper production is inconsistent </li></ul></ul><ul><ul><li>Late format changes are far more risky </li></ul></ul><ul><ul><li>Retrieval is prone to many false hits </li></ul></ul><ul><li>“ Pay me now, or pay me later” </li></ul>
  14. 14. XML design principles <ul><li>Straightforwardly usable over the Internet </li></ul><ul><li>Support for a wide variety of applications </li></ul><ul><li>Compatible with SGML </li></ul><ul><li>Make writing XML programs easy </li></ul><ul><li>Avoid optional features </li></ul><ul><li>Human-readable (if not terse) markup </li></ul><ul><li>Formal and concise design </li></ul><ul><li>Design produced quickly </li></ul>
  15. 15. Opportunities with XML <ul><li>Scalability and openness of Web solutions </li></ul><ul><li>“ Rich clients” for complex information </li></ul><ul><ul><li>Dynamic user views </li></ul></ul><ul><li>XML as interprocess communication protocol for “data” (as opposed to “text”) </li></ul><ul><li>eCommerce integration </li></ul><ul><li>New methods of creation </li></ul><ul><ul><li>Schema combination/composition </li></ul></ul><ul><ul><li>Free-form, schema-less data development </li></ul></ul>
  16. 16. Web usage <ul><li>XML works with familiar Web paradigms </li></ul><ul><ul><li>Locations are expressed as URIs </li></ul></ul><ul><ul><li>High interoperability because of few options </li></ul></ul><ul><ul><li>Easily implementable and usable </li></ul></ul><ul><ul><li>Robust against network failures </li></ul></ul><ul><ul><li>Avoids serving schemas every time with documents </li></ul></ul><ul><ul><ul><li>(but can do better validation anyway, when needed) </li></ul></ul></ul>
  17. 17. Some additional XML details <ul><li>Well-formedness </li></ul><ul><li>Error handling </li></ul><ul><li>Case sensitivity </li></ul><ul><li>HTML compatibility </li></ul>
  18. 18. Well-formedness <ul><li>Document has a single root element, and </li></ul><ul><li>Elements nest properly </li></ul><ul><ul><li>Try <B>foo<I>bar</B>baz</I> in your browser! </li></ul></ul><ul><li>Entities are whole subtrees (not </P><P> ) </li></ul><ul><li>No tag omission (close what you open) </li></ul><ul><li>Attributes must be quoted </li></ul><ul><li>< and & must always be escaped in some way </li></ul><ul><li>A document can be well-formed (and parsable) whether or not it fits a given schema </li></ul>
  19. 19. Elements and their Content element type character content element empty element < bibliography > < paper ID=&quot;object-fusion&quot;> < authors > < author >Y.Papakonstantinou</ author > < author >S. Abiteboul</ author > < author >H. Garcia-Molina</ author > </ authors > < fullPaper source=&quot;fusion&quot;/> < title >Object Fusion in Mediator Systems</ title > < booktitle > VLDB 96 </ booktitle > </ paper > </ bibliography > element content
  20. 20. Element Attributes <bibliography> <paper pid =&quot; object-fusion &quot; > <authors> <author>Y.Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <fullPaper source =&quot;fusion&quot;/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography> Attribute name Attribute Value
  21. 21. Pure XML -- Instance Model <ul><li>XML 1.0 Standard: </li></ul><ul><ul><li>no explicit data model </li></ul></ul><ul><ul><li>only syntax of well-formed and valid (wrt. a DTD) documents </li></ul></ul><ul><li>implicit data model: </li></ul><ul><ul><li>nested containers (&quot;boxes within boxes&quot;) </li></ul></ul><ul><ul><li>labeled ordered trees (=a semistructured data model) </li></ul></ul><ul><ul><li>relational, object-oriented, other data: easy to encode </li></ul></ul>< A > < B > foo </ B > < C > bar </ C > < C > lab </ C > </ A > A B C &quot; foo &quot; &quot; bar &quot; &quot; lab &quot; C children are ordered C : &quot; bar &quot; A : B : &quot; foo &quot; C : &quot; lab &quot;
  22. 22. Example: Relational Data to XML R  R   tuple   A  a1  /A   B  b1  /B   C  c1  /C   /tuple   tuple   A  a2  /A   B  b2  /B   C  c2  /C   /tuple  …  /R  c2 b2 a2 c3 b3 a3 c1 b1 a1 C B A R tuple A B C a1 b1 c1 tuple A B C a2 b2 c2 tuple A B C a3 b3 c3
  23. 23. Example: Relational Data to XML R  R   tuple   A  a1  /A   B  b1  /B   C  c1  /C   /tuple   tuple   A  a2  /A   B  b2  /B   C  c2  /C   /tuple  …  /R  c2 b2 a2 c3 b3 a3 c1 b1 a1 C B A R tuple A B C a1 b1 c1 tuple A B C a2 b2 c2 tuple A B C a3 b3 c3
  24. 24. Adding Structure and Semantics <ul><li>XML Document Type Definitions (DTDs): </li></ul><ul><ul><li>define the structure of &quot;allowed&quot; documents (i.e., valid wrt. a DTD ) </li></ul></ul><ul><ul><li> database schema </li></ul></ul><ul><ul><li>=> improve query formulation, execution, ... </li></ul></ul><ul><li>XML Schema </li></ul><ul><ul><li>defines structure and data types </li></ul></ul><ul><ul><li>allows developers to build their own libraries of interchanged data types </li></ul></ul><ul><li>XML Namespaces </li></ul><ul><ul><li>identify your vocabulary </li></ul></ul>
  25. 25. Partial and missing DTDs <ul><li>DTDs (schemas) are needed for validation </li></ul><ul><li>DTD processing adds a burden </li></ul><ul><li>Because of Well-formedness, </li></ul><ul><ul><li>DTDs are not needed just to parse </li></ul></ul><ul><ul><li>Even subtrees can be parsed in isolation </li></ul></ul><ul><ul><ul><li>One exception: Default attributes </li></ul></ul></ul><ul><li>Very handy for development/experimentation </li></ul>
  26. 26. Case sensitivity <ul><li>HTML is </li></ul><ul><ul><li>Case-insensitive for tag names: <P> = <p> </li></ul></ul><ul><ul><li>Case-sensitive for entity names: &LT; ≠ &lt; </li></ul></ul><ul><li>XML is case-sensitive for both! </li></ul><ul><ul><li>Unicode standard advises against case-folding </li></ul></ul><ul><ul><li>Folding is not well-defined for all languages </li></ul></ul><ul><ul><ul><li>Turkish has two lower-case i’s, only one upper </li></ul></ul></ul><ul><ul><ul><li>In languages with no accented caps, can’t reverse </li></ul></ul></ul><ul><ul><ul><li>Error-prone for programmers </li></ul></ul></ul><ul><li>XHTML uses lower case </li></ul>
  27. 27. Summary <ul><li>XML has: </li></ul><ul><ul><li>Representational power and extensibility </li></ul></ul><ul><ul><ul><li>Custom tags, order constraints, etc. </li></ul></ul></ul><ul><ul><li>Validation and consistency (several ways) </li></ul></ul><ul><ul><li>Much of HTML’s simplicity for users/implementors </li></ul></ul><ul><li>XML trashes: </li></ul><ul><ul><li>SGML’s syntax/feature complexity </li></ul></ul><ul><ul><li>SGML’s high startup costs </li></ul></ul><ul><ul><li>HTML’s inflexibility </li></ul></ul><ul><ul><li>ASCII legacy </li></ul></ul>
  28. 28. XML System Architectures
  29. 29. First, an HTML system HTML document <ul><li>Web Server </li></ul>Web Client Internet Parser, formatter, interface
  30. 30. How do you get the data? Documents, stylesheets, and other data
 can all be expressed in XML. This model can work locally or over a network. Parsing, tree-building, and access can shift between client/server XML data Parser Information structure (tree+links) DOM Interface Any application can plug in via an API called “Document Object Model” DTD/ Schema But their information is accessed directly.
  31. 31. Server side XML publishing Server transforms to HTML/CSS; Ship to client browser for display Very common current strategy; Leverages current technology XML data XSLT http Stylesheet HTML +CSS Browser/ Interface
  32. 32. XML everywhere <ul><li>XML separates representation from structure </li></ul><ul><ul><li>So you can use the same parsers, network protocols, tree managers, and APIs to access documents, stylesheets, search and query, etc. </li></ul></ul><ul><li>XML allows separating application parts </li></ul><ul><ul><li>So you can mix and match formatters, search engines, networks and protocols, etc. </li></ul></ul><ul><li>XML separates out semantics </li></ul><ul><ul><li>So you can control style or search semantics without having to mangle your documents to do it </li></ul></ul>
  33. 33. What are the parts? <ul><li>Header stuff </li></ul><ul><ul><li>The XML Processing Instruction </li></ul></ul><ul><li><?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?> </li></ul><ul><ul><li>Schema/DTD (referenced or included) </li></ul></ul><ul><li>The DOCTYPE </li></ul><ul><li><!DOCTYPE catalog SYSTEM &quot;http://www.xyz.com/DTDs/catalog.dtd&quot;> </li></ul>
  34. 34. Main document stuff <ul><li>Elements: <title>...</title> </li></ul><ul><li>Attributes: <xref tgt=&quot;#h185&quot;> </li></ul><ul><li>Text or other content: Tools, computer </li></ul><ul><li>Entity references: &lt;…® </li></ul><ul><li>Comments <!-- Prepared by... --> </li></ul>
  35. 35. Anatomy of an element <p type=&quot;rule&quot;>Use a hyphen: ­.</p> Start-tag Content End-tag Element Element type Attribute name Attribute value (character) entity reference Element type Attribute
  36. 36. Audiences XML aims to help <ul><li>Parser writers </li></ul><ul><ul><li>The Mythical CS Grad Student </li></ul></ul><ul><li>Application writer </li></ul><ul><ul><li>The Desperate Perl Hacker </li></ul></ul><ul><li>Document creators </li></ul><ul><li>Newbies of all stripes </li></ul><ul><li>The World Wide Web itself </li></ul>
  37. 37. HTML compatibility <ul><li>XHTML is an XML application </li></ul><ul><ul><li>One schema among many (probably a popular one, of course) </li></ul></ul><ul><li>Web browser should start supporting generic XML regardless of tag-set. </li></ul><ul><ul><li>Don’t hard-code sizes and names </li></ul></ul><ul><li>Open eBook spec has a nice compromise that accommodates XML, HTML, CSS, and MIME </li></ul>
  38. 38. What are the parts of an XML Document? <ul><li>The DTD </li></ul><ul><li>Elements </li></ul><ul><li>Attributes </li></ul><ul><li>General entities </li></ul><ul><li>Character references </li></ul><ul><li>Comments </li></ul><ul><li>Marked sections </li></ul><ul><li>Processing instructions </li></ul><ul><li>Notations </li></ul><ul><li>Identifiers and catalogs </li></ul>
  39. 39. Schema Languages <ul><li>3 Leading contenders (all can win): </li></ul><ul><li>XML Schema </li></ul><ul><ul><li>Backed by the W3C </li></ul></ul><ul><ul><li>Very powerful </li></ul></ul><ul><ul><li>Very large + Complex theory </li></ul></ul><ul><li>Relax/NG </li></ul><ul><ul><li>Backed by ISO </li></ul></ul><ul><ul><li>Based on tree automata </li></ul></ul><ul><ul><li>Very small </li></ul></ul><ul><li>Schematron </li></ul><ul><ul><li>Independent effort </li></ul></ul><ul><ul><li>Validation tool, not complete language </li></ul></ul>
  40. 40. The DTD (schema) <ul><li>A DTD is a simple schema, based on SGML </li></ul><ul><li>They consist of declarations for the parts: </li></ul><ul><ul><li><!ELEMENT CHAP (TI, SEC*, SUM)> </li></ul></ul><ul><ul><li><!ATTLIST P ID ID #IMPLIED> </li></ul></ul><ul><ul><li><!ELEMENT P (#PCDATA)> </li></ul></ul><ul><li>Can reference from DOCTYPE , or include: </li></ul><ul><li><!DOCTYPE book SYSTEM “book.dtd” [ <!ELEMENT P (#PCDATA)>… ]> </li></ul><ul><li>Other schema languages are available </li></ul><ul><ul><li>They use XML syntax (why not?) </li></ul></ul>
  41. 41. Elements <ul><li>Identify structural/semantic components </li></ul><ul><li>Can (usually do) have children </li></ul><ul><li>Represented by start-tags and end-tags: </li></ul><ul><ul><li><P>Hello, world.</P> </li></ul></ul><ul><li>Some elements are EMPTY </li></ul><ul><ul><li>Special syntax so parser knows: <HR/> </li></ul></ul><ul><li>Schemas control what sub-element patterns can occur with any given type of element </li></ul><ul><li>Order matters / Context does not </li></ul>
  42. 42. Attributes <ul><li>Specify properties/characteristics of elements </li></ul><ul><ul><li>That generally apply to the elements as wholes </li></ul></ul><ul><li>Values are atomic strings </li></ul><ul><ul><li>Though applications may impose more structure </li></ul></ul><ul><li>Represented by assignments within start-tags: </li></ul><ul><ul><li><P TYPE=&quot;SECRET&quot; ID=&quot;FOO&quot;> </li></ul></ul><ul><li>Schemas control what attributes can occur on any given type of element </li></ul><ul><li>One special type: ID, unique per document </li></ul><ul><li>Attributes are not ordered </li></ul>
  43. 43. General Entities <ul><li>A lexical mechanism for inclusion </li></ul><ul><ul><li>But, constrained to including subtrees </li></ul></ul><ul><ul><li>This preserves fragment parsability </li></ul></ul><ul><ul><li>This allows lazy evaluation of structure nodes </li></ul></ul><ul><li>Also used for referring to graphic or other non-directly-XML data objects </li></ul><ul><li>References occur in the document instance: </li></ul><ul><ul><li><PROCEDURE TYPE=&quot;REPAIR&quot;> &warn37;&warn12;...</PROCEDURE> </li></ul></ul><ul><li>Declarations associate the name with a URI or a “public identifier” </li></ul>
  44. 44. Predefined entities <ul><li>Used for escaping markup characters </li></ul><ul><ul><li><p>In XML, tags start with “&lt;”.</p> </li></ul></ul><ul><li>Represented just like other entities: </li></ul><ul><ul><li>&lt; “<“ </li></ul></ul><ul><ul><li>&amp; “&” </li></ul></ul><ul><ul><li>&gt; “>” (more for symmetry than need) </li></ul></ul><ul><ul><li>&apos; “'” </li></ul></ul><ul><ul><li>&quo; “&quot;” </li></ul></ul><ul><li>Schemas may not redefine these names </li></ul>
  45. 45. Character references <ul><li>Can be used to obtain untypable characters </li></ul><ul><ul><li>Such as Kanji for users with English keyboards </li></ul></ul><ul><li>Map directly to a Unicode code point </li></ul><ul><li>Represented much like entity references: </li></ul><ul><ul><li>Decimal: ㋱ </li></ul></ul><ul><ul><li>Hex: 뻯 </li></ul></ul><ul><li>Schemas do not affect these </li></ul>
  46. 46. Comments <ul><li>Can go most anywhere </li></ul><ul><ul><li>(though not inside tags) </li></ul></ul><ul><li>Represented as: </li></ul><ul><ul><li><!-- text of comment --> </li></ul></ul><ul><li>Have simpler syntax than in SGML/HTML </li></ul><ul><ul><li>Not <!-- foo -- -- bar -- > </li></ul></ul><ul><ul><li>Not <!-- foo -- > </li></ul></ul><ul><li>Schemas can contain comments, too </li></ul>
  47. 47. Marked sections <ul><li>Two purposes: </li></ul><ul><ul><li>Escaping a lot of markup </li></ul></ul><ul><ul><li>Conditional inclusion </li></ul></ul><ul><li>In XML: </li></ul><ul><ul><li>Escaping only in the document instance: </li></ul></ul><ul><ul><ul><li><![CDATA[ <P>Hello</P> ]]> </li></ul></ul></ul><ul><ul><li>Conditional content only in schemas: </li></ul></ul><ul><ul><ul><li><![IGNORE[ ... ]]> </li></ul></ul></ul><ul><ul><ul><li><![INCLUDE[ ... ]]> </li></ul></ul></ul>
  48. 48. The “XML Declaration” PI <ul><li>At top of each XML document: </li></ul><ul><li><?XML version=&quot;1.0&quot; standalone=&quot;yes&quot; encoding=&quot;UTF-8&quot;?> </li></ul><ul><li>This marks the document as being XML </li></ul><ul><li>“ Encoding” can be double-checked </li></ul><ul><ul><li>You can detect the encoding from the first few bytes, for many common ones (even EBCDIC) </li></ul></ul><ul><ul><li>MIME types also can signal encoding </li></ul></ul><ul><ul><li>(watch out if server re-encodes document) </li></ul></ul>
  49. 49. Notations <ul><li>Used to name foreign data formats referenced </li></ul><ul><li>Ties a notation name to a URI (presumably pointing to the format’s specification) </li></ul><ul><li>Entities can state their data’s notation </li></ul><ul><li>Processing instructions can (should) use them as target names </li></ul><ul><li>Declared in the schema </li></ul><ul><ul><li>< !NOTATION gif SYSTEM “http://specs.com/gif10.html”> </li></ul></ul><ul><li>Can also use PUBLIC </li></ul>
  50. 50. Identifiers <ul><li>Used in entity declarations to state where the data to be included later can be found </li></ul><ul><li>< !ENTITY warning SYSTEM &quot;http://www.warnsource.com/w993.xml&quot;> </li></ul><ul><li>Uses a URI reference </li></ul><ul><ul><li>Probably will later allow referencing subtrees directly by appending an XPointer </li></ul></ul><ul><li>Accommodates persistent naming schemes under development; but doesn’t define one. </li></ul>
  51. 51. XML 1.0 DTDs <ul><li>DTDs let you say: </li></ul><ul><ul><li>What element types can occur and where </li></ul></ul><ul><ul><li>What attributes each element type can have </li></ul></ul><ul><ul><li>What notations are in use </li></ul></ul><ul><ul><li>What external entities can be referenced </li></ul></ul><ul><li>Standard DTDs exist in almost every domain </li></ul><ul><ul><li>Robin Cover’s oasis.org site has references </li></ul></ul><ul><ul><li>Some repositories exist, such as xml.org </li></ul></ul>
  52. 52. An Example DTD <ul><li><!-- DTD for Friendly Letter --> </li></ul><ul><li><!-- FPI: -//sjd//DTD Friendly letter//EN --> <!ELEMENT LETTER (DATE, GREET, BODY, SIG)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT GREET (#PCDATA)> <!ELEMENT BODY (P)*> <!ELEMENT SIG (#PCDATA)> <!ELEMENT P (#PCDATA | EMPH | FIG)*> <!ELEMENT EMPH (#PCDATA)> <!ATTLIST EMPH TYPE NAME ”WOW&quot;> <!ELEMENT FIG EMPTY> <!ATTLIST FIG HREF CDATA #REQUIRED> </li></ul>
  53. 53. Another Example <ul><li><!ENTITY % inline “emph | strong”> </li></ul><ul><li><!ELEMENT doc (chap*)> </li></ul><ul><li><!ELEMENT chap (title, section*)> </li></ul><ul><li><!ELEMENT title (#PCDATA | %inline;)*> </li></ul><ul><li><!ELEMENT section P+> </li></ul><ul><li><!ELEMENT p (#PCDATA|%inline;)*> </li></ul><ul><li><!ATTLIST p ID ID #IMPLIED> </li></ul><ul><li><!ELEMENT emph (#PCDATA)> </li></ul><ul><li><!ELEMENT strong (#PCDATA)> </li></ul>
  54. 54. A corresponding document <ul><li><?xml version=&quot;1.0&quot;> <!DOCTYPE LETTER PUBLIC &quot;-//sjd//DTD Friendly letter//EN&quot; </li></ul><ul><li>[]> <LETTER><DATE>October 3, 1998</DATE> <GREET>Sammy</GREET> <BODY> <P>How <EMPH>are</EMPH> you doing?</P> <P>This is my dog: <FIG HREF=”http://www.me.com/dog.gif”/></P> </BODY> <SIG>Todd</SIG> </LETTER> </li></ul>
  55. 55. Content Models <ul><li>These are modeled on regular expressions </li></ul><ul><li>In DTD, each element has one content model for all time </li></ul><ul><li>Similarly, each element has one set of attributes for all time </li></ul><ul><li>Attributes and content models are completely independent </li></ul>
  56. 56. Basic Operators <ul><li>Joining </li></ul><ul><ul><li>Sequence a,b,c </li></ul></ul><ul><ul><li>Alternation a | b | c </li></ul></ul><ul><li>Grouping (a) </li></ul><ul><li>Repetition </li></ul><ul><ul><li>0 or more a* </li></ul></ul><ul><ul><li>1 or more a+ </li></ul></ul><ul><ul><li>Optional a? </li></ul></ul>
  57. 57. Data <ul><li>#PCDATA </li></ul><ul><li>Element names </li></ul><ul><li>Model groups </li></ul><ul><li>Mixed content (#PCDATA | x | …)* </li></ul><ul><li>ANY </li></ul><ul><li>EMPTY </li></ul>
  58. 58. Not quite regular expressions <ul><li>Ambiguity restriction </li></ul><ul><li>No alternatives must be found for any model group </li></ul><ul><li>This restriction is preserved in W3C Schema, relaxed in RelaxNG </li></ul>
  59. 59. Handy terminology decoder ring <ul><li>Element: a text feature distinguished by markup </li></ul><ul><li>Tag: a string in angle brackets. <a> or </a>. Two tags delimit an element </li></ul><ul><li>Content: anything in an element (children in the parse tree) tags and characters between an element’s tags </li></ul><ul><li>Attribute: a (name, value) pair associated with an element </li></ul><ul><li>Element Type Name: a string like “p” or “img” that identifies the type of an element </li></ul>
  60. 60. Decoder ring… <ul><li>Entity: abstraction of an item of data storage. </li></ul><ul><li>General entity: entity whose text is contained in its declaration. </li></ul><ul><li>External entity: entity whose content is stored externally to its declaration </li></ul><ul><li>Declaration: meta-markup that declares entities, content models, etc. </li></ul><ul><li>Document instance: the tags and content in an XML document, not counting declarations </li></ul>
  61. 61. Decoder… <ul><li>Document Type declaration (DOCTYPE): declaration of root element of a document instance, can refer to: </li></ul><ul><li>External subset: DTD (XML declarations) stored as an external entity. </li></ul><ul><li>Internal subset: declarations contained within a DOCTYPE declaration. ATTLIST declarations must be parsed, and interpreted. </li></ul>
  62. 62. Decoder… <ul><li>Content Model: description of restrictions on the content of an element </li></ul><ul><li>Model Group: content model subexpression in parentheses </li></ul><ul><li>Repetition indicator: *, +, ? </li></ul><ul><li>Prolog: All of the stuff before the document instance starts. </li></ul>
  63. 63. Ambiguity <ul><li>A content model is ambiguous if it contains an alternation (a | b) where the content models a and b cannot be distinguished by their first element. </li></ul><ul><li>A content model is ambiguous if an optional occurrence indicator is followed by a submodel whose first element is not different. </li></ul>
  64. 64. Attributes <ul><li>Data types </li></ul><ul><li>Default values / omissability </li></ul><ul><li><!ATTLIST p </li></ul><ul><li>type (summary | body) “body” </li></ul><ul><li>id ID #IMPLIED </li></ul><ul><li>prefix CDATA “”> </li></ul>
  65. 65. <!ATTLIST syntax <ul><li><!ATTLIST element-name att-name type defaults att-name type defaults …> </li></ul><ul><li><!ATTLIST element-group att-name type defaults att-name type defaults …> </li></ul>
  66. 66. Attribute Data Types <ul><li>CDATA </li></ul><ul><li>NMTOKEN / NMTOKENS </li></ul><ul><li>Enumeration Type (a | b) </li></ul><ul><li>ENTITY / ENTITIES </li></ul><ul><li>ID / IDREF / IDREFS </li></ul><ul><li>NOTATION </li></ul>
  67. 67. Attribute defaults <ul><li>#REQUIRED </li></ul><ul><li>#IMPLIED </li></ul><ul><li>#FIXED “value” </li></ul><ul><li>Literal default value </li></ul>
  68. 68. Parameter Entities <ul><li>• Declaring </li></ul><ul><li><!ENTITY % pent “value”> </li></ul><ul><li><!ENTITY % include-file SYSTEM “http://www.w3.org//”> </li></ul><ul><li>Using </li></ul><ul><li>%include-file; </li></ul><ul><li><![ option [ <!… optional declaration …> ]]> </li></ul>
  69. 69. General Entities <ul><li>Simple </li></ul><ul><li><!ENTITY ent “value”> </li></ul><ul><li>External </li></ul><ul><li><!ENTITY include-file SYSTEM “http://www.w3.org//”> </li></ul>
  70. 70. Notations <ul><li>declaring </li></ul><ul><ul><li><!NOTATION blob SYSTEM “application/binary”> </li></ul></ul><ul><li>Using (to declare entity datatypes) </li></ul><ul><ul><li><!ENTITY something SYSTEM http://blob.org/blobel </li></ul></ul><ul><ul><ul><li>NDATA blob> </li></ul></ul></ul><ul><li>Using an NDATA entity </li></ul><ul><ul><li><!ATTLIST img ref ENTITY #REQUIRED> </li></ul></ul><ul><ul><li>… in instance … </li></ul></ul><ul><ul><li><img ref=“something”> </li></ul></ul><ul><li>Or one can just use URIs and MIME types in software… less validation, more simplicity </li></ul>
  71. 71. Processing instructions <ul><li>Escape to procedural markup </li></ul><ul><ul><li><!NOTATION my-app SYSTEM “http://my.com/”> </li></ul></ul><ul><ul><li><?my-app does something, anything …. ?> </li></ul></ul><ul><li>Escape hatch </li></ul><ul><li>Way to add declarations to XML in some cases </li></ul><ul><li>Way to “pickle” application state in a document. </li></ul>
  72. 72. Namespaces <ul><li>Helps to “uniquify” markup names </li></ul><ul><ul><li>Colon delimiter allowed in names </li></ul></ul><ul><ul><li><cals:table> <html:table xyz:key=&quot;2&quot;> </li></ul></ul><ul><ul><li>Attributes associate a prefix with a namespace URI </li></ul></ul><ul><ul><li><div xmlns:xhtml= &quot; http://www.w3.org/1999/xhtml &quot; > </li></ul></ul><ul><ul><ul><li>Sets default for element and descendants </li></ul></ul></ul>
  73. 73. Things namespace almost do <ul><li>Allow arbitrary mixing of DTDs /schemas </li></ul><ul><li>Provide a “type system” for referents of markup </li></ul><ul><li>Allow automatic processing of foreign markup </li></ul>
  74. 74. Pros and Cons of Namespaces <ul><li>You can uniquely label element types in a global way </li></ul><ul><li>You can must change the element name to take advantage of this </li></ul><ul><li>Attempts to re-use large numbers of namespace-qualified elements are often clumsy/redundant </li></ul><ul><li>Detection of a namespace is very easy </li></ul><ul><li>There can only be one namespace for an instance of an element </li></ul>
  75. 75. Things are confusing about namespaces <ul><li>The URI reference in a namespace is just a string </li></ul><ul><li>The URI reference in a namespace may not exist, it’s just a string </li></ul><ul><li>The URI reference in a namespace may exist and contain something irrelevant or unexpected: it’s just a string </li></ul><ul><li>Relative URI references in namespaces are well-defined, but don’t do what you might expect, because they are just strings… </li></ul><ul><li>Fragment identifiers are allowed in namespace URIs, if you want to use them. </li></ul>
  76. 76. Namespace URI dereferencing <ul><li>There are applications within which this has been defined </li></ul><ul><li>There isn’t anything yet which works across arbitrary domains </li></ul><ul><li>RDF, DAML/OIL, other semantic web efforts may also address this in time. </li></ul>
  77. 77. XML Information Set <ul><li>What data in an XML document “counts”? </li></ul><ul><ul><li>Elements, attributes, content </li></ul></ul><ul><ul><li>Order and hierarchy of elements </li></ul></ul><ul><ul><li>No whitespace within tags </li></ul></ul><ul><ul><li>All whitespace within elements </li></ul></ul><ul><ul><li>Not which kind of quotes around attributes </li></ul></ul><ul><li>Required for interoperability </li></ul><ul><ul><li>Applications must not count nodes differently </li></ul></ul><ul><ul><li>W3C “Document Object Model” is related </li></ul></ul><ul><ul><ul><li>DOM is an API for XML, not an O.M. </li></ul></ul></ul>
  78. 78. XML and related specs <ul><li>XML: The basic syntax, plus namespaces </li></ul><ul><ul><li>XML Namespaces: disambiguation </li></ul></ul><ul><ul><li>XML-Information Set: What counts </li></ul></ul><ul><ul><li>XML-Schemas: datatyping and structure </li></ul></ul><ul><li>XPath: Expressions to find whole nodes </li></ul><ul><li>XPointer: XPath++ for hyperlink addressing </li></ul><ul><li>XLink: hypermedia </li></ul><ul><li>XML Base (relative URLs) </li></ul><ul><li>XSL: stylesheets and transforms </li></ul><ul><li>DOM: API to the Information Set </li></ul>
  79. 79. XML specification <ul><li>A “Recommendation” since 2/1998 </li></ul><ul><ul><li>The highest level for a W3C specification </li></ul></ul><ul><li>Defines the syntax/grammar </li></ul><ul><li>Schemas or DTDs then define particular applications (poetry, manuals, eCommerce,…) </li></ul><ul><ul><li>All these can be parsed by generic XML, just as new words can be readily fitted into existing sentence structures </li></ul></ul><ul><ul><li>Schemas are political as well as technical </li></ul></ul>
  80. 80. The W3C standards* process <ul><li>World Wide Web Consortium (W3C) </li></ul><ul><li>Development is organized into WGs. </li></ul><ul><ul><li>Working Group (~10) - set agenda /decide </li></ul></ul><ul><ul><li>Special Interest Group (~100) - discuss/recommend </li></ul></ul><ul><ul><li>W3C members (~500) - vote </li></ul></ul><ul><ul><li>W3C Director (TimBL) - may veto </li></ul></ul><ul><li>The public--comment on public WDs; adopt/reject </li></ul>
  81. 81. The beginning of XML <ul><li>Originally chartered to work on a suite: </li></ul><ul><ul><li>XML (Extensible Markup Language) </li></ul></ul><ul><ul><li>XML-Linking (Extensible Linking Language) </li></ul></ul><ul><ul><li>XSL (Extensible Style Language) </li></ul></ul><ul><li>Founder/chair: Jon Bosak (Sun); W3C contact: Dan Connolly (W3C) </li></ul><ul><li>First presented 11/ 1996; ratified 2/1998 </li></ul><ul><li>Quickly added XML Namespaces spec </li></ul>
  82. 82. The current XML organization <ul><li>Work products done by several WGs </li></ul><ul><li>“ XML Plenary ” coordinates these WGs </li></ul>
  83. 83. Document analysis <ul><li>Cycle of steps; repeat until out of time </li></ul><ul><li>Identify project requirements/audience </li></ul><ul><li>Using those, identify information items in the document that could be important </li></ul><ul><li>Make sure you have a way to use that information </li></ul><ul><li>Identify restrictions on those items </li></ul><ul><li>Identify structural constraints that may be needed </li></ul><ul><li>Identify non-semantic features that may be important for presentation, etc. </li></ul>
  84. 84. Project requirements <ul><li>Know the audience/readers </li></ul><ul><li>Know the authors </li></ul><ul><li>Don’t forget the editorial/clerical staff </li></ul><ul><li>These 3 groups are the experts, you are the detail person </li></ul><ul><li>Don’t make a lifetime commitment to your processing model, but have one in mind; analysis without limitations is dangerous </li></ul>
  85. 85. Identifying information items <ul><li>This is pretty much a manual process </li></ul><ul><li>Often best done with paper and highlighters and post-its </li></ul><ul><li>In later stages, adding tags to a text transcript can be useful. </li></ul><ul><li>The more documents you’ve looked at and thought about, the easier this becomes. </li></ul>
  86. 86. Issues to think about <ul><li>Cross-references </li></ul><ul><li>Structural divisions (headings, blurbs, ambiguities) </li></ul><ul><li>Tradeoff between freedom and processing </li></ul><ul><li>Normalization of data items </li></ul><ul><li>What external data and catalogs may exist </li></ul>
  87. 87. Restrictions on data items <ul><li>Content model </li></ul><ul><li>Data values (are there controlled or semi-controlled vocabularies?) </li></ul><ul><li>Are there “authority files” for large open sets (like lists of authors) </li></ul><ul><li>How variable is the content, and how realistic the idea to normalize it. </li></ul>
  88. 88. Presentation issues <ul><li>Some text can be auto-generated, some cannot </li></ul><ul><li>Some test can be “almost” auto-generated (you can’t avoid special cases) </li></ul><ul><li>Punctuation can kill you, either when you leave it to authors, or when you take it away from them </li></ul>

×