• Like
Xml Overview
Upcoming SlideShare
Loading in...5
×
Uploaded on

XML

XML

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,583
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
46
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. XML for Catalogers in 2009: Emerging Technologies, Tools, and Trends Kevin Reiss [email_address] Systems Librarian Office of Library Services City University of New York AJL-NYMA's 2009 Cataloging Workshop 4/22/2009
  • 2. Outline
    • XML Basics
    • 3. XML and MARC
    • 4. XML Formats
    • 5. Usage Scenarios
    • 6. XML Tools
    • 7. Experimentation & Questions
  • 8. Purpose
    • I'm not here to teach you how to catalog in XML
    • 9. Give a basic understanding of XML syntax
    • 10. Put in XML in the context of library, specifically cataloging, work
    • 11. Highlight usage scenarios for XML
    • 12. Discuss tools for editing XML
  • 13. XML Basics
    • Extensible Markup Language
    • 14. World Wide Web Consortium (W3C) Standard
      • Officially a Recommendation
    • First Published in 1997
    • 15. SGML for the Web
      • Standardized General Markup Language
    • Came out of the text-encoding community
      • Software Documentation ( Docbook )
      • 16. Literary Texts ( TEI )
  • 17. XML is:
      So useful it has outlived it's own hype. It is ubiquitous within most modern applications and on the web. It isn't even cool any longer.
  • 18. Future Proof Your Data “Data Outlasts Code” Ian Davis – Code4lib 2009
      How many of you have lived through an ILS migration?
  • 19. XML is: The best data format we have to deal with this issue at the moment since MARC, in some respects, is becoming a liability where modern software is concerned.
  • 20. XML is also:
    • Machine-readable
    • 21. Human-readable
    • 22. Platform Independent
    • 23. Verbose
    • 24. Unicode-compliant
    • 25. Used in data-centric applications
    • 26. Used in document-centric applications
    • 27. Editable by any editor that can handle plain-text files
  • 28. XML is a meta-language
    • “Self describing Data”
    • 29. Machine-readable semantic data
    • 30. You define your application vocabulary
      • XML applications are defined with a schema
      • 31. Example (X)HTML is an XML application
    • Adhere to a few simple rules
  • 34. Two Approaches to Markup
    • Descriptive
      • <h1>Page Title</h1> <p>Paragraph one.</p> <p>Paragraph two.</p>
    • Procedural
      • <font size=”12”>Page Title</font><br/><br/> <font size=”6”>Paragraph one.</font><br/><br/> <font size=”6”>Paragraph two.</font>
  • 35. Similar Display/Different Approaches
  • 36. Descriptive Markup
    • Seeks to separate content from presentation
    • 37. Which of the previous code snippets succeeds?
    • 38. Descriptive markup makes data
      • More portable
      • 39. Easier to repurpose and share
    • In many ways MARC is a partially descriptive, partially procedural markup language
      • Field/subfield definitions and validation rules
      • 40. ISBD Punctuation
  • 41. 090 |a ML410 .S18 |b J3 2007 24500 |a J. B. Sancho : |b compositor pioner de Califòrnia = compositor pionero de California : pioneer composer of California / |c William J. Summers ... [et. al.] ; ed. Antoni Pizà. 250 |a 1a ed. 260 |a Palma : |b Universitat de les Illes Balears, |c c2007. 300 |a 366 p. : |b ill., music ; |c 30 cm. + |e 1 CD-ROM. 500 |a Parallel text in Catalan, Spanish, and English. 504 |a Includes bibliographical references and thematic catalogue of the works of J. B. Sancho. 500 |a CD-ROM contains Artaserse facsimiles; transcriptions of Misa de los ángeles, Gloria, and Misa del sol; and audio recordings of Misa de los ángeles and Gloria de la Misa en sol. 590 |a At GC, CD-ROMs shelved at Circulation Desk under call no.: CD-ROM 54 50500 |t Sancho : l'eminent músic de l'Alta Califòrnia / |r William J. Summers -- |t Juan Bautista Sancho : a la recerca dels orígens del primer compositor de Califòrnia i de 'estil musical primitiu de les missions / |r Craig H. Russell -- |t Els Sanzo d'Artà / |r Antoni Gili -- |t Catàleg temàtic / |r William J. Summers. 650 0 |a Composers |z California |x Biography. 60010 |a Sancho, Juan Bautista, |d 1772-1830. 60010 |a Sancho, Juan Bautista, |d 1772-1830 |v Thematic catalogs. 7001 |a Pizà, Antoni. 7001 |a Summers, William John. 7001 |a Russell, Craig H. 7001 |a Gili Ferrer, Antonio. Procedural or Descriptive?
  • 42. Basic XML Syntax
    • Files end in .xml
    • 43. Individual XML documents are “instances”
    • 44. Documents must adhere to a nested hierarchy
    • 45. Start with an option XML declaration
    • 46. <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?>
      • Declares XML version used
      • 47. Declares the character set
  • 48. The Root Element
    • Every document instance has only one
    • 49. All other elements nest within this one
    • 50. For example every XHTML Document has only one “<html>” Tag
    • 51. Start <tag>
    • 52. End </tag>
  • 53. Web Page Source
  • 54. Elements
    • Sometimes called “tags”
    • 55. Can contain other elements and text
    • 56. Must have a <start> and </end> tag
    • 57. Sometimes elements are “empty”
    • 58. These must also be “closed”
      • <empty attribute=”stuff”/>
      • 59. The image <img src=”mypicture.jpeg”/> element in XHTML is a good example
  • 60. Elements in MODS <subject xmlns:xlink = &quot;http://www.w3.org/1999/xlink&quot; authority = &quot;lcsh&quot; > <topic> City and town life </topic> <topic> Fiction </topic> </subject>
  • 61. Attributes
    • Attached to a specific element
    • 62. Must be quoted ex; myattribute=”my attribute content”
    • 63. Order is not important when attached to a given element
    • 64. HTML Example
      • <a href=” http://www.google.com ” title=”Go to Google”>Visit Google</a>
    • MARCXML Example
    <datafield tag = &quot;245&quot; ind1 = &quot;1&quot; ind2 = &quot;0&quot; > <subfield code = &quot;a&quot; > Ulysses </subfield> <subfield code = &quot;c&quot; > [by] James Joyce. </subfield> </datafield>
  • 65. Entities
    • Five reserved special characters – XML general entities
    • Example <equation>2 &lt; 5</equation>
    • 70. Authoring software should escape these for you
  • 71. Parsing XML
    • Every programming language and operating system supports parsing XML
    • 72. Most web browsers are XML parsers
    • 73. Two-levels of XML parsing
      • Well-formed - “weak checking”
      • 74. Validation - “strict checking”
      • 75. Validation happens when the instance adheres to the rules defined in a specific Schema
  • 76. Well-Formedness
    • “Weak” check
    • 77. Checks for adherence to basic XML syntax
    • Ensures a piece of software can parse the data
    • 81. Test this with your web browser
  • 82. Well-formed XML Document
  • 83.  
  • 84. What is a Valid XML Document?
    • Validity makes XML more than than just a structured data format
    • 85. Validity is enforced by a “schema” that defines a particular XML application
    • 86. Schemas contain:
      • Element/Attribute definitions
      • 87. Content model definitions
        • i.e. element order and number
      • Data validation rules
        • Enumerated values
        • 88. Patterns, i.e. dates, MARCXML leader field
  • 89. XML Schemas
    • Schemas define the semantics/structure of your application
    • 90. Could be called “strict checking”
    • 91. Most major XML applications have some sort of schema
    • 92. Data or document modeling work is done here
    • 93. A schema supports “guided” editing
    • 94. In practice schemas are most useful during
      • Authoring phase
      • 95. Data migration
  • 96. Types of Schemas
    • DTDs – Document Type Definitions
      • Older form, derived from SGML
      • 97. Non XML syntax
    • XML Schemas
      • W3C Standard
      • 98. Expressed in XML
      • 99. Most database like
    • Relax NG Schemas
      • Most flexible
      • 100. Expressed in XML or non-XML
  • 101. Guided Editing using Oxygen
  • 102. Data Validation Example
    • Consider the MARCXML Schema
    • 103. MARC Leader Field validation rule
    [d ]{5}[dA-Za-z ]{1}[dA-Za-z]{1}[dA-Za-z ]{3}(2| )(2| )[d ]{5}[dA-Za-z ]{3}(4500| )
  • 104. Creating XML
    • Any text editors
    • 105. Special purpose tools
      • General purpose XML editors
      • 106. Tools for a specific XML application
    • As an export format – check your ILS system
    • 107. Can I catalog in XML?
      • Yes
      • 108. Many of you already do, see OCLC Connexion
  • 109. Why is XML useful to Software?
    • Well-formed or valid documents make the content predictable and accessible
    • 110. Parser and schema carry out data-checking for you
    • 111. Very easy to manipulate for programmers
    • 112. Multi-language support via Unicode
    • 113. Parses to “tree” data structure
      • Think of an XML document instance as an organization chart
      • 114. Consists of nodes
  • 115. Sample Instance
  • 116. Document Parsed as a Tree
  • 117. Manipulated via DOM
    • DOM - Document Object Model
      • Common XML Processing
      • 118. supported by most programming languages
    • Typical DOM pseudo code:
    • 119. list = xml->GetAllElements(“genre”)
    • 120. foreach genre in list:
      • if (isTextNode(genre.firstChildNode))
        • print “Genre is ” + genre.firstChildNode
  • 121. Auxiliary XML Standards
    • Many other W3C Standards
      • XML Namespaces
      • 122. XML Transformations (XSLT)
      • 123. Xlink (Linking within XML documents)
      • 124. Xquery, Xpath (Query syntax for XML Documents)
    • Transformations are the most useful and relevant for catalogers
    • 125. XML Namespaces a close second
      • Allow you to mix different XML applications within the same document instance
  • 126. XSLT
    • Extensible Stylesheet Language Transformations
    • 127. W3C Standard
    • 128. Written in XML
    • 129. Convert an XML instance into:
      • Another XML instance
      • 130. Another text format (.csv, .txt)
      • 131. Most commonly takes XML to XHTML
  • 132. XSLT is a series of Templates
    • Convert Dublin Core fields to their MARCXML Equivalents
    • 133. Language => 546
    • 134. Publisher => 260
  • 135. Crosswalks
    • Most common XSLT application in library world
    • 136. LOC Publishes a series of stylesheets for common cataloging XML formats
      • MODS -> MARCXML
      • 137. MARCXML -> MODS
      • 138. MARCXML -> Dublin Core
      • 139. MARCXML -> HTML
      • 140. Character set conversions (MARC8 -> UTF8)
    • Programming languages support
      • RAW MARC => MARCXML
  • 141. Can MARC and XML Co-Exist? I find it telling that the first step to designing any system around data currently in MARC, is that I have to take the data out of MARC, correct it for inconsistencies, massage it to make it more straightforward — just so that the information is useful within non-library systems. Terry Reese - 2007 Creator of MarcEdit
  • 142. MARC and the future
    • The Current MARC format is:
      • The primary store for library metadata
      • 143. A very large part of the collective intellectual effort of our profession
      • 144. Most modern software uses design paradigms far different than that of MARC
      • 145. Not very interoperable outside of OCLC/ILS land
      • 146. A factor in isolating library data
      • 147. Supported by only a small number of software tools
  • 148. 02805cam 2200589 450000100070000000500170000700800410002403500210006590600450008601000170013104000420014804300120019005000230020205101140022505101580033905100850049705101080058205101230069005100830081305101060089608200130100210000300101524500310104526000370107630000500111350000250116350000300118850000340121850000600125265000330131265100310134565000290137665000250140565000220143065500330145265500280148565500280151371000610154195200310160295200630163398500360169698500350173299100680176799100510183599100480188699100500193499100510198499100510203599100500208699100500213609900290218627430020010920145828.0891030s1934 nyuag 000 1 eng 9(DLC) 34002348 a7bcbccorignewd1eocipf19gy-gencatlg a 34002348 aDLCcDLCdDLCdOCoLCdDLCdOCoLCdDLC ae-ie---00aPR6019.O9bU4 1934 aPR6019.O9bU4 1934bcAnother impression. &quot;Second printing, January 1934&quot;--P. [v]. Copyright deposit (#70193). aMicrofilmb75965 PRcMicrofilm of preceding impression. Washington, D.C. : Library of Congress, Photoduplication Service, 1979. 1 microfilm reel ; 35 mm. aPR6019.O9bU4 1934ccAnother impression. &quot;Fourth printing, January 1934&quot;--P. [v] aPR6019.O9bU4 1934c Copy 2cCopy 2 of the preceding impression. Gift of Willard L. Hart, Mar. 17, 1952. aPR6019.O9bU4 1934dcAnother impression. &quot;Fifth printing, February 1934--P. [v] Purchase, Mar. 20, 1934 (DLC #452310). aPR6019.O9bU4 1934ecAnother impression. &quot;Seventh printing, Nov. 1934&quot;--P. [v] aPR6019.ObU4 1934e Copy 2cCopy 2 of the preceding impression. Purchase, Feb. 25, 1936 (DLC #494329).00a823/.9121 aJoyce, James,d1882-1941.10aUlyssesc[by] James Joyce. a[New York,bRandom house,c1934] axvii, 767, [1] p.billus. (music)c21 1/2 cm. aTitle on two leaves. a&quot;First American edition.&quot; aLC copy has dust jacket.5DLC aSource: Gift of Herman Finkelstein, Dec. 30, 1980.5DLC 0aCity and town lifevFiction. 0aDublin (Ireland)vFiction. 0aMarried peoplevFiction. 0aJewish menvFiction. 0aArtistsvFiction. 7aPsychological fiction.2lcsh 7aDomestic fiction.2lcsh 7aEpic literature.2gsafd2 aHerman Finkelstein Collection (Library of Congress)5DLC aForm AACR 2: vj05 04-26-89 aCopy 1 of 4th printing missing in inventory: vj05 04-26-89 ararebk/finkerbcfqr05 02-19-93 ararebk/rbcerbcfqr05 02-19-93 bc-RareBookhPR6019.O9iU4 1934tCopy 1mFinkelstein CollwBOOKS bc-RareBookhPR6019.O9iU4 1934btCopy 1wBOOKS bc-MicRRhMicrofilmi75965 PRtCopy 1wBOOKS bc-GenCollhPR6019.O9iU4 1934ctCopy 1wBOOKS bc-RareBookhPR6019.O9iU4 1934ctCopy 2wBOOKS bc-RareBookhPR6019.O9iU4 1934dtCopy 1wBOOKS bc-GenCollhPR6019.O9iU4 1934etCopy 1wBOOKS bc-RareBookhPR6019.OiU4 1934etCopy 2wBOOKS ajoyce-ulysses-1072275128 From Another Computing Era
  • 149. MARC | MARCXML
  • 150. Meaningful Output in Browser
  • 151. MARC => MARCXML
    • This step requires programming
    • 152. Utilize Perl Programming to parse MARC to MARCXML
    • 153. PHP also has a MARC library
    • 154. These have internal crosswalks that produce a MARCXML representation
  • 155. MARC => MARCXML <datafield tag = &quot;245&quot; ind1 = &quot;1&quot; ind2 = &quot;0&quot; > <subfield code = &quot;a&quot; > Ulysses </subfield> <subfield code = &quot;c&quot; > [by] James Joyce. </subfield> </datafield> <datafield tag = &quot;260&quot; ind1 = &quot; &quot; ind2 = &quot; &quot; > <subfield code = &quot;a&quot; > [New York, </subfield> <subfield code = &quot;b&quot; > Random house, </subfield> <subfield code = &quot;c&quot; > 1934] </subfield> </datafield>
  • 156. Tough Example
    • 24500 |a J. B. Sancho : |b compositor pioner de Califòrnia = compositor pionero de California : pioneer composer of California / |c William J. Summers ... [et. al.] ; ed. Antoni Pizà.”
    • 157. MARCXMLifying this isn't necessarily going to help make this more easily digestible to a piece of software
    • 158. MARCXML essentially maintains MARC as it is and puts it into a parsable XML wrapper
  • 159. Other XML Formats
    • MARC-Derivatives
      • MODS (The Semantic or Readable MARC)
      • 160. MARCXML
    • Dublin Core
      • MARCXML's little brother
    • EAD
    • 161. TEI
    • 162. XHTML
    • 163. RSS/ Atom
    • 164. RDF
  • 165. Data v. Document Centric
    • Data Centric
      • Database export formats
      • 166. Spreadsheet export formats
      • 167. Metadata
      • 168. Most cataloging formats fall into this category
    • Document Centric
      • Encoding full-text resources
      • 169. Mixed content
  • 170. MODS
    • Metadata Object and Description Schema
      • http://www.loc.gov/standards/mods/
    • The “semantic” or “descriptive” XML MARC Surrogate
    • 171. Inconsistent support
      • ILS Systems
      • 172. Institutional Repositories
  • 173. MADS
    • Metadata Authority Description Standard
      • http://www.loc.gov/standards/mads/
    <mads .....><authority> <topic authority = &quot;lcsh&quot; > Computer programming </topic> </authority> <related type = &quot;broader&quot; > <topic> Computers </topic> </related> <related type = &quot;narrower&quot; > <topic> Programming languages </topic> </related> <related type = &quot;other&quot; > <topic> Systems Analysis </topic> </related> </mads>
  • 174. Dublin Core
    • Popular simple metadata format
    • 175. 15 basic elements
    • 176. key=>value pairs
    • Qualified vocabulary available
    • 179. Default format for the OAI-PMH Protocol for Metadata Harvesting
  • 180. EAD
    • Encoded Archival Description
    • 181. Archival Findings Aids
    • 182. One of the oldest XML formats
    • 183. Straddles the data and document-centric worlds
    • 184. Crosswalks available in MarcEdit and other places
  • 185. TEI
    • Text Encoding Initiative
    • 186. Designed to encode any kind of text
    • 187. Humanities Computing Initiative
    • 188. Support in the special collections community
    • 189. Intellectually rich XML application
    • 190. Many dialects ranging from:
      • Basic descriptive encoding of a text's structure
      • 191. Detailed linguistic analysis
  • 192. XTHML
    • Extensible HTML
    • 193. HTML that confirms to XML rules
    • 194. Has become ubiquitous on the web
    • 195. Used in conjunction with Cascading Style Sheets
      • XHTML provides the content
      • 196. CSS controls how it displays
    • If your Content Management System (CMS) doesn't use XHTML you are in trouble
  • 197. RSS Syndication
    • Really Simple Syndication
    • 198. An instance of RSS is known as a feed
    • 199. Users can subscribe to a particular RSS feed
    • 200. New additions to the feed are pushed out
    • 201. RSS feeds are easily incorporate into webpages
    • 202. Most web portals (i.e. your yahoo, or google account are built around RSS feeds)
    • 203. In a catalog
  • 204. RSS within a Catalog
  • 205. RSS and Repositories
    • Emerging area of functionality for RSS
    • 206. RSS can be used an export protocol to a repository, i.e. turn something into connexion for a institutional repositories
    • 207. Any content creation tool could send items to a repository
    • 208. SWORD (Simple Web-service Offering Repository Deposit)
    • 209. Uses Atom, an RSS dialect to accomplish this
    • 210. http://www.swordapp.org/
  • 211. RDF
    • Resource Description Framework
    • 212. Semantic Web Technology
    • 213. Linked Data using URI(L)s
    • 214. Machine Readable semantics a level above what XML provides
    • 215. RDF fragment of Project Gutenberg data
  • 216. Sample RDF Assertion describing a Person taken from RDF Primer
  • 217. RDA and XML
    • Some crosswalks in the works
    • 218. XML versions of RDA will likely be produced in RDF
    • 219. Early Example - Using Library of Congress MARC data
      • http://code.google.com/p/code4rda/wiki/MilestoneOne
  • 220. RDA in RDF/XML
  • 221. XML Usage Scenarios
    • Web Interfaces (AJAX)
    • 222. Data processing (ILS go-between)
    • 223. Crosswalks (MARCXML=>All of the Above)
    • 224. Metadata Harvesting (OAI-PMH)
    • 225. Full-text Indexing
  • 226. AJAX – XML Behind the Scenes
  • 227. ILS Go-between Format
    • OCLC Connexion
      • Connexion records are actually created in MARCXML
      • 228. Get converted to MARC for export
    • ILS Example - Aleph
      • Notices
      • 229. Reports
      • 230. Customizable XSL stylesheets to format the XML produced by these transactions
  • 231. Crosswalks
    • Library of Congress
      • Various MARCXML crosswalks
    • Other formats
      • EAD => MARCXML
      • 232. Anything to Dublin Core
  • 233. OAI - PMH
    • Open Archives Initiative – Protocol for Metadata Harvesting
    • 234. Dublin Core is the default format here
    • 235. Expose information about digital collections/repository content to the wider world
    • 236. Participants in METRO grants have data available via OAI in XML
      • Collection List
  • 237. OAI Metadata Example with Dublin Core
  • 238. Indexing XML
    • There are numerous full-text indexing tools for XML, some utilized by ILS systems
    • 239. Parse XML into their own indexing format
      • Solr (actually uses it's own XML format)
      • 240. Lucene
    • Native XML Indexers
      • eXist
      • 241. Ex Libris' Primo
        • Catalog Records are converted to OAI-PMH Dublin Core and then indexed
  • 242. MarcEdit
    • Simplest tool to integrate into existing library workflows; open-source, freely downloadable
    • 243. Direct MARC Support
    • 244. Global Editing of MARC Data
    • 245. Crosswalk utilities
    • 246. Most useful for:
      • Special Collections Work
      • 247. Electronic MARC Record Processing
  • 248. MarcEdit Crosswalk Options
  • 249. Harvest OAI Data
  • 250. End of OAI Harvest in MarcEdit
  • 251. Specialty Editors
    • Archivist's Toolkit
      • Useful for EAD
      • 252. Also has MARC support
    • Oxygen
      • Most useful low-cost option for:
        • Special Collections work
        • 253. Document-centric work
        • 254. General authoring XML
  • 255. Oxygen
    • Low-cost
    • 256. Complete XML Management Solution
    • 257. Supports all types of XML Schema
    • 258. XSLT Support w/debugger
    • 259. Many academic users
  • 260. XML Aware Editing in Oxygen
  • 261.  
  • 262. XML and Programming Languages
    • Strong native XML support in all programming languages
    • 263. Familiar data structure to programmers
      • Remember the tree structure?
    • Internationalization support via Unicode
    • 264. Library data has a better chance of strong support in XML than not in XML
  • 265. MARC and Programming Languages
    • Full Support by a small number of software vendors
    • 266. Perl/PHP/Python/Ruby all have support with varying levels of MARC support
    • 267. Marc tools in these languages are typically:
      • Specialty modules
      • 268. maintained by a small, but dedicated group of programmers
      • 269. Not part of most languages' “standard” distribution
  • 270. For Future Reference
    • A Classic introduction to basic XML concepts from the TEI “ A Gentle Introduction to XML ”
    • 271. Terry Reese's Weblog
    • 272. Watch for how RDA interacts with XML
    • 273. Eric Lease Morgan's Workshop for those with a more technical bent - “ XML in Libraries ”
  • 274. Conclusion
    • XML is just a tool
    • 275. It is a useful one
    • 276. The intellectual work of cataloging will still be the same
    • 277. Relying on the MARC format as our primary data store is becoming problematic