XML2004

XML-CENTRIC WORKFLOW
OFFERS BENEFITS TO SCHOLARLY PUBLISHERS
Alexander (‘Sasha’) Schwarzman, AGU
<sschwarzman@agu.org>
Hyunmin Hur, DocsDoc
Shu-Li Pai, AGU
Carter M. Glass, AGU
XML 2004
Washington, D.C.
18 November 2004

XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR2
PRESENTATION OVERVIEW
● Introduction
● Publishing requirements in a dual paper–electronic world
▪ AGU manuscript production modes in 2001
▪ System architecture and workflow in 2004
● Design decisions
▪ Schema language: DTD vs. W3C Schema and Relax NG
▪ Copyediting manuscript: in author-submitted format vs. in XML
▪ Converting manuscript to XML: vendor vs. in-house
▪ Validating XML instance: beyond a validating parser
▪ Extracting and loading metadata into a metadata database (MDDB)
▪ MDDB-based information products and services
▪ Choice of DB technology and programming languages
● Lessons learned
▪ DOI: what does it identify and what its format should be
▪ Version of record
▪ “Published as ready”: journal model deconstructed
▪ Page numbers and article IDs
▪ Special characters and math
● Results

INTRODUCTION
● AGU
▪ Nonprofit multidisciplinary scientific society; 41,000 members from 130 countries
▪ Focus: organization and dissemination of scientific info in interdisciplinary field of
geophysics (atmospheric, oceanic, solid Earth, hydrologic, and space sciences)
▪ Publishes 14 high-impact English language journals, ~4500 articles annually
● Manuscript life cycle
▪ Production
◦ copyediting (copy editor)
◦ proofing (proofreader and author)
◦ correcting (vendor)
◦ publishing (production coordinator)
While AGU has introduced radical changes in how a manuscript is handled at all stages of its life
cycle, in this presentation we will concentrate on the post-acceptance part of the publishing process.

PUBLISHING REQUIREMENTS IN A PAPER–ELECTRONIC WORLD
● Multiple outputs
▪ journal article has to appear in multiple formats: print, PDF, HTML, …
● Search
▪ both article’s metadata and its full text have to be searchable
● Linking
▪ bibliographic references, external datasets, inter- and intra-article linking
● Dynamic content
▪ authors have to be able to include multimedia objects, such as videos or
animations, into their articles
● Cross-journal products
▪ ability to create collections cutting across journals (“virtual” journals)
● Customization
● Metadata sharing
● Preservation of scientific content
▪ ability to preserve scientific content in a readable nonproprietary format
for the foreseeable future

AGU MANUSCRIPT PRODUCTION MODES IN 2001
● Camera-ready copy (CRC)
▪ no electronic copy → no reuse/repurposing of scientific content
▪ authors prepared production files → inconsistent quality of published product
▪ metadata-based products – issue ToCs, author and subject indices, AGU bib.
database EASI – created manually (rekeying). Still, no abstracts in EASI !
▪ no article in electronic form → printed issues mailed to A&I services, metadata
rekeyed. Delay between issue publication and A&I information availability
● Typeset manuscript
▪ two journals wholly typeset in XyVision. PDF, HTML, and ISO 12083 SGML
generated from proprietary typesetting system
▪ 1997: GRL authors given an option to submit in LaTeX, which was converted to
HTML and PDF in-house and posted on the Web
● SGML markup of the electronic manuscript file
▪ 1997: Earth Interactions – marked up in SGML in accordance with own DTD
▪ 1999: Geochemistry, Geophysics, Geosystems – marked up with a variation of
ISO 12083 SGML DTD
▪ 2000: Global Biogeochemical Cycles – partial use of AGU Article SGML DTD
CRC publishing model is a dead end. Disparate production modes counterproductive.
Solution: unified XML-centric process to cut costs, provide services, and stay competitive

SYSTEM ARCHITECTURE AND WORKFLOW: 1

SYSTEM ARCHITECTURE AND WORKFLOW: 2
Custom software
▪ AGU-article XML DTD
▪ AGU Validator
▪ XML conversion tool
▪ Metadata Loader (bib
and ref metadata
extractor and inserter)
▪ AGU metadata
database (AGU MDDB)
▪ reporting, linking, and
metadata
dissemination modules
▪ full-text database and
search engine

DESIGN DECISIONS: DTD vs. W3C Schema and Relax NG
DTD advantages [Beck and Lapeyre, 2003]
● Technical
▪ parameter entity mechanism: modular design, inclusion of DTDs (CALS,
MathML), maintainability, scalability, customization
▪ availability of processing tools
▪ consistency of validity checking among parsers
● Business
▪ vendor community: taggers, compositors, conversion shops
▪ aggregators, archives
● Practical
▪ XML character entities: é vs. Unicode point &#x000E9 (readability)
DTD disadvantages
▪ not in XML syntax
▪ lacks strong datatyping
DTD: publisher-specific vs. industry-standard
▪ no suitable DTD at the time – AGU developed its own to meet the Requirements
▪ emerging industry standard – NCBI/NLM DTD http://dtd.nlm.nih.gov/publishing/

DESIGN DECISIONS: non-DTD schema languages – W3C Schema
W3C Schema
● Ask yourself:
▪ schema must be in XML syntax?
▪ strong datatyping, full namespace support for elements and attributes essential?
▪ min/max mechanism for elements essential?
▪ contents models similar enough to make use of derivation by
extension/restriction?
▪ developers and vendors okay with inconsistency of tools?
● Cons
▪ in scholarly publishing content models are diverse → not derivable
▪ mixed content → min/max, regular expressions not usable
▪ difficult to modularize, scale, maintain
▪ no XML character entities

DESIGN DECISIONS: non-DTD schema languages – Relax NG
Relax NG
● Pros
▪ has both XML and compact (non-XML) syntax
▪ combines intelligibility of DTD with datatyping capability of W3C Schema
▪ provides for context-sensitive content models
▪ can validate documents of different types using a combined schema
▪ DocBook and TEI converted to Relax NG in the past year
● Cons
▪ not as well-established as W3C Schema, number of tools available is limited
▪ does not support XML character entities → you must also have a DTD
▪ does not permit Formal Public Identifiers (FPI)
▪ Relax NG-specific features may not translate neatly to either DTD or W3C
Schema
Today we would still opt for a DTD but Relax NG may become a schema of
choice for “text” (as opposed to “data”) content in the future

DESIGN DECISIONS: more
Converting manuscript to XML
▪ vendor vs. in-house
Copyediting manuscript
▪ in author-submitted format vs. in XML
Validator
▪ XML instance may be valid but not correct or even meaningful [Rosenblum and
Golfman, 2001]
▪ in addition to validity, also checks datatypes, specific dependencies, and naming
conventions
▪ overall, performs 100+ checks on each XML article instance
Metadata Loader
▪ extracts bibliographic and reference metadata and loads them to MDDB
▪ Validator and Loader: Java/XSLT applications – portability
MDDB and its reporting, linking, and metadata dissemination modules
▪ information products and services: online ToCs (updated daily), printed issue
ToCs, author and subject indices, “virtual journals”
▪ metadata deposits, response pages, and citation linking
▪ business reports for the managers
▪ MDDB: relational vs. native XML DB

LESSONS LEARNED: 1
DOI
▪ Decide what your DOI identifies: abstraction/manifestation, format, extent,
granularity
▪ DOI format: “dumb” vs. “intelligent”, based on volume/issue/page vs. tracking
number, etc.
Version of record
▪ XML, if the goal is to separate content from presentation
▪ rendering preserved content in a variety of formats/devices/media
▪ archiving textual source and non-textual components
“Published as ready”: journal model deconstructed
▪ each article appears online as soon as its production cycle is completed,
assembled into printed issues and mailed later
▪ if article is published (online) in one calendar year and printed in the next, what is
its year and volume? Should your XML schema account for the difference?
▪ set of articles selected on the basis of user- or publisher-defined criteria may cut
across journals → “journal” is but one of many collections. Any collection is just a
query executed against MDDB

LESSONS LEARNED: 2
Page numbers and article IDs
▪ at the time of article publishing it is not always possible to predict accurately what
its continuous pagination within the printed issue will be
▪ waiting until a printed issue is assembled and then adding page numbers to
article representations creates discrepancy in how an article is to be cited and
runs contrary to the principle that an article must not be changed after it is
published
▪ abandoning page numbers altogether is not an option because many A&I
services may need them for the purposes of citation tracking and metadata
resolution (Thomson ISI, CrossRef)
▪ as long as a printed issue exists, the reader needs a means of finding a particular
article within it
AGU solution: smart Article ID (“citation number”)
Citation: Holzworth, R. H., and R. A. Goldberg (2004), Electric field measurements in
noctilucent clouds, J. Geophys. Res., 109, D16203, doi:10.1029/2003JD004468.
D16203 is a citation number, where
D part D (Atmospheres) of Journal of Geophysical Research (JGRD)
16issue number
2 “Aerosols and Clouds” subset of JGRD
03article sequence within the subset

LESSONS LEARNED: 3
Page numbers and article IDs (cont’d)
▪ all metadata needed for the citation are in the version of record (XML), as well as
in HTML and PDF → article can be consistently cited as soon as it is published!
▪ A&I services can use either page numbers (each article begins with a page
number 1, though), or a citation number, or both
▪ citation numbers appear in print → easy for the reader to locate an article within
a printed issue
▪ citation numbers in most cases follow the physical sequence of articles within an
issue or its unit, but may occasionally deviate from it → AGU has the flexibility to
deal with exceptions to a regular publishing flow
Special characters
▪ XML is the version of record, and é is more readable than é
▪ an XML instance with Unicode points can always be produced simply by running
a validating parser

LESSONS LEARNED: 4
Math
● Tagging math
▪ MathML
▪ LaTeX
▪ link to an image
● Rendering math
▪ displaying MathML in a browser
▪ providing an image
● Problems with MathML [Gaylord, 2004]
▪ MathML Presentation vs. MathML Content
▪ MathML Presentation verbosity (debugging problems)
▪ Firefox & Netscape display MathML natively, IE 6.0 needs plug-in. Opera, etc.?
▪ Math Player – Windows only. Mac, Linux, UNIX?
▪ all browsers require additional fonts, yet not all characters can be displayed
MathML as a display format is not an option if multiple browsers/platforms are involved
Using image gives complete control over appearance but math can’t be searched/reused
AGU approach: tag math using LaTeX within XML; convert to GIF for presenting in HTML

RESULTS: 1
Improved productivity
▪ since introduction of the XML-centric workflow, the number of published articles
has increased, while in the journal production department 2 full-time positions
have been eliminated and 25% of staff positions have been downgraded
▪ production time from acceptance to publication has been substantially reduced; it
is now the fastest since 1984 (when records began to be kept). GRL: 5 weeks,
semimonthlies and monthlies – 10 weeks from acceptance to publication
▪ management reporting improved significantly
Automated production of publishing products
▪ printed issue ToCs
▪ end-of-year author and subject indices
Improved quality and value of published product
▪ human error reduced
▪ authors responsible for content only, publisher responsible for accuracy of
articles’ structure and consistency of their appearance
▪ multiple outputs (PDF, HTML, print) produced automatically from the XML source
▪ previously unfeasible checks performed: accuracy of references’ metadata

RESULTS: 2
Automatic production of the Web search repository
▪ metadata and full text automatically extracted from XML into the repository
Automatic archiving
▪ XML, HTML, PDF, non-textual components, and article metadata
Direct data feeds to A&I services
▪ CrossRef, NASA’s Astrophysics Data System (ADS), AIP’s SPIN, etc.
▪ Used to be up to half a year delays between article publication and metadata
appearance in A&I services. Now delivery is instantaneous and fully automated
Reference linking implementation
▪ CrossRef inbound, outbound, and forward linking
Introduction of new information products and services
▪ “virtual journals” (cross-journal article collections)
▪ multimedia content
▪ immediate access to underlying datasets
▪ RSS
Making production process XML-centric has allowed AGU to bring its readers the results
of scientific research of the highest quality in the fastest & most cost-efficient manner

XML2004

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to XML2004

Similar to XML2004 (20)

More from aschwarzman

More from aschwarzman (18)

XML2004