SlideShare a Scribd company logo
1 of 17
XML-CENTRIC WORKFLOW
OFFERS BENEFITS TO SCHOLARLY PUBLISHERS
Alexander (‘Sasha’) Schwarzman, AGU
<sschwarzman@agu.org>
Hyunmin Hur, DocsDoc
Shu-Li Pai, AGU
Carter M. Glass, AGU
XML 2004
Washington, D.C.
18 November 2004
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR2
PRESENTATION OVERVIEW
● Introduction
● Publishing requirements in a dual paper–electronic world
▪ AGU manuscript production modes in 2001
▪ System architecture and workflow in 2004
● Design decisions
▪ Schema language: DTD vs. W3C Schema and Relax NG
▪ Copyediting manuscript: in author-submitted format vs. in XML
▪ Converting manuscript to XML: vendor vs. in-house
▪ Validating XML instance: beyond a validating parser
▪ Extracting and loading metadata into a metadata database (MDDB)
▪ MDDB-based information products and services
▪ Choice of DB technology and programming languages
● Lessons learned
▪ DOI: what does it identify and what its format should be
▪ Version of record
▪ “Published as ready”: journal model deconstructed
▪ Page numbers and article IDs
▪ Special characters and math
● Results
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR3
INTRODUCTION
● AGU
▪ Nonprofit multidisciplinary scientific society; 41,000 members from 130 countries
▪ Focus: organization and dissemination of scientific info in interdisciplinary field of
geophysics (atmospheric, oceanic, solid Earth, hydrologic, and space sciences)
▪ Publishes 14 high-impact English language journals, ~4500 articles annually
● Manuscript life cycle
▪ Production
◦ copyediting (copy editor)
◦ proofing (proofreader and author)
◦ correcting (vendor)
◦ publishing (production coordinator)
While AGU has introduced radical changes in how a manuscript is handled at all stages of its life
cycle, in this presentation we will concentrate on the post-acceptance part of the publishing process.
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR4
PUBLISHING REQUIREMENTS IN A PAPER–ELECTRONIC WORLD
● Multiple outputs
▪ journal article has to appear in multiple formats: print, PDF, HTML, …
● Search
▪ both article’s metadata and its full text have to be searchable
● Linking
▪ bibliographic references, external datasets, inter- and intra-article linking
● Dynamic content
▪ authors have to be able to include multimedia objects, such as videos or
animations, into their articles
● Cross-journal products
▪ ability to create collections cutting across journals (“virtual” journals)
● Customization
● Metadata sharing
● Preservation of scientific content
▪ ability to preserve scientific content in a readable nonproprietary format
for the foreseeable future
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR5
AGU MANUSCRIPT PRODUCTION MODES IN 2001
● Camera-ready copy (CRC)
▪ no electronic copy → no reuse/repurposing of scientific content
▪ authors prepared production files → inconsistent quality of published product
▪ metadata-based products – issue ToCs, author and subject indices, AGU bib.
database EASI – created manually (rekeying). Still, no abstracts in EASI !
▪ no article in electronic form → printed issues mailed to A&I services, metadata
rekeyed. Delay between issue publication and A&I information availability
● Typeset manuscript
▪ two journals wholly typeset in XyVision. PDF, HTML, and ISO 12083 SGML
generated from proprietary typesetting system
▪ 1997: GRL authors given an option to submit in LaTeX, which was converted to
HTML and PDF in-house and posted on the Web
● SGML markup of the electronic manuscript file
▪ 1997: Earth Interactions – marked up in SGML in accordance with own DTD
▪ 1999: Geochemistry, Geophysics, Geosystems – marked up with a variation of
ISO 12083 SGML DTD
▪ 2000: Global Biogeochemical Cycles – partial use of AGU Article SGML DTD
CRC publishing model is a dead end. Disparate production modes counterproductive.
Solution: unified XML-centric process to cut costs, provide services, and stay competitive
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR6
SYSTEM ARCHITECTURE AND WORKFLOW: 1
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR7
SYSTEM ARCHITECTURE AND WORKFLOW: 2
Custom software
▪ AGU-article XML DTD
▪ AGU Validator
▪ XML conversion tool
▪ Metadata Loader (bib
and ref metadata
extractor and inserter)
▪ AGU metadata
database (AGU MDDB)
▪ reporting, linking, and
metadata
dissemination modules
▪ full-text database and
search engine
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR8
DESIGN DECISIONS: DTD vs. W3C Schema and Relax NG
DTD advantages [Beck and Lapeyre, 2003]
● Technical
▪ parameter entity mechanism: modular design, inclusion of DTDs (CALS,
MathML), maintainability, scalability, customization
▪ availability of processing tools
▪ consistency of validity checking among parsers
● Business
▪ vendor community: taggers, compositors, conversion shops
▪ aggregators, archives
● Practical
▪ XML character entities: &eacute; vs. Unicode point &#x000E9 (readability)
DTD disadvantages
▪ not in XML syntax
▪ lacks strong datatyping
DTD: publisher-specific vs. industry-standard
▪ no suitable DTD at the time – AGU developed its own to meet the Requirements
▪ emerging industry standard – NCBI/NLM DTD http://dtd.nlm.nih.gov/publishing/
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR9
DESIGN DECISIONS: non-DTD schema languages – W3C Schema
W3C Schema
● Ask yourself:
▪ schema must be in XML syntax?
▪ strong datatyping, full namespace support for elements and attributes essential?
▪ min/max mechanism for elements essential?
▪ contents models similar enough to make use of derivation by
extension/restriction?
▪ developers and vendors okay with inconsistency of tools?
● Cons
▪ in scholarly publishing content models are diverse → not derivable
▪ mixed content → min/max, regular expressions not usable
▪ difficult to modularize, scale, maintain
▪ no XML character entities
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR10
DESIGN DECISIONS: non-DTD schema languages – Relax NG
Relax NG
● Pros
▪ has both XML and compact (non-XML) syntax
▪ combines intelligibility of DTD with datatyping capability of W3C Schema
▪ provides for context-sensitive content models
▪ can validate documents of different types using a combined schema
▪ DocBook and TEI converted to Relax NG in the past year
● Cons
▪ not as well-established as W3C Schema, number of tools available is limited
▪ does not support XML character entities → you must also have a DTD
▪ does not permit Formal Public Identifiers (FPI)
▪ Relax NG-specific features may not translate neatly to either DTD or W3C
Schema
Today we would still opt for a DTD but Relax NG may become a schema of
choice for “text” (as opposed to “data”) content in the future
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR11
DESIGN DECISIONS: more
Converting manuscript to XML
▪ vendor vs. in-house
Copyediting manuscript
▪ in author-submitted format vs. in XML
Validator
▪ XML instance may be valid but not correct or even meaningful [Rosenblum and
Golfman, 2001]
▪ in addition to validity, also checks datatypes, specific dependencies, and naming
conventions
▪ overall, performs 100+ checks on each XML article instance
Metadata Loader
▪ extracts bibliographic and reference metadata and loads them to MDDB
▪ Validator and Loader: Java/XSLT applications – portability
MDDB and its reporting, linking, and metadata dissemination modules
▪ information products and services: online ToCs (updated daily), printed issue
ToCs, author and subject indices, “virtual journals”
▪ metadata deposits, response pages, and citation linking
▪ business reports for the managers
▪ MDDB: relational vs. native XML DB
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR12
LESSONS LEARNED: 1
DOI
▪ Decide what your DOI identifies: abstraction/manifestation, format, extent,
granularity
▪ DOI format: “dumb” vs. “intelligent”, based on volume/issue/page vs. tracking
number, etc.
Version of record
▪ XML, if the goal is to separate content from presentation
▪ rendering preserved content in a variety of formats/devices/media
▪ archiving textual source and non-textual components
“Published as ready”: journal model deconstructed
▪ each article appears online as soon as its production cycle is completed,
assembled into printed issues and mailed later
▪ if article is published (online) in one calendar year and printed in the next, what is
its year and volume? Should your XML schema account for the difference?
▪ set of articles selected on the basis of user- or publisher-defined criteria may cut
across journals → “journal” is but one of many collections. Any collection is just a
query executed against MDDB
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR13
LESSONS LEARNED: 2
Page numbers and article IDs
▪ at the time of article publishing it is not always possible to predict accurately what
its continuous pagination within the printed issue will be
▪ waiting until a printed issue is assembled and then adding page numbers to
article representations creates discrepancy in how an article is to be cited and
runs contrary to the principle that an article must not be changed after it is
published
▪ abandoning page numbers altogether is not an option because many A&I
services may need them for the purposes of citation tracking and metadata
resolution (Thomson ISI, CrossRef)
▪ as long as a printed issue exists, the reader needs a means of finding a particular
article within it
AGU solution: smart Article ID (“citation number”)
Citation: Holzworth, R. H., and R. A. Goldberg (2004), Electric field measurements in
noctilucent clouds, J. Geophys. Res., 109, D16203, doi:10.1029/2003JD004468.
D16203 is a citation number, where
D part D (Atmospheres) of Journal of Geophysical Research (JGRD)
16issue number
2 “Aerosols and Clouds” subset of JGRD
03article sequence within the subset
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR14
LESSONS LEARNED: 3
Page numbers and article IDs (cont’d)
▪ all metadata needed for the citation are in the version of record (XML), as well as
in HTML and PDF → article can be consistently cited as soon as it is published!
▪ A&I services can use either page numbers (each article begins with a page
number 1, though), or a citation number, or both
▪ citation numbers appear in print → easy for the reader to locate an article within
a printed issue
▪ citation numbers in most cases follow the physical sequence of articles within an
issue or its unit, but may occasionally deviate from it → AGU has the flexibility to
deal with exceptions to a regular publishing flow
Special characters
▪ XML is the version of record, and &eacute; is more readable than &#x000E9;
▪ an XML instance with Unicode points can always be produced simply by running
a validating parser
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR15
LESSONS LEARNED: 4
Math
● Tagging math
▪ MathML
▪ LaTeX
▪ link to an image
● Rendering math
▪ displaying MathML in a browser
▪ providing an image
● Problems with MathML [Gaylord, 2004]
▪ MathML Presentation vs. MathML Content
▪ MathML Presentation verbosity (debugging problems)
▪ Firefox & Netscape display MathML natively, IE 6.0 needs plug-in. Opera, etc.?
▪ Math Player – Windows only. Mac, Linux, UNIX?
▪ all browsers require additional fonts, yet not all characters can be displayed
MathML as a display format is not an option if multiple browsers/platforms are involved
Using image gives complete control over appearance but math can’t be searched/reused
AGU approach: tag math using LaTeX within XML; convert to GIF for presenting in HTML
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR16
RESULTS: 1
Improved productivity
▪ since introduction of the XML-centric workflow, the number of published articles
has increased, while in the journal production department 2 full-time positions
have been eliminated and 25% of staff positions have been downgraded
▪ production time from acceptance to publication has been substantially reduced; it
is now the fastest since 1984 (when records began to be kept). GRL: 5 weeks,
semimonthlies and monthlies – 10 weeks from acceptance to publication
▪ management reporting improved significantly
Automated production of publishing products
▪ printed issue ToCs
▪ end-of-year author and subject indices
Improved quality and value of published product
▪ human error reduced
▪ authors responsible for content only, publisher responsible for accuracy of
articles’ structure and consistency of their appearance
▪ multiple outputs (PDF, HTML, print) produced automatically from the XML source
▪ previously unfeasible checks performed: accuracy of references’ metadata
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR17
RESULTS: 2
Automatic production of the Web search repository
▪ metadata and full text automatically extracted from XML into the repository
Automatic archiving
▪ XML, HTML, PDF, non-textual components, and article metadata
Direct data feeds to A&I services
▪ CrossRef, NASA’s Astrophysics Data System (ADS), AIP’s SPIN, etc.
▪ Used to be up to half a year delays between article publication and metadata
appearance in A&I services. Now delivery is instantaneous and fully automated
Reference linking implementation
▪ CrossRef inbound, outbound, and forward linking
Introduction of new information products and services
▪ “virtual journals” (cross-journal article collections)
▪ multimedia content
▪ immediate access to underlying datasets
▪ RSS
Making production process XML-centric has allowed AGU to bring its readers the results
of scientific research of the highest quality in the fastest & most cost-efficient manner

More Related Content

What's hot

What's hot (17)

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Web data management
Web data managementWeb data management
Web data management
 
Web data management (chapter-1)
Web data management (chapter-1)Web data management (chapter-1)
Web data management (chapter-1)
 
DOM-XML
DOM-XMLDOM-XML
DOM-XML
 
Dom Hackking & Security - BlackHat Preso
Dom Hackking & Security - BlackHat PresoDom Hackking & Security - BlackHat Preso
Dom Hackking & Security - BlackHat Preso
 
Web services Overview in depth
Web services Overview in depthWeb services Overview in depth
Web services Overview in depth
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
CTDA Workshop on XSL
CTDA Workshop on XSLCTDA Workshop on XSL
CTDA Workshop on XSL
 
Markup Languages
Markup Languages Markup Languages
Markup Languages
 
CTDA Workshop on XML and MODS
CTDA Workshop on XML and MODSCTDA Workshop on XML and MODS
CTDA Workshop on XML and MODS
 
Nosql
NosqlNosql
Nosql
 
Nosql
NosqlNosql
Nosql
 
Using MRuby in a database
Using MRuby in a databaseUsing MRuby in a database
Using MRuby in a database
 
NoSQL document oriented data access for .net systems with postgresql and marten
NoSQL document oriented data access for .net systems with postgresql and martenNoSQL document oriented data access for .net systems with postgresql and marten
NoSQL document oriented data access for .net systems with postgresql and marten
 
Understanding Dom
Understanding DomUnderstanding Dom
Understanding Dom
 

Similar to XML2004

Fyp presentation 2 (SQL Converter)
Fyp presentation 2 (SQL Converter)Fyp presentation 2 (SQL Converter)
Fyp presentation 2 (SQL Converter)Muhammad Shafiq
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
Assessing technology landscape
Assessing technology landscapeAssessing technology landscape
Assessing technology landscapeDom Mike
 
Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012scorlosquet
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013scorlosquet
 
A Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQLA Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQLEDB
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012scorlosquet
 
DITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part IDITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part ISuite Solutions
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBjhugg
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analyticsKyle Bader
 
Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013Maciek Próchniak
 

Similar to XML2004 (20)

Fyp presentation 2 (SQL Converter)
Fyp presentation 2 (SQL Converter)Fyp presentation 2 (SQL Converter)
Fyp presentation 2 (SQL Converter)
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
DBMS - Introduction.ppt
DBMS - Introduction.pptDBMS - Introduction.ppt
DBMS - Introduction.ppt
 
Assessing technology landscape
Assessing technology landscapeAssessing technology landscape
Assessing technology landscape
 
Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
 
Breitfelder Incorporating XML into a Standards Environment
Breitfelder Incorporating XML into a Standards EnvironmentBreitfelder Incorporating XML into a Standards Environment
Breitfelder Incorporating XML into a Standards Environment
 
Markup For Dummies (Russ Ward)
Markup For Dummies (Russ Ward)Markup For Dummies (Russ Ward)
Markup For Dummies (Russ Ward)
 
A Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQLA Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQL
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
oodb.ppt
oodb.pptoodb.ppt
oodb.ppt
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012
 
2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn
 
DITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part IDITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part I
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013
 

More from aschwarzman

2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentationaschwarzman
 
2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzmanaschwarzman
 
2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentationaschwarzman
 
2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzmanaschwarzman
 
2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzmanaschwarzman
 
2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzmanaschwarzman
 
Schwarzman-CSE2011
Schwarzman-CSE2011Schwarzman-CSE2011
Schwarzman-CSE2011aschwarzman
 
Schwarzman-JATS-Con-slides
Schwarzman-JATS-Con-slidesSchwarzman-JATS-Con-slides
Schwarzman-JATS-Con-slidesaschwarzman
 
Extreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-SchwarzmanExtreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-Schwarzmanaschwarzman
 
XML2004-schwarzman
XML2004-schwarzmanXML2004-schwarzman
XML2004-schwarzmanaschwarzman
 
JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29aschwarzman
 
Balisage_2011-08-03_Schwarzman
Balisage_2011-08-03_SchwarzmanBalisage_2011-08-03_Schwarzman
Balisage_2011-08-03_Schwarzmanaschwarzman
 
Balisage-2015-funding-poster
Balisage-2015-funding-posterBalisage-2015-funding-poster
Balisage-2015-funding-posteraschwarzman
 
Balisage-2015-sup-mat-poster
Balisage-2015-sup-mat-posterBalisage-2015-sup-mat-poster
Balisage-2015-sup-mat-posteraschwarzman
 
Using Schematron for appropriate layer validation: A case study
Using Schematron for appropriate layer validation: A case studyUsing Schematron for appropriate layer validation: A case study
Using Schematron for appropriate layer validation: A case studyaschwarzman
 
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...aschwarzman
 
NISO-NFAIS Supplemental Journal Article Materials Working Group
NISO-NFAIS Supplemental Journal Article Materials Working GroupNISO-NFAIS Supplemental Journal Article Materials Working Group
NISO-NFAIS Supplemental Journal Article Materials Working Groupaschwarzman
 

More from aschwarzman (18)

dineen2013
dineen2013dineen2013
dineen2013
 
2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation
 
2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman
 
2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation
 
2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman
 
2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman
 
2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman
 
Schwarzman-CSE2011
Schwarzman-CSE2011Schwarzman-CSE2011
Schwarzman-CSE2011
 
Schwarzman-JATS-Con-slides
Schwarzman-JATS-Con-slidesSchwarzman-JATS-Con-slides
Schwarzman-JATS-Con-slides
 
Extreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-SchwarzmanExtreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-Schwarzman
 
XML2004-schwarzman
XML2004-schwarzmanXML2004-schwarzman
XML2004-schwarzman
 
JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29
 
Balisage_2011-08-03_Schwarzman
Balisage_2011-08-03_SchwarzmanBalisage_2011-08-03_Schwarzman
Balisage_2011-08-03_Schwarzman
 
Balisage-2015-funding-poster
Balisage-2015-funding-posterBalisage-2015-funding-poster
Balisage-2015-funding-poster
 
Balisage-2015-sup-mat-poster
Balisage-2015-sup-mat-posterBalisage-2015-sup-mat-poster
Balisage-2015-sup-mat-poster
 
Using Schematron for appropriate layer validation: A case study
Using Schematron for appropriate layer validation: A case studyUsing Schematron for appropriate layer validation: A case study
Using Schematron for appropriate layer validation: A case study
 
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
 
NISO-NFAIS Supplemental Journal Article Materials Working Group
NISO-NFAIS Supplemental Journal Article Materials Working GroupNISO-NFAIS Supplemental Journal Article Materials Working Group
NISO-NFAIS Supplemental Journal Article Materials Working Group
 

XML2004

  • 1. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS Alexander (‘Sasha’) Schwarzman, AGU <sschwarzman@agu.org> Hyunmin Hur, DocsDoc Shu-Li Pai, AGU Carter M. Glass, AGU XML 2004 Washington, D.C. 18 November 2004
  • 2. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR2 PRESENTATION OVERVIEW ● Introduction ● Publishing requirements in a dual paper–electronic world ▪ AGU manuscript production modes in 2001 ▪ System architecture and workflow in 2004 ● Design decisions ▪ Schema language: DTD vs. W3C Schema and Relax NG ▪ Copyediting manuscript: in author-submitted format vs. in XML ▪ Converting manuscript to XML: vendor vs. in-house ▪ Validating XML instance: beyond a validating parser ▪ Extracting and loading metadata into a metadata database (MDDB) ▪ MDDB-based information products and services ▪ Choice of DB technology and programming languages ● Lessons learned ▪ DOI: what does it identify and what its format should be ▪ Version of record ▪ “Published as ready”: journal model deconstructed ▪ Page numbers and article IDs ▪ Special characters and math ● Results
  • 3. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR3 INTRODUCTION ● AGU ▪ Nonprofit multidisciplinary scientific society; 41,000 members from 130 countries ▪ Focus: organization and dissemination of scientific info in interdisciplinary field of geophysics (atmospheric, oceanic, solid Earth, hydrologic, and space sciences) ▪ Publishes 14 high-impact English language journals, ~4500 articles annually ● Manuscript life cycle ▪ Production ◦ copyediting (copy editor) ◦ proofing (proofreader and author) ◦ correcting (vendor) ◦ publishing (production coordinator) While AGU has introduced radical changes in how a manuscript is handled at all stages of its life cycle, in this presentation we will concentrate on the post-acceptance part of the publishing process.
  • 4. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR4 PUBLISHING REQUIREMENTS IN A PAPER–ELECTRONIC WORLD ● Multiple outputs ▪ journal article has to appear in multiple formats: print, PDF, HTML, … ● Search ▪ both article’s metadata and its full text have to be searchable ● Linking ▪ bibliographic references, external datasets, inter- and intra-article linking ● Dynamic content ▪ authors have to be able to include multimedia objects, such as videos or animations, into their articles ● Cross-journal products ▪ ability to create collections cutting across journals (“virtual” journals) ● Customization ● Metadata sharing ● Preservation of scientific content ▪ ability to preserve scientific content in a readable nonproprietary format for the foreseeable future
  • 5. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR5 AGU MANUSCRIPT PRODUCTION MODES IN 2001 ● Camera-ready copy (CRC) ▪ no electronic copy → no reuse/repurposing of scientific content ▪ authors prepared production files → inconsistent quality of published product ▪ metadata-based products – issue ToCs, author and subject indices, AGU bib. database EASI – created manually (rekeying). Still, no abstracts in EASI ! ▪ no article in electronic form → printed issues mailed to A&I services, metadata rekeyed. Delay between issue publication and A&I information availability ● Typeset manuscript ▪ two journals wholly typeset in XyVision. PDF, HTML, and ISO 12083 SGML generated from proprietary typesetting system ▪ 1997: GRL authors given an option to submit in LaTeX, which was converted to HTML and PDF in-house and posted on the Web ● SGML markup of the electronic manuscript file ▪ 1997: Earth Interactions – marked up in SGML in accordance with own DTD ▪ 1999: Geochemistry, Geophysics, Geosystems – marked up with a variation of ISO 12083 SGML DTD ▪ 2000: Global Biogeochemical Cycles – partial use of AGU Article SGML DTD CRC publishing model is a dead end. Disparate production modes counterproductive. Solution: unified XML-centric process to cut costs, provide services, and stay competitive
  • 6. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR6 SYSTEM ARCHITECTURE AND WORKFLOW: 1
  • 7. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR7 SYSTEM ARCHITECTURE AND WORKFLOW: 2 Custom software ▪ AGU-article XML DTD ▪ AGU Validator ▪ XML conversion tool ▪ Metadata Loader (bib and ref metadata extractor and inserter) ▪ AGU metadata database (AGU MDDB) ▪ reporting, linking, and metadata dissemination modules ▪ full-text database and search engine
  • 8. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR8 DESIGN DECISIONS: DTD vs. W3C Schema and Relax NG DTD advantages [Beck and Lapeyre, 2003] ● Technical ▪ parameter entity mechanism: modular design, inclusion of DTDs (CALS, MathML), maintainability, scalability, customization ▪ availability of processing tools ▪ consistency of validity checking among parsers ● Business ▪ vendor community: taggers, compositors, conversion shops ▪ aggregators, archives ● Practical ▪ XML character entities: &eacute; vs. Unicode point &#x000E9 (readability) DTD disadvantages ▪ not in XML syntax ▪ lacks strong datatyping DTD: publisher-specific vs. industry-standard ▪ no suitable DTD at the time – AGU developed its own to meet the Requirements ▪ emerging industry standard – NCBI/NLM DTD http://dtd.nlm.nih.gov/publishing/
  • 9. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR9 DESIGN DECISIONS: non-DTD schema languages – W3C Schema W3C Schema ● Ask yourself: ▪ schema must be in XML syntax? ▪ strong datatyping, full namespace support for elements and attributes essential? ▪ min/max mechanism for elements essential? ▪ contents models similar enough to make use of derivation by extension/restriction? ▪ developers and vendors okay with inconsistency of tools? ● Cons ▪ in scholarly publishing content models are diverse → not derivable ▪ mixed content → min/max, regular expressions not usable ▪ difficult to modularize, scale, maintain ▪ no XML character entities
  • 10. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR10 DESIGN DECISIONS: non-DTD schema languages – Relax NG Relax NG ● Pros ▪ has both XML and compact (non-XML) syntax ▪ combines intelligibility of DTD with datatyping capability of W3C Schema ▪ provides for context-sensitive content models ▪ can validate documents of different types using a combined schema ▪ DocBook and TEI converted to Relax NG in the past year ● Cons ▪ not as well-established as W3C Schema, number of tools available is limited ▪ does not support XML character entities → you must also have a DTD ▪ does not permit Formal Public Identifiers (FPI) ▪ Relax NG-specific features may not translate neatly to either DTD or W3C Schema Today we would still opt for a DTD but Relax NG may become a schema of choice for “text” (as opposed to “data”) content in the future
  • 11. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR11 DESIGN DECISIONS: more Converting manuscript to XML ▪ vendor vs. in-house Copyediting manuscript ▪ in author-submitted format vs. in XML Validator ▪ XML instance may be valid but not correct or even meaningful [Rosenblum and Golfman, 2001] ▪ in addition to validity, also checks datatypes, specific dependencies, and naming conventions ▪ overall, performs 100+ checks on each XML article instance Metadata Loader ▪ extracts bibliographic and reference metadata and loads them to MDDB ▪ Validator and Loader: Java/XSLT applications – portability MDDB and its reporting, linking, and metadata dissemination modules ▪ information products and services: online ToCs (updated daily), printed issue ToCs, author and subject indices, “virtual journals” ▪ metadata deposits, response pages, and citation linking ▪ business reports for the managers ▪ MDDB: relational vs. native XML DB
  • 12. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR12 LESSONS LEARNED: 1 DOI ▪ Decide what your DOI identifies: abstraction/manifestation, format, extent, granularity ▪ DOI format: “dumb” vs. “intelligent”, based on volume/issue/page vs. tracking number, etc. Version of record ▪ XML, if the goal is to separate content from presentation ▪ rendering preserved content in a variety of formats/devices/media ▪ archiving textual source and non-textual components “Published as ready”: journal model deconstructed ▪ each article appears online as soon as its production cycle is completed, assembled into printed issues and mailed later ▪ if article is published (online) in one calendar year and printed in the next, what is its year and volume? Should your XML schema account for the difference? ▪ set of articles selected on the basis of user- or publisher-defined criteria may cut across journals → “journal” is but one of many collections. Any collection is just a query executed against MDDB
  • 13. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR13 LESSONS LEARNED: 2 Page numbers and article IDs ▪ at the time of article publishing it is not always possible to predict accurately what its continuous pagination within the printed issue will be ▪ waiting until a printed issue is assembled and then adding page numbers to article representations creates discrepancy in how an article is to be cited and runs contrary to the principle that an article must not be changed after it is published ▪ abandoning page numbers altogether is not an option because many A&I services may need them for the purposes of citation tracking and metadata resolution (Thomson ISI, CrossRef) ▪ as long as a printed issue exists, the reader needs a means of finding a particular article within it AGU solution: smart Article ID (“citation number”) Citation: Holzworth, R. H., and R. A. Goldberg (2004), Electric field measurements in noctilucent clouds, J. Geophys. Res., 109, D16203, doi:10.1029/2003JD004468. D16203 is a citation number, where D part D (Atmospheres) of Journal of Geophysical Research (JGRD) 16issue number 2 “Aerosols and Clouds” subset of JGRD 03article sequence within the subset
  • 14. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR14 LESSONS LEARNED: 3 Page numbers and article IDs (cont’d) ▪ all metadata needed for the citation are in the version of record (XML), as well as in HTML and PDF → article can be consistently cited as soon as it is published! ▪ A&I services can use either page numbers (each article begins with a page number 1, though), or a citation number, or both ▪ citation numbers appear in print → easy for the reader to locate an article within a printed issue ▪ citation numbers in most cases follow the physical sequence of articles within an issue or its unit, but may occasionally deviate from it → AGU has the flexibility to deal with exceptions to a regular publishing flow Special characters ▪ XML is the version of record, and &eacute; is more readable than &#x000E9; ▪ an XML instance with Unicode points can always be produced simply by running a validating parser
  • 15. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR15 LESSONS LEARNED: 4 Math ● Tagging math ▪ MathML ▪ LaTeX ▪ link to an image ● Rendering math ▪ displaying MathML in a browser ▪ providing an image ● Problems with MathML [Gaylord, 2004] ▪ MathML Presentation vs. MathML Content ▪ MathML Presentation verbosity (debugging problems) ▪ Firefox & Netscape display MathML natively, IE 6.0 needs plug-in. Opera, etc.? ▪ Math Player – Windows only. Mac, Linux, UNIX? ▪ all browsers require additional fonts, yet not all characters can be displayed MathML as a display format is not an option if multiple browsers/platforms are involved Using image gives complete control over appearance but math can’t be searched/reused AGU approach: tag math using LaTeX within XML; convert to GIF for presenting in HTML
  • 16. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR16 RESULTS: 1 Improved productivity ▪ since introduction of the XML-centric workflow, the number of published articles has increased, while in the journal production department 2 full-time positions have been eliminated and 25% of staff positions have been downgraded ▪ production time from acceptance to publication has been substantially reduced; it is now the fastest since 1984 (when records began to be kept). GRL: 5 weeks, semimonthlies and monthlies – 10 weeks from acceptance to publication ▪ management reporting improved significantly Automated production of publishing products ▪ printed issue ToCs ▪ end-of-year author and subject indices Improved quality and value of published product ▪ human error reduced ▪ authors responsible for content only, publisher responsible for accuracy of articles’ structure and consistency of their appearance ▪ multiple outputs (PDF, HTML, print) produced automatically from the XML source ▪ previously unfeasible checks performed: accuracy of references’ metadata
  • 17. XML-CENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLAR17 RESULTS: 2 Automatic production of the Web search repository ▪ metadata and full text automatically extracted from XML into the repository Automatic archiving ▪ XML, HTML, PDF, non-textual components, and article metadata Direct data feeds to A&I services ▪ CrossRef, NASA’s Astrophysics Data System (ADS), AIP’s SPIN, etc. ▪ Used to be up to half a year delays between article publication and metadata appearance in A&I services. Now delivery is instantaneous and fully automated Reference linking implementation ▪ CrossRef inbound, outbound, and forward linking Introduction of new information products and services ▪ “virtual journals” (cross-journal article collections) ▪ multimedia content ▪ immediate access to underlying datasets ▪ RSS Making production process XML-centric has allowed AGU to bring its readers the results of scientific research of the highest quality in the fastest & most cost-efficient manner