Toler HTML First Journal Workflow

HTML first Journal Workflow
Todd Toler, John Wiley & Sons
For NISO, May 16, 2017

Why HTML first?
• HTML is the language of the web, continuing to evolve and expand
– More than just scholarly publishing
– Thriving ecosystem of tools and APIs to author, augment, support, and utilize
– Naturally incorporates the semantic Web
– The notion of the journal article is changing rapidly – how to keep up?
– Allow us to potentially re-set for a digital age (industry standards get bloated over time, things
don’t get deprecated.)
– Can be the route to a rich online/offline experience (for instance, when EPUB swallows HTML
as a portable format)
• HTML-based peer review gives Reviewers access to the complete research output,
including source data & multimedia that is not well supported in PDF-based
workflows
• Enables preservation of data collected before and during the publication process
– eg: Peer Review comments are permanently linked to the content
– Can pass through linked data as researchers pipeline their source content (e.g.Jupyter
notebooks)
• Simplifies downstream transformation
• Normalized content format simplifies transfer workflows
• Reduce/eliminate errors associated with major format transformations

Melville
• Melville is Wiley’s internal HTML standard
– It is intended to be a superset of an eventual Scholarly HTML standard
– Follows many of the same principles, but focused on the needs of content
production
• Differences:
– Polyglot HTML for XML compatibility
– Tools for validation using established XML standards (rather than buried in
proprietary code)
– Favors established RDF vocabularies over schema.org where appropriate
– Compatible with WileyML, Wiley’s existing XML standard
– Supports conversion to JATS and other syndication formats
• Trade-offs:
– Polyglot needed to use the powerful validation tools available for XML, but
tradeoff benefits of HTML (such as iFrames and Scripting elements)

Output
• PDF, ePub generation occurs directly from the Melville
HTML
– We license technology from a vendor for this (Vivliostylein
Japan)
– Uses javascript enhancements to CSS Page Media
– Handles math well (MathML + MathJax)
– Creates PDFs, EPUBs, or even paginated HTML
• Some additional meta-tagging needed before the
automation works seamlessly (e.g. image meta-tagging
to drive the sizing of the images)
• Journal level standardization to a few templates was a
pre-cursor

HTML ASAP, Melville eventually
• Pre-acceptance
– Automated conversion to “dirty” HTML
– Author review & validation adds additional
structure
– Additional enrichment adds value that is
preserved throughout the workflow
• Post-acceptance
– Final Melville prep
– Authors final review & validation (proofing)

Automated “dirty” HTML
• We use Aspose libraries to create a basic HTML model
of the MS Word document
– Tried Open Source Apache POI, but Aspose better meets
our needs
– LaTeX conversions are more straightforward because they
already have inherent structure. Authors use LaTeX
templates that we supply.
• Then, we attempt to interpret semantic structure
within the document to tag key elements (Title,
abstract, authors, figure sets, references, etc), primarily
using heuristics (a complex set of rules). We use the
GROBID machine learning library to parse references.

“Cleaner” HTML in the future
• The heuristics approach is
limited and we’re already
experimenting with more
advanced methods:
– OpenCV (open source) library
for computer vision. With this,
we’re working to interpret
some structure based on visual
features (contours-see image)
– TensorFlow(open source)
Recurrent Neural Network
(RNN) tool and Conditional
Random Field (CRF) algorithms
for structure identification and
entity extraction

Author review & validation
• The Author is asked to review
and validate the structured
items
• The Author has simple tools to
correct any errors or provide
structure in areas where the
algorithm failed
– Algorithm quality improves over
time based on Author feedback

Additional enrichment
• Author is encouraged to enrich
their submission with additional
detail:
– Source data
– Computer code
– Other external links/references
– Multimedia
• Staff validate (or fix) technical
aspects of the conversion
including figure sets, tables,
references, etc.
• Additional enrichment is
preserved within the HTML
package as it moves through the
review/decision workflow

Final Melville prep
• After article acceptance, QA step fixes any
remaining issues (expected to be rare) and
sends “final” Melville on to…
• Vendor copyediting stage directly in Melville
(new tools being developed for this)
• All editing and peer review workflow is stored
as annotations using web annotation
standards

Authors final review/validation
• Author proofs an HTML (Melville)
version of their final article
• No direct edits permitted,
revisions are noted via actionable
annotations (annotations are more
than just comments in our model)
• A final, technical QA step follows
author proof approval

Call to Action: Scholarly HTML
– Scholarly HTML community group was formed in
2016 to explore turning this into a standard;
chaired by Robin Berjon of Standard Analytics
(Science.ai)
– Greater participation is needed to make this real
– Need strong arguments to extend HTML
– Need to extend open web ontology, schema.org
– If NISO got behind it, we could get built in scale.

Toler HTML First Journal Workflow

More Related Content

What's hot

More from National Information Standards Organization (NISO)

Recently uploaded

Toler HTML First Journal Workflow