This presentation was given by Todd Toler of John Wiley & Sons during the NISO Virtual Conference, Convergence: The Web and Publishing Onto the Web, held on May 17, 2017.
2. Why HTML first?
• HTML is the language of the web, continuing to evolve and expand
– More than just scholarly publishing
– Thriving ecosystem of tools and APIs to author, augment, support, and utilize
– Naturally incorporates the semantic Web
– The notion of the journal article is changing rapidly – how to keep up?
– Allow us to potentially re-set for a digital age (industry standards get bloated over time, things
don’t get deprecated.)
– Can be the route to a rich online/offline experience (for instance, when EPUB swallows HTML
as a portable format)
• HTML-based peer review gives Reviewers access to the complete research output,
including source data & multimedia that is not well supported in PDF-based
workflows
• Enables preservation of data collected before and during the publication process
– eg: Peer Review comments are permanently linked to the content
– Can pass through linked data as researchers pipeline their source content (e.g.Jupyter
notebooks)
• Simplifies downstream transformation
• Normalized content format simplifies transfer workflows
• Reduce/eliminate errors associated with major format transformations
3. Melville
• Melville is Wiley’s internal HTML standard
– It is intended to be a superset of an eventual Scholarly HTML standard
– Follows many of the same principles, but focused on the needs of content
production
• Differences:
– Polyglot HTML for XML compatibility
– Tools for validation using established XML standards (rather than buried in
proprietary code)
– Favors established RDF vocabularies over schema.org where appropriate
– Compatible with WileyML, Wiley’s existing XML standard
– Supports conversion to JATS and other syndication formats
• Trade-offs:
– Polyglot needed to use the powerful validation tools available for XML, but
tradeoff benefits of HTML (such as iFrames and Scripting elements)
4. Output
• PDF, ePub generation occurs directly from the Melville
HTML
– We license technology from a vendor for this (Vivliostylein
Japan)
– Uses javascript enhancements to CSS Page Media
– Handles math well (MathML + MathJax)
– Creates PDFs, EPUBs, or even paginated HTML
• Some additional meta-tagging needed before the
automation works seamlessly (e.g. image meta-tagging
to drive the sizing of the images)
• Journal level standardization to a few templates was a
pre-cursor
6. Automated “dirty” HTML
• We use Aspose libraries to create a basic HTML model
of the MS Word document
– Tried Open Source Apache POI, but Aspose better meets
our needs
– LaTeX conversions are more straightforward because they
already have inherent structure. Authors use LaTeX
templates that we supply.
• Then, we attempt to interpret semantic structure
within the document to tag key elements (Title,
abstract, authors, figure sets, references, etc), primarily
using heuristics (a complex set of rules). We use the
GROBID machine learning library to parse references.