MJ Suhonos, PKP Developer/Librarian2ndPKP Scholarly Publishing ConferenceVancouver, Canada Jul 8, 2009Lemon8-XML in a Morning
Overview/Contents• Lemon8-XML: rationale, history, and purpose• Design overview and technical approach• Demonstration• Open Medicine and Lemon8-XML• Submission / review with PubMed Central• The role of XML and structured metadata• Discussion
Current OJS Workflow1) Author uploads article as, eg. DOC2) Reviewers read article as supplied3) Reviewers/editors exchange comments in OJS4) Copyeditor corrects information in article5) Layout editor creates galleys, eg. HTML/PDF6) Copyeditor and author proofread galleys
Problems With Current Workflow• Proprietary format tools are expensive• Vendor lock-in, eg. RefMan is Windows-only• Author must explicitly enter metadata into OJS• “Hidden” metadata can affect double-blind review• Reviewers need the same software as the author• Comments must be submitted in OJS or email• “track changes” feature is unreliable
Problems With Current Workflow• Little assistance is available for hard stuff, eg. citationchecking in indexes• Binary compatible issues, eg. “looks different on mycomputer”• Format conversion software (eg. PDF) is expensive• Substantial expertise required for layout editing• Consistency across articles difficult to maintain• Incompatible data, eg. WordArt, OLE objects• HTML and PDF not the best archival formats
Why Do TheseProblems Exist?• treating articles and galleys as “black boxes”• a familiar example:“Journal”OJS“Articles”
Whats In a Document?“Document”(article, galley, etc.)
A Simple Document Model• metadata: title, author information, etc.• content: sections, images, etc.• links to related materialA Simple Document Model
Desirable Features of a Document• non-proprietary• human readable and editable• long-term archival quality• reliable version tracking• in-context comments• machine actionable• comprised of objects, not a black box!
Current Formats• MS-Word: proprietary, not archival quality,unreliable, not machine actionable: FAIL• PDF: not editable, no version tracking, poorcommenting, not machine actionable: FAIL• OpenDocument: not (really) machine actionable
What is a Machine ActionableDocument?• a “Smart Document”• based on semantics, not appearance• comprised of objects, not a black box!– metadata, content, links to related material, etc.
What Can A Smart Document Do?• direct interaction with indexes and web services• layout presentation can be automated• interactive, in-context user discussion• complex bibliometrics• things we cant even think of right now (Web 3.0)
A Machine ActionableDocument Format• HTML (a relative of XML)• XML is basically “smart” (semantic) HTML• XML is based on the Document Object Model• numerous XML schemas to represent documents:– NLM Journal Publishing DTD, Erudit DTD,Docbook DTD, etc.
How Do We Get There From Here?• format conversion• data extraction (“decomposition”)• semantic mark-up• but there are no integrated, easy-to-use tools to do this!
Lemon8-XML (L8X)• designed to meet 5 goals (or steps):• import documents in any current format (almost)• extract document objects, eg. metadata, sections• provide editability for document objects• assist users with citation correction• export to a machine actionable XML format
Format Conversion• use existing tools to convert everything to ODT• import filter as a “bridge” to conversion services:– Docvert (web-based interface to OpenOffice)– Google Docs (web-based API)• or, users can convert using OpenOffice on theirlocal computer
Structure Decomposition• three kinds of document objects:• metadata elements: author name, email, title, etc.• sections: paragraphs, image, table, etc.• citations: comprised of metadata elements• these objects are part of a schemaless hierarchy• this is enough to represent most scholarly articles
Structure Editing• allows the user to ensure the document isstructured correctly• provides the ability to replace incompatible data, eg.WordArt with PNG, EndNote with text, etc.• lets the user add missing information, eg. additionalauthors, article ID, etc.
Another ExampleWhat is the name of the journal cited?
Citation Correction• Citations have their own complex metadata• automatically parses citations against 400 styles (with aframework to add more)• citations are validated in online indexes using fuzzymatching, eg. PubMed, WorldCat, CrossRef, ISBNdb,etc.
Preview and Export• gives a rough idea of what the rendered HTML / PDFgalley will look like• the user can choose which XML schema to export• the entire edit/preview process is iterative, just likethe current OJS workflow
Goals for 2009• Feature integration into OJS, OMP (via the WAL)• Automatic document/file conversion & generation• Automatic metadata extraction on upload• Additional citation indexes using plugins• Additional XML export formats (Docbook, Erudit)• Customizable XSL and preview rendering• Things we cant even think of right now (Web 3.0)
Future OJS Workflow1) Author uploads article as, eg. DOC2) File is converted to ODT, metadata extracted3) Reviewers read article converted to PDF4) Reviewers/editors exchange comments in OJS5) Copyeditor edits / corrects citations automatically6) Copy/layout-editor and author make corrections7) HTML/PDF galleys automatically created8) XML exported and submitted to archive
Squeezing lemons intoLemon8 at Open MedicineAnita Palepu, Open MedicineVancouver, Canada Jul 8, 2009
1. Start with the freshest lemons• Spending a small amount of time refining lemonsmeans less problems down the road• Try to have documents that will need few or nocontent changes later• These refinements include ensuring headings andreferences are in a standard format.
2. Upload the lemon• Usually we try to get a document score > 75%• If it is below, it usually sends us back to the lemonto refine it, since problems can be unpredictable ifone proceeds from this point onwards
3. Fix the metadata• Metadata that has been missed in the Lemon8parsing is inserted (eg. titles, author information)
4. Fix the citations• This is the single most time-consuming part of theprocess (approximately 1 minute per citation inthe document)• Also where the most time has been saved in theprocess (several hours per document)• We double-check successfully looked-up citations,as well as manually entering / searching ones thathave not been successfully parsed
5. Massage the XML• At this point, we usually export the XML and startediting it manually• This includes entering the publication information,dates accepted and so on in the metadata• Also enter figures / tables at this stage manually• We use Eclipse with the PHP extensions to editXML mark-up• Validate! Validate! Validate! No document leavesthis stage unless its valid NLM XML
6. Upload• HTML is automatically generated from theuploaded XML by OJS• We also push the XML through an XSL we craftedin-house to turn it into a Scribus document, whichwe use to produce our PDF documents.
The PubMed Central Experience• Now the de facto requirement for inclusion inPubMed (MEDLINE) index• PMC has a very high standard for XML mark-up• All articles must be submitted in XML and PDF• All content must be 100% consistent between andwithin articles• Images must be supplied at extremely high(archival) resolution (700dpi TIF)• Average time for acceptance is 2 years, 4 reviews
The PubMed Central Experience• Plan your journals workflow clearly in advance• Start with the right tools (eg. OpenOffice)• Choose a canonical source for edits (ie. XML)• Encourage authors and editors to be involved andengaged as early as possible• Be patient and thorough – dont cut corners• Plan a sustainability strategy – PMC expectssubmissions within 4 weeks of publication
Special Thanks:Anita Palepu, Tarek LoubaniOpen Discussion