MediaX (Jan 2013) -- PKP XML Parsing


Published on

1 Comment
1 Like
  • demo at during slide 6 (can only guarantee this will be functional on jan 8, 2013)
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • (5 minute demo happens here)
  • MediaX (Jan 2013) -- PKP XML Parsing

    1. 1. Left to Their Own Devices: Automating XML Parsing andRendering for Scholarly Publishing Alex Garnett & John Willinsky Public Knowledge Project
    2. 2. What do we want? XML Publishing!• When do we want it? 2004 would’ve been nice…• We’ve known the value of properly marked up documents for a few decades now – Unfortunately, this entails hours of marking.• Open-source publishers on limited budgets can’t afford the outsourcing or the grad students that normally make this possible
    3. 3. The Public Knowledge Project• Developers of Open Journal Systems & Open Monograph Press – Open source software to support open access publishing. –• Our userbase happens to include many such small publishers, who publish almost exclusively in PDF, given its ease.
    4. 4. Nice things that PDF doesn’t have• Well-structured text mining & indexing• Rendering in different formats (e.g. mobile)• Embedded dynamic content• Citation parsing and lookup• Reliable metadata• So why are we still using it, again?
    5. 5. XML Publishing Workflows• Are complex and underdocumented, requiring lots of manual labour, since no author will ever write in XML, and only a small fraction will use Markdown or LaTeX or some other text format that’s easy to transform, and most automated parsing tools are in deplorable condition anyhow, rant rant rant, despite the fact that there are many very good piecemeal tools available at different stages of these workflows. We put some of them together.
    6. 6. Toolchain• External Services: – LibreOffice – document conversion – pdfx – fuzzy parsing – ParsCit – fuzzy citation parsing – citeproc/CSL – citation transformation
    7. 7. Future Work• After incorporating upstream changes from pdfx (fixing punctutation & non-English languages) we’re aiming to have an OJS plugin by March.• OMP will follow soon after.• By the end of our initial funding period in June, we’ll have a source release (without pdfx) and plan to be supporting a set of OJS/OMP users.
    8. 8. Future Work not done by us• Collaborators at Heidelberg University are working on a WYSIWYG in-browser XML editor for manually revising article formatting.• The University of Michigan’s mPach system will add ePub generation and HathiTrust ingest.• CrossRef will be contributing functionality to look up, verify, and link parsed citations.
    9. 9. Thanks• Damion Dooley, our primary developer• Steve Pettifer and the University of Manchester for allowing us to use pdfx• Juan Alperin and the rest of the PKP team for their support and earlier work• Alf Eaton from the NLM for stylesheets• MediaX for funding this project
    10. 10. Questions?• If you want to use our service for document preparation right now, contact me (Alex) at• We’ll have a stable version available by the end of January (probably free with registration)• OJS/OMP integration and standalone release (without pdfx) coming soon!