This document summarizes Nature Publishing Group's techniques for RDF data publishing. It discusses NPG's prior semantic publishing work and linked data applications. It then describes NPG's ontology, data hosting on a cloud platform, public SPARQL endpoint, and internal Hub application. The document outlines NPG's data extraction, loading, and publishing process, as well as techniques for naming, monitoring, and providing a linked data API.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Techniques used in RDF Data Publishing at Nature Publishing Group
1. Techniques
used in
RDF Data Publishing
at
Nature Publishing Group
Tony Hammond
Data Architect, NPG
March 5, 2013
2. Nature Publishing Group
● NPG a division of Macmillan (a privately
owned company)
● Publishes ~120 titles in all
● 34 Nature branded titles
● 53 academic and society journals
● 16 magazines (incl. Scientific American)
● ~1000 employees,17 offices (5 continents)
● ~30 society partners
● Databases, conferences/events, multimedia
2
3. Semantic Publishing at NPG
• Prior Work
• RSS 1.0 webfeeds
• HTML metadata
• PDF metadata (XMP)
• Urchin – RSS aggregator
• OAI-PMH, OpenSearch (SRU), OpenURL
• Linked Data Apps
• Public Data: test viability of data publishing
• Hub: application of technology internally
3
31. Positions Available
goo.gl/bYIt8
www.linkedin.com/jobs?jobId=4890057&viewJob
31
32. Information
data.nature.com
developers.nature.com/docs
datahub.io/group/npg
prefix.cc/npg
32
Editor's Notes
BackgroundNPG is a division of Macmillan Publishers Ltd, a global publishing group founded in the United Kingdom in 1843.Macmillan is itself owned by German-based, family run company Verlagsgruppe Georg von Holtzbrinck GmbH.Nature magazine started publication in 1869.Scientifc American started publication in 1845.
Prior WorkNPG has long been receptive to RDF and to semantic publishing.Now has 10 years publishing RSS 1.0 webfeeds.Over 5 years publishing XMP in PDFs.Later a number of digital library–focused services – OAI-PMH, OpenSearch (SRU), OpenURL – subsequently made use of public vocabularies.Linked Data AppsMeantime the linked data paradigm has grown and can be seen as RDF's coming of age.TBL introduced a set of guidelines in July 2006 – just over 6 years ago.NPG first began to explore this space early in 2010 and since then has developed two main applications: Public Data and Hub.
The public data application was intended to gauge the readiness of the marketplace for data publishing.Scientific exchange is built on the reuse of ideas.With this linked data publishing we wanted to start testing the reusability of the knowledge that NPG generates.There was some initial interest in our first release, less in the second release.We have not been actively promoting this, and done a poor job linking to our own data.We have also had some ongoing problems in accessing query history which we are working to resolve.
As a reference this graph shows on a log scale (i.e. each vertical division shows a 10-fold increase) some relevant numbers:~120 journals~1m articles~10m citations~300m triples (a current working number for the public dataset)The 'Plimsoll Line' at right shows the initial banding we defined for sizing our requirements:10–100m100m–1b1b–10bOur public datasets appeared in the 1st and 2nd bands
In both these linked data applications NPG has focussed more on a broad – but flat – coverage. Our data model currently covers some 12 object types in our public dataset and double that number in our internal dataset. We are working to extend that number further.The slide shows the internal data model implemented at this point and is generally the public bibliographic dataset together with our annotations data.As can be seen the :Article object is a local hub object.Specifically we have focussed on the RDF data model and some minimal RDFS schemas at this point and not on OWL ontologies.Our strategy has been to put in broadband coverage and then to work up ontology 'verticals' later. Already a number of benefits may be realized by considering simple RDF linking and steering shy of inferencing.
We first began looking at vendors to host our data around Q2/Q3, 2010.We finally settled on TSO in Q1, 2011 as we already had previous relations with them and the technology they were able to offer (5store) was highly performant and scalable.TSO did not provide much in the way of API support – which actually suited us just fine – but were enthusiastic partners and were flexible in their arrangements with us.For loading the data we just had to ftp a snapshot – or set of snapshot files – and they would load this for us.Additionally – and as discussed later – we provided our own update mechanism using SPARQL Update.The SPARQL 1.0 endpoint was upgraded earlier this year with SPARQL 1.1. features.The only issue that we have not resolved satisfactorily is logging.
We had two releases of public data in 2012, April (22m) and July (270m) – both distributions were released into the public domain under a Creative Commons CC0 license (or more precisely a public waiver).April 4 DistributionThe first release included a subset of journals with articles and no citations - ~22m triples, 450,000 articles and 10 objects.July 16 DistributionThe second included all Journals (67 new titles) with articles and with citations - >270m triples, >900,000 articles and 12 objects.This distribution effectively doubles the number of articles while also adding in new :Citation and:DataCitation objects so that the complete citation graph for all NPG titles now brings the full distribution to more than ten times the size of the previous distribution.Also added a live updating facility and RDF data dumps
We built a web proxy – Cerberus – which was intended to perform three major tasks:browsing – implements a linked data browserlinking – implements a linked data APIlookup – implements URI dereferencingAdditionally a 4th background service – available at system and not user level – was implemented to support updates.Cerberus also provides for the following:centralized user accessa rich set of serializations (via HTTP content negotiation)query limitsprequery extensions (to article full text search)sample queries
The hub application is aimed at providing a discovery layer over our internal content.
We have an enterprise-wideinitiative at NPG to upgrade our workflow systems in which we are aiming both to database all our production assets, and to do this at the beginning of the process cycle (at acceptance) rather at the end (at publication) as currently.All publication assets will be included and will be stored in appropriate repositories.Currently only our XML asset base – i.e. our structured data – is well maintained in a dedicated XML database – MarkLogic.The accompanying BLOBs (.jpg, .gif, .pdf, etc) are only maintained on the filesystems – we are exploring a DAM option.Consequence would be a small set of content repositories – a distributed data warehouse.Problem is: How do we find stuff?
Proposed solution is to lay a graph over the repositories.We represent the physical (content) layer by the red nodes in the graph. There is a 1:1 relationship between physical asset and graph node.We represent the logical (context) layer by the white nodes in the graph. These are the nodes that are richly interlinked.This conceptual graph overlay may bear some similarities to Topic Maps, if you are familiar with that technology.
Basic methodology is to introduce a registration process and to associate a description (metadata) with each asset (file) and from these descriptions to generate linked data graph of objects.This registration process can be likened to a border patrol and the descriptions as passports for assets.The proposed format for the metadata descriptions is standalone XMP packets which have the benefits of both XML and RDF, as well as providing a useful set of constraints and keeping us focussed on media management.Should note that this terminology of assets and objects is used internally.Basically the hub is an asset-driven object factory.
Details:Adobe standard for embedding metadata in binary objects – PDF, etc.Standard mechanism (since PDF 1.5 in 2001) for adding metadata to PDFISO Standard: ISO 16684-1:2012Essentially, an XML document – specifically, an RDF/XML documentCan be embedded in wide range of binary and non-binary mediaCan also be used a standalone (‘sidecar’) documentUpsides:XML is easy import to ML – RDF is easy import to TS NPG already using this inPDFsDownsides:Uses archaic RDF forms (rdf:Alt, rdf:Bag, rdf:Seq)Requires certain properties (DC, etc) be mapped to structured formsSDK is C++ only (but other tools - ExifTool - has good support)
This figure shows how the graph would be built.We have two assets represented in the graph by the red nodes – an article XML file with properties shown in blue, and an article PDF with properties shown in green.Both of the objects are built from the descriptions associated with the assets (i.e. from the XMP packets).At right is shown some sample RDF that would be contained within the XMP packets.These two objects link to a concept object (the :Article) which itself is linked to other concept objects.Note that our current XML has no link to the PDF, but going forward we represent that linkage directly in the graph.
We have been using Apache TDB for local hosting for various reasons – open source and we are a Java shop.This supports SPARQL 1.1.Indexing is appreciably slower than 5store speeds.
The introduction of a triplestore changes the traditional publishing picture.Previously a CMS would access backend databases and present the data to an end-user.With the triplestore we are now publishing data which is both an output in its own right as well as an input to the CMS.The data is published to teh web rather than stored in a database and accessed (possibly) through an API.
The Hub Finder is a set of apps in early development to provide contextual discovery services to in-house teams (Production, Editorial, Marketing, etc.).Here a simple Inspector is shown which provides a low-level view of objects and their properties.In this view the :Subject object from our subjects ontology is selected.
And here are some results for search the :Subject object.
Some techniques of our RDF publishing practice will be discussed:Naming architecture/policyNamed graphs RDF filesystems for import/exportPublishing contractsXMP packets for metadata packaging (discussed earlier)Linked Data API
We follow the usual practice of distinguishing between Information Resources and (so-called) Non-Information Resources, i.e. between descriptions of things and things themselves.To implement this we follow the well-known DBpedia pattern of 303 redirects which although clunky to set up and requiring an extra DNS lookup does have the merit of being unambiguous.All NPG things are named in the ns.nature.com namespace.Descriptions for those things are served from the data.nature.com namespace.Documents built using this data are served from the www.nature.com namespace.
We have a strong naming policy.This allows for predictability in coding and in mapping to the API.We define a pair of namespaces: npg: for classes and properties, and npgg: for graphs.Each object type lives in its own named graph using the npgg: namespace for graphs.All objects and object properties are named in the npg: namespace, although additional classes may be added as appropriate.Datatype properties are sourced from common vocabularies wherever possible.
The diagram shows our data publishing operation.We support a 3-tiered development environment (live, staging and test) with dedicated triplestores for each environment.We also have the cloud-based public triplestore.Together we have eight SPARQL endpoints: four native triplestore endpoints, and four web proxy endpoints.We also have the datastore which is not yet 3-tier enabled. This is a data prep machine and is where our triples are extracted to and indexed. We are planning to extend this across all three tiers.The triplestore are under puppet management and simply deployed.
The Hub Finder provides a Monitor page which provides a simple test across all our various SPARQL endpoints with three status levels: up (green), up – slow (amber), down (red). Also timing info is shown.This just issues a simple lightweight SPARQL query to each of the endpoints.We are looking to tie this back into our regular Zenoss monitoring system in order to get email alerts, etc.
Our ETL process – or extract, transform, and load process – is fairly straightforward.Some points to note are that our extractor uses XPath and Jena. There is also a certain amount of Saxon and XQuery processing as some early transforms wrote results out to RDF/XML as an intermediate step. Output is written out in nquads.The output of the extractor goes to the filesystem building up, in effect, an RDF filesystem.This RDF filesystem is sourced by the assembler which compiles a distribution for a given knowledgebase (e.g. internal and external) using publishing contracts. The contracts are essentially used as an output filter. More on this later.The assembler output can then by indexed by tdbloader for generating TDB indexes or can be tarballed for downloading.The deployment is typically an scp over to the triplestore machine (although for out public store we would ftp the tarballs for remote indexing).
We define a data tree on our datastore machine for importing from various sources.The extraction process (which runs against MarkLogic) writes out RDF in nquads format and stores these on disk by object and property.For example, under a given extraction run we will write out all the dc:title properties for an :Article object in a dedicated file under an articles folder. For this, we use the full URL name which is normalized to alphanumeric plus a couple of punctuation chars (e.g. dash and period) These are further partitioned by restricting to a given product (i.e. journal title). So, we end up with a highly branched tree of nquads files.Other data sources (e.g. our static ontology datasets and imports from the SQL database) are likewise componentized.
Complementing the imports tree we have a corresponding exports tree.This is partitioned by knowledgebase. And under each knowledgebase we have folders for dumps, indexes, snapshots and updates.
Our publishing contracts are based on VoID descriptions.Initially we were maintaining VoID descriptions (which we loosely call graphs and assign a :Graph object type to) for discovery purposes in which we listed out classes and properties and – especially – counts.We have been obsessed with counts since we have been thwarted by triplestore support – either the triplestores did not support the function, or performance for larger populations was an obstacle, and updates presented their own challenges – and tried to implement this ourselves through our web proxy.We subsequently realized that we could dual-purpose these graph descriptions as publishing contracts. By listing out only those objects and properties to be published our assembler could ignore any objects and properties not intended for that knowledgebase.
Cerberus, as mentioned earlier, supports the LInked Data API (LDA).Specifically, we have embedded Elda – the Epimorphics Linked Data API – into Cerberus.From Elda's maintainer: "The LDA provides configurable mediation between a SPARQL endpoint and presentations of the data in HTML, XML, Turtle, or JSON."
In closing would just note some future challenges:We need to secure the current triplestore platform to make it resilient and performant.We need to consider what a next generation platform would need to support – sizing, inferencing, etc.We need to develop a more complete ontology with a foundational ontology footing.This stuff is hard. Harder than publishing web pages. Harder than loading data into a database.Maybe with more practice and with better tools support we can begin to treat this more like a day-to-day operation.But there can be no doubt that the web is becoming increasingly granular.
And just to note that we are looking to recruit semweb developers with strong Java and semantic skills. So if you are interested in joining us here's the job ad.