Techniques used in RDF Data Publishing at Nature Publishing Group

•Download as PPTX, PDF•

9 likes•6,932 views

This document summarizes Nature Publishing Group's techniques for RDF data publishing. It discusses NPG's prior semantic publishing work and linked data applications. It then describes NPG's ontology, data hosting on a cloud platform, public SPARQL endpoint, and internal Hub application. The document outlines NPG's data extraction, loading, and publishing process, as well as techniques for naming, monitoring, and providing a linked data API.

Technology

Techniques
used in
RDF Data Publishing
at
Nature Publishing Group

Tony Hammond
Data Architect, NPG

March 5, 2013

Nature Publishing Group

● NPG a division of Macmillan (a privately
owned company)
● Publishes ~120 titles in all
● 34 Nature branded titles
● 53 academic and society journals
● 16 magazines (incl. Scientific American)
● ~1000 employees,17 offices (5 continents)
● ~30 society partners
● Databases, conferences/events, multimedia

2

Semantic Publishing at NPG

• Prior Work
• RSS 1.0 webfeeds
• HTML metadata
• PDF metadata (XMP)
• Urchin – RSS aggregator
• OAI-PMH, OpenSearch (SRU), OpenURL
• Linked Data Apps
• Public Data: test viability of data publishing
• Hub: application of technology internally
3

Cloud Hosting

• TSO OpenUp® SaaS platform
• Offers 5store as a triplestore
• Scale-out architecture (C/C++)
• Supports up to a trillion triples
• 150,000tps load speed
• SPARQL 1.0, with 1.1 features
(aggregates, etc)

7

Local Hosting

• Apache TDB
• Single-node architecture (Java)
• Supports up to ~1.5b triples (tested)
• SPARQL 1.1

16

Naming Policy
npg: http://ns.nature.com/terms/
npgg: http://ns.nature.com/graphs/

Object Example Usage
Graph npgg:gadgets gadgets:33 ex:title "Title" npgg:gadgets .

Class npg:Gadget gadgets:33 a npg:Gadget npgg:gadgets .

Object npg:hasGadget _:12 npg:hasGadget gadgets:33 npgg:_ .
Property
Data ex:title gadgets:33 ex:title "Title" npgg:gadgets .
Property
Instance gadgets:33 gadgets:33 ex:title "Title" npgg:gadgets .

22

Contracts
npgg:affiliations void:property vcard:region ;
a npg:Graph, void:Dataset ; void:triples "183483"^^xsd:int
dcterms:description "Graph of npg:Affiliation objects" ; ], [
dcterms:issued "2013-02-15"^^xsd:date ; void:property vcard:organisation-name ;
dcterms:modified "2013-02-15"^^xsd:date ; void:triples "694290"^^xsd:int
dcterms:publisher [ ], [
a foaf:Organization ; void:property vcard:locality ;
foaf:mbox <mailto:developers@nature.com> ; void:triples "412042"^^xsd:int
foaf:name "Nature Publishing Group" ], [
]; void:property vcard:email ;
dcterms:source "extractor-xml" ; void:triples "21650"^^xsd:int
dcterms:title "npgg:affiliations" ; ], [
rdfs:label "npgg:affiliations" ; void:property vcard:country-name ;
void:classPartition [ void:triples 0
void:class npg:Affiliation ; ], [
void:entities "973208"^^xsd:int void:property rdfs:label ;
]; void:triples "973208"^^xsd:int
void:propertyPartition [ ], [
void:property vcard:url ; void:property rdf:type ;
void:triples "326"^^xsd:int void:triples "973208"^^xsd:int
], [ ];
void:property vcard:street-address ; void:triples "3340845"^^xsd:int ;
void:triples "82638"^^xsd:int void:vocabulary npg:, rdf:, rdfs:, void: .
], [

28

Linked Data API

• ./api/articles [.json, .rdf, .xml]
• ./api/articles?hasProduct.pcode=ng
• ./api/contributors?familyName=Smith
• ./api/products.json?pcode=ng&_page=2
• ./api/products?_view=none&_properties=pcode
• ./api/search?title=black+hole
• ./api/tree/subjects/children.xml?_sort=title

29

Positions Available

goo.gl/bYIt8

www.linkedin.com/jobs?jobId=4890057&viewJob
31

Information

data.nature.com

developers.nature.com/docs

datahub.io/group/npg
prefix.cc/npg

32

What's hot

Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...MongoDB

Data Processing and Aggregation with MongoDB MongoDB

Introduction to MongoDBantoinegirbal

Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB

MongoDb and NoSQLTO THE NEW | Technology

Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo

MondodbPaulo Fagundes

Python and MongoDB Norberto Leite

Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingMyles Braithwaite

MongoDB for AnalyticsMongoDB

Social Data and Log Analysis Using MongoDBTakahiro Inoue

Document Model for High Speed Spark ProcessingMongoDB

Back to Basics Webinar 1: Introduction to NoSQLMongoDB

2011 Mongo FR - Indexing in MongoDBantoinegirbal

Back to Basics 2017: Mí primera aplicación MongoDBMongoDB

Querying mongo dbBogdan Sabău

Getting Started with MongoDB and NodeJSMongoDB

NoSQL - An introduction to CouchDBJonathan Weiss

Webinar: Data Processing and Aggregation OptionsMongoDB

MongoDB Workshop Universidad de HuelvaJuan Antonio Roy Couto

What's hot (20)

Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...

Data Processing and Aggregation with MongoDB

Introduction to MongoDB

Back to Basics Webinar 2: Your First MongoDB Application

MongoDb and NoSQL

Analytics with MongoDB Aggregation Framework and Hadoop Connector

Mondodb

Python and MongoDB

Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting

MongoDB for Analytics

Social Data and Log Analysis Using MongoDB

Document Model for High Speed Spark Processing

Back to Basics Webinar 1: Introduction to NoSQL

2011 Mongo FR - Indexing in MongoDB

Back to Basics 2017: Mí primera aplicación MongoDB

Querying mongo db

Getting Started with MongoDB and NodeJS

NoSQL - An introduction to CouchDB

Webinar: Data Processing and Aggregation Options

MongoDB Workshop Universidad de Huelva

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group

Spark schema for free with David SzakallasDatabricks

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos

HyperGraphQLSzymon Klarman

Hydra - Getting Startedabramsm

Blazing Fast Analytics with MongoDB & SparkMongoDB

Introduction to Spark Datasets - Functional and relational together at lastHolden Karau

The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine

Cross-Platform Mobile Apps & Drupal Web ServicesBob Sims

ETL with SPARK - First Spark London meetupRafal Kwasny

Lessons learned while building Omroep.nlbartzon

Graph databases & data integration v2Dimitris Kontokostas

Lessons learned while building Omroep.nltieleman

Pig on sparkSigmoid

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Graph Analytics in SparkPaco Nathan

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Graph Analytics with ArangoDBArangoDB Database

Retaining globally distributed high availabilityspil-engineering

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group (20)

Spark schema for free with David Szakallas

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database

HyperGraphQL

Hydra - Getting Started

Blazing Fast Analytics with MongoDB & Spark

Introduction to Spark Datasets - Functional and relational together at last

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Cross-Platform Mobile Apps & Drupal Web Services

ETL with SPARK - First Spark London meetup

Lessons learned while building Omroep.nl

Graph databases & data integration v2

Lessons learned while building Omroep.nl

Pig on spark

GraphX: Graph analytics for insights about developer communities

Lens: Data exploration with Dask and Jupyter widgets

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Graph Analytics in Spark

GraphFrames: DataFrame-based graphs for Apache® Spark™

Graph Analytics with ArangoDB

Retaining globally distributed high availability

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

AI as an Interface for Commercial BuildingsMemoori

Story boards and shot lists for my a level piececharlottematthew16

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Install Stable Diffusion in windows machinePadma Pradeep

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club

Connect Wave/ connectwave Pitch Deck Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!

Designing IA for AI - Information Architecture Conference 2024

Vector Databases 101 - An introduction to the world of Vector Databases

"Debugging python applications inside k8s environment", Andrii Soldatenko

DevoxxFR 2024 Reproducible Builds with Apache Maven

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Are Multi-Cloud and Serverless Good or Bad?

DMCC Future of Trade Web3 - Special Edition

WordPress Websites for Engineers: Elevate Your Brand

Advanced Test Driven-Development @ php[tek] 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

AI as an Interface for Commercial Buildings

Story boards and shot lists for my a level piece

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Install Stable Diffusion in windows machine

The Future of Software Development - Devin AI Innovative Approach.pdf

Vertex AI Gemini Prompt Engineering Tips

Streamlining Python Development: A Guide to a Modern Project Setup

Techniques used in RDF Data Publishing at Nature Publishing Group

1. Techniques used in RDF Data Publishing at Nature Publishing Group Tony Hammond Data Architect, NPG March 5, 2013

2. Nature Publishing Group ● NPG a division of Macmillan (a privately owned company) ● Publishes ~120 titles in all ● 34 Nature branded titles ● 53 academic and society journals ● 16 magazines (incl. Scientific American) ● ~1000 employees,17 offices (5 continents) ● ~30 society partners ● Databases, conferences/events, multimedia 2

3. Semantic Publishing at NPG • Prior Work • RSS 1.0 webfeeds • HTML metadata • PDF metadata (XMP) • Urchin – RSS aggregator • OAI-PMH, OpenSearch (SRU), OpenURL • Linked Data Apps • Public Data: test viability of data publishing • Hub: application of technology internally 3

4. Public Data 4

5. NPG by Numbers 5

6. NPG Ontology 6

7. Cloud Hosting • TSO OpenUp® SaaS platform • Offers 5store as a triplestore • Scale-out architecture (C/C++) • Supports up to a trillion triples • 150,000tps load speed • SPARQL 1.0, with 1.1 features (aggregates, etc) 7

8. data.nature.com 8

9. data.nature.com/query 9

10. Hub 10

11. Hub: Problem 11

12. Hub: Solution 12

13. Hub: Method 13

14. XMP 14

15. Building the Graph 15

16. Local Hosting • Apache TDB • Single-node architecture (Java) • Supports up to ~1.5b triples (tested) • SPARQL 1.1 16

17. Data Publishing 17

18. Hub Finder 18

19. Hub Finder: Results 19

20. Techniques 20

21. Naming Architecture 21

22. Naming Policy npg: http://ns.nature.com/terms/ npgg: http://ns.nature.com/graphs/ Object Example Usage Graph npgg:gadgets gadgets:33 ex:title "Title" npgg:gadgets . Class npg:Gadget gadgets:33 a npg:Gadget npgg:gadgets . Object npg:hasGadget _:12 npg:hasGadget gadgets:33 npgg:_ . Property Data ex:title gadgets:33 ex:title "Title" npgg:gadgets . Property Instance gadgets:33 gadgets:33 ex:title "Title" npgg:gadgets . 22

23. Publishing 23

24. Monitoring 24

25. ETL Process 25

26. Datastore: Imports 26

27. Datastore: Exports 27

28. Contracts npgg:affiliations void:property vcard:region ; a npg:Graph, void:Dataset ; void:triples "183483"^^xsd:int dcterms:description "Graph of npg:Affiliation objects" ; ], [ dcterms:issued "2013-02-15"^^xsd:date ; void:property vcard:organisation-name ; dcterms:modified "2013-02-15"^^xsd:date ; void:triples "694290"^^xsd:int dcterms:publisher [ ], [ a foaf:Organization ; void:property vcard:locality ; foaf:mbox <mailto:developers@nature.com> ; void:triples "412042"^^xsd:int foaf:name "Nature Publishing Group" ], [ ]; void:property vcard:email ; dcterms:source "extractor-xml" ; void:triples "21650"^^xsd:int dcterms:title "npgg:affiliations" ; ], [ rdfs:label "npgg:affiliations" ; void:property vcard:country-name ; void:classPartition [ void:triples 0 void:class npg:Affiliation ; ], [ void:entities "973208"^^xsd:int void:property rdfs:label ; ]; void:triples "973208"^^xsd:int void:propertyPartition [ ], [ void:property vcard:url ; void:property rdf:type ; void:triples "326"^^xsd:int void:triples "973208"^^xsd:int ], [ ]; void:property vcard:street-address ; void:triples "3340845"^^xsd:int ; void:triples "82638"^^xsd:int void:vocabulary npg:, rdf:, rdfs:, void: . ], [ 28

29. Linked Data API • ./api/articles [.json, .rdf, .xml] • ./api/articles?hasProduct.pcode=ng • ./api/contributors?familyName=Smith • ./api/products.json?pcode=ng&_page=2 • ./api/products?_view=none&_properties=pcode • ./api/search?title=black+hole • ./api/tree/subjects/children.xml?_sort=title 29

30. Closing 30

31. Positions Available goo.gl/bYIt8 www.linkedin.com/jobs?jobId=4890057&viewJob 31

32. Information data.nature.com developers.nature.com/docs datahub.io/group/npg prefix.cc/npg 32

Editor's Notes

BackgroundNPG is a division of Macmillan Publishers Ltd, a global publishing group founded in the United Kingdom in 1843.Macmillan is itself owned by German-based, family run company Verlagsgruppe Georg von Holtzbrinck GmbH.Nature magazine started publication in 1869.Scientifc American started publication in 1845.
Prior WorkNPG has long been receptive to RDF and to semantic publishing.Now has 10 years publishing RSS 1.0 webfeeds.Over 5 years publishing XMP in PDFs.Later a number of digital library–focused services – OAI-PMH, OpenSearch (SRU), OpenURL – subsequently made use of public vocabularies.Linked Data AppsMeantime the linked data paradigm has grown and can be seen as RDF's coming of age.TBL introduced a set of guidelines in July 2006 – just over 6 years ago.NPG first began to explore this space early in 2010 and since then has developed two main applications: Public Data and Hub.
The public data application was intended to gauge the readiness of the marketplace for data publishing.Scientific exchange is built on the reuse of ideas.With this linked data publishing we wanted to start testing the reusability of the knowledge that NPG generates.There was some initial interest in our first release, less in the second release.We have not been actively promoting this, and done a poor job linking to our own data.We have also had some ongoing problems in accessing query history which we are working to resolve.
As a reference this graph shows on a log scale (i.e. each vertical division shows a 10-fold increase) some relevant numbers:~120 journals~1m articles~10m citations~300m triples (a current working number for the public dataset)The 'Plimsoll Line' at right shows the initial banding we defined for sizing our requirements:10–100m100m–1b1b–10bOur public datasets appeared in the 1st and 2nd bands
In both these linked data applications NPG has focussed more on a broad – but flat – coverage. Our data model currently covers some 12 object types in our public dataset and double that number in our internal dataset. We are working to extend that number further.The slide shows the internal data model implemented at this point and is generally the public bibliographic dataset together with our annotations data.As can be seen the :Article object is a local hub object.Specifically we have focussed on the RDF data model and some minimal RDFS schemas at this point and not on OWL ontologies.Our strategy has been to put in broadband coverage and then to work up ontology 'verticals' later. Already a number of benefits may be realized by considering simple RDF linking and steering shy of inferencing.
We first began looking at vendors to host our data around Q2/Q3, 2010.We finally settled on TSO in Q1, 2011 as we already had previous relations with them and the technology they were able to offer (5store) was highly performant and scalable.TSO did not provide much in the way of API support – which actually suited us just fine – but were enthusiastic partners and were flexible in their arrangements with us.For loading the data we just had to ftp a snapshot – or set of snapshot files – and they would load this for us.Additionally – and as discussed later – we provided our own update mechanism using SPARQL Update.The SPARQL 1.0 endpoint was upgraded earlier this year with SPARQL 1.1. features.The only issue that we have not resolved satisfactorily is logging.
We had two releases of public data in 2012, April (22m) and July (270m) – both distributions were released into the public domain under a Creative Commons CC0 license (or more precisely a public waiver).April 4 DistributionThe first release included a subset of journals with articles and no citations - ~22m triples, 450,000 articles and 10 objects.July 16 DistributionThe second included all Journals (67 new titles) with articles and with citations - >270m triples, >900,000 articles and 12 objects.This distribution effectively doubles the number of articles while also adding in new :Citation and:DataCitation objects so that the complete citation graph for all NPG titles now brings the full distribution to more than ten times the size of the previous distribution.Also added a live updating facility and RDF data dumps
We built a web proxy – Cerberus – which was intended to perform three major tasks:browsing – implements a linked data browserlinking – implements a linked data APIlookup – implements URI dereferencingAdditionally a 4th background service – available at system and not user level – was implemented to support updates.Cerberus also provides for the following:centralized user accessa rich set of serializations (via HTTP content negotiation)query limitsprequery extensions (to article full text search)sample queries
The hub application is aimed at providing a discovery layer over our internal content.
We have an enterprise-wideinitiative at NPG to upgrade our workflow systems in which we are aiming both to database all our production assets, and to do this at the beginning of the process cycle (at acceptance) rather at the end (at publication) as currently.All publication assets will be included and will be stored in appropriate repositories.Currently only our XML asset base – i.e. our structured data – is well maintained in a dedicated XML database – MarkLogic.The accompanying BLOBs (.jpg, .gif, .pdf, etc) are only maintained on the filesystems – we are exploring a DAM option.Consequence would be a small set of content repositories – a distributed data warehouse.Problem is: How do we find stuff?
Proposed solution is to lay a graph over the repositories.We represent the physical (content) layer by the red nodes in the graph. There is a 1:1 relationship between physical asset and graph node.We represent the logical (context) layer by the white nodes in the graph. These are the nodes that are richly interlinked.This conceptual graph overlay may bear some similarities to Topic Maps, if you are familiar with that technology.
Basic methodology is to introduce a registration process and to associate a description (metadata) with each asset (file) and from these descriptions to generate linked data graph of objects.This registration process can be likened to a border patrol and the descriptions as passports for assets.The proposed format for the metadata descriptions is standalone XMP packets which have the benefits of both XML and RDF, as well as providing a useful set of constraints and keeping us focussed on media management.Should note that this terminology of assets and objects is used internally.Basically the hub is an asset-driven object factory.
Details:Adobe standard for embedding metadata in binary objects – PDF, etc.Standard mechanism (since PDF 1.5 in 2001) for adding metadata to PDFISO Standard: ISO 16684-1:2012Essentially, an XML document – specifically, an RDF/XML documentCan be embedded in wide range of binary and non-binary mediaCan also be used a standalone (‘sidecar’) documentUpsides:XML is easy import to ML – RDF is easy import to TS NPG already using this inPDFsDownsides:Uses archaic RDF forms (rdf:Alt, rdf:Bag, rdf:Seq)Requires certain properties (DC, etc) be mapped to structured formsSDK is C++ only (but other tools - ExifTool - has good support)
This figure shows how the graph would be built.We have two assets represented in the graph by the red nodes – an article XML file with properties shown in blue, and an article PDF with properties shown in green.Both of the objects are built from the descriptions associated with the assets (i.e. from the XMP packets).At right is shown some sample RDF that would be contained within the XMP packets.These two objects link to a concept object (the :Article) which itself is linked to other concept objects.Note that our current XML has no link to the PDF, but going forward we represent that linkage directly in the graph.
We have been using Apache TDB for local hosting for various reasons – open source and we are a Java shop.This supports SPARQL 1.1.Indexing is appreciably slower than 5store speeds.
The introduction of a triplestore changes the traditional publishing picture.Previously a CMS would access backend databases and present the data to an end-user.With the triplestore we are now publishing data which is both an output in its own right as well as an input to the CMS.The data is published to teh web rather than stored in a database and accessed (possibly) through an API.
The Hub Finder is a set of apps in early development to provide contextual discovery services to in-house teams (Production, Editorial, Marketing, etc.).Here a simple Inspector is shown which provides a low-level view of objects and their properties.In this view the :Subject object from our subjects ontology is selected.
And here are some results for search the :Subject object.
Some techniques of our RDF publishing practice will be discussed:Naming architecture/policyNamed graphs RDF filesystems for import/exportPublishing contractsXMP packets for metadata packaging (discussed earlier)Linked Data API
We follow the usual practice of distinguishing between Information Resources and (so-called) Non-Information Resources, i.e. between descriptions of things and things themselves.To implement this we follow the well-known DBpedia pattern of 303 redirects which although clunky to set up and requiring an extra DNS lookup does have the merit of being unambiguous.All NPG things are named in the ns.nature.com namespace.Descriptions for those things are served from the data.nature.com namespace.Documents built using this data are served from the www.nature.com namespace.
We have a strong naming policy.This allows for predictability in coding and in mapping to the API.We define a pair of namespaces: npg: for classes and properties, and npgg: for graphs.Each object type lives in its own named graph using the npgg: namespace for graphs.All objects and object properties are named in the npg: namespace, although additional classes may be added as appropriate.Datatype properties are sourced from common vocabularies wherever possible.
The diagram shows our data publishing operation.We support a 3-tiered development environment (live, staging and test) with dedicated triplestores for each environment.We also have the cloud-based public triplestore.Together we have eight SPARQL endpoints: four native triplestore endpoints, and four web proxy endpoints.We also have the datastore which is not yet 3-tier enabled. This is a data prep machine and is where our triples are extracted to and indexed. We are planning to extend this across all three tiers.The triplestore are under puppet management and simply deployed.
The Hub Finder provides a Monitor page which provides a simple test across all our various SPARQL endpoints with three status levels: up (green), up – slow (amber), down (red). Also timing info is shown.This just issues a simple lightweight SPARQL query to each of the endpoints.We are looking to tie this back into our regular Zenoss monitoring system in order to get email alerts, etc.
Our ETL process – or extract, transform, and load process – is fairly straightforward.Some points to note are that our extractor uses XPath and Jena. There is also a certain amount of Saxon and XQuery processing as some early transforms wrote results out to RDF/XML as an intermediate step. Output is written out in nquads.The output of the extractor goes to the filesystem building up, in effect, an RDF filesystem.This RDF filesystem is sourced by the assembler which compiles a distribution for a given knowledgebase (e.g. internal and external) using publishing contracts. The contracts are essentially used as an output filter. More on this later.The assembler output can then by indexed by tdbloader for generating TDB indexes or can be tarballed for downloading.The deployment is typically an scp over to the triplestore machine (although for out public store we would ftp the tarballs for remote indexing).
We define a data tree on our datastore machine for importing from various sources.The extraction process (which runs against MarkLogic) writes out RDF in nquads format and stores these on disk by object and property.For example, under a given extraction run we will write out all the dc:title properties for an :Article object in a dedicated file under an articles folder. For this, we use the full URL name which is normalized to alphanumeric plus a couple of punctuation chars (e.g. dash and period) These are further partitioned by restricting to a given product (i.e. journal title). So, we end up with a highly branched tree of nquads files.Other data sources (e.g. our static ontology datasets and imports from the SQL database) are likewise componentized.
Complementing the imports tree we have a corresponding exports tree.This is partitioned by knowledgebase. And under each knowledgebase we have folders for dumps, indexes, snapshots and updates.
Our publishing contracts are based on VoID descriptions.Initially we were maintaining VoID descriptions (which we loosely call graphs and assign a :Graph object type to) for discovery purposes in which we listed out classes and properties and – especially – counts.We have been obsessed with counts since we have been thwarted by triplestore support – either the triplestores did not support the function, or performance for larger populations was an obstacle, and updates presented their own challenges – and tried to implement this ourselves through our web proxy.We subsequently realized that we could dual-purpose these graph descriptions as publishing contracts. By listing out only those objects and properties to be published our assembler could ignore any objects and properties not intended for that knowledgebase.
Cerberus, as mentioned earlier, supports the LInked Data API (LDA).Specifically, we have embedded Elda – the Epimorphics Linked Data API – into Cerberus.From Elda's maintainer: "The LDA provides configurable mediation between a SPARQL endpoint and presentations of the data in HTML, XML, Turtle, or JSON."
In closing would just note some future challenges:We need to secure the current triplestore platform to make it resilient and performant.We need to consider what a next generation platform would need to support – sizing, inferencing, etc.We need to develop a more complete ontology with a foundational ontology footing.This stuff is hard. Harder than publishing web pages. Harder than loading data into a database.Maybe with more practice and with better tools support we can begin to treat this more like a day-to-day operation.But there can be no doubt that the web is becoming increasingly granular.
And just to note that we are looking to recruit semweb developers with strong Java and semantic skills. So if you are interested in joining us here's the job ad.

Techniques used in RDF Data Publishing at Nature Publishing Group

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group (20)

More from Tony Hammond

More from Tony Hammond (11)

Recently uploaded

Recently uploaded (20)

Techniques used in RDF Data Publishing at Nature Publishing Group

Editor's Notes