Techniques used in RDF Data Publishing at Nature Publishing Group

•Download as PPTX, PDF•

9 likes•6,932 views

This document summarizes Nature Publishing Group's techniques for RDF data publishing. It discusses NPG's prior semantic publishing work and linked data applications. It then describes NPG's ontology, data hosting on a cloud platform, public SPARQL endpoint, and internal Hub application. The document outlines NPG's data extraction, loading, and publishing process, as well as techniques for naming, monitoring, and providing a linked data API.

Technology

Techniques
used in
RDF Data Publishing
at
Nature Publishing Group

Tony Hammond
Data Architect, NPG

March 5, 2013

Nature Publishing Group

● NPG a division of Macmillan (a privately
owned company)
● Publishes ~120 titles in all
● 34 Nature branded titles
● 53 academic and society journals
● 16 magazines (incl. Scientific American)
● ~1000 employees,17 offices (5 continents)
● ~30 society partners
● Databases, conferences/events, multimedia

2

Semantic Publishing at NPG

• Prior Work
• RSS 1.0 webfeeds
• HTML metadata
• PDF metadata (XMP)
• Urchin – RSS aggregator
• OAI-PMH, OpenSearch (SRU), OpenURL
• Linked Data Apps
• Public Data: test viability of data publishing
• Hub: application of technology internally
3

Cloud Hosting

• TSO OpenUp® SaaS platform
• Offers 5store as a triplestore
• Scale-out architecture (C/C++)
• Supports up to a trillion triples
• 150,000tps load speed
• SPARQL 1.0, with 1.1 features
(aggregates, etc)

7

Local Hosting

• Apache TDB
• Single-node architecture (Java)
• Supports up to ~1.5b triples (tested)
• SPARQL 1.1

16

Naming Policy
npg: http://ns.nature.com/terms/
npgg: http://ns.nature.com/graphs/

Object Example Usage
Graph npgg:gadgets gadgets:33 ex:title "Title" npgg:gadgets .

Class npg:Gadget gadgets:33 a npg:Gadget npgg:gadgets .

Object npg:hasGadget _:12 npg:hasGadget gadgets:33 npgg:_ .
Property
Data ex:title gadgets:33 ex:title "Title" npgg:gadgets .
Property
Instance gadgets:33 gadgets:33 ex:title "Title" npgg:gadgets .

22

Contracts
npgg:affiliations void:property vcard:region ;
a npg:Graph, void:Dataset ; void:triples "183483"^^xsd:int
dcterms:description "Graph of npg:Affiliation objects" ; ], [
dcterms:issued "2013-02-15"^^xsd:date ; void:property vcard:organisation-name ;
dcterms:modified "2013-02-15"^^xsd:date ; void:triples "694290"^^xsd:int
dcterms:publisher [ ], [
a foaf:Organization ; void:property vcard:locality ;
foaf:mbox <mailto:developers@nature.com> ; void:triples "412042"^^xsd:int
foaf:name "Nature Publishing Group" ], [
]; void:property vcard:email ;
dcterms:source "extractor-xml" ; void:triples "21650"^^xsd:int
dcterms:title "npgg:affiliations" ; ], [
rdfs:label "npgg:affiliations" ; void:property vcard:country-name ;
void:classPartition [ void:triples 0
void:class npg:Affiliation ; ], [
void:entities "973208"^^xsd:int void:property rdfs:label ;
]; void:triples "973208"^^xsd:int
void:propertyPartition [ ], [
void:property vcard:url ; void:property rdf:type ;
void:triples "326"^^xsd:int void:triples "973208"^^xsd:int
], [ ];
void:property vcard:street-address ; void:triples "3340845"^^xsd:int ;
void:triples "82638"^^xsd:int void:vocabulary npg:, rdf:, rdfs:, void: .
], [

28

Linked Data API

• ./api/articles [.json, .rdf, .xml]
• ./api/articles?hasProduct.pcode=ng
• ./api/contributors?familyName=Smith
• ./api/products.json?pcode=ng&_page=2
• ./api/products?_view=none&_properties=pcode
• ./api/search?title=black+hole
• ./api/tree/subjects/children.xml?_sort=title

29

Positions Available

goo.gl/bYIt8

www.linkedin.com/jobs?jobId=4890057&viewJob
31

Information

data.nature.com

developers.nature.com/docs

datahub.io/group/npg
prefix.cc/npg

32

What's hot

Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...

MongoDB

Data Processing and Aggregation with MongoDB

MongoDB

Introduction to MongoDB

antoinegirbal

Back to Basics Webinar 2: Your First MongoDB Application

MongoDB

The agenda of the slides are to discuss some basic and in-depth details of MongoDB and NoSQL. A snapshot of the topics discussed: - Introduction to NoSQL and MongoDB - Installation - Queries - Indexing - Schema modeling - Aggregation This tutorial is an introduction to MongoDB and NoSQL. The tutorial includes an introduction to MongoDb and NoSQL, installation, queries related to MongoDB and NoSQL, aggregation framework, indexing of MongoDB and NoSQL and schema modelling. The tutorial begins with a section on introduction. This section includes an introduction to NoSQL, its data models like document model, graph model, key value etc. It also includes an introduction to MongoDB and its data model. The introduction section is then followed by the installation section. This section includes installing MongoDB, default directory, starting MongoDB server, starting Mongo shell and more steps. It also includes adding documents. The next section is about queries related to MongoDB and NoSQL. This section includes query collection which are selecting all documents, find by example, use OR condition, use AND condition, update query. It also includes removing documents. Then comes a section about aggregation framework. This section includes a brief about aggregation framework process and its samples. The next section is about indexing. This section involves indexing for speeding up of search and sorting, types of indexes like single field, compound field, multiple index etc. The last section of the tutorial is about schema modelling. This section includes schema design factors like rich documents, no mongo joins, no constraints, atomic operation etc.

MongoDb and NoSQL

TO THE NEW | Technology

Analytics with MongoDB Aggregation Framework and Hadoop Connector

Henrik Ingo

Mondodb

Paulo Fagundes

Python and MongoDB

Norberto Leite

Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting

Myles Braithwaite

MongoDB for Analytics

MongoDB

Social Data and Log Analysis Using MongoDB

Takahiro Inoue

Document Model for High Speed Spark Processing

MongoDB

Back to Basics Webinar 1: Introduction to NoSQL

MongoDB

We all know that MongoDB is one of the most flexible and feature-rich databases available. In this session we'll discuss how you can leverage this feature set and maintain high performance with your project's massive data sets and high loads. We'll cover how indexes can be designed to optimize the performance of MongoDB. We'll also discuss tips for diagnosing and fixing performance issues should they arise.

2011 Mongo FR - Indexing in MongoDB

antoinegirbal

Back to Basics 2017: Mí primera aplicación MongoDB

MongoDB

Querying mongo db

Bogdan Sabău

For developers new to MongoDB and Node.js, however, some the common design patterns are very different than those of a RDBMS and traditional synchronous languages. Developers learning these technologies together may find it a bit bewildering. In reality, however, these tools fit perfectly together and enable I high degree of developer productivity and application performance. This webinar will walk developers through common MongoDB development patterns in Node.js, such as efficiently loading data into MongoDB using MongoDB's bulk API, iterating through query results, and managing simultaneous asynchronous MongoDB queries to provide the best possible application performance. Working Node.js and MongoDB examples will be used throughout the presentation.

Getting Started with MongoDB and NodeJS

MongoDB

NoSQL - An introduction to CouchDB

Jonathan Weiss

MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we'll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.

Webinar: Data Processing and Aggregation Options

MongoDB

MongoDB Workshop Universidad de Huelva

Juan Antonio Roy Couto

What's hot (20)

Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...

Data Processing and Aggregation with MongoDB

Introduction to MongoDB

Back to Basics Webinar 2: Your First MongoDB Application

MongoDb and NoSQL

Analytics with MongoDB Aggregation Framework and Hadoop Connector

Mondodb

Python and MongoDB

Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting

MongoDB for Analytics

Social Data and Log Analysis Using MongoDB

Document Model for High Speed Spark Processing

Back to Basics Webinar 1: Introduction to NoSQL

2011 Mongo FR - Indexing in MongoDB

Back to Basics 2017: Mí primera aplicación MongoDB

Querying mongo db

Getting Started with MongoDB and NodeJS

NoSQL - An introduction to CouchDB

Webinar: Data Processing and Aggregation Options

MongoDB Workshop Universidad de Huelva

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group

DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types. We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.

Spark schema for free with David Szakallas

Databricks

Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)! Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;) Video: MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4 WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database

Jimmy Angelakos

HyperGraphQL

Szymon Klarman

Hydra - Getting Started

abramsm

Blazing Fast Analytics with MongoDB & Spark

MongoDB

Introduction to Spark Datasets - Functional and relational together at last

Holden Karau

Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs. Talk Overview: Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark. Demo: Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Miklos Christine

Cross-Platform Mobile Apps & Drupal Web Services

Bob Sims

ETL with SPARK - First Spark London meetup

Rafal Kwasny

Lessons learned while building Omroep.nl

bartzon

Graph databases & data integration v2

Dimitris Kontokostas

Lessons learned while building Omroep.nl

tieleman

Pig on spark

Sigmoid

GraphX: Graph analytics for insights about developer communities

Paco Nathan

The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project. In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including: - General information about the dataset, including data quality of each of the columns; - Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; - 2D distribution between pairs of columns; - Correlation coefficient matrix for all numerical columns. Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.

Lens: Data exploration with Dask and Jupyter widgets

Víctor Zabalza

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Databricks

https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472 Big Brains meetup hosted by BloomReach, 2015-06-04 Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.

Graph Analytics in Spark

Paco Nathan

These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.

GraphFrames: DataFrame-based graphs for Apache® Spark™

Databricks

Graph Analytics with ArangoDB

ArangoDB Database

Retaining globally distributed high availability

spil-engineering

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group (20)

Spark schema for free with David Szakallas

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database

HyperGraphQL

Hydra - Getting Started

Blazing Fast Analytics with MongoDB & Spark

Introduction to Spark Datasets - Functional and relational together at last

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Cross-Platform Mobile Apps & Drupal Web Services

ETL with SPARK - First Spark London meetup

Lessons learned while building Omroep.nl

Graph databases & data integration v2

Lessons learned while building Omroep.nl

Pig on spark

GraphX: Graph analytics for insights about developer communities

Lens: Data exploration with Dask and Jupyter widgets

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Graph Analytics in Spark

GraphFrames: DataFrame-based graphs for Apache® Spark™

Graph Analytics with ArangoDB

Retaining globally distributed high availability

Recently uploaded

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Modernizing Securities Finance: The cloud-native prime brokerage platform transforming capital markets. Madhu Subbu, Managing Director, Head of Securities Finance Engineering Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Real Time Object Detection Using Open CV

Khem

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

FWD Group - Insurer Innovation Award 2024

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

Strategies for Landing an Oracle DBA Job as a Fresher

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - The value of a flexible API Management solution for O...

A Year of the Servo Reboot: Where Are We Now?

A Beginners Guide to Building a RAG App Using Open Source Milvus

Axa Assurance Maroc - Insurer Innovation Award 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Data Cloud, More than a CDP by Matt Robison

Real Time Object Detection Using Open CV

Manulife - Insurer Transformation Award 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Techniques used in RDF Data Publishing at Nature Publishing Group

1. Techniques used in RDF Data Publishing at Nature Publishing Group Tony Hammond Data Architect, NPG March 5, 2013

2. Nature Publishing Group ● NPG a division of Macmillan (a privately owned company) ● Publishes ~120 titles in all ● 34 Nature branded titles ● 53 academic and society journals ● 16 magazines (incl. Scientific American) ● ~1000 employees,17 offices (5 continents) ● ~30 society partners ● Databases, conferences/events, multimedia 2

3. Semantic Publishing at NPG • Prior Work • RSS 1.0 webfeeds • HTML metadata • PDF metadata (XMP) • Urchin – RSS aggregator • OAI-PMH, OpenSearch (SRU), OpenURL • Linked Data Apps • Public Data: test viability of data publishing • Hub: application of technology internally 3

4. Public Data 4

5. NPG by Numbers 5

6. NPG Ontology 6

7. Cloud Hosting • TSO OpenUp® SaaS platform • Offers 5store as a triplestore • Scale-out architecture (C/C++) • Supports up to a trillion triples • 150,000tps load speed • SPARQL 1.0, with 1.1 features (aggregates, etc) 7

8. data.nature.com 8

9. data.nature.com/query 9

10. Hub 10

11. Hub: Problem 11

12. Hub: Solution 12

13. Hub: Method 13

14. XMP 14

15. Building the Graph 15

16. Local Hosting • Apache TDB • Single-node architecture (Java) • Supports up to ~1.5b triples (tested) • SPARQL 1.1 16

17. Data Publishing 17

18. Hub Finder 18

19. Hub Finder: Results 19

20. Techniques 20

21. Naming Architecture 21

22. Naming Policy npg: http://ns.nature.com/terms/ npgg: http://ns.nature.com/graphs/ Object Example Usage Graph npgg:gadgets gadgets:33 ex:title "Title" npgg:gadgets . Class npg:Gadget gadgets:33 a npg:Gadget npgg:gadgets . Object npg:hasGadget _:12 npg:hasGadget gadgets:33 npgg:_ . Property Data ex:title gadgets:33 ex:title "Title" npgg:gadgets . Property Instance gadgets:33 gadgets:33 ex:title "Title" npgg:gadgets . 22

23. Publishing 23

24. Monitoring 24

25. ETL Process 25

26. Datastore: Imports 26

27. Datastore: Exports 27

28. Contracts npgg:affiliations void:property vcard:region ; a npg:Graph, void:Dataset ; void:triples "183483"^^xsd:int dcterms:description "Graph of npg:Affiliation objects" ; ], [ dcterms:issued "2013-02-15"^^xsd:date ; void:property vcard:organisation-name ; dcterms:modified "2013-02-15"^^xsd:date ; void:triples "694290"^^xsd:int dcterms:publisher [ ], [ a foaf:Organization ; void:property vcard:locality ; foaf:mbox <mailto:developers@nature.com> ; void:triples "412042"^^xsd:int foaf:name "Nature Publishing Group" ], [ ]; void:property vcard:email ; dcterms:source "extractor-xml" ; void:triples "21650"^^xsd:int dcterms:title "npgg:affiliations" ; ], [ rdfs:label "npgg:affiliations" ; void:property vcard:country-name ; void:classPartition [ void:triples 0 void:class npg:Affiliation ; ], [ void:entities "973208"^^xsd:int void:property rdfs:label ; ]; void:triples "973208"^^xsd:int void:propertyPartition [ ], [ void:property vcard:url ; void:property rdf:type ; void:triples "326"^^xsd:int void:triples "973208"^^xsd:int ], [ ]; void:property vcard:street-address ; void:triples "3340845"^^xsd:int ; void:triples "82638"^^xsd:int void:vocabulary npg:, rdf:, rdfs:, void: . ], [ 28

29. Linked Data API • ./api/articles [.json, .rdf, .xml] • ./api/articles?hasProduct.pcode=ng • ./api/contributors?familyName=Smith • ./api/products.json?pcode=ng&_page=2 • ./api/products?_view=none&_properties=pcode • ./api/search?title=black+hole • ./api/tree/subjects/children.xml?_sort=title 29

30. Closing 30

31. Positions Available goo.gl/bYIt8 www.linkedin.com/jobs?jobId=4890057&viewJob 31

32. Information data.nature.com developers.nature.com/docs datahub.io/group/npg prefix.cc/npg 32

Editor's Notes

BackgroundNPG is a division of Macmillan Publishers Ltd, a global publishing group founded in the United Kingdom in 1843.Macmillan is itself owned by German-based, family run company Verlagsgruppe Georg von Holtzbrinck GmbH.Nature magazine started publication in 1869.Scientifc American started publication in 1845.
Prior WorkNPG has long been receptive to RDF and to semantic publishing.Now has 10 years publishing RSS 1.0 webfeeds.Over 5 years publishing XMP in PDFs.Later a number of digital library–focused services – OAI-PMH, OpenSearch (SRU), OpenURL – subsequently made use of public vocabularies.Linked Data AppsMeantime the linked data paradigm has grown and can be seen as RDF's coming of age.TBL introduced a set of guidelines in July 2006 – just over 6 years ago.NPG first began to explore this space early in 2010 and since then has developed two main applications: Public Data and Hub.
The public data application was intended to gauge the readiness of the marketplace for data publishing.Scientific exchange is built on the reuse of ideas.With this linked data publishing we wanted to start testing the reusability of the knowledge that NPG generates.There was some initial interest in our first release, less in the second release.We have not been actively promoting this, and done a poor job linking to our own data.We have also had some ongoing problems in accessing query history which we are working to resolve.
As a reference this graph shows on a log scale (i.e. each vertical division shows a 10-fold increase) some relevant numbers:~120 journals~1m articles~10m citations~300m triples (a current working number for the public dataset)The 'Plimsoll Line' at right shows the initial banding we defined for sizing our requirements:10–100m100m–1b1b–10bOur public datasets appeared in the 1st and 2nd bands
In both these linked data applications NPG has focussed more on a broad – but flat – coverage. Our data model currently covers some 12 object types in our public dataset and double that number in our internal dataset. We are working to extend that number further.The slide shows the internal data model implemented at this point and is generally the public bibliographic dataset together with our annotations data.As can be seen the :Article object is a local hub object.Specifically we have focussed on the RDF data model and some minimal RDFS schemas at this point and not on OWL ontologies.Our strategy has been to put in broadband coverage and then to work up ontology 'verticals' later. Already a number of benefits may be realized by considering simple RDF linking and steering shy of inferencing.
We first began looking at vendors to host our data around Q2/Q3, 2010.We finally settled on TSO in Q1, 2011 as we already had previous relations with them and the technology they were able to offer (5store) was highly performant and scalable.TSO did not provide much in the way of API support – which actually suited us just fine – but were enthusiastic partners and were flexible in their arrangements with us.For loading the data we just had to ftp a snapshot – or set of snapshot files – and they would load this for us.Additionally – and as discussed later – we provided our own update mechanism using SPARQL Update.The SPARQL 1.0 endpoint was upgraded earlier this year with SPARQL 1.1. features.The only issue that we have not resolved satisfactorily is logging.
We had two releases of public data in 2012, April (22m) and July (270m) – both distributions were released into the public domain under a Creative Commons CC0 license (or more precisely a public waiver).April 4 DistributionThe first release included a subset of journals with articles and no citations - ~22m triples, 450,000 articles and 10 objects.July 16 DistributionThe second included all Journals (67 new titles) with articles and with citations - >270m triples, >900,000 articles and 12 objects.This distribution effectively doubles the number of articles while also adding in new :Citation and:DataCitation objects so that the complete citation graph for all NPG titles now brings the full distribution to more than ten times the size of the previous distribution.Also added a live updating facility and RDF data dumps
We built a web proxy – Cerberus – which was intended to perform three major tasks:browsing – implements a linked data browserlinking – implements a linked data APIlookup – implements URI dereferencingAdditionally a 4th background service – available at system and not user level – was implemented to support updates.Cerberus also provides for the following:centralized user accessa rich set of serializations (via HTTP content negotiation)query limitsprequery extensions (to article full text search)sample queries
The hub application is aimed at providing a discovery layer over our internal content.
We have an enterprise-wideinitiative at NPG to upgrade our workflow systems in which we are aiming both to database all our production assets, and to do this at the beginning of the process cycle (at acceptance) rather at the end (at publication) as currently.All publication assets will be included and will be stored in appropriate repositories.Currently only our XML asset base – i.e. our structured data – is well maintained in a dedicated XML database – MarkLogic.The accompanying BLOBs (.jpg, .gif, .pdf, etc) are only maintained on the filesystems – we are exploring a DAM option.Consequence would be a small set of content repositories – a distributed data warehouse.Problem is: How do we find stuff?
Proposed solution is to lay a graph over the repositories.We represent the physical (content) layer by the red nodes in the graph. There is a 1:1 relationship between physical asset and graph node.We represent the logical (context) layer by the white nodes in the graph. These are the nodes that are richly interlinked.This conceptual graph overlay may bear some similarities to Topic Maps, if you are familiar with that technology.
Basic methodology is to introduce a registration process and to associate a description (metadata) with each asset (file) and from these descriptions to generate linked data graph of objects.This registration process can be likened to a border patrol and the descriptions as passports for assets.The proposed format for the metadata descriptions is standalone XMP packets which have the benefits of both XML and RDF, as well as providing a useful set of constraints and keeping us focussed on media management.Should note that this terminology of assets and objects is used internally.Basically the hub is an asset-driven object factory.
Details:Adobe standard for embedding metadata in binary objects – PDF, etc.Standard mechanism (since PDF 1.5 in 2001) for adding metadata to PDFISO Standard: ISO 16684-1:2012Essentially, an XML document – specifically, an RDF/XML documentCan be embedded in wide range of binary and non-binary mediaCan also be used a standalone (‘sidecar’) documentUpsides:XML is easy import to ML – RDF is easy import to TS NPG already using this inPDFsDownsides:Uses archaic RDF forms (rdf:Alt, rdf:Bag, rdf:Seq)Requires certain properties (DC, etc) be mapped to structured formsSDK is C++ only (but other tools - ExifTool - has good support)
This figure shows how the graph would be built.We have two assets represented in the graph by the red nodes – an article XML file with properties shown in blue, and an article PDF with properties shown in green.Both of the objects are built from the descriptions associated with the assets (i.e. from the XMP packets).At right is shown some sample RDF that would be contained within the XMP packets.These two objects link to a concept object (the :Article) which itself is linked to other concept objects.Note that our current XML has no link to the PDF, but going forward we represent that linkage directly in the graph.
We have been using Apache TDB for local hosting for various reasons – open source and we are a Java shop.This supports SPARQL 1.1.Indexing is appreciably slower than 5store speeds.
The introduction of a triplestore changes the traditional publishing picture.Previously a CMS would access backend databases and present the data to an end-user.With the triplestore we are now publishing data which is both an output in its own right as well as an input to the CMS.The data is published to teh web rather than stored in a database and accessed (possibly) through an API.
The Hub Finder is a set of apps in early development to provide contextual discovery services to in-house teams (Production, Editorial, Marketing, etc.).Here a simple Inspector is shown which provides a low-level view of objects and their properties.In this view the :Subject object from our subjects ontology is selected.
And here are some results for search the :Subject object.
Some techniques of our RDF publishing practice will be discussed:Naming architecture/policyNamed graphs RDF filesystems for import/exportPublishing contractsXMP packets for metadata packaging (discussed earlier)Linked Data API
We follow the usual practice of distinguishing between Information Resources and (so-called) Non-Information Resources, i.e. between descriptions of things and things themselves.To implement this we follow the well-known DBpedia pattern of 303 redirects which although clunky to set up and requiring an extra DNS lookup does have the merit of being unambiguous.All NPG things are named in the ns.nature.com namespace.Descriptions for those things are served from the data.nature.com namespace.Documents built using this data are served from the www.nature.com namespace.
We have a strong naming policy.This allows for predictability in coding and in mapping to the API.We define a pair of namespaces: npg: for classes and properties, and npgg: for graphs.Each object type lives in its own named graph using the npgg: namespace for graphs.All objects and object properties are named in the npg: namespace, although additional classes may be added as appropriate.Datatype properties are sourced from common vocabularies wherever possible.
The diagram shows our data publishing operation.We support a 3-tiered development environment (live, staging and test) with dedicated triplestores for each environment.We also have the cloud-based public triplestore.Together we have eight SPARQL endpoints: four native triplestore endpoints, and four web proxy endpoints.We also have the datastore which is not yet 3-tier enabled. This is a data prep machine and is where our triples are extracted to and indexed. We are planning to extend this across all three tiers.The triplestore are under puppet management and simply deployed.
The Hub Finder provides a Monitor page which provides a simple test across all our various SPARQL endpoints with three status levels: up (green), up – slow (amber), down (red). Also timing info is shown.This just issues a simple lightweight SPARQL query to each of the endpoints.We are looking to tie this back into our regular Zenoss monitoring system in order to get email alerts, etc.
Our ETL process – or extract, transform, and load process – is fairly straightforward.Some points to note are that our extractor uses XPath and Jena. There is also a certain amount of Saxon and XQuery processing as some early transforms wrote results out to RDF/XML as an intermediate step. Output is written out in nquads.The output of the extractor goes to the filesystem building up, in effect, an RDF filesystem.This RDF filesystem is sourced by the assembler which compiles a distribution for a given knowledgebase (e.g. internal and external) using publishing contracts. The contracts are essentially used as an output filter. More on this later.The assembler output can then by indexed by tdbloader for generating TDB indexes or can be tarballed for downloading.The deployment is typically an scp over to the triplestore machine (although for out public store we would ftp the tarballs for remote indexing).
We define a data tree on our datastore machine for importing from various sources.The extraction process (which runs against MarkLogic) writes out RDF in nquads format and stores these on disk by object and property.For example, under a given extraction run we will write out all the dc:title properties for an :Article object in a dedicated file under an articles folder. For this, we use the full URL name which is normalized to alphanumeric plus a couple of punctuation chars (e.g. dash and period) These are further partitioned by restricting to a given product (i.e. journal title). So, we end up with a highly branched tree of nquads files.Other data sources (e.g. our static ontology datasets and imports from the SQL database) are likewise componentized.
Complementing the imports tree we have a corresponding exports tree.This is partitioned by knowledgebase. And under each knowledgebase we have folders for dumps, indexes, snapshots and updates.
Our publishing contracts are based on VoID descriptions.Initially we were maintaining VoID descriptions (which we loosely call graphs and assign a :Graph object type to) for discovery purposes in which we listed out classes and properties and – especially – counts.We have been obsessed with counts since we have been thwarted by triplestore support – either the triplestores did not support the function, or performance for larger populations was an obstacle, and updates presented their own challenges – and tried to implement this ourselves through our web proxy.We subsequently realized that we could dual-purpose these graph descriptions as publishing contracts. By listing out only those objects and properties to be published our assembler could ignore any objects and properties not intended for that knowledgebase.
Cerberus, as mentioned earlier, supports the LInked Data API (LDA).Specifically, we have embedded Elda – the Epimorphics Linked Data API – into Cerberus.From Elda's maintainer: "The LDA provides configurable mediation between a SPARQL endpoint and presentations of the data in HTML, XML, Turtle, or JSON."
In closing would just note some future challenges:We need to secure the current triplestore platform to make it resilient and performant.We need to consider what a next generation platform would need to support – sizing, inferencing, etc.We need to develop a more complete ontology with a foundational ontology footing.This stuff is hard. Harder than publishing web pages. Harder than loading data into a database.Maybe with more practice and with better tools support we can begin to treat this more like a day-to-day operation.But there can be no doubt that the web is becoming increasingly granular.
And just to note that we are looking to recruit semweb developers with strong Java and semantic skills. So if you are interested in joining us here's the job ad.

Techniques used in RDF Data Publishing at Nature Publishing Group

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group (20)

More from Tony Hammond

More from Tony Hammond (11)

Recently uploaded

Recently uploaded (20)

Techniques used in RDF Data Publishing at Nature Publishing Group

Editor's Notes