SlideShare a Scribd company logo
1 of 32
Techniques
                 used in
    RDF Data Publishing
                      at
Nature Publishing Group

           Tony Hammond
       Data Architect, NPG

              March 5, 2013
Nature Publishing Group

● NPG a division of Macmillan (a privately
  owned company)
● Publishes ~120 titles in all
  ● 34 Nature branded titles
  ● 53 academic and society journals
  ● 16 magazines (incl. Scientific American)
● ~1000 employees,17 offices (5 continents)
● ~30 society partners
● Databases, conferences/events, multimedia

                       2
Semantic Publishing at NPG

• Prior Work
  •   RSS 1.0 webfeeds
  •   HTML metadata
  •   PDF metadata (XMP)
  •   Urchin – RSS aggregator
  •   OAI-PMH, OpenSearch (SRU), OpenURL
• Linked Data Apps
  • Public Data: test viability of data publishing
  • Hub: application of technology internally
                         3
Public Data



         4
NPG by Numbers




                 5
NPG Ontology




               6
Cloud Hosting

•   TSO OpenUp® SaaS platform
•   Offers 5store as a triplestore
•   Scale-out architecture (C/C++)
•   Supports up to a trillion triples
•   150,000tps load speed
•   SPARQL 1.0, with 1.1 features
    (aggregates, etc)


                         7
data.nature.com




                  8
data.nature.com/query




                9
Hub



      10
Hub: Problem




               11
Hub: Solution




                12
Hub: Method




              13
XMP




      14
Building the Graph




                15
Local Hosting

•   Apache TDB
•   Single-node architecture (Java)
•   Supports up to ~1.5b triples (tested)
•   SPARQL 1.1




                       16
Data Publishing




                  17
Hub Finder




             18
Hub Finder: Results




                19
Techniques



        20
Naming Architecture




                21
Naming Policy
npg:     http://ns.nature.com/terms/
npgg:    http://ns.nature.com/graphs/

Object     Example          Usage
Graph      npgg:gadgets     gadgets:33 ex:title "Title" npgg:gadgets .

Class      npg:Gadget       gadgets:33 a npg:Gadget npgg:gadgets .

Object     npg:hasGadget    _:12 npg:hasGadget gadgets:33 npgg:_ .
Property
Data       ex:title         gadgets:33 ex:title "Title" npgg:gadgets .
Property
Instance   gadgets:33       gadgets:33 ex:title "Title" npgg:gadgets .



                                    22
Publishing




             23
Monitoring




             24
ETL Process




              25
Datastore: Imports




                26
Datastore: Exports




                27
Contracts
npgg:affiliations                                                        void:property vcard:region ;
  a npg:Graph, void:Dataset ;                                            void:triples "183483"^^xsd:int
  dcterms:description "Graph of npg:Affiliation objects" ;        ], [
  dcterms:issued "2013-02-15"^^xsd:date ;                                void:property vcard:organisation-name ;
  dcterms:modified "2013-02-15"^^xsd:date ;                              void:triples "694290"^^xsd:int
  dcterms:publisher [                                             ], [
       a foaf:Organization ;                                             void:property vcard:locality ;
       foaf:mbox <mailto:developers@nature.com> ;                        void:triples "412042"^^xsd:int
       foaf:name "Nature Publishing Group"                        ], [
  ];                                                                     void:property vcard:email ;
  dcterms:source "extractor-xml" ;                                       void:triples "21650"^^xsd:int
  dcterms:title "npgg:affiliations" ;                             ], [
  rdfs:label "npgg:affiliations" ;                                       void:property vcard:country-name ;
  void:classPartition [                                                  void:triples 0
       void:class npg:Affiliation ;                               ], [
       void:entities "973208"^^xsd:int                                   void:property rdfs:label ;
  ];                                                                     void:triples "973208"^^xsd:int
  void:propertyPartition [                                        ], [
       void:property vcard:url ;                                         void:property rdf:type ;
       void:triples "326"^^xsd:int                                       void:triples "973208"^^xsd:int
  ], [                                                            ];
       void:property vcard:street-address ;                       void:triples "3340845"^^xsd:int ;
       void:triples "82638"^^xsd:int                              void:vocabulary npg:, rdf:, rdfs:, void: .
  ], [




                                                             28
Linked Data API

•   ./api/articles [.json, .rdf, .xml]
•   ./api/articles?hasProduct.pcode=ng
•   ./api/contributors?familyName=Smith
•   ./api/products.json?pcode=ng&_page=2
•   ./api/products?_view=none&_properties=pcode
•   ./api/search?title=black+hole
•   ./api/tree/subjects/children.xml?_sort=title



                        29
Closing



          30
Positions Available




                   goo.gl/bYIt8

   www.linkedin.com/jobs?jobId=4890057&viewJob
                       31
Information


                data.nature.com

         developers.nature.com/docs

              datahub.io/group/npg
                  prefix.cc/npg




                      32

More Related Content

What's hot

Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...MongoDB
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBantoinegirbal
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingApache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingMyles Braithwaite
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for AnalyticsMongoDB
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBTakahiro Inoue
 
Document Model for High Speed Spark Processing
Document Model for High Speed Spark ProcessingDocument Model for High Speed Spark Processing
Document Model for High Speed Spark ProcessingMongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
2011 Mongo FR - Indexing in MongoDB
2011 Mongo FR - Indexing in MongoDB2011 Mongo FR - Indexing in MongoDB
2011 Mongo FR - Indexing in MongoDBantoinegirbal
 
Back to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBMongoDB
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSMongoDB
 
NoSQL - An introduction to CouchDB
NoSQL - An introduction to CouchDBNoSQL - An introduction to CouchDB
NoSQL - An introduction to CouchDBJonathan Weiss
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaJuan Antonio Roy Couto
 

What's hot (20)

Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
MongoDb and NoSQL
MongoDb and NoSQLMongoDb and NoSQL
MongoDb and NoSQL
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Mondodb
MondodbMondodb
Mondodb
 
Python and MongoDB
Python and MongoDB Python and MongoDB
Python and MongoDB
 
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingApache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
Document Model for High Speed Spark Processing
Document Model for High Speed Spark ProcessingDocument Model for High Speed Spark Processing
Document Model for High Speed Spark Processing
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
2011 Mongo FR - Indexing in MongoDB
2011 Mongo FR - Indexing in MongoDB2011 Mongo FR - Indexing in MongoDB
2011 Mongo FR - Indexing in MongoDB
 
Back to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDB
 
Querying mongo db
Querying mongo dbQuerying mongo db
Querying mongo db
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
NoSQL - An introduction to CouchDB
NoSQL - An introduction to CouchDBNoSQL - An introduction to CouchDB
NoSQL - An introduction to CouchDB
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de Huelva
 

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Startedabramsm
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Cross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web ServicesCross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web ServicesBob Sims
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2Dimitris Kontokostas
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Pig on spark
Pig on sparkPig on spark
Pig on sparkSigmoid
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availabilityspil-engineering
 

Similar to Techniques used in RDF Data Publishing at Nature Publishing Group (20)

Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
HyperGraphQL
HyperGraphQLHyperGraphQL
HyperGraphQL
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Cross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web ServicesCross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web Services
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Pig on spark
Pig on sparkPig on spark
Pig on spark
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availability
 

More from Tony Hammond

The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
 
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWLData Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWLTony Hammond
 
Iswc 2014-hammond-pasin-presentation-final
Iswc 2014-hammond-pasin-presentation-finalIswc 2014-hammond-pasin-presentation-final
Iswc 2014-hammond-pasin-presentation-finalTony Hammond
 
nature.com OpenSearch
nature.com OpenSearchnature.com OpenSearch
nature.com OpenSearchTony Hammond
 
OpenURL - The Rough Guide
OpenURL - The Rough GuideOpenURL - The Rough Guide
OpenURL - The Rough GuideTony Hammond
 
Agile Descriptions
Agile DescriptionsAgile Descriptions
Agile DescriptionsTony Hammond
 

More from Tony Hammond (11)

XMP Inspector
XMP InspectorXMP Inspector
XMP Inspector
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWLData Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
 
Iswc 2014-hammond-pasin-presentation-final
Iswc 2014-hammond-pasin-presentation-finalIswc 2014-hammond-pasin-presentation-final
Iswc 2014-hammond-pasin-presentation-final
 
nature.com OpenSearch
nature.com OpenSearchnature.com OpenSearch
nature.com OpenSearch
 
Handle 08
Handle 08Handle 08
Handle 08
 
OpenURL - The Rough Guide
OpenURL - The Rough GuideOpenURL - The Rough Guide
OpenURL - The Rough Guide
 
Bionlp 07
Bionlp 07Bionlp 07
Bionlp 07
 
Agile Descriptions
Agile DescriptionsAgile Descriptions
Agile Descriptions
 
Yads
YadsYads
Yads
 
Jisc
JiscJisc
Jisc
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Techniques used in RDF Data Publishing at Nature Publishing Group

  • 1. Techniques used in RDF Data Publishing at Nature Publishing Group Tony Hammond Data Architect, NPG March 5, 2013
  • 2. Nature Publishing Group ● NPG a division of Macmillan (a privately owned company) ● Publishes ~120 titles in all ● 34 Nature branded titles ● 53 academic and society journals ● 16 magazines (incl. Scientific American) ● ~1000 employees,17 offices (5 continents) ● ~30 society partners ● Databases, conferences/events, multimedia 2
  • 3. Semantic Publishing at NPG • Prior Work • RSS 1.0 webfeeds • HTML metadata • PDF metadata (XMP) • Urchin – RSS aggregator • OAI-PMH, OpenSearch (SRU), OpenURL • Linked Data Apps • Public Data: test viability of data publishing • Hub: application of technology internally 3
  • 7. Cloud Hosting • TSO OpenUp® SaaS platform • Offers 5store as a triplestore • Scale-out architecture (C/C++) • Supports up to a trillion triples • 150,000tps load speed • SPARQL 1.0, with 1.1 features (aggregates, etc) 7
  • 10. Hub 10
  • 14. XMP 14
  • 16. Local Hosting • Apache TDB • Single-node architecture (Java) • Supports up to ~1.5b triples (tested) • SPARQL 1.1 16
  • 22. Naming Policy npg: http://ns.nature.com/terms/ npgg: http://ns.nature.com/graphs/ Object Example Usage Graph npgg:gadgets gadgets:33 ex:title "Title" npgg:gadgets . Class npg:Gadget gadgets:33 a npg:Gadget npgg:gadgets . Object npg:hasGadget _:12 npg:hasGadget gadgets:33 npgg:_ . Property Data ex:title gadgets:33 ex:title "Title" npgg:gadgets . Property Instance gadgets:33 gadgets:33 ex:title "Title" npgg:gadgets . 22
  • 28. Contracts npgg:affiliations void:property vcard:region ; a npg:Graph, void:Dataset ; void:triples "183483"^^xsd:int dcterms:description "Graph of npg:Affiliation objects" ; ], [ dcterms:issued "2013-02-15"^^xsd:date ; void:property vcard:organisation-name ; dcterms:modified "2013-02-15"^^xsd:date ; void:triples "694290"^^xsd:int dcterms:publisher [ ], [ a foaf:Organization ; void:property vcard:locality ; foaf:mbox <mailto:developers@nature.com> ; void:triples "412042"^^xsd:int foaf:name "Nature Publishing Group" ], [ ]; void:property vcard:email ; dcterms:source "extractor-xml" ; void:triples "21650"^^xsd:int dcterms:title "npgg:affiliations" ; ], [ rdfs:label "npgg:affiliations" ; void:property vcard:country-name ; void:classPartition [ void:triples 0 void:class npg:Affiliation ; ], [ void:entities "973208"^^xsd:int void:property rdfs:label ; ]; void:triples "973208"^^xsd:int void:propertyPartition [ ], [ void:property vcard:url ; void:property rdf:type ; void:triples "326"^^xsd:int void:triples "973208"^^xsd:int ], [ ]; void:property vcard:street-address ; void:triples "3340845"^^xsd:int ; void:triples "82638"^^xsd:int void:vocabulary npg:, rdf:, rdfs:, void: . ], [ 28
  • 29. Linked Data API • ./api/articles [.json, .rdf, .xml] • ./api/articles?hasProduct.pcode=ng • ./api/contributors?familyName=Smith • ./api/products.json?pcode=ng&_page=2 • ./api/products?_view=none&_properties=pcode • ./api/search?title=black+hole • ./api/tree/subjects/children.xml?_sort=title 29
  • 30. Closing 30
  • 31. Positions Available goo.gl/bYIt8 www.linkedin.com/jobs?jobId=4890057&viewJob 31
  • 32. Information data.nature.com developers.nature.com/docs datahub.io/group/npg prefix.cc/npg 32

Editor's Notes

  1. BackgroundNPG is a division of Macmillan Publishers Ltd, a global publishing group founded in the United Kingdom in 1843.Macmillan is itself owned by German-based, family run company Verlagsgruppe Georg von Holtzbrinck GmbH.Nature magazine started publication in 1869.Scientifc American started publication in 1845.
  2. Prior WorkNPG has long been receptive to RDF and to semantic publishing.Now has 10 years publishing RSS 1.0 webfeeds.Over 5 years publishing XMP in PDFs.Later a number of digital library–focused services – OAI-PMH, OpenSearch (SRU), OpenURL – subsequently made use of public vocabularies.Linked Data AppsMeantime the linked data paradigm has grown and can be seen as RDF&apos;s coming of age.TBL introduced a set of guidelines in July 2006 – just over 6 years ago.NPG first began to explore this space early in 2010 and since then has developed two main applications: Public Data and Hub.
  3. The public data application was intended to gauge the readiness of the marketplace for data publishing.Scientific exchange is built on the reuse of ideas.With this linked data publishing we wanted to start testing the reusability of the knowledge that NPG generates.There was some initial interest in our first release, less in the second release.We have not been actively promoting this, and done a poor job linking to our own data.We have also had some ongoing problems in accessing query history which we are working to resolve.
  4. As a reference this graph shows on a log scale (i.e. each vertical division shows a 10-fold increase) some relevant numbers:~120 journals~1m articles~10m citations~300m triples (a current working number for the public dataset)The &apos;Plimsoll Line&apos; at right shows the initial banding we defined for sizing our requirements:10–100m100m–1b1b–10bOur public datasets appeared in the 1st and 2nd bands
  5. In both these linked data applications NPG has focussed more on a broad – but flat – coverage. Our data model currently covers some 12 object types in our public dataset and double that number in our internal dataset. We are working to extend that number further.The slide shows the internal data model implemented at this point and is generally the public bibliographic dataset together with our annotations data.As can be seen the :Article object is a local hub object.Specifically we have focussed on the RDF data model and some minimal RDFS schemas at this point and not on OWL ontologies.Our strategy has been to put in broadband coverage and then to work up ontology &apos;verticals&apos; later. Already a number of benefits may be realized by considering simple RDF linking and steering shy of inferencing.
  6. We first began looking at vendors to host our data around Q2/Q3, 2010.We finally settled on TSO in Q1, 2011 as we already had previous relations with them and the technology they were able to offer (5store) was highly performant and scalable.TSO did not provide much in the way of API support – which actually suited us just fine – but were enthusiastic partners and were flexible in their arrangements with us.For loading the data we just had to ftp a snapshot – or set of snapshot files – and they would load this for us.Additionally – and as discussed later – we provided our own update mechanism using SPARQL Update.The SPARQL 1.0 endpoint was upgraded earlier this year with SPARQL 1.1. features.The only issue that we have not resolved satisfactorily is logging.
  7. We had two releases of public data in 2012, April (22m) and July (270m) – both distributions were released into the public domain under a Creative Commons CC0 license (or more precisely a public waiver).April 4 DistributionThe first release included a subset of journals with articles and no citations - ~22m triples, 450,000 articles and 10 objects.July 16 DistributionThe second included all Journals (67 new titles) with articles and with citations - &gt;270m triples, &gt;900,000 articles and 12 objects.This distribution effectively doubles the number of articles while also adding in new :Citation and:DataCitation objects so that the complete citation graph for all NPG titles now brings the full distribution to more than ten times the size of the previous distribution.Also added a live updating facility and RDF data dumps
  8. We built a web proxy – Cerberus – which was intended to perform three major tasks:browsing – implements a linked data browserlinking – implements a linked data APIlookup – implements URI dereferencingAdditionally a 4th background service – available at system and not user level – was implemented to support updates.Cerberus also provides for the following:centralized user accessa rich set of serializations (via HTTP content negotiation)query limitsprequery extensions (to article full text search)sample queries
  9. The hub application is aimed at providing a discovery layer over our internal content.
  10. We have an enterprise-wideinitiative at NPG to upgrade our workflow systems in which we are aiming both to database all our production assets, and to do this at the beginning of the process cycle (at acceptance) rather at the end (at publication) as currently.All publication assets will be included and will be stored in appropriate repositories.Currently only our XML asset base – i.e. our structured data – is well maintained in a dedicated XML database – MarkLogic.The accompanying BLOBs (.jpg, .gif, .pdf, etc) are only maintained on the filesystems – we are exploring a DAM option.Consequence would be a small set of content repositories – a distributed data warehouse.Problem is: How do we find stuff?
  11. Proposed solution is to lay a graph over the repositories.We represent the physical (content) layer by the red nodes in the graph. There is a 1:1 relationship between physical asset and graph node.We represent the logical (context) layer by the white nodes in the graph. These are the nodes that are richly interlinked.This conceptual graph overlay may bear some similarities to Topic Maps, if you are familiar with that technology.
  12. Basic methodology is to introduce a registration process and to associate a description (metadata) with each asset (file) and from these descriptions to generate linked data graph of objects.This registration process can be likened to a border patrol and the descriptions as passports for assets.The proposed format for the metadata descriptions is standalone XMP packets which have the benefits of both XML and RDF, as well as providing a useful set of constraints and keeping us focussed on media management.Should note that this terminology of assets and objects is used internally.Basically the hub is an asset-driven object factory.
  13. Details:Adobe standard for embedding metadata in binary objects – PDF, etc.Standard mechanism (since PDF 1.5 in 2001) for adding metadata to PDFISO Standard: ISO 16684-1:2012Essentially, an XML document – specifically, an RDF/XML documentCan be embedded in wide range of binary and non-binary mediaCan also be used a standalone (‘sidecar’) documentUpsides:XML is easy import to ML – RDF is easy import to TS NPG already using this inPDFsDownsides:Uses archaic RDF forms (rdf:Alt, rdf:Bag, rdf:Seq)Requires certain properties (DC, etc) be mapped to structured formsSDK is C++ only (but other tools - ExifTool - has good support)
  14. This figure shows how the graph would be built.We have two assets represented in the graph by the red nodes – an article XML file with properties shown in blue, and an article PDF with properties shown in green.Both of the objects are built from the descriptions associated with the assets (i.e. from the XMP packets).At right is shown some sample RDF that would be contained within the XMP packets.These two objects link to a concept object (the :Article) which itself is linked to other concept objects.Note that our current XML has no link to the PDF, but going forward we represent that linkage directly in the graph.
  15. We have been using Apache TDB for local hosting for various reasons – open source and we are a Java shop.This supports SPARQL 1.1.Indexing is appreciably slower than 5store speeds.
  16. The introduction of a triplestore changes the traditional publishing picture.Previously a CMS would access backend databases and present the data to an end-user.With the triplestore we are now publishing data which is both an output in its own right as well as an input to the CMS.The data is published to teh web rather than stored in a database and accessed (possibly) through an API.
  17. The Hub Finder is a set of apps in early development to provide contextual discovery services to in-house teams (Production, Editorial, Marketing, etc.).Here a simple Inspector is shown which provides a low-level view of objects and their properties.In this view the :Subject object from our subjects ontology is selected.
  18. And here are some results for search the :Subject object.
  19. Some techniques of our RDF publishing practice will be discussed:Naming architecture/policyNamed graphs RDF filesystems for import/exportPublishing contractsXMP packets for metadata packaging (discussed earlier)Linked Data API
  20. We follow the usual practice of distinguishing between Information Resources and (so-called) Non-Information Resources, i.e. between descriptions of things and things themselves.To implement this we follow the well-known DBpedia pattern of 303 redirects which although clunky to set up and requiring an extra DNS lookup does have the merit of being unambiguous.All NPG things are named in the ns.nature.com namespace.Descriptions for those things are served from the data.nature.com namespace.Documents built using this data are served from the www.nature.com namespace.
  21. We have a strong naming policy.This allows for predictability in coding and in mapping to the API.We define a pair of namespaces: npg: for classes and properties, and npgg: for graphs.Each object type lives in its own named graph using the npgg: namespace for graphs.All objects and object properties are named in the npg: namespace, although additional classes may be added as appropriate.Datatype properties are sourced from common vocabularies wherever possible.
  22. The diagram shows our data publishing operation.We support a 3-tiered development environment (live, staging and test) with dedicated triplestores for each environment.We also have the cloud-based public triplestore.Together we have eight SPARQL endpoints: four native triplestore endpoints, and four web proxy endpoints.We also have the datastore which is not yet 3-tier enabled. This is a data prep machine and is where our triples are extracted to and indexed. We are planning to extend this across all three tiers.The triplestore are under puppet management and simply deployed.
  23. The Hub Finder provides a Monitor page which provides a simple test across all our various SPARQL endpoints with three status levels: up (green), up – slow (amber), down (red). Also timing info is shown.This just issues a simple lightweight SPARQL query to each of the endpoints.We are looking to tie this back into our regular Zenoss monitoring system in order to get email alerts, etc.
  24. Our ETL process – or extract, transform, and load process – is fairly straightforward.Some points to note are that our extractor uses XPath and Jena. There is also a certain amount of Saxon and XQuery processing as some early transforms wrote results out to RDF/XML as an intermediate step. Output is written out in nquads.The output of the extractor goes to the filesystem building up, in effect, an RDF filesystem.This RDF filesystem is sourced by the assembler which compiles a distribution for a given knowledgebase (e.g. internal and external) using publishing contracts. The contracts are essentially used as an output filter. More on this later.The assembler output can then by indexed by tdbloader for generating TDB indexes or can be tarballed for downloading.The deployment is typically an scp over to the triplestore machine (although for out public store we would ftp the tarballs for remote indexing).
  25. We define a data tree on our datastore machine for importing from various sources.The extraction process (which runs against MarkLogic) writes out RDF in nquads format and stores these on disk by object and property.For example, under a given extraction run we will write out all the dc:title properties for an :Article object in a dedicated file under an articles folder. For this, we use the full URL name which is normalized to alphanumeric plus a couple of punctuation chars (e.g. dash and period) These are further partitioned by restricting to a given product (i.e. journal title). So, we end up with a highly branched tree of nquads files.Other data sources (e.g. our static ontology datasets and imports from the SQL database) are likewise componentized.
  26. Complementing the imports tree we have a corresponding exports tree.This is partitioned by knowledgebase. And under each knowledgebase we have folders for dumps, indexes, snapshots and updates.
  27. Our publishing contracts are based on VoID descriptions.Initially we were maintaining VoID descriptions (which we loosely call graphs and assign a :Graph object type to) for discovery purposes in which we listed out classes and properties and – especially – counts.We have been obsessed with counts since we have been thwarted by triplestore support – either the triplestores did not support the function, or performance for larger populations was an obstacle, and updates presented their own challenges – and tried to implement this ourselves through our web proxy.We subsequently realized that we could dual-purpose these graph descriptions as publishing contracts. By listing out only those objects and properties to be published our assembler could ignore any objects and properties not intended for that knowledgebase.
  28. Cerberus, as mentioned earlier, supports the LInked Data API (LDA).Specifically, we have embedded Elda – the Epimorphics Linked Data API – into Cerberus.From Elda&apos;s maintainer: &quot;The LDA provides configurable mediation between a SPARQL endpoint and presentations of the data in HTML, XML, Turtle, or JSON.&quot;
  29. In closing would just note some future challenges:We need to secure the current triplestore platform to make it resilient and performant.We need to consider what a next generation platform would need to support – sizing, inferencing, etc.We need to develop a more complete ontology with a foundational ontology footing.This stuff is hard. Harder than publishing web pages. Harder than loading data into a database.Maybe with more practice and with better tools support we can begin to treat this more like a day-to-day operation.But there can be no doubt that the web is becoming increasingly granular.
  30. And just to note that we are looking to recruit semweb developers with strong Java and semantic skills. So if you are interested in joining us here&apos;s the job ad.