Linked Data Experiences at Springer Nature

1
Linked Data Experiences at Springer
Nature
Michele Pasin
Lead Data Architect
Knowledge Graph Team

Linked Data Experiences at Springer Nature
Leipzig, 09/2016
2
Outline
•Who we are
• Why semantic technologies
• Our work so far
• The Scigraph project
• Looking ahead

Linked Data Experiences at Springer Nature -
Leipzig, 09/2016
3
Who We Are

4
Formed in May 2015 through the merger of Nature Publishing
Group, Palgrave Macmillan, Macmillan Education and Springer
Science+Business Media

5
4
5
1
14
2
13k employees in over 50 countries, EUR 1.5 billion turnover

6
[Pre-Merger] Springer Science + Business Media brands

7
[Pre-Merger] Macmillan Science & Education brands
Holtzbrinck
Publishing
Group

8
We publish a lot of science! (since 1815)
13M documents
7M articles, 4M chapters
4k journals, 700k books

9
..and generate a lot of traffic
11.5M monthly visitors
(nature.com)
260M visits per year
600M downloads per year
(link.springer.com)

> Collaborative effort between Springer Nature and
Digital Science
> Supporting internal use cases,but also contributing
to an emerging web of linked science data
> Not just publications data but a wealth of other
related information

Leipzig, 09/2016
12
Why Semantic Technologies

13
Why is Semantics Important To Us?
Challenges: Data Silos
● Data is fragmented
● Data gets duplicated
● Data is hardcoded into applications
Change Drivers
● Digital first workflow
● User-centric design
● Unified Springer Nature domain

For example: our sites are currently organised around arTcles,
journals and issues…

However, scienTsts are interested in answering quesTons about real
world things…

Search engines do not know we have content about these things…
1st hit from nature.com…
Not linked to/from..

17
PDF
XML
ePub
HTML
TIFF
Today: Content base Tomorrow: Knowledge Graph
We publish science We manage knowledge
Vision

The Knowledge Graph is
about collecting
information about objects
in the real world
…so that we can do a better job of
providing users with what they're
looking for

reads / writes
is about
interested in
Three areas of knowledge we care about

Reads / Writes
Works for
Funds
Lead researcher in
Produces
Studies Located at
In
proceedings
C
ontains
Cites
Has learning
resource
Attends
Has topicProduces

21
Research/
Manuscript
Creation
Manuscript
Submission
Peer Review/
Proposal Stage
Planning
Production
Publication
Distribution/
Sales
Discovery
Researcher /
Author
Editorial /
Publisher
Reviewer
Opportunities: Tools & Services Along the Publishing Life Cycle

Leipzig, 09/2016
22
Our Work So Far

Our Work So Far
2014
2013
2012
2015
2016
NPG Linked Data Platform
Nature Ontologies Portal
Springer Materials
Springer Conferences
Scigraph
Content Hub
Scigraph
prototype
Nero
Project
Linnaeus
Project
Springer
Protocols
CURI Semantic
Annotation Project

Deliverables (2012–2014)
● Prototype for external use
● SPARQL query service
● Two RDF dataset releases in 2012
– April 2012 (22m triples)
– July 2012 (270m triples)
● Live updates to query endpoint
Led to (2014–)
● Focus on internal use-cases
● Publish ontology pages
● Periodic data snapshots
NPG Linked Data Platform (2012)

Features
● Hybrid RDF + XML architecture
– MarkLogic for XML, RDF/XML
– Triplestore (TDB) for RDF validation
● Repo’s for binary assets
Layout
! Semantic RDF/XML includes in XML
● RDF objects serialized in list order
● Application XML for subject hierarchy 
Indexes
● Indexes over all elements
● Range indexes for datatypes (e.g. dates)
NPG Content Hub (2014): Hybrid Architecture

27
NPG Ontologies Portal (2015): Data Publishing

29
Springer Conferences Portal (2015)

30
Scigraph Project (2016): main objectives
Data Integration
> Consolidation of existing LD efforts via a single domain mode
> Ingestion and normalisation of third party datasets
Discoverability
> Better end user applications [B2C]
> Metadata delivery & validation [B2B]
> Data publishing [B2developers]

Leipzig, 09/2016
31
Scigraph
what’s in it
> data architecture, taxonomies, ontologies
how it works
> ETL, naming, validation, identity

32
Data Landscape
Citations / References
160M
Articles
7M
Chapters
3.6M
Journals
4K
Books
700k
Subjects
4K
Article
Types
Grants
2M
Organizations
60K
Conferen
ces
10K
Funders
Publishers
Universities
Scigraph
Core
Persons
1M
Relations
Publish
states
Vocabularies

a DB/OO
scheme
Arbitrary relations plus
axioms, constraints
and rules expressed
in a logical languagea glossary
an axiomatized
theory
a thesaurus
a taxonomy
Taxonomy plus
related terms;
captures synonymy,
homonymy etc.
Complexity (ontological depth)
A controlled
vocabulary with NL
definitions (e.g.
lexicon)
- Publishers
- Relations
- Publish-states
A c.v. that captures
broaderThan /
narrowerThan
relationships
- Subjects,
- Article Types
Relational model:
unconstrained use
of arbitrary relations
Scigraph
Core ontology
Ontologies and Taxonomies: overview

34
The Core Ontology
- Language: OWL 2, Profile: ALCHI(D)
- Entities: ~73 classes, ~250 properties
- Principles: Incremental Formalization/ Enterprise Integration / Model Coherence
http://www.nature.com/ontologies/core/

35
The Core Ontology: mappings
:Asset
:Thing
:Publication
:Concept
:Event
:Subject
:Type
:Agent
:ArticleType
:Publishing
Event
:Aggregation
Event
:Component
:Document
:Serial
cidoc-crm:
Information_Carrier
cidoc-crm:
Conceptual_Object
dbpedia:Agent
dc:Agent
dcterms:Agent
cidoc-crm:Agent
vcard:Agent
foaf:Agent
event:Event
bibo:Event
schema:Event
cidoc-crm:
TemporalEntity
cidoc-crm:Type
vcard:Type
fabio:SubjectTerm
bibo:Document
cidoc-crm:Document
foaf:Document
bibo:Periodical
fabio:Periodical
schema:Periodical
bibo:DocumentPart
fabio:Expression
cidoc-crm:InformationObject
= owl:equivalentClass

36
SKOS taxonomies: Poolparty integration

37
SKOS taxonomies: Subjects
- Structure: SKOS, ~2500 concepts, multi hierarchical tree, 6 branches, 7 levels of depth
- Mappings: 100% of terms, using skos:broadMatch or skos:closeMatch, (Dbpedia and
MESH)
- Document tagging: mostly manual, different workflows, often costly and inconsistent

38
Semi-Automatic tagging with Dimensions (from UberResearch)

Leipzig, 09/2016
39
Scigraph
what’s in it
> data architecture, taxonomies, ontologies
how it works
> ETL, naming, validation, identity

40
Naming Architecture: federated model
> Dereference and 303 redirects:
- http://name.scigraph.com/{things}/
- http://data.scigraph.com/{things}/
> Two patterns: schemas and instances
- http://name.scigraph.com/ontologies/{domain}/
- http://name.scigraph.com/{domain}/{things}/
> Prefixes for schemas and instances
- @prefix sg: <http://name.scigraph.com/ontologies/core/> .
> Entity names follow a robust convention
- camel-case for naming terms, with an initial uppercase for
classes and an initial lowercase for properties.
> Named graphs used to track provenance

41
Scigraph - Data Flow
Peer
Review
DDS
Core
Media
UNSILO TARGET
Uber
Research
DBPedia etc..
KNOWLEDGE GRAPH
JSON-LD API DDS Adapter TTL Loader RDF Loader ..
data
sources
integration
layer
real time
services
Peer Review
Service
Search Service
(Content Hub)
applications Peer Review Oscar Search
data is delivered to
applications via fast APIs
data is extracted and
denormalised so to support
applications
data is normalised and
mapped to SN ontologies

42
ETL Architecture: main features [in evolution]
Tech stack
> Airflow framework (Airbnb)
> Amazon S3 to make backups
> GraphDB triplestore (staging and presentation)
> Elastic search and APIs
Components & Principles
> Graph must be ‘ephemeral’
> Data sources versioning algorithm
> Identity Persistence service
> Validation via SHACL (TopBraid API)

43
ETL Architecture
Persons
zip
XML
RDF
JSON
CSV
Articles
DB
Publishers
Dataset
Books
API
Sources
Data Store
Amazon S3
Data Staging
Triplestore
Data Presentation
Triplestore
Linked
Data
Browser
Analytics
Reporting
APIs
✴ Extraction
✴ Validation
✴ Identity Persistence
✴ Updating / Replacing
named graphs
✴ Versioning service
✴ (md5 checksum,
timestamps, origin
version, etc...)
✴ Integration
(union graph)
✴ Inference
Named Graphs

Identity Persistence
Identity Persistence
Module
J1
(xml)
J2
(xml)
RDF
Extractor
journals:
76as67fda76sd67a
id: 1
DOI: 123
issn: ABC
id: 2
issn: ABC
J1
(xml)
id: 1
DOI: 123
issn: ABC
ingest #1
ingest #2
ingest #3
Identity Registry
sgo:core Ontology
sg:Journal
a owl:Class ;
sg:hasKeyProperty sg:doi .
sg:hasKeyProperty sg:issn
sg:hasKeyProperty sg:eissn
....

45
Data Validation: from SPIN to SHACL
> SPIN SPARQL syntax
(2011, TopQuadrant)
> Example: “if a Journal
instance has no short
title, raise an Exception”
> Main drawback: hard to
maintain and to read by
non specialists

46
Data Validation: from SPIN to SHACL
> SHACL - Shapes
Constraint Language
(2016, TopQuadrant)
> Example: “all article
instances should have a
valid DOI”
> Example: “all grants
instances should have
max 1 start year and end
year”
> Approach: polish data
before entering the
triplestore, use triplestore
inference primarily for
integration

Leipzig, 09/2016
47
Next Steps

48
Looking Ahead
Summary
● Scigraph is our latest LD platform - public version live in late 2016
● SW tech allows for scalable enterprise-level metadata management
● It is crucial to distinguish between data Integration VS (real time) data delivery
● Still a work in progress… suggestions or feedback very welcome!
Ongoing Work
● Ontology: federated model, more advanced inferencing capabilities
● Build internal/external APIs (JSON-LD) by integrating also NoSQL
● Tools for analytics, reporting, visualisation, interactive exploration of the graph
● Entities extraction: scientific entities, places, people, events etc..
● We’re looking to collaborate… Crossref, W3C, building a Linked Science Web

Future: a scientific article X-ray?

50
The Knowledge Graph team
CORE TEAM
*Markus Kaindl: Product Owner
*Ben Kirkley: Project Manager
* Michele Pasin: Lead Data Architect
*Tony Hammond: Data Architect
* Matias Piipari: Lead Engineer
* Hilverd Reker: Software Engineer
*Artur Konczak: Software Engineer
*<blankNode>: Data Scientist
*<blankNode>: Data Engineer
DIGITAL SCIENCE
* Martin Szomszor: Data Scientist
*Richard Koks: Data Scientist
* Mario Diwersy: CTO, Uber Research
PROGRAM SPONSOR
* Henning Schoenenberger: Director Data &
Metadata

Linked Data Experiences at Springer
Nature - Leipzig, 09/2016
51
Thanks
michele.pasin@nature.com

Linked Data Experiences at Springer Nature

More Related Content

What's hot

Viewers also liked

Similar to Linked Data Experiences at Springer Nature

More from Michele Pasin

Recently uploaded

Linked Data Experiences at Springer Nature