1
Linked	Data	Experiences	at	Springer	
Nature
Michele	Pasin	
Lead	Data	Architect	
Knowledge	Graph	Team
Linked	Data	Experiences	at	Springer	Nature	
Leipzig,	09/2016
2
Outline	
•Who	we	are	
•	Why	semantic	technologies		
•	Our	work	so	far	
•	The	Scigraph	project	
•	Looking	ahead
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
3
Who	We	Are
4
Formed in May 2015 through the merger of Nature Publishing
Group, Palgrave Macmillan, Macmillan Education and Springer
Science+Business Media
5
4
5
1
14
2
13k employees in over 50 countries, EUR 1.5 billion turnover
6
[Pre-Merger]		Springer	Science	+	Business	Media	brands
7
[Pre-Merger]		Macmillan	Science	&	Education	brands
Holtzbrinck
Publishing
Group
8
We	publish	a	lot	of	science!	(since	1815)
13M documents
7M articles, 4M chapters
4k journals, 700k books
9
..and	generate	a	lot	of	traffic
11.5M monthly visitors
(nature.com)
260M visits per year
600M downloads per year
(link.springer.com)
> Collaborative effort between Springer Nature and
Digital Science
> Supporting internal use cases,but also contributing
to an emerging web of linked science data
> Not just publications data but a wealth of other
related information
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
12
Why	Semantic	Technologies
13
Why	is	Semantics	Important	To	Us?
Challenges: Data Silos
● Data is fragmented
● Data gets duplicated
● Data is hardcoded into applications
Change Drivers
● Digital first workflow
● User-centric design
● Unified Springer Nature domain
For	example:	our	sites	are	currently	organised	around	arTcles,	
journals	and	issues…
However,	scienTsts	are	interested	in	answering	quesTons	about	real	
world	things…
Search	engines	do	not	know	we	have	content	about	these	things…
1st	hit	from	nature.com…
Not	linked	to/from..
17
PDF
XML
ePub
HTML
TIFF
Today: Content base Tomorrow: Knowledge Graph
We publish science We manage knowledge
Vision
The Knowledge Graph is
about collecting
information about objects
in the real world
…so that we can do a better job of
providing users with what they're
looking for
reads / writes
is about
interested in
Three areas of knowledge we care about
Reads / Writes
Works for
Funds
Lead researcher in
Produces
Studies Located at
In
proceedings
C
ontains
Cites
Has learning
resource
Attends
Has topicProduces
21
Research/
Manuscript
Creation
Manuscript
Submission
Peer Review/
Proposal Stage
Planning
Production
Publication
Distribution/
Sales
Discovery
Researcher /
Author
Editorial /
Publisher
Reviewer
Opportunities:	Tools	&	Services	Along	the	Publishing	Life	Cycle
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
22
Our	Work	So	Far
Our	Work	So	Far	
2014
2013
2012
2015
2016
NPG Linked Data Platform
Nature Ontologies Portal
Springer Materials
Springer Conferences
Scigraph
Content Hub
Scigraph
prototype
Nero
Project
Linnaeus
Project
Springer
Protocols
CURI Semantic
Annotation Project
Deliverables (2012–2014)
● Prototype for external use
● SPARQL query service
● Two RDF dataset releases in 2012
– April 2012 (22m triples)
– July 2012 (270m triples)
● Live updates to query endpoint
Led to (2014–)
● Focus on internal use-cases
● Publish ontology pages
● Periodic data snapshots
NPG	Linked	Data	Platform	(2012)
Features
● Hybrid RDF + XML architecture
– MarkLogic for XML, RDF/XML
– Triplestore (TDB) for RDF validation
● Repo’s for binary assets
Layout
! Semantic RDF/XML includes in XML
● RDF objects serialized in list order
● Application XML for subject hierarchy

Indexes
● Indexes over all elements
● Range indexes for datatypes (e.g. dates)
NPG	Content	Hub	(2014):		Hybrid	Architecture
Subject	Pages	(2014)
27
NPG	Ontologies	Portal	(2015):	Data	Publishing
28
Springer	Materials	(2014)
29
Springer	Conferences	Portal	(2015)
30
Scigraph	Project	(2016):	main	objectives
Data Integration
> Consolidation of existing LD efforts via a single domain mode
> Ingestion and normalisation of third party datasets
Discoverability
> Better end user applications [B2C]
> Metadata delivery & validation [B2B]
> Data publishing [B2developers]
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
31
Scigraph	
what’s	in	it	
>	data	architecture,	taxonomies,	ontologies	
how	it	works	
>	ETL,	naming,	validation,	identity
32
Data	Landscape
Citations / References
160M
Articles
7M
Chapters
3.6M
Journals
4K
Books
700k
Subjects
4K
Article
Types
Grants
2M
Organizations
60K
Conferen
ces
10K
Funders
Publishers
Universities
Scigraph
Core
Persons
1M
Relations
Publish
states
Vocabularies
a DB/OO
scheme
Arbitrary relations plus
axioms, constraints
and rules expressed
in a logical languagea glossary
an axiomatized
theory
a thesaurus
a taxonomy
Taxonomy plus
related terms;
captures synonymy,
homonymy etc.
Complexity (ontological depth)
A controlled
vocabulary with NL
definitions (e.g.
lexicon)
- Publishers
- Relations
- Publish-states
A c.v. that captures
broaderThan /
narrowerThan
relationships
- Subjects,
- Article Types
Relational model:
unconstrained use
of arbitrary relations
Scigraph
Core ontology
Ontologies	and	Taxonomies:	overview
34
The	Core	Ontology
- Language: OWL 2, Profile: ALCHI(D)
- Entities: ~73 classes, ~250 properties
- Principles: Incremental Formalization/ Enterprise Integration / Model Coherence
http://www.nature.com/ontologies/core/
35
The	Core	Ontology:	mappings
:Asset
:Thing
:Publication
:Concept
:Event
:Subject
:Type
:Agent
:ArticleType
:Publishing
Event
:Aggregation
Event
:Component
:Document
:Serial
cidoc-crm:
Information_Carrier
cidoc-crm:
Conceptual_Object
dbpedia:Agent
dc:Agent
dcterms:Agent
cidoc-crm:Agent
vcard:Agent
foaf:Agent
event:Event
bibo:Event
schema:Event
cidoc-crm:
TemporalEntity
cidoc-crm:Type
vcard:Type
fabio:SubjectTerm
bibo:Document
cidoc-crm:Document
foaf:Document
bibo:Periodical
fabio:Periodical
schema:Periodical
bibo:DocumentPart
fabio:Expression
cidoc-crm:InformationObject
= owl:equivalentClass
36
SKOS	taxonomies:	Poolparty	integration
37
SKOS	taxonomies:	Subjects
- Structure: SKOS, ~2500 concepts, multi hierarchical tree, 6 branches, 7 levels of depth
- Mappings: 100% of terms, using skos:broadMatch or skos:closeMatch, (Dbpedia and
MESH)
- Document tagging: mostly manual, different workflows, often costly and inconsistent
38
Semi-Automatic	tagging	with	Dimensions	(from	UberResearch)
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
39
Scigraph	
what’s	in	it	
>	data	architecture,	taxonomies,	ontologies	
how	it	works	
>	ETL,	naming,	validation,	identity
40
Naming	Architecture:	federated	model
> Dereference and 303 redirects:
- http://name.scigraph.com/{things}/
- http://data.scigraph.com/{things}/
> Two patterns: schemas and instances
- http://name.scigraph.com/ontologies/{domain}/
- http://name.scigraph.com/{domain}/{things}/
> Prefixes for schemas and instances
- @prefix sg: <http://name.scigraph.com/ontologies/core/> .
> Entity names follow a robust convention
- camel-case for naming terms, with an initial uppercase for
classes and an initial lowercase for properties.
> Named graphs used to track provenance
41
Scigraph	-	Data	Flow
Peer
Review
DDS
Core
Media
UNSILO TARGET
Uber
Research
DBPedia etc..
KNOWLEDGE GRAPH
JSON-LD API DDS Adapter TTL Loader RDF Loader ..
data
sources
integration
layer
real time
services
Peer Review
Service
Search Service
(Content Hub)
applications Peer Review Oscar Search
data is delivered to
applications via fast APIs
data is extracted and
denormalised so to support
applications
data is normalised and
mapped to SN ontologies
42
ETL	Architecture:	main	features	[in	evolution]
Tech stack
> Airflow framework (Airbnb)
> Amazon S3 to make backups
> GraphDB triplestore (staging and presentation)
> Elastic search and APIs
Components & Principles
> Graph must be ‘ephemeral’
> Data sources versioning algorithm
> Identity Persistence service
> Validation via SHACL (TopBraid API)
43
ETL	Architecture
Persons
zip
XML
RDF
JSON
CSV
Articles
DB
Publishers
Dataset
Books
API
Sources
Data Store
Amazon S3
Data Staging
Triplestore
Data Presentation
Triplestore
Linked
Data
Browser
Analytics
Reporting
APIs
✴ Extraction
✴ Validation
✴ Identity Persistence
✴ Updating / Replacing
named graphs
✴ Versioning service
✴ (md5 checksum,
timestamps, origin
version, etc...)
✴ Integration
(union graph)
✴ Inference
Named Graphs
Identity	Persistence
Identity Persistence
Module
J1
(xml)
J2
(xml)
RDF
Extractor
journals:
76as67fda76sd67a
id: 1
DOI: 123
issn: ABC
id: 2
issn: ABC
J1
(xml)
id: 1
DOI: 123
issn: ABC
ingest #1
ingest #2
ingest #3
Identity Registry
sgo:core Ontology
sg:Journal
a owl:Class ;
sg:hasKeyProperty sg:doi .
sg:hasKeyProperty sg:issn
sg:hasKeyProperty sg:eissn
....
45
Data	Validation:	from	SPIN	to	SHACL
> SPIN SPARQL syntax
(2011, TopQuadrant)
> Example: “if a Journal
instance has no short
title, raise an Exception”
> Main drawback: hard to
maintain and to read by
non specialists
46
Data	Validation:	from	SPIN	to	SHACL
> SHACL - Shapes
Constraint Language
(2016, TopQuadrant)
> Example: “all article
instances should have a
valid DOI”
> Example: “all grants
instances should have
max 1 start year and end
year”
> Approach: polish data
before entering the
triplestore, use triplestore
inference primarily for
integration
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
47
Next	Steps
48
Looking	Ahead	
Summary
● Scigraph is our latest LD platform - public version live in late 2016
● SW tech allows for scalable enterprise-level metadata management
● It is crucial to distinguish between data Integration VS (real time) data delivery
● Still a work in progress… suggestions or feedback very welcome!
Ongoing Work
● Ontology: federated model, more advanced inferencing capabilities
● Build internal/external APIs (JSON-LD) by integrating also NoSQL
● Tools for analytics, reporting, visualisation, interactive exploration of the graph
● Entities extraction: scientific entities, places, people, events etc..
● We’re looking to collaborate… Crossref, W3C, building a Linked Science Web
Future:	a	scientific	article	X-ray?
50
The	Knowledge	Graph	team
CORE TEAM
*Markus Kaindl: Product Owner
*Ben Kirkley: Project Manager
* Michele Pasin: Lead Data Architect
*Tony Hammond: Data Architect
* Matias Piipari: Lead Engineer
* Hilverd Reker: Software Engineer
*Artur Konczak: Software Engineer
*<blankNode>: Data Scientist
*<blankNode>: Data Engineer
DIGITAL SCIENCE
* Martin Szomszor: Data Scientist
*Richard Koks: Data Scientist
* Mario Diwersy: CTO, Uber Research
PROGRAM SPONSOR
* Henning Schoenenberger: Director Data &
Metadata
Linked	Data	Experiences	at	Springer	
Nature	-	Leipzig,	09/2016
51
Thanks	
michele.pasin@nature.com

Linked Data Experiences at Springer Nature