Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linked Data Experiences at Springer Nature

2,772 views

Published on

An overview of how we're using semantic technologies at Springer Nature, and an introduction to our latest product: www.scigraph.com

(Keynote given at http://2016.semantics.cc/, Leipzig, Sept 2016)

Published in: Data & Analytics
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Linked Data Experiences at Springer Nature

  1. 1. 1 Linked Data Experiences at Springer Nature Michele Pasin Lead Data Architect Knowledge Graph Team
  2. 2. Linked Data Experiences at Springer Nature Leipzig, 09/2016 2 Outline •Who we are • Why semantic technologies • Our work so far • The Scigraph project • Looking ahead
  3. 3. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 3 Who We Are
  4. 4. 4 Formed in May 2015 through the merger of Nature Publishing Group, Palgrave Macmillan, Macmillan Education and Springer Science+Business Media
  5. 5. 5 4 5 1 14 2 13k employees in over 50 countries, EUR 1.5 billion turnover
  6. 6. 6 [Pre-Merger] Springer Science + Business Media brands
  7. 7. 7 [Pre-Merger] Macmillan Science & Education brands Holtzbrinck Publishing Group
  8. 8. 8 We publish a lot of science! (since 1815) 13M documents 7M articles, 4M chapters 4k journals, 700k books
  9. 9. 9 ..and generate a lot of traffic 11.5M monthly visitors (nature.com) 260M visits per year 600M downloads per year (link.springer.com)
  10. 10. > Collaborative effort between Springer Nature and Digital Science > Supporting internal use cases,but also contributing to an emerging web of linked science data > Not just publications data but a wealth of other related information
  11. 11. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 12 Why Semantic Technologies
  12. 12. 13 Why is Semantics Important To Us? Challenges: Data Silos ● Data is fragmented ● Data gets duplicated ● Data is hardcoded into applications Change Drivers ● Digital first workflow ● User-centric design ● Unified Springer Nature domain
  13. 13. For example: our sites are currently organised around arTcles, journals and issues…
  14. 14. However, scienTsts are interested in answering quesTons about real world things…
  15. 15. Search engines do not know we have content about these things… 1st hit from nature.com… Not linked to/from..
  16. 16. 17 PDF XML ePub HTML TIFF Today: Content base Tomorrow: Knowledge Graph We publish science We manage knowledge Vision
  17. 17. The Knowledge Graph is about collecting information about objects in the real world …so that we can do a better job of providing users with what they're looking for
  18. 18. reads / writes is about interested in Three areas of knowledge we care about
  19. 19. Reads / Writes Works for Funds Lead researcher in Produces Studies Located at In proceedings C ontains Cites Has learning resource Attends Has topicProduces
  20. 20. 21 Research/ Manuscript Creation Manuscript Submission Peer Review/ Proposal Stage Planning Production Publication Distribution/ Sales Discovery Researcher / Author Editorial / Publisher Reviewer Opportunities: Tools & Services Along the Publishing Life Cycle
  21. 21. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 22 Our Work So Far
  22. 22. Our Work So Far 2014 2013 2012 2015 2016 NPG Linked Data Platform Nature Ontologies Portal Springer Materials Springer Conferences Scigraph Content Hub Scigraph prototype Nero Project Linnaeus Project Springer Protocols CURI Semantic Annotation Project
  23. 23. Deliverables (2012–2014) ● Prototype for external use ● SPARQL query service ● Two RDF dataset releases in 2012 – April 2012 (22m triples) – July 2012 (270m triples) ● Live updates to query endpoint Led to (2014–) ● Focus on internal use-cases ● Publish ontology pages ● Periodic data snapshots NPG Linked Data Platform (2012)
  24. 24. Features ● Hybrid RDF + XML architecture – MarkLogic for XML, RDF/XML – Triplestore (TDB) for RDF validation ● Repo’s for binary assets Layout ! Semantic RDF/XML includes in XML ● RDF objects serialized in list order ● Application XML for subject hierarchy
 Indexes ● Indexes over all elements ● Range indexes for datatypes (e.g. dates) NPG Content Hub (2014): Hybrid Architecture
  25. 25. Subject Pages (2014)
  26. 26. 27 NPG Ontologies Portal (2015): Data Publishing
  27. 27. 28 Springer Materials (2014)
  28. 28. 29 Springer Conferences Portal (2015)
  29. 29. 30 Scigraph Project (2016): main objectives Data Integration > Consolidation of existing LD efforts via a single domain mode > Ingestion and normalisation of third party datasets Discoverability > Better end user applications [B2C] > Metadata delivery & validation [B2B] > Data publishing [B2developers]
  30. 30. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 31 Scigraph what’s in it > data architecture, taxonomies, ontologies how it works > ETL, naming, validation, identity
  31. 31. 32 Data Landscape Citations / References 160M Articles 7M Chapters 3.6M Journals 4K Books 700k Subjects 4K Article Types Grants 2M Organizations 60K Conferen ces 10K Funders Publishers Universities Scigraph Core Persons 1M Relations Publish states Vocabularies
  32. 32. a DB/OO scheme Arbitrary relations plus axioms, constraints and rules expressed in a logical languagea glossary an axiomatized theory a thesaurus a taxonomy Taxonomy plus related terms; captures synonymy, homonymy etc. Complexity (ontological depth) A controlled vocabulary with NL definitions (e.g. lexicon) - Publishers - Relations - Publish-states A c.v. that captures broaderThan / narrowerThan relationships - Subjects, - Article Types Relational model: unconstrained use of arbitrary relations Scigraph Core ontology Ontologies and Taxonomies: overview
  33. 33. 34 The Core Ontology - Language: OWL 2, Profile: ALCHI(D) - Entities: ~73 classes, ~250 properties - Principles: Incremental Formalization/ Enterprise Integration / Model Coherence http://www.nature.com/ontologies/core/
  34. 34. 35 The Core Ontology: mappings :Asset :Thing :Publication :Concept :Event :Subject :Type :Agent :ArticleType :Publishing Event :Aggregation Event :Component :Document :Serial cidoc-crm: Information_Carrier cidoc-crm: Conceptual_Object dbpedia:Agent dc:Agent dcterms:Agent cidoc-crm:Agent vcard:Agent foaf:Agent event:Event bibo:Event schema:Event cidoc-crm: TemporalEntity cidoc-crm:Type vcard:Type fabio:SubjectTerm bibo:Document cidoc-crm:Document foaf:Document bibo:Periodical fabio:Periodical schema:Periodical bibo:DocumentPart fabio:Expression cidoc-crm:InformationObject = owl:equivalentClass
  35. 35. 36 SKOS taxonomies: Poolparty integration
  36. 36. 37 SKOS taxonomies: Subjects - Structure: SKOS, ~2500 concepts, multi hierarchical tree, 6 branches, 7 levels of depth - Mappings: 100% of terms, using skos:broadMatch or skos:closeMatch, (Dbpedia and MESH) - Document tagging: mostly manual, different workflows, often costly and inconsistent
  37. 37. 38 Semi-Automatic tagging with Dimensions (from UberResearch)
  38. 38. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 39 Scigraph what’s in it > data architecture, taxonomies, ontologies how it works > ETL, naming, validation, identity
  39. 39. 40 Naming Architecture: federated model > Dereference and 303 redirects: - http://name.scigraph.com/{things}/ - http://data.scigraph.com/{things}/ > Two patterns: schemas and instances - http://name.scigraph.com/ontologies/{domain}/ - http://name.scigraph.com/{domain}/{things}/ > Prefixes for schemas and instances - @prefix sg: <http://name.scigraph.com/ontologies/core/> . > Entity names follow a robust convention - camel-case for naming terms, with an initial uppercase for classes and an initial lowercase for properties. > Named graphs used to track provenance
  40. 40. 41 Scigraph - Data Flow Peer Review DDS Core Media UNSILO TARGET Uber Research DBPedia etc.. KNOWLEDGE GRAPH JSON-LD API DDS Adapter TTL Loader RDF Loader .. data sources integration layer real time services Peer Review Service Search Service (Content Hub) applications Peer Review Oscar Search data is delivered to applications via fast APIs data is extracted and denormalised so to support applications data is normalised and mapped to SN ontologies
  41. 41. 42 ETL Architecture: main features [in evolution] Tech stack > Airflow framework (Airbnb) > Amazon S3 to make backups > GraphDB triplestore (staging and presentation) > Elastic search and APIs Components & Principles > Graph must be ‘ephemeral’ > Data sources versioning algorithm > Identity Persistence service > Validation via SHACL (TopBraid API)
  42. 42. 43 ETL Architecture Persons zip XML RDF JSON CSV Articles DB Publishers Dataset Books API Sources Data Store Amazon S3 Data Staging Triplestore Data Presentation Triplestore Linked Data Browser Analytics Reporting APIs ✴ Extraction ✴ Validation ✴ Identity Persistence ✴ Updating / Replacing named graphs ✴ Versioning service ✴ (md5 checksum, timestamps, origin version, etc...) ✴ Integration (union graph) ✴ Inference Named Graphs
  43. 43. Identity Persistence Identity Persistence Module J1 (xml) J2 (xml) RDF Extractor journals: 76as67fda76sd67a id: 1 DOI: 123 issn: ABC id: 2 issn: ABC J1 (xml) id: 1 DOI: 123 issn: ABC ingest #1 ingest #2 ingest #3 Identity Registry sgo:core Ontology sg:Journal a owl:Class ; sg:hasKeyProperty sg:doi . sg:hasKeyProperty sg:issn sg:hasKeyProperty sg:eissn ....
  44. 44. 45 Data Validation: from SPIN to SHACL > SPIN SPARQL syntax (2011, TopQuadrant) > Example: “if a Journal instance has no short title, raise an Exception” > Main drawback: hard to maintain and to read by non specialists
  45. 45. 46 Data Validation: from SPIN to SHACL > SHACL - Shapes Constraint Language (2016, TopQuadrant) > Example: “all article instances should have a valid DOI” > Example: “all grants instances should have max 1 start year and end year” > Approach: polish data before entering the triplestore, use triplestore inference primarily for integration
  46. 46. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 47 Next Steps
  47. 47. 48 Looking Ahead Summary ● Scigraph is our latest LD platform - public version live in late 2016 ● SW tech allows for scalable enterprise-level metadata management ● It is crucial to distinguish between data Integration VS (real time) data delivery ● Still a work in progress… suggestions or feedback very welcome! Ongoing Work ● Ontology: federated model, more advanced inferencing capabilities ● Build internal/external APIs (JSON-LD) by integrating also NoSQL ● Tools for analytics, reporting, visualisation, interactive exploration of the graph ● Entities extraction: scientific entities, places, people, events etc.. ● We’re looking to collaborate… Crossref, W3C, building a Linked Science Web
  48. 48. Future: a scientific article X-ray?
  49. 49. 50 The Knowledge Graph team CORE TEAM *Markus Kaindl: Product Owner *Ben Kirkley: Project Manager * Michele Pasin: Lead Data Architect *Tony Hammond: Data Architect * Matias Piipari: Lead Engineer * Hilverd Reker: Software Engineer *Artur Konczak: Software Engineer *<blankNode>: Data Scientist *<blankNode>: Data Engineer DIGITAL SCIENCE * Martin Szomszor: Data Scientist *Richard Koks: Data Scientist * Mario Diwersy: CTO, Uber Research PROGRAM SPONSOR * Henning Schoenenberger: Director Data & Metadata
  50. 50. Linked Data Experiences at Springer Nature - Leipzig, 09/2016 51 Thanks michele.pasin@nature.com

×