Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pistoia alliance harmonizing fair data catalog approaches webinar

25 views

Published on

Multiple groups in the life sciences community have started their journey towards data FAIR-ification by implementing Data Catalogs, a clear first step towards Finding your data. While in many cases the approaches are quite similar, in both origin and intent, differing implementations could end up hampering interoperability and reuse. The Pistoia Alliance and the Linked Data Community of Practice hosted a panel discussion describing at three implementations and their downstream goals:

[1] Pharma cross-omics data catalogs,
[2] Clinical data catalogs
[3] Bioschemas for dataset discoverability on the inter/intranet

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Pistoia alliance harmonizing fair data catalog approaches webinar

  1. 1. November 29, 2018 Harmonizing FAIR Data Catalog Approaches A Pistoia Alliance Debates Webinar Moderated by Martin Romacker – Roche Panelists Tom Plasterer - AstraZeneca Eric Little - OSTHUS Kees van Bochove The Hyve Rafael Jimenez - Elixir
  2. 2. This webinar is being recorded
  3. 3. Poll Question 1: How familiar are you with FAIR principles and metrics? A. Unfamiliar with FAIR principles B. Some familiarity with FAIR principles C. Familiar with FAIR principles D. Some expertise with FAIR principles and metrics E. Highly expert with FAIR principles and metrics
  4. 4. Poll Question 2: What is the maturity level of your organization with respect to implementation of FAIR? A. Don’t know B. No plans or interest in FAIR implementation B. Thinking about FAIR implementation C. Several projects have FAIR implementation E. Systematic FAIR implementation across whole organization
  5. 5. ©PistoiaAlliance Phuse and the Regulated Cloud 5 Eric Little Chiref Data Officer OSTHUS Martin Romacker Principal Scientist Roche Tom Plasterer US Cross-Science Director, R&D Information AstraZeneca Rafael Jimenez Chief Data Architect ELIXIR Kees van Bochove CEO The Hyve
  6. 6. Harmonizing FAIR Data Catalog Approaches – Introduction Martin Romacker Data and Information Architect Pharma Research and Early Development Informatics (pREDi) Roche Innovation Center Basel Pistoia Alliance Webinar, 29th November 2018
  7. 7. Data Catalogue & FAIR Principles High Quality Data Driving Insights • Digitial Transformation leading to a data-driven Industry • Data are more and more perceived as an asset
  8. 8. Doing now what patients need next
  9. 9. Findability and FAIR Data Catalogs Tom Plasterer, PhD Science & Enabling Units IT (S&EUIT), Semantic Technology Lead 29 Nov 2018 Pistoia/Linked Data Community of Practice Webinar Series Eric Little, PhD Chief Data Officer, Osthus
  10. 10. 10 Starting Point: Questions and Iterative Data Model core:Study core:Project core:Target core:Subject core:Drug core:Indication core:TherapeuticArea core:BiologicalSample core:Measurement core:Technologycore:Visit bdm:Cohort core:hasSubject core:hasProject core:hasDrug core:hasIndication bdm:hasArm bdm:participatesIn core:hasTA core:hasTarget core:hasMeasurement core:hasSample core:hasVisit core:measuredBy How do you find datasets described with this model when you have hundreds or thousands?
  11. 11. 11 Dataset Catalog is a collection of Dataset Records • Catalogs are needed to supporting FAIR (Findable) data • Catalogs can and should support Enterprise MDM strategies • Consumers can be internal or external Dataset Catalogs are needed so data consumers can find Datasets • Dataset records need sufficient metadata to support discoverability • Dataset terms are NOT the data instance Dataset Catalogs surface dataset provenance and enable data access Dataset Catalogs can provide datasets for multiple consumption patters • Analytics readiness and fit • ‘Walking’ across information models Dataset Catalogs: Findability Starts Here
  12. 12. 12 Best Practices: Data on the Web, Vocabulary of Interlinked Datasets Dataset Descriptions for the Open Pharmacological Space http://www.openphacts.org/specs/2012/WD-datadesc-20121019/ Data on the Web Best Practices https://www.w3.org/TR/dwbp/
  13. 13. 13 Dataset Catalogs: Find me Datasets about: Projects Study Indication/ Disease Technology Targets Cohort DatesAgent Therapeutic Area Drugs
  14. 14. 14 Dublin-Core-Type (DCT): Dataset • A dataset is information encoded in a defined structure (for example, lists, tables, and databases), intended to be useful for direct machine processing Data-Catalog (DCAT): Dataset • A collection of data, published or curated by a single source, and available for access or download in one or more formats Vocabulary of Interlinked Dataset (VoID): Dataset • A set of RDF triples that are published, maintained or aggregated by a single provider. Data and Datasets… dct:Dataset dcat:Dataset void:Dataset rdfs:subClassOf rdfs:subClassOf
  15. 15. 15 The Backbone: A DCAT conformant Data Catalog https://www.w3.org/TR/hcls-dataset/ https://www.w3.org/TR/vocab-dcat/#vocabulary-overview Semantic tagging of datasets with concepts from taxonomies: • provides context • multi-dimensional & flexible • effective for discoverability • light-weight semantics skos:Concept dcat:Catalog skos:ConceptScheme dctypes:Dataset (summary) dct:title dct:publisher <foaf:Agent> foaf:page void:sparqlEndpoint dct:accrualPeriodicity dcat:keyword dcat:dataset dcat:theme dctypes:Dataset (version) dcat:Distribution (dctypes:Dataset) void:vocabulary dct:conformsTo void:exampleResource …other void properties dcat:distribution dcat:themeTaxonomy dct:isVersionOf pav:previousVersion dct:hasPart pav:hasCurrentVersion dct:hasPart dct:title dct:publisher <foaf:Agent> pav:version dct:creator <foaf:Agent> dct:created dct:source dct:creator <foaf:Agent> dct:license dct:format pav:retrievedFrom dct:created pav:createdWith dcat:accessURL dcat:downloadURL void:Dataset dct:title dctDescription dct:publisher <foaf:Agent>
  16. 16. 16 Metadata Model Stack for the AZ Data Catalog DCAT VoID DCTerms RDF/S, OWL, SKOS/SKOS-XL AZ TaxonomiesPAV AZ DataCatalog ontology and instances for catalogs, datasets, distributions (could be further modularized later) DCMI bdm-tech core bdm internal external uniprot umls sio chembl… W3C and Metadata Standards Reference Master Data
  17. 17. Data Discoverability: Multi-phase Filtering Data Catalog Filter Phase 1 Experiment Metadata Filter Phase 2 Ad hoc Analyses Filtering Phase 3 Outbound to Data Analytics Data Science Tools Statistical Filtering e.g., clinical trial with > 50 participants Dataset Catalog Descriptions
  18. 18. 18 Creating a Dataset Record
  19. 19. 19 Validation and SHACL Az:ANYDataset az:BDMDataset dct:Dataset rdfs:subClassOf az:BDMDataset (Node Shape) sh:targetClass az:BDMDataset sh:and dctypes:dataset (Node Shape) sh:targetClass dctypes:Dataset sh:property az:BDMDatasetExtension (Shape) sh:property dct:title (Propery Shape) sh:path dct:title sh:datatype xsd:string sh:minCount: 1 sh:maxCount: 1 … az:theme (Shape) sh:path dcat:theme sh:class bdm:Technology sh:minCount: 1 … For BDM dataset: at least one technology MUST be specified Use of SHACL for Data Catalogs & Dataset Types @Heiner Oberkampf; <internal talk> April 18, 2018
  20. 20. 20 DisQover Example
  21. 21. Slide 21 How FAIR is Your Data with a Catalog? FAIR metrics: https://tinyurl.com/FAIRMetrics-ALL I
  22. 22. Slide 22 Data Science Layer machine learning, text analytics, NLP, clustering, matching, classification Lightweight Semantic Integration Layer: FAIR Catalog enablers: reference master data mgt., metadata mgt., semantic indexing, linking, governance, APIs A Data-Centric World Allows to Utilize Data Effectively Linked Open Data & Open APIs Semantic Graph DB Operational DBs … Unstructured Documents Analytics Tools simulations statistics Visualization dashboards exploration search … Semi-structured Data Instrument Data Reporting regulatory internal external
  23. 23. R&D | RDI DCTERMS, DCAT, VoID are nearly sufficient • Extend for local needs DCTERMS, DCAT, VoID are nearly sufficient • Extend for local needs Public Domain Ontologies should be reused • Consensus is emerging around best practices and cross-mapping Public Domain Ontologies should be reused • Consensus is emerging around best practices and cross-mapping Use Multi-Phase Filtering for Shallow & Deep Questions • Balance to what belongs in a catalog record vs. instance data Use Multi-Phase Filtering for Shallow & Deep Questions • Balance to what belongs in a catalog record vs. instance data Lots of Activity to Learn and Shape Best Practices • Didn’t reinvent a wheel Lots of Activity to Learn and Shape Best Practices • Didn’t reinvent a wheel Dataset Catalogs: Take-aways
  24. 24. R&D | RDI Thanks Key Influencers David Wood Tim Berners-Lee Lee Harland Jane Lomax James Malone Dean Allemang Barend Mons Carole Goble Bernadette Hyland Bob Stanley Eric Little Juan Sequeda Michel Dumontier John Wilbanks Hans Constandt Filip Pattyn Dan Crowther Tim Hoctor Ian Harrow Key Influencers David Wood Tim Berners-Lee Lee Harland Jane Lomax James Malone Dean Allemang Barend Mons Carole Goble Bernadette Hyland Bob Stanley Eric Little Juan Sequeda Michel Dumontier John Wilbanks Hans Constandt Filip Pattyn Dan Crowther Tim Hoctor Ian Harrow AZ/MedImmune Linked Data Community David Fenstermacher Mathew Woodwark Rajan Desai Nic Sinibaldi Chia-Chien Chiang Kerstin Forsberg Ola Engkvist Ian Dix Colin Wood Ted Slater Martin Romacker Eric Neumann Jeff Saltzman Kathy Reinold Nirmal Keshava Bryan Takasaki AZ/MedImmune Linked Data Community David Fenstermacher Mathew Woodwark Rajan Desai Nic Sinibaldi Chia-Chien Chiang Kerstin Forsberg Ola Engkvist Ian Dix Colin Wood Ted Slater Martin Romacker Eric Neumann Jeff Saltzman Kathy Reinold Nirmal Keshava Bryan Takasaki
  25. 25. Clinical Data Catalogs 3 case studies 2018, NOVEMBER 29th Kees van Bochove, CEO & Founder, The Hyve @keesvanbochove
  26. 26. ● Cohort study / registry / biobank: Netherlands Twin Registry ● Academic medical center: Institut Curie ● Pharma company: Cataloguing the Data Lake Clinical data catalog case studies
  27. 27. 27 Teams Research Data Management ● FAIR / Data Governance consultancy ● Fairspace (meta)data management Cancer Genomics ● Cancer data warehouse: cBioPortal ● Knowledge base: Open Targets Data Warehousing ● Data warehouses: tranSMART, i2b2 ● Cohort selection: Glowing Bear ● Request Portals: Podium Real World Data ● Real world evidence: OMOP/OHDSI ● Wearables platform: RADAR-BASE ● Data catalogues: CKAN, DataVerse
  28. 28. 28 Case 1: Netherlands Twin Register
  29. 29. 29 Case 1: Netherlands Twin Register 1. Data Catalogue: Low barrier of entry, variable level metadata
  30. 30. 30 Case 1: Netherlands Twin Register 2. Podium Request Portal: Request access to data (or samples)
  31. 31. 31 Case 1: Netherlands Twin Register 3. Glowing Bear: Data selection tool for data manager (or researcher)
  32. 32. ● Provide foundations for a comprehensive knowledge system to make data accessible ● Foster a strong extended research network within the organization ● Translated to 4 processes / propositions: ● Understand which data is available in the institute ● Collaborate easily on data analysis ● Comply to regulations (GDPR) and best practices (i.e. FAIR) ● Manage the full data lifecycle Case 2: Institut Curie: project goals
  33. 33. Research Team A SEMANTIC LAYER GOVERNANCE LAYER VIRTUAL RESEARCH DESKTOP HOSPITAL INFORMATION SYSTEM (HIS) CURIE DATA RESOURCE A C Research Team B AD E Research Team C FBAB Clinical annotations, Consent metadata metadata metadata Sequencing data (WGS, WES, targeted) RNAseq Images Institut Curie – Data Office – Vision on FAIR Data By: X. Fernandez and J. Guerien FAIRSPACE - Apache Jena - FS Protocol - Kubernetes - Apps a.o. cBioPortal
  34. 34. FAIRSPACE 34 Add your data to Fairspace Automate it! Share and publish your data Annotate your data with FAIR metadata 01 02 03 04 Collaborative science in 4 steps Metadata Governance Audit Logs
  35. 35. ● Strategic programme underway to make preclinical, clinical and real world data assets available across R&D ● Most files and legacy data warehouses moved to AWS data lake ● Computational pipelines, data analysis solutions etc. available ● Need for indexing and searching data assets on study, indication, compound, site etc. Case 3: Unlocking the enterprise data lake
  36. 36. 36 Unlock the data lake with cataloguing Data Lake Data Source Adapters S3, GCP buckets, FS mounts, Azure blobs etc. File & Collection Metadata DCAT, DATS etc. Logging & Auditing Lightweight Metadata Layer (Kubernetes microservices) Data Governance Access control Request flow FAIRSPACE FederationData Sources Dataverse, C/DKAN etc. via OAI-PMH Semantic search (e.g. Disqover) via SPARQL Analytics (e.g. SAS, Jupyter, Matlab etc.) Cohort & Patient Finders (e.g. Glowing Bear)
  37. 37. ● Understand the driving use case for the catalog ● Filtering, browse and search: which criteria would users search on? ● Result sets: do your users / customers want to locate datasets, files, patients, samples etc? ● Strategy to populate the catalogue ● Favour automation, but some manual work and terminology management will be needed ● Some further reading: https://peerj.com/preprints/27151/ ● Testing the FAIR metrics on data catalogs, comparing CKAN, Dataverse, Invenio, by The Hyve team ● Let me know if you are interested in beta testing Fairspace: kees@thehyve.nl Conclusions
  38. 38. www.elixir-europe.org Bioschemas for dataset discoverability on the inter/intranet Harmonizing FAIR Data Catalog Approaches Rafael C Jimenez Chief Data Architect November 29, 2018 The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 ELIXIR-EXCELERATE is funded by the European Commission within the Research Infrastructures programme of Horizon 2020, grant agreement number 676559.
  39. 39. Semantic markup for web pages
  40. 40. <div itemscope itemtype="http://schema.org/Recipe"> <h1 itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div> Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . . RDFa JSON-LD Microdata
  41. 41. Semantic markup for web pages
  42. 42. http://schema.org/docs/full.html More than 600 types
  43. 43. Tim Berners-Lee
  44. 44. Bioschemas Schema.org for life sciences Collection of specifications Community initiative
  45. 45. Bioschemas • New life sciences schema.org types (eg. protein) • Life sciences profiles for existing types (eg. dataset) Types • Proposed to Schema.org • More descriptive • Guidelines applying constrains to existing Schema.org types • Managed by Bioschemas and specific for life sciences • Minimum properties and best practices for finding and accessing data • Focused on few types and well defined relationships Profiles 2210 DataCatalog Datasets DataRecord
  46. 46. Bioschemas • Use case driven • Finding data • Presenting search results • Metadata exchange • Minimum information guidelines • Link to domain ontologies • Examples and documentation
  47. 47. Tools • Buzzbang Open-source components for software to find, crawl and use Bioschemas markup, and for humans to search it. • GoCrawlIt Minimal crawler and extractor of microdata and JSON-LD metadata. • GoWeb Application to help publishing Bioschemas profiles on the Bisochemas website. • Validata A web application for validating Bioschemas markup against the specifications. • Markup Builder A web application for prototyping markup against the Bioschemas profiles
  48. 48. Data resources Tools Training Bioschemas Bioschemas Bioschemas Search engines Metadata Registries Data Aggregators BioschemasBioschemas Providers Consumers
  49. 49. MarRef -> BioSamples MarRef <- BioSamples Without an API https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md Use case: data exchange Metadata exchange
  50. 50. Data producers Bioschemas Data Aggregators Bioschemas Providers Consumers Organization “A” private data Data producers Bioschemas Data Aggregators Bioschemas Providers Consumers Organization “B” private data Search Compare Link Exchange Metadata Metadata Metadata Metadata • Different data models and interfaces • Minimum common metadata agreements on a selection of data types
  51. 51. Remarks • Bioschemas is focused on findability, not data modeling • It can be useful for some basic metadata exchange (MarRef example) • It is a complement, not a replacement for more tailored ways to share metadata • It is mainly used embedded in HTML, but it can be exposed via APIs or JSON files • For data providers it is a simple way to expose metadata • It is not just for Google, it is also helping our metadata catalogues to automatically ingest metadata
  52. 52. Thanks for your attention! 200+ People 8 Events (2018) 10 Types 23Working groups22Profiles 55Live deploys 6M+Pages
  53. 53. Audience Q&A Please use the Question function in GoToWebinar
  54. 54. Upcoming Webinars Data Quality in support of AI/ML: Pistoia Alliance CoE for AI Webinar Series Date/Time: 10 Dec 2018: 4pm - 5pm GMT, 11 am ET/5pm CET Panel members include: Terry Stouch, Science for Solutions, Isabella Feierberg, AstraZeneca, Jamie Powers, Cambridge Semantics, Sirarat Sarntivijai, ELIXIR EU, Jabe Wilson, Elsevier Knowledge Graphs for Pharma: A perspective from the PhUSE Project 'Clinical Trials Data as RDF' Date/Time: January 24th, 2019 11am ET/4pmGMT/5pm CET Speaker: Tim Williams (UCB and PhUSE)

×