Linked Data and Ontology Tutorial (for RD-Connect)

  • 775 views
Uploaded on

In this tutorial we explain the basics of a 'Linked Data and Ontology' approach for combining data, in particular for the study of rare diseases. The approach is motivated by a case study provided by …

In this tutorial we explain the basics of a 'Linked Data and Ontology' approach for combining data, in particular for the study of rare diseases. The approach is motivated by a case study provided by health care researcher Ulrike Braisch. The main take home lesson is that with this approach the effort for data integration can be substantially lowered, i.e. lead to a shorter path to new treatments for (rare) diseases.

The presentation is based on a tutorial given at the RD-Connect/Neuromics/Euronomics plenary meeting in Heidelberg, Germany, February 26, 2014. It was made possible by RD-Connect, a European project to support Rare Disease research (http://www.rd-connect.eu).

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
775
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness
  • Ulrike’s problem: ndefined sameness

Transcript

  • 1. LINKED DATA AND ONTOLOGY TUTORIAL R D - C O N N E C T T U T O R I A L , H E I D E L B E RG 2 0 1 4 M a r c o R o o s , P e d r o L o p e s , M a r k T h o m p s o n , R a j a r a m K a i l y a p e r u m a l A c k n o w l e d g e m e n t s : U l r i k e B r a i s c h ( U L M ) , P a u l G r o t h a n d F r a n k v a n H a r m e l e n ( V U A m s t e r d a m ) , B i o S e m a n t i c s g r o u p L U M C R D - C o n n e c t L i n k e d D a t a & O n t o l o g y T a s k F o r c e , 2 0 1 3 - 2 0 1 4 1
  • 2. 2 1. Basic introduction to Linked Data 1. The problem 2. Linked Data Approach 3. Linked Data Architecture 4. Nanopublication Agenda
  • 3. Marco Roos1, Pedro Lopes2, Mark Thompson1, Rajaram Kaliyaperumal1 1. BioSemantics Group, Human Genetics Department,Leiden University Medical Center, The Netherlands – http://biosemantics.org 2. Bioinformatics & Computational Biology Group, University of Aveiro, Portugal – http://bioinformatics.ua.pt Acknowledgements: Ulrike Braisch (ULM), Paul Groth (VU Amsterdam), BioSemantics group EMC/LUMC, RD-Connect Linked Data & Ontology Task Force Introduction to Linked Data3
  • 4. 4 Ulrike Braisch’ Problem C (USA) R2 (EU) R3 (EU) Education level C_EDUC: 7 levels Edlevel: 9 levels Isced: 7 levels Marital status C_MARSTAT: never, now, separated, divorced, divorced Maristat: single, married, partnership, divorced, widowed Maristat: single, married, partnership, divorced, widowed Age/date of birth Age at baseline in years Exact age at visit Exact age at visit I wish to correlate patient characteristics
  • 5. 5 Ulrike Braisch’ Problem C (USA) R2 (EU) R3 (EU) Education level C_EDUC: 7 levels Edlevel: 9 levels Isced: 7 levels Marital status C_MARSTAT: never, now, separated, divorced, divorced Maristat: single, married, partnership, divorced, widowed Maristat: single, married, partnership, divorced, widowed Age/date of birth Age at baseline in years Exact age at visit Exact age at visit Ulrike’s Problem: the data in the fields pertain to very similar things, but not exactly the same. How similar she does not know a priori. I wish to correlate patient characteristics
  • 6. 6 Ulrike Braisch’ Problem 6 Registry 1 Registry 2 Registry 3 A ≠ A’ ≠ A’’, B ≠ B’ ≠ B’’, C ≠ C’ ≠ C’’ Can I rely on what I think the headers mean? A B C A’’ B’’ C’’ A’ B’ C’ How to align the data? I wish to correlate patient characteristics
  • 7. 7 Solution 1: Ulrike solves the problem 7 Registry 1 A B C Registry 2 A’ B’ C’ My ‘Registry’ A’’’ B’’’ C’’’ Ulrike has to do the alignment herself. She has to do the heavy lifting for data integration
  • 8. 8 I wish to...  correlate patient characteristics with CAG repeat length (Ulrike)  correlate clinical data with genome data (Bob)  compare Huntington data with Alzheimer data (Alice)  study social aspects of clinical surveys (Christian)  compute the commonalities between all diseases (Don) Not just Ulrike’s problem
  • 9. 9 I wish to...  correlate patient characteristics with CAG repeat length (Ulrike)  correlate clinical data with genome data (Bob)  compare Huntington data with Alzheimer data (Alice)  study social aspects of clinical surveys (Christian)  compute the commonalities between all diseases (Don) Not just Ulrike’s problem The data are valuable for many people; they all face the same problem
  • 10. 10 Solution 1: Bob, Alice, Ulrike, Christian, Don solve the problem Registry 1 A B C Registry 2 A’ B’ C’ They all do the heavy lifting
  • 11. 11 Can computers help? – NO! Registry 1 A B C Registry 2 A’ B’ C’ Computers cannot help; not for alignment
  • 12. 12 Effort for data integration Experiment Data generation Data Integration Analysis Application Gain Data Knowledge The (simplified) steps of data integration. How is the pain for data integration distributed?
  • 13. 13 PainPain Effort for data integration Experiment Data generation Data Integration Analysis Application Gain Pain Pain Data Knowledge Pain
  • 14. 14 PainPain Effort for data integration Experiment Data generation Data Integration Analysis Application Gain Pain Pain Data Knowledge Pain Data are not explicitly prepared for data integration (apart from storing them in tables/files/databases). The pain of data integration is with Ulrike. Computers can not help her with that.
  • 15. 15 Pain Pain Linked Data = Redistribution of pain to enable computers to help us 15 Pain Gain Pain Pain Experiment Data generation Integration Analysis Application Data Knowledge “Linked Data” moves the pain and enables computers
  • 16. 16 Pain Pain Linked Data = Redistribution of pain to enable computers to help us 16 Pain Gain Pain Pain Experiment Data generation Integration Analysis Application Data Knowledge The goal of “Linked Data”Take home message: “Linked Data” does not take the pain of data integration away; alignment remains necessary. But it moves the pain to data experts, making the overall workflow more efficient. And it enables computers to help. Next we explain how…
  • 17.  The three layers of data “harmonization”  The key role of “Uniform Resource Identifiers”  Sayings things with Linked Data  Linked Data Infrastructure Linked Data and Ontology approach 17
  • 18. 18 Disentangling harmonization “Harmonization” is commonly used to refer to aligning what samples and data are collected within a consortium
  • 19. 19 Disentangling harmonization It is useful to discriminate three aspects of ”Harmonization” … and avoid conflating them
  • 20. 20  Harmonize what is measured and how  Harmonize classification and relations (meaning)  Harmonize how we make it computable Disentangling harmonization
  • 21. 21 1) Harmonize what is measured and how 2) Harmonize classification and relations (meaning) 3) Harmonize how we make it computable Disentangling harmonization Ontologies Linked Data Consensus (1) is about agreement between people, (2) is about how to call things in our data, (3) is about enabling computers to help
  • 22. 22  Harmonize what is measured and how  Harmonize classification and relations (meaning)  Harmonize how we make it computable Disentangling harmonization Ontologies Linked Data Consensus Syntax Semantics Ontologies have 2 roles: (i) enforce compliance with the consensus, (ii) convey meaning to computers; they have a human and computer-readable representation Agreement
  • 23. 23 Use of ontologies, but not Linked Data C (USA) R2 (EU) R3 (EU) Ontology Education level C_EDUC: 7 levels Edlevel: 9 levels Isced: 7 levels Onto:1234 Marital status C_MARSTAT: never, now, separated, divorced, divorced Maristat: single, married, partnership, divorced, widowed Maristat: single, married, partnership, divorced, widowed Onto:2345 Age/date of birth Age at baseline in years Exact age at visit Exact age at visit Onto:3456 Perhaps confusing, but ontology identifiers (like GO or HPO IDs) are often not readily readable for computers...
  • 24. 24 Use of ontologies, but not Linked Data C (USA) R2 (EU) R3 (EU) Ontology Education level C_EDUC: 7 levels Edlevel: 9 levels Isced: 7 levels Onto:1234 Marital status C_MARSTAT: never, now, separated, divorced, divorced Maristat: single, married, partnership, divorced, widowed Maristat: single, married, partnership, divorced, widowed Onto:2345 Age/date of birth Age at baseline in years Exact age at visit Exact age at visit Onto:3456For a computer they are but a string of symbols; adding these IDs to a table is good, but it is not Linked Data yet.
  • 25. 25 Universal Resource Identifier Linked Data: unique computer- readable identifiers <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> This is more like it for computers!
  • 26. 26 Universal Resource Identifier Linked Data: unique computer- readable identifiers <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> <URI> ‘Uniform Resource Identifiers’ are identifiers for computers The URI is an international recommendation by the World Wide Web Consortium (W3C)
  • 27. 27 http://rdf.biosemantics.org/owl/BioSemanticsConcepts#c3877... Universal Resource Identifier An example URI… Why are they so useful?...
  • 28. 28 http://rdf.biosemantics.org/owl/BioSemanticsConcepts#c3877... A Universal Resource Identifier (URI) is… A unique identifier for data or concept A unique reference for data or concept Computer-readable Universal Resource Identifier URIs are three things at once
  • 29. 29 http://rdf.biosemantics.org/owl/BioSemanticsConcepts#c3877... Universal Resource Identifier And they look familiar…
  • 30. 30 Reuse of technology: world wide web hyperlinks <a href=“http://www.ni.nlm.nih.gove/pubmed/18927111">
  • 31. 31 Reuse of technology: world wide web hyperlinks <a href=“http://www.ni.nlm.nih.gove/pubmed/18927111"> For Linked Data we simply reuse what made the World Wide Web such a success: the hyperlink… What is different?...
  • 32. 32 Documents for human consumption Document 1 Document 2 http://www.ncbi.nlm.nih.gov/ pubmed/18927111 Hyperlinks (URIs) link documents The Web as we know it links documents for humans
  • 33. 33 Data for computer consumption http://www.ncbi.nlm.nih.gov/ pubmed/18927111 Hyperlinks (URIs) can link data ‘Linked Data’ links data for computers (enabling them to support us)
  • 34. 34 http://rdf.biosemantics.org/owl/BioSemanticsConcepts#c3877... Universal Resource Identifier (URI) 100% Unique! “Address” data itemProtocol for exchange by computers Computer-readable reference for data URIs function through three main elements: [protocol][address][ID]
  • 35. 35 http://rdf.biosemantics.org/owl/BioSemanticsConcepts#c3877... Universal Resource Identifier (URI) 100% Unique! “Address” data itemProtocol for exchange by computers Computer-readable reference for data A URI can represent many things: a gene, a person, a value, but also a relation, such as ‘causes’
  • 36. 36 Predicate Objectsubject <HDAC1> <malaria> <mutation X> <interacts with> <is transmitted by> <has frequency> <ParvB> <mosquitos> <0.25%> Can we say things with URIs? A ‘triple’ of URIs can form a computer-readable statement
  • 37. 37 Predicate Objectsubject <HDAC1> <malaria> <mutation X> <interacts with> <is transmitted by> <has frequency> <ParvB> <mosquitos> <0.25%> Can we say things with URIs? Subject, Predicate, and Object are each URIs URIs are not for humans, but they are often supplied with a web page for humans…
  • 38. 38 http://purl.uniprot.org/uniprot/Q13547 http://conceptwiki.org/index.php/Concept:e6559... http://bio2rdf.org/geneid:29780 <HDAC1> <interacts with> <PARVB> computer-readable => human readable Simply copy a URI to your browser NB This may not always give you a human readable web page
  • 39. 39 http://purl.uniprot.org/uniprot/Q13547 http://conceptwiki.org/index.php/Concept:e6559... http://bio2rdf.org/geneid:29780 Linked data = computer readable knowledge “HDAC1 interacts with Parvb” Back to our triple Note that we ‘said’ something meaningful! Triples allow us to say things that computers can understand
  • 40. 40 http://purl.uniprot.org/uniprot/Q13547 http://conceptwiki.org/index.php/Concept:e6559... http://bio2rdf.org/geneid:29780 URIs in one triple can point to different locations Linked data = computer readable knowledge “HDAC1 interacts with Parvb” Think of the implication for Data Integration
  • 41. 41 http://purl.uniprot.org/uniprot/Q13547 http://conceptwiki.org/index.php/Concept:e6559... http://bio2rdf.org/geneid:29780 Linked data = computer readable knowledge “HDAC1 interacts with Parvb” Remember that URIs are also references they may refer to more information… Is this all we said?
  • 42. 42 http://purl.uniprot.org/uniprot/Q135 47.rdf “HDAC1” The UniProt Linked Data representation of HDAC1: many more triples!
  • 43. 43 http://purl.uniprot.org/uniprot/Q13547 We said all that by just this reference Things we can say URIs are references. No need to download a whole ontology or all of UniProt in your own knowledge base What kind of things can we say?
  • 44. 44 http://purl.uniprot.org/uniprot/Q13547 <URI for a type of relation> <URI for object of relation> Things we can say: relation http://purl.uniprot.org/uniprot/Q13547 http://conceptwiki.org/index.php/Concept:e6559... http://bio2rdf.org/geneid:29780 “HDAC1” We already saw the (biological) relation
  • 45. 45 http://purl.uniprot.org/uniprot/Q13547 <URI for “label”> “HDAC1” Things we can say: human readable labels Here we add a label for humans. Software engineers use this in the User Interface of their tools. URIs are used ‘under the hood’.
  • 46. 46 http://purl.uniprot.org/uniprot/Q13547 <URI for “is of type”> <URI for class Protein> Things we can say: classify “HDAC1” Here we say what type of thing a URI represents: we classify a URI.
  • 47. 47 http://purl.uniprot.org/uniprot/Q13547 <URI for “is of type”> <URI for class Protein> <URI for “has label”> “Protein” Things we can say: classify + human readable labels “HDAC1” …and we add a label for this class.
  • 48. 48 http://purl.uniprot.org/uniprot/Q13547 <URI for “is of type”> <URI for class Protein> <URI for “has label”> “Protein” Things we can say: classify + human readable labels “HDAC1” Classification is special: here is where Linked Data and Ontologies meet
  • 49. 49 http://purl.uniprot.org/uniprot/Q13547 <URI for “is of type”> <URI for class Protein> <URI for “label”> “Protein” Things we can say: human readable labels This is from an ontology! Good ontologies have a “URI” representation (format: OWL/RDF)
  • 50. 50 “parvb” “HDAC1” “Interacts with” “genome location <…>” “has genome location” “Homo Sapiens” “Species” “in species” “in species” instance of “Genome Location” instance of “Protein” instance of instance of “Gene” “encodes” “Biological Entity” “subclass of” “subclass of” “subclass of” Knowledge and data represented by graphs With Linked Data we build knowledge graphs. NB we decide what to include per application.
  • 51. 51 “parvb” “HDAC1” “Interacts with” “genome location <…>” “has genome location” “Homo Sapiens” “Species” “in species” “in species” instance of “Genome Location” instance of “Protein” instance of instance of “Gene” “encodes” “Biological Entity” “subclass of” “subclass of” “subclass of” Knowledge and data represented by graphs
  • 52. 52 http://purl.uniprot.org/uniprot/Q13547 http://conceptwiki.org/index.php/Concept:e6559... http://bio2rdf.org/geneid:29780 http://nanopub.org/nschema/hasPublicationInfo http://nanopub.org/4214adf1... http://swan.mindinformatics.org/.../pav/Author http://orcid.org/0000-0002-8691-772X Things we can say: it was me! “HDAC1 interacts with Parvb” “nanopublication authored by me!” , Example: acknowledgement by Nanopublication What we say is not limited to biology… BiologyCredit
  • 53. 53 “parvb” “HDAC1” “Interacts with” “genome location <…>” “has genome location” “Homo Sapiens” “Species” “in species” “in species” instance of “Genome Location” instance of “Protein” instance of instance of “Gene” “encodes” “Biological Entity” “subclass of” “subclass of” “subclass of” Knowledge and data represented by graphs myNanopub:myAssertion Our name is on this now
  • 54. 54 http://purl.uniprot.org/uniprot/Q13547 <URI for “is same as”> <URI in other resource> Things we can say: mappings Back to Ulrike. One other type of relation: the mapping. We state what is what between resources.
  • 55. 55 http://purl.uniprot.org/uniprot/Q13547 <URI for “also referred to as”> <URI in other resource> Things we can say: mappings Vocabularies exist for sophisticated mapping (also as URIs) We can do that in a precise and subtle way
  • 56. 56 By using these URIs Ulrike Braisch’ Problem <URI for C> <URI for R2> <URI for R3> <URI for Education level> <URI for C_EDUC>: <URIs for 7 levels> <URI for Edlevel> <URIs for 9 levels> <URI for Isced> <URI for 7 levels> <URI for Marital status> <URI for C_MARSTAT> <URIs for never, now, separated, divorced, divorced> <URI for Maristat> <URIs for single, married, partnership, divorced, widowed> <URI for Maristat> <URIs for single, married, partnership, divorced, widowed> <URI for Age/date of birth> <URI for Age at baseline in years> <URI for Exact age at visit> <URI for Exact age at visit> I wish to correlate patient characteristics with CAG repeat length If Ulrike’s table were Linked Data…
  • 57. 57 Linked Data for Ulrike <URI for C>, <URI for R2>, <URI for R3> <URI for “is of type”> <URI for RD resource> <URI for Edlevel level 3> <URI for “is narrower than”> <URI for C_EDUC level 2> <URI for lsced level 3> <URI for “is same as”> <URI for C_EDUC level 2> <URI for C_MARSTAT:divorced> <URI for “is same as”> <URI for Maristat:divorced> <URI for C_MARSTAT:never> <URI for “is related to”> <URI for Maristat:single> <URI for C_MARSTAT>, <URI for Maristat> <URI for “subclass of”> <URI for Marital status> We also say… Remember: URI = ID + Reference + Computable
  • 58. 58 Linked Data is not  Painless data integration and computer reasoning  Harmonization moved up to early data management  More efficient, modelling effort is reused  Pain: semantic model for new data  Early days for reasoning: we need your Linked Data first! Conclusions (1/2)
  • 59. 59 Linked data is  A way to enable computers to help harmonize  Everything has a unique reference  Ontologies say what data means  Mappings specify the relation between datasets  Data integration (almost) trivial  Enable computing with knowledge Conclusions (2/2)
  • 60. Linked Data Architecture 25 April 2014 In the next few slides we show (simplified) how Linked Data systems work
  • 61. 61 Most common use: common reference 25 April 2014 Smoker Heavy smoker Light smoker Gene Expression Database Clinical RegistryLinked Data Exchange
  • 62. 62 Most common use: common reference 25 April 2014 Smoker Heavy smoker Light smoker Gene Expression Database Clinical RegistryLinked Data Exchange Ontologies in Linked Data provide a reference for systems whatever internal structures they use
  • 63. 63 Most common use: common reference 25 April 2014 Smoker Heavy smoker Light smoker Gene Expression Database Clinical RegistryLinked Data Exchange Systems do not have to agree on one fixed schema One common link suffices to connect resources
  • 64. 64 Typical Linked Data architecture for data integration applications 64 Linked Data Cache (e.g. running COEUS) Case Study Exposed Linked Data Exposed Linked Data Exposed Linked Data Interface User dependent Source 1 Source 2 Source 3
  • 65. 65 Typical Linked Data architecture for data integration applications 65 Linked Data Cache (e.g. running COEUS) Case Study Exposed Linked Data Exposed Linked Data Exposed Linked Data Interface User dependent Source 1 Source 2 Source 3 Linked Data can be integrated in a cache Integration is trivial when sources are well- formed Linked Data: when the same URIs were used for the same things, integration is instant
  • 66. Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Data Import CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Applications OpenPHACTS uses Linked Data for drug discovery
  • 67. Claim your findings as Nanopublications Nanopublication Mark Thompson, Rajaram Kaliyaperumal 67 It was me, me, me! Finally, a word about Nanopublication, because in our opinion your data contributions should be acknowledged
  • 68. 68  What do you say with a Nanopublication?  Minimal statement for which you deserve credit  How you came to say it (provenance)  Who should be cited  Preferred Format: Linked Data! Nanopublication
  • 69. 69  What do you say with a Nanopublication?  Minimal statement for which you deserve credit  How you came to say it (provenance)  Who should be cited  Preferred Format: Linked Data! Nanopublication Science Good Science Acknowledged Good Science Digital
  • 70. 70 Pain Pain Fame and glory (and reproducibility): Nanopublication! Pain Gain Pain Pain Experiment Data generation Integration Analysis Application Data Knowledge Gain Nano- publications Gain Nano- publications
  • 71. 71 Pain Pain Fame and glory (and reproducibility): Nanopublication! Pain Gain Pain Pain Experiment Data generation Integration Analysis Application Knowledge Gain Nano- publications Gain Nano- publications Data A new type of gain is the credit you can get for data publication
  • 72. Acknowledgements Ulrike Braisch (University of ULM, Germany) RD-Connect (EU-FP7) Leiden University Medical Center Dutch Tech Centre for Life Sciences RD-Connect Linked Data and Ontology Task Force, in particular: Pedro Lopes, Rachel Thompson, David Salgado, Peter Robinson, Manual Posada, Estrella Lopez Martin,Mark Thompson, Michael Orth, David van Enckevort BioSemantics team LUMC: Kristina Hettne, Eleni Mina, Tareq Malas, Herman van Haagen, Peter-Bram ‘t Hoen, Rajaram Kaliyaperumal, Zuotian Tatum, Eelke van der Horst, Mark Thompson, Barend Mons These slides are partly based on input and inspiration from Frank van Harmelen, Paul Groth, Scott Marshall, Andrew Gibson, Katy Wolstencroft, Jun Zhao, Robert Stevens, Carole Goble, W3C Health Care and Life Science Interest Group Thank you for your attention… 25 April 2014