Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

FAIR Data Prototype - Interoperability and FAIRness through a novel combination of Web technologies

533 views

Published on

This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1

It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.

This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.



Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R

Published in: Science
  • Be the first to comment

FAIR Data Prototype - Interoperability and FAIRness through a novel combination of Web technologies

  1. 1. Mark D. Wilkinson CBGP-UPM/INIA, Madrid markw@illuminae.com Data Interoperability and FAIRness Through Existing Web Technologies DTL Partner Wkshp November, 2016
  2. 2. The Problem ...one recent survey of 18 microarray studies found that only two were fully reproducible using the archived data. Another study of 19 papers in population genetics found that 30% of analyses could not be reproduced from the archived data and that 35% of datasets were incorrectly or insufficiently described. “ ” Dominique G. Roche , Loeske E. B. Kruuk, Robert Lanfear, Sandra A. Binning (2015) http://dx.doi.org/10.1371/journal.pbio.1002295
  3. 3. The Problem We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse. “ ” Dominique G. Roche , Loeske E. B. Kruuk, Robert Lanfear, Sandra A. Binning (2015) http://dx.doi.org/10.1371/journal.pbio.1002295
  4. 4. FAIR Findable → Globally unique, resolvable, and persistent identifiers → Machine-actionable contextual information supporting discovery Accessible → Clearly-defined access protocol → Clearly-defined rules for authorization/authentication Interoperable → Use shared vocabularies and/or ontologies → Syntactically and semantically machine-accessible format Reusable → Be compliant with the F, A, and I Principles → Contextual information, allowing proper interpretation → Rich provenance information facilitating accurate citation The Four Principles
  5. 5. “Skunkworks” Task: Build a prototype
  6. 6. Skunkworks Participants ● Mark Wilkinson ● Michel Dumontier ● Barend Mons ● Tim Clark ● Jun Zhao ● Paolo Ciccarese ● Paul Groth ● Erik van Mulligen ● Luiz Olavo Bonino da Silva Santos ● Matthew Gamble ● Carole Goble ● Joël Kuiper ● Morris Swertz ● Erik Schultes ● Erik Schultes ● Mercè Crosas ● Adrian Garcia ● Philip Durbin ● Jeffrey Grethe ● Katy Wolstencroft ● Sudeshna Das ● M. Emily Merrill
  7. 7. The Hourglass Concept We want a large ecosystem of apps that use FAIR Data
  8. 8. The Hourglass Concept We want to support a wide range of source providers
  9. 9. The Hourglass Concept The FAIR solution between them must be THIN!
  10. 10. Skunkworks participants had tons of experience v.v. metadata around scholarly publication
  11. 11. Skunkworks participants had tons of experience v.v. metadata around scholarly publication RDA, Force11, Dataverse, Research Objects, NanoPubs, Semantic Science, SADI, AlzForum, SWAN, LSID, … … … ...
  12. 12. There was very little disagreement about F, about A, or about R
  13. 13. The “I” is the big problem!
  14. 14. Interoperability is Hard!! The “I” is the big problem!
  15. 15. Keeping the history brief A series of teleconferences led to the concept of putting metadata into an iterative set of ~identical “containers”
  16. 16. Skunkworks Hackathons The “containers of containers of containers” idea was elaborated by the belief that we should also reject any solution that required a new API ProgrammableWeb.com already catalogues >16,000 different Web APIs!!!!!!
  17. 17. Skunkworks Hackathons The “containers of containers of containers” idea was elaborated by the belief that we should also reject any solution that required a new API ProgrammableWeb.com already catalogues >16,000 different Web APIs!!!!!! APIs DO NOT MAKE YOU INTEROPERABLE!
  18. 18. Skunkworks Hackathons The “containers of containers of containers” idea was elaborated by the belief that we should also reject any solution that required a new API
  19. 19. Skunkworks Hackathons The “containers of containers of containers” idea was elaborated by the belief that we should also reject any solution that required a new API REST is an architectural style for software
  20. 20. Skunkworks Hackathons The “containers of containers of containers” idea was elaborated by the belief that we should also reject any solution that required a new API Hypermedia-driven, NOT parameter-driven The phrase “my REST API” makes almost no sense, and in bioinformatics, is usually accompanied by a page explaining the key/value pairs you should add to the URL to access various functions. This Is Not REST!
  21. 21. Skunkworks Hackathons Are there existing standards that are And have the properties of ?
  22. 22. Uses machine-accessible standards and representations, following a REST paradigm LDP Useful Features I I + R F + A I Defines HTTP-resolvable URIs for each of these containers Defines the concept of a “Container” - a machine-actionable way to represent repositories, data deposits, data files, data points, and their metadata Uses a widely accepted standard (DCAT) to relate metadata to data → machine-actionable data mining
  23. 23. Caveats We used LDP as “inspiration” Some parts of LDP are not really “REST” (in our opinion) so we stripped-out those parts in favour of a 100% API-free solution For the moment we require a read-only solution; LDP defines a read-write platform so we don’t use those portions either (for now)
  24. 24. The FAIR Accessor In incremental detail 26
  25. 25. What can we describe with FAIR Accessors? FAIR Accessors provide a machine-actionable, structured, REST-oriented way to publish Metadata about a wide range of scholarly “entities”
  26. 26. What can we describe with FAIR Accessors? Warehouses (e.g. EBI) Databases (e.g. UniProt) Repositories (e.g. Zenodo, Dataverse, Leiden Repository, etc.) Datasets (e.g. output from a workflow) Research Objects (data a/o workflow a/o results a/o publications) Data “slices” (e.g. the result of a database query) Data Records (e.g. image, excel file, patient clinical record) Other…
  27. 27. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”?
  28. 28. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... What does a FAIR Accessor “look like”?
  29. 29. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... What does a FAIR Accessor “look like”? In REST, a “Resource” is “anything that can be identified” whether that is a single thing or an identifiable collection of things, and whether it exists in the real-world or only in the network. A “Resource” is identified by a URI
  30. 30. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... What does a FAIR Accessor “look like”? Resources are manipulated using the HTTP protocol on the Resource URI For the FAIR Accessor, the only HTTP method we currently require is HTTP GET
  31. 31. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... What does a FAIR Accessor “look like”? What is returned is a document full of metadata richly describing that Container (warehouse, database, dataset, slice, etc.) And a list of Resources (URIs) that represent the contained “things”
  32. 32. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... What does a FAIR Accessor “look like”? Looking more closely at one of those contained things...
  33. 33. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? The contained thing is a Resource
  34. 34. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? That Resource can be resolved by HTTP GET
  35. 35. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? To retrieve a Metadata document describing that resource (e.g. a single record)
  36. 36. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? Which record does this Metadata describe? The foaf:primaryTopic attribute defines this
  37. 37. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? Using the metadata structures defined by DCAT the FAIR Accessor may also tell you how to get the content of the record, and what formats are available
  38. 38. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? In this case, the record is available in XML format By calling HTTP GET on URL_U2
  39. 39. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”? Or in RDF format by calling HTTP GET on URL_U1
  40. 40. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET What does a FAIR Accessor “look like”?
  41. 41. Or you may add additional layers... Metadata Metadata Metadata Metadata DATA - format 1 DATA - format 2
  42. 42. Features of the FAIR Accessor 1: There is no API GET Interpret the Metadata Select the desired Resource GET
  43. 43. Features of the FAIR Accessor 1: There is no API GET Interpret the Metadata Select the desired Resource GET ANY Web agent can explore/index a FAIR Accessor (e.g. Google) An agent that understands globally-accepted vocabularies can explore it “intelligently”!
  44. 44. Features of the FAIR Accessor 46 1: There is no API It’s difficult to get thinner than nothing ;-)
  45. 45. Features of the FAIR Accessor 2: Identifiers for unidentifi-ed/-able things HTTP GET <FAIR metadata/> This is the ArrayExpress query I did for paper doi:10/1234.56 Results: MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ...
  46. 46. Features of the FAIR Accessor 2: Identifiers for unidentifi-ed/-able things HTTP GET <FAIR metadata/> This is the ArrayExpress query I did for paper doi:10/1234.56 Results: MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... Should assist with reproducibility and transparency
  47. 47. Features of the FAIR Accessor 3: A predictable “place” for metadata MetaRecordURL PrimaryTopic: record 1A445 Metadata... DATA - format 1 DATA - format 2 Different “kinds” of metadata have distinct ontological types, and distinct document structures. There is no ambiguity regarding what the metadata is describing - a repository or a record. Repository metadata MetaRecordURL
  48. 48. Features of the FAIR Accessor 3: Symmetry & predictable path to citation XXX Part of dataset XXX Metadata... DATA - format 1 DATA - format 2 The record metadata contains an “upward” link to the Repository-level metadata, which should contain license and citation information Repository metadata: Cite: doi:10/8847.384 License: cc-by
  49. 49. Features of the FAIR Accessor 4: Granularity of Access/Privacy/Security Container Resource HTTP GET <FAIR metadata/> Contains <<184 Records>> Contact Mark Wilkinson For more information about These records
  50. 50. Features of the FAIR Accessor 4: Granularity of Access/Privacy/Security Container Resource HTTP GET <FAIR metadata/> Contains <<184 Records>> Contact Mark Wilkinson For more information about These records
  51. 51. Features of the FAIR Accessor Container HTTP GET <FAIR metadata/> Contains MetaRecordResource3 MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:distribution <<NONE>> HTTP GET 4: Granularity of Access/Privacy/Security
  52. 52. Features of the FAIR Accessor Container HTTP GET <FAIR metadata/> Contains MetaRecordResource3 MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:distribution <<NONE>> HTTP GET 4: Granularity of Access/Privacy/Security
  53. 53. Container HTTP GET <FAIR metadata/> Contains MetaRecordResource3 ... MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml HTTP GET Features of the FAIR Accessor 4: Granularity of Access/Privacy/Security
  54. 54. Container HTTP GET <FAIR metadata/> Contains MetaRecordResource3 ... MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml HTTP GET Features of the FAIR Accessor 4: Granularity of Access/Privacy/Security
  55. 55. Features of the FAIR Accessor 4: Granularity of Access/Privacy/Security Thin solution - if it’s private, do nothing! Literally!
  56. 56. The Real Thing A working FAIR Accessor Serving a “Slice” of UniProt
  57. 57. A real-world scenario... Imagine you are publishing a paper describing the evolution of proteins involved in the RNA Processing machineries of the fungus Aspergillus nidulans. If you want to be a “good” scholarly publisher, interested in transparency and reproducibility, what is the first thing you need to explain to your readers and reviewers?
  58. 58. A real-world scenario... Imagine you are publishing a paper describing the evolution of proteins involved in the RNA Processing machineries of the fungus Aspergillus nidulans. If you want to be a “good” scholarly publisher, interested in transparency and reproducibility, what is the first thing you need to explain to your readers and reviewers? The list of proteins and their inclusion/exclusion criteria How would this usually be done today?
  59. 59. The query that returns the relevant proteins WHERE { ?protein a up:Protein . ?protein up:organism ?organism . ?organism rdfs:subClassOf taxon:162425 . ?protein up:classifiedWith ?go . ?go rdfs:subClassOf* <http://purl.obolibrary.org/obo/GO_0006396> . bind(replace(str(?protein), "http://purl.uniprot.org/uniprot/", "", "i") as ?id) }
  60. 60. The query that returns the relevant proteins WHERE { ?protein a up:Protein . ?protein up:organism ?organism . ?organism rdfs:subClassOf taxon:162425 . ?protein up:classifiedWith ?go . ?go rdfs:subClassOf* <http://purl.obolibrary.org/obo/GO_0006396> . bind(replace(str(?protein), "http://purl.uniprot.org/uniprot/", "", "i") as ?id) } NCBI Taxonomy: Aspergillus nidulans
  61. 61. The query that returns the relevant proteins WHERE { ?protein a up:Protein . ?protein up:organism ?organism . ?organism rdfs:subClassOf taxon:162425 . ?protein up:classifiedWith ?go . ?go rdfs:subClassOf* <http://purl.obolibrary.org/obo/GO_0006396> . bind(replace(str(?protein), "http://purl.uniprot.org/uniprot/", "", "i") as ?id) } Gene Ontology: RNA Processing
  62. 62. Create and publish a FAIR Accessor for that query http://linkeddata.systems/Accessors/UniProtAccessor Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ...
  63. 63. Create and publish a FAIR Accessor for that query http://linkeddata.systems/Accessors/UniProtAccessor Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... Resolve the URI (in software or in your browser)
  64. 64. Create and publish a FAIR Accessor for that query Returns a page of metadata (in this example, in RDF) Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ...
  65. 65. 68
  66. 66. 69 Note that this Metadata is about ME! I am the creator of this dataset, and may be credited for it.
  67. 67. I have a script that generates these… not a significant barrier!
  68. 68. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ...
  69. 69. Container Resource HTTP GET <FAIR metadata/> Contains MetaRecordResource1 MetaRecordResource2 MetaRecordResource3 ... MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET Step down to individual Record metadata
  70. 70. Step down to individual Record metadata MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET Software calls HTTP GET on the URL representing the MetaRecord Resource for the desired record in the Container (or just click on it, or type it into your browser)
  71. 71. <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml The document that is returned
  72. 72. Note the change in metadata focus! This metadata is about the UniProt Record (not about Mark Wilkinson). The record described in this metadata was created by UniProt, so the citation and authorship information is now THEIRS, not MINE.
  73. 73. Container Resource Symmetrical Link back upward to the Accessor Container (possibly relevant metadata up there… about Me :-) )
  74. 74. <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml
  75. 75. Two ways to retrieve the record - RDF or HTML (in REST-speak, two Representations of that Resource)
  76. 76. Note that this metadata record is somewhat more FAIR, than what you can (easily) retrieve from UniProt itself! e.g. the UniProt record does not include the citation or license information - you have to manually surf around the UniProt Web page to find that. So the Accessor makes UniProt’s already notably FAIR data, even more FAIR (with respect to “R”)
  77. 77. How FAIR are we now? What does the Accessor give us?
  78. 78. What we have achieved We have created a FAIR record for something - i.e. a slice of a database - that was, historically, un-recordable and un-identifiable in any formal way.F F + A F + R Accessors are a standard approach to providing human & machine accessible metadata to facilitate appropriate discovery (contextual, biological), proper usage (license) and proper citation for any kind of data. The discovery, accessibility, and drill-down/up behaviors do not require any novel API, rather simply rely on global Web standards; this allows them to be indexed by existing Web search engines
  79. 79. What we have achieved F + AI The metadata itself uses machine-accessible syntaxes, and widely adopted ontologies and vocabularies, thus easily integrates with other metadata A Accessors provide a lightweight means to protect privacy while still providing the maximum degree of transparency possible + Accessors can be static, or dynamic. i.e. we can provide template Accessor file(s) that are edited in Notepad, then published together with the data; or Accessors can dynamically generate their output from code (e.g. layered on a database server)
  80. 80. Can we be FAIRer?
  81. 81. Not good enough for the Skunk! So far we have only addressed Interoperability and FAIRness of Metadata The Holy Grail of Interoperability is to provide interoperable Data.
  82. 82. FAIR Projection: Providing FAIR Data from non-FAIR Data
  83. 83. This is going to be a bit complicated, but please be patient Interoperability is Hard!!
  84. 84. Imagine the data we need to integrate is in a CSV file In FigShare or Zenodo How do we discover and integrate that data? CSV (usually) has no inherent semantics and semantics is ~impossible to automatically guess
  85. 85. Things we need to do: We need a way to query “opaque” data blobs (like CSV) about their content We need a way to retrieve that content in a FAIR format We need, therefore, to model semantics for that opaque data content We need to model various semantics for that content (one “size” doesn’t fit all!) We need to associate those semantic models with a record or record-sets We need a way to query those semantics determine which “size” fits our req’s We would like to reuse semantic definitions as much as possible We need to do all of this without creating a new API :-)
  86. 86. Triple Pattern Fragments + RDF Modeling/Mapping Language Ruben Verborgh Ghent University Anastasia Dimou Ghent University
  87. 87. Triple Pattern Fragments (TPF) A REST interface for requesting/retrieving RDF Triples (from any source) Ruben Verborgh “Slices” of data, from any source, are considered Resources and are therefore represented by a distinct URL: http://some.database.org/dataset?s=___;p=___;o=___ Calling HTTP GET on a TPF URL returns the set of Triples matching {?s, ?p, ?o} PLUS hypermedia instructions and Resource URLs for other relevant slices.
  88. 88. Triple Pattern Fragments (TPF) A REST interface for retrieving RDF Triples (from any source) Ruben Verborgh For example, the “BMI” column from a patient registry is a Resource with the URL: http://my.registry.org/patients?p=CMO:0000105 (CMO:0000105 = “body mass index””) HTTP GET gives me all BMI triples in the registry, together with other Resource URLs representing other “slices” that might be useful, for example: http://my.registry.org/patients?p=CMO:0000004 (CMO:0000004 = “systolic B.P.”)
  89. 89. Triple Pattern Fragments (TPF) A REST interface for retrieving RDF Triples (from any source) Ruben Verborgh For example, the “BMI” column from a patient registry is a Resource with the URL: http://my.registry.org/patients?p=CMO:0000105 (CMO:0000105 = “body mass index””) HTTP GET gives me all BMI triples in the registry, together with other Resource URLs representing other “slices” that might be useful, for example: http://my.registry.org/patients?p=CMO:0000004 (CMO:0000004 = “systolic B.P.”)
  90. 90. We have a standard, RESTful way to request triples from any data source i.e. every slice of every dataset will be considered a distinct Resource → simply call HTTP GET on that Resource to get the Triples
  91. 91. But it’s a disaster! We have no way to know what Resources are available for any given dataset or what those Resources “are” (proteins? genes? patients? articles?) ...and still no defined way to generate the associated Triples
  92. 92. RML A way to describe the structure of an RDF document Anastasia Dimou RML allows us to create models of (meta)data structures “What could this (meta)data look like?” RML fulfills similar objectives to DCAT Profiles, the Dublin Core Application Profile, and ISO 11179 - Metadata Registries; but has added advantages! http://rml.io/RMLmappingLanguage.html
  93. 93. Using RML to describe the structure and semantics of a single Triple Map1 Predicate Object Map Subject Map Object Map ex:Patient Record subjectMap template “http://example.org/patient/{id}” predicateObjectMap predicate ex:has Variant objectM ap class Map2 parentTriplesMap Subject Map2 SO:0000694 (“SNP”) subjectMap template “http://identifiers.org/dbsnp/{snp}” class T H E M O D E L
  94. 94. Using RML to describe the structure (and semantics) of a single triple Map1 Predicate Object Map Subject Map Object Map ex:Patient Record subjectMap template “http://example.org/patient/{id}” predicateObjectMap predicate ex:has Variant objectM ap class Map2 parentTriplesMap Subject Map2 SO:0000694 (“SNP”) subjectMap template “http://identifiers.org/dbsnp/{snp}” class T H E M O D E L We call this a “Triple Descriptor” These are used to describe the structure of data “slices” in which all Triples have the same structure
  95. 95. T H E M O D E L Using RML to describe the structure (and semantics) of a single triple Map1 Predicate Object Map Subject Map Object Map ex:Patient Record subjectMap template “http://example.org/patient/{id}” predicateObjectMap predicate ex:has Variant objectM ap class Map2 parentTriplesMap Subject Map2 SO:0000694 (“SNP”) subjectMap template “http://identifiers.org/dbsnp/{snp}” class Patient:123 rdf:type ex:Patient Record snp: rs0020394 ex:hasVariant rdf:type ex:Patient Record The Data
  96. 96. Where are we now? TPF - A standard, RESTful way to request Triples Triple Descriptors - A standard way to describe the structure and meaning of a Triple
  97. 97. Where are we now? TPF - A standard, RESTful way to request Triples Triple Descriptors - A standard way to describe the structure and meaning of a Triple There is still no way to associate these with the dataset that they represent There is still no way to make these triples “real”
  98. 98. Wait… that’s not right! “There is still no way to associate these with the dataset that they represent” ???!!! Of course there is! We already have that!
  99. 99. MetaRecord Resource3 <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml HTTP GET The FAIR Accessor carries this information Using the metadata structures defined by DCAT the FAIR Accessor also tells you how to get the content of the record, and what formats are available
  100. 100. If we consider the TPF Resource URL to be just another DCAT Distribution, we get... <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml
  101. 101. If we consider the TPF Resource URL to be just another DCAT Distribution, we get... <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml URL to the Triple Pattern Fragment Server
  102. 102. If we consider the TPF Resource URL to be just another DCAT Distribution, we get… now add the Triple Descriptor <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL
  103. 103. If we consider the TPF Resource URL to be just another DCAT Distribution, we get… now add the Triple Descriptor <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL HTTP GET on that URL returns:
  104. 104. If we consider the TPF Resource URL to be just another DCAT Distribution, we get… now add the Triple Descriptor <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL When a TPF Server is associated with a Record and a Triple Descriptor, we call the combination of technologies a FAIR Projector
  105. 105. If we consider the TPF Resource URL to be just another DCAT Distribution, we get… now add the Triple Descriptor <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL Calling HTTP GET on TPFrag_URL_1 will give you rdf+xml triples from Record R That look like VOILA!! Interoperability from
  106. 106. I hear you objecting… I skipped something important!!! There is still no way to make these triples “real”
  107. 107. I hear you objecting… I skipped something important!!! There is still no way to make these triples “real” <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL What is this??
  108. 108. Sadly, there is no magic wand to create interoperability
  109. 109. Sadly, there is no magic wand to create interoperability Someone has to write the TPF server, and host it somewhere Interoperability will never come “for free” (because semantics will never come “for free”)
  110. 110. However, there are reasons for optimism! 1. People must transform data anyway in order to integrate it - this is a daily routine in most bioinformatics labs 2. For the most common file formats (e.g. CSV or Excel), there are RML-based tools to automate these transformations; simply create an RML model of what you want/need, and point the tool at the file. 3. Investing time into creating an RML model, vs. ad hoc “re-useless” transformation, makes it ~trivial to create a TPF server, and therefore to create a FAIR Projector for your own data transformation needs… and it is reusable! AND 4. Because the RML of Triple Descriptors is itself so simple (one triple!), we can also templatize their construction, making the entire process even more trivial 5. Citations Citations Citations! FAIR Accessors/Projectors are FAIR objects! You can get credit if other people use your Projector for their own analyses!
  111. 111. Summary of FAIR Projectors FAIR Projectors create interoperable data, associated with interoperable metadata, that can be discovered by querying a FAIR Port (or potentially Google, when/if Google properly indexes RDF) F+I+R A + I I FAIR Projectors can convert non-FAIR data into FAIR data, or can change the structure, URL format, or semantics of existing FAIR data sources FAIR Projectors can be deployed over, and provide a common interface to: - Static Data Deposits (in any format!) - Databases - Triplestores - Certain (common) types of Web Services R+++ Triple Descriptors are FAIR entities, intended for reuse, & None of this required a new API
  112. 112. How does the FAIR Projector fit into the Data FAIR Port?
  113. 113. How does the FAIR Projector fit into the Data FAIR Port? <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL
  114. 114. How does the FAIR Projector fit into the Data FAIR Port? <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_1 Source URL_U1 format rdf+xml dcat:Distribution_2 Source URL_U2 format application/xml dcat:Distribution_3 Source TPFrag_URL_1 format rdf+xml Model: Triple_Desc_URL HTTP GET on that URL returns: We know that this is about Record R in some repository We know one possible representation of a slice of the data in Record R
  115. 115. Or different Accessors published by different individuals… Anyone can publish an Accessor for any publicly-visible data! Every possible representation for Record R Can be used simply for discovery of the “type” of content inside of Record R, or can be used to discover a FAIR Projector that executes the transformation
  116. 116. <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL Index all possible representations for Record R (and for every other record) FAIR Profile Map of all semantic representations for Record R
  117. 117. <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL FAIR Port Registry/ Broker Index of all FAIR Profiles in one or more of the FAIR Ports
  118. 118. <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic Record R dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL FAIR Port Registry/ Broker “I need the blood creatinine levels for patient Record R using Clinical Measurement Ontology format” ? http://hosp.nl/recs/R?s.CMO.o
  119. 119. http://hosp.nl/recs/R?s.CMO.o
  120. 120. http://hosp.nl/recs/R?s.CMO.o The data you want, in the format you want! Go ahead and integrate it with your other data!
  121. 121. <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL <FAIR metadata/> foaf:primaryTopic u:Q2348Y dcat:Distribution_3 Source TPFrag_URL format rdf+xml Model: Triple_Desc_URL FAIR Port Registry/ Broker “I need the functional annotations for UniProt Q2348Y, using EDAM ontology” ? Exactly the same process for any data you are trying to retrieve... http://linkdd.sys/UP/Q2348Y?s.fa.o
  122. 122. With all of this in-place, what will we be able to do?
  123. 123. With all of this in-place, what will we be able to do? Siri? Yes, master! Please find me all known low molecular weight inhibitors of the Human p65 Protein. Please separate the list based on those that were found in curated databases, and those that were found in self-deposited data archives. Also, keep track of the license and citation information for each one. If you find data that is relevant, but not public, please provide me with the contact information for the person I need to request the data from.
  124. 124. Thanks to: Michel Dumontier - Stanford Center for Biomedical Informatics Research, Stanford, California. Ruben Verborgh – Ghent University – imec, Ghent, Belgium Luiz Olavo Bonino da Silva Santos - Dutch Techcentre for Life Sciences, Utrecht, The Netherlands - Vrije Universiteit Amsterdam, Amsterdam, The Netherlands. Tim Clark - Department of Neurology, Massachusetts General Hospital Boston MA and Harvard Medical School, Boston, MA, USA Morris A. Swertz - Genomics Coordination Center and Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands Fleur D.L. Kelpin - Genomics Coordination Center and Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands Alasdair J. G. Gray - Department of Computer Science, School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, UK Erik A. Schultes - Department of Human Genetics, Leiden University Medical Center, The Netherlands Erik M. van Mulligen - Department of Medical Informatics, Erasmus University Medical Center Rotterdam, The Netherlands Paolo Ciccarese - Perkin Elmer Innovation Lab, Cambridge MA and Harvard Medical School, Boston MA, USA Mark Thompson - Leiden University Medical Center, Leiden, The Netherlands Jerven T. Bolleman - Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
  125. 125. Funding for Mark Wilkinson from: Fundacion BBVA and the UPM Isaac Peral programme, and the Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R. Additional support for FAIR Skunkworks members comes from: European Union funded projects ELIXIR-EXCELERATE (H2020 no. 676559), ADOPT BBMRI-ERIC (H2020 no. 676550) CORBEL (H2020 no. 654248) Netherlands Organisation for Scientific Research (Odex4all project) Stichting Topconsortium voor Kennis en Innovatie High Tech Systemen en Materialen (FAIRdICT project) BBMRI-NL RD-Connect and ELIXIR (Rare disease implementation study FP7 no. 305444).
  126. 126. You are not only welcome to share and reuse this presentation... ...You are encouraged to!
  127. 127. Important: All our templates are free to use under Creative Commons Attribution License. If you use the graphic assets (photos, icons and typographies) included in this Google Slides Templates you must keep the Credits slide or add all attributions in the last slide notes. FGST Free GoogleSlides Templates Some graphical elements were taken from slide templates provided by:

×