It would be useful to be able to discover what kinds of data are contained in the myriad general-purpose public data repositories. It would be even better if it were possible to query that data and/or have that data conform to a particular context-dependent data format. This was the ambition of the Data FAIRport project. I will be demonstrating the "strawman" demonstration of a fully-functional Data FAIRport, where the meta/data in a public repository can be "projected" into one of a number of different context-dependent formats, such that it can be cross-queried in combination with the (potentially "projected") data from other repositories.
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descriptors by Mark Wilkinson
1. This presentation is licensed CC-BY
Mark Wilkinson (markw@illuminae.com)
https://goo.gl/ts3hLW
2. EU Lead
Mark Wilkinson
Isaac Peral Distinguished Researcher, CBGP-UPM, Madrid
USA Lead
Michel Dumontier
Associate Professor, Biomedical Informatics, Stanford, USA
FAIRport Project Lead
Barend Mons
Professor, Leiden University Medical Centre, Netherlands
Data FAIRport
Skunkworks
Common repository access
via meta-meta-descriptors
3. What is a FAIRport?
● Findable - (meta)data should be uniquely and persistently
identifiable
● Accessible - identifiers should provide a mechanism for (meta)data
access, including authentication, access protocol, license, etc.
● Interoperable - (meta)data should be machine-accessible, using a
machine-parseable syntax and, where possible, shared common
vocabularies.
● Reusable - there should be sufficient machine-readable metadata
that it is possible to “integrate like-with-like”, and that component data
objects can be precisely and comprehensively cited post-integration.
5. End-user view of “The Problem”
Tissue rejection experimental context. Today, I’m looking
for microarray data of human liver cells on a time-course
following liver transplant.
What repositories could contain such data?
● GEO? EUDat? FigShare? Dryad? Atlas?
● What fields in those repositories would I need to
search, using what vocabularies, to find the
microarray studies that are relevant?
6. Dissecting the problem
There are a lot of repositories!
General Purpose: DataVerse, Dryad, EUDat, Figshare, etc.
Special Purpose: PDB, UniProt, NCBI, GEO, Atlas, EnsEMBL
7. Dissecting the problem
Lack of harmonized metadata structures, or even rich
descriptions of the contents of these repositories, hinders
us from (for example):
● knowing where we can look for certain types of data
● knowing if two repositories contain records about the same thing
● Cross-referencing or “joining” across repositories to integrate
disparate data about the same thing
● Knowing which repository I could/should deposit my data to (and how)
8. “Skunkworks” Challenge
If we wanted to enable this kind of FAIR discovery and
integration over myriad repositories, what infrastructure
(existing/new) would we need?
9. If we wanted to enable this kind of FAIR discovery and
integration over myriad repositories, what infrastructure
(existing/new) would we need?
Discussions with Tim Clark revealed that the core
objectives of Skunkworks were very similar to those of
Force 11 Data Citation Implementation
Working Group Team 4 - “Common repository interfaces”
...so we joined forces :-)
“Skunkworks” Challenge
11. Shared Metadata Descriptors?
They already exist! (e.g. DCAT)
Are not (yet) widely implemented
But are not sufficiently rich...
...only describe “core” metadata
We need to query, e.g. experimental context and
domain-specific metadata
13. So... extend DCAT?
...extend it where?...
too many specialist domains & data
resistance to harmonization
resistance to implementation
(time, money, expertise, ‘just don’t care’)
attempting to impose standards
is a Mug’s game!
15. Common provider-implemented API?
a la TDWG/TAPIR and caBIO...
too many specialist domains & data
resistance to harmonization
resistance to implementation
(time, money, expertise, ‘just don’t care’)
attempting to impose standards
is a Mug’s game!
16. Where else could the solution be?
What exactly *is* our problem?
18. What exactly *is* our problem?
Data Record (e.g. XML, RDF)
Data Schema (e.g. XMLS, RDFS)
Defines
19. What exactly *is* our problem?
Data Record (e.g. XML, RDF)
Data Schema (e.g. XMLS, RDFS)
Metadata Record (e.g. DCAT-compliant RDF)
Defines
Describes
20. What exactly *is* our problem?
Data Record (e.g. XML, RDF)
Data Schema (e.g. XMLS, RDFS)
Metadata Record (e.g. DCAT-compliant RDF)
(IF the repository uses DCAT)
DCAT RDFS Schema
(IF the repository uses DCAT…)
Defines
Describes
Defines
21. What exactly *is* our problem?
Data Record (e.g. XML, RDF)
Data Schema (e.g. XMLS, RDFS)
Metadata Record (e.g. DCAT-compliant RDF)
(IF the repository uses DCAT)
DCAT RDFS Schema
(IF the repository uses DCAT…)
Defines
Describes
Defines
If everyone used DCAT, we could at least query the
core metadata of all repositories…
...but they don’t...
...and core isn’t rich enough anyway...
22. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
REALITY
23. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
Repositories don’t all use DCAT Schema
24. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
Those that use DCAT Schema, use only parts of it
25. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
Those that don’t use DCAT
use a myriad of alternatives (some very loosely defined)
26. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
And don’t necessarily use
all elements of those alternatives either
27. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
So we need to find a way to do RICH queries
over all of these?
28. What exactly *is* our problem?
XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
We need a way to describe the descriptors...
29. Desiderata of meta-meta descriptors
● Must describe legacy data (i.e. not just DCAT or other “modern” data)
● Must describe a multitude of data formats (XML, RDF, Key/Value, etc.)
● Must be capable of describing any kind of value constraint, e.g. plain text,
numerical, arbitrary CV, rdf:range, or equivalent OWL construct
● Must be modular, identifiable, shareable, and reusable (to stem the
proliferation of new formats)
● Must be hierarchical to allow composite re-use of shared descriptors
● Must use standard technologies, and re-use existing vocabularies if poss.
● Must be extremely lightweight and “trivial” to create
● Must NOT require the participation of the repository host (no buy-in required)
31. Exemplar use-cases:
● A piece of software that can generate a “sensible”
data submission form for any repository
(at the Force 2015 meeting a few months ago I gave a presentation of a working
example of this… so I won’t repeat that today…)
● A piece of software that can generate a “sensible”
query form/interface for any repository
(demonstration of this today!)
Skunkworks Task #1 - [F]indable
Invent harmonized cross-repository meta-
descriptors
33. XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
What FAIR Profiles do
34. XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
FAIR Profile
DCAT Schema
FAIR Profile
UniProt Metadata
Schema
FAIR Profile
DragonDB Metadata
Schema
What FAIR Profiles do
35. XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
FAIR Profile
DCAT Schema
FAIR Profile
UniProt Metadata
Schema
FAIR Profile
DragonDB Metadata
Schema
Though they are potentially describing very different things
(from Web FORM fields to OWL Ontologies!)
all FAIR Profiles are written using the same vocabulary and structure, defined by...
36. XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
FAIR Profile
DCAT Schema
FAIR Profile
UniProt Metadata
Schema
FAIR Profile
DragonDB Metadata
Schema
43. URI must resolve to:
XSD, SKOS Concept Scheme
or another FAIR Profile
Describes the constraints on the possible
values for a predicate in the target-
Repository’s metadata Schema
xsd:anyURI
allowedValues
44. URI must resolve to:
XSD, SKOS Concept Scheme
or another FAIR Profile
Describes the constraints on the possible
values for a predicate in the target-
Repository’s metadata Schema
NOTE: we cannot use rdfs:range because
we are meta-modelling a schema! The
predicate is a CLASS at the meta-model
level, so use of rdfs:range is not appropriate.
xsd:anyURI
allowedValues
45. A FAIR Profile
(an RDF document that follows the FAIR Profile Schema)
This
Metadata Record
Metadata Schema
Fair Profile
Fair Profile Schema
46. What a FAIR Profile is:
A meta-description of the (meta)data
in a repository
47. What a FAIR Profile is:
A meta-description of the (meta)data
in a repository
What a FAIR Profile is NOT:
THE meta-description of the (meta)data
in a repository
48. What a FAIR Profile is:
A meta-description of the (meta)data
in a repository
if you were to view it
from a particular “perspective”
(also known as a “lens*” over the data)
* Scientific Lenses to Support Multiple Views over Linked Chemistry
Data; DOI:10.1007/978-3-319-11964-9_7
49. What a FAIR Profile is:
A meta-description of the (meta)data
in a repository
if you were to view it
from a particular “perspective”
(also known as a “lens*” over the data)
this is where the FAIRport approach becomes
distinctly powerful!
50. What a FAIR Profile is:
A meta-description of the (meta)data
in a repository
if you were to view it
from a particular “perspective”
(also known as a “lens*” over the data)
but first, look at the other
FAIRport components
51. Skunkworks Task #2 - [A]cessible
Are there already access layer definitions?
52. A set of behaviors for providing a unified (albeit simplistic!)
access layer for “records” contained in any Web resource
Skunkworks Task #2 - [A]cessible
Are there already access layer definitions?
60. LDP returns you
DCAT Distributions for all
available formats of that record
that the repo provides
<RDF>
<dcat:Dist.>
<format xml>
URL6a
<dcat:Dist.>
<format html>
URL6b
</RDF>
61. You directly call the
repository using the URL of
your choice
GET URL6a
62. Repository returns you the
data you requested
Content-type: application/xml
<data>
<data>
Yummy Data Here!
</data>
</data>
….
(Note: most repositories already do this!
So we’re half-way there :-) )
63. The first time I wrote one of these from scratch,
it was about 170 lines of code,
and took less than 4 hours
(including reading the W3C documentation!)
64. The first time I wrote one of these from scratch,
it was about 170 lines of code,
and took less than 4 hours
(including reading the W3C documentation!)
When one of these is associated with a FAIR Profile we call it a
“FAIR Accessor”
66. Skunkworks Task #3 - [I]nteroperable
This is “the holy grail”!!
This is where the FAIR Profile reveals its utility
“what it IS” vs. “what it IS NOT”
67. What a FAIR Profile is:
A meta-description of the (meta)data
in a repository
if you were to view it
from a particular “perspective”
(also known as a “lens” over the data)
68. Skunkworks Task #3 - [I]nteroperable
“FAIR Projectors”
A FAIR Projector is a (potentially) small, modular,
reusable Web based service that “projects” data
from a repository into the format
described by a FAIR Profile
69. Skunkworks Task #3 - [I]nteroperable
“FAIR Projectors”
A FAIR Projector is a (potentially) small, modular,
reusable Web based service that “projects” data
from a repository into the format
described by a FAIR Profile
http://linkeddatafragments.org/
70. RESTful access to RDF data resources
RESTful hypermedia controls (e.g. pagination)
defined by Hydra W3C Community Group
http://www.hydra-cg.com/
77. Stage 1: Kinds of questions we can ask
● How do I access the records in Repo X?
→ GET Accessor URL
● How do I access the records in Repo X in
XML?
→ GET Accessor URL & DCAT Dist URL
● Can I please have the “biological tissue” field
repo X as FMA Ontology terms?
→ Search FAIRport → pick matching FAIR Profile +
+ Projector → GET Projector URL
78. The first time I wrote one of these from scratch,
it was about 300 lines of Perl code,
and took about 6 hours
(including reading the LDF documentation!)
and it projected three different FAIR Profiles
83. Stage 2: Leverage the Modularity
implementedByimplementedBy
Merged data to be cross-queried
84.
85. Main features of FAIR Profiles
● Do not require repository participation - anyone can write a Profile
● Provides a purpose-driven, potentially non-comprehensive “view” on a
repository
● FAIR Profiles of any given repository facet may be different! May use
different vocabularies or may interpret fields differently, depending on the
needs of the Profile author
● FAIR profiles can/should be indexed and shared (e.g. in a FAIRport
Registry), to facilitate cross-repository interoperability and integration
● There is no (obvious) reason why a FAIR profile could not be used to
describe the DATA in the repository, not just the metadata…
○ my examples on the final page of this slideshow do exactly that!
● FAIR Profiles can be used both at the “read” and at the “write” end of data
publishing… (Force 11 Oxford meeting demo was for “write” interfaces)
86. Main features of FAIRPort Platform
● GET GET GET!! We didn’t invent any new technology or API :-) :-)
● All components modular, re-usable, and often will be written by 3rd parties
○ → encourages the creation of an ecosystem of these lightweight,
discoverable little data transformers
● All components identified by URL, and can be “cobbled together” in whatever
way a client needs on a particular day (and this can happen automatically!)
● Because everything is identified by a URL, and we only use HTTP GET,
components can be “chained” (e.g. the Projector calls GET on the URL of
another Projector)
○ → i.e. I simply don’t care how the Projector or Accessor work “under the
hood”. I only look at the FAIR Profile and then call GET.
87. Skunkworks Participants
● Mark Wilkinson
● Michel Dumontier
● Barend Mons
● Tim Clark
● Jun Zhao
● Paolo Ciccarese
● Paul Groth
● Erik van Mulligen
● Luiz Olavo Bonino da
Silva Santos
● Matthew Gamble
● Carole Goble
● Joël Kuiper
● Morris Swertz
● Erik Schultes
● Erik Schultes
● Mercè Crosas
● Adrian Garcia
● Philip Durbin
● Jeffrey Grethe
● Katy Wolstencroft
● Sudeshna Das
● M. Emily Merrill
88. Working Examples
- One (small) dataset (the Allele slice of my own DragonDB): http://antirrhinum.net An example record in the repository's native format is
here: http://antirrhinum.net/cgi-bin/ace/generic/xml/DragonDB?name=cho;class=Allele
- Three different FAIR Profiles - one with textual descriptions and gene cross-references, the other two with phenotypic images described
using the SIO ontology, or the EDAM ontology (respectively). This is the "F" in FAIR, since these can (in principle) be searched and queried
in order to find repositories that potentially have your data of interest, in your desired format.
* http://biordf.org/DataFairPort/ProfileSchemas/DragonDB_Allele_ProfileAlleleDescriptions.rdf
* http://biordf.org/DataFairPort/ProfileSchemas/DragonDB_Allele_ProfileImagesEDAM.rdf
* http://biordf.org/DataFairPort/ProfileSchemas/DragonDB_Allele_ProfileImagesSIO.rdf
- a "FAIR Accessor" that provides a Linked Data Platform-compliant way to retrieve all of the URIs for the Allele records, as well as their
various representations (described as DCAT Distributions). This is the "A" in FAIR. http://antirrhinum.net/cgi-bin/LDP/Alleles
- a "FAIR Projector" that takes the data from the Allele records and "projects" it as RDF that is compliant with whichever Profile you chose.
This is the 'I" in FAIR. http://biordf.org/cgi-bin/DataFairPort/DragonDB_LDF_Profiler (you wont see anything if you just surf to that endpoint.
It's a RESTful web service that requires additional URL components, as described below)
- Profiles and Accessors and Projectors are linked by small fragments of RDF, but in principle they are all independent from one another.
This describes the accessor for a given Profile: http://biordf.org/DataFairPort/DragonDB_Allele_Accessor.rdf This describes the projector
for a given profile: http://biordf.org/DataFairPort/DragonDB_FAIRDataProjector.rdf (in this case, the same file is describing all three FAIR
projections, but these could be published independently just as easily)
Three “Projections” of the DragonDB Allele Data (note that most of the process above is achieved simply by called GET on the URLs
below!!)
http://biordf.org/cgi-bin/DataFairPort/DragonDB_LDF_Profiler/DragonDB_Allele_ProfileAlleleDescriptions/
http://biordf.org/cgi-bin/DataFairPort/DragonDB_LDF_Profiler/DragonDB_Allele_ProfileImagesSIO/
http://biordf.org/cgi-bin/DataFairPort/DragonDB_LDF_Profiler/DragonDB_Allele_ProfileImagesEDAM/
89. This presentation is licensed CC-BY
Mark Wilkinson (markw@illuminae.com)
https://goo.gl/ts3hLW