Force11 JDDCP workshop presentation, @ Force2015, Oxford

EU Lead
Mark Wilkinson
Fundacion BBVA Chair in Biological Informatics
Isaac Peral Distinguished Researcher, CBGP-UPM
USA Lead
Michel Dumontier
Associate Professor, Biomedical Informatics, Stanford
FAIRport Project Lead
Barend Mons
Professor, Leiden University Medical Centre
FAIRport Skunkworks

“Skunkworks”
Team Update
Objectives and Outcomes
(...so far...)

What is a FAIRport?
● Findable - (meta)data should be uniquely and persistently identifiable
● Accessible - identifiers should provide a mechanism for (meta)data
access, including authentication, access protocol, license, etc.
● Interoperable - (meta)data should be machine-accessible, using a
machine-parseable syntax and, where possible, shared common
vocabularies.
● Reusable - there should be sufficient machine-readable metadata that it is
possible to “integrate like-with-like”, and that component data objects can
be precisely and comprehensively cited post-integration.

“Skunkworks”
“...a group within an organization given a high
degree of autonomy and unhampered by
bureaucracy, tasked with working on advanced
or secret projects.” -- Wikipedia: http://en.wikipedia.org/wiki/Skunk_Works

“Skunkworks” FAIRport group
Objective (ongoing) - explore existing technologies and attempt to build
prototype FAIRport code components using, whenever possible, existing
standards. Once desirable FAIR behaviors have been achieved, hand-off
to a professional coding team to ensure production-quality outcomes.
● Self-selected “hackers”
● Self-identified tasks (next few slides)
● Led to a series of Web meetings, and a joint Hackathon, with
participants at venues in Netherlands and USA.

Typical Problem
I’m looking for microarray data of human liver cells on a
time-course following liver transplant.
What repositories *could* contain this data?
● GEO? EUDat? NPG Scientific Data?
● What fields in those repositories would I need to
search, using what vocabularies, to find what I
need?

“Skunkworks” - initial observations
There are a lot of repositories out there!
General Purpose: Dryad, EUDat, Figshare, DataVerse, etc.
Special Purpose: PDB, UniProt, NCBI, EnsEMBL
Lack of rich, machine-readable descriptions of the contents of these
repositories hinders us from (for example):
● knowing where we can look for certain types of data
● knowing if two repositories contain records about the same thing
● Cross-referencing or “joining” across repositories to integrate
disparate data about the same thing
● Knowing which repository I could/should deposit my data to (and how)

If we wanted to enable this kind of FAIR discovery and
integration over myriad repositories, what infrastructure
(existing/new) would we need?
Challenge

Task:
harmonized cross-repository meta-descriptors
Though self-selected as a FAIRport Skunkworks task, this significantly
overlaps with the Force11 Data Citation Implementation Working Group
Team 4 - “Common repository interfaces”.
...so we joined forces :-)

Exemplar use-cases:
A piece of software that can generate a “sensible” query form/interface for
any repository
A piece of software that can generate a “sensible” and comprehensive
data submission form for any repository
Task:
harmonized cross-repository meta-descriptors

Prior Art?
“DCAT is an RDF vocabulary designed to facilitate interoperability
between data catalogs published on the Web…. By using DCAT to
describe datasets in data catalogs, publishers increase discoverability
and enable applications easily to consume metadata from multiple
catalogs. It further enables decentralized publishing of catalogs and
facilitates federated dataset search across sites. Aggregated DCAT
metadata can serve as a manifest file to facilitate digital preservation.”
http://www.w3.org/TR/vocab-dcat/
W3C Recommendation 16 January 2014
DCAT Data Catalog Vocabulary

DCAT is an RDF Schema that defines core metadata elements describing
dataset collections and the datasets within those collections. e.g.
:dataset-001
a dcat:Dataset ;
dct:title "Imaginary dataset" ;
dcat:keyword "accountability","transparency" ,"payments" ;
dct:issued "2011-12-05"^^xsd:date ;
dct:modified "2011-12-05"^^xsd:date ;
dct:temporal <http://reference.data.gov.uk/id/quarter/2006-Q1> ;
dct:spatial <http://www.geonames.org/6695072> ;
dct:publisher :finance-ministry ;
dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
dcat:distribution :dataset-001-csv ;
Prior Art?
DCAT Data Catalog Vocabulary

So the core metadata of a repository’s collections
could be described in DCAT...

...if the repositories used DCAT…

...generally speaking, they don’t...

...generally speaking, they don’t...
...and we need more than just core metadata to enable
cross-repository search anyway…

So DCAT itself isn’t the solution to our problem
because, among other things, it does not
provide sufficiently rich descriptors

What exactly *is* our problem?

Data Record (e.g. XML, RDF)

Data Schema (e.g. XMLS, RDFS)
Defines

Metadata Record (e.g. DCAT-compliant RDF)
Defines
Describes

DCAT RDFS Schema
Defines
Describes
Defines

DCAT RDFS Schema
If everyone was using all elements of the DCAT schema
to define their core metadata
then (that part of) the problem would be solved at this point

DCAT RDFS Schema
We could use THIS

DCAT RDFS Schema
To build queries
about THIS

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
REALITY

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
Repositories don’t all use DCAT Schema

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
Those that use DCAT Schema, use only parts of it

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
Those that don’t use DCAT
use a myriad of alternatives (some very loosely defined)

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
And don’t necessarily use
all elements of those alternatives either

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
So how are we going to do RICH queries over all
of these?

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
We need a way to describe the descriptors...

The DCAT WG suggested the same thing
They said there was a need for “DCAT Profiles”
A DCAT Profile is a specification for data catalogs that adds additional
constraints to DCAT. Additional constraints in a profile MAY include:
● A minimum set of required metadata fields
● Classes and properties for additional metadata fields not covered in DCAT
● Controlled vocabularies or URI sets as acceptable values for properties
● Requirements for specific access mechanisms (RDF syntaxes, protocols) to the catalog's RDF
description

The DCAT WG suggested the same thing
They said there was a need for “DCAT Profiles”
description
A DCAT Profile is:
A generic way to describe what metadata fields a repository has
and what the constraints on those fields are

But the DCAT WG also suggested...
description
DCAT Profiles don’t exist!

“FAIR Profiles”
At the Hackathon, the “Skunkers” decided to invent the DCAT Profile technology.
Since they are intended to allow descriptions of
● Descriptor metadata fields not included in DCAT...
● ...in many cases, Descriptors with ZERO metadata fields from DCAT...
● ...and in many cases, Descriptors that are not even in RDF...
We call them “FAIR Profiles” rather than DCAT profiles
(However, clear acknowledgements to the
DCAT Working Group for conceiving of the idea!)

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
What the FAIR profile technology accomplishes

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
FAIR Profile
DCAT Schema
FAIR Profile
UniProt Metadata
Schema
FAIR Profile
DragonDB Metadata
Schema
What the FAIR profile technology accomplishes

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
FAIR Profile
DCAT Schema
FAIR Profile
UniProt Metadata
Schema
FAIR Profile
DragonDB Metadata
Schema
Though they are potentially describing very different things
(from Web FORM fields to OWL Ontologies!)
all FAIR Profiles are written using the same vocabulary and structure, defined by...

XML
Data Record
XMLS
Data Schema
DCAT RDF
Metadata Record
RDF
Data Record
RDFS
Data Schema
UniProt RDF
Metadata Record
ACEDB
Data Record
ACEDB
Data Schema
DragonDB Form
Metadata Record
DCAT
RDFS Schema
UniProt RDFS
MetadataSchema
DragonDB Form
Metadata Schema
FAIR Profile of
DCAT Schema
FAIR Profile of
UniProt Metadata
Schema
FAIR Profile of
DragonDB Metadata
Schema

The FAIR Profile
Schema
(the thing the Skunkworks team invented)

Repo. Data Record (e.g. XML, RDF)
Repo. Data Schema (e.g. XMLS, RDFS)
Repository Metadata Record
Repository Metadata Schema
Defines
Describes
Defines
Defines
Describes
Repository’s Fair Profile
Fair Profile Schema

“All problems in computer
science can be solved by
another level of indirection”
-- David Wheeler
inventor of the subroutine

"...But that usually will create
another problem."
-- David Wheeler
another level of indirection”
-- David Wheeler
inventor of the subroutine
Diomidis Spinellis. Another level of indirection. In Andy Oram and Greg Wilson, editors, Beautiful Code: Leading Programmers Explain How They Think, chapter 17, pages 279–
291. O'Reilly and Associates, Sebastopol, CA, 2007.

Desiderata for FAIR Profile Schema
● Must describe legacy data (i.e. not just DCAT or other “modern” data)
● Must describe a multitude of data formats (XML, RDF, Key/Value, etc.)
● Must be capable of describing OWL-DL-governed data (still rare, but
increasingly used… Classes, property-restrictions, etc.)
● Must be capable of describing any kind of value constraint, e.g. arbitrary CV,
rdf:range, or equivalent OWL construct
● Must be hierarchical (i.e. the value-constraint of a field can be set as an
entirely separate FAIR Profile)
● Must be modular, identifiable, shareable, and reusable (to stem the
proliferation of new formats)
● Must use standard technologies, and re-use existing vocabularies if poss.
● Must be extremely lightweight
● Must NOT require the participation of the repository host (no buy-in required)

FAIR Profile Schema
A very lightweight meta-meta-descriptor, in RDFS language
FAIR Profile FP Class FP Property
Property
Restriction
Definition
hasClass hasProperty allowed
Values
classType propertyType
External Ontology
or RDFS Class
(optional)
External Ontology
or RDFS Predicate
(optional)
http://github.com/DataFairPort/DataFairPort/blob/Master/Schema/DCATProfile.rdfs

FAIR Profile Schema
A very lightweight meta-meta-descriptor, in RDFS language
Property
Restriction
Definition
Values
External Ontology
or RDFS Class
(optional)
External Ontology
or RDFS Predicate
(optional)
Requirement Status?
Cardinality?
Other Constraint?
http://github.com/DataFairPort/DataFairPort/blob/Master/Schema/DCATProfile.rdfs

Property Restriction
Definition
(XSD, FAIR Profile, SKOS)
Describes the constraints on the possible
values for a predicate in the target-
Repository’s metadata Schema

Definition
NOTE: we cannot use rdfs:range because
we are meta-modelling! The predicate is a
CLASS at the meta-model level, so use of
rdfs:range is not appropriate.

Definition
The possible values are:
● An XSD Datatype
● Another DCAT Profile (i.e. hierarchical profiles)
● A SKOS View on a set of ontology terms from
one or more ontologies

A FAIR Profile
(an RDF document that follows the FAIR Profile Schema)
This!
DCAT RDFS Schema
Fair Profile
Fair Profile Schema

A FAIR Profile
Property
Restriction
Definition
Values
External Class External Predicate

A FAIR Profile
Property
Restriction
Definition
Values
FAIR Profiles are FAIR!
(Identifiable, Re-usable, and Shareable)

A FAIR Profile
The CoreMicroarrayDistributionMetadata Descriptor Class
Values
Property
Restriction
Definition

CoreMicroarrayDistributionMetadata
Class Descriptor
Values
Property
Restriction
Definition

Descriptor
The Class follows the “DCAT Distribution” Class model
Values
Property
Restriction
Definition

Descriptor
It uses only 3 properties from the “DCAT Distribution” Class model
hasClass
hasProperty
allowed
Values
propertyType
Property
Restriction
Definition
classType

Descriptor: Property #1
It uses only 3 properties from the “DCAT Distribution” Class model
...let’s look at one of them in detail
hasClass
hasProperty
allowed
Values
propertyType
classType
Property
Restriction
Definition

This Meta-Descriptor element is a ‘FAIR Profile Property’ Class
Values
Property
Restriction
Definition

This is it’s label within that organizations metadata descriptor
Values
Property
Restriction
Definition

This is the URL of the Predicate used by that descriptor
Values
classType
propertyType
Property
Restriction
Definition

This is the “range” of that Predicate within the organizations descriptor
Values
classType
Property
Restriction
Definition
propertyType

Let’s look at a different property from the
CoreMicroarrayDistributionMetadata Class
hasClass
hasProperty
allowed
Values
propertyType
classType
Property
Restriction
Definition

Values
Property
Restriction
Definition

This is the label for that property
Values
Property
Restriction
Definition

The URL of the predicate of this Property
Values
classType
propertyType
Property
Restriction
Definition

In the Metadata Descriptor, this property is constrained
by the set of ontology terms defined in the SKOS Concept Scheme
EDAM_Microarray_Data_Format
Values
classType
Property
Restriction
Definition
propertyType

<rdf:Description xmlns:ns1="http://www.w3.org/2002/07/owl#"
rdf:about="http://biordf.org/DataFairPort/ConceptSchemes/EDAM_Microarray_Data_Format">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/>
<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#ConceptScheme"/>
<ns1:imports rdf:resource="http://purl.bioontology.org/ontology/EDAM"/>
</rdf:Description>
<rdf:Description
xmlns:ns1="http://www.w3.org/2000/01/rdf-schema#"
xmlns:ns2="http://www.w3.org/2004/02/skos/core#"
rdf:about="http://edamontology.org/format_1641">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#NamedIndividual"/>
<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
<ns1:label>affymetrix-exp</ns1:label>
<ns2:broader rdf:resource="http://edamontology.org/format_2056"/>
<ns2:inScheme rdf:resource="http://biordf.org/DataFairPort/ConceptSchemes/EDAM_Microarray_Data_Format"/>
</rdf:Description>
<rdf:Description
xmlns:ns1="http://www.w3.org/2000/01/rdf-schema#"
xmlns:ns2="http://www.w3.org/2004/02/skos/core#"
rdf:about="http://edamontology.org/format_2056">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#NamedIndividual"/>
<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
<ns1:label>Microarray experiment data format</ns1:label>
<ns2:broader rdf:resource="http://biordf.org/DataFairPort/ConceptSchemes/EDAM_Microarray_Data_Format"/>
<ns2:inScheme rdf:resource="http://biordf.org/DataFairPort/ConceptSchemes/EDAM_Microarray_Data_Format"/>
</rdf:Description>
http://biordf.org/DataFairPort/ConceptSchemes/EDAM_Microarray_Data_Format
This is a “SKOSified” view of the EDAM Ontology
Jupp, et al., “Taking a view on bio-ontologies” ceur-ws.org/Vol-897/session4-paper22.pdf

A DCAT Profile
Return to the very top of our FAIR Profile
Follow the ExtendedAuthorship Class
Property
Restriction
Definition
Values

ExtendedAuthorship
Follow one of the properties of the ExtendedAuthorship Class
hasClass
hasProperty
allowed
Values
propertyType
classType
Property
Restriction
Definition

Author ORCID
Values
Property
Restriction
Definition

Author ORCID
The allowed values of this Property are constrained to be
individuals that follow the FAIR Profile Schema “DemoORCIDProfileScheme”
Values
classType
Property
Restriction
Definition
propertyType

http://biordf.org/DataFairPort/ProfileSchemas/DemoORCIDProfileScheme.rdf
Property
Restriction
Definition
Values

http://biordf.org/DataFairPort/ProfileSchemas/DemoORCIDProfileScheme.rdf
Values
classType
propertyType
This is parsed in exactly the same way as our original
DemoMicroarrayProfileScheme, but is embedded within
it as the value of the author_ORCID property.
…Arbitrary, hierarchical layers of complexity…
FAIR Profile FP Class
hasClass hasProperty
classType
External Class

So to build an interface
(e.g. query or data-capture)
from a FAIR Profile:
[1] Parse all FAIR Profile classes
Parse the properties of each class
Determine the target predicate
Determine the target value-restrictions
Call [1] if restriction is a FAIR
Profile
Create a metadata [capture/query] facet with
that
predicate and that restriction

DCAT Profile Class #1
DCAT Profile
Class #4 (embedded)
Value
constraints
Descriptor-specific labels associated
with ontology predicates (if applicable)
“Classes” may be associated with an ontology
to allow reasoning, or may just represent an
“arbitrary” grouping of properties within the
Target metadata descriptor
Metadata Descriptor-specific details are captured
e.g. this field is required by this target Metadata Descriptor

Other features of FAIR profiles
● Do not require repository participation
● Provides a purpose-driven, potentially non-comprehensive “view” on a
repository, of which there may be many, according to what the profile
author needs to cross-query
● Profiles of any given repository facet are not required to be identical! e.g.
A different profile might utilize a different controlled vocabulary over any
given facet (e.g. a freetext facet)
● Anybody can define a profile (of course, the profile defined by the
repository owner should be considered “canonical”... the rest are just
purpose-built “best-guesses”)
● FAIR profiles can/should be indexed and shared, to facilitate cross-
repository interoperability and integration
● There is no (obvious) reason why a FAIR profile could not be used to
describe the DATA in the repository, not just the metadata...

Nothin’ ain’t worth nothin’, but it’s free!
-- Kris Kristofferson
another level of indirection
...But that usually will create
another problem."
-- David Wheeler

The FAIR profile isn’t “a magic bean”!
It DOES NOT ACCOMPLISH SEMANTIC MAPPING
between one field in one repository, and a semantically-
related field in another repository

It does give us a standard way to identify, describe, and
meta-link these fields, and a predictable place where a
mapping mechanism could be injected.

...we don’t inject it (yet!) because that would require
invention of yet another “standard”, and we want to avoid
that if possible!

There may be some in the audience who, like me,
recognize that this problem is nearly identical to the
problem faced by the WSDL -> SAWSDL community.
I will be looking at their solution for guidance in the next
phase of FAIR Profiles...
… so we still have problems, but at least they are now
re-defined as problems for which there are solutions!

Skunkworks Participants
● Mark Wilkinson
● Michel Dumontier
● Barend Mons
● Tim Clark
● Jun Zhao
● Paolo Ciccarese
● Paul Groth
● Erik van Mulligen
● Luiz Olavo Bonino da
Silva Santos
● Matthew Gamble
● Carole Goble
● Joël Kuiper
● Morris Swertz
● Erik Schultes
● Erik Schultes
● Mercè Crosas
● Adrian Garcia
● Barend Mons
● Philip Durbin
● Jeffrey Grethe
● Katy Wolstencroft
● Sudeshna Das
● M. Emily Merrill

Post-presentation comments
We should look at ISO 11179 -> are we
duplicating those efforts or are we creating
something that is an implementation of those
efforts?
See also Dublin Core’s similar initiative.

Force11 JDDCP workshop presentation, @ Force2015, Oxford

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Force11 JDDCP workshop presentation, @ Force2015, Oxford

Similar to Force11 JDDCP workshop presentation, @ Force2015, Oxford (20)

More from Mark Wilkinson

More from Mark Wilkinson (20)

Recently uploaded

Recently uploaded (20)

Force11 JDDCP workshop presentation, @ Force2015, Oxford