The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.
To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
1. A Generic Scientific Data Model
and Ontology for Representation
of Chemical Data
Stuart J. Chalk, Department of Chemistry
University of North Florida
schalk@unf.edu
CINF Paper 171 – 251st ACS Meeting Spring 2016
#ACSCINFDataSummit
2. Scientific Data Should be Open
Simple: Openness as the norm not the exception
Data made available, without restriction, so its useful
Mechanisms/tools to make data available
Formats to allow others to get the data…
…but also so its easy to use
Annotate the data to make it easy to find
Community driven promotion of and action on this issue
3. Research Notebook
Spectral Files (JCAMP-DX, propriety)
Excel Spreadsheets
Personal Databases
Online Databases
PDF Files No!
RDF Yes!
Resource Description Framework
Options for Storing Data?
4. W3C Recommendation 2015
Specification - https://www.w3.org/TR/ldp/
Primer - https://www.w3.org/TR/ldp-primer/
The Linked Data
Platform
From: http://www.dataversity.net/introduction-linked-data-platform/
5. Use JavaScript
Object Notation
(JSON) as a text
format for
storing data and
metadata so it
can be converted
to RDF
JSON for Linked Data (JSON-LD)
{
"@context": {
"name": "http://schema.org/name",
"isAlive": "http://example.org/isAlive",
"age": "http://example.org/age",
"height": "http://schema.org/height",
"@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx"
},
"@id": "",
"name": "Stuart Chalk",
"isAlive": true,
"age": 49,
"height": 188.0
} http://json-ld.org/playground/
7. Nice idea but because anything can be
linked to anything else to form a graph of variable structure…
...difficult to search, hard to maintain
OK, use regular relational database – Rigid Schema
Not good to try and make data fit the schema…
Use a hybrid approach!
Encode some structure in RDF using a framework...
...add data to the structured graph in an organized way
Store all Scientific Data in RDF?
8. Consider FAIR Principals (http://www.datafairport.org)
To be Findable:
F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A2. metadata are accessible, even when the data are no longer available
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
To be Reusable:
R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
What Metadata is Important for Data?
9. Define scope as data obtained from an experiment,
a series of experiments, a project
Who did the work and where are they?
Metadata about the data “packet”
The raw data…
…its associated metadata (enough to properly contextualize the data)
Access rights
Published location
What Should a Data Model Represent?
10. General
Framework
SciData – Scientific Data
Model (SDM)
Overview –
http://stuchalk.github.io/scidata/
GitHub Repo –
https://github.com/stuchalk/scidata
11. General Framework
- The Context
“@context” contains the
context definition
Refers to other context files
Namespace abbreviations
Default vocabulary “@vocab”
“@id” links ontology term
“@type” states data type
14. Example Data -
Literature Value
“scope” provides internal link
to “@id” value
Each value of a name value pair
has a default data type that can
be override by expanding value
to a JSON object and adding
“@value” and “@type”
15. Example Data -
NMR Spectrum
“dataseries” are JSON arrays of
data on one axis
Bring them together with
“datagroup” and we can
represent at spectrum
“parameter” is generic
container for data, or metadata
16. Example Data –
CC Calculation
“datagroup”s are structures to
aggregate data at any level
“datagroup”s can be infinitely
nested
“uid” is optional and can be
used to unique define any piece
of data
17. The SDM
Ontology
SciData Ontology –
Scientific Data Model
Ontology (SDMO)
OWL File –
https://github.com/stuchalk/scidata/b
lob/master/ontology/scidata.owl
18. Get community feedback, refine/extend/standardize
Generate large corpus of disparate data in JSON-LD, ingest into triple store
and query (SPARQL)
Evaluate inferencing on the triple store data
Push adoption through collaboration
Run hackathons to build developer implementations
Develop Electronic Laboratory Notebook (ELN) to generate data in JSON-LD
Get feedback from data community, RDA - https://rd-alliance.org/
Test using the NDS - http://www.nationaldataservice.org/
Future Work
19. Pain Points
Challenges
Opportunities
Normalization
Tools to generate
metadata automatically
User Perspective
Gaps in Data
Gaps in Ontology Coverage
Pain Points?
Gather stakeholders to work on standards
Broad knowledge domain representation
i-UPAC, RDA Chemistry Research Data IG
Priorities?
Data annotation and representation
Data exchange (repo <-> repo, user <-> user)
Structure representation (chiral centers)
Curation infrastructures
Domain vocabulary translations
Units of measure
20. Reality Check
“to err is human; to forgive, divine”
Alexander Pope
“to err is human; to really screw things up requires a computer”
Paul Ehrlich
“to err is human; all hell will break loose if you
don’t provide accurate semantics to a computer”
Stuart Chalk