A Generic Scientific Data Model and Ontology for Representation of Chemical Data

A Generic Scientific Data Model
and Ontology for Representation
of Chemical Data
Stuart J. Chalk, Department of Chemistry
University of North Florida
schalk@unf.edu
CINF Paper 171 – 251st ACS Meeting Spring 2016
#ACSCINFDataSummit

Scientific Data Should be Open
 Simple: Openness as the norm not the exception
 Data made available, without restriction, so its useful
 Mechanisms/tools to make data available
 Formats to allow others to get the data…
 …but also so its easy to use
 Annotate the data to make it easy to find
 Community driven promotion of and action on this issue

 Research Notebook
 Spectral Files (JCAMP-DX, propriety)
 Excel Spreadsheets
 Personal Databases
 Online Databases
 PDF Files No!
 RDF Yes!
Resource Description Framework
Options for Storing Data?

 W3C Recommendation 2015
Specification - https://www.w3.org/TR/ldp/
Primer - https://www.w3.org/TR/ldp-primer/
The Linked Data
Platform
From: http://www.dataversity.net/introduction-linked-data-platform/

 Use JavaScript
Object Notation
(JSON) as a text
format for
storing data and
metadata so it
can be converted
to RDF
JSON for Linked Data (JSON-LD)
{
"@context": {
"name": "http://schema.org/name",
"isAlive": "http://example.org/isAlive",
"age": "http://example.org/age",
"height": "http://schema.org/height",
"@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx"
},
"@id": "",
"name": "Stuart Chalk",
"isAlive": true,
"age": 49,
"height": 188.0
} http://json-ld.org/playground/

JSON for Linked Data (JSON-LD)
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://example.org/age>
"49"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example.org/isAlive>
"true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
<http://schema.org/height>
"188"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://schema.org/name>
"Stuart Chalk" .

 Nice idea but because anything can be
linked to anything else to form a graph of variable structure…
 ...difficult to search, hard to maintain
 OK, use regular relational database – Rigid Schema
Not good to try and make data fit the schema…
 Use a hybrid approach!
 Encode some structure in RDF using a framework...
 ...add data to the structured graph in an organized way
Store all Scientific Data in RDF?

 Consider FAIR Principals (http://www.datafairport.org)
 To be Findable:
 F1. (meta)data are assigned a globally unique and persistent identifier
 F2. data are described with rich metadata (defined by R1 below)
 F3. metadata clearly and explicitly include the identifier of the data it describes
 F4. (meta)data are registered or indexed in a searchable resource
 To be Accessible:
 A1. (meta)data are retrievable by their identifier using a standardized communications protocol
 A2. metadata are accessible, even when the data are no longer available
 To be Interoperable:
 I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
 I2. (meta)data use vocabularies that follow FAIR principles
 I3. (meta)data include qualified references to other (meta)data
 To be Reusable:
 R1. meta(data) are richly described with a plurality of accurate and relevant attributes
 R1.1. (meta)data are released with a clear and accessible data usage license
 R1.2. (meta)data are associated with detailed provenance
 R1.3. (meta)data meet domain-relevant community standards
What Metadata is Important for Data?

 Define scope as data obtained from an experiment,
a series of experiments, a project
 Who did the work and where are they?
 Metadata about the data “packet”
 The raw data…
 …its associated metadata (enough to properly contextualize the data)
 Access rights
 Published location
What Should a Data Model Represent?

General
Framework
 SciData – Scientific Data
Model (SDM)
 Overview –
http://stuchalk.github.io/scidata/
 GitHub Repo –
https://github.com/stuchalk/scidata

General Framework
- The Context
 “@context” contains the
context definition
 Refers to other context files
 Namespace abbreviations
 Default vocabulary “@vocab”
 “@id” links ontology term
 “@type” states data type

Methodology, System, and Dataset

Example Data -
Literature Value
 “scope” provides internal link
to “@id” value
 Each value of a name value pair
has a default data type that can
be override by expanding value
to a JSON object and adding
“@value” and “@type”

Example Data -
NMR Spectrum
 “dataseries” are JSON arrays of
data on one axis
 Bring them together with
“datagroup” and we can
represent at spectrum
 “parameter” is generic
container for data, or metadata

Example Data –
CC Calculation
 “datagroup”s are structures to
aggregate data at any level
 “datagroup”s can be infinitely
nested
 “uid” is optional and can be
used to unique define any piece
of data

The SDM
Ontology
 SciData Ontology –
Scientific Data Model
Ontology (SDMO)
 OWL File –
https://github.com/stuchalk/scidata/b
lob/master/ontology/scidata.owl

 Get community feedback, refine/extend/standardize
 Generate large corpus of disparate data in JSON-LD, ingest into triple store
and query (SPARQL)
 Evaluate inferencing on the triple store data
 Push adoption through collaboration
 Run hackathons to build developer implementations
 Develop Electronic Laboratory Notebook (ELN) to generate data in JSON-LD
 Get feedback from data community, RDA - https://rd-alliance.org/
 Test using the NDS - http://www.nationaldataservice.org/
Future Work

 Pain Points
 Challenges
 Opportunities
 Normalization
 Tools to generate
metadata automatically
 User Perspective
 Gaps in Data
 Gaps in Ontology Coverage
Pain Points?
 Gather stakeholders to work on standards
 Broad knowledge domain representation
 i-UPAC, RDA Chemistry Research Data IG
 Priorities?
 Data annotation and representation
 Data exchange (repo <-> repo, user <-> user)
 Structure representation (chiral centers)
 Curation infrastructures
 Domain vocabulary translations
 Units of measure

Reality Check
“to err is human; to forgive, divine”
Alexander Pope
“to err is human; to really screw things up requires a computer”
Paul Ehrlich
“to err is human; all hell will break loose if you
don’t provide accurate semantics to a computer”
Stuart Chalk

 schalk@unf.edu
 Phone: 904-620-1938
 Skype: stuartchalk
 LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
 ResearcherID: http://www.researcherid.com/rid/D-8577-2013
Questions?

A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Similar to A Generic Scientific Data Model and Ontology for Representation of Chemical Data (20)

More from Stuart Chalk

More from Stuart Chalk (20)

Recently uploaded

Recently uploaded (20)

A Generic Scientific Data Model and Ontology for Representation of Chemical Data