Martone grethe

Methodologies for Long-Tail Data
Sharing: What Have We Learned?
Maryann E. Martone, Ph. D.
University of California, San Diego
and
Hypothesis
Jeffrey S. Grethe, Ph. D.
University of California, San Diego

Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
NIF is an initiative of the NIH Blueprint consortium of institutes
– NIF has been tracking and cataloging the biomedical resource landscape since 2008

The current “Addictome"
NIF searches across:
• Resource Registry
(13,000+)
• > 200 deeply
integrated data
sources (>800
million records)
• literature
Query: Addiction

N
ORCID
RRID
Data
Digital world runs on globally unique and persistent identifiers; PID’s serve as a
“key” for identifying the same entity across different contexts
e-Science Ecosystem
Metadatastandards
Aggregator
People
Research resources
Ontology
Concepts
DOI
Protocols
Minimal Information Models
TranslationNon-digital
Repositories
and
Registries
e.g. NIF, Monarch
NIH Data DIscovery
Index
CDE
E
eScience goal: Make data Findable, Accessible, Interoperable, Re-usable
(FAIR) for both human and machine
PID

Resource Identification Initiative: Supplying unique
identifiers for key research resources
“The following antibodies were used for
immunoblotting: -actin mAb (1:10,000
dilution, Sigma-Aldrich)…”
“The following antibodies were used for
immunoblotting: -actin mAb (1:10,000
dilution, Sigma-Aldrich,
RRID:AB_262137)…”
VS
https://scicrunch.org/resolver/RRID:AB_262137

Minimal Information Standards
http://precedings.nature.com/documents/1720/version/1
http://precedings.nature.com/documents/1720/version/1/files/npre20081720-1.pdf
A set of guidelines for reporting data that
ensures the data can be easily verified,
analysed and clearly interpreted by the
wider scientific community. The
recommendations also provide a foundation
for structured databases, public repositories
and development of data analysis tools.
https://en.wikipedia.org/wiki/Minimum_Information_Standards
MINI: Minimum Information about a Neuroscience
Investigation
MIM
CDE 1
CDE 2
CDE N
• • •
Value Set

Common Data Elements
https://cde.nlm.nih.gov/home
http://www.nlm.nih.gov/cde/
A data element that is common
to multiple datasets and is used
to improve data quality and
promote data sharing. CDEs
usually describe the following
data element properties: Name,
Definition, Instructions,
Provenance, Value Set.

Value Sets
The set of possible values or
responses. A Value Set often
includes concepts from established
Vocabularies, Ontologies or Data
Standards. A value set may also
include a range of permissible values
and indicate the required units. For a
survey question, the value set may
be a list of possible responses.
http://neurolex.org/wiki/Category:Hippocampus_CA1_pyramidal_cell

Neuroscience Information Framework
“a tool for analyzing and structuring information”
“a reduction in uncertainty”
• Ontologies are the major way that NIF searches for and organizes information
• Aggregate of community ontologies, e.g., Gene Ontology, Chebi, Protein Ontology
• Still significant gaps for behavioral and physiological concepts and techniques
• Available as services through NIF so they can be built into applications
Organism
Molecule
Macromolecule Gene
Molecule Descriptors
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NS Function
Subcellular
structure
Investigation
ProtocolsReagent
Techniques
NIFSTD

Concept-based query
Remove synonyms
Ontologies and their relationships let us probe the data space for related concepts

What have we learned?
• The landscape is vibrant, dynamic and growing, but also littered
with abandoned and unrealized projects
• Data belongs in a data repository, not on your lab server
• People are important in this endeavor: Leaders, curators,
community engagement specialists
• Data and ontology resources become interesting when they
are comprehensive: populate!!!
• Assume that you will be resource limited and plan
accordingly: time, money, personnel
• Cost-benefit analysis; what to do now vs later
• Technology will improve
• Don’t start from square 1-resources exist to help; help
support them

Dimensions of FAIR data sharing
• Discoverability
– Data can be found
– Data set has an identifier and links are stable
• Accessibility
– Data can be accessed programmatically
– Access rights are clear
• Assessability
– Provenance is known
– Reliability can be determined
• Understandability
– The data can be understood
• Usability
– The data are actionable
– Data are not in a proprietary format
?
?
Goodman, A. et al. Ten simple rules for the care and feeding of scientific data. PLoS Comput Biol 10,
e1003542, doi:10.1371/journal.pcbi.1003542 (2014)
Science as an open enterprise, Royal Society: https://royalsociety.org/policy/projects/science-public-
enterprise/Report/

FORCE11: Future of Research Communications and
e-Scholarship
• Resource Identification Initiative:
https://www.force11.org/group/resource-identification-
initiative
• FAIR Data Guiding principles:
https://www.force11.org/group/fairgroup/fairprinciples
• Data Citation Principles:
https://www.force11.org/group/joint-declaration-data-
citation-principles-final
• On creating machine-readable data citations:
https://peerj.com/articles/cs-1/
• 10 Simple rules for design, provision, and reuse of persistent
identifiers for life science data:
https://zenodo.org/record/18003#.VeOxxLQjvyAFORCE11.org: Grass roots organization dedicated to transforming scholarship through

Forebrain
Midbrain
Hindbrain
0
1-10
11-100
>101
Data Sources
Mapping the data landscape: Anatomical framework
~800 million records across ~200 databases or views

Martone grethe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Martone grethe

Similar to Martone grethe (20)

Recently uploaded

Recently uploaded (20)

Martone grethe

Editor's Notes