How to make your published data findable, accessible, interoperable and reusable

Leonore Reiser and Lisa Harper
UC Berkeley
February 14, 2018

Good Data Stewardship
• Publish data with the paper
• Describe data to your fullest ability
• Use the right words to identify Data
• Deposit data in the right Data Repository
• Budget time for Data Management
• Don’t think of it as YOUR data

What’s in it for YOU?
We all benefit from data sharing.
More citations of YOUR work, increasing
your visibility in the research community.
Easily comply with journal and
funding requirements
Less time spent fulfilling requests for data.

Publications are increasing exponentially
http://bar.utoronto.ca/50YearsOfArabidopsis/

Idea
Funding
Experiments
Analysis
Publication
Reuse
Data
Lifecycle

Idea
Funding
Experiments
Analysis
Publication
Reuse
Data
Lifecycle
Don’t THROW it away!
Recycle!

Data re-use leads to new insights
Data Processing
Quality Control
Validation
503 datasets 314 datasets
Statistical Analysis
Additional Experiments
Yu Zhang et al. PNAS doi:10.1073/pnas.1716300115
NOVEL DISCOVERY
MET1 and CMT3 are independently required for the maintenance of
asymmetric CHH methylation at CMT2 target sites

Credit: Melissa Haendel
Wilkinson, et al., (2016) The FAIR Guiding Principles for scientific data management and stewardship
10.1038/sdata.2016.18. https://www.nature.com/articles/sdata201618
• Findable means data is human and machine readable
and attached to persistent identifiers
• Accessible means data can be found and retrieved by
humans and machines using standard formats
• Interoperable means data can be exchanged and used
between systems.
• Reusable means data can be used by others

How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others

CHROM POS REF ALT Line
1
Line
2
1 12345 A C A A
3 67891 C T H C
10 23456 G T T U
1
Line
2
Gm01 12345 A C 0/0 0/0
Gm03 67891 C T 0/1 0/0
Gm10 23456 G T 1/1 ./.
1
Line
2
Chr01 12345 A C AA AA
Chr03 67891 C T C/T CC
Chr10 23456 G T TT NN
ALL MEAN THE SAME!
BUT ARE NOT THE SAME
Use Standard formats: SNP example
SNP (Single Nucleotide Polymorphism): A base, a chromosome
number and genome position, and a reference to the genome
assembly used, and the genotypes of lines tested.
VCF: Variant Call Format
Is the STANDARD
Use the File format
STANDARD
for your data type

DOI:/10.3389/fpls.2017.01812
Use Standard formats: Data in
images is NOT accessible
Data in PDF (image) format
is not findable or
accessible.
Leave tabular data in tables

If you use EXCEL, look out for data corruption and hidden
Microsoft characters that impede parsing
Zeimann, 2016
10.1186/s13059-016-1044-7
Use Standard formats: Beware of Excel
Fig. 1: Prevalence of gene name errors in Supplementary Excel files
Percentage of papers with gene lists effected Increase in supplementary files with gene
name errors per year

Metadata: Species = xxx
Germplasm = xxx
Field location = xxx
Environment = xxx
Measurement = xxx
method
Phenotype (Data): Plant is 170cm tall
Metadata is data about the data,
and allows understanding of the data
Supply Complete Metadata

• Write your Materials and Methods as if you wanted
someone else to be able to reproduce your work.
• Be accurate and complete about your bench and field
work; include samples/stocks/lines used, accession
numbers, sources of materials, exact measuring
techniques etc.
• Be AS accurate and complete about your computational
pipelines. Include your created raw data files and
versions. If you use reference data (eg; sequence
assembly), include the version number, download dates,
and download source.
• Include names of software applications, versions,
platforms and source. If you use a CyVerse, use their
metadata reporting tools.

Pretty Good!

Not so good

597 Possible Attributes
At least 50 Attributes
Genome Sequence Assembly At least 100 Attributes

Budget TIME
to provide Metadata
The metadata in public databases is often confusing; a test case
with Zea mays mRNA seq data reveals a high proportion of
missing, misleading or incomplete metadata. 2018.
https://doi.org/10.1016/j.plantsci.2017.10.014

• Established: Genomic Standards Consortium
(http://gensc.org)
• Minimal Information about Any Sequence
• Emerging
• Minimal Information about a Plant Phenotyping Experiment
(MIAPPE)
Metadata Standards for Various Data Types
Ask For Help from Database People

How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others

Cell
Same word,
different meanings
Different words,
same concept
Eggplant
Aubergine
Melongene

Embrace Ontologies
An Ontology is:
A set of precisely defined terms
In a logical hierarchy, and the
Relationship between can be
understood by computers

PO:0020105
ligule
Ontologies: Hierarchy of terms and
explicit relationship among terms
Plant
Ontology
(PO)
Ligule
PO:0020105
Vascular leaf
PO:0009025
Leaf sheath
PO:0020104
Flag leaf
PO:0020103
Adult vascular leaf
PO:0020103
Leaf
PO:0025034

Data from diverse types of experiments and organisms
can be compared
Henk J. Franssen, et al (2015)
doi: 10.1242/dev.120774
(Medicago)
Li,S. et al., (2016)
10.1016/j.devcel.2016.10.012
Arabidopsis
Zhou, X-F, et a.l. (2014) 10.1104/pp.114.243808

Embracing ontologies
• Ontologies provide a POWERFUL, MACHINE READABLE utility
for data
• Find and use existing ontologies
(http://www.obofoundry.org/, Planteome)
• Gene Function = Gene Ontology (GO)
• Sequences = Sequence Ontology (SO)
• Plant Anatomy and Development = Plant Ontology (PO)
• Phenotypes = Phenotype and Trait Ontology (PATO)
• …..many many others
• Apply them consistently
• To datasets (e.g. in metadata)
• In publications (e.g. TAIR GO/PO submission)
• Ask Questions!

Use persistent, unambiguous
identifiers
Example: Gene names
GOOD!

Identifiers also resolve confusion over
species
Is this Arabidopsis? Maize? Tomato?

DOI:10/24/pp.17.00021
One gene- many names
GOOD
OK
(history)

Solution: Community Standards and
Nomenclature Resources

Problem: Data is not findable because
it is not available
Piwowar HA, Vision TJ.(2013)Data reuse and the open data
citation
advantage.PeerJ1:e175https://doi.org/10.7717/peerj.175
Gibney and VanNorden
doi:10.1038/nature.2013.14416

Put your data in a stable public
repository
Large International Repositories for many data
types for all species. ALL sequence data goes here
Large but specialized databases serving many species
Soybase
Specialized databases serving specific communities

Submitting to a repository: SNP example
As of 9/2017, All NON- human SNPs are
processed through EMBL in the European
Variation Archive (EVA,
https://www.ebi.ac.uk/eva/).
NCBI’s dbSNP will only process Human SNPs
EVA will require:
• Data in (standard) Variant Calling Format
(VCF) including allele frequencies
• SUBMITTED Genome or Transcriptome
assembly

What if there is no specialized database?
Or no recommendations from journals ?
You should get a Digital Object Identifier (DOI)
http://datadryad.org
** Curated, metadata
https://zenodo.org/
https://figshare.com/
https://datashare.ucsf.edu/stash
And just for you folks at UC……

But.. please, don’t forget to actually complete
your submission*...
*And you never have to spend time fielding requests
or transferring huge data files again

Cite, share freely and encourage others to be FAIR
Include searchable and citable identifiers for your data in
your papers
Release your data with clearly defined terms of use
e.g. Creative Commons (CC) CC-0, CC-BY
If you do not specify restrictions are implied limiting reuse
Cite all of your data sources
Enhances reproducibility….. and also shows value to funders!
When reviewing papers check them for FAIRness

Good data practices benefit everyone
(and help you get funded)

A few simple things to remember when
preparing your paper
• Include unambiguous identifiers
• Format data according to defined standards
• Keep data in (parseable) tables or text
• Include meaningful metadata
• Deposit data in a long term stable public repository and get a
DOI
• It is never to early to think about (meta) data, the best time to
start is BEFORE you are writing

You can get help structuring,
organizing and managing your data
● Contact your Community Database
● Don’t have one? Contact a curator
(Leonore, Lisa… we live amongst you)
● UCB Research Data Management Librarians
(http://researchdata.berkeley.edu/)

What YOU can do right now to
support FAIR data
Ask your funders for increased access to FAIR data
When you review papers- looks at the data, and be
sure it is well described (Metadata is great)
Change your attitude a little: You data will be more
cited, more important if you make it FAIR
Deposit your Data and get a DOI
Ask your institution to value good data submission,
and good data recycling

How to make your published data findable, accessible, interoperable and reusable

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to make your published data findable, accessible, interoperable and reusable

Similar to How to make your published data findable, accessible, interoperable and reusable (20)

More from Phoenix Bioinformatics

More from Phoenix Bioinformatics (13)

Recently uploaded

Recently uploaded (20)

How to make your published data findable, accessible, interoperable and reusable