Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Big Data Publishing,
Handling, & Reuse
Laurie Goodman, PhD
Editor-in-Chief, GigaScience
Laurie@gigasciencejournal.com
ORCID ID: 0000-0001-9724-5976
Beyond Data Release Mandates

What is the point of publishing?
• To disseminate
information/knowledge/ideas.
• To present material so it can be reasonably
assessed for its level of quality (and interest).
• To gain credit for career advancement.

What goes into a research article?
+ Area of Interest/
Question

+ Area of Interest/
Question
Data & Metadata Collection
Analysis/Hypothesis/Analysis
Conclusions

Conclusions
+ Area of Interest/
Question
Data & Metadata Collection

Scientific Communication
Via Publication
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and
computational methods, which support the
scholarship, remain largely inaccessible --- Jon B.
Buckheit and David L. Donoho, WaveLab and reproducible
research, 1995
• Core scientific statements or assertions are intertwined and
hidden in the conventional scholarly narratives
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication

Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014
http://genomebiology.com/2014/15/12/556
From Journal Delivery to PDF Delivery

Lack of Data and Software Availability
Impacts Reproducibility
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced

Retractions are on the Rise
>15X increase in last decade
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Deconstructing a paper into accessible,
useable, trackable, interlinked units
Need to provide credit to
reward sharing and proper
organization of:
• Narrative
• Data/Metadata
availability/curation
• Software availability
• Interoperability
• Availability of workflows
• Transparent analyses
Data/
MetaData
Software
Methods
Narrative

Deconstructing a paper into accessible,
useable, trackable, interlinked units
Currently we provide credit
for this:
• Narrative
• Data/Metadata
availability/curation
• Software availability
• Interoperability
• Availability of workflows
• Transparent analyses
Data/
MetaData
Software
Methods
Narrative
Sometimes we publish these
as Methods Papers

Beyond the Narrative
Data And Tools

Promoting Data Release
Data Citation

But- publishing ‘Data’ is “Salami Slicing”!!
What is Salami Slicing?
• Publishing research in several different papers that
should form a single cohesive paper
Why is it ‘unethical’
• It fragments the scientific literature, wasting
researcher’s time as they try to get all the information
related to a very specific topic/dataset/method
• It can give the appearance (given there are multiple
publications) that there is large support for a particular
hypothesis
• It pads a researcher’s publication record unfairly

Publishing ‘Data’ is “Salami Slicing”!
Baloney
1. Those guidelines were developed prior to the year 2000:
• More than 15 years ago: at a time when data set sizes and data
types collected in the life sciences by a single research group
were relatively small and primarily suitable for a single or narrow
range of disciplines or hypotheses.
• Most journals were not online (which allows easier identification
and access to closely related articles ) until the late ‘90s.
2. In 2005, COPE* ruled that a paper that had data that had been used
and described, at least in part, in a previous publication was not
unethical *Council of Publication Ethics. http://www.publicationethics.org/case/salami-publication
3. Data collection can be (should be!!) a scholarly pursuit:
• Data that is broadly reusable requires care, thought, training,
time, and money to be properly collected, curated, stored, and
shared.

Contrary to popular belief…
There are very few
—if any—
‘push-a-button-and-get-it’
reuseable data resources

Your not supposed to just collect samples!
*Collect ALL available metadata*
Help Develop a Digital Data Curation Team at your
Institution’s Library (they may already have one…)

Back to Darwin
Data & Metadata Collection/Experiments
Conclusions
+ Area of Interest/
Question
1839
1859
20 Yrs.

Say… was this a Data Publication?
Data & Metadata Collection/Experiments
Conclusions
+ Area of Interest/
Question
1839
1859
The most curious fact is the
perfect gradation in the
size of the beaks in the
different species of
Geospiza, from one as large
as that of a hawfinch to
that of a chaffinch, and (if
Mr. Gould is right in including his sub-group, Certhidea, in
the main group) even to that of a warbler. The largest beak
in the genus Geospiza is shown in Fig. 1, and the smallest in
Fig. 3; but instead of there being only one intermediate
species, with a beak of the size shown in Fig. 2, there are no
less than six species with insensibly graduated beaks.
(Chapter 17)

DataCite and DOIs
• Aims to “increase acceptance of research data as
legitimate, citable contributions to the scholarly
record”.
• “data generated in the course of research are
just as valuable to the ongoing academic
discourse as papers and monographs”.
Citing Data Isn’t New
The Physical Sciences have been doing this for a while…

What we’re doing:
Mandating and Aiding for Data Release
Requiring all data supporting work to be Freely available in a
publically available repository
– How we’re helping to do this:
• Journal-dedicated data and software repository GigaDB
that hosts ALL data types.
• Have Biocurators to aid in handling Metadata
• All Datasets are provided a Digital Object Identifier
(DOI) making them citable and countable
• All Material in GigaDB is available under a CC0 Waiver
• Data with a publically approved database must be
submitted there as well
• Provide Direct links to all associated information

Requiring all software and work to be Freely available in a
publically available repository
– How we’re promoting this:
• All software created by authors must be 100% OSI
compliant
• Journal-Dedicated repository GigaDB hosts software so
it can be downloaded.
• Software and Workflows are provided a DOI making
them citable and countable (reward)
• Journal-dedicated Galaxy Platform to run tools
• Have a Data Manager and Data Scientist to wrap and
deploy software tools
• Have our own Github Repository
What we’re doing
Mandating and Aiding Software Release

Data Sets in
GigaDB
Analyses/
Workflows in
GigaGalaxy
Paper in
GigaScience
(Narrative + Methods)
Open-access journal Data Publishing Platform
Data Computation Analysis Platform
How we view publishing at GigaScience

Making the Data Itself Citable
We provide a linked journal database- this is done to link the data
directly to our papers to ease reproducibility, make it available at the
time of review, and provide authors a place to submit data with no
sustainable ‘home’.
Note: there are many community available databases- so in principle-
any journal can do this by taking advantage of such available
resources.
These include the usual suspects: EBI, NCBI, DDBJ etc.
Databases that take all data types and provide Data DOIs: Dryad,
FigShare, etc.
There are also numerous smaller community databases specific to
different fields or data types.

Some of the Journals Currently
Doing Data Publication
http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList

Citing Data in the
References Allows Tracking
This rewards authors for making data
available AND makes it easier to find
But is this being done?

Is Cited
Data
Being
Tracked?
Yes:

Improving Quality as
Well as Availability
How Hard is Data and Software Review?
Not really that much harder than narrative
review.

Fail – submitter is
provided error report
Pass – dataset is
uploaded to
GigaDB.
Curator makes dataset public
(can be set as future date if
required)
DataCite
XML file
Submission
Submitter logs in to
GigaDB website and
uploads Excel submission
or uses online wizard
DOI
assigned
Files
Submitter provides
files by ftp or
Aspera
XML is generated and
registered with DataCite
Curator Review
Curator contacts submitter with
DOI citation and to arrange file
transfer (and resolve any other
questions/issues).
DOI 10.5524/100003
Genomic data from the
crab-eating
macaque/cynomolgus
monkey (Macaca
fascicularis) (2011)
Public GigaDB dataset
Data must be available for review with the manuscript
(and at the very least get a sanity check…)

Reviewing Data in More Detail
Issue: We can’t ask our reviewers to do that!
Our finding: Reviewers don’t mind
Reviewer Dr. Christophe Pouzat on neuroscience
manuscript:
“In addition to making the presented research
trustworthy, the reproducible research
paradigm definitely makes the reviewers job
more fun!”
Can also use specific Data Reviewers (we have)

Reviewing DataAND Software
Code in sourceforge under GPLv3:
http://soapdenovo2.sourceforge.net/>5000 downloads
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>35,000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Enabled code to being picked apart by bloggers in wiki

8 Reviewers! Holy Cow- that must have
taken forever!!
Submission
July 24
Final review
Aug 28
These were
reviewing
teams from
different labs,
assessing the
materials at
multiple levels

Is this really worth the effort?
Beyond Reproducibility:
REUSE
Data Availability and Tools

http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-
data-the-polar-bear/
These data
were released
THREE YEARS
before
publication of
the analysis
article

The polar bear DATA were released –prepublication- in 2011
They were used and cited in the following studies- before the main paper on the
sequencing was published
Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct
bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.
Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting
theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345.
doi:10.1371/journal.pgen.1003345.
Morgan, CC et al., Heterogeneous models place the root of the placental mammal
phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.
Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus
maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from
Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.
Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene
Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

Even though the data had
been released over 2 years
earlier and cited in other
papers- the main analysis
paper was published in Cell

Cell Press Journals had indicated
publishing a dataset prior to publication
could be considered as prior publication

• New Sequencing technology
• minION Oxford-Nanopore
• New Sequence Data Type
• EBI and NCBI Databases not ready
• High community interest for testing
data
• >100 GB of data
Real time use during the
publication process
• Uploaded prior to publication
• Deployed on Amazon Cloud Front
• Ongoing
testing/comparison/information
sharing prior to publication
• When ready for data EBI used our
cloud to upload data
• EBI transferred the data to NCBI when
they were ready

Getting past…
…look but don't touch

Reproduce and Reuse Needs Much More
• Data: GigaDB
• Software: Github
• Workflows
– Galaxy
– Executable Docs
– VMs
• Images: OMERO
• Cloud storage, tools, and
compute power…
• Need this to reach the smaller
labs
github.com/gigascience/gigadb-cogini
More Journals have or are starting to introduce
these and other tools: More is needed…

Currently… it feels like this…
Well…
…because it is like this

If we want to
move
forward, we
need to go
through that
to reach this:
It will require
researchers,
institutions,
publishers,
and funders
working
together.

Thanks to:
Scott Edmunds, Executive Editor
Nicole Nogoy, Commissioning Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Rob Davidson, Data Scientist
Xiao (Jesse) Si Zhe, Database Developer
Amye Kenall, Journal Development Manager
editorial@gigasciencejournal.com
database@gigasciencejournal.com
@GigaScience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog
Contact us:
Follow us:
www.gigasciencejournal.com
www.gigadb.org

Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

More Related Content

What's hot

Similar to Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

More from GigaScience, BGI Hong Kong

Recently uploaded

Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Editor's Notes