Scott Edmunds: Data publication in the data deluge

Data publication in the data deluge

Scott Edmunds, GigaScience/BGI Hong Kong
COASP 2012, Budapest, 20th September 2012

www.gigasciencejournal.com

The Data Challenge:
•1.2 zettabytes (1021) electronic data generated globally each year

•>Exponential growth of genomics data (& growth in imaging and
MS data following)

Source: http://www.genome.gov/sequencingcosts/ (with apologies)

•Issues with reproducibility, hosting, curation, interoperability

•Need for better incentives to overcome these

Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

Large-Scale Data
Journal/Database
In conjunction with:

Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Commisioning Editor: Nicole Nogoy, PhD
Lead Curator: Tam Sneddon D.Phil
Data Platform: Peter Li, PhD

GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)

GDSAP: Genomic Data Submission
and Analytical platform

Anatomy of a Publication
Idea

Study

Metadata

Data
Analysis

Answer

Anatomy of a Data Publication
Idea

Study

Metadata

Data
Analysis

Answer

Issues for Data Publication
Idea
Cultural issues:

Study

Technical issues:
Metadata

Data
Analysis

Answer

Idea
Cultural issues:

Study

Metadata

Data
Analysis

Adoption held back by:
Answer journal policies, citation, tracking…

* T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham

Idea

Study

Technical issues:
Metadata

What do we do with the data?
Data
Analysis

Answer

Idea

Study

Technical issues:
Metadata

What do we do with the data?
Data
Lightweight:
Analysis •Metadata only journals
•Get someone else to host
Heavyweight:
Answer •Become a repository

To host or not to host?
Against: supplementary files argument
The Journal of Neuroscience Average size of a Journal of Neuroscience article
and supplemental material in megabytes.
Announcement Regarding Supplemental Material:
Beginning November 1, 2010, The Journal of
Neuroscience will no longer allow authors to include
supplemental material when they submit new manuscripts
and will no longer host supplemental material on its web site
for those articles.

“While the size of articles has grown gradually over the
past decade, the supplemental material associated with a
typical Journal article appears to be growing exponentially
and is rapidly approaching the size of an article. The
sheer volume of supplemental material is adversely
affecting peer review.”

Maunsell J J. Neurosci. 2010;30:10599-10600

$1000 genome = million $ peer-review?
To review: (>6TBp, >1500 datasets)

S3 (storage) = $15,000
EC2 (analysis w/ BLASTx) = $500,000
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops

$1000 genome = million $ peer-review?
To review: (>6TBp, >1500 datasets)

S3 (storage) = $15,000
EC2 (analysis w/ BLASTx) = $500,000
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops

ENCODE analysis Virtual Machine:

Containing: input data, code
bundles with scripts and
processing steps, outputs

AWS = ~$5,000
Source: James Taylor / http://encodeproject.org/ENCODE/integrativeAnalysis/VM

To host or not to host?
For: reproducibility
The Guardian, 14th September 2012: Replication is the only solution to scientific fraud.
http://www.guardian.co.uk/commentisfree/2012/sep/14/solution-scientific-fraud-replication

For: “data is the new oil”
William Gibson: "Information is the currency of the future world”

Sir Tim Berners-Lee: "Data is a precious thing and will last longer than
the systems themselves”

Move compute to the data: think EC2 rather than S3
DNA Nexus + 0.5PB SRA data = $15 million given by Google

Source:DNA Nexus/SRA http://techcrunch.com/2011/10/12/dnanexus-raises-15-million-teams-with-google-to-host-massive-dna-database/

Overcoming cultural hurdles…

?

Overcoming cultural hurdles…
Adventures in Data Citation

doi:10.5524/100001

For data citation to work, needs:

1. Proven utility/potential user base.

2. Acceptance/inclusion by journals.

3. Data+Citation: inclusion in the references.

4. Tracking by citation indexes.

5. Usage of the metrics by the community…

Datacitation 1: utility/user base.
Establishment of data DOIs and use by databases:
Shackleton NJ, Hall MA, Vincent E (2001): Mean stable carbon isotope ratios
of Cibicidoides wuellerstorfi from sediment core MD95-2042 on the Iberian
margin, North Atlantic. PANGAEA - Data Publisher for Earth & Environmental
Science. http://doi.pangaea.de/10.1594/PANGAEA.58229
Cited in:
Pahnke K, Zahn R: Southern Hemisphere Water Mass Conversion Linked with North Atlantic
Climate Variability. Science 2005, 307:1741 -1746.

Nocek B, Xu X, Savchenko A, Edwards A, Joachimiak A. 2007. PDB
ID: 2P06 Crystal structure of a predicted coding region AF_0060
from Archaeoglobus fulgidus DSM 4304. 10.2210/pdb2p06/pdb.

Cited in:
Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data
growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008,
36:D419-425.

BGI Datasets Get DOI®s
Invertebrate Released pre-publication
Ant Paper Published in GigaScience
- Florida carpenter ant Microbe
- Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482
- Leaf-cutter ant Darwin’s Finch T2D gut metagenome
Roundworm Giant panda Macaque
Schistosoma -Chinese rhesus Cell-Lines
Silkworm -Crab-eating Chinese Hamster Ovary
Mini-Pig Mouse methylomes
Human Naked mole rat
Asian individual (YH) Parrot, Puerto Rican PLANTS
- DNA Methylome Penguin Chinese cabbage
- Genome Assembly - Emperor penguin Cucumber
- Transcriptome - Adelie penguin Foxtail millet
Cancer (14TB) Pigeon, domestic Pigeonpea
Single cell bladder cancer Polar bear Potato
HBV infected exomes Sheep Sorghum
Ancient DNA Tibetan antelope
- Saqqaq Eskimo
- Aboriginal Australian

Our first DOI:

To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.

Data Citation 2: acceptance by journals

Data+Citation 3: inclusion in the references

Is the DOI…

* Certain types of genomics data must also be deposited in INSDC databases (SRA & Genbank).

And in more journals…

Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma
vilgalysii, a new basidiolichen from the New World. Dryad Digital
Repository. doi:10.5061/dryad.j1g5dh23
Cited in:
Hodkinson BP, Uehling JK, Smith ME: Lepidostroma vilgalysii, a new basidiolichen
from the New World. Mycological Progress 2012. Advance Online Publication.

Roberts SB (2012) Herring Hepatic Transcriptome 34300
contigs.fa. Figshare. Available:
hdl.handle.net/10779/084d34370fbda29bbc67b3c5ecb02
575. Accessed 2012 Jan 20.
Cited in:
Roberts SB, Hauser L, Seeb LW, Seeb JE (2012) Development of Genomic Resources
for Pacific Herring through Targeted Transcriptome Pyrosequencing. PLoS ONE 7(2):
e30908. doi:10.1371/journal.pone.0030908

For data citation to work, needs:

1. Proven utility/potential user base. ✔
2. Acceptance/inclusion by journals. ✔
3. Data+Citation: inclusion in the references. ✔
4. Tracking by citation indexes.

5. Usage of the metrics by the community…

Datacitation 4: tracking?
✗FAIL
DataCite metadata in harvestable form (OAI-PMH)

- lists some DataCite DOIs, but says:

Datasets listed are the “result of approximations in the indexing
algorithms.”
“Google Scholar's intended coverage is for scholarly articles. At
this point, we don't include datasets. “

Datacitation 4: tracking?
✗FAIL
DataCite metadata in harvestable form (OAI-PMH)

✗ Working on it. Coming soon…

Datacitation 5: metrics?
“As a result of diverse practices and tool
limitations, data citations are currently very
diﬃcult to track.”

Datacitation 5: metrics?
✗FAIL
Research Remix, 29th May 2012: http://researchremix.wordpress.com/2012/05/29/dear-research-
data-advocate-please-sign-the-petition-oamonday/

“I’m afraid we are making promises to data
creators about attribution and reward that we
can’t keep. ”Make your data citeable!” is the cry.
OK. So citeable is step one. Cited is step two. But
for the citation to be useful, it has to be indexed
so that citation metrics can be tracked and
admired and used.
Who is indexing data citations right now? As far
as I can tell: absolutely no one.”

Where data citation is in 2012:
1. Proven utility/potential user base. ✔
2. Acceptance/inclusion by journals. ✔
3. Data+Citation: inclusion in the references. ✔
4. Tracking by citation indexes. ✔/✗
5. Usage of the metrics by the community… ✗

Overcoming technical hurdles…

?

Addressing the reproducibility gap:
Computable methods/workflow systems
Bioinformatics
Development Biomedical and bioinformatics research Publishing

Redefining what is a paper in the era of big-data?

goal: Executable Research Objects

Citable DOI

Publication

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

Data
Publication

• Background

• Methods

• Results (Data)
doi:10.5524/100035

doi:10.1186/2047-217X-1-3

Methods +
Data +
Publication

• Background

• Methods Doi for workflows?

• Results (Data)
doi:10.5524/100035

doi:10.1186/2047-217X-1-3

Data Methods Analysis

doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3

DOI: A + DOI: X = DOI: 1


doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3


DOI: B + DOI: X = DOI: 2


doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3



DOI: A + DOI: Y = DOI: 3


doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3



DOI: A + DOI: Y = DOI: 3

A, B, C… X, Y, Z… = 4, 5, 6…

Different shaped publishable objects
Data
Papers

Executable
(Methods)
Papers

Analysis
Papers

Different shaped publishable objects
Different levels of granularity

Experiment e.g. doi:10.5524/100001 Papers
(e.g. ACRG project)

e.g. doi:10.5524/100001-2 Data/
Datasets Micropubs
(e.g. cancer type)

e.g. doi:10.5524/100001-2000
Sample or doi:10.5524/100001_xyz
(e.g. specimen xyz)

Smaller still? Facts/Assertions (~1013 in literature) Nanopubs

Adding “value” publishing data

• Scope for different shaped publishable objects
• Scope for publishing methods/executable papers
• Peer review of data problematic
– Post publication peer review
– Change criteria (assess on transparency/access only)
– Better use of workflows/cloud/VMs

DOIs are cheap*, data is precious: maximise its use
* ish

Adding “value” publishing data

DOIs are cheap*, data is precious: maximise its use
* ish Source: Ross Mounce CC-BY http://rossmounce.co.uk/2012/09/04/the-gold-oa-plot-v0-2/

Thanks to: Shaoguang Liang (BGI-SZ)
Laurie Goodman Tin-Lap Lee (CUHK)
Tam Sneddon Huayen Gao (CUHK)
Nicole Nogoy Qiong Luo (HKUST)
Alexandra Basford Senghong Wang (HKUST)
Peter Li Yan Zhou (HKUST)
Jesse Si Zhe Cogini
editorial@gigasciencejournal.com
Contact us: database@gigasciencejournal.com

@gigascience

Follow us: facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/

www.gigadb.org

Scott Edmunds: Data publication in the data deluge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Scott Edmunds: Data publication in the data deluge

Similar to Scott Edmunds: Data publication in the data deluge (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Scott Edmunds: Data publication in the data deluge

Editor's Notes