Scott Edmunds: Data Dissemination in the era of "Big-Data"

Bio-IT World Asia Meeting, 7th June 2012 Scott Edmunds

Data dissemination in the era of “big data”
William Gibson: "Information is the currency of the future world”

Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems
themselves”

www.gigasciencejournal.com

Is data “the new oil”?
1.2 zettabytes (1021) of electronic data generated each year1

Data
Deluge?

1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

Global Sequencing Capacity

Data Production
5.6 Tb / day
> 1500X of human genome / day

Multiple Supercomputing Centers
157 TB Flops
20 TB Memory
14.7 PB Storage

BGI Sequencing Capacity

Sequencers Data Production
137 Illumina/HiSeq 2000 5.6 Tb / day
27 LifeTech/SOLiD 4 > 1500X of human genome / day
1 454 GS FLX+ 137

2 Illumina iScan Multiple Supercomputing Centers
1 Illumina MiSeq 157 TB Flops
1 Ion Torrent 20 TB Memory
14.7 PB Storage

Now taking submissions…

Large-Scale Data:
Journal/Database/Platform
In conjunction with:

Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Assistant Editor: Alexandra Basford, PhD
Lead BioCurator: Tam Sneddon, Dphil
Data Platform: Peter Li, PhD

Data Silo’s

Interoperability
Paywalls
Metadata
$ ©

There are many hurdles…

?

There are many hurdles…

Technical: too large volumes
too heterogeneous
no home for many data types
too time consuming

Cultural: inertia
no incentives to share
unaware of how
?

Technical challenges…
Better handling of metadata…
Novel tools/formats for data interoperability/handling.
Cloud
solutions?

Tools making work more easily reproducible…

Interoperability/Ease of use Workflows

Data quality assessment

More efficient handling of data…

Cloud?

Do we need to keep everything?

Compression?

Data Re-use
Effort

($)

Usability

Need to lower the hurdles…
Effort

($)

Usability

Better incentives?
Effort

($)

Usability

Incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)

Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating a citable reference, as it can
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)

Datacitation: Datacite and DOIs
Digital Object Identifiers (DOIs)


offer a solution

 Mostly widely used identifier for Dataset
scientific articles Yancheva et al (2007). Analyses on
 Researchers, authors, publishers sediment of Lake Maar. PANGAEA.
know how to use them doi:10.1594/PANGAEA.587840
 Put datasets on the same playing
field as articles

“increase acceptance of research data as
Aims to: legitimate, citable contributions to the
scholarly record”.

“data generated in the course of research
are just as valuable to the ongoing academic
discourse as papers and monographs”.

Datacitation: Datacite and DOIs
Central metadata repository:
• >1 million entries to date

• Stability

• Data discoverability

• Open & harvestable

• Potential to track &
credit use

Data publishing/DOI
New journal format combines standard manuscript
publication with an extensive database to host all
associated data, and integrated tools.
 Data hosting will follow standard funding agency
and community guidelines.
DOI assignment available for submitted data to
allow ease of finding and citing datasets, as well as for
citation tracking.

Data Publishing

www.gigaDB.org

BGI Datasets Get DOI®s
Invertebrate
Many released pre-publication…
Ant PLANTS
- Florida carpenter ant Chinese cabbage
Vertebrates
- Jerdon’s jumping ant Cucumber
Giant panda Macaque
- Leaf-cutter ant Foxtail millet
- Chinese rhesus
Roundworm Pigeonpea
- Crab-eating
Schistosoma Potato
Mini-Pig
Silkworm Sorghum
Naked mole rat
Penguin
Human - Emperor penguin
Asian individual (YH) - Adelie penguin
- DNA Methylome Pigeon, domestic
- Genome Assembly Polar bear
- Transcriptome Sheep
doi:10.5524/100004

Cancer (14TB) Tibetan antelope
Ancient DNA Microbe
- Saqqaq Eskimo E. Coli O104:H4 TY-2482
- Aboriginal Australian
Cell-Line
Chinese Hamster Ovary

For data citation to work, needs:

• Proven utility/potential user base.

• Acceptance/inclusion by journals.

• Data+Citation: inclusion in the references.

• Tracking by citation indexes.

• Usage of the metrics by the community…

Data+Citation: inclusion in the references

• Data submitted to NCBI databases:
- Raw data SRA:SRA046843
- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000
- SNPs dbSNP:1056306
- CNVs
-
-
InDels
SV
} dbVAR:nstd63

• Submission to public databases complemented by
its citable form in GigaDB (doi:10.5524/100012).

Datacitation: tracking?
DataCite metadata in harvestable form (OAI-PMH)

Plans in 2012 to link central metadata repository with WoS

- Will finally track and credit use!

To be continued…

Our first DOI:

To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

“The way that the genetic data of the 2011 E. coli strain were disseminated
globally suggests a more effective approach for tackling public health
problems. Both groups put their sequencing data on the Internet, so scientists
the world over could immediately begin their own analysis of the bug's
makeup. BGI scientists also are using Twitter to communicate their latest
findings.”

“German scientists and their colleagues at the Beijing Genomics Institute in China have
been working on uncovering secrets of the outbreak. BGI scientists revised their draft
genetic sequence of the E. coli strain and have been sharing their data with dozens of
scientists around the world as a way to "crowdsource" this data. By publishing their data
publicy and freely, these other scientists can have a look at the genetic structure, and try
to sort it out for themselves.”

Downstream consequences:
1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 2012)

3. Speed/legal-freedom

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and
publish their work without wasting time on legal wrangling.”

The era of the data consumer?

?

The era of the data consumer?
Free access to data – but analysis hubs/nodes for will form around it

?

GDSAP: Genomic Data Submission
and Analytical platform
Big data
from the
Data, Data, Data… “Sequencing
Oil Field”

Data
Modeling

Pipeline
design
Tin-Lap Lee, CUHK

Validation

Commercial
applications

“Apps”


mirror/open platform

Papers in the era of big-data
$1000 genome = million $ peer-review?

To review: (>6TBp, >1500 datasets)

S3 = $15,000
EC2 (BLASTx) = $500,000
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops

goal: Executable Research Objects

Citable DOI

goal: Executable Research Objects

Stage 1: Wilson GA, Dhami P, Feber A, Cortázar D, Suzuki Y, Schulz R, Schär P, Beck S:
Resources for methylome analysis suitable for gene knockout studies of
potential epigenome modifiers. GigaScience 2012, 1:3. (in press)
GigaDB hosting all data + tools (84GB total): doi:10.5524/100035
+
Partial (~80%) integration of workflow into our data platform.
(all the data processing steps, but not the enrichment analysis)

Stage 2: Papers fully integrating all data + all workflows in our platform.

Interested in Reproducible Research?
Take part in our session on: “Cloud and workflows for reproducible bioinformatics”

Submit to:
• Rapid review/Open Access/High-visibility
• Article Processing Charge covered by BGI
• Hosting of any test datasets/workflows in GigaDB

Thanks to:
Laurie Goodman Alexandra Basford
Tam Sneddon Peter Li
Tin-Lap Lee (CUHK) Qiong Luo (HKUST)
scott@gigasciencejournal.com
Contact us:
editorial@gigasciencejournal.com

@gigascience

Follow us: facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/


Scott Edmunds: Data Dissemination in the era of "Big-Data"

More Related Content

What's hot

Similar to Scott Edmunds: Data Dissemination in the era of "Big-Data"

More from GigaScience, BGI Hong Kong

Recently uploaded

Scott Edmunds: Data Dissemination in the era of "Big-Data"

Editor's Notes