2014 aus-agta

WHAT’S AHEAD FOR
BIOLOGY?
THE DATA INTENSIVE FUTURE
C. Titus Brown
ctb@msu.edu
Assistant Professor, Michigan State University
(In January, moving to UC Davis / VetMed.)
Talk slides on slideshare.net/c.titus.brown

The Data Deluge
(a traditional requirement for these talks)

The short version
• Data gathering & storage is growing, leaps & bounds!
• Biology is completely unprepared for this at every level:
• Technical and infrastructure
• Cultural
• Training
• Our funding/incentivization/prioritization structures are also
largely unprepared.
• This is a huge missed opportunity!!
(What does Titus think we should be doing?)

Challenges:
1. Dealing with Big Data (my current research)
2. Interpreting the unknowns (future research)
3. Accelerating research with better data/methods/
results sharing.
4. Expanding the role of exploratory data
analysis in biology. (career windmill)

1. Dealing with Big Data
A. Lossy compression
B. Streaming algorithms

Looking forward 5 years…
Navin et al., 2011

Some basic math:
• 1000 single cells from a tumor…
• …sequenced to 40x haploid coverage with Illumina…
• …yields 120 Gbp each cell…
• …or 120 Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling will require 2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in one
month.

Similar math applies:
• Pathogen detection in blood;
• Environmental sequencing;
• Sequencing rare DNA from circulating blood.
• Two issues:
•Volume of data & compute
infrastructure;
• Latency for clinical applications.

Approach A: Lossy compression
(Reduce volume of data & compute infrastructure
requirements)
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Lossy compression can substantially
reduce data size while retaining
information needed for later (re)analysis.

http://en.wikipedia.org/wiki/JPEG
Lossy compression

e.g. de novo assembly now scales with richness, not
diversity.
• 10-100 fold decrease in memory requirements
• 10-100 fold speed up in analysis
Brown et al., arXiv, 2012

Hey, cool, our approach and software is
used by Illumina for long-read sequencing!

Our general strategy: compressive prefilters
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Save in cold storage
Save for reanalysis,
investigation.

Approach B: streaming data analysis
(Reduce latency for clinical applications)
Data
1-pass
Answer
See also eXpress, Roberts et al., 2013.

Current variant calling approaches are multipass
Data
Mapping
Sorting
Calling Answer

Streaming graph-based approaches can
detect information saturation

Approach supports compute-intensive
interludes – remapping, etc.
Rimmer et al., 2014

Streaming with bases
k bases...
Graph
k+1
k bases... k+1
k+2
k bases... k+1
k bases... k+1
k bases... k+1
...
k bases... k+1
Variants

Integrate sequencing and analysis
Sequencing
Analysis
Are we done yet?
Decrease latency!

So, how do we deal with Big Data issues?
• Fairly record cost of data analysis (running software &
cost of computational infrastructure)
• This incentivizes development of better approaches!
• Lossy compression, streaming, …??
• Think 5 years ahead, rather than 2 years behind!
• Pay attention to workflows, software lifecycle, etc. etc.
(See ABiC 2014 talk :)

2. Dealing with the unknowns
Compare
sample to
control
Eliminate all
the things
we don't
know how to
interpret
Wonder why
we can only
account for
~50% effect.
~millions
~10s of thousands

“What is the function of ….?”
We can observe almost everything at a DNA/RNA level!
But,
• Experimentally based functional annotations are sparse;
• Most genes play multiple roles and are generally
annotated for only one;
• Model organisms are phylogenetically quite limited and
biased;
• …there is little or no $$$ or reputation gain for
characterizing novel genes (and nor is it straightforward
or easy to do so!)

The problem of lopsided gene characterization:
e.g., the brain "ignorome"
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
Slide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.

How do we systematically broaden our
functional understanding of genes?
1. More experimental work!
• Population studies, perturbation studies, good ol’ fashioned
molecular biology, etc.
2. Integrate modeling, to see where we have (or lack)
sufficiency of knowledge for a particular phenotype.
3. Sequence it all and let the bioinformaticians sort it out!
What I think will work best: a tight integration between all
three approaches (c.f. physics) – hypothesis-driven
investigation, modeling, and exploratory data science.
See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html

3. Accelerating research with better
sharing of results, data, methods.
Our current journal system is a 20th century solution to a
17th century problem.
- Paraphrased from Cameron Neylon
(Note: 20th century was LAST century)

3. Accelerating research with better
sharing of results, data, methods.
We could accelerate research with better sharing.
Recent example re rare diseases:
http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-
2
“The current academic publication system does patients an
enormous disservice.” – Daniel MacArthur
There are many barriers to better communication of results,
data, and methods, but most of them are cultural, not
technical. (Much harder!)

Preprints
• Many fields (including bioinformatics and increasingly
genomics) routinely share papers prior to publication. This
facilitates reproduction, dissemination, and ultimately
progress.
• Biology is behind the times!
See:
1. Haldane’s Sieve (blog discussion of preprints)
2. Evidence that preprints confer massive citation
advantage in physics (http://arxiv.org/abs/0906.5418)

Current model for data sharing
In a data limited world,
this kind of made sense.
Gather
data
Interpret
data
Write &
publish
paper
Grudgingly
share as
little data
as
possible

Current model for data sharing
This model ignores the fact that data
often has multiple (unrealized or
serendipitous) uses.
(Among many other problems ;)
Gather
data
Interpret
data
Write &
publish
paper
Grudgingly
share as
little data
as
possible

The train wreck ahead
Gather
data
Interpret
data
most data doesn’t get published,
Write &
publish
paper
When data is cheap, and
interpretation is expensive,
and therefore is lost.
Grudgingly
share as
little data
as
possible
(Program managers are not a fan of this)

Data sharing challenges -
• Little immediate or career rewards for sharing data;
incentives are almost entirely punitive (if you DON’T…)
• Sharing data in a usable form is still rather difficult.
• Submitting data to archival services is, in many cases,
surprisingly difficult.
• Few methods for gaining recognition for data sharing prior
to publication of conclusions.

The Ocean Cruise Model
One really expensive cruise, many data collectors, shared data.
DeepDOM – photo courtesy E. Kujawinski, WHOI

Sage Bionetworks / “walled garden”
Collaborative data sharing policy with restricted access to
outsiders;
Central platform with analysis provenance tracking;
A model for the future of biomedical research?
See, e.g., Enabling transparent and collaborative computational analysis
of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.

Distributed cyberinfrastructure to encourage sharing?
Web interface + API
Compute server
(Galaxy?
Arvados?)
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-talk.html

Better metadata collection is needed!
Suppose the NSA could EITHER track
who was calling whom,
OR what they were saying – which would
be more valuable?
Who? What? Who?

Better metadata collection is needed!
We need to track sample origin,
phenotype/environmental conditions, etc.
Sample information The –omic data Phenotype
This will facilitate discovery, serendipity, re-analysis, and cross-validation.

Data and software citation
Now methods for:
• assigning DOIs to data (which makes it citable) – figshare,
dryad.
• Data publications – gigascience, SIGS, Scientific Data.
• Software citations – Zenodo, MozSciLab/GitHub
• Software publications – F1000 Research
Will this address the need to incentivize data sharing and
methods? Probably not but it’s a good start ;)

4. Exploratory data analysis
Old model:
Gather
data
Interpret
data

New model
Your data is most useful when combined with everyone
else’s.
Gather
data
Interpret
data
Database
Database
Other
people's
data
Database
Database
Database
Other
people's
data Other
people's
data
Other
people's
data
Other
people's
data
Other
people's
data
Other
people's
Database data
Database
Database
Database

Given enough publicly accessible data…
Interpret
data
Database
Database
Other
people's
data
Database
Database
Database
Other
people's
data Other
people's
data
Other
people's
data
Other
people's
data
Other
people's
data
Other
people's
Database data
Database
Database
Database

But: we face lack of training.
The lack of training in data science is the biggest challenge
facing biology.
Students! There’s a great future in data analysis!
Also see:

Data integration?
Once you have all the data, what do you do?
"Business as usual simply cannot work."
Looking at millions to billions of genomes.
(David Haussler, 2014)
Illumina estimate: 228,000 human genomes will be
sequenced in 2014, mostly by researchers.
http://www.technologyreview.com/news/53
1091/emtech-illumina-says-228000-
human-genomes-will-be-sequenced-this-year/

Looking to the future
For the senior scientists and funders amongst us,
• How do we incentivize data sharing, and training?
• How do we fund the meso- and micro-scale
cyberinfrastructure development that will accelerate bio?
The NIH and NSF are exploring
this; the Moore and Sloan
foundations are simply doing it
(but 1% the size).
See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html

For Australian students and early career researchers
in bioinformatics and computational biology
Annual Student Symposium
Friday 28th November 2014
Parkville, Victoria
Now accepting abstracts for talks and posters
Talk abstracts close 31st October
combine.org.au

2014 aus-agta

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 2014 aus-agta

Similar to 2014 aus-agta (20)

More from c.titus.brown

More from c.titus.brown (20)

Recently uploaded

Recently uploaded (20)

2014 aus-agta