WHAT’S AHEAD FOR 
BIOLOGY? 
THE DATA INTENSIVE FUTURE 
C. Titus Brown 
ctb@msu.edu 
Assistant Professor, Michigan State University 
(In January, moving to UC Davis / VetMed.) 
Talk slides on slideshare.net/c.titus.brown
The Data Deluge 
(a traditional requirement for these talks)
The short version 
• Data gathering & storage is growing, leaps & bounds! 
• Biology is completely unprepared for this at every level: 
• Technical and infrastructure 
• Cultural 
• Training 
• Our funding/incentivization/prioritization structures are also 
largely unprepared. 
• This is a huge missed opportunity!! 
(What does Titus think we should be doing?)
Challenges: 
1. Dealing with Big Data (my current research) 
2. Interpreting the unknowns (future research) 
3. Accelerating research with better data/methods/ 
results sharing. 
4. Expanding the role of exploratory data 
analysis in biology. (career windmill)
1. Dealing with Big Data 
A. Lossy compression 
B. Streaming algorithms
Looking forward 5 years… 
Navin et al., 2011
Some basic math: 
• 1000 single cells from a tumor… 
• …sequenced to 40x haploid coverage with Illumina… 
• …yields 120 Gbp each cell… 
• …or 120 Tbp of data. 
• HiSeq X10 can do the sequencing in ~3 weeks. 
• The variant calling will require 2,000 CPU weeks… 
• …so, given ~2,000 computers, can do this all in one 
month.
Similar math applies: 
• Pathogen detection in blood; 
• Environmental sequencing; 
• Sequencing rare DNA from circulating blood. 
• Two issues: 
•Volume of data & compute 
infrastructure; 
• Latency for clinical applications.
Approach A: Lossy compression 
(Reduce volume of data & compute infrastructure 
requirements) 
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Lossy compression can substantially 
reduce data size while retaining 
information needed for later (re)analysis.
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
e.g. de novo assembly now scales with richness, not 
diversity. 
• 10-100 fold decrease in memory requirements 
• 10-100 fold speed up in analysis 
Brown et al., arXiv, 2012
Hey, cool, our approach and software is 
used by Illumina for long-read sequencing!
Our general strategy: compressive prefilters 
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Save in cold storage 
Save for reanalysis, 
investigation.
Approach B: streaming data analysis 
(Reduce latency for clinical applications) 
Data 
1-pass 
Answer 
See also eXpress, Roberts et al., 2013.
Current variant calling approaches are multipass 
Data 
Mapping 
Sorting 
Calling Answer
Streaming graph-based approaches can 
detect information saturation
Approach supports compute-intensive 
interludes – remapping, etc. 
Rimmer et al., 2014
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and analysis 
Sequencing 
Analysis 
Are we done yet? 
Decrease latency!
So, how do we deal with Big Data issues? 
• Fairly record cost of data analysis (running software & 
cost of computational infrastructure) 
• This incentivizes development of better approaches! 
• Lossy compression, streaming, …?? 
• Think 5 years ahead, rather than 2 years behind! 
• Pay attention to workflows, software lifecycle, etc. etc. 
(See ABiC 2014 talk :)
2. Dealing with the unknowns 
Compare 
sample to 
control 
Eliminate all 
the things 
we don't 
know how to 
interpret 
Wonder why 
we can only 
account for 
~50% effect. 
~millions 
~10s of thousands
“What is the function of ….?” 
We can observe almost everything at a DNA/RNA level! 
But, 
• Experimentally based functional annotations are sparse; 
• Most genes play multiple roles and are generally 
annotated for only one; 
• Model organisms are phylogenetically quite limited and 
biased; 
• …there is little or no $$$ or reputation gain for 
characterizing novel genes (and nor is it straightforward 
or easy to do so!)
The problem of lopsided gene characterization: 
e.g., the brain "ignorome" 
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression 
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. 
The major distinguishing characteristic between these sets of genes is date of discovery, early 
discovery being associated with greater research momentum—a genomic bandwagon effect." 
Slide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
How do we systematically broaden our 
functional understanding of genes? 
1. More experimental work! 
• Population studies, perturbation studies, good ol’ fashioned 
molecular biology, etc. 
2. Integrate modeling, to see where we have (or lack) 
sufficiency of knowledge for a particular phenotype. 
3. Sequence it all and let the bioinformaticians sort it out! 
What I think will work best: a tight integration between all 
three approaches (c.f. physics) – hypothesis-driven 
investigation, modeling, and exploratory data science. 
See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html
3. Accelerating research with better 
sharing of results, data, methods. 
Our current journal system is a 20th century solution to a 
17th century problem. 
- Paraphrased from Cameron Neylon 
(Note: 20th century was LAST century)
3. Accelerating research with better 
sharing of results, data, methods. 
We could accelerate research with better sharing. 
Recent example re rare diseases: 
http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind- 
2 
“The current academic publication system does patients an 
enormous disservice.” – Daniel MacArthur 
There are many barriers to better communication of results, 
data, and methods, but most of them are cultural, not 
technical. (Much harder!)
Preprints 
• Many fields (including bioinformatics and increasingly 
genomics) routinely share papers prior to publication. This 
facilitates reproduction, dissemination, and ultimately 
progress. 
• Biology is behind the times! 
See: 
1. Haldane’s Sieve (blog discussion of preprints) 
2. Evidence that preprints confer massive citation 
advantage in physics (http://arxiv.org/abs/0906.5418)
Current model for data sharing 
In a data limited world, 
this kind of made sense. 
Gather 
data 
Interpret 
data 
Write & 
publish 
paper 
Grudgingly 
share as 
little data 
as 
possible
Current model for data sharing 
This model ignores the fact that data 
often has multiple (unrealized or 
serendipitous) uses. 
(Among many other problems ;) 
Gather 
data 
Interpret 
data 
Write & 
publish 
paper 
Grudgingly 
share as 
little data 
as 
possible
The train wreck ahead 
Gather 
data 
Interpret 
data 
most data doesn’t get published, 
Write & 
publish 
paper 
When data is cheap, and 
interpretation is expensive, 
and therefore is lost. 
Grudgingly 
share as 
little data 
as 
possible 
(Program managers are not a fan of this)
Data sharing challenges - 
• Little immediate or career rewards for sharing data; 
incentives are almost entirely punitive (if you DON’T…) 
• Sharing data in a usable form is still rather difficult. 
• Submitting data to archival services is, in many cases, 
surprisingly difficult. 
• Few methods for gaining recognition for data sharing prior 
to publication of conclusions.
The Ocean Cruise Model 
One really expensive cruise, many data collectors, shared data. 
DeepDOM – photo courtesy E. Kujawinski, WHOI
Sage Bionetworks / “walled garden” 
Collaborative data sharing policy with restricted access to 
outsiders; 
Central platform with analysis provenance tracking; 
A model for the future of biomedical research? 
See, e.g., Enabling transparent and collaborative computational analysis 
of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.
Distributed cyberinfrastructure to encourage sharing? 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI) 
ivory.idyll.org/blog/2014-moore-ddd-talk.html
Better metadata collection is needed! 
Suppose the NSA could EITHER track 
who was calling whom, 
OR what they were saying – which would 
be more valuable? 
Who? What? Who?
Better metadata collection is needed! 
Suppose the NSA could EITHER track 
who was calling whom, 
OR what they were saying – which would 
be more valuable? 
Who? What? Who?
Better metadata collection is needed! 
We need to track sample origin, 
phenotype/environmental conditions, etc. 
Sample information The –omic data Phenotype 
This will facilitate discovery, serendipity, re-analysis, and cross-validation.
Data and software citation 
Now methods for: 
• assigning DOIs to data (which makes it citable) – figshare, 
dryad. 
• Data publications – gigascience, SIGS, Scientific Data. 
• Software citations – Zenodo, MozSciLab/GitHub 
• Software publications – F1000 Research 
Will this address the need to incentivize data sharing and 
methods? Probably not but it’s a good start ;)
4. Exploratory data analysis 
Old model: 
Gather 
data 
Interpret 
data
New model 
Your data is most useful when combined with everyone 
else’s. 
Gather 
data 
Interpret 
data 
Database 
Database 
Other 
people's 
data 
Database 
Database 
Database 
Other 
people's 
data Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
Database data 
Database 
Database 
Database
Given enough publicly accessible data… 
Interpret 
data 
Database 
Database 
Other 
people's 
data 
Database 
Database 
Database 
Other 
people's 
data Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
Database data 
Database 
Database 
Database
But: we face lack of training. 
The lack of training in data science is the biggest challenge 
facing biology. 
Students! There’s a great future in data analysis! 
Also see:
Data integration? 
Once you have all the data, what do you do? 
"Business as usual simply cannot work." 
Looking at millions to billions of genomes. 
(David Haussler, 2014) 
Illumina estimate: 228,000 human genomes will be 
sequenced in 2014, mostly by researchers. 
http://www.technologyreview.com/news/53 
1091/emtech-illumina-says-228000- 
human-genomes-will-be-sequenced-this-year/
Looking to the future 
For the senior scientists and funders amongst us, 
• How do we incentivize data sharing, and training? 
• How do we fund the meso- and micro-scale 
cyberinfrastructure development that will accelerate bio? 
The NIH and NSF are exploring 
this; the Moore and Sloan 
foundations are simply doing it 
(but 1% the size). 
See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html
Thanks for listening!
For Australian students and early career researchers 
in bioinformatics and computational biology 
Annual Student Symposium 
Friday 28th November 2014 
Parkville, Victoria 
Now accepting abstracts for talks and posters 
Talk abstracts close 31st October 
combine.org.au

2014 aus-agta

  • 1.
    WHAT’S AHEAD FOR BIOLOGY? THE DATA INTENSIVE FUTURE C. Titus Brown ctb@msu.edu Assistant Professor, Michigan State University (In January, moving to UC Davis / VetMed.) Talk slides on slideshare.net/c.titus.brown
  • 2.
    The Data Deluge (a traditional requirement for these talks)
  • 3.
    The short version • Data gathering & storage is growing, leaps & bounds! • Biology is completely unprepared for this at every level: • Technical and infrastructure • Cultural • Training • Our funding/incentivization/prioritization structures are also largely unprepared. • This is a huge missed opportunity!! (What does Titus think we should be doing?)
  • 4.
    Challenges: 1. Dealingwith Big Data (my current research) 2. Interpreting the unknowns (future research) 3. Accelerating research with better data/methods/ results sharing. 4. Expanding the role of exploratory data analysis in biology. (career windmill)
  • 5.
    1. Dealing withBig Data A. Lossy compression B. Streaming algorithms
  • 6.
    Looking forward 5years… Navin et al., 2011
  • 7.
    Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  • 8.
    Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  • 9.
    Approach A: Lossycompression (Reduce volume of data & compute infrastructure requirements) Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    e.g. de novoassembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis Brown et al., arXiv, 2012
  • 22.
    Hey, cool, ourapproach and software is used by Illumina for long-read sequencing!
  • 23.
    Our general strategy:compressive prefilters Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 24.
    Approach B: streamingdata analysis (Reduce latency for clinical applications) Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  • 25.
    Current variant callingapproaches are multipass Data Mapping Sorting Calling Answer
  • 26.
    Streaming graph-based approachescan detect information saturation
  • 27.
    Approach supports compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 28.
    Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 29.
    Integrate sequencing andanalysis Sequencing Analysis Are we done yet? Decrease latency!
  • 30.
    So, how dowe deal with Big Data issues? • Fairly record cost of data analysis (running software & cost of computational infrastructure) • This incentivizes development of better approaches! • Lossy compression, streaming, …?? • Think 5 years ahead, rather than 2 years behind! • Pay attention to workflows, software lifecycle, etc. etc. (See ABiC 2014 talk :)
  • 31.
    2. Dealing withthe unknowns Compare sample to control Eliminate all the things we don't know how to interpret Wonder why we can only account for ~50% effect. ~millions ~10s of thousands
  • 32.
    “What is thefunction of ….?” We can observe almost everything at a DNA/RNA level! But, • Experimentally based functional annotations are sparse; • Most genes play multiple roles and are generally annotated for only one; • Model organisms are phylogenetically quite limited and biased; • …there is little or no $$$ or reputation gain for characterizing novel genes (and nor is it straightforward or easy to do so!)
  • 33.
    The problem oflopsided gene characterization: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Slide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
  • 34.
    How do wesystematically broaden our functional understanding of genes? 1. More experimental work! • Population studies, perturbation studies, good ol’ fashioned molecular biology, etc. 2. Integrate modeling, to see where we have (or lack) sufficiency of knowledge for a particular phenotype. 3. Sequence it all and let the bioinformaticians sort it out! What I think will work best: a tight integration between all three approaches (c.f. physics) – hypothesis-driven investigation, modeling, and exploratory data science. See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html
  • 35.
    3. Accelerating researchwith better sharing of results, data, methods. Our current journal system is a 20th century solution to a 17th century problem. - Paraphrased from Cameron Neylon (Note: 20th century was LAST century)
  • 36.
    3. Accelerating researchwith better sharing of results, data, methods. We could accelerate research with better sharing. Recent example re rare diseases: http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind- 2 “The current academic publication system does patients an enormous disservice.” – Daniel MacArthur There are many barriers to better communication of results, data, and methods, but most of them are cultural, not technical. (Much harder!)
  • 37.
    Preprints • Manyfields (including bioinformatics and increasingly genomics) routinely share papers prior to publication. This facilitates reproduction, dissemination, and ultimately progress. • Biology is behind the times! See: 1. Haldane’s Sieve (blog discussion of preprints) 2. Evidence that preprints confer massive citation advantage in physics (http://arxiv.org/abs/0906.5418)
  • 38.
    Current model fordata sharing In a data limited world, this kind of made sense. Gather data Interpret data Write & publish paper Grudgingly share as little data as possible
  • 39.
    Current model fordata sharing This model ignores the fact that data often has multiple (unrealized or serendipitous) uses. (Among many other problems ;) Gather data Interpret data Write & publish paper Grudgingly share as little data as possible
  • 40.
    The train wreckahead Gather data Interpret data most data doesn’t get published, Write & publish paper When data is cheap, and interpretation is expensive, and therefore is lost. Grudgingly share as little data as possible (Program managers are not a fan of this)
  • 41.
    Data sharing challenges- • Little immediate or career rewards for sharing data; incentives are almost entirely punitive (if you DON’T…) • Sharing data in a usable form is still rather difficult. • Submitting data to archival services is, in many cases, surprisingly difficult. • Few methods for gaining recognition for data sharing prior to publication of conclusions.
  • 42.
    The Ocean CruiseModel One really expensive cruise, many data collectors, shared data. DeepDOM – photo courtesy E. Kujawinski, WHOI
  • 43.
    Sage Bionetworks /“walled garden” Collaborative data sharing policy with restricted access to outsiders; Central platform with analysis provenance tracking; A model for the future of biomedical research? See, e.g., Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.
  • 44.
    Distributed cyberinfrastructure toencourage sharing? Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html
  • 45.
    Better metadata collectionis needed! Suppose the NSA could EITHER track who was calling whom, OR what they were saying – which would be more valuable? Who? What? Who?
  • 46.
    Better metadata collectionis needed! Suppose the NSA could EITHER track who was calling whom, OR what they were saying – which would be more valuable? Who? What? Who?
  • 47.
    Better metadata collectionis needed! We need to track sample origin, phenotype/environmental conditions, etc. Sample information The –omic data Phenotype This will facilitate discovery, serendipity, re-analysis, and cross-validation.
  • 48.
    Data and softwarecitation Now methods for: • assigning DOIs to data (which makes it citable) – figshare, dryad. • Data publications – gigascience, SIGS, Scientific Data. • Software citations – Zenodo, MozSciLab/GitHub • Software publications – F1000 Research Will this address the need to incentivize data sharing and methods? Probably not but it’s a good start ;)
  • 49.
    4. Exploratory dataanalysis Old model: Gather data Interpret data
  • 50.
    New model Yourdata is most useful when combined with everyone else’s. Gather data Interpret data Database Database Other people's data Database Database Database Other people's data Other people's data Other people's data Other people's data Other people's data Other people's Database data Database Database Database
  • 51.
    Given enough publiclyaccessible data… Interpret data Database Database Other people's data Database Database Database Other people's data Other people's data Other people's data Other people's data Other people's data Other people's Database data Database Database Database
  • 52.
    But: we facelack of training. The lack of training in data science is the biggest challenge facing biology. Students! There’s a great future in data analysis! Also see:
  • 53.
    Data integration? Onceyou have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014) Illumina estimate: 228,000 human genomes will be sequenced in 2014, mostly by researchers. http://www.technologyreview.com/news/53 1091/emtech-illumina-says-228000- human-genomes-will-be-sequenced-this-year/
  • 54.
    Looking to thefuture For the senior scientists and funders amongst us, • How do we incentivize data sharing, and training? • How do we fund the meso- and micro-scale cyberinfrastructure development that will accelerate bio? The NIH and NSF are exploring this; the Moore and Sloan foundations are simply doing it (but 1% the size). See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html
  • 55.
  • 56.
    For Australian studentsand early career researchers in bioinformatics and computational biology Annual Student Symposium Friday 28th November 2014 Parkville, Victoria Now accepting abstracts for talks and posters Talk abstracts close 31st October combine.org.au