Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Mi...
20 years in…
 Started working in Dr. Koonin‟s group in 1993;
 First publication was submitted almost exactly 20

years a...
Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Mi...
Analogy: we seek an understanding
of humanity via our libraries.

http://eofdreams.com/library.html;
But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.

http://eofdreams.com/l...
Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite ...
Biological analog: shotgun
metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze...
Investigating soil microbial
communities
 95% or more of soil microbes cannot be cultured

in lab.
 Very little transpor...
“By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical ...
Questions to address
 Role of soil microbes in nutrient cycling:
 How does agricultural soil differ from native soil?

...
Must use culture independent and
metagenomic approaches
 Many reasons why you can‟t or don‟t want to

culture:
 Syntroph...
Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all...
Computational reconstruction of
(meta)genomic content.

http://eofdreams.com/library.html;
http://www.theshreddingservices...
Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite ...
Great Prairie Grand
Challenge --SAMPLING
LOCATIONS

2008
A “Grand Challenge” dataset
(DOE/JGI)
Total: 1,846 Gbp soil metagenome
600

MetaHIT (Qin et. al, 2011), 578 Gbp

Basepairs...
Why do we need so much data?!
 20-40x coverage is necessary; 100x is ~sufficient.

 Mixed population sampling => sensiti...
Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
...
Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
...
So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Qual...
…to “assembled” original sequence.

UMD assembly primer (cbcb.umd.edu)
De Bruijn graphs – assemble on
overlaps

J.R. Miller et al. / Genomics (2010)
Two problems: (1) variation/error

Single nucleotide variations cause long branches;
They don‟t rejoin quickly.
Two problems: (2) No graph
locality.
Assembly is inherently an all by all process. There
is no good way to subdivide the r...
Assembly graphs scale with data size, not
information.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Auth...
Why do k-mer assemblers scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data se...
Practical memory measurements

Velvet measurements (Adina Howe)
The Problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms runnin...
Our two solutions.
1. Subdivide data
2. Discard redundant data.
1. Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to
different source
speci...
Our two solutions.
1. Subdivide data (~20x scaling; 2 years to develop;
100x data increase)
2. Discard redundant data.
2. Approach: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor o...
Shotgun sequencing and
coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.
H...
Most shotgun data is redundant.

You only need 5-10 reads at a locus to
assemble or call (diploid) SNPs… but
because sampl...
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Coverage estimation
If you can estimate the coverage of a read in a
data set without a reference, this is
straightforward:...
The median k-mer count in a read is a good
approximate estimator of coverage.
This gives us a
reference-free
measure of
co...
Diginorm builds a De Bruijn graph & then
downsamples based on observed coverage.
Corresponds exactly to
underlying abstrac...
Digital normalization approach
 Is streaming and single pass: looks at each read

only once;
 Does not “collect” the maj...
Contig assembly now scales with richness, not
(data)
(information)
diversity.

Most samples can be assembled in < 50 GB of...
Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable covera...
Diginorm is “lossy compression”
 Nearly perfect from an information theoretic

perspective:
 Discards 95% more of data f...
Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal “driver

mutations” in face of passenger m...
Where are we taking this?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conc...
=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite cli...
What about the assembly results for Iowa
corn and prairie??

Total
Assembly

Total Contigs
(> 300 bp)

% Reads
Assembled

...
Resulting contigs are low
coverage.

Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil ...
So, for soil:
 We really do need more data;
 But at least now we can assemble what we

already have.
 Estimate required...
Biogeography: Iowa sample overlap?
Corn and prairie De Bruijn graps have 51% overlap.

Corn

Prairie

Suggests that at gre...
Concluding thoughts
 Empirically effective tools, in reasonably wide

use.
 Diginorm provides streaming, online algorith...
The real challenge:
understanding
 We have gotten distracted by shiny toys:

sequencing!! Data!!
 Data is now plentiful!...
Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation...
My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reprod...
Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data g...
khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirel...
IPython Notebook: data + code
=>
IPython)Notebook)
My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reprod...
We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll....
Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng ...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"
Upcoming SlideShare
Loading in …5
×

2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

1,079 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,079
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • @@ change slide up =&gt; more complex diversity
  • Taking advantage of structure within read
  • Note that any such measure will do.
  • Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.
  • Copy slide to end.
  • 2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

    1. 1. Like the Dog that Caught the Bus: Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 ctb@msu.edu
    2. 2. 20 years in…  Started working in Dr. Koonin‟s group in 1993;  First publication was submitted almost exactly 20 years ago!
    3. 3. Like the Dog that Caught the Bus: Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 ctb@msu.edu
    4. 4. Analogy: we seek an understanding of humanity via our libraries. http://eofdreams.com/library.html;
    5. 5. But, our only observation tool is shredding a mixture of all of the books & digitizing the shreds. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
    6. 6. Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Prior knowledge.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. A categorization system would be an invaluable but not infallible guide to book topics. Understanding the language would help you validate & understand the books.
    7. 7. Biological analog: shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
    8. 8. Investigating soil microbial communities  95% or more of soil microbes cannot be cultured in lab.  Very little transport in soil and sediment => slow mixing rates.  Estimates of immense diversity:  Billions of microbial cells per gram of soil.  Million+ microbial species per gram of soil (Gans et al, 2005)  One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory)
    9. 9. “By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.” Microbies live in & on: • Surfaces of aggregate particles; • Pores within microaggregates; N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml
    10. 10. Questions to address  Role of soil microbes in nutrient cycling:  How does agricultural soil differ from native soil?  How do soil microbial communities respond to climate perturbation?  Genome-level questions:  What kind of strain-level heterogeneity is present in the population?  What are the phage and viral populations & dynamic?  What species are where, and how much is shared between different geographical locations?
    11. 11. Must use culture independent and metagenomic approaches  Many reasons why you can‟t or don‟t want to culture:  Syntrophic relationships  Niche-specificity or unknown physiology  Dormant microbes  Abundance within communities  If you want to get at underlying function, 16s analysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities.
    12. 12. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
    13. 13. Computational reconstruction of (meta)genomic content. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
    14. 14. Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Reference genomes.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. (Sequencing error) The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We don’t understand most microbial function.) A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) Understanding the language would help you validate
    15. 15. Great Prairie Grand Challenge --SAMPLING LOCATIONS 2008
    16. 16. A “Grand Challenge” dataset (DOE/JGI) Total: 1,846 Gbp soil metagenome 600 MetaHIT (Qin et. al, 2011), 578 Gbp Basepairs of Sequencing (Gbp) 500 400 Rumen (Hess et. al, 2011), 268 Gbp 300 200 Rumen K-mer Filtered, 111 Gbp 100 NCBI nr database, 37 Gbp 0 Iowa, Iowa, Native Kansas, Continuous Prairie Cultivated corn corn Kansas, Native Prairie GAII Wisconsin, Wisconsin, Wisconsin, Wisconsin, Restored Switchgrass Continuous Native corn Prairie Prairie HiSeq
    17. 17. Why do we need so much data?!  20-40x coverage is necessary; 100x is ~sufficient.  Mixed population sampling => sensitivity driven by lowest abundance.  For example, for E. coli in 1/1000 dilution, you would need approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp)  Sequencing is straightforward; data analysis is not. “$1000 genome with $1m analysis”
    18. 18. Great Prairie Grand Challenge goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data sets thus far.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions)
    19. 19. Great Prairie Grand Challenge goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data sets thus far.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions) (For complex ecological and evolutionary systems, we‟re just starting to get past the first question. More on that later.)
    20. 20. So, we want to go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ
    21. 21. …to “assembled” original sequence. UMD assembly primer (cbcb.umd.edu)
    22. 22. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
    23. 23. Two problems: (1) variation/error Single nucleotide variations cause long branches; They don‟t rejoin quickly.
    24. 24. Two problems: (2) No graph locality. Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection
    25. 25. Assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
    26. 26. Why do k-mer assemblers scale badly? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
    27. 27. Practical memory measurements Velvet measurements (Adina Howe)
    28. 28. The Problem  We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  No locality to the data in terms of graph structure.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore‟s Law.
    29. 29. Our two solutions. 1. Subdivide data 2. Discard redundant data.
    30. 30. 1. Data partitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
    31. 31. Our two solutions. 1. Subdivide data (~20x scaling; 2 years to develop; 100x data increase) 2. Discard redundant data.
    32. 32. 2. Approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Diversity vs richness. The high-coverage reads in sample A are unnecessary for assembly, and,
    33. 33. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
    34. 34. Most shotgun data is redundant. You only need 5-10 reads at a locus to assemble or call (diploid) SNPs… but because sampling is random, and you need 5-10 reads at every locus, you
    35. 35. Digital normalization
    36. 36. Digital normalization
    37. 37. Digital normalization
    38. 38. Digital normalization
    39. 39. Digital normalization
    40. 40. Digital normalization
    41. 41. Coverage estimation If you can estimate the coverage of a read in a data set without a reference, this is straightforward: for read in dataset: if estimated_coverage(read) < CUTOFF: save(read) (Trick: the read coverage estimator needs to be errortolerant.)
    42. 42. The median k-mer count in a read is a good approximate estimator of coverage. This gives us a reference-free measure of coverage.
    43. 43. Diginorm builds a De Bruijn graph & then downsamples based on observed coverage. Corresponds exactly to underlying abstraction used for assembly; retains graph structure.
    44. 44. Digital normalization approach  Is streaming and single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of regions. …raw data can be retained for later abundance estimation.
    45. 45. Contig assembly now scales with richness, not (data) (information) diversity. Most samples can be assembled in < 50 GB of memory.
    46. 46. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid
    47. 47. Diginorm is “lossy compression”  Nearly perfect from an information theoretic perspective:  Discards 95% more of data for genomes.  Loses < 00.02% of information.
    48. 48. Prospective: sequencing tumor cells  Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.  1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.  Most of this data will be redundant and not useful.  Developing diginorm-based algorithms to eliminate data while retaining variant information.
    49. 49. Where are we taking this?  Streaming online algorithms only look at data ~once.  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
    50. 50. => Streaming, online variant calling. Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful.
    51. 51. What about the assembly results for Iowa corn and prairie?? Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Adina Howe
    52. 52. Resulting contigs are low coverage. Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.
    53. 53. So, for soil:  We really do need more data;  But at least now we can assemble what we already have.  Estimate required sequencing depth at 50 Tbp;  Now also have 2-8 Tbp from Amazon Rain Forest Microbial Observatory.  …still not saturated coverage, but getting closer. But, diginorm approach turns out to be widely useful.
    54. 54. Biogeography: Iowa sample overlap? Corn and prairie De Bruijn graps have 51% overlap. Corn Prairie Suggests that at greater depth, samples may have similar geno
    55. 55. Concluding thoughts  Empirically effective tools, in reasonably wide use.  Diginorm provides streaming, online algorithmic basis for     Coverage downsampling/lossy compression Error identification (sublinear) Error correction Variant calling?  Enables analyses that would otherwise be hard or impossible.  Most assembly doable in cloud or on commodity hardware;
    56. 56. The real challenge: understanding  We have gotten distracted by shiny toys: sequencing!! Data!!  Data is now plentiful! But:  We typically have no knowledge of what > 50% of an environmental metagenome “means”, functionally.  Most data is not openly available, so we cannot mine correlations across data sets.  Most computational science is not reproducible, so I can‟t reuse other people‟s tools or approaches.
    57. 57. Data intensive biology & hypothesis generation  My interest in biological data is to enable better hypothesis generation.
    58. 58. My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
    59. 59. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud.
    60. 60. khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Assembly Annotation RSEM differential expression
    61. 61. IPython Notebook: data + code => IPython)Notebook)
    62. 62. My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
    63. 63. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
    64. 64. Acknowledgements Lab members involved               Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Collaborators  Jim Tiedje, MSU  Susannah Tringe and Janet      Jansson (JGI, LBNL) Erich Schwarz, Caltech / Cornell Paul Sternberg, Caltech Robin Gasser, U. Melbourne Weiming Li, MSU Shana Goffredi, Occidental Funding USDA NIFA; NSF IOS; NIH; BEACON.

    ×