• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
2014 sage-talk
 

2014 sage-talk

on

  • 293 views

 

Statistics

Views

Total Views
293
Views on SlideShare
293
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • @@ change slide up => more complex diversity

2014 sage-talk 2014 sage-talk Presentation Transcript

  • Making assembly cheap & easy, and consequences thereof C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Feb 2014 ctb@msu.edu
  • Generally, yay #openscience! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • Problem under consideration: shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
  • Analogy: we seek an understanding of humanity via our libraries. http://eofdreams.com/library.html;
  • But, our only observation tool is shredding a mixture of all of the books & digitizing the shreds. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Prior knowledge.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. A categorization system would be an invaluable but not infallible guide to book topics. Understanding the language would help you validate & understand the books.
  • Investigating soil microbial communities  95% or more of soil microbes cannot be cultured in lab.  Very little transport in soil and sediment => slow mixing rates.  Estimates of immense diversity:  Billions of microbial cells per gram of soil.  Million+ microbial species per gram of soil (Gans et al, 2005)  One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory)
  • “By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.” Microbies live in & on: • Surfaces of aggregate particles; • Pores within microaggregates; N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml
  • Questions to address  Role of soil microbes in nutrient cycling:  How does agricultural soil differ from native soil?  How do soil microbial communities respond to climate perturbation?  Genome-level questions:  What kind of strain-level heterogeneity is present in the population?  What are the phage and viral populations & dynamic?  What species are where, and how much is shared between different geographical locations?
  • Must use culture independent and metagenomic approaches  Many reasons why you can‟t or don‟t want to culture: Cross-feeding, niche specificity, dormancy, etc.  If you want to get at underlying function, 16s analysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities.
  • Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
  • Computational reconstruction of (meta)genomic content. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Reference genomes.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. (Sequencing error) The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We don’t understand most microbial function.) A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) Understanding the language would help you validate
  • Great Prairie Grand Challenge --SAMPLING LOCATIONS 2008
  • A “Grand Challenge” dataset (DOE/JGI) Total: 1,846 Gbp soil metagenome 600 MetaHIT (Qin et. al, 2011), 578 Gbp Basepairs of Sequencing (Gbp) 500 400 Rumen (Hess et. al, 2011), 268 Gbp 300 200 Rumen K-mer Filtered, 111 Gbp 100 NCBI nr database, 37 Gbp 0 Iowa, Iowa, Native Kansas, Continuous Prairie Cultivated corn corn Kansas, Native Prairie GAII Wisconsin, Wisconsin, Wisconsin, Wisconsin, Restored Switchgrass Continuous Native corn Prairie Prairie HiSeq
  • Why do we need so much data?!  20-40x coverage is necessary; 100x is ~sufficient.  Mixed population sampling => sensitivity driven by lowest abundance.  For example, for E. coli in 1/1000 dilution, you would need approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp)  Sequencing is straightforward; data analysis is not. “$1000 genome with $1m analysis”
  • Great Prairie Grand Challenge goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest soil data set ever sequenced, ~2010.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions)
  • Assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • The Problem  We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  No locality to the data in terms of graph structure.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore‟s Law.
  • Several solutions 1.More efficient exploration of data. 2. Subdivide data 3. Discard redundant data.
  • Primary approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Diversity vs richness. The high-coverage reads in sample A are unnecessary for
  • Digital normalization
  • Digital normalization
  • Digital normalization
  • Digital normalization
  • Digital normalization
  • Digital normalization
  • Diginorm is “lossy compression”  Nearly perfect from an information theoretic perspective:  Discards 95% more of data for genomes.  Loses < 00.02% of information.
  • Where are we taking this?  Streaming online algorithms only look at data ~once.  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
  • => Streaming, online variant calling. Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful.
  • Prospective: sequencing tumor cells  Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.  1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.  Most of this data will be redundant and not useful.  Developing diginorm-based algorithms to eliminate data while retaining variant information.
  • The real challenge: understanding  We have gotten distracted by shiny toys: sequencing!! Data!!  Data is now plentiful! But:  We typically have no knowledge of what > 50% of an environmental metagenome “means”, functionally.  Most data is not openly available, so we cannot mine correlations across data sets.  Most computational science is not reproducible, so I can‟t reuse other people‟s tools or approaches.
  • Data intensive biology & hypothesis generation  My interest in biological data is to enable better hypothesis generation.
  • My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
  • Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud.
  • khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Assembly Annotation RSEM differential expression
  • CC0; BSD; on github; in reStructuredText.
  • Can we incentivize data sharing?  ~$100-$150/transcriptome in the cloud  Offer to analyze people‟s existing data for free, IFF they open it up within a year. See: “Dead Sea Scrolls & Open Marine Transcriptome Project” blog post; CephSeq white paper.
  • “Research singularity” The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary: The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data.
  • My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
  • IPython Notebook: data + code => IPython)Notebook)
  • Acknowledgements Lab members involved               Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Collaborators  Jim Tiedje, MSU  Susannah Tringe and Janet      Jansson (JGI, LBNL) Erich Schwarz, Caltech / Cornell Paul Sternberg, Caltech Robin Gasser, U. Melbourne Weiming Li, MSU Shana Goffredi, Occidental Funding USDA NIFA; NSF IOS; NIH; BEACON.