Metagenome assembly – part IIC. Titus Brownctb@msu.edu
WarningsThis talk contains forward looking statements. These forward- looking statements can be identified by terminology such as “will”, “expects”, and “believes”. -- Safe Harbor provisions of the U.S. Private Securities Litigation Act “Making predictions is difficult, especially if they’re about the future.” -- Attributed to Niels Bohr
The computational conundrum More data => better.and More data => computationally more challenging.
2. Big data sets require big machines For even relatively small data sets, metagenomic assemblers scale poorly. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set Size of data set == big!! (Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a slightly modified conventional assembler.)
Soil is full of uncultured microbes Randy Jackson
Great Prairie sampling designReference core 1 cM 1M 1 cM 10 M 1M Soil cores: 1 inch diameter 4 inches deep (litter and roots removed) • Spatial samples: 16S rRNA, nifH • Reference sample sequenced (small unmixed sample) Reference bulk soil: stored for additional “omics” and metadata 10 M
Soil contains thousands to millions of species (“Collector’s curves” of ~species) 2000 1800 1600Number of OTUs 1400 Iowa Corn Iowa_Native_Prairie 1200 Kansas Corn 1000 Kansas_Native_Prairie Wisconsin Corn 800 Wisconsin Native Prairie Wisconsin Restored Prairie 600 Wisconsin Switchgrass 400 200 0 100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100 Number of Sequences
The set of questions for soil -- discovery What’s there? Is it really that complex a community? How “deep” do we need to sequence to sample thoroughly and systematically? What organisms and gene functions are present, including non- canonical carbon and nitrogen cycling pathways? What kind of organismal and functional overlap is there between different sites? (Total sampling needed?) How is ecological complexity created & maintained? How does ecological complexity respond to perturbation?
Why are we applying short-read sequencing to this problem!? Short-read sampling is deep and quantitative. Statistical argument: your ability to observe rare organisms – your sensitivity of measurement – is directly related to the number of independent sequences you take. Longer reads (PacBio, 454, Ion Torrent) are less informative. Majority of metagenome studies going forward will make use of Illumina. BUT this kind of sequence is challenging to analyze. BUT, BUT this kind of sequence is necessary for high complexity environments.
Challenges of short-read analysis Low signal for functional analysis; no linkage at all. High error rates. Massive volume. Rapidly changing technology. Several approaches but we have settled on assembly.
Approach 1: PartitioningSplit reads into “bins” belonging to different source species.Can do this based almost entirely on connectivity of sequences.
Partitioning for scaling Can be done in ~10x less memory than assembly. Partition at low k and assemble exactly at any higher k (DBG). Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.) Can eliminate small partitions/contigs in the partitioning phase. An incredibly convenient approach enabling divide & conquer approaches across the board.
Technical challenges met (and defeated) Novel data structure properties elucidated via percolation theory analysis (Pell et al., PNAS, 2012) Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. Sequencing technology introduces false connections in graph (Howe et al., in prep.) Only 20x improvement in assembly scaling .
(NOVEL)Approach 2: Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
Digital normalization discards redundant reads prior to assembly. This removes reads and decreases data size, eliminates errors from removed reads, and normalizes coverage across loci.
Digital normalization algorithmfor read in dataset: if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
Downsample based on de Bruijn graphstructure (which can be derived online)
Shotgun data is often (1) high coverage and (2) biased in coverage. (MD amplified)
Digital normalization fixes all that. Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramatical Assembly is 98% identical.
Digital normalization retains information, while discarding data and errors
Other key points Virtually identical contig assembly; scaffolding works but is not yet cookie-cutter. Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample. Always lower memory than assembly: we never collect most erroneous k-mers. Digital normalization can be done once – and then assembly parameter exploration can be done.
Quotable quotes.Comment: “This looks like a great solution for people who can’t afford real computers”. OK, but:“Buying ever bigger computers is a great solution for people who don’t want to think hard.” To be less snide: both kinds of scaling are needed, of course.
Why use diginorm? Use the cloud to assemble any microbial genomes incl. single- cell, many eukaryotic genomes, most mRNAseq, and many metagenomes. Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity). And, well, the general idea of locus specific graph analysis solves lots of things…
Some interim concluding thoughts Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware. This is not true for highly diverse metagenome environments… For soil, we estimate that we need 50 Tbp / gram soil. Sigh. Biologists and bioinformaticians hate: Throwing away data Caveats in bioinformatics papers (which reviewers like, note) Digital normalization also discards abundance information.
Evaluating sensitivity & specificity E. coli @ 10x + soil Digital Velvet minimus2normalization Partitioning k from 19-51 merge+ other ﬁlters 98.5% of E. coli
ExampleDethlefsen shotgun data set / Relman lab251 m reads / 16gb FASTQ gzipped~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2 (reads => final assembly + mapping) Assembly stats: 58,224 contigs > 1000 bp (average 3kb) summing to 190 mb genomic ~38 microbial genomes worth of DNA ~65% of reads mapped back to assembly
What do we get for soil? Predicted Total % Reads Total Contigs protein rplb genesAssembly Assembled coding2.5 bill 4.5 mill 19% 5.3 mill 3913.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^Putting it in perspective:Total equivalent of ~1200 bacterial genomes Adina HoweHuman genome ~3 billion bp
Coverage of Assemblies Corn Prairie
Nearest reference in NCBIMost abundant contigs in Iowa corn metagenome:Unknown; alpha/beta hydrolase (Streptomyces sp. S4); unknown;unknown; hypothetical protein HMP (Clostridium clostridioforme)Most abundant contigs in Iowa prairie metagenome:hypothetical protein (Rhodanobacter sp. 2APBS1); hypothetical protein(Oryza sativa Japonica); outer membrane adhesin like proteiin (Solitaleacanadensis) ; alcohol dehydrogenase zinc-binding domain protein(Ktedonobacter racemifer); alcohol dehydrogenase GroES domain protein(Ktedonobacter racemifer)
(Done with MEGAN)
How many soil samples do we need to sequence?? Overlap between Iowa prairie & Iowa corn is significant! (Cumulative)Adina Howe
Extracting whole genomes?So far, we have only assembled contigs, but not whole genomes.Can entire genomes beassembled from metagenomicdata?Iverson et al. (2012), from the Armbrust lab, contains atechnique for scaffoldingmetagenome contigs into~whole genomes. YES.
Perspective: the coming infopocalypse Assembling about $20k worth of data, we can generate approximately 700 microbial genomes worth of data. (This is only going to go up in yield/$$, note.) Most of these assembled genomic contigs(and genes) do not belong to studiedorganisms. What the heck do they do??
More thoughts on assembly Illumina is the only game in town for sequencing complex microbial populations, but dealing with the data (volume, errors) is tricky. This problem is being solved, by us and others. We’re working to make it as close to push button as possible, with objectively argued parameters and tools, and methods for evaluating new tools and sequencing types. The community is working on dealing with data downstream of sequencing & assembly. Most pipelines were built around 454 data – long reads, and relatively few of them. With Illumina, we can get both long contigs and quantitative information about their abundance. This necessitates changes to pipelines like MG-RAST and HUMANn.
The interpretation challenge For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples. The vast majority of this genomic DNA contains unknown genes with largely unknown function. Most annotations of gene function & interaction are from a few phylogenetically limited model organisms Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology. Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
Concluding thoughts on “assembly” We can handle all the data (modulo another year or so of engineering.) Bring it on! Our approaches let us (& you) assemble pretty much anything, much more easily than before. (Single cell, microbial genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC sequencing…) Seriously. No more problemo. Done. Finished. Kaput. So now what? Validation. Interpretation and building general tools. Interpretation relies on annotation… (Uh oh.)
What are future needs? High-quality, medium+ throughput annotation of genomes? Extrapolating from model organisms is both immensely important and yet lacking. Strong phylogenetic sampling bias in existing annotations. Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn’t scale ) Integration of microbiology, community ecology/evolution modeling, and data analysis.
Replication fu In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook. This is an interactive Web notebook for data analysis… Hey, neat! We can use this for replication! All of our figures can be regenerated from scratch, on an EC2 instance, using a Makefile (data pipeline) and IPython Notebook (figure generation). Everything is version controlled. Honestly not much work, and will be less the next time.
So… how’d that go? People who already cared thought it was nifty. http://ivory.idyll.org/blog/replication-i.html Almost nobody else cares ;( Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please read my letter to the end? “Could you improve your Makefile? I want to reimplement diginorm in another language and reuse your pipeline, but your Makefile is a mess.” Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc.Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.
Acknowledgements CollaboratorsLab members involved Jim Tiedje, MSU Adina Howe (w/Tiedje) Jason Pell Billie Swalla, UW Arend Hintze Janet Jansson, LBNL Rosangela Canino-Koning Qingpeng Zhang Susannah Tringe, JGI Elijah Lowe Likit Preeyanon Jiarong Guo Funding Tim Brom USDA NIFA; NSF IOS; Kanchan Pavangadkar Eric McDonald BEACON.
Current research in my labSolving the rest of your problems Preliminary functional analysis
Search SSU rRNA gene in Illumina data 1. Randomly sequencing about 100bp long DNA in microbial genomes; 2. Everything is sequenced; 3. Not limited by primers or PCR bias; 4. Data mining is the challenge;SSU rRNA Gene length 10^3 10^7 10^4 10^6Genome length Reads # Expected SSU RNA gene fragments
Classification: Pyrotag vs shotgun RDP-pyrotag-SSU silva-pyrotag-SSU silva-shotgun-SSU
1542 bp Forward Start:907 End:1402 Reverse Sequence logo of short reads at Sequence logo of short reads at forward primer region: reverse primer region: AAACTYAAAKGAATTGACGG GYACACACCGCCCGT Current forward primer Current reverse primer (reverse complement)Primers used in 454 Titanium sequencing of SSU rRNA gene, usingE.coli as an example. Consensus sequences of the primer region fromIllumina reads suggest 1) searching method is good and 2)primer biasis minimal at the current E-value cutoff.
Running HMMs over de Bruijn graphs (=> cross validation) hmmgs: Assemble based on good-scoring HMM paths through the graph. Independent of other assemblers; very sensitive, specific. 95% of hmmgs rplB domains are present in our partitioned assemblies.Jordan Fish, Qiong Wang, and Jim Cole (RDP)
Streaming error correction. First pass Second pass Error-correct low- Error-correct low- All reads Yes! abundance k-mers in Yes! abundance k-mers in read. read. Does read come Does read come from a high- from a now high- coverage locus? coverage locus? Add read to graph Leave unchanged. and save for later. Only saved reads No! No! We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory.We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this
Side note: error correction is the biggest “data” problem left in sequencing. Both for mapping & assembly.
1542 bp Forward Start:907 End:1402Consensus of short reads at Consensus of short reads atforward primer region: reverse primer region:AAACTYAAAKGAATTGACGGCurrent forward primer Figure. Primers used in 454 Titanium sequencingof 16S rRNA gene, using E.coli as an example.Consensus sequences of the primer region fromIllumina reads suggest primer bias is minimal at thecurrent E-value cutoff.
Supplemental: abundance filtering is very lossy. Percent loss from abundance filtering (all >= 2)Largest partition 8.2x partition 3.8x partition contigs bp Total 0.0 20.0 40.0 60.0 80.0 100.0 Percentage lost