Development of antibiotic resistance; vacancy of niches for resource consumption by antibiotic sensitive; ??
Extracting genomes from community sequencing „What works, what will work, and what needs work‟ C. Titus Brown firstname.lastname@example.orgComputer Science; Microbiology; BEACON Michigan State University
Warnings This talk contains forward looking statements. These forward-looking statements can beidentified by terminology such as “will”, “expects”, and “believes”. -- Safe Harbor provisions of the U.S. Private Securities Litigation Act “Making predictions is difficult, especially if they‟re about the future.” -- Attributed to Niels Bohr
Thanks for the invitation! So, Linda Mansfield and I were talking one day… Her: “It‟d be great to be able to look at communities with sequencing.” Me: “Oh, yeah, we can we do that now.” My overall interest is in good hypothesis generation from computational data, with a focus on sequence data. For the past three years, I have been working on this specifically for soil metagenomics (and mRNAseq, too).
Soil contains thousands to millions of species (“Collector’s curves” of ~species) 2000 1800 1600Number of OTUs 1400 Iowa Corn Iowa_Native_Prairie 1200 Kansas Corn 1000 Kansas_Native_Prairie Wisconsin Corn 800 Wisconsin Native Prairie Wisconsin Restored Prairie 600 Wisconsin Switchgrass 400 200 0 100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100 Number of Sequences
Ecology => function emphasis What‟s there? Is it really that complex a community? How “deep” do we need to sequence to sample thoroughly and systematically? How is ecological complexity created & maintained? How does ecological complexity respond to perturbation? What organisms and gene functions are present, including non-canonical carbon and nitrogen cycling pathways? What kind of organismal and functional overlap is
The human gut is a diverse place Dethlefsen et al., 2008
Ecology vs function in human gut We can observe recovery of diversity after Cipro treatment; but what is driving recovery at a functional level? Dethlefsen and Relman, 2011
Culture independent methods Observation that 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”) While this is less true for host-associated microbes, culture independent methods are still important: Syntrophic relationships Niche-specificity or unknown physiology Dormant microbes Abundance within communities Single-cell sequencing &shotgun metagenomicsare two common ways to investigate microbial communities.
Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
Shotgun sequencing & assembly Why assembly? Assumption free (no reference needed) Necessary for soil and marine; useful for host-associated? Assembly can serve as reference for transcriptome interpretation Fragment, sequence, computationally assemble. What kind of results do you get? Almost certainly chimerism between different strains; but still useful for gene content &operon structure. Specificity seems high, but sensitivity is dependent on sequencing depth. Because of sampling rate, Illumina is primary choice.
Shotgun metagenomics: good news Cheap and easy to generate vast whole metagenome/metatranscriptome shotgun data sets from essentially any community you can sample. Such data can be quite interesting! Low hanging fruit – correlation with diet, etc. Still early days for observation of “pan genome” and functional content. Potential to illuminate or inform: Dynamics and selective pressures of antibiotic resistance, virulence genes, and pathogenicity islands Phage and viral communities Community interactions.
Shotgun metagenomics: badnews Computational techniques are still relatively immature Mapping to known genomes? Discovery of unknown genomes & strain variants? Sensitivity and specificity are hard to evaluate. Computational ecosystem is not that rich… Interpreting the data is still the bottleneck, of course. Vast majority of genes not usefully annotated. Reliance on specific reference databases, annotations. Tools for (e.g.) inferring community interactions from community dynamics & functional capacity are desperately needed.
The computational conundrum More data => better.and More data => computationally more challenging.
2. Big data sets require big machinesFor even relatively small data sets, metagenomic assemblers scale poorly.Memory usage ~ “real” variation + number of errorsNumber of errors ~ size of data setSize of data set == big!!(Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a slightly modified conventional assembler.)
Approach 1: PartitioningSplit reads into “bins” belonging to different source species.Can do this based almost entirely on connectivity of sequences.
Technical challenges met (and defeated) Novel data structure properties elucidated via percolation theory analysis (Pell, Hintze, et al., in review, PNAS). Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. Sequencing technology introduces false sequences in graph (Howe et al., in prep.) Only 20x improvement in assembly scaling .
(NOVEL)Approach 2: Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
Digital normalization discardsredundant reads prior to assembly. This removes reads and decreases data size, eliminates errors from removed reads, and normalizes coverage across loci. Discarded reads can be used after assembly for quantitative analysis.
A read‟s median k-mer count is agood estimator of “coverage”. This gives us a reference-free measure of coverage.
Shotgun data is often (1) highcoverage and (2) biased in coverage. (MD amplified)
Digital normalization fixes all that. Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
Digital normalization retains information, whilediscarding data and errors
Evaluating sensitivity & specificity E. coli @ 10x + soil 98.5% of E. coli
How much? A mathematicalinterlude. Suppose we need 10x coverage to assemble a microbial genome, and microbial genomes average 5e6 bp of DNA. Further suppose that we want to be able to assemble a microbial species that is “1 in a 100000”, i.e. 1 in 1e5. Shotgun sequencing samples randomly, so must sample deeply to be sensitive.10x coverage x 5e6 bp x1e5 =~ 50e11, or 5 Tbpof sequence.
ExampleDethlefsen shotgun data set / Relman lab251 m reads / 16gb FASTQ gzipped~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2 (reads => final assembly + mapping)Assembly stats: 58,224 contigs> 1000 bp (average 3kb) summing to 190 mb genomic ~38 microbial genomes worth of DNA ~65% of reads mapped back to assembly
What do we get for soil? Predicted Total % Reads rplb Total Contigs protein Assembly Assembled genes coding 2.5 bill 4.5 mill 19% 5.3 mill 391 3.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^ Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
Extracting whole genomes?So far, we have only assembled contigs, but not whole genomes.Can entire genomes beassembled from metagenomicdata?Iverson et al. (2012), fromthe Armbrust lab, contains atechnique for scaffoldingmetagenomecontigs into~whole genomes. YES.
Concluding thoughts onassembly Illumina is the only game in town for sequencing complex microbial populations, but dealing with the data (volume, errors) is tricky. This problem is being solved, by us and others. We‟re working to make it as close to push button as possible, with objectively argued parameters and tools, and methods for evaluating new tools and sequencing types. The community is working on dealing with data downstream of sequencing & assembly. Most pipelines were built around 454 data – long reads, and relatively few of them. With Illumina, we can get both long contigs and quantitative information about their abundance. This necessitates changes to pipelines like MG-RAST and HUMANn.
The interpretation challenge For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples. The vast majority of this genomic DNA contains unknown genes with largely unknown function. Most annotations of gene function & interaction are from a few phylogenetically limited model organisms Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology. Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
How will we annotate soil?? Predicted Total % Reads rplb Total Contigs protein Assembly Assembled genes coding 2.5 bill 4.5 mill 19% 5.3 mill 391 3.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^ Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
Some lessons from C. jejuni In vivomurine transfer experiments demonstrate substantial capacity for C. jejuni11168 to adapt solely via modification of poly-G tracts (Jerome et al., 2011). Bell et al. (unpub) have shown substantial variability in gene content of Campylobacter strains. Gene content and gene expression are both important to understanding mechanisms of pathogenicity. In vitro serial transfer experiments demonstrate that rapid genomic adaptation to new environments occurs at multiple loci, with substantial variation in genes of unknown function (Jereme et al., in preparation)
Multilocus “strain” variation in C.jejunidrives rapid adaptation
What works?Today, From deep metagenomicdata, you can get the gene and operon content (including abundance of both) from communities. You can get microarray-like expression information from metatranscriptomics.
What needs work? Assembling ultra-deep samples is going to require more engineering, but is straightforward. (“Infinite assembly.”) Building scaffolds and extracting whole genomes has been done, but I am not yet sure how feasible it is to do systematically with existing tools (c.f. Armbrust Lab).
What will work, someday? Sensitive analysis of strain variation. Both assembly and mapping approaches do a poor job detecting many kinds of biological novelty. The 1000 Genomes Project has developed some good tools that need to be evaluated on community samples. Ecological/evolutionary dynamics in vivo. Most work done on 16s, not on genomes or functional content. Here, sensitivity is really important!
What are future needs? High-quality, medium+ throughput annotation of genomes? Extrapolating from model organisms is both immensely important and yet lacking. Strong phylogenetic sampling bias in existing annotations. Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn‟t scale)
Pubs, software, tutorials, etc.Metagenome assembly / HMP tutorial: http://ged.msu.edu/angus/nih-hmp-2012/ Everything I discussed is available pre-pub -- contact email@example.com, or Google for khmer – software package kmer-percolation paper (in re-review, PNAS) digital normalization paper (in review, PLoS One) …a few dozen people using, one way or another.
Acknowledgements Jason Pell, Qingpeng Zhang, ArendHintze, and Adina Howe Soil: Jim Tiedje (MSU), Janet Jansson (LBNL/JGI), Susannah Tringe (JGI) Campy: Linda Mansfield, Julia Bell, JP Jerome, Jeff BarrickFunding:USDA NIFA; NSF IOS; BEACON.