C. Titus BrownAssistant ProfessorCSE, MMG, BEACONMichigan State Universityctb@msu.eduThe pro-shotgun-assembly talk.
AcknowledgementsLab members involved Collaborators• Adina Howe (w/Tiedje)• Jason Pell• Arend Hintze• Rosangela Canino-Koning• Qingpeng Zhang• Elijah Lowe• Likit Preeyanon• Jiarong Guo• Tim Brom• Kanchan Pavangadkar• Eric McDonald• Jordan Fish• Chris Welcher• Jim Tiedje, MSU• Billie Swalla, UW• Janet Jansson, LBNL• Susannah Tringe, JGIFundingUSDA NIFA; NSF IOS;BEACON.
Open, online scienceAll of the software and approaches I’m talking abouttoday are available:Assembling large, complex metagenomesarxiv.org/abs/1212.2832khmer software:github.com/ged-lab/khmer/Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown
Note: I am phylogeneticallyunconstrained…• Chordate mRNAseq (Molgula + lamprey +chick)• Nematode genomics• Soil metagenomics…but so far not microbial euks, specifically.
My goals in this work• Interested in genes & genomes: function &evolution, but not as much taxonomy.• Little or no marker work (16s/18s)• Develop lightweight prefiltering techniques forother tools.• Software & methods => democritize dataanalysis.
I am unambiguously pro-assembly.• Short-read analysis can be misleading; need more work like DocPollard’s showing where/why!• Assembly reduces the data size, increases boinformatic signal,and eliminates random errors.• The general mental frameworks (OLC or DBG) underpin virtuallyall sequence analysis anyway, note.• So, why not?– Assembly is HARD, SLOW, TRICKY.– Assemblies may MISLEAD you.– Assembly is a STRINGENT FILTER on your data <=> heuristics.
There is quite a bit of life left to sequence & assemble.http://pacelab.colorado.edu/
Challenges of (micro-)euks• Genomes are large and repeat rich.• Diploidy and polymorphism will confuse assemblers.– Note: very problematic in tandem with repeats.• Nucleotide bias => sequencing bias.• Scarce samples => amplification techniques => sequencingbias.All of these confound assembly.Can we “fix”?
Three illustrative problem cases• H. contortus genome assembly.• Lamprey reference-free transcriptomeassembly.• Soil metagenome assembly.
The H. contortus problem• A sheep parasite.• ~350 Mbp genome• Sequenced DNA 6 individuals after whole genomeamplification, estimated 10% heterozygosity (!?)• Significant bacterial contamination.(w/Robin Gasser, Paul Sternberg, and Erich Schwarz)
H. contortus life cycleRefs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868;Prichard and Geary (2008), Nature 452, 157-158.
The power of next-gen. sequencing:get 180x coverage ... and then watch yourassemblies never finishLibraries built and sequenced:300-nt inserts, 2x75 nt paired-end reads500-nt inserts, 2x75 and 2x100 nt paired-end reads2-kb, 5-kb, and 10-kb inserts, 2x49 nt paired-end readsNothing would assemble at all until filtered for basic quality.Filtering let ≤500 nt-sized inserts to assemble in a mere week.But 2+ kb-sized inserts would not assemble even then.Erich Schwarz
So, problem 1: nematode H. contortHighly polymorphicWhole genome amplificationRepeat ridden=> Assemblers DIE HORRIBLY.
The lamprey problem.• Lamprey genome is draft quality; low contiguity, missing~30%.• No closely related reference.• Full-length and exon-level gene predictions are 50-75%reliable, and rarely capture UTRs / isoforms.• De novo assembly, if we do it well, can identify– Novel genes– Novel exons– Fast evolving genes• Somatic recombination: how much are we missing, really?
Sea lamprey in the Great Lakes• Non-native• Parasite ofmedium tolarge fishes• Causedpopulations ofhost fishes tocrashLi Lab / Y-W C-D
Lamprey transcrpitome• Started with 5.1 billion reads from 50 differenttissues.No assembler on the planet can handle thismuch data.
So, problem 2: lamprey mRNAseqMust go with reference-free approach.TOO MUCH DATA.
Soil metagenome assembly• Observation: 99% of microbes cannot easily becultured in the lab. (“The great plate count anomaly”)• Many reasons why you can’t or don’t want to culture:– Syntrophic relationships– Niche-specificity or unknown physiology– Dormant microbes– Abundance within communitiesSingle-cell sequencing & shotgun metagenomics are twocommon ways to investigate microbial communities.
Investigating soil microbial ecology• What ecosystem level functions are present, andhow do microbes do them?• How does agricultural soil differ from native soil?• How does soil respond to climate perturbation?• Questions that are not easy to answer withoutshotgun sequencing:– What kind of strain-level heterogeneity is present inthe population?– What does the phage and viral population look like?– What species are where?
“Whoa, that’s a lot of data…”05E+131E+141.5E+142E+142.5E+143E+143.5E+144E+144.5E+145E+14E. coli genome Human genome VertebratetranscriptomeHuman gut Marine SoilEstimated sequencing required (bp, w/Illumina)
Scaling challenges in metagenomics(and assembly, more generally)• It is difficult to even achieve an assembly forthe volume of data we can easily get. (Alsosee: ARMO project, ~2 TB of data.)• Most current assemblers are quiteheavyweight, perhaps partly because they arewritten by people with large resources.• This fails given scaling behavior of sequencing.
So, problem 3: soil metagenomicsTOO MUCH DATA.BAD SCALING.
Approach: Digital normalization(a computational version of library normalization)Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consume diskspace and, because oferrors, memory.We can discard it for you…
Digital normalization approachA digital analog to cDNA library normalization, diginorm:• Reference free.• Is single pass: looks at each read only once;• Does not “collect” the majority of errors;• Keeps all low-coverage reads;• Smooths out coverage of regions.
Coverage before digital normalization:(MD amplified)
Coverage after digital normalization:Normalizes coverageDiscards redundancyEliminates majority oferrorsScales assembly dramatically.Assembly is 98% identical.
Wait, that works??Note, digital normalization is freely available, with lots of tutorials.Derived approach now part of Trinity (Broad mRNAseq assembler).It is, ahem, still unpublished, but available on arXiv:arxiv.org/abs/1203.4802
1. H. contort after digital normalization• Diginorm readily enabled assembly of a 404 Mbpgenome with N50 of 15.6 kb;• Post-processing with GapCloser and SOAPdenovoscaffolding led to final assembly of 453 Mbp with N50of 34.2kb.• CEGMA estimates 73-94% complete genome.• Diginorm helped by:– Suppressing high polymorphism, esp in repeats;– Eliminating 95% of sequencing errors;– “Squashing” coverage variation from whole genomeamplification and bacterial contamination
H. contort after digital normalization• Diginorm readily enabled assembly of a 404 Mbpgenome with N50 of 15.6 kb;• Post-processing with GapCloser and SOAPdenovoscaffolding led to final assembly of 453 Mbp with N50of 34.2kb.• CEGMA estimates 73-94% complete genome.• Diginorm helped by:– Suppressing high polymorphism, esp in repeats;– Eliminating 95% of sequencing errors;– “Squashing” coverage variation from whole genomeamplification and bacterial contamination
Next steps with H. contortus• Publish the genome paper • Identification of antibiotic targets fortreatment in agricultural settings (animalhusbandry).• Serving as “reference approach” for a widevariety of parasitic nematodes, many of whichhave similar genomic issues.
2. Lamprey transcriptome results• Started with 5.1 billion reads from 50 different tissues.• Digital normalization discarded 98.7% of them asredundant, leaving 87m (!)• These assembled into more than 100,000 transcripts >1kb• Against known full-length, 98.7% agreement(accuracy); 99.7% included (contiguity)
Evaluating de novo lampreytranscriptome• Estimate genome is ~70% complete (gene complement)• Majority of genome-annotated gene sets recovered bymRNAseq assembly.• Note: method to recover transcript families w/o genome…Assembly analysis Gene familiesGene families ingenomeFraction ingenomemRNAseq assembly 72003 51632 71.7%reference gene set 8523 8134 95.4%combined 73773 53137 72.0%intersection 6753 6753 100.0%only in mRNAseq assembly 65250 44884 68.8%only in reference gene set 1770 1500 84.7%(Includes transcripts > 300 bp)
Next steps with lamprey• Far more complete transcriptome than theone predicted from the genome!• Enabling studies in –– Basal vertebrate phylogeny– Biliary atresia– Evolutionary origin of brown fat (previouslythought to be mammalian only!)– Pheromonal response in adults
Additional Approach forMetagenomes: Data partitioning(a computational version of cell sorting)Split reads into “bins”belonging to differentsource species.Can do this based almostentirely on connectivityof sequences.“Divide and conquer”Memory-efficientimplementation helpsto scale assembly.Pell et al., 2012, PNAS
Partitioning separates reads by genome.Strain variants co-partition.When computationally spiking HMP mock data with one E. coligenome (left) or multiple E. coli strains (right), majority of partitionscontain reads from only a single genome (blue) vs multi-genomepartitions (green).Partitions containing spiked data indicated with a * Adina Howe**
Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bpAssembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes)TotalAssemblyTotal Contigs(> 300 bp)% ReadsAssembledPredictedproteincoding2.5 bill 4.5 mill 19% 5.3 mill3.5 bill 5.9 mill 22% 6.8 millAdina Howe
Resulting contigs are low coverage.Figure11: Coverage (median basepair) distribution of assembled contigsfrom soil metagenomes.
…but high coverage is needed.Low coverage is the dominant problem blocking assembly ofyour soil metagenome.
Strain variation?ToptwoallelefrequenciesPosition within contigOf 5000 mostabundantcontigs, only 1 hasapolymorphismrate > 5%Can measure byread mapping.
Overconfident predictions• We can assemble virtually anything but soil ;).– Genomes, transcriptomes, MDA, mixtures, etc.– Repeat resolution will be fundamentally limited bysequencing technology (insert size; sampling depth)• Strain variation confuses assembly, but does notprevent useful results.– Diginorm is systematic strategy to enable assembly.– Banfield has shown how to deconvolve strains atdifferential abundance.– Kostas K. results suggest that there will be a species gapsufficient to prevent contig misassembly.– Even genes “chimeric” between strains are useful.
Reasons why you shouldn’t believe me1) Strain variation – when we get deeper in soil, weshould see more (?). Not sure what willhappen, and we do not (yet) have provenapproaches.2) We, by definition, are not yet seeing anythingthat doesn’t assemble.3) We have not tackled scaffolding much. Seriousinvestigation of scaffolding will be necessary forany good genome assembly, and scaffolding isweak point.
Some concluding thoughts on shotgunmetagenomics• Making good use of environmental metagenome data isvery hard; assemblies don’t solve this, but may providetraction.• In particular, connection to “function” and actual biology isvery hard to make. (See other speakers for good positiveexamples.)• Our current assembly approaches do not yet push limits ofdata.• Illumina’s high sampling rate makes it only game in town.• Rate limiting factor is increasingly bioinfo-who-can-speak-to-biologists.• Assembly is a really stringent filter; diginorm is not.
A brief tour of forthcomingawesomeness• Targeted-gene assembly from short reads. (Fishet al., Ribosomal Database Project).• rRNA search in shotgun data.• Awesome™ techniques for comparing andevaluating different assemblies.• Error correction for mRNAseq & metag data.• Better diginorm.• Strain variation collapse, assembly, & recovery.
Some specific proposals• Include significant funding for bioinformaticinvestigation in anything you do.– Everyone gets this wrong. I’m looking atyou, NIH, NSF, GBMF, Sloan, DOE, USDA.– Cleverness scales better in bioinfo than exp.• Shotgun DNA and shotgun RNA + assembly-based approaches => gene “tags”.– Less experimental treatment up front is good.– Isoforms are hard, note.
The Last Slide• All of the computational techniques areavailable, along with a number of preprints.• They make assembly more possible but notnecessarily easy.• My long term goal is to make most assembly &all evaluation easy.