Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.
C. Titus BrownAssistant ProfessorCSE, MMG, BEACONMichigan State Universityctb@msu.eduHMP – Metagenome assembly
AcknowledgementsLab members involved Collaborators• Adina Howe (w/Tiedje)• Jason Pell• Arend Hintze• Rosangela Canino-Koning• Qingpeng Zhang• Elijah Lowe• Likit Preeyanon• Jiarong Guo• Tim Brom• Kanchan Pavangadkar• Eric McDonald• Jordan Fish• Chris Welcher• Jim Tiedje, MSU• Billie Swalla, UW• Janet Jansson, LBNL• Susannah Tringe, JGIFundingUSDA NIFA; NSF IOS;BEACON.
Open, online scienceAll of the software and approaches I’m talking abouttoday are available:Assembling large, complex metagenomesarxiv.org/abs/1212.2832khmer software:github.com/ged-lab/khmer/Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown
Illumina! De Bruijn graphs!• Today I’ll be talking about Illumina datasets, and de Bruijn graph assembly (k-merassembly).• This is because my research has largelyfocused on scaling to large data sets (soilmetagenomics!) and Illumina is the realscaling challenge.
Assembler heuristics• In order to build assemblies, each assemblermakes choices – uses heuristics – to reach aconclusion.• These heuristics may not be appropriate for yoursample!– High polymorphism?– Mixed population vs clonal?– Genomic vs metagenomic vs mRNA– Low coverage drives differences in assembly.
Evaluating assemblyPredicted genome.XXXXXXXXXXReads - noisy observationsof some genome.Assembler(a Big Black Box)Evaluating correctness of metagenomes is still undiscovered country.
Shotgun sequencing“Coverage” is simply the average number of reads that overlapeach true base in genome.Here, the coverage is ~10 – just draw a line straight down from the topthrough all of the reads.
Reducing to k-mers overlapsNote that k-mer abundance is not properly represented here! Eachblue k-mer will be present around 10 times.
Errors create new k-mersEach single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
So, k-mer abundance plots aremixtures of true and false k-mers.
Approach: Digital normalization(a computational version of library normalization)Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consume diskspace and, because oferrors, memory.We can discard it for you…
Digital normalization approachA digital analog to cDNA library normalization, diginorm:• Reference free.• Is single pass: looks at each read only once;• Does not “collect” the majority of errors;• Keeps all low-coverage reads;• Smooths out coverage of regions.
Coverage before digital normalization:(MD amplified)
Coverage after digital normalization:Normalizes coverageDiscards redundancyEliminates majority oferrorsScales assembly dramatically.Assembly is 98% identical.
In our experience…• Digital normalization produces “good”metagenome assemblies.• Smooths out abundance variation, strainvariation.• Reduces computational requirements forassembly.• It also kinda makes sense :)
Additional Approach forMetagenomes: Data partitioning(a computational version of cell sorting)Split reads into “bins”belonging to differentsource species.Can do this based almostentirely on connectivityof sequences.“Divide and conquer”Memory-efficientimplementation helpsto scale assembly.Pell et al., 2012, PNAS
Partitioning separates reads by genome.Strain variants co-partition.When computationally spiking HMP mock data with one E. coligenome (left) or multiple E. coli strains (right), majority of partitionscontain reads from only a single genome (blue) vs multi-genomepartitions (green).Partitions containing spiked data indicated with a * Adina Howe**
Conclusions re strainvariation/chimerism (previous slide)• When spiking in intentionally complexmixtures, only a small fraction of partitionsare chimeric.• These means that only a small fraction ofcontigs could be chimeric.• Strain variants will almost certainly assembletogether.• Can separate on abundance.See Sharon et al., 2013, PMID 22936250, for Banfield work on this.
Our experience• Our metagenome assemblies compare well withothers, but we have little in the way of groundtruth with which to evaluate.• Scaffold assembly is tricky; we believe in contigassembly for metagenomes, but not scaffolding.• See arXiv paper, “Assembling large, complexmetagenomes”, for our suggested pipeline andstatistics & references.
Metagenomic assemblies are highly variableAdina Howe et al., arXiv 1212.0159
High coverage is needed.Low coverage is the dominant problem blocking assembly ofyour soil metagenome.
Strain variation (soil)ToptwoallelefrequenciesPosition within contigOf 5000 mostabundantcontigs, only 1 hasapolymorphismrate > 5%Can measure byread mapping.
Overconfident predictions• We can assemble virtually anything but soil ;).– Genomes, transcriptomes, MDA, mixtures, etc.– Repeat resolution will be fundamentally limited bysequencing technology (insert size; sampling depth)• Strain variation confuses assembly, but does notprevent useful results.– Diginorm is systematic strategy to enable assembly.– Banfield has shown how to deconvolve strains atdifferential abundance.– Kostas K. results suggest that there will be a species gapsufficient to prevent contig misassembly.– Even genes “chimeric” between strains are useful.
Reasons why you shouldn’t believe me1) Strain variation – when we get deeper in soil, weshould see more (?). Not sure what willhappen, and we do not (yet) have provenapproaches.2) We, by definition, are not yet seeing anythingthat doesn’t assemble.3) We have not tackled scaffolding much. Seriousinvestigation of scaffolding will be necessary forany good genome assembly, and scaffolding isweak point.
Metagenome assemblersIn addition to khmer prefiltering,• SPADES• IDBA-UD• MetaVelvet• Ray Meta
Assembling in the cloud• Most metagenomes require 50-150 GB of RAM.• Many people don’t have access to computers ofthat size.• Amazon Web Services (aws.amazon.com) willhappily rent you such computers for $1-2/hr.• I will post instructions and sample data sets forusing Amazon today at ged.msu.edu/angus/.
Current research• Optimizing our programs => faster.• Building an evaluation framework formetagenome assemblers.• Error correction!
De novo metagenome error correctionmakes reads more mappable.Jason Pell, unpub.
Concluding thoughts• Achieving one or more assemblies is fairlystraightforward.• Evaluating them is challenging, however, andwhere you should be thinking hardest aboutassembly.• There are relatively few pipelines available foranalyzing assembled metagenomic data. MG-RAST does support this; others?