Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASMWorkshop, May 2013Visual Complexityhttp://www.flickr.com/photos/maisonbisson
Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo Janet Jansson Susannah TringeMSU Lab: Collaborators:
I will upload this on slideshare (adinachuanghowe) Khmer documentationgithub.com/ged-lab/khmer/https://khmer.readthedocs.org/en/latest/guide.html ManuscriptsScaling metagenome sequence assembly with probabilistic de Bruijn graphshttp://www.pnas.org/content/early/2012/07/25/1121464109A reference-free algorithm for computational normalization of shotgun sequencingdatahttp://arxiv.org/abs/1203.4802Assembling large, complex metagenomeshttp://arxiv.org/abs/1212.2832
High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXXA few gotchas of sequencing:Errors / Artifacts (confusion)Diversity / Complexity (scale)High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXX
High AbundanceLow AbundanceIn theenvironment (Our goal)In our handsXXXXXXXX1. Digital normalization (lossy compression)2. Partitioning3. Enabling usage of current previously unusableassembly tools
Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…
Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.High memory requirements Depends on good (~10x) sequencing coverage
“Coverage” is simply the average number of reads that overlapeach true base in genome.Here, the coverage is ~10 – just draw a line straight down from the topthrough all of the reads.
Note that k-mer abundance is not properly represented here! Eachblue k-mer will be present around 10 times.
Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consume diskspace and, because oferrors, memory.We can discard it for you…
A digital analog to cDNA library normalization,diginorm:Reference free.Is single pass: looks at each read only once;Does not “collect” the majority of errors;Keeps all low-coverage reads;Smooths out coverage of regions.
Digital normalization produces “good”metagenome assemblies. Smooths out abundance variation, strainvariation. Reduces computational requirements forassembly. It also kinda makes sense :)
Split reads into “bins”belonging to differentsource species.Can do this based almostentirely on connectivityof sequences.“Divide and conquer”Memory-efficientimplementation helpsto scale assembly.Pell et al., 2012, PNAS
Low coverage is the dominant problem blocking assembly ofyour soil metagenome.
In order to build assemblies, each assemblermakes choices – uses heuristics – to reach aconclusion. These heuristics may not be appropriate for yoursample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in assembly.
We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited bysequencing technology (insert size; sampling depth) Strain variation confuses assembly, but does notprevent useful results. Diginorm is systematic strategy to enable assembly. Banfield has shown how to deconvolve strains atdifferential abundance. Kostas K. results suggest that there will be a speciesgap sufficient to prevent contig misassembly.
Most metagenomes require 50-150 GB of RAM. Many people don’t have access to computers ofthat size. Amazon Web Services (aws.amazon.com) willhappily rent you such computers for $1-2/hr. http://ged.msu.edu/angus/2013-hmp-assembly-webinar/index.html
Optimizing our programs => faster. Building an evaluation framework formetagenome assemblers. Error correction!
Achieving one or more assemblies is fairlystraightforward. An assembly is a hypothesis and evaluatingthem is challenging, however, and where youshould be thinking hardest about assembly. There are relatively few pipelines availablefor analyzing assembled metagenomic data.
How do we study complexity? Interactions? Diversity?Communities? Evolution? Our environment?Visual Complexityhttp://www.flickr.com/photos/maisonbisson• Major efforts of data collection• Open-mind for discoveries• Willingness to adjust to change• Multiple efforts• Well-designed experimentsWorkshop example: Illumina deepsequencing and scaling large datasetson soil metagenomes
We receive Gb of sequences Generally, my data is… Split by barcodes Untrimmed Adapters are present Two paired end fastq files Underestimation of computationalrequirements: Quality control steps usually require 2-3 times theamount of hard drive space Similarity comparison against known databasesimpractical (soil metagenome ~50 years to BLAST)Home Alone ScreamMy first slide graphic that I’m scared may date me.
Two ways to reduce the onslaught:Cluster into known observances (annotate,bin)AssemblySome mix of the above
Ten of you upload 1 Hiseqflowcell into MG-RAST
Illumina short reads from soilmetagenome (~100 bp)454 short reads from soilmetagenome (~368 bp)Assembled contigs (Illumina)reads from soil metagenome(~491 bp)Read length will increase… computational requirements? Assembly great way to reduce data.