ASM 2013 Metagenomic Assembly Workshop Slides


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ASM 2013 Metagenomic Assembly Workshop Slides

  1. 1. Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASMWorkshop, May 2013Visual Complexity
  2. 2.  Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo Janet Jansson Susannah TringeMSU Lab: Collaborators:
  3. 3.  I will upload this on slideshare (adinachuanghowe) Khmer ManuscriptsScaling metagenome sequence assembly with probabilistic de Bruijn graphs reference-free algorithm for computational normalization of shotgun sequencingdata large, complex metagenomes
  4. 4. High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXXA few gotchas of sequencing:Errors / Artifacts (confusion)Diversity / Complexity (scale)High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXX
  5. 5. High AbundanceLow AbundanceIn theenvironment (Our goal)In our handsXXXXXXXX1. Digital normalization (lossy compression)2. Partitioning3. Enabling usage of current previously unusableassembly tools
  6. 6.  Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…
  7. 7.  Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.High memory requirements Depends on good (~10x) sequencing coverage
  8. 8. “Coverage” is simply the average number of reads that overlapeach true base in genome.Here, the coverage is ~10 – just draw a line straight down from the topthrough all of the reads.
  9. 9. Note that k-mer abundance is not properly represented here! Eachblue k-mer will be present around 10 times.
  10. 10. Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
  11. 11. Low-abundance peak (errors)
  12. 12. High-abundance peak(true k-mers)
  13. 13. Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consume diskspace and, because oferrors, memory.We can discard it for you…
  14. 14. A digital analog to cDNA library normalization,diginorm:Reference free.Is single pass: looks at each read only once;Does not “collect” the majority of errors;Keeps all low-coverage reads;Smooths out coverage of regions.
  15. 15.  Digital normalization produces “good”metagenome assemblies. Smooths out abundance variation, strainvariation. Reduces computational requirements forassembly. It also kinda makes sense :)
  16. 16. Split reads into “bins”belonging to differentsource species.Can do this based almostentirely on connectivityof sequences.“Divide and conquer”Memory-efficientimplementation helpsto scale assembly.Pell et al., 2012, PNAS
  17. 17. Low coverage is the dominant problem blocking assembly ofyour soil metagenome.
  18. 18.  In order to build assemblies, each assemblermakes choices – uses heuristics – to reach aconclusion. These heuristics may not be appropriate for yoursample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in assembly.
  19. 19.  We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited bysequencing technology (insert size; sampling depth) Strain variation confuses assembly, but does notprevent useful results. Diginorm is systematic strategy to enable assembly. Banfield has shown how to deconvolve strains atdifferential abundance. Kostas K. results suggest that there will be a speciesgap sufficient to prevent contig misassembly.
  20. 20.  Most metagenomes require 50-150 GB of RAM. Many people don’t have access to computers ofthat size. Amazon Web Services ( willhappily rent you such computers for $1-2/hr.
  21. 21.  Optimizing our programs => faster. Building an evaluation framework formetagenome assemblers. Error correction!
  22. 22.  Achieving one or more assemblies is fairlystraightforward. An assembly is a hypothesis and evaluatingthem is challenging, however, and where youshould be thinking hardest about assembly. There are relatively few pipelines availablefor analyzing assembled metagenomic data.
  23. 23.  Questions?
  24. 24.  How do we study complexity? Interactions? Diversity?Communities? Evolution? Our environment?Visual Complexity• Major efforts of data collection• Open-mind for discoveries• Willingness to adjust to change• Multiple efforts• Well-designed experimentsWorkshop example: Illumina deepsequencing and scaling large datasetson soil metagenomes
  25. 25.  We receive Gb of sequences Generally, my data is… Split by barcodes Untrimmed Adapters are present Two paired end fastq files Underestimation of computationalrequirements: Quality control steps usually require 2-3 times theamount of hard drive space Similarity comparison against known databasesimpractical (soil metagenome ~50 years to BLAST)Home Alone ScreamMy first slide graphic that I’m scared may date me.
  26. 26. Two ways to reduce the onslaught:Cluster into known observances (annotate,bin)AssemblySome mix of the above
  27. 27. Ten of you upload 1 Hiseqflowcell into MG-RAST
  28. 28. Illumina short reads from soilmetagenome (~100 bp)454 short reads from soilmetagenome (~368 bp)Assembled contigs (Illumina)reads from soil metagenome(~491 bp)Read length will increase… computational requirements? Assembly great way to reduce data.