Your SlideShare is downloading. ×
ASM 2013 Metagenomic Assembly Workshop Slides
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ASM 2013 Metagenomic Assembly Workshop Slides

1,310

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,310
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
41
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASMWorkshop, May 2013Visual Complexityhttp://www.flickr.com/photos/maisonbisson
  • 2.  Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo Janet Jansson Susannah TringeMSU Lab: Collaborators:
  • 3.  I will upload this on slideshare (adinachuanghowe) Khmer documentationgithub.com/ged-lab/khmer/https://khmer.readthedocs.org/en/latest/guide.html ManuscriptsScaling metagenome sequence assembly with probabilistic de Bruijn graphshttp://www.pnas.org/content/early/2012/07/25/1121464109A reference-free algorithm for computational normalization of shotgun sequencingdatahttp://arxiv.org/abs/1203.4802Assembling large, complex metagenomeshttp://arxiv.org/abs/1212.2832
  • 4. High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXXA few gotchas of sequencing:Errors / Artifacts (confusion)Diversity / Complexity (scale)High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXX
  • 5. High AbundanceLow AbundanceIn theenvironment (Our goal)In our handsXXXXXXXX1. Digital normalization (lossy compression)2. Partitioning3. Enabling usage of current previously unusableassembly tools
  • 6.  Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…
  • 7.  Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.High memory requirements Depends on good (~10x) sequencing coverage
  • 8. “Coverage” is simply the average number of reads that overlapeach true base in genome.Here, the coverage is ~10 – just draw a line straight down from the topthrough all of the reads.
  • 9. Note that k-mer abundance is not properly represented here! Eachblue k-mer will be present around 10 times.
  • 10. Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
  • 11. Low-abundance peak (errors)
  • 12. High-abundance peak(true k-mers)
  • 13. Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consume diskspace and, because oferrors, memory.We can discard it for you…
  • 14. A digital analog to cDNA library normalization,diginorm:Reference free.Is single pass: looks at each read only once;Does not “collect” the majority of errors;Keeps all low-coverage reads;Smooths out coverage of regions.
  • 15.  Digital normalization produces “good”metagenome assemblies. Smooths out abundance variation, strainvariation. Reduces computational requirements forassembly. It also kinda makes sense :)
  • 16. Split reads into “bins”belonging to differentsource species.Can do this based almostentirely on connectivityof sequences.“Divide and conquer”Memory-efficientimplementation helpsto scale assembly.Pell et al., 2012, PNAS
  • 17. Low coverage is the dominant problem blocking assembly ofyour soil metagenome.
  • 18.  In order to build assemblies, each assemblermakes choices – uses heuristics – to reach aconclusion. These heuristics may not be appropriate for yoursample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in assembly.
  • 19.  We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited bysequencing technology (insert size; sampling depth) Strain variation confuses assembly, but does notprevent useful results. Diginorm is systematic strategy to enable assembly. Banfield has shown how to deconvolve strains atdifferential abundance. Kostas K. results suggest that there will be a speciesgap sufficient to prevent contig misassembly.
  • 20.  Most metagenomes require 50-150 GB of RAM. Many people don’t have access to computers ofthat size. Amazon Web Services (aws.amazon.com) willhappily rent you such computers for $1-2/hr. http://ged.msu.edu/angus/2013-hmp-assembly-webinar/index.html
  • 21.  Optimizing our programs => faster. Building an evaluation framework formetagenome assemblers. Error correction!
  • 22.  Achieving one or more assemblies is fairlystraightforward. An assembly is a hypothesis and evaluatingthem is challenging, however, and where youshould be thinking hardest about assembly. There are relatively few pipelines availablefor analyzing assembled metagenomic data.
  • 23.  Questions?
  • 24.  How do we study complexity? Interactions? Diversity?Communities? Evolution? Our environment?Visual Complexityhttp://www.flickr.com/photos/maisonbisson• Major efforts of data collection• Open-mind for discoveries• Willingness to adjust to change• Multiple efforts• Well-designed experimentsWorkshop example: Illumina deepsequencing and scaling large datasetson soil metagenomes
  • 25.  We receive Gb of sequences Generally, my data is… Split by barcodes Untrimmed Adapters are present Two paired end fastq files Underestimation of computationalrequirements: Quality control steps usually require 2-3 times theamount of hard drive space Similarity comparison against known databasesimpractical (soil metagenome ~50 years to BLAST)Home Alone ScreamMy first slide graphic that I’m scared may date me.
  • 26. Two ways to reduce the onslaught:Cluster into known observances (annotate,bin)AssemblySome mix of the above
  • 27. Ten of you upload 1 Hiseqflowcell into MG-RAST
  • 28. Illumina short reads from soilmetagenome (~100 bp)454 short reads from soilmetagenome (~368 bp)Assembled contigs (Illumina)reads from soil metagenome(~491 bp)Read length will increase… computational requirements? Assembly great way to reduce data.

×