ASM 2013 Metagenomic Assembly Workshop Slides
Upcoming SlideShare
Loading in...5
×
 

ASM 2013 Metagenomic Assembly Workshop Slides

on

  • 1,375 views

 

Statistics

Views

Total Views
1,375
Views on SlideShare
1,375
Embed Views
0

Actions

Likes
0
Downloads
39
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ASM 2013 Metagenomic Assembly Workshop Slides ASM 2013 Metagenomic Assembly Workshop Slides Presentation Transcript

  • Adina HoweMichigan State University, AdjunctArgonne National Laboratory, PostdocASMWorkshop, May 2013Visual Complexityhttp://www.flickr.com/photos/maisonbisson
  •  Titus Brown Jim Tiedje Jason Pell Qingpeng Zhang Jordan Fish Eric McDonald Chris Welcher Aaron Garoutte Jiarong Guo Janet Jansson Susannah TringeMSU Lab: Collaborators:
  •  I will upload this on slideshare (adinachuanghowe) Khmer documentationgithub.com/ged-lab/khmer/https://khmer.readthedocs.org/en/latest/guide.html ManuscriptsScaling metagenome sequence assembly with probabilistic de Bruijn graphshttp://www.pnas.org/content/early/2012/07/25/1121464109A reference-free algorithm for computational normalization of shotgun sequencingdatahttp://arxiv.org/abs/1203.4802Assembling large, complex metagenomeshttp://arxiv.org/abs/1212.2832
  • High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXXA few gotchas of sequencing:Errors / Artifacts (confusion)Diversity / Complexity (scale)High AbundanceLow AbundanceIn t heenvironment (Our goal)In our handsX XXXXXXXX
  • High AbundanceLow AbundanceIn theenvironment (Our goal)In our handsXXXXXXXX1. Digital normalization (lossy compression)2. Partitioning3. Enabling usage of current previously unusableassembly tools
  •  Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…
  •  Reduces data for analysis Longer sequences (increased accuracy of annotation) Gene order Does not rely on known references, access to unknowns Creates new references Lots of assembly tools availableBut…Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.High memory requirements Depends on good (~10x) sequencing coverage
  • “Coverage” is simply the average number of reads that overlapeach true base in genome.Here, the coverage is ~10 – just draw a line straight down from the topthrough all of the reads.
  • Note that k-mer abundance is not properly represented here! Eachblue k-mer will be present around 10 times.
  • Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.
  • Low-abundance peak (errors)
  • High-abundance peak(true k-mers)
  • Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consume diskspace and, because oferrors, memory.We can discard it for you…
  • A digital analog to cDNA library normalization,diginorm:Reference free.Is single pass: looks at each read only once;Does not “collect” the majority of errors;Keeps all low-coverage reads;Smooths out coverage of regions.
  •  Digital normalization produces “good”metagenome assemblies. Smooths out abundance variation, strainvariation. Reduces computational requirements forassembly. It also kinda makes sense :)
  • Split reads into “bins”belonging to differentsource species.Can do this based almostentirely on connectivityof sequences.“Divide and conquer”Memory-efficientimplementation helpsto scale assembly.Pell et al., 2012, PNAS
  • Low coverage is the dominant problem blocking assembly ofyour soil metagenome.
  •  In order to build assemblies, each assemblermakes choices – uses heuristics – to reach aconclusion. These heuristics may not be appropriate for yoursample! High polymorphism? Mixed population vs clonal? Genomic vs metagenomic vs mRNA Low coverage drives differences in assembly.
  •  We can assemble virtually anything but soil ;). Genomes, transcriptomes, MDA, mixtures, etc. Repeat resolution will be fundamentally limited bysequencing technology (insert size; sampling depth) Strain variation confuses assembly, but does notprevent useful results. Diginorm is systematic strategy to enable assembly. Banfield has shown how to deconvolve strains atdifferential abundance. Kostas K. results suggest that there will be a speciesgap sufficient to prevent contig misassembly.
  •  Most metagenomes require 50-150 GB of RAM. Many people don’t have access to computers ofthat size. Amazon Web Services (aws.amazon.com) willhappily rent you such computers for $1-2/hr. http://ged.msu.edu/angus/2013-hmp-assembly-webinar/index.html
  •  Optimizing our programs => faster. Building an evaluation framework formetagenome assemblers. Error correction!
  •  Achieving one or more assemblies is fairlystraightforward. An assembly is a hypothesis and evaluatingthem is challenging, however, and where youshould be thinking hardest about assembly. There are relatively few pipelines availablefor analyzing assembled metagenomic data.
  •  Questions?
  •  How do we study complexity? Interactions? Diversity?Communities? Evolution? Our environment?Visual Complexityhttp://www.flickr.com/photos/maisonbisson• Major efforts of data collection• Open-mind for discoveries• Willingness to adjust to change• Multiple efforts• Well-designed experimentsWorkshop example: Illumina deepsequencing and scaling large datasetson soil metagenomes
  •  We receive Gb of sequences Generally, my data is… Split by barcodes Untrimmed Adapters are present Two paired end fastq files Underestimation of computationalrequirements: Quality control steps usually require 2-3 times theamount of hard drive space Similarity comparison against known databasesimpractical (soil metagenome ~50 years to BLAST)Home Alone ScreamMy first slide graphic that I’m scared may date me.
  • Two ways to reduce the onslaught:Cluster into known observances (annotate,bin)AssemblySome mix of the above
  • Ten of you upload 1 Hiseqflowcell into MG-RAST
  • Illumina short reads from soilmetagenome (~100 bp)454 short reads from soilmetagenome (~368 bp)Assembled contigs (Illumina)reads from soil metagenome(~491 bp)Read length will increase… computational requirements? Assembly great way to reduce data.