Adina Howe
Michigan State University, Adjunct
Argonne National Laboratory, Postdoc
ASMWorkshop, May 2013
Visual Complexity
http://www.flickr.com/photos/maisonbisson
 Titus Brown
 Jim Tiedje
 Jason Pell
 Qingpeng Zhang
 Jordan Fish
 Eric McDonald
 Chris Welcher
 Aaron Garoutte
 Jiarong Guo
 Janet Jansson
 Susannah Tringe
MSU Lab: Collaborators:
 I will upload this on slideshare (adinachuanghowe)
 Khmer documentation
github.com/ged-lab/khmer/
https://khmer.readthedocs.org/en/latest/guide.html
 Manuscripts
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
http://www.pnas.org/content/early/2012/07/25/1121464109
A reference-free algorithm for computational normalization of shotgun sequencing
data
http://arxiv.org/abs/1203.4802
Assembling large, complex metagenomes
http://arxiv.org/abs/1212.2832
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
A few gotchas of sequencing:
Errors / Artifacts (confusion)
Diversity / Complexity (scale)
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
High Abundance
Low Abundance
In theenvironment (Our goal)
In our hands
X
X
XX
XX
X
X1. Digital normalization (lossy compression)
2. Partitioning
3. Enabling usage of current previously unusable
assembly tools
 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
High memory requirements Depends on good (~10x) sequencing coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
Low-abundance peak (errors)
High-abundance peak
(true k-mers)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
A digital analog to cDNA library normalization,
diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
 Digital normalization produces “good”
metagenome assemblies.
 Smooths out abundance variation, strain
variation.
 Reduces computational requirements for
assembly.
 It also kinda makes sense :)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS
Low coverage is the dominant problem blocking assembly of
your soil metagenome.
 In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
 These heuristics may not be appropriate for your
sample!
 High polymorphism?
 Mixed population vs clonal?
 Genomic vs metagenomic vs mRNA
 Low coverage drives differences in assembly.
 We can assemble virtually anything but soil ;).
 Genomes, transcriptomes, MDA, mixtures, etc.
 Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
 Strain variation confuses assembly, but does not
prevent useful results.
 Diginorm is systematic strategy to enable assembly.
 Banfield has shown how to deconvolve strains at
differential abundance.
 Kostas K. results suggest that there will be a species
gap sufficient to prevent contig misassembly.
 Most metagenomes require 50-150 GB of RAM.
 Many people don’t have access to computers of
that size.
 Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
 http://ged.msu.edu/angus/2013-hmp-assembly-
webinar/index.html
 Optimizing our programs => faster.
 Building an evaluation framework for
metagenome assemblers.
 Error correction!
 Achieving one or more assemblies is fairly
straightforward.
 An assembly is a hypothesis and evaluating
them is challenging, however, and where you
should be thinking hardest about assembly.
 There are relatively few pipelines available
for analyzing assembled metagenomic data.
 Questions?
 How do we study complexity? Interactions? Diversity?
Communities? Evolution? Our environment?
Visual Complexity
http://www.flickr.com/photos/maisonbisson
• Major efforts of data collection
• Open-mind for discoveries
• Willingness to adjust to change
• Multiple efforts
• Well-designed experiments
Workshop example: Illumina deep
sequencing and scaling large datasets
on soil metagenomes
 We receive Gb of sequences
 Generally, my data is…
 Split by barcodes
 Untrimmed
 Adapters are present
 Two paired end fastq files
 Underestimation of computational
requirements:
 Quality control steps usually require 2-3 times the
amount of hard drive space
 Similarity comparison against known databases
impractical (soil metagenome ~50 years to BLAST)
Home Alone Scream
My first slide graphic that I’m scared may date me.
Two ways to reduce the onslaught:
Cluster into known observances (annotate,
bin)
Assembly
Some mix of the above
Ten of you upload 1 Hiseq
flowcell into MG-RAST
Illumina short reads from soil
metagenome (~100 bp)
454 short reads from soil
metagenome (~368 bp)
Assembled contigs (Illumina)
reads from soil metagenome
(~491 bp)
Read length will increase… computational requirements? Assembly great way to reduce data.

ASM 2013 Metagenomic Assembly Workshop Slides

  • 1.
    Adina Howe Michigan StateUniversity, Adjunct Argonne National Laboratory, Postdoc ASMWorkshop, May 2013 Visual Complexity http://www.flickr.com/photos/maisonbisson
  • 2.
     Titus Brown Jim Tiedje  Jason Pell  Qingpeng Zhang  Jordan Fish  Eric McDonald  Chris Welcher  Aaron Garoutte  Jiarong Guo  Janet Jansson  Susannah Tringe MSU Lab: Collaborators:
  • 3.
     I willupload this on slideshare (adinachuanghowe)  Khmer documentation github.com/ged-lab/khmer/ https://khmer.readthedocs.org/en/latest/guide.html  Manuscripts Scaling metagenome sequence assembly with probabilistic de Bruijn graphs http://www.pnas.org/content/early/2012/07/25/1121464109 A reference-free algorithm for computational normalization of shotgun sequencing data http://arxiv.org/abs/1203.4802 Assembling large, complex metagenomes http://arxiv.org/abs/1212.2832
  • 4.
    High Abundance Low Abundance Int heenvironment (Our goal) In our hands X X X XX XX X X A few gotchas of sequencing: Errors / Artifacts (confusion) Diversity / Complexity (scale) High Abundance Low Abundance In t heenvironment (Our goal) In our hands X X X XX XX X X
  • 5.
    High Abundance Low Abundance Intheenvironment (Our goal) In our hands X X XX XX X X1. Digital normalization (lossy compression) 2. Partitioning 3. Enabling usage of current previously unusable assembly tools
  • 6.
     Reduces datafor analysis  Longer sequences (increased accuracy of annotation)  Gene order  Does not rely on known references, access to unknowns  Creates new references  Lots of assembly tools available But…
  • 7.
     Reduces datafor analysis  Longer sequences (increased accuracy of annotation)  Gene order  Does not rely on known references, access to unknowns  Creates new references  Lots of assembly tools available But… Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes. High memory requirements Depends on good (~10x) sequencing coverage
  • 8.
    “Coverage” is simplythe average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 9.
    Note that k-merabundance is not properly represented here! Each blue k-mer will be present around 10 times.
  • 10.
    Each single baseerror generates ~k new k-mers. Generally, erroneous k-mers show up only once – errors are random.
  • 13.
  • 14.
  • 15.
    Suppose you havea dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 22.
    A digital analogto cDNA library normalization, diginorm: Reference free. Is single pass: looks at each read only once; Does not “collect” the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions.
  • 23.
     Digital normalizationproduces “good” metagenome assemblies.  Smooths out abundance variation, strain variation.  Reduces computational requirements for assembly.  It also kinda makes sense :)
  • 24.
    Split reads into“bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
  • 28.
    Low coverage isthe dominant problem blocking assembly of your soil metagenome.
  • 29.
     In orderto build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.  These heuristics may not be appropriate for your sample!  High polymorphism?  Mixed population vs clonal?  Genomic vs metagenomic vs mRNA  Low coverage drives differences in assembly.
  • 31.
     We canassemble virtually anything but soil ;).  Genomes, transcriptomes, MDA, mixtures, etc.  Repeat resolution will be fundamentally limited by sequencing technology (insert size; sampling depth)  Strain variation confuses assembly, but does not prevent useful results.  Diginorm is systematic strategy to enable assembly.  Banfield has shown how to deconvolve strains at differential abundance.  Kostas K. results suggest that there will be a species gap sufficient to prevent contig misassembly.
  • 32.
     Most metagenomesrequire 50-150 GB of RAM.  Many people don’t have access to computers of that size.  Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.  http://ged.msu.edu/angus/2013-hmp-assembly- webinar/index.html
  • 33.
     Optimizing ourprograms => faster.  Building an evaluation framework for metagenome assemblers.  Error correction!
  • 34.
     Achieving oneor more assemblies is fairly straightforward.  An assembly is a hypothesis and evaluating them is challenging, however, and where you should be thinking hardest about assembly.  There are relatively few pipelines available for analyzing assembled metagenomic data.
  • 35.
  • 36.
     How dowe study complexity? Interactions? Diversity? Communities? Evolution? Our environment? Visual Complexity http://www.flickr.com/photos/maisonbisson • Major efforts of data collection • Open-mind for discoveries • Willingness to adjust to change • Multiple efforts • Well-designed experiments Workshop example: Illumina deep sequencing and scaling large datasets on soil metagenomes
  • 37.
     We receiveGb of sequences  Generally, my data is…  Split by barcodes  Untrimmed  Adapters are present  Two paired end fastq files  Underestimation of computational requirements:  Quality control steps usually require 2-3 times the amount of hard drive space  Similarity comparison against known databases impractical (soil metagenome ~50 years to BLAST) Home Alone Scream My first slide graphic that I’m scared may date me.
  • 38.
    Two ways toreduce the onslaught: Cluster into known observances (annotate, bin) Assembly Some mix of the above
  • 39.
    Ten of youupload 1 Hiseq flowcell into MG-RAST
  • 40.
    Illumina short readsfrom soil metagenome (~100 bp) 454 short reads from soil metagenome (~368 bp) Assembled contigs (Illumina) reads from soil metagenome (~491 bp) Read length will increase… computational requirements? Assembly great way to reduce data.