ASM 2013 Metagenomic Assembly Workshop Slides

Adina Howe
Michigan State University, Adjunct
Argonne National Laboratory, Postdoc
ASMWorkshop, May 2013
Visual Complexity
http://www.flickr.com/photos/maisonbisson

 Titus Brown
 Jim Tiedje
 Jason Pell
 Qingpeng Zhang
 Jordan Fish
 Eric McDonald
 Chris Welcher
 Aaron Garoutte
 Jiarong Guo
 Janet Jansson
 Susannah Tringe
MSU Lab: Collaborators:

 I will upload this on slideshare (adinachuanghowe)
 Khmer documentation
github.com/ged-lab/khmer/
https://khmer.readthedocs.org/en/latest/guide.html
 Manuscripts
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
http://www.pnas.org/content/early/2012/07/25/1121464109
A reference-free algorithm for computational normalization of shotgun sequencing
data
http://arxiv.org/abs/1203.4802
Assembling large, complex metagenomes
http://arxiv.org/abs/1212.2832

High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
A few gotchas of sequencing:
Errors / Artifacts (confusion)
Diversity / Complexity (scale)
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X

High Abundance
Low Abundance
In theenvironment (Our goal)
In our hands
X
X
XX
XX
X
X1. Digital normalization (lossy compression)
2. Partitioning
3. Enabling usage of current previously unusable
assembly tools

 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…

 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
High memory requirements Depends on good (~10x) sequencing coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.

Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.

Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.

High-abundance peak
(true k-mers)

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…

A digital analog to cDNA library normalization,
diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.

 Digital normalization produces “good”
metagenome assemblies.
 Smooths out abundance variation, strain
variation.
 Reduces computational requirements for
assembly.
 It also kinda makes sense :)

Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS

Low coverage is the dominant problem blocking assembly of
your soil metagenome.

 In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
 These heuristics may not be appropriate for your
sample!
 High polymorphism?
 Mixed population vs clonal?
 Genomic vs metagenomic vs mRNA
 Low coverage drives differences in assembly.

 We can assemble virtually anything but soil ;).
 Genomes, transcriptomes, MDA, mixtures, etc.
 Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
 Strain variation confuses assembly, but does not
prevent useful results.
 Diginorm is systematic strategy to enable assembly.
 Banfield has shown how to deconvolve strains at
differential abundance.
 Kostas K. results suggest that there will be a species
gap sufficient to prevent contig misassembly.

 Most metagenomes require 50-150 GB of RAM.
 Many people don’t have access to computers of
that size.
 Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
 http://ged.msu.edu/angus/2013-hmp-assembly-
webinar/index.html

 Optimizing our programs => faster.
 Building an evaluation framework for
metagenome assemblers.
 Error correction!

 Achieving one or more assemblies is fairly
straightforward.
 An assembly is a hypothesis and evaluating
them is challenging, however, and where you
should be thinking hardest about assembly.
 There are relatively few pipelines available
for analyzing assembled metagenomic data.

 How do we study complexity? Interactions? Diversity?
Communities? Evolution? Our environment?
Visual Complexity
http://www.flickr.com/photos/maisonbisson
• Major efforts of data collection
• Open-mind for discoveries
• Willingness to adjust to change
• Multiple efforts
• Well-designed experiments
Workshop example: Illumina deep
sequencing and scaling large datasets
on soil metagenomes

 We receive Gb of sequences
 Generally, my data is…
 Split by barcodes
 Untrimmed
 Adapters are present
 Two paired end fastq files
 Underestimation of computational
requirements:
 Quality control steps usually require 2-3 times the
amount of hard drive space
 Similarity comparison against known databases
impractical (soil metagenome ~50 years to BLAST)
Home Alone Scream
My first slide graphic that I’m scared may date me.

Two ways to reduce the onslaught:
Cluster into known observances (annotate,
bin)
Assembly
Some mix of the above

Ten of you upload 1 Hiseq
flowcell into MG-RAST

Illumina short reads from soil
metagenome (~100 bp)
454 short reads from soil
metagenome (~368 bp)
Assembled contigs (Illumina)
reads from soil metagenome
(~491 bp)
Read length will increase… computational requirements? Assembly great way to reduce data.

ASM 2013 Metagenomic Assembly Workshop Slides

More Related Content

Viewers also liked

Similar to ASM 2013 Metagenomic Assembly Workshop Slides

More from Adina Chuang Howe

Recently uploaded

ASM 2013 Metagenomic Assembly Workshop Slides