2013 duke-talk

Building better genomes, transcriptomes, and
metagenomes with improved
techniques for de novo assembly -- an easier way to do it

C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
March 2013
ctb@msu.edu

Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)  Jim Tiedje, MSU
 Jason Pell  Erich Schwarz, Caltech /
 Arend Hintze Cornell
 Rosangela Canino-  Paul Sternberg, Caltech
Koning  Robin Gasser, U.
 Qingpeng Zhang Melbourne
 Elijah Lowe  Weiming Li
 Likit Preeyanon Funding
 Jiarong Guo
 Tim Brom USDA NIFA; NSF IOS;
 Kanchan Pavangadkar BEACON.
 Eric McDonald

We practice open science!
“Be the change you want”

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

Outline
1. Computational challenges associated with
sequencing DNA/RNA from non-model
organisms.

2. Three case studies:
1. Parasitic nematode genome assembly
2. Lamprey transcriptome assembly
3. Soil metagenome assembly

3. Future directions

My interests
I work primarily on organisms of agricultural,
evolutionary, or ecological importance, which tend
to have poor reference genomes and
transcriptomes. Focus on:

 Improving assembly sensitivity to better recover
genomic/transcriptomic sequence, often from
“weird” samples.

 Scaling sequence assembly approaches so that
huge assemblies are possible and big assemblies
are straightforward.

There is quite a bit of life left to sequence & assem

http://pacelab.colorado.edu/

“Weird” biological samples:
 Single genome  Hard to sequence DNA
(e.g. GC/AT bias)

 Transcriptome  Differential expression!

 High polymorphism data  Multiple alleles

 Whole genome  Often extreme
amplified amplification bias (next
slide)

 Metagenome (mixed  Differential abundance
microbial community) within community.

Single genome assembly is already
challenging --

Once you start sequencing
metagenomes…

Shotgun sequencing and
coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.

Random sampling => deep sampling
needed

Typically 10-100x needed for robust recovery (300 Gbp for human)

Various experimental treatments can
also modify coverage distribution.

(MD amplified)

Non-normal coverage distributions
lead to decreased assembly
sensitivity
 Many assemblers embed a “coverage model” in
their approach.
 Genome assemblers: abnormally low coverage is
erroneous; abnormally high coverage is repetitive
sequence.
 Transcriptome assemblers: isoforms should have
same coverage across the entire isoform.
 Metagenome assemblers: differing abundances
indicate different strains.

 Is there a different way? (Yes.)

Memory requirements (Velvet/Oases –
est)
 Bacterial genome  1-2 GB
(colony)
 500-1000 GB
 Human genome
 100 GB +
 Vertebrate mRNA
 100 GB
 Low complexity
metagenome  1000 GB ++

 High complexity
metagenome

K-mer based assemblers scale
poorly

Why do big data sets require big machines??

Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set

Why does efficiency matter?
 It is now cheaper to generate sequence than it is
to analyze it computationally!
 Machine time
 (Wo)man power/time

 More efficient programs allow better exploration
of analysis parameters for maximizing sensitivity.

 Better or more sensitive bioinformatic approaches
can be developed on top of more efficient theory.

Approach: Digital normalization
(a computational version of library normalization)

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!

This 100x will consume
disk space and, because
of errors, memory.

We can discard it for
you…

Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:

 Is single pass: looks at each read only once;

 Does not “collect” the majority of errors;

 Keeps all low-coverage reads;

 Smooths out coverage of regions.

Coverage before digital
normalization:

(MD amplified)

Coverage after digital normalization:

Normalizes coverage

Discards redundancy

Eliminates majority of
errors

Scales assembly dramat

Assembly is 98% identica

Digital normalization approach
A digital analog to cDNA library normalization,
diginorm is a read prefiltering approach that:

 Is single pass: looks at each read only once;

 Does not “collect” the majority of errors;

 Keeps all low-coverage reads;

 Smooths out coverage of regions.

Contig assembly is significantly more efficient and
now scales with underlying genome size

 Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.

Some diginorm examples:

1. Assembly of the H. contortus parasitic
nematode genome, a “high
polymorphism/variable coverage” problem.

2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly”
problem.

3. Assembly of two Midwest soil metagenomes,
Iowa corn and Iowa prairie – the “impossible”
assembly problem.

1. The H. contortus problem
 A sheep parasite.

 ~350 Mbp genome

 Sequenced DNA 6 individuals after whole
genome amplification, estimated 10%
heterozygosity (!?)

 Significant bacterial contamination.

(w/Robin Gasser, Paul Sternberg, and Erich
Schwarz)

H. contortus life cycle

Refs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868;
Prichard and Geary (2008), Nature 452, 157-158.

The power of next-gen. sequencing:
get 180x coverage ... and then watch your
assemblies never finish

Libraries built and sequenced:

300-nt inserts, 2x75 nt paired-end reads
500-nt inserts, 2x75 and 2x100 nt paired-end reads
2-kb, 5-kb, and 10-kb inserts, 2x49 nt paired-end reads

Nothing would assemble at all until filtered for basic quality.

Filtering let ≤500 nt-sized inserts to assemble in a mere week.
But 2+ kb-sized inserts would not assemble even then.

Erich Schwarz

Assembly after digital normalization
 Diginorm readily enabled assembly of a 404 Mbp
genome with N50 of 15.6 kb;
 Post-processing with GapCloser and
SOAPdenovo scaffolding led to final assembly of
453 Mbp with N50 of 34.2kb.
 CEGMA estimates 73-94% complete genome.

 Diginorm helped by:
 Suppressing high polymorphism, esp in repeats;
 Eliminating 95% of sequencing errors;
 “Squashing” coverage variation from whole genome
amplification and bacterial contamination

Assembly after digital normalization
 Diginorm readily enabled assembly of a 404 Mbp
genome with N50 of 15.6 kb;
 Post-processing with GapCloser and
SOAPdenovo scaffolding led to final assembly of
453 Mbp with N50 of 34.2kb.
 CEGMA estimates 73-94% complete genome.

 Diginorm helped by:
 Suppressing high polymorphism, esp in
repeats;
 Eliminating 95% of sequencing errors;
 “Squashing” coverage variation from whole genome
amplification and bacterial contamination

Next steps with H. contortus
 Publish the genome paper 

 Identification of antibiotic targets for treatment in
agricultural settings (animal husbandry).

 Serving as “reference approach” for a wide
variety of parasitic nematodes, many of which
have similar genomic issues.

2. Lamprey transcriptome assembly.
 Lamprey genome is draft quality; low contiguity,
missing ~30%.
 No closely related reference.
 Full-length and exon-level gene predictions are 50-
75% reliable, and rarely capture UTRs / isoforms.

 De novo assembly, if we do it well, can identify
 Novel genes
 Novel exons
 Fast evolving genes

 Somatic recombination: how much are we missing,
really?

Sea lamprey in the Great Lakes

 Non-native
 Parasite of
medium to large
fishes
 Caused
populations of
host fishes to
crash

Li Lab / Y-W C-D

Transcriptome results
 Started with 5.1 billion reads from 50 different
tissues.

 Digital normalization discarded 98.7% of them as
redundant, leaving 87m (!)

 These assembled into more than 100,000
transcripts > 1kb

 Against known full-length, 98.7% agreement
(accuracy); 99.7% included (contiguity)

Evaluating de novo lamprey
transcriptome
 Estimate genome is ~70% complete (gene
complement)
 Majority of genome-annotated gene sets
recovered by mRNAseq assembly.
 Note method to recover transcript families w/o
genome… Gene families in Fraction in
Assembly analysis Gene families genome genome
mRNAseq assembly 72003 51632 71.7%
reference gene set 8523 8134 95.4%
combined 73773 53137 72.0%
intersection 6753 6753 100.0%
only in mRNAseq assembly 65250 44884 68.8%
only in reference gene set 1770 1500 84.7%

(Includes transcripts > 300 bp)

Next steps with lamprey
 Far more complete transcriptome than the one
predicted from the genome!

 Enabling studies in –
 Basal vertebrate phylogeny
 Biliary atresia
 Evolutionary origin of brown fat (previously thought
to be mammalian only!)
 Pheromonal response in adults

3. Soil metagenome assembly
 Observation: 99% of microbes cannot easily be
cultured in the lab. (“The great plate count
anomaly”)
 Many reasons why you can‟t or don‟t want to
culture:
 Syntrophic relationships
 Niche-specificity or unknown physiology
 Dormant microbes
 Abundance within communities

Single-cell sequencing & shotgun metagenomics
are two common ways to investigate microbial
communities.

Investigating soil microbial
ecology
 What ecosystem level functions are present, and
how do microbes do them?
 How does agricultural soil differ from native soil?
 How does soil respond to climate perturbation?

 Questions that are not easy to answer without
shotgun sequencing:
 What kind of strain-level heterogeneity is present in
the population?
 What does the phage and viral population look like?
 What species are where?

A “Grand Challenge” dataset
(DOE/JGI)
Total: 1,846 Gbp soil metagenome
600 MetaHIT (Qin et. al, 2011), 578 Gbp

500
Basepairs of Sequencing (Gbp)

400

300 Rumen (Hess et. al, 2011), 268 Gbp

200 Rumen K-mer Filtered,
111 Gbp
100 NCBI nr database,
37 Gbp
0
Iowa, Iowa, Native Kansas, Kansas, Wisconsin, Wisconsin, Wisconsin, Wisconsin,
Continuous Prairie Cultivated Native Continuous Native Restored Switchgrass
corn corn Prairie corn Prairie Prairie
GAII HiSeq

“Whoa, that‟s a lot of data…”
Estimated sequencing required (bp, w/Illumina)

5E+14

4.5E+14

4E+14

3.5E+14

3E+14

2.5E+14

2E+14

1.5E+14

1E+14

5E+13

0
E. coli Human Vertebrate Human gut Marine Soil
genome genome transcriptome

Additional Approach for
Metagenomes: Data partitioning
(a computational version of cell sorting)

Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.
“Divide and conquer”
Memory-efficient
implementation
helps to scale
assembly. Pell et al., 2012, PNAS

Partitioning separates reads by genome.
When computationally spiking HMP mock data with one E. coli
genome (left) or multiple E. coli strains (right), majority of
partitions contain reads from only a single genome (blue) vs
multi-genome partitions (green).

* *

Partitions containing spiked data indicated with aAdina Howe
*

Assembly results for Iowa corn and prairie
(2x ~300 Gbp soil metagenomes)

Predicted
Total Total Contigs % Reads
protein
Assembly (> 300 bp) Assembled
coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp

Resulting contigs are low
coverage.

Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.

Strain variation?
Can measure
by read
Top two allele frequencies

mapping.
Of 5000 most
abundant
contigs, only 1
has a
polymorphism
rate > 5%

Position within contig

Tentative observations from our soil
samples:
 We need 100x as much data…
 Much of our sample may consist of phage.
 Phylogeny varies more than functional
predictions.
 We see little to no strain variation within our
samples
 Not bulk soil --
 Very small, localized, and low coverage samples
 We may be able to do selective really deep
sequencing and then infer the rest from 16s.
 Implications for soil aggregate assembly?

Additional projects --
 Bacterial symbionts of bone eating worms – w/Shana
Goffredi.
 Mixed sample => low-complexity metagenome
 Amplified DNA => uneven coverage
 ~50% complete => 97% complete assembly

 Molgulid ascidians – w/Billie Swalla and Lionel
Christiaen.
 High heterozygosity transcriptomes and genomes

 Chick genome & transcriptome assembly – w/Hans
Cheng, Jerry Dodgson, and Wes Warren.
 Hard to sequence/assemble microchromosomes

Digital normalization is enabling.
 This is a very powerful technique for “fixing” weird
samples. (…and all samples are weird.)
 A number of real world projects are using
diginorm successfully (~6-10 in my lab; ~70-80?
overall).
 A diginorm-derived procedure is now a
recommended part of the Trinity mRNAseq
assembler.

 Diginorm is
1. Very computationally efficient;
2. Always “cheaper” than running an assembler in
the first place
3. Almost always improves results (** in our hands ;)

Next steps
 Computation is a limiting factor in dealing with
NGS
 Assembly is slow, compute intensive, and sensitive
to parameters
 Mapping and SNP calling is parameter sensitive
 Exploring hypotheses and finding the right balance
between sensitivity and specificity often requires
“parameter sweeps” – executing one or more
pieces of software with multiple parameters.

 Can we attack this problem with lossy
compression?

Digital normalization retains information, while
discarding data and errors

Lossy compression

http://en.wikipedia.org/wiki/JPEG

~2 GB – 2 TB of single-chassis RAM

"Information"
"Information"
Raw data "Information"
Analysis "Information"
(~10-100 GB) ~1 GB "Information"
Database &
integration

Can we use lossy compression approaches to make
downstream analysis faster and better? (Yes.)

Embarking upon project to build theory around digital
normalization so that we can use it as a prefilter for
mapping and variant calling. We have also
implemented error correction, which enables easy
quantification of references by reads.

Streaming low-mem distributable prefilters for all

Error correction for metagenomic and
mRNAseq data.
Most error correction algorithms rely on
assumption of uniform coverage; ours does not.
Ours is also streaming.

Jason Pell

Error correction via graph alignment

Where next?
 Assembly in the cloud!

 Study and formalize paired/end mate pair handling in
diginorm.

 Web interface to run and evaluate assemblies.

 New methods to evaluate and improve
assemblies, including a “meta assembly” approach for
metagenomes.

 Fast and efficient error correction of sequencing data
 Can also address assembly of high polymorphism
sequence, allelic mapping bias, and others;
 Can also enable fast/efficient storage and search of nucleic
acid databases.

Four+ papers on our work, soon.
 2012 PNAS, Pell et al., pmid 22847406 (partitioning).

 Submitted, Brown et al., arXiv:1203.4802 (digital
normalization).

 Submitted, Howe et al, arXiv: 1212.0159 (artifact
removal from Illumina metagenomes).

 Submitted, Howe et al., arXiv: 1212.2832 –
Assembling large, complex environmental
metagenomes.

 In preparation, Zhang et al. – efficient k-mer counting.

Thanks!

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

2013 duke-talk

More Related Content

What's hot

Viewers also liked

Similar to 2013 duke-talk

More from c.titus.brown

2013 duke-talk

Editor's Notes