2013 talk at TGAC, November 4

Digital normalization and some consequences.
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Nov 2013
ctb@msu.edu

Acknowledgements
Lab members involved

Collaborators

 Adina Howe (w/Tiedje)

 Jim Tiedje, MSU
 Erich Schwarz, Caltech /

 Jason Pell
 Arend Hintze
 Rosangela Canino-Koning
 Qingpeng Zhang

 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Chris Welcher
 Michael Crusoe

Cornell
 Paul Sternberg, Caltech
 Robin Gasser, U.
Melbourne
 Weiming Li

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.

We practice open science!
“Be the change you want”
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

Outline
Digital normalization basics
2. Diginorm as streaming lossy compression of
NGS data…
3. …surprisingly useful.
1.

Three new directions:

4.

Reference-free data set investigation
2. Streaming algorithms
3. Open protocols
1.

Philosophy: hypothesis generation is
important.
 We need better methods to investigate and

analyze large sequencing data sets.
 To be most useful, these methods should be fast

& computationally efficient, because:
 Data gathering rate is already quite high
 Allows iterations

 Better methods for good computational

hypothesis generation are critical to moving
forward.

High-throughput sequencing
 I mostly work on ecologically and evolutionarily

interesting organisms.
 This includes non-model transcriptomes and

environmental metagenomes.
 Volume of data is a huge problem because of the

diversity of these samples, and because
assembly must be applied to them.

Why are big data sets difficult?
Need to resolve errors: the more coverage there is, the
more errors there are.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set

There is quite a bit of life left to sequence & assem

http://pacelab.colorado.edu/

Shotgun sequencing and
coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.

Random sampling => deep sampling
needed

Typically 10-100x needed for robust recovery (300 Gbp for human)

Mixed populations.
 Approximately 20-40x coverage is required to

assemble the majority of a bacterial genome from
short reads. 100x is required for a “good” assembly.
 To sample a mixed population thoroughly, you need to

sample 100x of the lowest abundance species
present.
 For example, for E. coli in 1/1000 dilution, you would

need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
 …actually getting this much sequence is fairly easy,

but is then hard to assemble in a reasonable
computer.

Approach: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A! Overkill!!
The high-coverage
reads in sample A
are unnecessary
for assembly, and,
in fact, distract.

How can this possibly work!?
All you really need is a way to estimate the
coverage of a read in a data set w/o an
assembly.
for read in dataset:
if estimated_coverage(read) < CUTOFF:
save(read)
(This read coverage estimator does need to be errortolerant.)

The median k-mer count in a read is a good
estimator of coverage.
This gives us a
reference-free
measure of
coverage.

Digital normalization algorithm
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.

Digital normalization approach
A digital analog to cDNA library
normalization, diginorm:
 Is streaming and single pass: looks at each read

only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of regions.

Key: underlying assembly graph
structure is retained.

Diginorm as a filter

Reads

Read ﬁlter/trim

Digital normalization
to C=20

 Diginorm is a pre-filter: it

loads in reads & emits
(some) of them.

Error trim with kmers

 You can then assemble

the reads however you
wish.

Digital normalization
to C=5

Calculate
abundances of
contigs

Assemble with your
favorite assembler

Contig assembly now scales with underlying genome
size

 Transcriptomes, microbial genomes incl MDA,

and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
 Memory efficient is improved by use of CountMin

Sketch.

Digital normalization retains information, while
discarding data and errors

Lossy compression

http://en.wikipedia.org/wiki/JPEG

Raw data
(~10-100 GB)

Compression
(~2 GB)

Analysis

"Information"
~1 GB

"Information"
"Information"
"Information"
"Information"
Database &
integration

Lossy compression can substantially
reduce data size while retaining
information needed for later (re)analysis.

Some diginorm examples:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Assembly of two Midwest soil metagenomes,
Iowa corn and Iowa prairie – the “impossible”
assembly problem.

Diginorm works well.
 Significantly decreases memory requirements,

esp. for metagenome and transcriptome
assemblies.
 Memory required for assembly now scales with

richness rather than diversity.
 Works on same underlying principle as assembly,
so assembly results can be nearly identical.

 Improves some (many?) assemblies, especially

for:
 Repeat rich data.
 Highly polymorphic samples.
 Data with significant sequencing bias.

 Nearly perfect lossy compression from an

information theoretic perspective:
 Discards 95% more of data for genomes.
 Loses < 00.02% of information.

Drawbacks of diginorm
 Some assemblers do not perform well

downstream of diginorm.
 Altered coverage statistics.
 Removal of repeats.

 No well-developed theory.
 …not yet published (but paper available as

preprint, with ~10 citations).

Diginorm is in wide (?) use
 Dozens to hundreds of labs using it.
 Seven research publications (at least) using it

already.
 A diginorm-derived algorithm, in silico

normalization, is now a standard part of the Trinity
mRNAseq pipeline.

Whither goest our research?
1. Pre-assembly analysis of shotgun

data.
2. Moving more sequence analysis onto

streaming reference-free basis.
3. Computing in the cloud.

1. Pre-assembly analysis of shotgun data
Rationale:
 Assembly is a “big black box” – data
goes in, contigs come out, ???
 In cases where assembly goes wrong,

or does not yield hoped-for results, we
need methods to diagnose potential
problems.

Perpetual Spouter hot spring
(Yellowstone)

Eric Boyd, Montana State U.

Data gathering =? Assembly
 Est low-complexity hot spring (~3-6 species)

 25m MiSeq reads (2x250), but no good assembly.
 Why?
 Several possible reasons:





Bad data
Significant strain variation
Low coverage
??

Information saturation curve (“collector‟s
curve”) suggests more information
needed.

Note: saturation to C=20

Read coverage spectrum

Many reads with low coverage

Cumulative read coverage

60% of data < 20x coverage

Cumulative read coverage

Some very high coverage data

Hot spring data conclusions --

Many reads with low coverage

Some very high
coverage data

 Need ~5 times more sequencing: assemblers do not

work well with reads < 20x coverage.
 But! Data is there, just low coverage.
 Many sequence reads are from small, high coverage
genomes (probably phage); this “dilutes” sequencing.

Directions for reference free
work:
 Richness estimation!
 MM5 deep carbon: 60 Mbp
 Great Prairie soil: 12 Gbp
 Amazon Rain Forest Microbial Observatory: 26 Gbp

 “How much more sequencing do I need to see

X?”
 Correlation with 16s

Qingpeng Zhang

2. Streaming/efficient reference-free
analysis
 Streaming online algorithms only look at data

~once.
(This is in comparison to most algorithms which are
“offline”: they require that all data be loaded in
completely before analyis begins.)
 Diginorm is streaming, online…
 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.

Example: calculating read error rates
by position within read
 Shotgun data is randomly

Reads

sampled;
Assemble

 Any variation in mismatches

with reference by position is
likely due to errors or bias.

Map reads to
assembly

Calculate positionspeciﬁc mismatches

Reads from Shakya et al., pmid 2338786

Diginorm can detect graph saturation

Reference-free error profile analysis
1.
2.
3.
4.
5.

Requires no prior information!
Immediate feedback on sequencing quality (for
cores & users)
Fast, lightweight (~100 MB, ~2 minutes)
Works for any shotgun sample (genomic,
metagenomic, transcriptomic).
Not affected by polymorphisms.

Reference-free error profile analysis
7. …if we know where the errors are, we can trim
them.
8. …if we know where the errors are, we can
correct them.
9. …if we look at differences by graph position
instead of by read position, we can call variants.

=> Streaming, online variant calling.

Streaming online reference-free variant calling.

Single pass, reference free, tunable, streaming online varian

Coverage is adjusted to retain signal

Directions for streaming graph
analysis
 Generate error profile for shotgun reads;
 Variable coverage error trimming;
 Streaming low-memory error correction for

genomes, metagenomes, and transcriptomes;
 Strain variant detection & resolution;
 Streaming variant analysis.

Jordan Fish & Jason Pe

3. Computing in the cloud
 Rental or “cloud” computers enable

expenditures on computing resources only on
demand.
 Everyone is generating data but few have

expertise, computational infrastructure to
analyze.
 Assembly has traditionally been “expensive”

but diginorm makes it cheap…

khmer-protocols
Read cleaning

 Close-to-release effort to provide

standard “cheap” assembly options
in the cloud.
 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.
 Open, versioned, forkable, citable.

Diginorm

Assembly

Annotation

RSEM differential
expression

Concluding thoughts
 Diginorm is a practically useful technique for

enabling more/better assembly.
 However, it also offers a number of opportunities

to put sequence analysis on a streaming basis.
 Underlying basis is really simple, but with (IMO)

profound implications: streaming, low memory.

Other interests!
 “Better Science through Superior Software”
 Open science/data/source
 Training!
 Software Carpentry
 “Zero-entry”

 Advanced workshops

 Reproducible research
 IPython Notebook!!!!!

IPython Notebook: data + code
=>
IPython)Notebook)

2013 talk at TGAC, November 4

More Related Content

What's hot

Viewers also liked

Similar to 2013 talk at TGAC, November 4

More from c.titus.brown

Recently uploaded

2013 talk at TGAC, November 4

Editor's Notes