2012 wellcome-talk

Streaming approaches to sequence
data compression
(via normalization and error correction)

C. Titus Brown
Asst Prof, CSE and
Microbiology
Michigan State University
ctb@msu.edu

Side note: error correction is the
biggest “data” problem left in
sequencing.*

Both for mapping & assembly.

*paraphrased, E. Birney

My biggest research problem –
soil.
 Est ~50 Tbp to comprehensively sample the microbial
composition of a gram of soil.
 Bacterial species in 1:1m dilution, est by 16s
 Does not include phage, etc. that are invisible to tagging
approaches

 Currently we have approximately 2 Tbp spread across
9 soil samples, for one project; 1 Tbp across 10
samples for another.

 Need 3 TB RAM on single chassis to do assembly of
300 Gbp (Velvet).
 …estimate 500 TB RAM for 50 Tbp of sequence.

That just won‟t do.

Online, streaming, lossy compression.
(Digital normalization)
Much of next-gen sequencing is redundant.

Uneven coverage => even more
redundancy

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!

This 100x will consume
disk space and, because
of errors, memory.

Coverage before digital
normalization:

(MD amplified)

Coverage after digital normalization:

Normalizes coverage

Discards redundancy

Eliminates majority of
errors

Scales assembly dramat

Assembly is 98% identica

Digital normalization algorithm

for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read

Note, single pass; fixed memory.

Digital normalization retains information, while
discarding data and errors

Little-appreciated implications!!
 Digital normalization puts both sequence and
assembly graph analysis on a streaming and
online basis.
 Potentially really useful for streaming variant calling
and streaming sample categorization

 Can implement (< 2)-pass error
detection/correction using locus-specific
coverage.

 Error correction can be “tuned” to specific
coverage retention and variant detection.

Local graph coverage
Diginorm provides ability to
efficiently (online) measure
local graph coverage, very
efficiently.
(Theory still needs to be
developed)

Alignment of reads to graph
 “Fixes” digital normalization
 Aligned reads => error corrected reads
 Can align longer sequencesCorrection
Sequence Read (transcripts?
contigs?) to graphs.
Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG
Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG
A
G G
19
G 19 19 G
SN
19 SN SN 19
C T
SN SN
19 19
C SN SN
19 A
G SN 19
19 SN Seed K-mer
A MN C
20 20 CGAATCTGAT
MN MN
G A
1 1
ME C ME
1 T
C 1
ME
1 G G ME Emission Base → A
ME 1 G
A
G 1
ME
K-mer Coverage → 19
ME 1
ME 1
1
ME
Vertex Class → SN
ME

Jason Pell

1.2x pass error-corrected E. coli*
(Significantly more compressible)

* Same approach can be used on mRNAseq and metageno

Some thoughts
 Need fundamental measures of information
retention so that we can place limits on what
we‟re discarding with lossy compression.

 Compression is most useful to scientists when it
makes analysis faster/lower memory/better.

 Variant calling, assembly, and just deleting your
data are all just various forms of lossy
compression :)

How compressible is soil data?
De Bruijn graph overlap: 51% of the reads in prairie
(330 Gbp) have coverage > 1 in the corn sample‟s
de Bruijn graph (180 Gbp).

Corn Prairie

Further resources
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/interests.html
 See esp:
BIGDATA: Small: DA: DCM: Low-memory
Streaming Prefilters for Biological Sequencing
Data
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling

2012 wellcome-talk

More Related Content

Viewers also liked

Similar to 2012 wellcome-talk

More from c.titus.brown

2012 wellcome-talk