Streaming approaches to sequence
        data compression
    (via normalization and error correction)



                 C. Titus Brown
               Asst Prof, CSE and
                  Microbiology
            Michigan State University
                 ctb@msu.edu
What Ewan said.
What Guy said.
Side note: error correction is the
biggest “data” problem left in
sequencing.*




        Both for mapping & assembly.

                            *paraphrased, E. Birney
My biggest research problem –
soil.
 Est ~50 Tbp to comprehensively sample the microbial
 composition of a gram of soil.
   Bacterial species in 1:1m dilution, est by 16s
   Does not include phage, etc. that are invisible to tagging
   approaches

 Currently we have approximately 2 Tbp spread across
 9 soil samples, for one project; 1 Tbp across 10
 samples for another.

 Need 3 TB RAM on single chassis to do assembly of
  300 Gbp (Velvet).
 …estimate 500 TB RAM for 50 Tbp of sequence.

                    That just won‟t do.
Online, streaming, lossy compression.
(Digital normalization)
        Much of next-gen sequencing is redundant.
Uneven coverage => even more
redundancy


                         Suppose you have a
                      dilution factor of A (10) to
                      B(1). To get 10x of B you
                        need to get 100x of A!
                                Overkill!!

                       This 100x will consume
                      disk space and, because
                         of errors, memory.
Coverage before digital
normalization:


                          (MD amplified)
Coverage after digital normalization:

                            Normalizes coverage

                            Discards redundancy

                            Eliminates majority of
                            errors

                            Scales assembly dramat

                            Assembly is 98% identica
Digital normalization algorithm

for read in dataset:
  if estimated_coverage(read) < CUTOFF:
        update_kmer_counts(read)
        save(read)
  else:
        # discard read

              Note, single pass; fixed memory.
Digital normalization retains information, while
discarding data and errors
Little-appreciated implications!!
 Digital normalization puts both sequence and
 assembly graph analysis on a streaming and
 online basis.
   Potentially really useful for streaming variant calling
   and streaming sample categorization

 Can implement (< 2)-pass error
 detection/correction using locus-specific
 coverage.

 Error correction can be “tuned” to specific
 coverage retention and variant detection.
Local graph coverage
Diginorm provides ability to
efficiently (online) measure
local graph coverage, very
efficiently.
(Theory still needs to be
developed)
Alignment of reads to graph
 “Fixes” digital normalization
 Aligned reads => error corrected reads
 Can align longer sequencesCorrection
        Sequence Read (transcripts?
 contigs?) to graphs.
Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG
             Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG
                                   A
                             G          G
                                  19
                        G    19         19   G
                                  SN
                        19   SN         SN   19
                    C                              T
                        SN                   SN
                   19                             19
               C   SN                             SN
              19                                        A
         G    SN                                       19
         19                                            SN        Seed K-mer
     A   MN                                                 C
    20                                                      20   CGAATCTGAT
    MN                                                      MN
         G                                              A
          1                                             1
         ME   C                                        ME
               1                                   T
                   C                               1
              ME
                    1   G                    G    ME     Emission Base →    A
                   ME    1   G
                                   A
                                       G      1
                                             ME
                                                       K-mer Coverage →    19
                        ME    1
                             ME    1
                                        1
                                       ME
                                                          Vertex Class →   SN
                                  ME




                                                                                Jason Pell
1.2x pass error-corrected E. coli*
(Significantly more compressible)




               * Same approach can be used on mRNAseq and metageno
Some thoughts
 Need fundamental measures of information
 retention so that we can place limits on what
 we‟re discarding with lossy compression.

 Compression is most useful to scientists when it
 makes analysis faster/lower memory/better.

 Variant calling, assembly, and just deleting your
 data are all just various forms of lossy
 compression :)
How compressible is soil data?
De Bruijn graph overlap: 51% of the reads in prairie
(330 Gbp) have coverage > 1 in the corn sample‟s
            de Bruijn graph (180 Gbp).




             Corn          Prairie
Further resources
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
   See esp:
         BIGDATA: Small: DA: DCM: Low-memory
    Streaming Prefilters for Biological Sequencing
                          Data
 Preprints: on arXiv, q-bio:
  „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling
Streaming Twitter analysis.

2012 wellcome-talk

  • 1.
    Streaming approaches tosequence data compression (via normalization and error correction) C. Titus Brown Asst Prof, CSE and Microbiology Michigan State University ctb@msu.edu
  • 2.
  • 3.
  • 4.
    Side note: errorcorrection is the biggest “data” problem left in sequencing.* Both for mapping & assembly. *paraphrased, E. Birney
  • 5.
    My biggest researchproblem – soil.  Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Bacterial species in 1:1m dilution, est by 16s  Does not include phage, etc. that are invisible to tagging approaches  Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another.  Need 3 TB RAM on single chassis to do assembly of 300 Gbp (Velvet).  …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
  • 6.
    Online, streaming, lossycompression. (Digital normalization) Much of next-gen sequencing is redundant.
  • 7.
    Uneven coverage =>even more redundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 14.
  • 15.
    Coverage after digitalnormalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
  • 16.
    Digital normalization algorithm forread in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 17.
    Digital normalization retainsinformation, while discarding data and errors
  • 18.
    Little-appreciated implications!!  Digitalnormalization puts both sequence and assembly graph analysis on a streaming and online basis.  Potentially really useful for streaming variant calling and streaming sample categorization  Can implement (< 2)-pass error detection/correction using locus-specific coverage.  Error correction can be “tuned” to specific coverage retention and variant detection.
  • 19.
    Local graph coverage Diginormprovides ability to efficiently (online) measure local graph coverage, very efficiently. (Theory still needs to be developed)
  • 20.
    Alignment of readsto graph  “Fixes” digital normalization  Aligned reads => error corrected reads  Can align longer sequencesCorrection Sequence Read (transcripts? contigs?) to graphs. Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG A G G 19 G 19 19 G SN 19 SN SN 19 C T SN SN 19 19 C SN SN 19 A G SN 19 19 SN Seed K-mer A MN C 20 20 CGAATCTGAT MN MN G A 1 1 ME C ME 1 T C 1 ME 1 G G ME Emission Base → A ME 1 G A G 1 ME K-mer Coverage → 19 ME 1 ME 1 1 ME Vertex Class → SN ME Jason Pell
  • 21.
    1.2x pass error-correctedE. coli* (Significantly more compressible) * Same approach can be used on mRNAseq and metageno
  • 22.
    Some thoughts  Needfundamental measures of information retention so that we can place limits on what we‟re discarding with lossy compression.  Compression is most useful to scientists when it makes analysis faster/lower memory/better.  Variant calling, assembly, and just deleting your data are all just various forms of lossy compression :)
  • 23.
    How compressible issoil data? De Bruijn graph overlap: 51% of the reads in prairie (330 Gbp) have coverage > 1 in the corn sample‟s de Bruijn graph (180 Gbp). Corn Prairie
  • 24.
    Further resources Everything discussedhere:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  See esp: BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data  Preprints: on arXiv, q-bio: „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling
  • 25.