2012 wellcome-talk

1,583 views
1,496 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,583
On SlideShare
0
From Embeds
0
Number of Embeds
856
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2012 wellcome-talk

  1. 1. Streaming approaches to sequence data compression (via normalization and error correction) C. Titus Brown Asst Prof, CSE and Microbiology Michigan State University ctb@msu.edu
  2. 2. What Ewan said.
  3. 3. What Guy said.
  4. 4. Side note: error correction is thebiggest “data” problem left insequencing.* Both for mapping & assembly. *paraphrased, E. Birney
  5. 5. My biggest research problem –soil. Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Bacterial species in 1:1m dilution, est by 16s  Does not include phage, etc. that are invisible to tagging approaches Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another. Need 3 TB RAM on single chassis to do assembly of 300 Gbp (Velvet). …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
  6. 6. Online, streaming, lossy compression.(Digital normalization) Much of next-gen sequencing is redundant.
  7. 7. Uneven coverage => even moreredundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  8. 8. Coverage before digitalnormalization: (MD amplified)
  9. 9. Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
  10. 10. Digital normalization algorithmfor read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  11. 11. Digital normalization retains information, whilediscarding data and errors
  12. 12. Little-appreciated implications!! Digital normalization puts both sequence and assembly graph analysis on a streaming and online basis.  Potentially really useful for streaming variant calling and streaming sample categorization Can implement (< 2)-pass error detection/correction using locus-specific coverage. Error correction can be “tuned” to specific coverage retention and variant detection.
  13. 13. Local graph coverageDiginorm provides ability toefficiently (online) measurelocal graph coverage, veryefficiently.(Theory still needs to bedeveloped)
  14. 14. Alignment of reads to graph “Fixes” digital normalization Aligned reads => error corrected reads Can align longer sequencesCorrection Sequence Read (transcripts? contigs?) to graphs.Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG A G G 19 G 19 19 G SN 19 SN SN 19 C T SN SN 19 19 C SN SN 19 A G SN 19 19 SN Seed K-mer A MN C 20 20 CGAATCTGAT MN MN G A 1 1 ME C ME 1 T C 1 ME 1 G G ME Emission Base → A ME 1 G A G 1 ME K-mer Coverage → 19 ME 1 ME 1 1 ME Vertex Class → SN ME Jason Pell
  15. 15. 1.2x pass error-corrected E. coli*(Significantly more compressible) * Same approach can be used on mRNAseq and metageno
  16. 16. Some thoughts Need fundamental measures of information retention so that we can place limits on what we‟re discarding with lossy compression. Compression is most useful to scientists when it makes analysis faster/lower memory/better. Variant calling, assembly, and just deleting your data are all just various forms of lossy compression :)
  17. 17. How compressible is soil data?De Bruijn graph overlap: 51% of the reads in prairie(330 Gbp) have coverage > 1 in the corn sample‟s de Bruijn graph (180 Gbp). Corn Prairie
  18. 18. Further resourcesEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog („titus brown blog‟) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/interests.html  See esp: BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data Preprints: on arXiv, q-bio: „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling
  19. 19. Streaming Twitter analysis.

×