• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
2012 wellcome-talk

2012 wellcome-talk






Total Views
Views on SlideShare
Embed Views



5 Embeds 858

http://www.homolog.us 807
https://twitter.com 39
http://byobio.com 10
https://si0.twimg.com 1
http://www.newsblur.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    2012 wellcome-talk 2012 wellcome-talk Presentation Transcript

    • Streaming approaches to sequence data compression (via normalization and error correction) C. Titus Brown Asst Prof, CSE and Microbiology Michigan State University ctb@msu.edu
    • What Ewan said.
    • What Guy said.
    • Side note: error correction is thebiggest “data” problem left insequencing.* Both for mapping & assembly. *paraphrased, E. Birney
    • My biggest research problem –soil. Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Bacterial species in 1:1m dilution, est by 16s  Does not include phage, etc. that are invisible to tagging approaches Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another. Need 3 TB RAM on single chassis to do assembly of 300 Gbp (Velvet). …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
    • Online, streaming, lossy compression.(Digital normalization) Much of next-gen sequencing is redundant.
    • Uneven coverage => even moreredundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
    • Coverage before digitalnormalization: (MD amplified)
    • Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
    • Digital normalization algorithmfor read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
    • Digital normalization retains information, whilediscarding data and errors
    • Little-appreciated implications!! Digital normalization puts both sequence and assembly graph analysis on a streaming and online basis.  Potentially really useful for streaming variant calling and streaming sample categorization Can implement (< 2)-pass error detection/correction using locus-specific coverage. Error correction can be “tuned” to specific coverage retention and variant detection.
    • Local graph coverageDiginorm provides ability toefficiently (online) measurelocal graph coverage, veryefficiently.(Theory still needs to bedeveloped)
    • Alignment of reads to graph “Fixes” digital normalization Aligned reads => error corrected reads Can align longer sequencesCorrection Sequence Read (transcripts? contigs?) to graphs.Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG A G G 19 G 19 19 G SN 19 SN SN 19 C T SN SN 19 19 C SN SN 19 A G SN 19 19 SN Seed K-mer A MN C 20 20 CGAATCTGAT MN MN G A 1 1 ME C ME 1 T C 1 ME 1 G G ME Emission Base → A ME 1 G A G 1 ME K-mer Coverage → 19 ME 1 ME 1 1 ME Vertex Class → SN ME Jason Pell
    • 1.2x pass error-corrected E. coli*(Significantly more compressible) * Same approach can be used on mRNAseq and metageno
    • Some thoughts Need fundamental measures of information retention so that we can place limits on what we‟re discarding with lossy compression. Compression is most useful to scientists when it makes analysis faster/lower memory/better. Variant calling, assembly, and just deleting your data are all just various forms of lossy compression :)
    • How compressible is soil data?De Bruijn graph overlap: 51% of the reads in prairie(330 Gbp) have coverage > 1 in the corn sample‟s de Bruijn graph (180 Gbp). Corn Prairie
    • Further resourcesEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog („titus brown blog‟) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/interests.html  See esp: BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data Preprints: on arXiv, q-bio: „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling
    • Streaming Twitter analysis.