Side note: error correction is thebiggest “data” problem left insequencing.* Both for mapping & assembly. *paraphrased, E. Birney
My biggest research problem –soil. Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil. Bacterial species in 1:1m dilution, est by 16s Does not include phage, etc. that are invisible to tagging approaches Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another. Need 3 TB RAM on single chassis to do assembly of 300 Gbp (Velvet). …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
Online, streaming, lossy compression.(Digital normalization) Much of next-gen sequencing is redundant.
Uneven coverage => even moreredundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
Coverage before digitalnormalization: (MD amplified)
Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
Digital normalization algorithmfor read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
Digital normalization retains information, whilediscarding data and errors
Little-appreciated implications!! Digital normalization puts both sequence and assembly graph analysis on a streaming and online basis. Potentially really useful for streaming variant calling and streaming sample categorization Can implement (< 2)-pass error detection/correction using locus-specific coverage. Error correction can be “tuned” to specific coverage retention and variant detection.
Local graph coverageDiginorm provides ability toefficiently (online) measurelocal graph coverage, veryefficiently.(Theory still needs to bedeveloped)
Alignment of reads to graph “Fixes” digital normalization Aligned reads => error corrected reads Can align longer sequencesCorrection Sequence Read (transcripts? contigs?) to graphs.Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG A G G 19 G 19 19 G SN 19 SN SN 19 C T SN SN 19 19 C SN SN 19 A G SN 19 19 SN Seed K-mer A MN C 20 20 CGAATCTGAT MN MN G A 1 1 ME C ME 1 T C 1 ME 1 G G ME Emission Base → A ME 1 G A G 1 ME K-mer Coverage → 19 ME 1 ME 1 1 ME Vertex Class → SN ME Jason Pell
1.2x pass error-corrected E. coli*(Significantly more compressible) * Same approach can be used on mRNAseq and metageno
Some thoughts Need fundamental measures of information retention so that we can place limits on what we‟re discarding with lossy compression. Compression is most useful to scientists when it makes analysis faster/lower memory/better. Variant calling, assembly, and just deleting your data are all just various forms of lossy compression :)
How compressible is soil data?De Bruijn graph overlap: 51% of the reads in prairie(330 Gbp) have coverage > 1 in the corn sample‟s de Bruijn graph (180 Gbp). Corn Prairie
Further resourcesEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog („titus brown blog‟) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/interests.html See esp: BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data Preprints: on arXiv, q-bio: „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling