2012 talk to CSE department at U. ArizonaPresentation Transcript
Streaming lossy compression of biological sequence data using probabilistic data structures C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University August 2012 email@example.com
AcknowledgementsLab members involved Collaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell Arend Hintze Billie Swalla, UW Rosangela Canino- Janet Jansson, LBNL Koning Qingpeng Zhang Susannah Tringe, JGI Elijah Lowe Likit Preeyanon Funding Jiarong Guo Tim Brom USDA NIFA; NSF IOS; Kanchan Pavangadkar BEACON. Eric McDonald
We practice open science! “Be the change you want”Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog („titus brown blog‟) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio: „diginorm arxiv‟
Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
Assemble based on word overlaps:Repeats cause problems:
Sequencers also produceerrors… It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
Assembly – no subdivision!Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection
Assembly – no subdivision!Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection I am, of course, lying. There were no good ways…
Four main challenges for de novosequencing. Repeats. Low coverage. Errors These introduce breaks in the construction of contigs. Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
Repeats Overlaps don‟t place sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
CoverageEasy calculation:(# reads x avg read length) / genome sizeSo, for haploid human genome:30m reads x 100 bp = 3 bn
Coverage “1x” doesn‟t mean every DNA sequence is read once. It means that, if sampling were systematic, it would be. Sampling isn‟t systematic, it‟s random!
Actual coverage varies widely fromthe average.
Actual coverage varies widely from the average.Low coverage introduces unavoidable breaks.
Two basic assembly approaches Overlap/layout/consensus De Bruijn or k-mer graphs The former is used for long reads, esp all Sanger- based assemblies. The latter is used because of memory efficiency.
Overlap/layout/consensusEssentially,1. Calculate all overlaps (n^2)2. Cluster based on overlap.3. Do a multiple sequence alignment UMD assembly primer (cbcb.umd.edu)
K-mer graph Break reads (of any length) down into multiple overlapping words of fixed length k.ATGGACCAGATGACAC (k=12) =>ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
K-mer graph (k=14) Each node represents a 14-mer; Links between each node are 13-mer overlaps
K-mer graph (k=14) Branches in the graph represent partially overlapping sequences.
K-mer graph (k=14) Single nucleotide variations cause long branches
K-mer graph (k=14) Single nucleotide variations cause long branches; They don‟t rejoin quickly.
K-mer graphs – choosing pathsFor decisions about which paths etc, biology-based heuristics come into play as well.
The computational conundrum More data => better.and More data => computationally more challenging.
The scale of the problem is stunning. I estimate a worldwide capacity for DNA sequencing of 15 petabases/yr (it‟s probably larger). Individual labs can generate ~100 Gbp in ~1 week for $10k. This sequencing is at a boutique level: Sequencing formats are semi-standard. Basic analysis approaches are ~80% cookbook. Every biological prep, problem, and analysis is different. Traditionally, biologists receive no training in computation. (And computational people receive no training in biology :) …and our computational infrastructure is optimizing for high performance computing, not high throughput.
My problems are also veryannoying… (From Monday seminar) Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil. Currently we have approximately 2 Tbp spread across 9 soil samples. Need 3 TB RAM on single chassis to do assembly of 300 Gbp. …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
Theoretical => applied solutions.Theoretical advances Practically useful & usable Demonstratedin data structures and implementations, at scale. effectiveness on real data. algorithms
Three parts to our solution.1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch).2. An online streaming approach to lossy compression.3. Compressible de Bruijn graph representation.
1. CountMin Sketch To add element: increment associated counter at all hash locales To get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/
Our approach is very memoryefficient…
…and does not introduce significantmiscounts on NGS data sets.
2. Online, streaming, lossy (NOVEL)compression. Much of next-gen sequencing is redundant.
Uneven coverage => even more (NOVEL)redundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
Downsample based on de Bruijngraph structure; this can be derivedonline.
Digital normalization algorithmfor read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
The median k-mer count in a “sentence” is agood estimator of redundancy within the graph. This gives us a reference-free measure of coverage.
Digital normalization retains information, whilediscarding data and errors
Contig assembly now scales with underlying genomesize Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results. Memory efficient is improved by use of CountMin Sketch.
(NOVEL)3. Compressible de Bruijn graphs Each node represents a 14-mer; Links between each node are 13-mer overlaps
Can store implicit de Bruijn graphs ina Bloom filter AGTCGG AGTCGGCATGAC AGTCGG …C GTCGGC TCGGCA …A CGGCAT GGCATG …T GCATGA CATGAC …G …A Bloom ﬁlter …C
False positives introduce falsenodes/edges. When does this start to distort the graph?
Average component size remains lowthrough 18% FPR.
Graph diameter remains constantthrough 18% FPR.
Global graph structure is retained past18% FPR 1% 5% 10% 15%
Equivalent to bond percolation problem; percolationthreshold independent of k (?)
This data structure is strikinglyefficient for storing sparse k-mergraphs. “Exact” is for best possible information-theoretical storage.
We implemented graph partitioning on top of this probabilistic de Bruijn graph.Split reads into “bins” belonging to different source species.Can do this based almost entirely on connectivity of sequences.
Partitioning scales assembly for asubset of problems. Can be done in ~10x less memory than assembly. Partition at low k and assemble exactly at any higher k (DBG). Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.) Can eliminate small partitions/contigs in the partitioning phase. An incredibly convenient approach enabling divide & conquer approaches across the board.
Technical challenges met (and defeated) Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. Sequencing technology introduces false connections in graph (Howe et al., in prep.) Implementation lets us scale ~20x over other approaches.
Our approaches yield a variety ofstrategies… Assembly Assembly Metagenomic data Partitioning Assembly Assembly Shotgun data Digital normalization Shotgun data Assembly
Concluding thoughts, thus far Our approaches provide significant and substantial practical and theoretical leverage to one of the most challenging current problems in computational biology: assembly. They also improve quality of analysis, in some cases. They provide a path to the future: Many-core compatible; distributable? Decreased memory footprint => cloud computing can be used for many analyses. They are in use, ~dozens of labs using digital normalization.
Future researchMany directions in the works! (see posted grantprops) Theoretical groundwork for normalization approach. Graph search & alignment algorithms. Error detection & correction. Resequencing analysis. Online (“infinite”) assembly.
Streaming Twitter analysis.
Running HMMs over de Bruijn graphs (=> cross validation) hmmgs: Assemble based on good-scoring HMM paths through the graph. Independent of other assemblers; very sensitive, specific. 95% of hmmgs rplB domains are present in our partitioned assemblies.Jordan Fish, Qiong Wang, and Jim Cole (RDP)
Side note: error correction is thebiggest “data” problem left insequencing. Both for mapping & assembly.
Streaming error correction. First pass Second pass Error-correct low- Error-correct low-All reads Yes! abundance k-mers in Yes! abundance k-mers in read. read. Does read come Does read come from a high- from a now high- coverage locus? coverage locus? Add read to graph Leave unchanged. and save for later. Only saved reads No! No! We can do error trimming ofgenomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment