Upcoming SlideShare
×

# Zhang Q - A probabilistic approach to k-mer counting

• 654 views

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

More in: Education , Technology
• Comment goes here.
Are you sure you want to
Be the first to comment
Be the first to like this

Total Views
654
On Slideshare
0
From Embeds
0
Number of Embeds
0

Shares
6
0
Likes
0

No embeds

### Report content

No notes for slide

### Transcript

• 1. A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA qingpeng@msu.edu July 13, 2012Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
• 2. What is k-mer counting?Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
• 3. What is our k-mer counting approach? The Bloom counting hash consists of one or more hash tables of diﬀerent size Each entry in the hash tables is a counter representing the number of k-mers that hash to that location Bloom ﬁlter(0/1) or Count-min Sketch(counting) The hash function is to take the modulus of a number representing the k-mer with the table size.Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
• 4. What is our k-mer counting approach? With certain counting false positive rate1 as tradeoﬀ because of collision Probabilistic properties well suited to next generation sequencing datasets Highly scalable: Counting accuracy is related to memory usage. However our approach will never break an imposed memory bound. 1 counting false positive rate: the possibility that the number of counts will be incorrect (oﬀ by 1 or more)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
• 5. How does our k-mer counting approach perform? How many k-mers have incorrect count? - counting error rate N: number of unique kmers; Z: number of hash tables; H: size of hash tables The probability that no collisions happened in a speciﬁc entry in one hash table is (1 − 1/H)N ,which is e −N/H . The individual collision rate in one hash table is 1 − e −N/H . Example: N=915898, Z=4, H=400000, The counting error rate f , which −N/H Z is the probability that collision f = (1 − e ) = happened in all the locations 0.6523 where a k-mer is hashed to in all observed counting Z hash tables, will be error rate f : 0.6566 (1 − e −N/H )ZQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
• 6. How does our k-mer counting approach perform? Ok, some counts are incorrect. However, how ”incorrect”? factors to inﬂuence miscount: number of total k-mers hash table sizeQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
• 7. How does our k-mer counting approach perform? Time Usage Figure: Time usage of khmer counting approachQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
• 8. How does our k-mer counting approach perform? Memory Usage Figure: Memory usage of diﬀerent k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
• 9. How does our k-mer counting approach perform? Disk Storage Usage Figure: disk storage usage of diﬀerent k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
• 10. What is the application of our approach? Filtering out reads with low-abundance k-mers for de novo assembly Figure: Percentage of ”bad” reads in the remaining reads Iterating ﬁltering out low-abundance reads(”bad” reads) that contain even a single unique k-mer with hash tables with diﬀerent sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
• 11. Summary a simple probabilistic approach for fast and memory eﬃcient counting of k-mers arbitrary-length k-mers arbitrary-size sequence data set with a tradeoﬀ of counting error other possible applications digital normalization repeat detection diversity analysis of metagenomic sample. ... The khmer software package is written in C++ and Python, available at https://github.com/ged-lab/khmerQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
• 12. Acknowledgement Jason Pell, Rose Canino-Koning, Adina Chuang Howe Dr. C. Titus Brown GED lab members@ Michigan State University Funding from USDA, DOE, MSU, BEACON, iCER Thanks!Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12