A probabilistic approach to k-mer counting                                                   Qingpeng Zhang               ...
What is k-mer counting?Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012  ...
What is our k-mer counting approach?                                                                                  The ...
What is our k-mer counting approach?           With certain counting false positive rate1 as tradeoff because of collision ...
How does our k-mer counting approach perform?   How many k-mers have incorrect count? - counting error rate               ...
How does our k-mer counting approach perform?   Ok, some counts are incorrect. However, how ”incorrect”?                  ...
How does our k-mer counting approach perform?   Time Usage                            Figure: Time usage of khmer counting...
How does our k-mer counting approach perform?   Memory Usage                       Figure: Memory usage of different k-mer ...
How does our k-mer counting approach perform?   Disk Storage Usage                    Figure: disk storage usage of differe...
What is the application of our approach?   Filtering out reads with low-abundance k-mers for de novo assembly             ...
Summary           a simple probabilistic approach for fast and memory efficient counting of           k-mers                ...
Acknowledgement           Jason Pell, Rose Canino-Koning, Adina Chuang Howe           Dr. C. Titus Brown           GED lab...
Upcoming SlideShare
Loading in …5
×

Zhang Q - A probabilistic approach to k-mer counting

1,002 views

Published on

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,002
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Zhang Q - A probabilistic approach to k-mer counting

  1. 1. A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA qingpeng@msu.edu July 13, 2012Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
  2. 2. What is k-mer counting?Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
  3. 3. What is our k-mer counting approach? The Bloom counting hash consists of one or more hash tables of different size Each entry in the hash tables is a counter representing the number of k-mers that hash to that location Bloom filter(0/1) or Count-min Sketch(counting) The hash function is to take the modulus of a number representing the k-mer with the table size.Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
  4. 4. What is our k-mer counting approach? With certain counting false positive rate1 as tradeoff because of collision Probabilistic properties well suited to next generation sequencing datasets Highly scalable: Counting accuracy is related to memory usage. However our approach will never break an imposed memory bound. 1 counting false positive rate: the possibility that the number of counts will be incorrect (off by 1 or more)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
  5. 5. How does our k-mer counting approach perform? How many k-mers have incorrect count? - counting error rate N: number of unique kmers; Z: number of hash tables; H: size of hash tables The probability that no collisions happened in a specific entry in one hash table is (1 − 1/H)N ,which is e −N/H . The individual collision rate in one hash table is 1 − e −N/H . Example: N=915898, Z=4, H=400000, The counting error rate f , which −N/H Z is the probability that collision f = (1 − e ) = happened in all the locations 0.6523 where a k-mer is hashed to in all observed counting Z hash tables, will be error rate f : 0.6566 (1 − e −N/H )ZQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
  6. 6. How does our k-mer counting approach perform? Ok, some counts are incorrect. However, how ”incorrect”? factors to influence miscount: number of total k-mers hash table sizeQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
  7. 7. How does our k-mer counting approach perform? Time Usage Figure: Time usage of khmer counting approachQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
  8. 8. How does our k-mer counting approach perform? Memory Usage Figure: Memory usage of different k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
  9. 9. How does our k-mer counting approach perform? Disk Storage Usage Figure: disk storage usage of different k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
  10. 10. What is the application of our approach? Filtering out reads with low-abundance k-mers for de novo assembly Figure: Percentage of ”bad” reads in the remaining reads Iterating filtering out low-abundance reads(”bad” reads) that contain even a single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
  11. 11. Summary a simple probabilistic approach for fast and memory efficient counting of k-mers arbitrary-length k-mers arbitrary-size sequence data set with a tradeoff of counting error other possible applications digital normalization repeat detection diversity analysis of metagenomic sample. ... The khmer software package is written in C++ and Python, available at https://github.com/ged-lab/khmerQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
  12. 12. Acknowledgement Jason Pell, Rose Canino-Koning, Adina Chuang Howe Dr. C. Titus Brown GED lab members@ Michigan State University Funding from USDA, DOE, MSU, BEACON, iCER Thanks!Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12

×