Zhang Q - A probabilistic approach to k-mer counting

  • 654 views
Uploaded on

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
654
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA qingpeng@msu.edu July 13, 2012Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
  • 2. What is k-mer counting?Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
  • 3. What is our k-mer counting approach? The Bloom counting hash consists of one or more hash tables of different size Each entry in the hash tables is a counter representing the number of k-mers that hash to that location Bloom filter(0/1) or Count-min Sketch(counting) The hash function is to take the modulus of a number representing the k-mer with the table size.Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
  • 4. What is our k-mer counting approach? With certain counting false positive rate1 as tradeoff because of collision Probabilistic properties well suited to next generation sequencing datasets Highly scalable: Counting accuracy is related to memory usage. However our approach will never break an imposed memory bound. 1 counting false positive rate: the possibility that the number of counts will be incorrect (off by 1 or more)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
  • 5. How does our k-mer counting approach perform? How many k-mers have incorrect count? - counting error rate N: number of unique kmers; Z: number of hash tables; H: size of hash tables The probability that no collisions happened in a specific entry in one hash table is (1 − 1/H)N ,which is e −N/H . The individual collision rate in one hash table is 1 − e −N/H . Example: N=915898, Z=4, H=400000, The counting error rate f , which −N/H Z is the probability that collision f = (1 − e ) = happened in all the locations 0.6523 where a k-mer is hashed to in all observed counting Z hash tables, will be error rate f : 0.6566 (1 − e −N/H )ZQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
  • 6. How does our k-mer counting approach perform? Ok, some counts are incorrect. However, how ”incorrect”? factors to influence miscount: number of total k-mers hash table sizeQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
  • 7. How does our k-mer counting approach perform? Time Usage Figure: Time usage of khmer counting approachQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
  • 8. How does our k-mer counting approach perform? Memory Usage Figure: Memory usage of different k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
  • 9. How does our k-mer counting approach perform? Disk Storage Usage Figure: disk storage usage of different k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
  • 10. What is the application of our approach? Filtering out reads with low-abundance k-mers for de novo assembly Figure: Percentage of ”bad” reads in the remaining reads Iterating filtering out low-abundance reads(”bad” reads) that contain even a single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
  • 11. Summary a simple probabilistic approach for fast and memory efficient counting of k-mers arbitrary-length k-mers arbitrary-size sequence data set with a tradeoff of counting error other possible applications digital normalization repeat detection diversity analysis of metagenomic sample. ... The khmer software package is written in C++ and Python, available at https://github.com/ged-lab/khmerQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
  • 12. Acknowledgement Jason Pell, Rose Canino-Koning, Adina Chuang Howe Dr. C. Titus Brown GED lab members@ Michigan State University Funding from USDA, DOE, MSU, BEACON, iCER Thanks!Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12