Zhang Q - A probabilistic approach to k-mer counting
Upcoming SlideShare
Loading in...5
×
 

Zhang Q - A probabilistic approach to k-mer counting

on

  • 969 views

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

Presentation at BOSC2012 by Zhang Q - A probabilistic approach to k-mer counting

Statistics

Views

Total Views
969
Views on SlideShare
969
Embed Views
0

Actions

Likes
0
Downloads
6
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Zhang Q - A probabilistic approach to k-mer counting Zhang Q - A probabilistic approach to k-mer counting Presentation Transcript

  • A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA qingpeng@msu.edu July 13, 2012Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
  • What is k-mer counting?Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
  • What is our k-mer counting approach? The Bloom counting hash consists of one or more hash tables of different size Each entry in the hash tables is a counter representing the number of k-mers that hash to that location Bloom filter(0/1) or Count-min Sketch(counting) The hash function is to take the modulus of a number representing the k-mer with the table size.Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
  • What is our k-mer counting approach? With certain counting false positive rate1 as tradeoff because of collision Probabilistic properties well suited to next generation sequencing datasets Highly scalable: Counting accuracy is related to memory usage. However our approach will never break an imposed memory bound. 1 counting false positive rate: the possibility that the number of counts will be incorrect (off by 1 or more)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
  • How does our k-mer counting approach perform? How many k-mers have incorrect count? - counting error rate N: number of unique kmers; Z: number of hash tables; H: size of hash tables The probability that no collisions happened in a specific entry in one hash table is (1 − 1/H)N ,which is e −N/H . The individual collision rate in one hash table is 1 − e −N/H . Example: N=915898, Z=4, H=400000, The counting error rate f , which −N/H Z is the probability that collision f = (1 − e ) = happened in all the locations 0.6523 where a k-mer is hashed to in all observed counting Z hash tables, will be error rate f : 0.6566 (1 − e −N/H )ZQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
  • How does our k-mer counting approach perform? Ok, some counts are incorrect. However, how ”incorrect”? factors to influence miscount: number of total k-mers hash table sizeQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
  • How does our k-mer counting approach perform? Time Usage Figure: Time usage of khmer counting approachQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
  • How does our k-mer counting approach perform? Memory Usage Figure: Memory usage of different k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
  • How does our k-mer counting approach perform? Disk Storage Usage Figure: disk storage usage of different k-mer counting toolsQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
  • What is the application of our approach? Filtering out reads with low-abundance k-mers for de novo assembly Figure: Percentage of ”bad” reads in the remaining reads Iterating filtering out low-abundance reads(”bad” reads) that contain even a single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
  • Summary a simple probabilistic approach for fast and memory efficient counting of k-mers arbitrary-length k-mers arbitrary-size sequence data set with a tradeoff of counting error other possible applications digital normalization repeat detection diversity analysis of metagenomic sample. ... The khmer software package is written in C++ and Python, available at https://github.com/ged-lab/khmerQingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
  • Acknowledgement Jason Pell, Rose Canino-Koning, Adina Chuang Howe Dr. C. Titus Brown GED lab members@ Michigan State University Funding from USDA, DOE, MSU, BEACON, iCER Thanks!Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12