Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Count-Min Tree Sketch : Approximate counting for NLP tasks

844 views

Published on

Count-Min Tree sktech is a variant of the Count-Min Sketch, tailored for Zipfian (or power law) data distribution. With memory footprint improvement of 4 to 8 times against other variants, and on-par performance with native strict counting, the Count-Min Tree sketch can be used in many time-critical situations. It is developped by eXenSa (www.exensa.com)

Published in: Science
  • Be the first to comment

Count-Min Tree Sketch : Approximate counting for NLP tasks

  1. 1. PAGE1 www.exensa.com www.exensa.com PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLP Count-Min Tree Sketch Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul Mouhamadsultane 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7
  2. 2. PAGE2 www.exensa.com A bit of context Why do we need to count ? Data analysis platform : eXenGine. Processes different kind of data (mostly text). We need to create relevant cross-features : to do that we need to count occurrences of all possible cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams. There are many different measures to decide if a n-gram is interesting. All require to count the occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in bigrams) Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the whole data structure containing the counts in memory is impossible, so one has to resort to using huge map/reduce with joins to do the job.
  3. 3. PAGE3 www.exensa.com A bit of context What kind of data are we talking about ? Google N-grams tokens 1024 Billions sentences 95 Billions 1-grams (count > 200) 14 Millions 2-grams (count > 40) 314 Millions 3-grams 977 Millions 4-grams 1.3 Billion 5-grams 1.2 Billion
  4. 4. PAGE4 www.exensa.com A bit of context What kind of data are we talking about ? Zipfian distribution [Le Quan & al. 2003]
  5. 5. PAGE5 www.exensa.com A bit of context What kind of measures are we talking about ? PMI, TF-IDF, LLR
  6. 6. PAGE6 www.exensa.com A bit of context Summary / Goals Many counts Logarithms in measures We need to store a large amount of counts We care about the order of magnitude Fast and memory controlled We don’t want a distributed memory for the counts Zipfian counts Many very small counts that will be filtered out later
  7. 7. PAGE7 www.exensa.com A bit of context Summary / Goals Many counts Logarithms in measures We need to store a large amount of counts We care about the order of magnitude Fast and memory controlled We don’t want a distributed memory for the counts Zipfian counts Many very small counts that will be filtered out later We can use probabilistic structures
  8. 8. PAGE8 www.exensa.com Count-Min Sketch A probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]
  9. 9. PAGE9 www.exensa.com Count-Min Sketch A probabilistic data structure to store counts Conservative Update : improve CMS by updating only min values
  10. 10. PAGE10 www.exensa.com Count-Min Log Sketch A probabilistic data structure to store logarithmic counts [Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting logarithmically. Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can use more counters However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some point, error stops improving with space (there is an inherent residual error)
  11. 11. PAGE11 www.exensa.com Count-Min Tree Sketch A count min sketch with shared counters Idea : use a hierarchal storage where most significant bits are shared between counters. Somehow similar to TOMB counters [Van Durme, 2009], except that overflow is managed very differently.
  12. 12. PAGE12 www.exensa.com Tree Shared Counters Sharing most significant bits 8 counters structure o A tree is made of three kinds of storage: o Counting bits o Barrier bits o Spire (not required except for performance) oSeveral layers alternating counting and barrier bits. oHere we have a <[(8,8),(4,4),(2,2),(1,1)],4> counter Or : how can we store counts with an average approaching 4 bits / counter 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 barrier bits counting bits spire base layer
  13. 13. PAGE13 www.exensa.com Tree Shared Counters Sharing most significant bits 8 counters structure o8 counters in 30 bits + spire oWithout a spire, n bits can count up to 3 × 21+log2 𝑛 4 o Many small shared counters with spires are more efficient than a large shared counter Or : how can we store counts with an average approaching 4 bits / counter 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 barrier bits counting bits spire base layer
  14. 14. PAGE14 www.exensa.com Tree Shared Counters Reading values o A counter stops at the first ZERO barrier o When two barrier paths meet, there is a conflict o Barrier length (b) is evaluated in unary o Counter bits (c) are evaluated in a more classical way 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7
  15. 15. PAGE15 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 0 1 2
  16. 16. PAGE16 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0000 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0000 3 4 5
  17. 17. PAGE17 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 6 1 A bit at that level is worth … 2 2 4 4 8
  18. 18. PAGE18 www.exensa.com Count-Min Tree Sketches Experiments Results ! • 140M tokens from English Wikipedia* • 14.7M words (unigrams + bigrams) • Reference counts stored in UnorderedMap  815MiB Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits counters. For 14.7M words, it amounts to 59MiB. Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native UnorderedMap performance. We use 3-layers sketches (good performance/precision tradeoff) * We preferred to test our counters with a large number of parameters rather than with a large corpus, so we limit to 5% of Wikipedia.
  19. 19. PAGE19 www.exensa.com Count-Min Tree Sketches Average Relative Error Results !
  20. 20. PAGE20 www.exensa.com Count-Min Tree Sketches RMSE Results !
  21. 21. PAGE21 www.exensa.com Count-Min Tree Sketches RMSE on PMI Results !
  22. 22. PAGE22 www.exensa.com Count-Min Tree Sketch Question : are CMTS really useful in real-life ? 1 – CMTS are better on the whole vocabulary, but what happens if we skip the least frequent words / bigrams ? 2 – CMTS are better on average, but what happens quantile by quantile ?
  23. 23. PAGE23 www.exensa.com Count-Min Tree Sketches PMI Error per quantile (sketches at 50% perfect size, limit eval to f > 10-7 ) Results !
  24. 24. PAGE24 www.exensa.com Count-Min Tree Sketches Relative Error per log2-quantile (sketches at 50% perfect size, limit eval to f > 10-7 ) Results !
  25. 25. PAGE25 www.exensa.com Conclusion Where are we ? CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient way. Because most of the time in sketch accesses is due to memory access, its performance is on-par with other methods • Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage size), the error skyrockets • Other drawback : implementation is not straightforward. We have devised at least 4 different ways to increment the counters. Merging (and thus distributing) is easy once you can read and set a counter.
  26. 26. PAGE26 www.exensa.com Conclusion Where are we going ? Dynamic : we are working on a CMTS version that can automatically grow (more layers added below) Pressure control : when we detect that pressure becomes too high, we can divide and subsample to stop the collisions to cascade Open Source python package on its way

×