Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
No BS Data Salon #3:Probabilistic SketchingMay 2012                          Analytics + Attribution =                    ...
Outline     What we do at AK     What’s sketching?     Our motivation for sketching     Why should you sketch?     Ou...
Here’s what we do at AK.                   Online ad analytics      Compare performance of different: campaigns, inventory...
Motivation     High throughput: 10s of K/s => 100s of K/s     High dimensionality: 100M+ reporting keys     Easy aggreg...
Why should you bother?    SELECT COUNT(DISTINCT user_id)    FROM access_logs    GROUP BY campaign_id5
What is probabilistic sketching?     One-pass     “Small” memory     Probabilistic error6
Our Case Study: unique counting     Non-unique stream of ints     Want to keep unique count, up to about a billion     ...
How it works                                     The Papers:     LogLog Counting of Large Cardinalities       Marianne Du...
How it works cont’d    1.   Stream of int_64 => “good” hash => random {0,1}64    2.   Keep track of longest run of leading...
Here’s what you get                     Native:                union, cardinality                    Implies:      interse...
Show me the money!      Used in production at AK for a year      Accurate: count to a billion with 1-3% error      Smal...
Lies, damn lies, and boxplots!                                                Cardinality Relative Error vs True Cardinali...
But wait, there’s more!                                     ●                                                             ...
Implementation caveats      If you store an HLL for each key, you’ll likely be wasting space when all the       registers...
How we use them, in production      Original problem: fast, on-the-fly overlaps and unique counts      Solution:        ...
UI example              To the browser, Robin!16
How we use them, Ad Hoc      Outside of production: amazing ad-hoc analysis tool      Example: gathering more than a yea...
“Soft” Caveats      Fixed N% error is deceiving      Additive error for set operations can balloon      Unbounded error...
Parting Advice      Test these on your data rigorously      Choose good hash functions      Tuning parameters are parti...
Questions?                  @timonk     timon@aggregateknowledge.com      blog.aggregateknowledge.com20
Credits     All the adorable cartoons you saw in this presentation were taken from     http://sureilldrawthat.com/ and htt...
Upcoming SlideShare
Loading in …5
×

No BS Data Salon #3: Probabilistic Sketching

1,828 views

Published on

Timon Karnezos' presentation on probabilistic sketching and distinct counting with HyperLogLog from the third No BS Data Salon on May 19th, 2012.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No BS Data Salon #3: Probabilistic Sketching

  1. 1. No BS Data Salon #3:Probabilistic SketchingMay 2012 Analytics + Attribution = Actionable Insights
  2. 2. Outline  What we do at AK  What’s sketching?  Our motivation for sketching  Why should you sketch?  Our case: unique counting How it works How well it works How we use them2
  3. 3. Here’s what we do at AK. Online ad analytics Compare performance of different: campaigns, inventory, providers, creatives, etc… Bottom Line: Give the advertisers insight into the performance of their ads.3
  4. 4. Motivation  High throughput: 10s of K/s => 100s of K/s  High dimensionality: 100M+ reporting keys  Easy aggregates: counters, scalars  Hard aggregates: unique user counting, set operations  No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient Warehouse appliances (Aster, custom PG) same Our data is immutable. Paying for unneeded ACID is silly.  Offline solutions slow, operationally finicky.  Not a bank. We don’t need to be perfect, just useful.4
  5. 5. Why should you bother? SELECT COUNT(DISTINCT user_id) FROM access_logs GROUP BY campaign_id5
  6. 6. What is probabilistic sketching?  One-pass  “Small” memory  Probabilistic error6
  7. 7. Our Case Study: unique counting  Non-unique stream of ints  Want to keep unique count, up to about a billion  Want to do set operations (union, intersection, set difference)  Straw Man #1: “Put them in a HashSet, and go away.”  (Maybe) Straw Man #2: “Fine, keep a sample.”  How we did it: HyperLogLog7
  8. 8. How it works The Papers:  LogLog Counting of Large Cardinalities Marianne Durand and Philippe Flajolet (RIP 2010), 2003  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007 The (rudimentary, unrigorous) Intuition: Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k E[x] = 1, p = (½)k => n ≈ 2k8
  9. 9. How it works cont’d 1. Stream of int_64 => “good” hash => random {0,1}64 2. Keep track of longest run of leading zeroes 3. Longest run of length k => cardinality ≈2k  Crazy math business Correct systematic bias with a derived constant Stochastic averaging Balls and bins correction9
  10. 10. Here’s what you get Native: union, cardinality Implies: intersection (!!!), set difference (!!!)10
  11. 11. Show me the money!  Used in production at AK for a year  Accurate: count to a billion with 1-3% error  Small: a few KB each so we can keep 100s of M in memory  Fast: benched at 2M inserts/s, used in production at 100s of K/s11
  12. 12. Lies, damn lies, and boxplots! Cardinality Relative Error vs True Cardinality log2m=13 [5kB] 4% 2% ● HLL Cardinality RE 0% −2% ● ● −4% 102 103 104 105 106 107 108 10912 True Cardinality
  13. 13. But wait, there’s more! ● ● Intersection Error vs Magnitude Diff erence log2m=13 [5kB] 40% ● ● ● ● ● ● ● ● ● ● ● ● ● 20% ● ● ● ● factor(overlap_fraction) ● ● 0.1 HLL Intersection Error ● ● ● 0.2 ● 0.3 ● ● 0.4 0% 0.5 ● ● ● 0.6 ● ● 0.7 ● ● 0.8 ● 0.9 ● ● 1 −20% ● ● ● −40% 0 1 2 313 Cardinality Order of Magnitude Diff erence
  14. 14. Implementation caveats  If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.  Pick a good hash function!  Test on your data!  Tune parameters to suit your business needs!14
  15. 15. How we use them, in production  Original problem: fast, on-the-fly overlaps and unique counts  Solution: streaming, in-memory aggregations shipped to Postgres Postgres module to do set operations on binary representations in the DB  Freebie: PG analytics support like GROUP BY, sliding windows, etc…15
  16. 16. UI example To the browser, Robin!16
  17. 17. How we use them, Ad Hoc  Outside of production: amazing ad-hoc analysis tool  Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches much more effort to put it through Hadoop  Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful.17
  18. 18. “Soft” Caveats  Fixed N% error is deceiving  Additive error for set operations can balloon  Unbounded error sneaks in now and again18
  19. 19. Parting Advice  Test these on your data rigorously  Choose good hash functions  Tuning parameters are particularly sensitive  You’ll find all kinds of unexpected uses for them, so get building!  Bibliography blog post will be up in a bit!19
  20. 20. Questions? @timonk timon@aggregateknowledge.com blog.aggregateknowledge.com20
  21. 21. Credits All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her.21

×