0
Cardinality Estimation for    Very Large Data Sets    Matt Abrams, VP Data and Operations                         March 25...
THANKS FORCOMING!I build large scale distributed systems and work onalgorithms that make sense of the data stored inthemCo...
HOW CAN WE COUNTTHE NUMBER OFDISTINCT ELEMENTSIN LARGE DATASETS?
HOW CAN WE COUNTTHE NUMBER OFDISTINCT ELEMENTSIN VERY LARGE DATASETS?
GOALS FORCOUNTING SOLUTIONSupport high throughput data streams (upto many 100s of thousands per second)Estimate cardinalit...
1 UID = 128 bits513a71b843e54b73
In one month AddThis    logs 5B+ UIDs          2,500,000 * 2000          = 5,000,000,000
That’s 596GB of  just UIDS
NAÏVE SOLUTIONS• Select count(distinct  UID) from table where  dimension = foo• HashSet<K>• Run a batch job for each  new ...
WE ARE NOT A BANK    This means a estimate rather    than exact value is acceptable.                  http://graphics8.nyt...
THREE INTUITIONS• It is possible to estimate the cardinality of a set  by understanding the probability of a sequence  of ...
INTUITION   What is the probability   that a binary string   starts with ’01’?
INTUITION  (1/2)2    = 25%
INTUITION(1/2)3      = 12.5%
INTUITIONCrude analysis: If a streamhas 8 unique values the hashof at least one of them shouldstart with ‘001’
INTUITIONGiven the variability of a singlerandom value we can not usea single variable for accuratecardinality estimations
MULTIPLE OBSERVATIONS HELPREDUCE VARIANCEBy taking the mean of the standarddeviation of multiple random variables wecan ma...
THE PROBLEM WITHMULTIPLE HASHFUNCTIONS• It is too costly from a  computational perspective to  apply m hash functions to  ...
STOCHASTICAVERAGING• Emulating the effect of m experiments  with a single hash function• Divide input stream h(M) into m s...
HASH FUNCTIONS32 Bit         64 Bit       160 Bit                      Odds of aHash           Hash         Hash          ...
HYPERLOGLOG      (2007)Counts up to1 Billion in 1.5KB of space            Philippe Flajolet (1948-2011)
HYPERLOGLOG (HLL)• Operates with a single pass  over the input data set• Produces a typical error of of            1.04 / ...
HLL SUBSTREAMS HLL uses a single hash function and splits the result into m buckets                              Bucket 1 ...
HLL ALGORITHMBASICS• Each substream maintains an Observable • Observable is largest value p(x) which is the   position of ...
WHAT ARE “SHORT BYTES”?• We know a priori that the value of a given  substream of the multiset M is in the  range         ...
ADDING VALUES TOHLL       r ( xb+1 xb+2 ×××)       index =1+ x1x2 ××× xb   2• The first b bits of the new value define  th...
ADDING VALUES TOHLL                   Observations{M[1], M[2],..., M[m]}The multiset is updated using the equation:   M[ j...
INTUITION ONEXTRACTINGCARDINALITY FROM HLL• If we add n elements to a stream then each  substream will contain roughly n/m...
HLL CARDINALITYESTIMATE                                            -1                          æ     m        ö           ...
A NOTE ON LONGRANGE CORRECTIONS• The paper says to apply a long range  correction function when the estimate is  greater t...
DEMO TIME!Lets look at HLL in Action.            http://www.aggregateknowledge.com/science/blog/hll.html
HLL UNIONS                      Root• Merging two or more HLL  data structures is a                 MON   HLL  similar pro...
HLL INTERSECTION        C = A + B - AÈ B            A           C       B     You must understand the properties     of yo...
HYPERLOGLOG++• Google researches have recently released an  update to the HLL algorithm• Uses clever encoding/decoding tec...
OTHER PROBABILISTICDATA STRUCTURES• Bloom Filters – set membership  detection• CountMinSketch – estimate number  of occurr...
REFERENCES• Stream-Lib -  https://github.com/clearspring/stream-lib• HyperLogLog -  http://citeseerx.ist.psu.edu/viewdoc/s...
THANKS!     AddThis is hiring!
2013 open analytics_countingv3
2013 open analytics_countingv3
Upcoming SlideShare
Loading in...5
×

2013 open analytics_countingv3

875

Published on

AddThis' OA DC Summit Presentaiton

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
875
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • 2.5M people
  • Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  • Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  • Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  • Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  • Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
  • Transcript of "2013 open analytics_countingv3"

    1. 1. Cardinality Estimation for Very Large Data Sets Matt Abrams, VP Data and Operations March 25, 2013
    2. 2. THANKS FORCOMING!I build large scale distributed systems and work onalgorithms that make sense of the data stored inthemContributor to the open source project Stream-Lib, a Java library for summarizing data streams(https://github.com/clearspring/stream-lib)Ask me questions: @abramsm
    3. 3. HOW CAN WE COUNTTHE NUMBER OFDISTINCT ELEMENTSIN LARGE DATASETS?
    4. 4. HOW CAN WE COUNTTHE NUMBER OFDISTINCT ELEMENTSIN VERY LARGE DATASETS?
    5. 5. GOALS FORCOUNTING SOLUTIONSupport high throughput data streams (upto many 100s of thousands per second)Estimate cardinality with known errorthresholds in sets up to around 1 billion (oreven 1 trillion when needed)Support set operations (unions andintersections)Support data streams with large number ofdimensions
    6. 6. 1 UID = 128 bits513a71b843e54b73
    7. 7. In one month AddThis logs 5B+ UIDs 2,500,000 * 2000 = 5,000,000,000
    8. 8. That’s 596GB of just UIDS
    9. 9. NAÏVE SOLUTIONS• Select count(distinct UID) from table where dimension = foo• HashSet<K>• Run a batch job for each new query request
    10. 10. WE ARE NOT A BANK This means a estimate rather than exact value is acceptable. http://graphics8.nytimes.com/images/2008/01/30/timestopics/feddc.jp g
    11. 11. THREE INTUITIONS• It is possible to estimate the cardinality of a set by understanding the probability of a sequence of events occurring in a random variable (e.g. how many coins were flipped if I saw n heads in a row?)• Averaging the the results of multiple observations can reduce the variance associated with random variables• By applying a good hash function effectively de- duplicates the input stream
    12. 12. INTUITION What is the probability that a binary string starts with ’01’?
    13. 13. INTUITION (1/2)2 = 25%
    14. 14. INTUITION(1/2)3 = 12.5%
    15. 15. INTUITIONCrude analysis: If a streamhas 8 unique values the hashof at least one of them shouldstart with ‘001’
    16. 16. INTUITIONGiven the variability of a singlerandom value we can not usea single variable for accuratecardinality estimations
    17. 17. MULTIPLE OBSERVATIONS HELPREDUCE VARIANCEBy taking the mean of the standarddeviation of multiple random variables wecan make the error rate as small as desiredby controlling the size of m (the numberrandom variables) error = s / m
    18. 18. THE PROBLEM WITHMULTIPLE HASHFUNCTIONS• It is too costly from a computational perspective to apply m hash functions to each data point• It is not clear that it is possible to generate m good hash functions that are independent
    19. 19. STOCHASTICAVERAGING• Emulating the effect of m experiments with a single hash function• Divide input stream h(M) into m sub- streams é1 2 m -1 ù ê , ,..., ëm m ,1ú m û• An average of the observable values for each sub-stream will yield a cardinality that improves in proportion to 1/ m as m increases
    20. 20. HASH FUNCTIONS32 Bit 64 Bit 160 Bit Odds of aHash Hash Hash Collision77163 5.06 Billion 1.42 * 1 in 2 10^1430084 1.97 Billion 5.55 * 1 in 10 10^239292 609 million 1.71 * 1 in 100 10^232932 192 million 5.41 * 1 in 1000 10^22 http://preshing.com/20110504/hash-collision-probabilities
    21. 21. HYPERLOGLOG (2007)Counts up to1 Billion in 1.5KB of space Philippe Flajolet (1948-2011)
    22. 22. HYPERLOGLOG (HLL)• Operates with a single pass over the input data set• Produces a typical error of of 1.04 / m• Error decreases as m increases. Error is not a function of the number of elements in the set
    23. 23. HLL SUBSTREAMS HLL uses a single hash function and splits the result into m buckets Bucket 1 HashInput Values Function S Bucket 2 Bucket m
    24. 24. HLL ALGORITHMBASICS• Each substream maintains an Observable • Observable is largest value p(x) which is the position of the leftmost 1-bit in a binary string x• 32 bit hashing function with 5 bit “short bytes”• Harmonic mean • Increases quality of estimates by reducing variance
    25. 25. WHAT ARE “SHORT BYTES”?• We know a priori that the value of a given substream of the multiset M is in the range 0..(L +1- log2 m)• Assuming L = 32 we only need 5 bits to store the value of the register• 85% less memory usage as compared to standard java int
    26. 26. ADDING VALUES TOHLL r ( xb+1 xb+2 ×××) index =1+ x1x2 ××× xb 2• The first b bits of the new value define the index for the multiset M that may be updated when the new value is added• The bits b+1 to m are used to determine the leading number of zeros (p)
    27. 27. ADDING VALUES TOHLL Observations{M[1], M[2],..., M[m]}The multiset is updated using the equation: M[ j] := max(M[ j], r (w )) Number of leading zeros + 1
    28. 28. INTUITION ONEXTRACTINGCARDINALITY FROM HLL• If we add n elements to a stream then each substream will contain roughly n/m elements• The MAX value in each substream should be about log2 ( n / m) (from earlier intuition re random variables)• The harmonic mean (mZ) of 2MAX is on the order of n/m• So m2Z is on the order of n  That’s the cardinality!
    29. 29. HLL CARDINALITYESTIMATE -1 æ m ö E := a m m × çå 2 -M [ j ] 2 ç ÷ ÷ è j=1 ø (2 ) p 2 Harmonic Mean• m2Z has systematic multiplicative bias that needs to be corrected. This is done by multiplying a constant value
    30. 30. A NOTE ON LONGRANGE CORRECTIONS• The paper says to apply a long range correction function when the estimate is greater than: E > 1 232 30• The correction function is: E := -2 log(1- E / 2 * 32• DON’T DO THIS! It doesn’t work and increases error. Better approach is to use a bigger/better hash function
    31. 31. DEMO TIME!Lets look at HLL in Action. http://www.aggregateknowledge.com/science/blog/hll.html
    32. 32. HLL UNIONS Root• Merging two or more HLL data structures is a MON HLL similar process to adding a new value to a single HLL TUE HLL• For each register in the HLL take the max value of the HLLs you are merging WED HLL and the resulting register set can be used to estimate the cardinality of THU HLL the combined sets FRI HLL
    33. 33. HLL INTERSECTION C = A + B - AÈ B A C B You must understand the properties of your sets to know if you can trust the resulting intersection
    34. 34. HYPERLOGLOG++• Google researches have recently released an update to the HLL algorithm• Uses clever encoding/decoding techniques to create a single data structure that is very accurate for small cardinality sets and can estimate sets that have over a trillion elements in them• Empirical bias correction. Observations show that most of the error in HLL comes from the bias function. Using empirically derived values significantly reduces error• Already available in Stream-Lib!
    35. 35. OTHER PROBABILISTICDATA STRUCTURES• Bloom Filters – set membership detection• CountMinSketch – estimate number of occurrences for a given element• TopK Estimators – estimate the frequency and top elements from a stream
    36. 36. REFERENCES• Stream-Lib - https://github.com/clearspring/stream-lib• HyperLogLog - http://citeseerx.ist.psu.edu/viewdoc/summary?d oi=10.1.1.142.9475• HyperLogLog In Practice - http://research.google.com/pubs/pub40671.html• Aggregate Knowledge HLL Blog Posts - http://blog.aggregateknowledge.com/tag/hyperlo glog/
    37. 37. THANKS! AddThis is hiring!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×