© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
© 2014 MapR Technologies 2 
• "Decoder ring" 
• "the next thing I want to do is this" 
• Flajolet
© 2014 MapR Technologies 3 
• What's the problem? 
– speed 
– feasibility 
– communication 
– incremental computation 
– tree-based pre-computation 
• What do we need? 
– on-line version 
– associative version
© 2014 MapR Technologies 4 
• Why is that hard (impossible)? 
– pathological inputs 
– median ... any element of the first half of the data could be the median 
– k-th most common ... any element could occur enough in the second 
half to be biggest 
– unique elements ... hashing loses information, any compact 
representation must have false positives or negatives.
© 2014 MapR Technologies 5 
• What can we do? 
– give up ... a slow, but exact answer may not be sooo bad 
– give up ... a fast, but inexact answer may not be sooo bad 
• The good news: 
– approximate can be very, very close to exact
© 2014 MapR Technologies 6 
The Classic Problems 
• Most common (top-40) 
• Count distinct 
• Quantiles, with focus on extremes
© 2014 MapR Technologies 7 
Classic Solutions 
• Leaky counters 
– Forget values, remember uncertainties 
• Count min sketch 
– Many small hash tables 
• Count distinct with HyperLogLog 
– Many hashes again 
• New Solution - Quantiles by t-digest 
– A new low in clustering
© 2014 MapR Technologies 8 
Classic Solutions - Leaky counters 
• Intuition: 
– Common elements are rarely rare, rare elements are always rare 
• Leaky counter: 
– new element inserted with count=1, error = ceiling((N-1)/w) 
– every w samples {dropAll( if f+error < ceiling(N/w) )} 
• Adaptation to heavy hitters is trivial
© 2014 MapR Technologies 9 
Classic Solutions - Count min sketch 
• Intuition: 
– A gazillion hashed counters can't all be wrong 
• Big array of counters, each row has different hash function 
• Increment counter in each row determined by hashing 
• Probe by finding minimum hashed counter for probe key 
• Oops... finding heavy hitters is tricky ... requires keeping log n 
sketches
© 2014 MapR Technologies 10 
Increment Hashed Locations to Insert 
a 
h 
i 
(a)
© 2014 MapR Technologies 11 
Probe Using min of Counts 
mini"k[h 
i 
(a)]
Classic Solutions - Count distinct with HyperLogLog 
© 2014 MapR Technologies 12 
• Intuition: 
– The smallest of n uniform samples is expected to be 1/n 
– Hashing turns anything into uniform distribution 
– Hashing again turns anything into a new uniform distribution 
• Best done with pictures
What does hashing look like? 
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.2 0.4 0.6 0.8 1.0 
ix
© 2014 MapR Technologies 15 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.2 0.4 0.6 0.8 1.0 
hash(ix)
Hashing fixes all ills 
© 2014 MapR Technologies 16
0 5 10 15 20 25 30 
© 2014 MapR Technologies 17 
0.0 1.0 2.0 
Original distribution 
x ~ G(0.2, 0.2) 
Mean = 1, median = 0.1, 5%−ile = 10-6 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.4 0.8 
After hashing
Now the trick … what is the min? 
© 2014 MapR Technologies 18
© 2014 MapR Technologies 19 
Repeated Minimum 
10 samples 
Min is ~ 0.1
© 2014 MapR Technologies 20 
Min(x) 
PDF 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Observed minimum value 
(100 samples x 10,000 replications)
© 2014 MapR Technologies 21 
Min(x) 
PDF 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Theoretical distribution 
Observed minimum value 
(100 samples x 10,000 replications)
© 2014 MapR Technologies 22 
Min(x) 
PDF 
Mean = 0.0099 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Theoretical distribution 
Observed minimum value 
(100 samples x 10,000 replications)
Counting leading zeros is 
taking the log (almost) 
© 2014 MapR Technologies 23
© 2014 MapR Technologies 24 
Mean = −2.3 
10−2.3 
= 0.0056 
Observed minimum log10(value) 
Min(x) 
PDF 
0.0 0.2 0.4 0.6 0.8 1.0 
Error 
1e−05 1e−04 0.001 0.01 0.1
© 2014 MapR Technologies 25 
T-digest for Quantiles 
• Intuition: 
– 1-d k-means with size cap 
– Make size cap depend on distance to nearest end 
• Experimental verification 
– Distribution in cluster very uniform 
– Accuracy far better than alternatives, especially at extremes

Doing-the-impossible

  • 1.
    © 2014 MapRTechno©lo 2g0ie1s4 MapR Technologies 1
  • 2.
    © 2014 MapRTechnologies 2 • "Decoder ring" • "the next thing I want to do is this" • Flajolet
  • 3.
    © 2014 MapRTechnologies 3 • What's the problem? – speed – feasibility – communication – incremental computation – tree-based pre-computation • What do we need? – on-line version – associative version
  • 4.
    © 2014 MapRTechnologies 4 • Why is that hard (impossible)? – pathological inputs – median ... any element of the first half of the data could be the median – k-th most common ... any element could occur enough in the second half to be biggest – unique elements ... hashing loses information, any compact representation must have false positives or negatives.
  • 5.
    © 2014 MapRTechnologies 5 • What can we do? – give up ... a slow, but exact answer may not be sooo bad – give up ... a fast, but inexact answer may not be sooo bad • The good news: – approximate can be very, very close to exact
  • 6.
    © 2014 MapRTechnologies 6 The Classic Problems • Most common (top-40) • Count distinct • Quantiles, with focus on extremes
  • 7.
    © 2014 MapRTechnologies 7 Classic Solutions • Leaky counters – Forget values, remember uncertainties • Count min sketch – Many small hash tables • Count distinct with HyperLogLog – Many hashes again • New Solution - Quantiles by t-digest – A new low in clustering
  • 8.
    © 2014 MapRTechnologies 8 Classic Solutions - Leaky counters • Intuition: – Common elements are rarely rare, rare elements are always rare • Leaky counter: – new element inserted with count=1, error = ceiling((N-1)/w) – every w samples {dropAll( if f+error < ceiling(N/w) )} • Adaptation to heavy hitters is trivial
  • 9.
    © 2014 MapRTechnologies 9 Classic Solutions - Count min sketch • Intuition: – A gazillion hashed counters can't all be wrong • Big array of counters, each row has different hash function • Increment counter in each row determined by hashing • Probe by finding minimum hashed counter for probe key • Oops... finding heavy hitters is tricky ... requires keeping log n sketches
  • 10.
    © 2014 MapRTechnologies 10 Increment Hashed Locations to Insert a h i (a)
  • 11.
    © 2014 MapRTechnologies 11 Probe Using min of Counts mini"k[h i (a)]
  • 12.
    Classic Solutions -Count distinct with HyperLogLog © 2014 MapR Technologies 12 • Intuition: – The smallest of n uniform samples is expected to be 1/n – Hashing turns anything into uniform distribution – Hashing again turns anything into a new uniform distribution • Best done with pictures
  • 13.
    What does hashinglook like? © 2014 MapR Technologies 13
  • 14.
    © 2014 MapRTechnologies 14 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ix
  • 15.
    © 2014 MapRTechnologies 15 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 hash(ix)
  • 16.
    Hashing fixes allills © 2014 MapR Technologies 16
  • 17.
    0 5 1015 20 25 30 © 2014 MapR Technologies 17 0.0 1.0 2.0 Original distribution x ~ G(0.2, 0.2) Mean = 1, median = 0.1, 5%−ile = 10-6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 After hashing
  • 18.
    Now the trick… what is the min? © 2014 MapR Technologies 18
  • 19.
    © 2014 MapRTechnologies 19 Repeated Minimum 10 samples Min is ~ 0.1
  • 20.
    © 2014 MapRTechnologies 20 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Observed minimum value (100 samples x 10,000 replications)
  • 21.
    © 2014 MapRTechnologies 21 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
  • 22.
    © 2014 MapRTechnologies 22 Min(x) PDF Mean = 0.0099 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
  • 23.
    Counting leading zerosis taking the log (almost) © 2014 MapR Technologies 23
  • 24.
    © 2014 MapRTechnologies 24 Mean = −2.3 10−2.3 = 0.0056 Observed minimum log10(value) Min(x) PDF 0.0 0.2 0.4 0.6 0.8 1.0 Error 1e−05 1e−04 0.001 0.01 0.1
  • 25.
    © 2014 MapRTechnologies 25 T-digest for Quantiles • Intuition: – 1-d k-means with size cap – Make size cap depend on distance to nearest end • Experimental verification – Distribution in cluster very uniform – Accuracy far better than alternatives, especially at extremes