Doing-the-impossible

© 2014 MapR Technologies 2
• "Decoder ring"
• "the next thing I want to do is this"
• Flajolet

• What's the problem?
– speed
– feasibility
– communication
– incremental computation
– tree-based pre-computation
• What do we need?
– on-line version
– associative version

• Why is that hard (impossible)?
– pathological inputs
– median ... any element of the first half of the data could be the median
– k-th most common ... any element could occur enough in the second
half to be biggest
– unique elements ... hashing loses information, any compact
representation must have false positives or negatives.

• What can we do?
– give up ... a slow, but exact answer may not be sooo bad
– give up ... a fast, but inexact answer may not be sooo bad
• The good news:
– approximate can be very, very close to exact

The Classic Problems
• Most common (top-40)
• Count distinct
• Quantiles, with focus on extremes

Classic Solutions
• Leaky counters
– Forget values, remember uncertainties
• Count min sketch
– Many small hash tables
• Count distinct with HyperLogLog
– Many hashes again
• New Solution - Quantiles by t-digest
– A new low in clustering

Classic Solutions - Leaky counters
• Intuition:
– Common elements are rarely rare, rare elements are always rare
• Leaky counter:
– new element inserted with count=1, error = ceiling((N-1)/w)
– every w samples {dropAll( if f+error < ceiling(N/w) )}
• Adaptation to heavy hitters is trivial

Classic Solutions - Count min sketch
• Intuition:
– A gazillion hashed counters can't all be wrong
• Big array of counters, each row has different hash function
• Increment counter in each row determined by hashing
• Probe by finding minimum hashed counter for probe key
• Oops... finding heavy hitters is tricky ... requires keeping log n
sketches

Increment Hashed Locations to Insert
a
h
i
(a)

Probe Using min of Counts
mini"k[h
i
(a)]

Classic Solutions - Count distinct with HyperLogLog
• Intuition:
– The smallest of n uniform samples is expected to be 1/n
– Hashing turns anything into uniform distribution
– Hashing again turns anything into a new uniform distribution
• Best done with pictures

What does hashing look like?

0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ix

0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
hash(ix)

Hashing fixes all ills

0 5 10 15 20 25 30
0.0 1.0 2.0
Original distribution
x ~ G(0.2, 0.2)
Mean = 1, median = 0.1, 5%−ile = 10-6
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
After hashing

Now the trick … what is the min?

Repeated Minimum
10 samples
Min is ~ 0.1

Min(x)
PDF
0.00 0.02 0.04 0.06 0.08 0.10
0 20 40 60 80
Observed minimum value
(100 samples x 10,000 replications)

Min(x)
PDF
0.00 0.02 0.04 0.06 0.08 0.10
0 20 40 60 80
Theoretical distribution

Min(x)
PDF
Mean = 0.0099
0.00 0.02 0.04 0.06 0.08 0.10
0 20 40 60 80
Theoretical distribution

Counting leading zeros is
taking the log (almost)

Mean = −2.3
10−2.3
= 0.0056
Observed minimum log10(value)
Min(x)
PDF
0.0 0.2 0.4 0.6 0.8 1.0
Error
1e−05 1e−04 0.001 0.01 0.1

T-digest for Quantiles
• Intuition:
– 1-d k-means with size cap
– Make size cap depend on distance to nearest end
• Experimental verification
– Distribution in cluster very uniform
– Accuracy far better than alternatives, especially at extremes

Doing-the-impossible

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Doing-the-impossible

Similar to Doing-the-impossible (20)

More from Ted Dunning

More from Ted Dunning (9)

Recently uploaded

Recently uploaded (20)

Doing-the-impossible