Too Much Data? - Just Sample, Just Hash, ...

CS-
Code & Supply 
Pittsburgh MEETUP
Too Much Data?
Just Sample, Just Hash, …
Probabilistic Data Structures and Algorithms (PDSA)
Andrii Gakhov, PhD 
Ferret Go GmbH, Berlin, Germany!1

Senior Software Engineer
PhD in Mathematical Modelling
About me
Author of the book “Probabilistic Data
Structures and Algorithms for Big Data
Applications”
Website: www.gakhov.com
Twitter: @gakhov
Andrii Gakhov
!2

Sampling and Hashing
Lot’s of data: Lot’s of opportunity - Lot’s of tech problems
Solution: Sampling and Hashing
Process subset of elements
Discard rest
Interpolate results
Sampling
Process all elements
Compression with information loss
Approximate results
Hashing
PDSA = Hashing + Pattern observation
# 0000101001010101010!3

PDSA in Apache Spark SQL
Approximate number of distinct elements (HyperLogLog++)
SELECT approx_count_distinct(some_column) FROM df
Frequency estimation (Count-Min Sketch)
SELECT count_min_sketch(some_column, 0.1, 0.9, 42) FROM df 
# Estimated frequency for python = cms.estimateCount(“python”)
Spark SQL is Apache Spark's module for working with structured data.!4

and many others …
PDSA in Production
!5

PDSA in Big Data Ecosystem
Big
Data
Velocity Variety
Volume
Membership
Counting
Frequency
Rank Similarity
Bloom Filter
Quotient Filter
Cuckoo Filter
Count-Min Sketch
Count Sketch
Random Sampling t-digestq-digest
Linear Counting
FM Sketch
LogLogHyperLogLog
MinHash
SimHash
!6

CS-
Counting123Find the number of distinct elements
!7

Counting: Traditional Approach
Build list of all unique elements
Sort / search  
to avoid listing elements twice
Count elements in the list
Time complexity: nlogn - (e.g.,
mergesort) If search is
requires linear memory
requires O(n·log n) time
!8

Counting: 99% Memory Reduction
*SimilarWeb.Com Data for March, 2017
What if we can count them
with 12 KB only?
144 million unique IP Addresses / month*
2.3 GB to store them in a set!
!9

Counting: Approximate Counting
@katyperry has
107,287,629 followers
Would you really care if she has 107.2,
108.0, or 106.7 million followers?
!10

Counting: Probabilistic counter
hash(“hello”) = 42 = 01010100
The more values we have seen, the closer we are to the theoretically expected distribution
Idea: Store all observed ranks of indexed values
Rank = the number of leading zeros
rank(42) = 1
Example:
Take a coin and toss it as many times as you can. How many times you need to toss it to
get the equal number of tails and heads? What happened if you toss it a few more times?!11

Counting: Probabilistic counter
Idea: the left-most position of zero (R) in the Counter after inserting  
n elements from the dataset can be used as an indicator of log2n.
Let’s remember all ranks we’ve seen so far in a binary Counter
1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0
rank 0 1 2 3 4 5 …. 14 15
R
j << log2 n => almost certainly we have 1 (low ranks appear often)
j >> log2 n => almost certainly we have 0 (high ranks appear rarely)
j ≈ log2 n => equal probability to have 1 or 0 in the Counter
n ≈
2R
0.77351
≈
16
0.77351
≈ 21
The probability to observe 1 at position j after indexing n elements is
n
2j+1
!12

Counting: HyperLogLog Algorithm
Proposed in 2007
Uses the idea of probabilistic counting
Based on a single 32-bit hash function
p bits (32 - p) bits
addressing bits rank computation bits
hash(x) =
32-bit hash value
HyperLogLog Algorithm
!13

Counting: HyperLogLog Algorithm
Stores only m = 2p
counters (registers), about 4 bytes each
The memory always ﬁxed, regardless the number of unique elements
More counters provide less error (memory/accuracy trade-oﬀ)
Idea: Instead of storing m binary Counters,
store only the maximal observed ranks for each of them.
n ≈ α ⋅ m ⋅ 2AVG(Ri)
Ri = max(Ri, rank(x)), i=1…m
!14

Counting: Interactive Presentation of HyperLogLog
!15

More counters require more memory (4 bytes per counter)
More counters need more bits for addressing them (m = 2p)
Counting: Accuracy vs Memory Tradeoﬀ in HyperLogLog
p bits (32 - p)
addressing bits rank computation bits
hash(x)
(p = 4 … 16)
!16

Counting: Distinct Count in Redis
Redis uses the HyperLogLog data structure to count unique elements in a set
requires a small constant amount of memory of 12KB for every data structure
approximates the exact cardinality with a standard error of 0.81%.
redis> PFADD hll python java ruby
(integer) 1
redis> PFADD hll python python python
(integer) 0
redis> PFADD hll java ruby
(integer) 0
redis> PFCOUNT hll
(integer) 3
!17

Counting: Invoking HyperLogLog from Python
 
import json
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters
with open('visitors.txt') as f:
for line in f:
ip = json.loads(line)['ip']
hll.add(ip)
num_of_unique_visitors = hll.count()
print('Unique visitors', num_of_unique_visitors)
size_in_bytes = hll.size()
print('Size in bytes', size_in_bytes) 
!18

Counting: Implementing HyperLogLog with Cython
cdef class HyperLogLog:
def __cinit__(self, const uint8_t precision):
self.precision = precision # bits used for addressing
self.num_of_counters = <uint32_t>1 << precision
self.counter = array('L', range(self.num_of_counters)) # 4-byte int
cdef uint8_t rank(self, uint32_t value):
cdef uint8_t size = 32 - self.precision # bits used for hashing
return size - value.bit_length() + 1
cpdef void add(self, object element) except *:
cdef uint16_t index
cdef uint32_t value
value, index = divmod(hash(element, seed), self.num_of_counters)
counter[index] = max(self.counter[index], self.rank(value))
!19

Counting: Implementing HyperLogLog with Cython
cpdef size_t count(self):
cdef float R = 0
cdef uint16_t index
cdef bint all_zero = 1
for index in xrange(self.num_of_counters):
if self._counter[index] > 0:
all_zero = 0
R += 1.0 / float(<uint32_t>1 << self._counter[counter_index])
if all_zero: return 0 # all counters are zeros
cdef size_t n = <size_t>round(alpha * self.num_of_counters ** 2 / R)
if n < 2.5 * self.num_of_counters:
# *small* range correction
elif n > 143165576: # 2^{32} / 30
# *big* range correction
return n !20

Counting: HyperLogLog++ Algorithm
HyperLogLog++
64-bit hash function, so allows to count more values
better bias correction using pre-trained data
proposed a sparse representation of the counters
(registers) to reduce memory requirements
HyperLogLog++ is an improved version of HyperLogLog  
developed in Google and proposed in 2013
!21

[book] Probabilistic Data Structures and Algorithms for Big Data Applications  
https://pdsa.gakhov.com
[repo] Probabilistic Data Structures and Algorithms in Python 
https://github.com/gakhov/pdsa
Read More about HyperLogLog and other PDSA
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure  
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Redis new data structure: the HyperLogLog 
http://antirez.com/news/75
Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles  
https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
!22

Thank you
Website: www.gakhov.com
Twitter: @gakhov
Probabilistic Data Structures and
Algorithms for Big Data Applications
pdsa.gakhov.com
Thank you
!23

Too Much Data? - Just Sample, Just Hash, ...

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Too Much Data? - Just Sample, Just Hash, ...

Similar to Too Much Data? - Just Sample, Just Hash, ... (20)

More from Andrii Gakhov

More from Andrii Gakhov (20)

Recently uploaded

Recently uploaded (20)

Too Much Data? - Just Sample, Just Hash, ...