SlideShare a Scribd company logo
CS-
Code & Supply

Pittsburgh MEETUP
Too Much Data?
Just Sample, Just Hash, …
Probabilistic Data Structures and Algorithms (PDSA)
Andrii Gakhov, PhD

Ferret Go GmbH, Berlin, Germany!1
Senior Software Engineer
PhD in Mathematical Modelling
About me
Author of the book “Probabilistic Data
Structures and Algorithms for Big Data
Applications”
Website: www.gakhov.com
Twitter: @gakhov
Andrii Gakhov
!2
Sampling and Hashing
Lot’s of data: Lot’s of opportunity - Lot’s of tech problems
Solution: Sampling and Hashing
Process subset of elements
Discard rest
Interpolate results
Sampling
Process all elements
Compression with information loss
Approximate results
Hashing
PDSA = Hashing + Pattern observation
# 0000101001010101010!3
PDSA in Apache Spark SQL
Approximate number of distinct elements (HyperLogLog++)
SELECT approx_count_distinct(some_column) FROM df
Frequency estimation (Count-Min Sketch)
SELECT count_min_sketch(some_column, 0.1, 0.9, 42) FROM df

# Estimated frequency for python = cms.estimateCount(“python”)
Spark SQL is Apache Spark's module for working with structured data.!4
and many others …
PDSA in Production
!5
PDSA in Big Data Ecosystem
Big
Data
Velocity Variety
Volume
Membership
Counting
Frequency
Rank Similarity
Bloom Filter
Quotient Filter
Cuckoo Filter
Count-Min Sketch
Count Sketch
Random Sampling t-digestq-digest
Linear Counting
FM Sketch
LogLogHyperLogLog
MinHash
SimHash
!6
CS-
Counting123Find the number of distinct elements
!7
Counting: Traditional Approach
Build list of all unique elements
Sort / search 

to avoid listing elements twice
Count elements in the list
Time complexity: nlogn - (e.g.,
mergesort) If search is
requires linear memory
requires O(n·log n) time
!8
Counting: 99% Memory Reduction
*SimilarWeb.Com Data for March, 2017
What if we can count them
with 12 KB only?
144 million unique IP Addresses / month*
2.3 GB to store them in a set!
!9


Counting: Approximate Counting
@katyperry has
107,287,629 followers
Would you really care if she has 107.2,
108.0, or 106.7 million followers?
!10


Counting: Probabilistic counter
hash(“hello”) = 42 = 01010100
The more values we have seen, the closer we are to the theoretically expected distribution
Idea: Store all observed ranks of indexed values
Rank = the number of leading zeros
rank(42) = 1
Example:
Take a coin and toss it as many times as you can. How many times you need to toss it to
get the equal number of tails and heads? What happened if you toss it a few more times?!11


Counting: Probabilistic counter
Idea: the left-most position of zero (R) in the Counter after inserting 

n elements from the dataset can be used as an indicator of log2n.
Let’s remember all ranks we’ve seen so far in a binary Counter
1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0
rank 0 1 2 3 4 5 …. 14 15
R
j << log2 n => almost certainly we have 1 (low ranks appear often)
j >> log2 n => almost certainly we have 0 (high ranks appear rarely)
j ≈ log2 n => equal probability to have 1 or 0 in the Counter
n ≈
2R
0.77351
≈
16
0.77351
≈ 21
The probability to observe 1 at position j after indexing n elements is
n
2j+1
!12
Counting: HyperLogLog Algorithm
Proposed in 2007
Uses the idea of probabilistic counting
Based on a single 32-bit hash function
p bits (32 - p) bits
addressing bits rank computation bits
hash(x) =
32-bit hash value
HyperLogLog Algorithm
!13
Counting: HyperLogLog Algorithm
Stores only m = 2p
counters (registers), about 4 bytes each
The memory always fixed, regardless the number of unique elements
More counters provide less error (memory/accuracy trade-off)
Idea: Instead of storing m binary Counters,
store only the maximal observed ranks for each of them.
n ≈ α ⋅ m ⋅ 2AVG(Ri)
Ri = max(Ri, rank(x)), i=1…m
!14
Counting: Interactive Presentation of HyperLogLog
!15
More counters require more memory (4 bytes per counter)
More counters need more bits for addressing them (m = 2p)
Counting: Accuracy vs Memory Tradeoff in HyperLogLog
p bits (32 - p)
addressing bits rank computation bits
hash(x)
(p = 4 … 16)
!16
Counting: Distinct Count in Redis
Redis uses the HyperLogLog data structure to count unique elements in a set
requires a small constant amount of memory of 12KB for every data structure
approximates the exact cardinality with a standard error of 0.81%.
redis> PFADD hll python java ruby
(integer) 1
redis> PFADD hll python python python
(integer) 0
redis> PFADD hll java ruby
(integer) 0
redis> PFCOUNT hll
(integer) 3
!17
Counting: Invoking HyperLogLog from Python


import json
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters
with open('visitors.txt') as f:
for line in f:
ip = json.loads(line)['ip']
hll.add(ip)
num_of_unique_visitors = hll.count()
print('Unique visitors', num_of_unique_visitors)
size_in_bytes = hll.size()
print('Size in bytes', size_in_bytes)

!18
Counting: Implementing HyperLogLog with Cython
cdef class HyperLogLog:
def __cinit__(self, const uint8_t precision):
self.precision = precision # bits used for addressing
self.num_of_counters = <uint32_t>1 << precision
self.counter = array('L', range(self.num_of_counters)) # 4-byte int
cdef uint8_t rank(self, uint32_t value):
cdef uint8_t size = 32 - self.precision # bits used for hashing
return size - value.bit_length() + 1
cpdef void add(self, object element) except *:
cdef uint16_t index
cdef uint32_t value
value, index = divmod(hash(element, seed), self.num_of_counters)
counter[index] = max(self.counter[index], self.rank(value))
!19
Counting: Implementing HyperLogLog with Cython
cpdef size_t count(self):
cdef float R = 0
cdef uint16_t index
cdef bint all_zero = 1
for index in xrange(self.num_of_counters):
if self._counter[index] > 0:
all_zero = 0
R += 1.0 / float(<uint32_t>1 << self._counter[counter_index])
if all_zero: return 0 # all counters are zeros
cdef size_t n = <size_t>round(alpha * self.num_of_counters ** 2 / R)
if n < 2.5 * self.num_of_counters:
# *small* range correction
elif n > 143165576: # 2^{32} / 30
# *big* range correction
return n !20
Counting: HyperLogLog++ Algorithm
HyperLogLog++
64-bit hash function, so allows to count more values
better bias correction using pre-trained data
proposed a sparse representation of the counters
(registers) to reduce memory requirements
HyperLogLog++ is an improved version of HyperLogLog 

developed in Google and proposed in 2013
!21
[book] Probabilistic Data Structures and Algorithms for Big Data Applications 

https://pdsa.gakhov.com
[repo] Probabilistic Data Structures and Algorithms in Python

https://github.com/gakhov/pdsa
Read More about HyperLogLog and other PDSA
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 

https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Redis new data structure: the HyperLogLog

http://antirez.com/news/75
Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 

https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
!22
Thank you
Website: www.gakhov.com
Twitter: @gakhov
Probabilistic Data Structures and
Algorithms for Big Data Applications
pdsa.gakhov.com
Thank you
!23

More Related Content

What's hot

Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
宇 傅
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
Christopher Conlan
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
hash
 hash hash
hash
tim4911
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
Albert Bifet
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
Albert Bifet
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Bloom filter
Bloom filterBloom filter
Bloom filter
feng lee
 
Bloom filter
Bloom filterBloom filter
Bloom filter
Hamid Feizabadi
 

What's hot (20)

Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
hash
 hash hash
hash
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 

Similar to Too Much Data? - Just Sample, Just Hash, ...

Intel JIT Talk
Intel JIT TalkIntel JIT Talk
Intel JIT Talk
iamdvander
 
Class 26: Objectifying Objects
Class 26: Objectifying ObjectsClass 26: Objectifying Objects
Class 26: Objectifying Objects
David Evans
 
Design and Analysis of Algorithm Brute Force 1.ppt
Design and Analysis of Algorithm Brute Force 1.pptDesign and Analysis of Algorithm Brute Force 1.ppt
Design and Analysis of Algorithm Brute Force 1.ppt
moiza354
 
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Ra'Fat Al-Msie'deen
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Hridyesh Bisht
 
UNEC__1683196273.pptx
UNEC__1683196273.pptxUNEC__1683196273.pptx
UNEC__1683196273.pptx
huseynmusayev2
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)
bolovv
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
Mohamed Loey
 
Cis435 week06
Cis435 week06Cis435 week06
Cis435 week06
ashish bansal
 
Lecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptxLecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptx
jovannyflex
 
Lecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptxLecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptx
jovannyflex
 
18. Dictionaries, Hash-Tables and Set
18. Dictionaries, Hash-Tables and Set18. Dictionaries, Hash-Tables and Set
18. Dictionaries, Hash-Tables and Set
Intro C# Book
 
Lesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmetic
PVS-Studio
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
Steffen Wenz
 
Recurrence Relation
Recurrence RelationRecurrence Relation
Recurrence Relation
NilaNila16
 
Programming python quick intro for schools
Programming python quick intro for schoolsProgramming python quick intro for schools
Programming python quick intro for schools
Dan Bowen
 
Chapter 22. Lambda Expressions and LINQ
Chapter 22. Lambda Expressions and LINQChapter 22. Lambda Expressions and LINQ
Chapter 22. Lambda Expressions and LINQ
Intro C# Book
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
Daniel Lemire
 
Teeing Up Python - Code Golf
Teeing Up Python - Code GolfTeeing Up Python - Code Golf
Teeing Up Python - Code Golf
Yelp Engineering
 

Similar to Too Much Data? - Just Sample, Just Hash, ... (20)

Intel JIT Talk
Intel JIT TalkIntel JIT Talk
Intel JIT Talk
 
Class 26: Objectifying Objects
Class 26: Objectifying ObjectsClass 26: Objectifying Objects
Class 26: Objectifying Objects
 
Design and Analysis of Algorithm Brute Force 1.ppt
Design and Analysis of Algorithm Brute Force 1.pptDesign and Analysis of Algorithm Brute Force 1.ppt
Design and Analysis of Algorithm Brute Force 1.ppt
 
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
UNEC__1683196273.pptx
UNEC__1683196273.pptxUNEC__1683196273.pptx
UNEC__1683196273.pptx
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
Cis435 week06
Cis435 week06Cis435 week06
Cis435 week06
 
Lecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptxLecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptx
 
Lecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptxLecture 5 – Computing with Numbers (Math Lib).pptx
Lecture 5 – Computing with Numbers (Math Lib).pptx
 
18. Dictionaries, Hash-Tables and Set
18. Dictionaries, Hash-Tables and Set18. Dictionaries, Hash-Tables and Set
18. Dictionaries, Hash-Tables and Set
 
Lesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmetic
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 
Recurrence Relation
Recurrence RelationRecurrence Relation
Recurrence Relation
 
Programming python quick intro for schools
Programming python quick intro for schoolsProgramming python quick intro for schools
Programming python quick intro for schools
 
Chapter 22. Lambda Expressions and LINQ
Chapter 22. Lambda Expressions and LINQChapter 22. Lambda Expressions and LINQ
Chapter 22. Lambda Expressions and LINQ
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Teeing Up Python - Code Golf
Teeing Up Python - Code GolfTeeing Up Python - Code Golf
Teeing Up Python - Code Golf
 

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
Andrii Gakhov
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
Andrii Gakhov
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
Andrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
Andrii Gakhov
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
Andrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
Andrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
Andrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
Andrii Gakhov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
Andrii Gakhov
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
Andrii Gakhov
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
Andrii Gakhov
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
Andrii Gakhov
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
Andrii Gakhov
 

More from Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
 

Recently uploaded

J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
YAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring detailsYAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring details
NishanthaBulumulla1
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
YousufSait3
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 

Recently uploaded (20)

J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
YAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring detailsYAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring details
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 

Too Much Data? - Just Sample, Just Hash, ...

  • 1. CS- Code & Supply
 Pittsburgh MEETUP Too Much Data? Just Sample, Just Hash, … Probabilistic Data Structures and Algorithms (PDSA) Andrii Gakhov, PhD
 Ferret Go GmbH, Berlin, Germany!1
  • 2. Senior Software Engineer PhD in Mathematical Modelling About me Author of the book “Probabilistic Data Structures and Algorithms for Big Data Applications” Website: www.gakhov.com Twitter: @gakhov Andrii Gakhov !2
  • 3. Sampling and Hashing Lot’s of data: Lot’s of opportunity - Lot’s of tech problems Solution: Sampling and Hashing Process subset of elements Discard rest Interpolate results Sampling Process all elements Compression with information loss Approximate results Hashing PDSA = Hashing + Pattern observation # 0000101001010101010!3
  • 4. PDSA in Apache Spark SQL Approximate number of distinct elements (HyperLogLog++) SELECT approx_count_distinct(some_column) FROM df Frequency estimation (Count-Min Sketch) SELECT count_min_sketch(some_column, 0.1, 0.9, 42) FROM df
 # Estimated frequency for python = cms.estimateCount(“python”) Spark SQL is Apache Spark's module for working with structured data.!4
  • 5. and many others … PDSA in Production !5
  • 6. PDSA in Big Data Ecosystem Big Data Velocity Variety Volume Membership Counting Frequency Rank Similarity Bloom Filter Quotient Filter Cuckoo Filter Count-Min Sketch Count Sketch Random Sampling t-digestq-digest Linear Counting FM Sketch LogLogHyperLogLog MinHash SimHash !6
  • 7. CS- Counting123Find the number of distinct elements !7
  • 8. Counting: Traditional Approach Build list of all unique elements Sort / search 
 to avoid listing elements twice Count elements in the list Time complexity: nlogn - (e.g., mergesort) If search is requires linear memory requires O(n·log n) time !8
  • 9. Counting: 99% Memory Reduction *SimilarWeb.Com Data for March, 2017 What if we can count them with 12 KB only? 144 million unique IP Addresses / month* 2.3 GB to store them in a set! !9
  • 10. 
 Counting: Approximate Counting @katyperry has 107,287,629 followers Would you really care if she has 107.2, 108.0, or 106.7 million followers? !10
  • 11. 
 Counting: Probabilistic counter hash(“hello”) = 42 = 01010100 The more values we have seen, the closer we are to the theoretically expected distribution Idea: Store all observed ranks of indexed values Rank = the number of leading zeros rank(42) = 1 Example: Take a coin and toss it as many times as you can. How many times you need to toss it to get the equal number of tails and heads? What happened if you toss it a few more times?!11
  • 12. 
 Counting: Probabilistic counter Idea: the left-most position of zero (R) in the Counter after inserting 
 n elements from the dataset can be used as an indicator of log2n. Let’s remember all ranks we’ve seen so far in a binary Counter 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 rank 0 1 2 3 4 5 …. 14 15 R j << log2 n => almost certainly we have 1 (low ranks appear often) j >> log2 n => almost certainly we have 0 (high ranks appear rarely) j ≈ log2 n => equal probability to have 1 or 0 in the Counter n ≈ 2R 0.77351 ≈ 16 0.77351 ≈ 21 The probability to observe 1 at position j after indexing n elements is n 2j+1 !12
  • 13. Counting: HyperLogLog Algorithm Proposed in 2007 Uses the idea of probabilistic counting Based on a single 32-bit hash function p bits (32 - p) bits addressing bits rank computation bits hash(x) = 32-bit hash value HyperLogLog Algorithm !13
  • 14. Counting: HyperLogLog Algorithm Stores only m = 2p counters (registers), about 4 bytes each The memory always fixed, regardless the number of unique elements More counters provide less error (memory/accuracy trade-off) Idea: Instead of storing m binary Counters, store only the maximal observed ranks for each of them. n ≈ α ⋅ m ⋅ 2AVG(Ri) Ri = max(Ri, rank(x)), i=1…m !14
  • 16. More counters require more memory (4 bytes per counter) More counters need more bits for addressing them (m = 2p) Counting: Accuracy vs Memory Tradeoff in HyperLogLog p bits (32 - p) addressing bits rank computation bits hash(x) (p = 4 … 16) !16
  • 17. Counting: Distinct Count in Redis Redis uses the HyperLogLog data structure to count unique elements in a set requires a small constant amount of memory of 12KB for every data structure approximates the exact cardinality with a standard error of 0.81%. redis> PFADD hll python java ruby (integer) 1 redis> PFADD hll python python python (integer) 0 redis> PFADD hll java ruby (integer) 0 redis> PFCOUNT hll (integer) 3 !17
  • 18. Counting: Invoking HyperLogLog from Python 
 import json from pdsa.cardinality.hyperloglog import HyperLogLog hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters with open('visitors.txt') as f: for line in f: ip = json.loads(line)['ip'] hll.add(ip) num_of_unique_visitors = hll.count() print('Unique visitors', num_of_unique_visitors) size_in_bytes = hll.size() print('Size in bytes', size_in_bytes)
 !18
  • 19. Counting: Implementing HyperLogLog with Cython cdef class HyperLogLog: def __cinit__(self, const uint8_t precision): self.precision = precision # bits used for addressing self.num_of_counters = <uint32_t>1 << precision self.counter = array('L', range(self.num_of_counters)) # 4-byte int cdef uint8_t rank(self, uint32_t value): cdef uint8_t size = 32 - self.precision # bits used for hashing return size - value.bit_length() + 1 cpdef void add(self, object element) except *: cdef uint16_t index cdef uint32_t value value, index = divmod(hash(element, seed), self.num_of_counters) counter[index] = max(self.counter[index], self.rank(value)) !19
  • 20. Counting: Implementing HyperLogLog with Cython cpdef size_t count(self): cdef float R = 0 cdef uint16_t index cdef bint all_zero = 1 for index in xrange(self.num_of_counters): if self._counter[index] > 0: all_zero = 0 R += 1.0 / float(<uint32_t>1 << self._counter[counter_index]) if all_zero: return 0 # all counters are zeros cdef size_t n = <size_t>round(alpha * self.num_of_counters ** 2 / R) if n < 2.5 * self.num_of_counters: # *small* range correction elif n > 143165576: # 2^{32} / 30 # *big* range correction return n !20
  • 21. Counting: HyperLogLog++ Algorithm HyperLogLog++ 64-bit hash function, so allows to count more values better bias correction using pre-trained data proposed a sparse representation of the counters (registers) to reduce memory requirements HyperLogLog++ is an improved version of HyperLogLog 
 developed in Google and proposed in 2013 !21
  • 22. [book] Probabilistic Data Structures and Algorithms for Big Data Applications 
 https://pdsa.gakhov.com [repo] Probabilistic Data Structures and Algorithms in Python
 https://github.com/gakhov/pdsa Read More about HyperLogLog and other PDSA Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 
 https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ Redis new data structure: the HyperLogLog
 http://antirez.com/news/75 Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 
 https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html !22
  • 23. Thank you Website: www.gakhov.com Twitter: @gakhov Probabilistic Data Structures and Algorithms for Big Data Applications pdsa.gakhov.com Thank you !23