SlideShare a Scribd company logo
1 of 23
Download to read offline
Count me once, count me fast!
Probabilistic methods in real-time streaming
(Hyperloglog, Bloom filters)
Kendrick Lo
Insight Data Engineering, NYC
Summer 2016
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
real-time viewing data
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
13 MB
100 million
uniques
bitmap
(for exact counting)
4 KB
billions of uniques
hyperloglog
real-time viewing data
Hyperloglog
Count-distinct problem
(a.k.a. cardinality estimation problem)
● counting unique elements in a data
stream with repeated elements
● calculates an approximate number
○ typical error purported to be
less than < 2%
What it can’t do:
● give an exact count
● track frequency of
occurrence
● confirm whether a certain
element was seen
Hyperloglog - a probabilistic method
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Hyperloglog - a probabilistic method
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Question:
I have a list of N unique numbers.
The one with the longest string
of leading zeros is
0 0 0 0 0 0 1 x x…
What is N?
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
Hyperloglog
ID
ID
ID
ID
ID
6
=> 128 unique viewers
5 6 7 4 6 8... ...
(harmonic) MEAN: 6
IDID
ID
Pipeline
Ad ID
Unique
User ID
Gender
Age
segments
Time
stamp
Algebird
4 x m4.large
1 sec mini-batches
Pushed 1 billion records
with unique user IDs
● Throughput can reach an
average of 5M records/min
● Streams of <1M records
processed within a minute
● After >1M uniques, delays
accumulate causing system
instability when using sets
Extension: counting unique viewers in a subgroup
● Associating segments with user IDs
○ Challenge: Can we avoid database accesses when
processing data in real-time?
○ Bloom filter: another fixed-size probabilistic data
structure that trades off (tunable) accuracy for size
e.g. Bloom filter + Hyperloglog count males error: 1.2%
○ needed to overcome challenges in combining
aspects of Spark (batch) and Spark Streaming
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
Thank you for listening!
appendix
[Set structures]
[HLL structures]
Results: error rate in counts
● Error < 2% for subgroups;
slightly higher for main group
● Error for intersection
calculation (purple) tends to
be higher on average
Use cases
● Advertising
○ ad viewership, website views, television viewership, app engagement, etc.
● Any application where you would want to count a large number of unique
things fast
○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.
● Well suited to real-time analytics
○ intermediate state of HLL structure provides for a running count
○ trivially parallelizable
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
Future exploration
● Associating segments with user IDs
○ quantifying incremental error associated with introduction of
Bloom filters
● Apache Storm versus Spark
○ Does Storm (a “pure” streaming technology) perform much
better?
● Spark DataFrames API
○ seemed to introduce significant delay: would like to quantify this
Bloom Filters
● Experiment with 1 million records
○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment
data to be matched with incoming user IDs, continued processing with Hyperloglog
○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%
● Actual error:
○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%
○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%
● Time to process:
○ Bloom filter + Hyperloglog: 17s (+55%)
○ Hyperloglog only: 11s
Bloom Filters
Source: Wikipedia
Tuning Probabilistic Structures
Hyperloglog
(source: Twitter Algebird source code: HyperLogLog.scala)
Bloom Filters
(source: https://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
e.g. n = 1 M (capacity)
p = 0.03 (error)
=> k = 5 (# of hash functions)
=> m = 891 kB

More Related Content

What's hot

Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to KibanaVineet .
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOAltinity Ltd
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouserpolat
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introductionleanderlee2
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino ProjectMartin Traverso
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineScyllaDB
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 

What's hot (20)

Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
Presto
PrestoPresto
Presto
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 

Similar to Hyperloglog Project

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4heyramzz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Marko Grobelnik
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsDr. Paolo Di Prodi
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are aiNeo4j
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...Tuan Hoang
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitterTwitter Developers
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)Laura Chiticariu
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauIsCoolEnt
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTrent McConaghy
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...Steve Omohundro
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token EngineeringTrent McConaghy
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetProduct School
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 

Similar to Hyperloglog Project (20)

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 

Recently uploaded

Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 

Recently uploaded (20)

Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 

Hyperloglog Project

  • 1. Count me once, count me fast! Probabilistic methods in real-time streaming (Hyperloglog, Bloom filters) Kendrick Lo Insight Data Engineering, NYC Summer 2016
  • 2. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? real-time viewing data
  • 3. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? 13 MB 100 million uniques bitmap (for exact counting) 4 KB billions of uniques hyperloglog real-time viewing data
  • 4. Hyperloglog Count-distinct problem (a.k.a. cardinality estimation problem) ● counting unique elements in a data stream with repeated elements ● calculates an approximate number ○ typical error purported to be less than < 2% What it can’t do: ● give an exact count ● track frequency of occurrence ● confirm whether a certain element was seen
  • 5. Hyperloglog - a probabilistic method General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…? 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ...
  • 6. Hyperloglog - a probabilistic method 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ... Question: I have a list of N unique numbers. The one with the longest string of leading zeros is 0 0 0 0 0 0 1 x x… What is N? General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…?
  • 7. Hyperloglog ID ID ID ID ID 6 => 128 unique viewers 5 6 7 4 6 8... ... (harmonic) MEAN: 6 IDID ID
  • 8. Pipeline Ad ID Unique User ID Gender Age segments Time stamp Algebird 4 x m4.large 1 sec mini-batches Pushed 1 billion records with unique user IDs
  • 9. ● Throughput can reach an average of 5M records/min ● Streams of <1M records processed within a minute
  • 10.
  • 11. ● After >1M uniques, delays accumulate causing system instability when using sets
  • 12. Extension: counting unique viewers in a subgroup ● Associating segments with user IDs ○ Challenge: Can we avoid database accesses when processing data in real-time? ○ Bloom filter: another fixed-size probabilistic data structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2% ○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 13. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
  • 14. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer) Thank you for listening!
  • 18. Results: error rate in counts ● Error < 2% for subgroups; slightly higher for main group ● Error for intersection calculation (purple) tends to be higher on average
  • 19. Use cases ● Advertising ○ ad viewership, website views, television viewership, app engagement, etc. ● Any application where you would want to count a large number of unique things fast ○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc. ● Well suited to real-time analytics ○ intermediate state of HLL structure provides for a running count ○ trivially parallelizable Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 20. Future exploration ● Associating segments with user IDs ○ quantifying incremental error associated with introduction of Bloom filters ● Apache Storm versus Spark ○ Does Storm (a “pure” streaming technology) perform much better? ● Spark DataFrames API ○ seemed to introduce significant delay: would like to quantify this
  • 21. Bloom Filters ● Experiment with 1 million records ○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog ○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3% ● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9% ○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6% ● Time to process: ○ Bloom filter + Hyperloglog: 17s (+55%) ○ Hyperloglog only: 11s
  • 23. Tuning Probabilistic Structures Hyperloglog (source: Twitter Algebird source code: HyperLogLog.scala) Bloom Filters (source: https://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-web-analytics-data-mining/) e.g. n = 1 M (capacity) p = 0.03 (error) => k = 5 (# of hash functions) => m = 891 kB