SlideShare a Scribd company logo
Count me once, count me fast!
Probabilistic methods in real-time streaming
(Hyperloglog, Bloom filters)
Kendrick Lo
Insight Data Engineering, NYC
Summer 2016
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
real-time viewing data
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
13 MB
100 million
uniques
bitmap
(for exact counting)
4 KB
billions of uniques
hyperloglog
real-time viewing data
Hyperloglog
Count-distinct problem
(a.k.a. cardinality estimation problem)
● counting unique elements in a data
stream with repeated elements
● calculates an approximate number
○ typical error purported to be
less than < 2%
What it can’t do:
● give an exact count
● track frequency of
occurrence
● confirm whether a certain
element was seen
Hyperloglog - a probabilistic method
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Hyperloglog - a probabilistic method
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Question:
I have a list of N unique numbers.
The one with the longest string
of leading zeros is
0 0 0 0 0 0 1 x x…
What is N?
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
Hyperloglog
ID
ID
ID
ID
ID
6
=> 128 unique viewers
5 6 7 4 6 8... ...
(harmonic) MEAN: 6
IDID
ID
Pipeline
Ad ID
Unique
User ID
Gender
Age
segments
Time
stamp
Algebird
4 x m4.large
1 sec mini-batches
Pushed 1 billion records
with unique user IDs
● Throughput can reach an
average of 5M records/min
● Streams of <1M records
processed within a minute
● After >1M uniques, delays
accumulate causing system
instability when using sets
Extension: counting unique viewers in a subgroup
● Associating segments with user IDs
○ Challenge: Can we avoid database accesses when
processing data in real-time?
○ Bloom filter: another fixed-size probabilistic data
structure that trades off (tunable) accuracy for size
e.g. Bloom filter + Hyperloglog count males error: 1.2%
○ needed to overcome challenges in combining
aspects of Spark (batch) and Spark Streaming
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
Thank you for listening!
appendix
[Set structures]
[HLL structures]
Results: error rate in counts
● Error < 2% for subgroups;
slightly higher for main group
● Error for intersection
calculation (purple) tends to
be higher on average
Use cases
● Advertising
○ ad viewership, website views, television viewership, app engagement, etc.
● Any application where you would want to count a large number of unique
things fast
○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.
● Well suited to real-time analytics
○ intermediate state of HLL structure provides for a running count
○ trivially parallelizable
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
Future exploration
● Associating segments with user IDs
○ quantifying incremental error associated with introduction of
Bloom filters
● Apache Storm versus Spark
○ Does Storm (a “pure” streaming technology) perform much
better?
● Spark DataFrames API
○ seemed to introduce significant delay: would like to quantify this
Bloom Filters
● Experiment with 1 million records
○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment
data to be matched with incoming user IDs, continued processing with Hyperloglog
○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%
● Actual error:
○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%
○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%
● Time to process:
○ Bloom filter + Hyperloglog: 17s (+55%)
○ Hyperloglog only: 11s
Bloom Filters
Source: Wikipedia
Tuning Probabilistic Structures
Hyperloglog
(source: Twitter Algebird source code: HyperLogLog.scala)
Bloom Filters
(source: https://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
e.g. n = 1 M (capacity)
p = 0.03 (error)
=> k = 5 (# of hash functions)
=> m = 891 kB

More Related Content

What's hot

Creating Beautiful Dashboards with Grafana and ClickHouse
Creating Beautiful Dashboards with Grafana and ClickHouseCreating Beautiful Dashboards with Grafana and ClickHouse
Creating Beautiful Dashboards with Grafana and ClickHouse
Altinity Ltd
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
Jonathan Katz
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Altinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
SQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseSQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouse
NAVER Engineering
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
Paul Podolny
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree Engine
Altinity Ltd
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 

What's hot (20)

Creating Beautiful Dashboards with Grafana and ClickHouse
Creating Beautiful Dashboards with Grafana and ClickHouseCreating Beautiful Dashboards with Grafana and ClickHouse
Creating Beautiful Dashboards with Grafana and ClickHouse
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
SQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseSQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouse
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree Engine
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 

Similar to Hyperloglog Project

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
heyramzz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
Dr. Paolo Di Prodi
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
Amazon Web Services LATAM
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
Tuan Hoang
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
Laura Chiticariu
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
IsCoolEnt
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
Trent McConaghy
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
Steve Omohundro
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
deep.bi
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
Trent McConaghy
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
Product School
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 

Similar to Hyperloglog Project (20)

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 

Recently uploaded

KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
Madhumitha Jayaram
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 

Recently uploaded (20)

KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 

Hyperloglog Project

  • 1. Count me once, count me fast! Probabilistic methods in real-time streaming (Hyperloglog, Bloom filters) Kendrick Lo Insight Data Engineering, NYC Summer 2016
  • 2. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? real-time viewing data
  • 3. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? 13 MB 100 million uniques bitmap (for exact counting) 4 KB billions of uniques hyperloglog real-time viewing data
  • 4. Hyperloglog Count-distinct problem (a.k.a. cardinality estimation problem) ● counting unique elements in a data stream with repeated elements ● calculates an approximate number ○ typical error purported to be less than < 2% What it can’t do: ● give an exact count ● track frequency of occurrence ● confirm whether a certain element was seen
  • 5. Hyperloglog - a probabilistic method General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…? 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ...
  • 6. Hyperloglog - a probabilistic method 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ... Question: I have a list of N unique numbers. The one with the longest string of leading zeros is 0 0 0 0 0 0 1 x x… What is N? General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…?
  • 7. Hyperloglog ID ID ID ID ID 6 => 128 unique viewers 5 6 7 4 6 8... ... (harmonic) MEAN: 6 IDID ID
  • 8. Pipeline Ad ID Unique User ID Gender Age segments Time stamp Algebird 4 x m4.large 1 sec mini-batches Pushed 1 billion records with unique user IDs
  • 9. ● Throughput can reach an average of 5M records/min ● Streams of <1M records processed within a minute
  • 10.
  • 11. ● After >1M uniques, delays accumulate causing system instability when using sets
  • 12. Extension: counting unique viewers in a subgroup ● Associating segments with user IDs ○ Challenge: Can we avoid database accesses when processing data in real-time? ○ Bloom filter: another fixed-size probabilistic data structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2% ○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 13. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
  • 14. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer) Thank you for listening!
  • 18. Results: error rate in counts ● Error < 2% for subgroups; slightly higher for main group ● Error for intersection calculation (purple) tends to be higher on average
  • 19. Use cases ● Advertising ○ ad viewership, website views, television viewership, app engagement, etc. ● Any application where you would want to count a large number of unique things fast ○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc. ● Well suited to real-time analytics ○ intermediate state of HLL structure provides for a running count ○ trivially parallelizable Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 20. Future exploration ● Associating segments with user IDs ○ quantifying incremental error associated with introduction of Bloom filters ● Apache Storm versus Spark ○ Does Storm (a “pure” streaming technology) perform much better? ● Spark DataFrames API ○ seemed to introduce significant delay: would like to quantify this
  • 21. Bloom Filters ● Experiment with 1 million records ○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog ○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3% ● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9% ○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6% ● Time to process: ○ Bloom filter + Hyperloglog: 17s (+55%) ○ Hyperloglog only: 11s
  • 23. Tuning Probabilistic Structures Hyperloglog (source: Twitter Algebird source code: HyperLogLog.scala) Bloom Filters (source: https://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-web-analytics-data-mining/) e.g. n = 1 M (capacity) p = 0.03 (error) => k = 5 (# of hash functions) => m = 891 kB