SlideShare a Scribd company logo
Count me once, count me fast!
Probabilistic methods in real-time streaming
(Hyperloglog, Bloom filters)
Kendrick Lo
Insight Data Engineering, NYC
Summer 2016
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
real-time viewing data
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
13 MB
100 million
uniques
bitmap
(for exact counting)
4 KB
billions of uniques
hyperloglog
real-time viewing data
Hyperloglog
Count-distinct problem
(a.k.a. cardinality estimation problem)
● counting unique elements in a data
stream with repeated elements
● calculates an approximate number
○ typical error purported to be
less than < 2%
What it can’t do:
● give an exact count
● track frequency of
occurrence
● confirm whether a certain
element was seen
Hyperloglog - a probabilistic method
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Hyperloglog - a probabilistic method
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Question:
I have a list of N unique numbers.
The one with the longest string
of leading zeros is
0 0 0 0 0 0 1 x x…
What is N?
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
Hyperloglog
ID
ID
ID
ID
ID
6
=> 128 unique viewers
5 6 7 4 6 8... ...
(harmonic) MEAN: 6
IDID
ID
Pipeline
Ad ID
Unique
User ID
Gender
Age
segments
Time
stamp
Algebird
4 x m4.large
1 sec mini-batches
Pushed 1 billion records
with unique user IDs
● Throughput can reach an
average of 5M records/min
● Streams of <1M records
processed within a minute
Hyperloglog Project
● After >1M uniques, delays
accumulate causing system
instability when using sets
Extension: counting unique viewers in a subgroup
● Associating segments with user IDs
○ Challenge: Can we avoid database accesses when
processing data in real-time?
○ Bloom filter: another fixed-size probabilistic data
structure that trades off (tunable) accuracy for size
e.g. Bloom filter + Hyperloglog count males error: 1.2%
○ needed to overcome challenges in combining
aspects of Spark (batch) and Spark Streaming
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
Thank you for listening!
appendix
[Set structures]
[HLL structures]
Results: error rate in counts
● Error < 2% for subgroups;
slightly higher for main group
● Error for intersection
calculation (purple) tends to
be higher on average
Use cases
● Advertising
○ ad viewership, website views, television viewership, app engagement, etc.
● Any application where you would want to count a large number of unique
things fast
○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.
● Well suited to real-time analytics
○ intermediate state of HLL structure provides for a running count
○ trivially parallelizable
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
Future exploration
● Associating segments with user IDs
○ quantifying incremental error associated with introduction of
Bloom filters
● Apache Storm versus Spark
○ Does Storm (a “pure” streaming technology) perform much
better?
● Spark DataFrames API
○ seemed to introduce significant delay: would like to quantify this
Bloom Filters
● Experiment with 1 million records
○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment
data to be matched with incoming user IDs, continued processing with Hyperloglog
○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%
● Actual error:
○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%
○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%
● Time to process:
○ Bloom filter + Hyperloglog: 17s (+55%)
○ Hyperloglog only: 11s
Bloom Filters
Source: Wikipedia
Tuning Probabilistic Structures
Hyperloglog
(source: Twitter Algebird source code: HyperLogLog.scala)
Bloom Filters
(source: https://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
e.g. n = 1 M (capacity)
p = 0.03 (error)
=> k = 5 (# of hash functions)
=> m = 891 kB

More Related Content

What's hot

ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
HostedbyConfluent
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
Norberto Leite
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
DataWorks Summit/Hadoop Summit
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
RomanKhavronenko
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
Altinity Ltd
 
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Altinity Ltd
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
confluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Machine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesMachine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and Kubernetes
Arun Gupta
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
CJ Cullen
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
confluent
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 

What's hot (20)

ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
 
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Machine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesMachine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and Kubernetes
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 

Similar to Hyperloglog Project

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
heyramzz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
Dr. Paolo Di Prodi
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
Amazon Web Services LATAM
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
Tuan Hoang
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
Laura Chiticariu
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
IsCoolEnt
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
Trent McConaghy
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
Steve Omohundro
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
deep.bi
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
Trent McConaghy
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
Product School
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 

Similar to Hyperloglog Project (20)

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 

Recently uploaded

Synthetic Test Collections for Retrieval Evaluation (Poster)
Synthetic Test Collections for Retrieval Evaluation (Poster)Synthetic Test Collections for Retrieval Evaluation (Poster)
Synthetic Test Collections for Retrieval Evaluation (Poster)
Hossein A. (Saeed) Rahmani
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
Iwiss Tools Co.,Ltd
 
Benefits of Studying Artificial Intelligence - KRCE.pptx
Benefits of Studying Artificial Intelligence - KRCE.pptxBenefits of Studying Artificial Intelligence - KRCE.pptx
Benefits of Studying Artificial Intelligence - KRCE.pptx
krceseo
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptxPresentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Er. Kushal Ghimire
 
Data Visualization in Python of b.tech student.pptx
Data Visualization in Python of b.tech student.pptxData Visualization in Python of b.tech student.pptx
Data Visualization in Python of b.tech student.pptx
TelanganaPakkaFolk
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
Mani Krishna Sarkar
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
IJAEMSJORNAL
 
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmtlecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
RAtna29
 
Online toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdfOnline toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdf
Kamal Acharya
 
OSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag outOSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag out
Ateeb19
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
OBD II
 
Rockets and missiles notes engineering ppt
Rockets and missiles notes engineering pptRockets and missiles notes engineering ppt
Rockets and missiles notes engineering ppt
archithaero
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
sanabts249
 
Jet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdfJet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdf
KIET Group of Institutions
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
GOWSIKRAJA PALANISAMY
 
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY pptCONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
ASHOK KUMAR SINGH
 
Conservation of Natural Resources Biodiversity.pptx
Conservation of Natural Resources Biodiversity.pptxConservation of Natural Resources Biodiversity.pptx
Conservation of Natural Resources Biodiversity.pptx
AdarshaMR1
 

Recently uploaded (20)

Synthetic Test Collections for Retrieval Evaluation (Poster)
Synthetic Test Collections for Retrieval Evaluation (Poster)Synthetic Test Collections for Retrieval Evaluation (Poster)
Synthetic Test Collections for Retrieval Evaluation (Poster)
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
 
Benefits of Studying Artificial Intelligence - KRCE.pptx
Benefits of Studying Artificial Intelligence - KRCE.pptxBenefits of Studying Artificial Intelligence - KRCE.pptx
Benefits of Studying Artificial Intelligence - KRCE.pptx
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptxPresentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
 
Data Visualization in Python of b.tech student.pptx
Data Visualization in Python of b.tech student.pptxData Visualization in Python of b.tech student.pptx
Data Visualization in Python of b.tech student.pptx
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
 
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmtlecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
 
Online toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdfOnline toll plaza booking system project report.doc.pdf
Online toll plaza booking system project report.doc.pdf
 
OSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag outOSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag out
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
 
Rockets and missiles notes engineering ppt
Rockets and missiles notes engineering pptRockets and missiles notes engineering ppt
Rockets and missiles notes engineering ppt
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
 
Jet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdfJet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdf
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
 
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY pptCONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
 
Conservation of Natural Resources Biodiversity.pptx
Conservation of Natural Resources Biodiversity.pptxConservation of Natural Resources Biodiversity.pptx
Conservation of Natural Resources Biodiversity.pptx
 

Hyperloglog Project

  • 1. Count me once, count me fast! Probabilistic methods in real-time streaming (Hyperloglog, Bloom filters) Kendrick Lo Insight Data Engineering, NYC Summer 2016
  • 2. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? real-time viewing data
  • 3. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? 13 MB 100 million uniques bitmap (for exact counting) 4 KB billions of uniques hyperloglog real-time viewing data
  • 4. Hyperloglog Count-distinct problem (a.k.a. cardinality estimation problem) ● counting unique elements in a data stream with repeated elements ● calculates an approximate number ○ typical error purported to be less than < 2% What it can’t do: ● give an exact count ● track frequency of occurrence ● confirm whether a certain element was seen
  • 5. Hyperloglog - a probabilistic method General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…? 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ...
  • 6. Hyperloglog - a probabilistic method 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ... Question: I have a list of N unique numbers. The one with the longest string of leading zeros is 0 0 0 0 0 0 1 x x… What is N? General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…?
  • 7. Hyperloglog ID ID ID ID ID 6 => 128 unique viewers 5 6 7 4 6 8... ... (harmonic) MEAN: 6 IDID ID
  • 8. Pipeline Ad ID Unique User ID Gender Age segments Time stamp Algebird 4 x m4.large 1 sec mini-batches Pushed 1 billion records with unique user IDs
  • 9. ● Throughput can reach an average of 5M records/min ● Streams of <1M records processed within a minute
  • 11. ● After >1M uniques, delays accumulate causing system instability when using sets
  • 12. Extension: counting unique viewers in a subgroup ● Associating segments with user IDs ○ Challenge: Can we avoid database accesses when processing data in real-time? ○ Bloom filter: another fixed-size probabilistic data structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2% ○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 13. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
  • 14. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer) Thank you for listening!
  • 18. Results: error rate in counts ● Error < 2% for subgroups; slightly higher for main group ● Error for intersection calculation (purple) tends to be higher on average
  • 19. Use cases ● Advertising ○ ad viewership, website views, television viewership, app engagement, etc. ● Any application where you would want to count a large number of unique things fast ○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc. ● Well suited to real-time analytics ○ intermediate state of HLL structure provides for a running count ○ trivially parallelizable Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 20. Future exploration ● Associating segments with user IDs ○ quantifying incremental error associated with introduction of Bloom filters ● Apache Storm versus Spark ○ Does Storm (a “pure” streaming technology) perform much better? ● Spark DataFrames API ○ seemed to introduce significant delay: would like to quantify this
  • 21. Bloom Filters ● Experiment with 1 million records ○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog ○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3% ● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9% ○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6% ● Time to process: ○ Bloom filter + Hyperloglog: 17s (+55%) ○ Hyperloglog only: 11s
  • 23. Tuning Probabilistic Structures Hyperloglog (source: Twitter Algebird source code: HyperLogLog.scala) Bloom Filters (source: https://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-web-analytics-data-mining/) e.g. n = 1 M (capacity) p = 0.03 (error) => k = 5 (# of hash functions) => m = 891 kB