SlideShare a Scribd company logo
Count me once, count me fast!
Probabilistic methods in real-time streaming
(Hyperloglog, Bloom filters)
Kendrick Lo
Insight Data Engineering, NYC
Summer 2016
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
real-time viewing data
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Ad ID
Unique
User ID
Time
stamp
Unique
User ID
Unique
User ID
Unique
User ID
Unique
User ID
...
...
?
13 MB
100 million
uniques
bitmap
(for exact counting)
4 KB
billions of uniques
hyperloglog
real-time viewing data
Hyperloglog
Count-distinct problem
(a.k.a. cardinality estimation problem)
● counting unique elements in a data
stream with repeated elements
● calculates an approximate number
○ typical error purported to be
less than < 2%
What it can’t do:
● give an exact count
● track frequency of
occurrence
● confirm whether a certain
element was seen
Hyperloglog - a probabilistic method
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Hyperloglog - a probabilistic method
1 x x x x x x x x… → 0.5 (1 out of every 2)
0 1 x x x x x x x… → 0.25 (1 out of every 4)
0 0 1 x x x x x x… → 0.125 (1 out of every 8)
…
0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)
...
Question:
I have a list of N unique numbers.
The one with the longest string
of leading zeros is
0 0 0 0 0 0 1 x x…
What is N?
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,
what is the probability of seeing…?
Hyperloglog
ID
ID
ID
ID
ID
6
=> 128 unique viewers
5 6 7 4 6 8... ...
(harmonic) MEAN: 6
IDID
ID
Pipeline
Ad ID
Unique
User ID
Gender
Age
segments
Time
stamp
Algebird
4 x m4.large
1 sec mini-batches
Pushed 1 billion records
with unique user IDs
● Throughput can reach an
average of 5M records/min
● Streams of <1M records
processed within a minute
Hyperloglog Project
● After >1M uniques, delays
accumulate causing system
instability when using sets
Extension: counting unique viewers in a subgroup
● Associating segments with user IDs
○ Challenge: Can we avoid database accesses when
processing data in real-time?
○ Bloom filter: another fixed-size probabilistic data
structure that trades off (tunable) accuracy for size
e.g. Bloom filter + Hyperloglog count males error: 1.2%
○ needed to overcome challenges in combining
aspects of Spark (batch) and Spark Streaming
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
About me
Master of Science, Harvard University
Computational Science and Engineering
(graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto
Engineering Science (Computer)
Thank you for listening!
appendix
[Set structures]
[HLL structures]
Results: error rate in counts
● Error < 2% for subgroups;
slightly higher for main group
● Error for intersection
calculation (purple) tends to
be higher on average
Use cases
● Advertising
○ ad viewership, website views, television viewership, app engagement, etc.
● Any application where you would want to count a large number of unique
things fast
○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.
● Well suited to real-time analytics
○ intermediate state of HLL structure provides for a running count
○ trivially parallelizable
Ad ID
Unique
User ID
Gender
Age segment
(e.g. 18-34)
Time
stamp
Sample record
Future exploration
● Associating segments with user IDs
○ quantifying incremental error associated with introduction of
Bloom filters
● Apache Storm versus Spark
○ Does Storm (a “pure” streaming technology) perform much
better?
● Spark DataFrames API
○ seemed to introduce significant delay: would like to quantify this
Bloom Filters
● Experiment with 1 million records
○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment
data to be matched with incoming user IDs, continued processing with Hyperloglog
○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%
● Actual error:
○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%
○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%
● Time to process:
○ Bloom filter + Hyperloglog: 17s (+55%)
○ Hyperloglog only: 11s
Bloom Filters
Source: Wikipedia
Tuning Probabilistic Structures
Hyperloglog
(source: Twitter Algebird source code: HyperLogLog.scala)
Bloom Filters
(source: https://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
e.g. n = 1 M (capacity)
p = 0.03 (error)
=> k = 5 (# of hash functions)
=> m = 891 kB

More Related Content

What's hot

Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin Guide
JEONGPHIL HAN
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
DataWorks Summit
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
Marco Tusa
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
Xiang Fu
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
NAVER D2
 

What's hot (20)

Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin Guide
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 

Similar to Hyperloglog Project

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
heyramzz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
Dr. Paolo Di Prodi
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
Amazon Web Services LATAM
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
Tuan Hoang
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
Laura Chiticariu
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
IsCoolEnt
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
Trent McConaghy
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
Steve Omohundro
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
deep.bi
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
Trent McConaghy
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
Product School
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 

Similar to Hyperloglog Project (20)

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
The Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence PlatformsThe Future Of Threat Intelligence Platforms
The Future Of Threat Intelligence Platforms
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Big data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteauBig data paris 2011 is cool florian douetteau
Big data paris 2011 is cool florian douetteau
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
 
Machine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at TargetMachine Learning and Blockchain by Director of Product at Target
Machine Learning and Blockchain by Director of Product at Target
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 

Recently uploaded

Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
leakingvideo
 
OSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag outOSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag out
Ateeb19
 
The world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptxThe world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptx
engrasjadshahzad
 
Python programming Introduction about Python
Python programming Introduction about PythonPython programming Introduction about Python
Python programming Introduction about Python
Senthil Vit
 
Introduction And Differences Between File System And Dbms.pptx
Introduction And Differences Between File System And Dbms.pptxIntroduction And Differences Between File System And Dbms.pptx
Introduction And Differences Between File System And Dbms.pptx
SerendipityYoon
 
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptxIE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
BehairyAhmed2
 
Presentation python programming vtu 6th sem
Presentation python programming vtu 6th semPresentation python programming vtu 6th sem
Presentation python programming vtu 6th sem
ssuser8f6b1d1
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
idelewebmestre
 
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptxOME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
shanmugamram247
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
celiosilva66
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
Blesson Easo Varghese
 
Online airline reservation system project report.pdf
Online airline reservation system project report.pdfOnline airline reservation system project report.pdf
Online airline reservation system project report.pdf
Kamal Acharya
 
ANATOMY OF SOA - Thomas Erl - Service Oriented Architecture
ANATOMY OF SOA - Thomas Erl - Service Oriented ArchitectureANATOMY OF SOA - Thomas Erl - Service Oriented Architecture
ANATOMY OF SOA - Thomas Erl - Service Oriented Architecture
Divya Rajasekar
 
readers writers Problem in operating system
readers writers Problem in operating systemreaders writers Problem in operating system
readers writers Problem in operating system
VADAPALLYPRAVEENKUMA1
 
Traffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptxTraffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptx
mailmad391
 
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY pptCONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
ASHOK KUMAR SINGH
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
RajaRamannaTarigoppu
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
OBD II
 

Recently uploaded (20)

Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
 
OSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag outOSHA LOTO training, LOTO, lock out tag out
OSHA LOTO training, LOTO, lock out tag out
 
The world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptxThe world of Technology Management MEM 814.pptx
The world of Technology Management MEM 814.pptx
 
Python programming Introduction about Python
Python programming Introduction about PythonPython programming Introduction about Python
Python programming Introduction about Python
 
Introduction And Differences Between File System And Dbms.pptx
Introduction And Differences Between File System And Dbms.pptxIntroduction And Differences Between File System And Dbms.pptx
Introduction And Differences Between File System And Dbms.pptx
 
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptxIE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
 
Presentation python programming vtu 6th sem
Presentation python programming vtu 6th semPresentation python programming vtu 6th sem
Presentation python programming vtu 6th sem
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
 
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptxOME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
 
Online airline reservation system project report.pdf
Online airline reservation system project report.pdfOnline airline reservation system project report.pdf
Online airline reservation system project report.pdf
 
ANATOMY OF SOA - Thomas Erl - Service Oriented Architecture
ANATOMY OF SOA - Thomas Erl - Service Oriented ArchitectureANATOMY OF SOA - Thomas Erl - Service Oriented Architecture
ANATOMY OF SOA - Thomas Erl - Service Oriented Architecture
 
readers writers Problem in operating system
readers writers Problem in operating systemreaders writers Problem in operating system
readers writers Problem in operating system
 
Traffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptxTraffic Engineering-MODULE-1 vtu syllabus.pptx
Traffic Engineering-MODULE-1 vtu syllabus.pptx
 
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY pptCONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
CONFINED SPACE ENTRY TRAINING FOR OIL INDUSTRY ppt
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
 

Hyperloglog Project

  • 1. Count me once, count me fast! Probabilistic methods in real-time streaming (Hyperloglog, Bloom filters) Kendrick Lo Insight Data Engineering, NYC Summer 2016
  • 2. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? real-time viewing data
  • 3. Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Ad ID Unique User ID Time stamp Unique User ID Unique User ID Unique User ID Unique User ID ... ... ? 13 MB 100 million uniques bitmap (for exact counting) 4 KB billions of uniques hyperloglog real-time viewing data
  • 4. Hyperloglog Count-distinct problem (a.k.a. cardinality estimation problem) ● counting unique elements in a data stream with repeated elements ● calculates an approximate number ○ typical error purported to be less than < 2% What it can’t do: ● give an exact count ● track frequency of occurrence ● confirm whether a certain element was seen
  • 5. Hyperloglog - a probabilistic method General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…? 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ...
  • 6. Hyperloglog - a probabilistic method 1 x x x x x x x x… → 0.5 (1 out of every 2) 0 1 x x x x x x x… → 0.25 (1 out of every 4) 0 0 1 x x x x x x… → 0.125 (1 out of every 8) … 0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128) ... Question: I have a list of N unique numbers. The one with the longest string of leading zeros is 0 0 0 0 0 0 1 x x… What is N? General Idea: Count leading zeros in a randomly generated binary number Given a random number, what is the probability of seeing…?
  • 7. Hyperloglog ID ID ID ID ID 6 => 128 unique viewers 5 6 7 4 6 8... ... (harmonic) MEAN: 6 IDID ID
  • 8. Pipeline Ad ID Unique User ID Gender Age segments Time stamp Algebird 4 x m4.large 1 sec mini-batches Pushed 1 billion records with unique user IDs
  • 9. ● Throughput can reach an average of 5M records/min ● Streams of <1M records processed within a minute
  • 11. ● After >1M uniques, delays accumulate causing system instability when using sets
  • 12. Extension: counting unique viewers in a subgroup ● Associating segments with user IDs ○ Challenge: Can we avoid database accesses when processing data in real-time? ○ Bloom filter: another fixed-size probabilistic data structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2% ○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 13. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
  • 14. About me Master of Science, Harvard University Computational Science and Engineering (graduated May 2016) J.D. / MBA, University of Toronto Bachelor of Applied Science, University of Toronto Engineering Science (Computer) Thank you for listening!
  • 18. Results: error rate in counts ● Error < 2% for subgroups; slightly higher for main group ● Error for intersection calculation (purple) tends to be higher on average
  • 19. Use cases ● Advertising ○ ad viewership, website views, television viewership, app engagement, etc. ● Any application where you would want to count a large number of unique things fast ○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc. ● Well suited to real-time analytics ○ intermediate state of HLL structure provides for a running count ○ trivially parallelizable Ad ID Unique User ID Gender Age segment (e.g. 18-34) Time stamp Sample record
  • 20. Future exploration ● Associating segments with user IDs ○ quantifying incremental error associated with introduction of Bloom filters ● Apache Storm versus Spark ○ Does Storm (a “pure” streaming technology) perform much better? ● Spark DataFrames API ○ seemed to introduce significant delay: would like to quantify this
  • 21. Bloom Filters ● Experiment with 1 million records ○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog ○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3% ● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9% ○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6% ● Time to process: ○ Bloom filter + Hyperloglog: 17s (+55%) ○ Hyperloglog only: 11s
  • 23. Tuning Probabilistic Structures Hyperloglog (source: Twitter Algebird source code: HyperLogLog.scala) Bloom Filters (source: https://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-web-analytics-data-mining/) e.g. n = 1 M (capacity) p = 0.03 (error) => k = 5 (# of hash functions) => m = 891 kB