SlideShare a Scribd company logo
1 of 83
Download to read offline
Large-scale
real-time
analytics for everyone:
fast, cheap and 98% correct
Pavel Kalaidin
@facultyofwonder
we have a lot of data
memory is limited
one pass would be great
constant update time
max, min, mean is trivial
median, anyone?
Sampling?
Probabilistic algorithms
Estimate is OK
but nice to know how error is
distributed
def frugal(stream):
m = 0
for val in stream:
if val > m:
m += 1
elif val < m:
m -= 1
return m
Memory used - 1 int!
def frugal(stream):
m = 0
for val in stream:
if val > m:
m += 1
elif val < m:
m -= 1
return m
It really works
Percentiles?
Demo: bit.ly/frugalsketch
def frugal_1u(stream, m=0, q=0.5):
for val in stream:
r = np.random.random()
if val > m and r > 1 - q:
m += 1
elif val < m and r > q:
m -= 1
return m
Streaming + probabilistic =
sketch
What do we want?
Get the number of unique users
aka cardinality number
What do we want?
Get the number of unique users
grouped by host, date, segment
When do we want?
Well, right now
Data:
1010
elements,
109
unique int32
40Gb
Straight-forward approach:
hash-table
Hash-table:
4Gb
HyperLogLog:
1.5Kb, 2% error
It all starts with an algorithm
called LogLog
Imagine I tell you I spent this
morning flipping a coin
and now tell you what was
the longest non-interrupting
run of heads
2 times
or
100 times
When I flipped a coin for
longer time?
We are interested in patterns
in hashes
(namely the longest runs of
leading zeros = heads)
Hash, don’t sample!*
* need a good hash function
Expecting:
0xxxxxx hashes - ~50%
1xxxxxx hashes - ~50%
00xxxxx hashes - ~25%
estimate - 2R
,
where R - is a longest run of
leading zeros in hashes
I can perform several flipping
experiments
and average the number of
zeros
This is called stochastic
averaging
So far the estimate is 2R
,
where R is a is a longest run
of leading zeros in hashes
We will be using M buckets
where ɑ is a normalization constant
LogLog
SuperLogLog
LogLog
SuperLogLog
HyperLogLog
arithmetic mean -> harmonic mean
plus a couple of tweaks
Standard error is 1.04/sqrt
(M),
where M is the number of
buckets
LogLog
SuperLogLog
HyperLogLog
HyperLogLog++
Google, 2013
32 bit -> 64 bit + fixes for low cardinality
bit.ly/HLLGoogle
LogLog
SuperLogLog
HyperLogLog
HyperLogLog++
Discrete Max-Count
Facebook, 2014
bit.ly/DiscreteMaxCount
Large scale?
Suppose we have two HLL-
sketches, let’s take a
maximum value from
corresponding buckets
Resulting sketch has no loss
in accuracy!
What do we want?
how many unique users belong to two
segments?
HLL intersection
Inclusion-exclusion principle
credits: http://research.neustar.
biz/2012/12/17/hll-intersections-2/
Python code:
bit.ly/hloglog
What do we want?
Get the churn rate
Straight forward: feed new
data to a new sketch
Sliding-window HyperLogLog
We maintain a list of tuples
(timestamp, R), where R is a
possible maximum over
future time
Values that are no longer
make sense are
automatically discarded from
the list
One list per bucket
Take a maximum R over the
given timeframe from the
past, then estimate as we do
in a regular HLL
Extra memory is required
All the details:
bit.ly/SlidingHLL
hash, don’t sample
estimate, not precise
save memory
streaming
this slide is the sketch of the talk
Lots of sketches for various
purposes:
percentiles,
heavy hitters,
similarity,
other stream statistics
Have we seen this user
before?
Bloom filter
i
h
1
h
2
h
k
1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
How many time did we see a
user?
Count-Min sketch is the
answer:
bit.ly/CountMinSketch
w
i
+1
+1
+1
h1
h4
hd
d
Estimate - take minimum from d values
Percentiles
Frugal sketching is not that
precise enough
Sorting is pain
Distribute incoming values to
buckets?
Some sort of clustering,
maybe
T-Digest
Size is log(n),
error is relative to q(1-q)
Code:
bit.ly/T-Digest-Java
bit.ly/T-Digest-Python
This is a growing field of
computer science:
stay tuned!
Thanks
and
happy sketching!
Reading list:
Neustar Research blog:
bit.ly/NRsketches
Sketches overview:
bit.ly/SketchesOverview
Lecture notes on streaming algorithms:
bit.ly/streaming-lectures
Bonus:
HyperLogLog in SQL:
bit.ly/HLLinSQL

More Related Content

Similar to Large-scale real-time analytics for everyone

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015Fastly
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonPeadar Coyle
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Pedro Rodrigues
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018Olaf de Leeuw
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docxISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docxpriestmanmable
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3abramsm
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyArnab Bhadury
 

Similar to Large-scale real-time analytics for everyone (20)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Python faster for loop
Python faster for loopPython faster for loop
Python faster for loop
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docxISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Large-scale real-time analytics for everyone