Submit Search
Upload
Large-scale real-time analytics for everyone
•
4 likes
•
1,048 views
Pavel Kalaidin
Follow
My slides from Highload Strategy conference in Vilnius.
Read less
Read more
Data & Analytics
Report
Share
Report
Share
1 of 83
Download now
Download to read offline
Recommended
A calculus of mobile Real-Time processes
A calculus of mobile Real-Time processes
Polytechnique Montréal
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
Lecture 12
Lecture 12
Tanveer Malik
Global wan prez-ru.mini
Global wan prez-ru.mini
TWD Industries AG
TrustLeap GWAN - The multicore Future requires Parallelism Programming tools
TrustLeap GWAN - The multicore Future requires Parallelism Programming tools
TWD Industries AG
The Allen AI Science Challenge
The Allen AI Science Challenge
Pavel Kalaidin
Global-WAN - The Swiss Neutral Data Haven
Global-WAN - The Swiss Neutral Data Haven
TWD Industries AG
Recommended
A calculus of mobile Real-Time processes
A calculus of mobile Real-Time processes
Polytechnique Montréal
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
Lecture 12
Lecture 12
Tanveer Malik
Global wan prez-ru.mini
Global wan prez-ru.mini
TWD Industries AG
TrustLeap GWAN - The multicore Future requires Parallelism Programming tools
TrustLeap GWAN - The multicore Future requires Parallelism Programming tools
TWD Industries AG
The Allen AI Science Challenge
The Allen AI Science Challenge
Pavel Kalaidin
Global-WAN - The Swiss Neutral Data Haven
Global-WAN - The Swiss Neutral Data Haven
TWD Industries AG
HyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
Python faster for loop
Python faster for loop
💾 Radek Fabisiak
It Probably Works - QCon 2015
It Probably Works - QCon 2015
Fastly
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
MapR Technologies
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
tutorial.ppt
tutorial.ppt
GuioGonza2
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
Peadar Coyle
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
New directions for mahout
New directions for mahout
MapR Technologies
Dataworkz odsc london 2018
Dataworkz odsc london 2018
Olaf de Leeuw
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
AI Frontiers
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
priestmanmable
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
Ehsan Totoni
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
2013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Arnab Bhadury
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
hf8803863
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
More Related Content
Similar to Large-scale real-time analytics for everyone
HyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
Python faster for loop
Python faster for loop
💾 Radek Fabisiak
It Probably Works - QCon 2015
It Probably Works - QCon 2015
Fastly
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
MapR Technologies
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
tutorial.ppt
tutorial.ppt
GuioGonza2
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
Peadar Coyle
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
New directions for mahout
New directions for mahout
MapR Technologies
Dataworkz odsc london 2018
Dataworkz odsc london 2018
Olaf de Leeuw
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
AI Frontiers
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
priestmanmable
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
Ehsan Totoni
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
2013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Arnab Bhadury
Similar to Large-scale real-time analytics for everyone
(20)
HyperLogLog and friends
HyperLogLog and friends
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
Python faster for loop
Python faster for loop
It Probably Works - QCon 2015
It Probably Works - QCon 2015
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
tutorial.ppt
tutorial.ppt
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
New directions for mahout
New directions for mahout
Dataworkz odsc london 2018
Dataworkz odsc london 2018
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
2013 open analytics_countingv3
2013 open analytics_countingv3
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Recently uploaded
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
hf8803863
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
sapnasaifi408
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
TanveerAhmed817946
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
Suhani Kapoor
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
ccctableauusergroup
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
Pooja Nehwal
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
soniya singh
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
shivangimorya083
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
ffjhghh
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Rachmat Ramadhan H
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
Boston Institute of Analytics
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
Samantha Rae Coolbeth
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
Aishani27
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Social Samosa
Recently uploaded
(20)
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Large-scale real-time analytics for everyone
1.
Large-scale real-time analytics for everyone: fast,
cheap and 98% correct
2.
Pavel Kalaidin @facultyofwonder
3.
we have a
lot of data memory is limited one pass would be great constant update time
4.
max, min, mean
is trivial
5.
median, anyone?
6.
Sampling?
7.
Probabilistic algorithms
8.
Estimate is OK but
nice to know how error is distributed
9.
def frugal(stream): m =
0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
10.
Memory used -
1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
11.
12.
Percentiles?
13.
Demo: bit.ly/frugalsketch def frugal_1u(stream,
m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
14.
Streaming + probabilistic
= sketch
15.
What do we
want? Get the number of unique users aka cardinality number
16.
What do we
want? Get the number of unique users grouped by host, date, segment
17.
When do we
want? Well, right now
18.
Data: 1010 elements, 109 unique int32 40Gb
19.
Straight-forward approach: hash-table
20.
Hash-table: 4Gb
21.
HyperLogLog: 1.5Kb, 2% error
22.
It all starts
with an algorithm called LogLog
23.
Imagine I tell
you I spent this morning flipping a coin
24.
and now tell
you what was the longest non-interrupting run of heads
25.
2 times or 100 times
26.
When I flipped
a coin for longer time?
27.
We are interested
in patterns in hashes (namely the longest runs of leading zeros = heads)
28.
Hash, don’t sample!* *
need a good hash function
29.
Expecting: 0xxxxxx hashes -
~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
30.
estimate - 2R , where
R - is a longest run of leading zeros in hashes
31.
I can perform
several flipping experiments
32.
and average the
number of zeros
33.
This is called
stochastic averaging
34.
So far the
estimate is 2R , where R is a is a longest run of leading zeros in hashes
35.
We will be
using M buckets
36.
where ɑ is
a normalization constant
37.
LogLog SuperLogLog
38.
LogLog SuperLogLog HyperLogLog arithmetic mean ->
harmonic mean plus a couple of tweaks
39.
Standard error is
1.04/sqrt (M), where M is the number of buckets
40.
LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit
-> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
41.
LogLog SuperLogLog HyperLogLog HyperLogLog++ Discrete Max-Count Facebook, 2014 bit.ly/DiscreteMaxCount
42.
Large scale?
43.
Suppose we have
two HLL- sketches, let’s take a maximum value from corresponding buckets
44.
Resulting sketch has
no loss in accuracy!
45.
What do we
want? how many unique users belong to two segments?
46.
HLL intersection
47.
Inclusion-exclusion principle
48.
credits: http://research.neustar. biz/2012/12/17/hll-intersections-2/
49.
50.
Python code: bit.ly/hloglog
51.
What do we
want? Get the churn rate
52.
Straight forward: feed
new data to a new sketch
53.
Sliding-window HyperLogLog
54.
We maintain a
list of tuples (timestamp, R), where R is a possible maximum over future time
55.
Values that are
no longer make sense are automatically discarded from the list
56.
57.
One list per
bucket
58.
Take a maximum
R over the given timeframe from the past, then estimate as we do in a regular HLL
59.
Extra memory is
required
60.
All the details: bit.ly/SlidingHLL
61.
hash, don’t sample estimate,
not precise save memory streaming this slide is the sketch of the talk
62.
63.
Lots of sketches
for various purposes: percentiles, heavy hitters, similarity, other stream statistics
64.
Have we seen
this user before?
65.
Bloom filter
66.
i h 1 h 2 h k 1 1 10
0 0 0 0 0 0 0 0 0 0 0 0
67.
How many time
did we see a user?
68.
Count-Min sketch is
the answer: bit.ly/CountMinSketch
69.
w i +1 +1 +1 h1 h4 hd d Estimate - take
minimum from d values
70.
Percentiles
71.
Frugal sketching is
not that precise enough
72.
Sorting is pain
73.
Distribute incoming values
to buckets?
74.
Some sort of
clustering, maybe
75.
T-Digest
76.
77.
Size is log(n), error
is relative to q(1-q)
78.
Code: bit.ly/T-Digest-Java bit.ly/T-Digest-Python
79.
This is a
growing field of computer science: stay tuned!
80.
81.
Thanks and happy sketching!
82.
Reading list: Neustar Research
blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures
83.
Bonus: HyperLogLog in SQL: bit.ly/HLLinSQL
Download now