SlideShare a Scribd company logo
Submit Search
Upload
Login
Signup
Large-scale real-time analytics for everyone
Report
Pavel Kalaidin
Follow
VK
Jan. 30, 2015
•
0 likes
•
1,045 views
1
of
83
Large-scale real-time analytics for everyone
Jan. 30, 2015
•
0 likes
•
1,045 views
Download Now
Download to read offline
Report
Data & Analytics
My slides from Highload Strategy conference in Vilnius.
Pavel Kalaidin
Follow
VK
Recommended
A calculus of mobile Real-Time processes
Polytechnique Montréal
217 views
•
45 slides
Real-Time Big Data Stream Analytics
Albert Bifet
6.3K views
•
65 slides
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
1.5K views
•
23 slides
Lecture 12
Tanveer Malik
80 views
•
22 slides
Global wan prez-ru.mini
TWD Industries AG
732 views
•
50 slides
TrustLeap GWAN - The multicore Future requires Parallelism Programming tools
TWD Industries AG
6.7K views
•
35 slides
More Related Content
Similar to Large-scale real-time analytics for everyone
HyperLogLog and friends
Simon Lia-Jonassen
166 views
•
14 slides
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
730 views
•
39 slides
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
1.7K views
•
26 slides
Machine Learning on Code - SF meetup
source{d}
532 views
•
64 slides
Python faster for loop
💾 Radek Fabisiak
41 views
•
17 slides
It Probably Works - QCon 2015
Fastly
1.1K views
•
121 slides
Similar to Large-scale real-time analytics for everyone
(20)
HyperLogLog and friends
Simon Lia-Jonassen
•
166 views
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
•
730 views
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
•
1.7K views
Machine Learning on Code - SF meetup
source{d}
•
532 views
Python faster for loop
💾 Radek Fabisiak
•
41 views
It Probably Works - QCon 2015
Fastly
•
1.1K views
Boston Hug by Ted Dunning 2012
MapR Technologies
•
480 views
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
•
20K views
tutorial.ppt
GuioGonza2
•
2 views
Introduction to Bayesian Analysis in Python
Peadar Coyle
•
1.7K views
Data Stream Algorithms in Storm and R
Radek Maciaszek
•
1.8K views
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
•
1.1K views
New directions for mahout
MapR Technologies
•
551 views
Dataworkz odsc london 2018
Olaf de Leeuw
•
9 views
Jay Yagnik at AI Frontiers : A History Lesson on AI
AI Frontiers
•
1.4K views
ISTA 130 Lab 21 Turtle ReviewHere are all of the turt.docx
priestmanmable
•
6 views
HPAT presentation at JuliaCon 2016
Ehsan Totoni
•
1.3K views
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
•
260 views
2013 open analytics_countingv3
abramsm
•
351 views
Learning Content and Usage Factors Simultaneously
Arnab Bhadury
•
354 views
Recently uploaded
Your Analytics does not have to be dramatic to be useful
Andrew Patricio
21 views
•
28 slides
apidays London 2023 - Overengineering Weakens your API Security, Dr. David Va...
apidays
9 views
•
13 slides
apidays London 2023 - Uptime, Mean-Time, and Ahead of Your Time, Anna Daugher...
apidays
7 views
•
44 slides
apidays London 2023 - Revolutionising fitness and well-being, David Turner, V...
apidays
9 views
•
16 slides
Factors Influencing the Choice of Business Education in Bangladesh..pdf
Shamim Rana
8 views
•
36 slides
apidays London 2023 - Autonomous Agents, Zdenek Nemec, superface.ai
apidays
6 views
•
32 slides
Recently uploaded
(20)
Your Analytics does not have to be dramatic to be useful
Andrew Patricio
•
21 views
apidays London 2023 - Overengineering Weakens your API Security, Dr. David Va...
apidays
•
9 views
apidays London 2023 - Uptime, Mean-Time, and Ahead of Your Time, Anna Daugher...
apidays
•
7 views
apidays London 2023 - Revolutionising fitness and well-being, David Turner, V...
apidays
•
9 views
Factors Influencing the Choice of Business Education in Bangladesh..pdf
Shamim Rana
•
8 views
apidays London 2023 - Autonomous Agents, Zdenek Nemec, superface.ai
apidays
•
6 views
apidays London 2023 - 7 pillars of an API Factory, Patrick Brosse, Amadeus
apidays
•
12 views
Proposal Presentation
SolarBhai
•
12 views
apidays London 2023 - Let's make "true" impact happen!, Sandra Sydow, Climate...
apidays
•
7 views
apidays London 2023 - Why and how to apply DDD to APIs, Radhouane Jrad, QBE E...
apidays
•
33 views
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
•
5 views
apidays London 2023 - Open Standards, AI and Data for better business decisio...
apidays
•
11 views
Predictive HR Analytics_ Mastering the HR Metric ( PDFDrive ).pdf
Santhosh Prabhu
•
41 views
PLM 200.pdf
Midhun Mohan
•
10 views
CASL compliance.pptx
papalipadhihari
•
10 views
apidays London 2023 - API Metrics matters in APIOps, Ludovic Pourrat, Lombar...
apidays
•
13 views
BLOCK CHAIN TECHNOLOGY.pptx
Priyanka749523
•
11 views
Introduction to Graph Databases.pdf
Neo4j
•
12 views
Interpreting the brief B2.pptx
Stephen266013
•
5 views
Wellbeing of Wales 2023
Statistics for Wales @ Welsh Government
•
55 views
Large-scale real-time analytics for everyone
1.
Large-scale real-time analytics for everyone: fast,
cheap and 98% correct
2.
Pavel Kalaidin @facultyofwonder
3.
we have a
lot of data memory is limited one pass would be great constant update time
4.
max, min, mean
is trivial
5.
median, anyone?
6.
Sampling?
7.
Probabilistic algorithms
8.
Estimate is OK but
nice to know how error is distributed
9.
def frugal(stream): m =
0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
10.
Memory used -
1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
12.
Percentiles?
13.
Demo: bit.ly/frugalsketch def frugal_1u(stream,
m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
14.
Streaming + probabilistic
= sketch
15.
What do we
want? Get the number of unique users aka cardinality number
16.
What do we
want? Get the number of unique users grouped by host, date, segment
17.
When do we
want? Well, right now
18.
Data: 1010 elements, 109 unique int32 40Gb
19.
Straight-forward approach: hash-table
20.
Hash-table: 4Gb
21.
HyperLogLog: 1.5Kb, 2% error
22.
It all starts
with an algorithm called LogLog
23.
Imagine I tell
you I spent this morning flipping a coin
24.
and now tell
you what was the longest non-interrupting run of heads
25.
2 times or 100 times
26.
When I flipped
a coin for longer time?
27.
We are interested
in patterns in hashes (namely the longest runs of leading zeros = heads)
28.
Hash, don’t sample!* *
need a good hash function
29.
Expecting: 0xxxxxx hashes -
~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
30.
estimate - 2R , where
R - is a longest run of leading zeros in hashes
31.
I can perform
several flipping experiments
32.
and average the
number of zeros
33.
This is called
stochastic averaging
34.
So far the
estimate is 2R , where R is a is a longest run of leading zeros in hashes
35.
We will be
using M buckets
36.
where ɑ is
a normalization constant
37.
LogLog SuperLogLog
38.
LogLog SuperLogLog HyperLogLog arithmetic mean ->
harmonic mean plus a couple of tweaks
39.
Standard error is
1.04/sqrt (M), where M is the number of buckets
40.
LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit
-> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
41.
LogLog SuperLogLog HyperLogLog HyperLogLog++ Discrete Max-Count Facebook, 2014 bit.ly/DiscreteMaxCount
42.
Large scale?
43.
Suppose we have
two HLL- sketches, let’s take a maximum value from corresponding buckets
44.
Resulting sketch has
no loss in accuracy!
45.
What do we
want? how many unique users belong to two segments?
46.
HLL intersection
47.
Inclusion-exclusion principle
48.
credits: http://research.neustar. biz/2012/12/17/hll-intersections-2/
50.
Python code: bit.ly/hloglog
51.
What do we
want? Get the churn rate
52.
Straight forward: feed
new data to a new sketch
53.
Sliding-window HyperLogLog
54.
We maintain a
list of tuples (timestamp, R), where R is a possible maximum over future time
55.
Values that are
no longer make sense are automatically discarded from the list
57.
One list per
bucket
58.
Take a maximum
R over the given timeframe from the past, then estimate as we do in a regular HLL
59.
Extra memory is
required
60.
All the details: bit.ly/SlidingHLL
61.
hash, don’t sample estimate,
not precise save memory streaming this slide is the sketch of the talk
63.
Lots of sketches
for various purposes: percentiles, heavy hitters, similarity, other stream statistics
64.
Have we seen
this user before?
65.
Bloom filter
66.
i h 1 h 2 h k 1 1 10
0 0 0 0 0 0 0 0 0 0 0 0
67.
How many time
did we see a user?
68.
Count-Min sketch is
the answer: bit.ly/CountMinSketch
69.
w i +1 +1 +1 h1 h4 hd d Estimate - take
minimum from d values
70.
Percentiles
71.
Frugal sketching is
not that precise enough
72.
Sorting is pain
73.
Distribute incoming values
to buckets?
74.
Some sort of
clustering, maybe
75.
T-Digest
77.
Size is log(n), error
is relative to q(1-q)
78.
Code: bit.ly/T-Digest-Java bit.ly/T-Digest-Python
79.
This is a
growing field of computer science: stay tuned!
81.
Thanks and happy sketching!
82.
Reading list: Neustar Research
blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures
83.
Bonus: HyperLogLog in SQL: bit.ly/HLLinSQL