In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional COUNT(DISTINCT) queries work well most of the time for such use cases, it has some drawbacks while working on large data sets which result in large memory requirements and/or slow execution time. It is also not easy to use traditional COUNT(DISTINCT) queries in a distributed environment.
In this talk, we will focus on the HyperLogLog (HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries within mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. In this talk, Burak Y. from Citus Data (also maintainer of postgresql-hll extension) will talk about internals of HLL and how it estimates cardinality; Jarred from IronNet Cybersecurity will showcase HLL in PostgreSQL with real world examples and use cases from his experience of running HLL in production. We promise that at the end of this session, you will fall in love with this fun little data structure as the newest tool in your data science and analytics tool belt.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy
1. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is HyperLogLog and
Why You Will Love It
Burak Yücesoy
2. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Number of unique elements (cardinality) in given data
• Useful to find things like…
• Number of unique users visited your web page
• Number of unique products in your inventory
What is COUNT(DISTINCT)?
2
3. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is COUNT(DISTINCT)?
3
logins
username | date
----------+-----------
Alice | 2018-10-02
Bob | 2018-10-03
Alice | 2018-10-05
Eve | 2018-10-07
Bob | 2018-10-07
Bob | 2018-10-08
• Number of logins: 6
• Number of unique users who log in: 3
4. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Slow
• High memory footprint
• Cannot work with appended/streaming data
Problems with Traditional COUNT(DISTINCT)
4
5. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low
memory footprint;
• Approximation algorithm
• Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data
• Mathematically provable error bounds
• It can estimate cardinalities well beyond 109
with 1% error rate using only 6 KB of memory
There is better way!
5
6. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
It depends...
Is it OK to approximate?
6
7. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Is it OK to approximate?
7
• Count # of unique felonies associated to a person; Not OK
• Count # of unique visits to my web page; OK
8. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HLL
• Very fast
• Low memory footprint
• Can work with streaming data
• Can merge estimations of two separate datasets efficiently
8
9. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging
9
10. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2
10
11. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
hash
Bob 1492309842
binary
0101...010
Number of leading zeros: 1
Maximum number of leading zeros: 2
11
12. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
...
Maximum number of leading zeros: 7
Cardinality Estimation: 27
12
13. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.
13
14. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201814
15. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201815
16. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212
16
17. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
01000101...010
First m bits to decide
partition number
Remaining bits to
count leading zeros
17
18. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Error rate of HLL
• Typical Error Rate: 1.04 / sqrt(number of partitions)
• Memory need is number of partitions * log(log(max. value in hash space)) bit
• Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
• Memory vs accuracy tradeoff
18
19. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Why does HLL work?
It turns out, combination of lots of bad observation is a
good observation
19
20. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 8
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
00100110...001
... ... ...
20
21. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
00000000...000
... ... ...
21
22. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HLL in PostgreSQL
● https://github.com/citusdata/postgresql-hll
22
23. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
• Use hll_hash_bigint to hash elements.
• There are some other functions for other common data types.
• Use hll_add_agg to aggregate hashed elements into hll data structure.
• Use hll_cardinality to materialize hll data structure to actual distinct count.
HLL in PostgreSQL
23
24. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Real Time Dashboard with
HyperLogLog
24
25. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Precomputed aggregates for period of time and set of dimensions;
What is Rollup?
25
26. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
26
27. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
27
28. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
28
29. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
29
30. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
30
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
30
31. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
31
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
31
32. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
32
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
32
33. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
33
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
33
34. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Fast & indexed lookups of aggregates
• Avoid expensive repeated computations
• Rollups are compact (uses less space) and can be kept over longer periods
• Rollups can be further aggregated
Benefit of Rollup Tables
34
35. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
35
36. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
36
37. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
37
38. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Rollup Table with HLL
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count hll,
session_distinct_count hll,
minute timestamp
);
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
38
39. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
39
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count,
hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
Rollup Table with HLL
39
40. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
hll_union_agg (device_distinct_count) AS device_distinct_count,
hll_union_agg (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
40
41. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 1
Interval 1
Partition 1
Interval 1
Partition 3
Interval 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
41
42. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 2
Interval 2
Partition 1
Interval 2
Partition 3
Interval 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
42
43. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
11
7
12
1053.255
Estimation
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg
How to Merge COUNT(DISTINCT) with HLL
43
44. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 1
+
Interval 2
Interval 1
Partition 1(7)
+
Interval 2
Partition 1(11)
11
7
12
1053.255
Estimation
Interval 1
Partition 2(5)
+
Interval 2
Partition 2(7)
Interval 1
Partition 3(12)
+
Interval 2
Partition 4(8)
How to Merge COUNT(DISTINCT) with HLL
44
45. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• What if ...
• Without hll, you would have to maintain 2n
- 1 rollup tables to cover all
combinations in n columns (multiply this with number of time intervals).
45
Rollup Table with HLL
45
46. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What Happens in Distributed
Scenario?
46
47. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
1. Separate data into shards.
events_001 events_002 events_003
postgresql-hll in distributed environment
47
48. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
2. Put shards into separate nodes.
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
events_001 events_002 events_003
postgresql-hll in distributed environment
48
49. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
3. For each shard, calculate hll (but do not materialize).
postgresql-hll in distributed environment
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
49
50. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
4. Pull intermediate results to a single node.
Worker
Node 1
events_001
Coordinator
Worker
Node 2
events_002
Worker
Node 3
events_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
postgresql-hll in distributed environment
50
51. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
5. Merge separate hll data structures and materialize them
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)
postgresql-hll in distributed environment
51
52. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
burak@citusdata.com
Thanks
&
Questions
@byucesoy
Burak Yücesoy
www.citusdata.com @citusdata