Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Burak Yucesoy | Citus Data | PGConf EU
Distributed
COUNT(DISTINCT) with
HyperLogLog on
PostgreSQL
Burak Yucesoy | Citus Data | PGConf EU
What is COUNT(DISTINCT)?
● Number of unique elements (cardinality) in given data
● ...
Burak Yucesoy | Citus Data | PGConf EU
What is distributed COUNT(DISTINCT)?
Worker
Node 1
logins_001
Coordinator
Worker
No...
Burak Yucesoy | Citus Data | PGConf EU
Why do we need distributed COUNT(DISTINCT)?
● Your data is too big to fit in memory...
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
Coordin...
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
Coordin...
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker Node 1
logins_001
usernam...
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker Node 1
logins_001
usernam...
Burak Yucesoy | Citus Data | PGConf EU
Some Possible Approaches
● Pull all distinct data to one node and count there. (Doe...
Burak Yucesoy | Citus Data | PGConf EU
HyperLogLog(HLL)
HLL is;
● Approximation algorithm
● Estimates cardinality of given...
Burak Yucesoy | Citus Data | PGConf EU
Is it OK to approximate?
It depends…
Burak Yucesoy | Citus Data | PGConf EU
HLL
● Very fast
● Low memory footprint
● Can work with streaming data
● Can merge e...
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution...
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0010.....
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
hash
Bob 1492309842
binary
0101......
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
...
Maximum number of leading zero...
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking ...
Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
Data
Partition 1
Partition 3
Partition 2
7
...
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
01000101...010
First m bits to decide
parti...
Burak Yucesoy | Citus Data | PGConf EU
Error rate of HLL is damn good
● Typical Error Rate: 1.04 / sqrt(number of partitio...
Burak Yucesoy | Citus Data | PGConf EU
Why does HLL work?
It turns out, combination of lots of bad estimation is a
good es...
Burak Yucesoy | Citus Data | PGConf EU
Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 3
Par...
Burak Yucesoy | Citus Data | PGConf EU
Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142....
Burak Yucesoy | Citus Data | PGConf EU
postgresql-hll
● https://github.com/aggregateknowledge/postgresql-hll
● https://git...
Burak Yucesoy | Citus Data | PGConf EU
postgresql-hll uses a data structure, also called hll to keep maximum number of
lea...
Burak Yucesoy | Citus Data | PGConf EU
What Happens in
Distributed Scenario?
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 1
Shard 1
Partition 1
Shard 1
Partition...
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 2
Shard 2
Partition 1
Shard 2
Partition...
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
11
7
12
1053.255
211
27
212
HLL(11, 7, 8)
HLL...
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 1
+
Shard 2
Shard 1
Partition 1(7)
+
Sh...
Burak Yucesoy | Citus Data | PGConf EU
1. Separate data into shards.
postgresql-hll in distributed environment
logins_001 ...
Burak Yucesoy | Citus Data | PGConf EU
2. Put shards into separate nodes.
postgresql-hll in distributed environment
Worker...
Burak Yucesoy | Citus Data | PGConf EU
3. For each shard, calculate hll (but do not materialize).
postgresql-hll in distri...
Burak Yucesoy | Citus Data | PGConf EU
4. Pull intermediate results to a single node.
postgresql-hll in distributed enviro...
Burak Yucesoy | Citus Data | PGConf EU
5. Merge separate hll data structures and materialize them
postgresql-hll in distri...
Burak Yucesoy | Citus Data | PGConf EU
Or use Citus :)
postgresql-hll in distributed environment
Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy
burak@citusdata.com
@byucesoy
Thank You
citusdata.com | @citusdata
Upcoming SlideShare
Loading in …5
×

of

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 1 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 2 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 3 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 4 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 5 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 6 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 7 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 8 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 9 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 10 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 11 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 12 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 13 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 14 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 15 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 16 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 17 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 18 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 19 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 20 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 21 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 22 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 23 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 24 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 25 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 26 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 27 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 28 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 29 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 30 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 31 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 32 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 33 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 34 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 35 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 36 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 37 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 38 Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy Slide 39
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy

Download to read offline

Running SELECT COUNT(DISTINCT) on your database is all too common. In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you cannot just push query to the workers and add up results, because most likely there will be overlapping records in different workers.

In this talk, we will focus on HyperLogLog(HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in distributed environment. During the talk, first, we’ll look at the internals of the HLL to understand why HLL algorithm is useful to solve distinct count problem in scalable way, then how it can be applied in a distributed fashion. Finally we will see some examples of HLL usage.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy

  1. 1. Burak Yucesoy | Citus Data | PGConf EU Distributed COUNT(DISTINCT) with HyperLogLog on PostgreSQL
  2. 2. Burak Yucesoy | Citus Data | PGConf EU What is COUNT(DISTINCT)? ● Number of unique elements (cardinality) in given data ● Useful to find things like… ○ Number of unique users visited your web page ○ Number of unique products in your inventory
  3. 3. Burak Yucesoy | Citus Data | PGConf EU What is distributed COUNT(DISTINCT)? Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003
  4. 4. Burak Yucesoy | Citus Data | PGConf EU Why do we need distributed COUNT(DISTINCT)? ● Your data is too big to fit in memory of single machine ● Naive approach for COUNT(DISTINCT) needs too much memory
  5. 5. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 CoordinatorSELECT COUNT(*) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 600 100 200 300SELECT COUNT(*) FROM ...;
  6. 6. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 CoordinatorSELECT COUNT(DISTINCT username) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 SELECT COUNT(DISTINCT user_id) FROM ...;
  7. 7. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Frank | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19
  8. 8. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Dave | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19
  9. 9. Burak Yucesoy | Citus Data | PGConf EU Some Possible Approaches ● Pull all distinct data to one node and count there. (Doesn’t scale) ● Repartition data on the fly. (Scales but it’s very slow) ● Use HyperLogLog. (Scales and fast)
  10. 10. Burak Yucesoy | Citus Data | PGConf EU HyperLogLog(HLL) HLL is; ● Approximation algorithm ● Estimates cardinality of given data ● Mathematically proven error bounds
  11. 11. Burak Yucesoy | Citus Data | PGConf EU Is it OK to approximate? It depends…
  12. 12. Burak Yucesoy | Citus Data | PGConf EU HLL ● Very fast ● Low memory footprint ● Can work with streaming data ● Can merge estimations of two separate datasets efficiently
  13. 13. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging
  14. 14. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2
  15. 15. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2
  16. 16. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27
  17. 17. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average.
  18. 18. Burak Yucesoy | Citus Data | PGConf EU
  19. 19. Burak Yucesoy | Citus Data | PGConf EU
  20. 20. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212
  21. 21. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros
  22. 22. Burak Yucesoy | Citus Data | PGConf EU Error rate of HLL is damn good ● Typical Error Rate: 1.04 / sqrt(number of partitions) ● Memory need is number of partitions * log(log(max. value in hash space)) bit ● Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes ● Memory vs accuracy tradeoff
  23. 23. Burak Yucesoy | Citus Data | PGConf EU Why does HLL work? It turns out, combination of lots of bad estimation is a good estimation
  24. 24. Burak Yucesoy | Citus Data | PGConf EU Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 3 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ...
  25. 25. Burak Yucesoy | Citus Data | PGConf EU Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ...
  26. 26. Burak Yucesoy | Citus Data | PGConf EU postgresql-hll ● https://github.com/aggregateknowledge/postgresql-hll ● https://github.com/citusdata/postgresql-hll ● Companies using postgresql-hll for their dashboard ● Neustar ● Cloudflare
  27. 27. Burak Yucesoy | Citus Data | PGConf EU postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. ● Use hll_hash_bigint to hash elements. ○ There are some other functions for other common data types. ● Use hll_add_agg to aggregate hashed elements into hll data structure. ● Use hll_cardinality to materialize hll data structure to actual distinct count. postgresql-hll in single node
  28. 28. Burak Yucesoy | Citus Data | PGConf EU What Happens in Distributed Scenario?
  29. 29. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result
  30. 30. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 2 Shard 2 Partition 1 Shard 2 Partition 3 Shard 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result
  31. 31. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL 11 7 12 1053.255 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg
  32. 32. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 1 + Shard 2 Shard 1 Partition 1(7) + Shard 2 Partition 1(11) 11 7 12 1053.255 Estimation Shard 1 Partition 2(5) + Shard 2 Partition 2(7) Shard 1 Partition 3(12) + Shard 2 Partition 4(8)
  33. 33. Burak Yucesoy | Citus Data | PGConf EU 1. Separate data into shards. postgresql-hll in distributed environment logins_001 logins_002 logins_003
  34. 34. Burak Yucesoy | Citus Data | PGConf EU 2. Put shards into separate nodes. postgresql-hll in distributed environment Worker Node 1 Coordinator Worker Node 2 Worker Node 3 logins_001 logins_002 logins_003
  35. 35. Burak Yucesoy | Citus Data | PGConf EU 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result
  36. 36. Burak Yucesoy | Citus Data | PGConf EU 4. Pull intermediate results to a single node. postgresql-hll in distributed environment Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
  37. 37. Burak Yucesoy | Citus Data | PGConf EU 5. Merge separate hll data structures and materialize them postgresql-hll in distributed environment 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6)
  38. 38. Burak Yucesoy | Citus Data | PGConf EU Or use Citus :) postgresql-hll in distributed environment
  39. 39. Burak Yucesoy | Citus Data | PGConf EU Burak Yucesoy burak@citusdata.com @byucesoy Thank You citusdata.com | @citusdata

Running SELECT COUNT(DISTINCT) on your database is all too common. In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you cannot just push query to the workers and add up results, because most likely there will be overlapping records in different workers. In this talk, we will focus on HyperLogLog(HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in distributed environment. During the talk, first, we’ll look at the internals of the HLL to understand why HLL algorithm is useful to solve distinct count problem in scalable way, then how it can be applied in a distributed fashion. Finally we will see some examples of HLL usage.

Views

Total views

1,387

On Slideshare

0

From embeds

0

Number of embeds

665

Actions

Downloads

14

Shares

0

Comments

0

Likes

0

×