Your SlideShare is downloading.
×

- 1. Burak Yucesoy | Citus Data | PGConf EU Distributed COUNT(DISTINCT) with HyperLogLog on PostgreSQL
- 2. Burak Yucesoy | Citus Data | PGConf EU What is COUNT(DISTINCT)? ● Number of unique elements (cardinality) in given data ● Useful to find things like… ○ Number of unique users visited your web page ○ Number of unique products in your inventory
- 3. Burak Yucesoy | Citus Data | PGConf EU What is distributed COUNT(DISTINCT)? Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003
- 4. Burak Yucesoy | Citus Data | PGConf EU Why do we need distributed COUNT(DISTINCT)? ● Your data is too big to fit in memory of single machine ● Naive approach for COUNT(DISTINCT) needs too much memory
- 5. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 CoordinatorSELECT COUNT(*) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 600 100 200 300SELECT COUNT(*) FROM ...;
- 6. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 CoordinatorSELECT COUNT(DISTINCT username) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 SELECT COUNT(DISTINCT user_id) FROM ...;
- 7. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Frank | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19
- 8. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Dave | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19
- 9. Burak Yucesoy | Citus Data | PGConf EU Some Possible Approaches ● Pull all distinct data to one node and count there. (Doesn’t scale) ● Repartition data on the fly. (Scales but it’s very slow) ● Use HyperLogLog. (Scales and fast)
- 10. Burak Yucesoy | Citus Data | PGConf EU HyperLogLog(HLL) HLL is; ● Approximation algorithm ● Estimates cardinality of given data ● Mathematically proven error bounds
- 11. Burak Yucesoy | Citus Data | PGConf EU Is it OK to approximate? It depends…
- 12. Burak Yucesoy | Citus Data | PGConf EU HLL ● Very fast ● Low memory footprint ● Can work with streaming data ● Can merge estimations of two separate datasets efficiently
- 13. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging
- 14. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2
- 15. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2
- 16. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27
- 17. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average.
- 18. Burak Yucesoy | Citus Data | PGConf EU
- 19. Burak Yucesoy | Citus Data | PGConf EU
- 20. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212
- 21. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros
- 22. Burak Yucesoy | Citus Data | PGConf EU Error rate of HLL is damn good ● Typical Error Rate: 1.04 / sqrt(number of partitions) ● Memory need is number of partitions * log(log(max. value in hash space)) bit ● Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes ● Memory vs accuracy tradeoff
- 23. Burak Yucesoy | Citus Data | PGConf EU Why does HLL work? It turns out, combination of lots of bad estimation is a good estimation
- 24. Burak Yucesoy | Citus Data | PGConf EU Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 3 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ...
- 25. Burak Yucesoy | Citus Data | PGConf EU Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ...
- 26. Burak Yucesoy | Citus Data | PGConf EU postgresql-hll ● https://github.com/aggregateknowledge/postgresql-hll ● https://github.com/citusdata/postgresql-hll ● Companies using postgresql-hll for their dashboard ● Neustar ● Cloudflare
- 27. Burak Yucesoy | Citus Data | PGConf EU postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. ● Use hll_hash_bigint to hash elements. ○ There are some other functions for other common data types. ● Use hll_add_agg to aggregate hashed elements into hll data structure. ● Use hll_cardinality to materialize hll data structure to actual distinct count. postgresql-hll in single node
- 28. Burak Yucesoy | Citus Data | PGConf EU What Happens in Distributed Scenario?
- 29. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result
- 30. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 2 Shard 2 Partition 1 Shard 2 Partition 3 Shard 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result
- 31. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL 11 7 12 1053.255 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg
- 32. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 1 + Shard 2 Shard 1 Partition 1(7) + Shard 2 Partition 1(11) 11 7 12 1053.255 Estimation Shard 1 Partition 2(5) + Shard 2 Partition 2(7) Shard 1 Partition 3(12) + Shard 2 Partition 4(8)
- 33. Burak Yucesoy | Citus Data | PGConf EU 1. Separate data into shards. postgresql-hll in distributed environment logins_001 logins_002 logins_003
- 34. Burak Yucesoy | Citus Data | PGConf EU 2. Put shards into separate nodes. postgresql-hll in distributed environment Worker Node 1 Coordinator Worker Node 2 Worker Node 3 logins_001 logins_002 logins_003
- 35. Burak Yucesoy | Citus Data | PGConf EU 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result
- 36. Burak Yucesoy | Citus Data | PGConf EU 4. Pull intermediate results to a single node. postgresql-hll in distributed environment Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
- 37. Burak Yucesoy | Citus Data | PGConf EU 5. Merge separate hll data structures and materialize them postgresql-hll in distributed environment 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6)
- 38. Burak Yucesoy | Citus Data | PGConf EU Or use Citus :) postgresql-hll in distributed environment
- 39. Burak Yucesoy | Citus Data | PGConf EU Burak Yucesoy burak@citusdata.com @byucesoy Thank You citusdata.com | @citusdata