SlideShare a Scribd company logo
1 of 39
Download to read offline
Burak Yucesoy | Citus Data | PGConf EU
Distributed
COUNT(DISTINCT) with
HyperLogLog on
PostgreSQL
Burak Yucesoy | Citus Data | PGConf EU
What is COUNT(DISTINCT)?
● Number of unique elements (cardinality) in given data
● Useful to find things like…
○ Number of unique users visited your web page
○ Number of unique products in your inventory
Burak Yucesoy | Citus Data | PGConf EU
What is distributed COUNT(DISTINCT)?
Worker
Node 1
logins_001
Coordinator
Worker
Node 2
logins_002
Worker
Node 3
logins_003
Burak Yucesoy | Citus Data | PGConf EU
Why do we need distributed COUNT(DISTINCT)?
● Your data is too big to fit in memory of single machine
● Naive approach for COUNT(DISTINCT) needs too much memory
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
CoordinatorSELECT COUNT(*) FROM logins;
Worker
Node 2
logins_002
Worker
Node 3
logins_003
600
100 200 300SELECT COUNT(*) FROM ...;
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
CoordinatorSELECT COUNT(DISTINCT username) FROM logins;
Worker
Node 2
logins_002
Worker
Node 3
logins_003
SELECT COUNT(DISTINCT user_id) FROM ...;
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker Node 1
logins_001
username | date
----------+-----------
Alice | 2017-01-02
Bob | 2017-01-03
Charlie | 2017-01-05
Eve | 2017-01-07
Worker Node 3
logins_003
username | date
----------+-----------
Frank | 2017-03-23
Eve | 2017-03-29
Charlie | 2017-03-02
Charlie | 2017-03-03
Worker Node 2
logins_002
username | date
----------+-----------
Bob | 2017-02-11
Bob | 2017-02-13
Dave | 2017-02-17
Alice | 2017-02-19
Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker Node 1
logins_001
username | date
----------+-----------
Alice | 2017-01-02
Bob | 2017-01-03
Charlie | 2017-01-05
Eve | 2017-01-07
Worker Node 3
logins_003
username | date
----------+-----------
Dave | 2017-03-23
Eve | 2017-03-29
Charlie | 2017-03-02
Charlie | 2017-03-03
Worker Node 2
logins_002
username | date
----------+-----------
Bob | 2017-02-11
Bob | 2017-02-13
Dave | 2017-02-17
Alice | 2017-02-19
Burak Yucesoy | Citus Data | PGConf EU
Some Possible Approaches
● Pull all distinct data to one node and count there. (Doesn’t scale)
● Repartition data on the fly. (Scales but it’s very slow)
● Use HyperLogLog. (Scales and fast)
Burak Yucesoy | Citus Data | PGConf EU
HyperLogLog(HLL)
HLL is;
● Approximation algorithm
● Estimates cardinality of given data
● Mathematically proven error bounds
Burak Yucesoy | Citus Data | PGConf EU
Is it OK to approximate?
It depends…
Burak Yucesoy | Citus Data | PGConf EU
HLL
● Very fast
● Low memory footprint
● Can work with streaming data
● Can merge estimations of two separate datasets efficiently
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
hash
Bob 1492309842
binary
0101...010
Number of leading zeros: 1
Maximum number of leading zeros: 2
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
...
Maximum number of leading zeros: 7
Cardinality Estimation: 27
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.
Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212
Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
01000101...010
First m bits to decide
partition number
Remaining bits to
count leading zeros
Burak Yucesoy | Citus Data | PGConf EU
Error rate of HLL is damn good
● Typical Error Rate: 1.04 / sqrt(number of partitions)
● Memory need is number of partitions * log(log(max. value in hash space)) bit
● Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
● Memory vs accuracy tradeoff
Burak Yucesoy | Citus Data | PGConf EU
Why does HLL work?
It turns out, combination of lots of bad estimation is a
good estimation
Burak Yucesoy | Citus Data | PGConf EU
Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 3
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
00100110...001
... ... ...
Burak Yucesoy | Citus Data | PGConf EU
Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
00000000...000
... ... ...
Burak Yucesoy | Citus Data | PGConf EU
postgresql-hll
● https://github.com/aggregateknowledge/postgresql-hll
● https://github.com/citusdata/postgresql-hll
● Companies using postgresql-hll for their dashboard
● Neustar
● Cloudflare
Burak Yucesoy | Citus Data | PGConf EU
postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
● Use hll_hash_bigint to hash elements.
○ There are some other functions for other common data types.
● Use hll_add_agg to aggregate hashed elements into hll data structure.
● Use hll_cardinality to materialize hll data structure to actual distinct count.
postgresql-hll in single node
Burak Yucesoy | Citus Data | PGConf EU
What Happens in
Distributed Scenario?
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 2
Shard 2
Partition 1
Shard 2
Partition 3
Shard 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
11
7
12
1053.255
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg
Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 1
+
Shard 2
Shard 1
Partition 1(7)
+
Shard 2
Partition 1(11)
11
7
12
1053.255
Estimation
Shard 1
Partition 2(5)
+
Shard 2
Partition 2(7)
Shard 1
Partition 3(12)
+
Shard 2
Partition 4(8)
Burak Yucesoy | Citus Data | PGConf EU
1. Separate data into shards.
postgresql-hll in distributed environment
logins_001 logins_002 logins_003
Burak Yucesoy | Citus Data | PGConf EU
2. Put shards into separate nodes.
postgresql-hll in distributed environment
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
logins_001 logins_002 logins_003
Burak Yucesoy | Citus Data | PGConf EU
3. For each shard, calculate hll (but do not materialize).
postgresql-hll in distributed environment
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
Burak Yucesoy | Citus Data | PGConf EU
4. Pull intermediate results to a single node.
postgresql-hll in distributed environment
Worker
Node 1
logins_001
Coordinator
Worker
Node 2
logins_002
Worker
Node 3
logins_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
Burak Yucesoy | Citus Data | PGConf EU
5. Merge separate hll data structures and materialize them
postgresql-hll in distributed environment
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)
Burak Yucesoy | Citus Data | PGConf EU
Or use Citus :)
postgresql-hll in distributed environment
Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy
burak@citusdata.com
@byucesoy
Thank You
citusdata.com | @citusdata

More Related Content

Similar to Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy

Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Oblu Integration Guide
Oblu Integration GuideOblu Integration Guide
Oblu Integration Guideoblu.io
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogsAlexey Grigorev
 
Proposal and Implementation of the Connected-Component Labeling of Binary Ima...
Proposal and Implementation of the Connected-Component Labeling of Binary Ima...Proposal and Implementation of the Connected-Component Labeling of Binary Ima...
Proposal and Implementation of the Connected-Component Labeling of Binary Ima...CSCJournals
 
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTSIRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTSIRJET Journal
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres OpenRobert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres OpenPostgresOpen
 
DESIGN OF 8-BIT COMPARATORS
DESIGN OF 8-BIT COMPARATORSDESIGN OF 8-BIT COMPARATORS
DESIGN OF 8-BIT COMPARATORSIRJET Journal
 
Implementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderImplementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderVLSICS Design
 
GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesxryuseix
 
Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...
Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...
Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...iosrjce
 
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET Journal
 
IRJET- To Design 16 bit Synchronous Microprocessor using VHDL on FPGA
IRJET-  	  To Design 16 bit Synchronous Microprocessor using VHDL on FPGAIRJET-  	  To Design 16 bit Synchronous Microprocessor using VHDL on FPGA
IRJET- To Design 16 bit Synchronous Microprocessor using VHDL on FPGAIRJET Journal
 
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...IRJET Journal
 
An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...IJECEIAES
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
IRJET- Implementation of Ternary ALU using Verilog
IRJET- Implementation of Ternary ALU using VerilogIRJET- Implementation of Ternary ALU using Verilog
IRJET- Implementation of Ternary ALU using VerilogIRJET Journal
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedbacksinfomicien
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520IJRAT
 
FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...
FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...
FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...eeiej_journal
 

Similar to Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy (20)

Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Oblu Integration Guide
Oblu Integration GuideOblu Integration Guide
Oblu Integration Guide
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Proposal and Implementation of the Connected-Component Labeling of Binary Ima...
Proposal and Implementation of the Connected-Component Labeling of Binary Ima...Proposal and Implementation of the Connected-Component Labeling of Binary Ima...
Proposal and Implementation of the Connected-Component Labeling of Binary Ima...
 
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTSIRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres OpenRobert Haas Query Planning Gone Wrong Presentation @ Postgres Open
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
 
DESIGN OF 8-BIT COMPARATORS
DESIGN OF 8-BIT COMPARATORSDESIGN OF 8-BIT COMPARATORS
DESIGN OF 8-BIT COMPARATORS
 
Implementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderImplementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adder
 
GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our lives
 
Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...
Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...
Implementation of 32 Bit Binary Floating Point Adder Using IEEE 754 Single Pr...
 
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
 
IRJET- To Design 16 bit Synchronous Microprocessor using VHDL on FPGA
IRJET-  	  To Design 16 bit Synchronous Microprocessor using VHDL on FPGAIRJET-  	  To Design 16 bit Synchronous Microprocessor using VHDL on FPGA
IRJET- To Design 16 bit Synchronous Microprocessor using VHDL on FPGA
 
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
IRJET- VLSI Architecture for Reversible Radix-2 FFT Algorithm using Programma...
 
An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
IRJET- Implementation of Ternary ALU using Verilog
IRJET- Implementation of Ternary ALU using VerilogIRJET- Implementation of Ternary ALU using Verilog
IRJET- Implementation of Ternary ALU using Verilog
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520
 
FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...
FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...
FPGA IMPLEMENTATION OF HIGH SPEED BAUGH-WOOLEY MULTIPLIER USING DECOMPOSITION...
 

More from Citus Data

Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Citus Data
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
 
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...Citus Data
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensCitus Data
 
When it all goes wrong | PGConf EU 2019 | Will Leinweber
When it all goes wrong | PGConf EU 2019 | Will LeinweberWhen it all goes wrong | PGConf EU 2019 | Will Leinweber
When it all goes wrong | PGConf EU 2019 | Will LeinweberCitus Data
 
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise GrandjoncAmazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise GrandjoncCitus Data
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...Citus Data
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
 
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...Citus Data
 
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
A story on Postgres index types | PostgresLondon 2019 | Louise GrandjoncA story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
A story on Postgres index types | PostgresLondon 2019 | Louise GrandjoncCitus Data
 
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...Citus Data
 
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri FontaineThe Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri FontaineCitus Data
 
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Citus Data
 
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
When it all goes wrong (with Postgres) | RailsConf 2019 | Will LeinweberWhen it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
When it all goes wrong (with Postgres) | RailsConf 2019 | Will LeinweberCitus Data
 
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri FontaineThe Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri FontaineCitus Data
 
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...Citus Data
 
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineHow to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineCitus Data
 
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
When it all Goes Wrong |Nordic PGDay 2019 | Will LeinweberWhen it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
When it all Goes Wrong |Nordic PGDay 2019 | Will LeinweberCitus Data
 
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire GiordanoWhy PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire GiordanoCitus Data
 

More from Citus Data (20)

Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
 
When it all goes wrong | PGConf EU 2019 | Will Leinweber
When it all goes wrong | PGConf EU 2019 | Will LeinweberWhen it all goes wrong | PGConf EU 2019 | Will Leinweber
When it all goes wrong | PGConf EU 2019 | Will Leinweber
 
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise GrandjoncAmazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
 
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
 
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
A story on Postgres index types | PostgresLondon 2019 | Louise GrandjoncA story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
 
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
 
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri FontaineThe Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
 
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
 
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
When it all goes wrong (with Postgres) | RailsConf 2019 | Will LeinweberWhen it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
 
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri FontaineThe Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
 
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
 
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineHow to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
 
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
When it all Goes Wrong |Nordic PGDay 2019 | Will LeinweberWhen it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
 
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire GiordanoWhy PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
 

Recently uploaded

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoUXDXConf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 

Recently uploaded (20)

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy

  • 1. Burak Yucesoy | Citus Data | PGConf EU Distributed COUNT(DISTINCT) with HyperLogLog on PostgreSQL
  • 2. Burak Yucesoy | Citus Data | PGConf EU What is COUNT(DISTINCT)? ● Number of unique elements (cardinality) in given data ● Useful to find things like… ○ Number of unique users visited your web page ○ Number of unique products in your inventory
  • 3. Burak Yucesoy | Citus Data | PGConf EU What is distributed COUNT(DISTINCT)? Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003
  • 4. Burak Yucesoy | Citus Data | PGConf EU Why do we need distributed COUNT(DISTINCT)? ● Your data is too big to fit in memory of single machine ● Naive approach for COUNT(DISTINCT) needs too much memory
  • 5. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 CoordinatorSELECT COUNT(*) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 600 100 200 300SELECT COUNT(*) FROM ...;
  • 6. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 CoordinatorSELECT COUNT(DISTINCT username) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 SELECT COUNT(DISTINCT user_id) FROM ...;
  • 7. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Frank | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19
  • 8. Burak Yucesoy | Citus Data | PGConf EU Why does distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Dave | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19
  • 9. Burak Yucesoy | Citus Data | PGConf EU Some Possible Approaches ● Pull all distinct data to one node and count there. (Doesn’t scale) ● Repartition data on the fly. (Scales but it’s very slow) ● Use HyperLogLog. (Scales and fast)
  • 10. Burak Yucesoy | Citus Data | PGConf EU HyperLogLog(HLL) HLL is; ● Approximation algorithm ● Estimates cardinality of given data ● Mathematically proven error bounds
  • 11. Burak Yucesoy | Citus Data | PGConf EU Is it OK to approximate? It depends…
  • 12. Burak Yucesoy | Citus Data | PGConf EU HLL ● Very fast ● Low memory footprint ● Can work with streaming data ● Can merge estimations of two separate datasets efficiently
  • 13. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging
  • 14. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2
  • 15. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2
  • 16. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27
  • 17. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average.
  • 18. Burak Yucesoy | Citus Data | PGConf EU
  • 19. Burak Yucesoy | Citus Data | PGConf EU
  • 20. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212
  • 21. Burak Yucesoy | Citus Data | PGConf EU How does HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros
  • 22. Burak Yucesoy | Citus Data | PGConf EU Error rate of HLL is damn good ● Typical Error Rate: 1.04 / sqrt(number of partitions) ● Memory need is number of partitions * log(log(max. value in hash space)) bit ● Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes ● Memory vs accuracy tradeoff
  • 23. Burak Yucesoy | Citus Data | PGConf EU Why does HLL work? It turns out, combination of lots of bad estimation is a good estimation
  • 24. Burak Yucesoy | Citus Data | PGConf EU Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 3 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ...
  • 25. Burak Yucesoy | Citus Data | PGConf EU Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ...
  • 26. Burak Yucesoy | Citus Data | PGConf EU postgresql-hll ● https://github.com/aggregateknowledge/postgresql-hll ● https://github.com/citusdata/postgresql-hll ● Companies using postgresql-hll for their dashboard ● Neustar ● Cloudflare
  • 27. Burak Yucesoy | Citus Data | PGConf EU postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. ● Use hll_hash_bigint to hash elements. ○ There are some other functions for other common data types. ● Use hll_add_agg to aggregate hashed elements into hll data structure. ● Use hll_cardinality to materialize hll data structure to actual distinct count. postgresql-hll in single node
  • 28. Burak Yucesoy | Citus Data | PGConf EU What Happens in Distributed Scenario?
  • 29. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result
  • 30. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 2 Shard 2 Partition 1 Shard 2 Partition 3 Shard 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result
  • 31. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL 11 7 12 1053.255 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg
  • 32. Burak Yucesoy | Citus Data | PGConf EU How to merge COUNT(DISTINCT) with HLL Shard 1 + Shard 2 Shard 1 Partition 1(7) + Shard 2 Partition 1(11) 11 7 12 1053.255 Estimation Shard 1 Partition 2(5) + Shard 2 Partition 2(7) Shard 1 Partition 3(12) + Shard 2 Partition 4(8)
  • 33. Burak Yucesoy | Citus Data | PGConf EU 1. Separate data into shards. postgresql-hll in distributed environment logins_001 logins_002 logins_003
  • 34. Burak Yucesoy | Citus Data | PGConf EU 2. Put shards into separate nodes. postgresql-hll in distributed environment Worker Node 1 Coordinator Worker Node 2 Worker Node 3 logins_001 logins_002 logins_003
  • 35. Burak Yucesoy | Citus Data | PGConf EU 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result
  • 36. Burak Yucesoy | Citus Data | PGConf EU 4. Pull intermediate results to a single node. postgresql-hll in distributed environment Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
  • 37. Burak Yucesoy | Citus Data | PGConf EU 5. Merge separate hll data structures and materialize them postgresql-hll in distributed environment 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6)
  • 38. Burak Yucesoy | Citus Data | PGConf EU Or use Citus :) postgresql-hll in distributed environment
  • 39. Burak Yucesoy | Citus Data | PGConf EU Burak Yucesoy burak@citusdata.com @byucesoy Thank You citusdata.com | @citusdata