pg_shardman: PostgreSQL sharding via postgres_fdw, pg_pathman and logical replication

pg_shardman:
PostgreSQL sharding
via postgres_fdw,
pg_pathman and
logical replication.
Arseny Sher, Stas Kelvich
Postgres Professional

Read and write scalability
High availability
ACID transactions
What people typically expect from the cluster
2

Informal statement: it is impossible to implement a read/write data object that provides
all three properties.
Consistency in CAP means linearizability
wow, so strict
Availability in CAP means that any node must give non-error answer to every
query.
... but execution can take arbitrary time
P in CAP means that the system continues operation after network failure
And in real life, we always want the system to continue operation after network
failure
CAP theorem: common myths
4

This combination of availability and consistency over the wide area is generally
considered impossible due to the CAP Theorem. We show how Spanner achieves this
combination and why it is consistent with CAP.
Eric Brewer. Spanner, TrueTime & The CAP Theorem. February 14, 2017
CAP theorem: conclusions
5

We aim for
Write (and read) horizontal scalability
Mainly OLTP workload with occasional analytical queries
Decent transactions
pg_shardman is PG 10 extension, PostgreSQL license, available at GitHub
Some features require patched Postgres
pg_shardman
6

pg_shardman is a compilation of several technologies.
Scalability: hash-sharding via partitioning and fdw
HA: logical replication
ACID: 2PC + distributed snapshot manager
pg_shardman foundations
7

Let’s go up from partitioning.
Because it’s like sharding, but inside one node.
Partitioning beneﬁts
Sequential access to single (or a few) partitions instead of random access to huge
table
Eﬀective cache usage when most frequently used data located in several partitions
...
Sharding
8

9.6 and below:
Range and list partitioning, complex manual management
Not eﬃcient
New declarative partitioning in 10:
+ Range and list partitioning with handy DDL
- No insertions to foreign partitions, no triggers on parent tables
- Updates moving tuples between partitions are not supported
pg_pathman extension:
Hash and range partitioning
Planning and execution optimizations
FDW support
Partitioning in PostgreSQL
9

FDW (foreign data wrappers) mechanism in PG gives access to external sources of
data. postgres_fdw extension allows querying one PG instance from another.
Going beyond one node: FDW
11

Since 9.6 postgres_fdw can push-down joins.
Since 10 postgres_fdw can push-down aggregates and more kinds of joins.
explain (analyze, costs off) select count(*)
from remote.customer
group by country_code;
QUERY PLAN
--------------------------------------------------------------
Foreign Scan (actual time=353.786..353.896 rows=100 loops=1)
Relations: Aggregate on (remote.customer)
postgres_fdw optimizations
12

Currently parallel foreign scans are not supported :(
... and limitations
13

partitioning + postgres_fdw => sharding
14

partitioning + postgres_fdw => sharding
15

pg_shardman supports only distribution by hash
It splits the load evenly
Currently it is impossible to change number of shards, it should be chosen
beforehand wisely
Too little shards will balance poorly after of nodes addition/removal
Too many shards bring overhead, especially for replication
~10 shards per node looks like adequate baseline
Another common approach for resharding is consistent hashing
Data distribution schemas
16

Possible schemas of replication
per-node, using streaming (physical) replication of PostgreSQL
High availability
17

1
1
Taken from citus docs
Per-node replication in Citus MX
18

Requires 2x nodes, or 2х PG instances per node.
19

Requires 2x nodes, or 2х PG instances per node.
per-shard, using logical replication
20

Logical replication – new in PostgreSQL 10
21

Logical replication – new in PostgreSQL 10
22

Synchronous replication:
We don’t lose transactions reported as committed
Write it blocked if replica doesn’t respond
Slower
Currently we can reliably failover only if we have 1 replica per shard
Asynchronous replication:
Last committed transactions might be lost
Writes don’t block
Faster
Synchronous, asynchronous replication and
availability
24

Node addition with seamless rebalance
25

We designate one special node ’sharlord’.
It holds tables with metadata.
Metadata can be synchronously replicated somewhere to change shardlord in case
of failure.
Currently shardlord can’t hold usual data itself.
How to manage this zoo
27

select shardman.add_node(’port=5433’);
Example
28

create table pgbench_accounts (aid int not null, bid int, abalance int,
filler char(84));
select shardman.create_hash_partitions(’pgbench_accounts’,’aid’, 30, 1);
Example
29

[local]:5432 ars@ars:5434=# table shardman.partitions;
part_name | node_id | relation
---------------------+---------+------------------
pgbench_accounts_0 | 1 | pgbench_accounts
...
Example
30

[local]:5432 ars@ars:5434=# table shardman.replicas;
part_name | node_id | relation
---------------------+---------+------------------
...
Example
31

Distributed transactions:
Distributed atomicity
Distributed isolation
Proﬁt! (distributed)
Transactions in shardman
32

All reliable distributed systems are alike each unreliable is unreliable in its own way.
Kyle Kingsbury and Leo Tolstoy.
33

Distributed transactions:
Atomicity: 2PC
Isolation: Clock-SI
34

Transactions in shardman: 2PC
35

Two-phase commit is the anti-availability protocol.
P. Helland. ACM Queue, Vol. 14, Issue 2, March-April 2016.
36

37

38

39

40

So what we can do about it?
Make 2PC fail-recovery tolerant: X3PC, Paxos Commit
Back-up partitions!
41

42

Spanner mitigates this by having each member be a Paxos group, thus ensuring each
2PC “member” is highly available even if some of its Paxos participants are down.
Eric Brewer.
43

Proﬁt? Not yet!
Transactions in shardman: isolation
44

45

postgres_fdw.use_twophase = on
BEGIN;
UPDATE holders SET horns -= 1 WHERE holders.id = $id1;
UPDATE holders SET horns += 1 WHERE holders.id = $id2;
COMMIT;
SELECT sum(horns_count) FROM holders;
-> 1
-> -2
-> 0
46

MVCC in two sentences:
UPDATE/DELETE create new tuple version, without in-place override
Each tx gets current database version at start (xid, csn,timestamp) and able to see
only appropriate versions.
acc1
ver 10: {1, 0}
ver 20: {1, 2}
ver 30: {1, 4}
––––– snapshot = 34 –––––
ver 40: {1, 2}
47

BEGIN
48

Do some serious stuﬀ
49

COMMIT
50

BEGIN
51

Do some serious web scale stuﬀ
52

COMMIT
53

Transactions in shardman: Clock Skew
54

Clock-SI slightly changes visibility rules:
version = timestamp
Visibility’: Waits if tuple came from future. (Do not allow time-travel paradoxes!)
Visibility”: Waits if tuple already prepared(P) but not yet commited(C).
Commit’: Receives local versions from partitions on Prepare and Commits with
maximal version.
55

0 2 4 6 8 10 12 14
nodes
0
10000
20000
30000
40000
50000
TPS
pgbench -N on ec2 c3.2xlarge, client is oblivious about keys distribution
single node, no shardman
pg_shardman, no replication
pg_shardman, redundancy 1, async replication
Some benchmarks
56

pg_shardman with docs is available at github.com/postgrespro/pg_shardman
Report issues on GitHub
Some features require patched postgres
github.com/postgrespro/postgres_cluster/tree/pg_shardman
2PC and distributed snapshot manager
COPY FROM to sharded tables additionaly needs patched pg_pathman
We appreciate feedback!
57

pg_shardman: PostgreSQL sharding via postgres_fdw, pg_pathman and logical replication

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to pg_shardman: PostgreSQL sharding via postgres_fdw, pg_pathman and logical replication

Similar to pg_shardman: PostgreSQL sharding via postgres_fdw, pg_pathman and logical replication (20)

More from Ontico

More from Ontico (20)

Recently uploaded

Recently uploaded (20)

pg_shardman: PostgreSQL sharding via postgres_fdw, pg_pathman and logical replication