Postgres clusters

Overview of postgres cluster
solutions
Stas Kelvich, @kelvich

Let’s deﬁne it just as database that utilizes several servers.
Design space is huge. Let’s also put some constraints:
Do we want to improve perfomance?
Do we want to improve availability?
Is load will be more like OLAP or OLTP? Or both?
What consistency guarantees do we need?
Contradictory topics!
What do we mean by distributed DB
or cluster?
2

Master-slave replication
Postgres-XC/XC2/XL
Greenplum
CitusDB / pg_shard
BDR / pg_logical
...
Postgres clustering
3

Async.
Muster just streams journal (WAL) to slaves. Slave can lag.
Sync.
Master waits for WAL sync on replica. But still we can read
old data from replica – time gap between WAL fsync and ﬁles
delivery.
Sync + remote_apply (new in 9.6).
Wait for data to be delivered to ﬁles on replica. Perserves
causality.
4

Pros:
Improves availability
Read-only queries can be sent to replica – improves
performance in that sence.
Cons:
Automatic failover is hard to achive and external tools are
required. (pacemaker, patroni, etc)
Read-only queries on replica are limited: no temp tables, no
catalog access.
5

Postgres-XC distributed db grew out of postgres sources. Originally
for write scalability. Now development continues in Huawei
(Postgres-XC2) and 2ndQuadrant (Postgres-XL)
Let’s concentrate on Postgres-XL:
postgres fork built for write scalability and OLAP performance.
distributed transactions
3 types of nodes in system: coordinator, node, gtm (+
gtm-proxy)
Postgres-XC/XC2/XL
6

Pros:
now merged with 9.5
good for OLAP
good for write-heavy load
Pros:
fork – isn’t maintained by postgres core team
isn’t good for high TPS
GTM is a hot spot
Postgres-XL
8

MPP database by Pivotal.
Opensourced in 2015.
Built on top of postgres 8.3.
Same niche as a Postgres-XL, but more mature.
Greenplum
9

Also sharding + analytics (distributed joins)
Not a fork! Extension to an ordinary postgres!
Isn’t good for high TPS (master instances is a hot spot)
2 type of nodes: master-node and data-node
No distributed transactions (and hence no proper isolation)
CitusDB
10

In-core extension to access remote node as a local table.
In core
Some distributed planning (aggregate pushdown)
Can be used for sharding (along with standand table
inheritance mechanism)
Neither atomical nor isolated
Postgres_fdw
11

Bi-Directional Replication. Fork but most of the features are
merged into postgres core.
Multimaster replication, but without conﬂict detection. (No
distributed transactions, no distributed queries).
Very useful if application can determine node to access itself.
Good for georeplication.
BDR
12

Was part of BDR, now shipped as an extension.
Can decode journal (WAL) into sequence of logical operations in
custom format.
Libpq binary and json formats support included.
Server-side events (ﬁlter and send to WebSocket connections)
Replication between diﬀerent versions
Replication to other databases!
Part of BDR
pg_logical + decoding plugin
13

Started about a year ago.
Konstantin Knizhnik, Constantin Pan, Stas Kelvich
Cluster group in PgPro
14

Started to playing with Postgres-XC. 2ndQuadrant also had project
(ﬁnished now) to port XC to 9.5.
Fork is painful;
How can we bring functionality of XC in core?
Cluster group in PgPro
15

Distributed transactions - nothing in-core;
Distributed planner - fdw, pg_shard, greenplum planner (?);
HA/Autofailover - can be built on top of logical decoding.
Distributed postgres
16

acc1
{5, 0, 1, 500}
{5, 42, 1, 500}
{42, 0, 1, 600}
T42
r1(500)
w1(600)
r2(600)
Comm
T43
r2(500)
w1(400)
Comm
acc2
{7, 0, 2, 500}
{7, 43, 1, 500}
{43, 0, 1, 400}
MVCC / snapshot isolation
17

SQL-92
Уровни изоляции
18

(Peter Bailis, VLDB, 2013)
Уровни изоляции
19

create extension pg_dtm;
select begin_glob...();
begin
update ...;
commit;
create extension pg_dtm;
select join_glob...(xid);
begin;
update ...;
commit;
src server dst server
Client side of view
20

Achieve proper isolation between tx for multi-node transactions.
Now in postgres on write tx start:
Aquire XID;
Get list of running tx’s;
Use that info in visibility checks.
Distributed transactions
21

transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
XTM API:
vanilla
22

transam/clog.c:
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
XTM API:
after patch
23

transam/clog.c:
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
pg_dtm.so
XTM API:
after tm load
24

Aquire XID centrally (DTMd, arbiter);
No local tx possible;
DTMd is a bottleneck.
XTM implementations
GTM or snapshot sharing
25

Paper from SAP HANA team;
Central daemon is needed, but only for multi-node tx;
Snapshots -> Commit Sequence Number;
DTMd is still a bottleneck.
XTM implementations
Incremental SI
26

XID/CSN are gathered from all nodes that participates in tx;
No central service;
local tx;
possible to reduce communication by using time (Spanner,
CockroachDB).
XTM implementations
Clock-SI or tsDTM
27

XTM implementations
tsDTM scalability
28

More nodes, higher probability of failure in system.
Possible problems with nodes:
Node stopped (and will not be back);
Node was down small amount of time (and we should bring it
back to operation);
Network partitions (avoid split-brain).
If we want to survive network partitions than we can have not more
than [N/2] - 1 failures.
HA/autofailover
29

Possible usage of such system:
Multimaster replication;
Tables with metainformation in sharded databases;
Sharding with redundancy.
HA/autofailover
30

By Multimaster we mean strongly coupled one, that acts as a single
database. With proper isolation and no merge conﬂicts.
Ways to build:
Global order to XLOG (Postgres-R, MySQL Galera);
Wrap each tx as distributed – allows parallelism while applying
tx.
Multimaster
31

Our implementation:
Built on top of pg_logical;
Make use of tsDTM;
Pool of workers for tx replay;
Raft-based storage for dealing with failures and distributed
deadlock detection.
Multimaster
32

Our implementation:
Approximately half of a speed of standalone postgres;
Same speed for reads;
Deals with nodes autorecovery;
Deals with network partitions (debugging right now).
Can work as an extension (if community accept XTM API in
core).
Multimaster
33

Postgres clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Postgres clusters

Similar to Postgres clusters (20)

Recently uploaded

Recently uploaded (20)

Postgres clusters