2. Let’s define it just as database that utilizes several servers.
Design space is huge. Let’s also put some constraints:
Do we want to improve perfomance?
Do we want to improve availability?
Is load will be more like OLAP or OLTP? Or both?
What consistency guarantees do we need?
Contradictory topics!
What do we mean by distributed DB
or cluster?
2
4. Async.
Muster just streams journal (WAL) to slaves. Slave can lag.
Sync.
Master waits for WAL sync on replica. But still we can read
old data from replica – time gap between WAL fsync and files
delivery.
Sync + remote_apply (new in 9.6).
Wait for data to be delivered to files on replica. Perserves
causality.
Master-slave replication
4
5. Pros:
Improves availability
Read-only queries can be sent to replica – improves
performance in that sence.
Cons:
Automatic failover is hard to achive and external tools are
required. (pacemaker, patroni, etc)
Read-only queries on replica are limited: no temp tables, no
catalog access.
Master-slave replication
5
6. Postgres-XC distributed db grew out of postgres sources. Originally
for write scalability. Now development continues in Huawei
(Postgres-XC2) and 2ndQuadrant (Postgres-XL)
Let’s concentrate on Postgres-XL:
postgres fork built for write scalability and OLAP performance.
distributed transactions
3 types of nodes in system: coordinator, node, gtm (+
gtm-proxy)
Postgres-XC/XC2/XL
6
8. Pros:
now merged with 9.5
good for OLAP
good for write-heavy load
Pros:
fork – isn’t maintained by postgres core team
isn’t good for high TPS
GTM is a hot spot
Postgres-XL
8
9. MPP database by Pivotal.
Opensourced in 2015.
Built on top of postgres 8.3.
Same niche as a Postgres-XL, but more mature.
Greenplum
9
10. Also sharding + analytics (distributed joins)
Not a fork! Extension to an ordinary postgres!
Isn’t good for high TPS (master instances is a hot spot)
2 type of nodes: master-node and data-node
No distributed transactions (and hence no proper isolation)
CitusDB
10
11. In-core extension to access remote node as a local table.
In core
Some distributed planning (aggregate pushdown)
Can be used for sharding (along with standand table
inheritance mechanism)
Neither atomical nor isolated
Postgres_fdw
11
12. Bi-Directional Replication. Fork but most of the features are
merged into postgres core.
Multimaster replication, but without conflict detection. (No
distributed transactions, no distributed queries).
Very useful if application can determine node to access itself.
Good for georeplication.
BDR
12
13. Was part of BDR, now shipped as an extension.
Can decode journal (WAL) into sequence of logical operations in
custom format.
Libpq binary and json formats support included.
Server-side events (filter and send to WebSocket connections)
Replication between different versions
Replication to other databases!
Part of BDR
pg_logical + decoding plugin
13
14. Started about a year ago.
Konstantin Knizhnik, Constantin Pan, Stas Kelvich
Cluster group in PgPro
14
15. Started to playing with Postgres-XC. 2ndQuadrant also had project
(finished now) to port XC to 9.5.
Fork is painful;
How can we bring functionality of XC in core?
Cluster group in PgPro
15
16. Distributed transactions - nothing in-core;
Distributed planner - fdw, pg_shard, greenplum planner (?);
HA/Autofailover - can be built on top of logical decoding.
Distributed postgres
16
20. create extension pg_dtm;
select begin_glob...();
begin
update ...;
commit;
create extension pg_dtm;
select join_glob...(xid);
begin;
update ...;
commit;
src server dst server
Client side of view
20
21. Achieve proper isolation between tx for multi-node transactions.
Now in postgres on write tx start:
Aquire XID;
Get list of running tx’s;
Use that info in visibility checks.
Distributed transactions
21
25. Aquire XID centrally (DTMd, arbiter);
No local tx possible;
DTMd is a bottleneck.
XTM implementations
GTM or snapshot sharing
25
26. Paper from SAP HANA team;
Central daemon is needed, but only for multi-node tx;
Snapshots -> Commit Sequence Number;
DTMd is still a bottleneck.
XTM implementations
Incremental SI
26
27. XID/CSN are gathered from all nodes that participates in tx;
No central service;
local tx;
possible to reduce communication by using time (Spanner,
CockroachDB).
XTM implementations
Clock-SI or tsDTM
27
29. More nodes, higher probability of failure in system.
Possible problems with nodes:
Node stopped (and will not be back);
Node was down small amount of time (and we should bring it
back to operation);
Network partitions (avoid split-brain).
If we want to survive network partitions than we can have not more
than [N/2] - 1 failures.
HA/autofailover
29
30. Possible usage of such system:
Multimaster replication;
Tables with metainformation in sharded databases;
Sharding with redundancy.
HA/autofailover
30
31. By Multimaster we mean strongly coupled one, that acts as a single
database. With proper isolation and no merge conflicts.
Ways to build:
Global order to XLOG (Postgres-R, MySQL Galera);
Wrap each tx as distributed – allows parallelism while applying
tx.
Multimaster
31
32. Our implementation:
Built on top of pg_logical;
Make use of tsDTM;
Pool of workers for tx replay;
Raft-based storage for dealing with failures and distributed
deadlock detection.
Multimaster
32
33. Our implementation:
Approximately half of a speed of standalone postgres;
Same speed for reads;
Deals with nodes autorecovery;
Deals with network partitions (debugging right now).
Can work as an extension (if community accept XTM API in
core).
Multimaster
33