HighLoad++ 2017
Зал «Кейптаун», 7 ноября, 15:00
Тезисы:
http://www.highload.ru/2017/abstracts/3043.html
Citus is an open-source extension to Postgres that transforms it into a multi-node, distributed database. It allows you to horizontally scale out both the.
In this session you'll learn how Citus takes care of sharding, distributed transactions, and even masterless writes. You'll learn how to transition your database from single-node Postgres in order to scale up your database to bigger and bigger sizes as your data grows.
4. A. Start with SQL, need to scale out and migrate to NoSQL
B. Start with NoSQL, hope you actually later need scale out
C. Start with SQL, need to scale out and stay with SQL?
Possible Paths
5. What is Citus?
1.Scales out Postgres
2.Extension to Postgres
3.Available in 3 Ways
• Using sharding & replication
• Query engine parallelizes SQL queries across many nodes
• Using Postgres extension APIs
6. Citus, Packaged Three Ways
Open
Source
Enterprise
Software
Fully-Managed
Database as a Service
github.com/citusdata/citus
8. (coordinator node)=# d
Schema | Name
--------+------------
public | cw_metrics
public | events
(worker node)=# d
Schema | Name
--------+-------------------
public | cw_metrics_102008
public | cw_metrics_102012
public | cw_metrics_102016
public | cw_metrics_102064
public | cw_metrics_102068
public | events_102104
public | events_102108
public | events_102112
public | events_102116
...
12. Why is High Availability hard?
Postgres replication uses one primary & multiple
secondary nodes. Two challenges:
1. Most Postgres clients aren’t smart. When the
primary fails, they retry the same IP.
2. Postgres replicates entire state. This makes it
resource intensive to reconstruct new nodes from a
primary.
14. Database Failures Shouldn’t Be a Big Deal
1. Postgres streaming replication to replicate from
primary to secondary. Back up to S3.
2. Volume level replication to replicate to secondary’s
volume. Back up to S3.
3. Incremental backups to S3. Reconstruct secondary
nodes from S3.
3 Methods for HA & Backups in Postgres
17. Postgres – Reconstruct from WAL (3)
Postgres
Primary
Monitoring Agents
(Auto node failover)
Persistent Volume
Postgres
Secondary
Backup
Process
S3 / Blob Storage
(Encrypted)
Table foo
Table bar
WAL logs
Persistent Volume
Table foo
Table bar
WAL logs
Backup process
18. WHO DOES THIS? PRIMARY BENEFITS
Streaming Replication
(local / ephemeral disk)
On-prem
Manual EC2
Simple to set up
Direct I/O: High I/O & large storage
Disk Mirroring
RDS
Azure Preview
Works for MySQL and Postgres
Data durability in cloud environments
Reconstruct from WAL
Heroku
Citus Data
Enables Fork and PITR
Node reconstruction in background
(Data durability in cloud environments)
How do these approaches compare?
20. Summary
• In Postgres, a database node’s state gets replicated in
its entirety. The replication can be set up in three
ways.
• Reconstructing a secondary node from S3 makes
bringing up or shooting down nodes easy.
• When you shard your database, the state you need to
replicate per node becomes smaller.
22. 3 ways to build a distributed database
1. Build a distributed database from scratch
2. Middleware sharding (mimic the parser)
3. Fork your favorite database (like Postgres)
26. Two Stage Query Optimization
1. Plan to minimize network I/O
2. Nodes talk to each other using SQL over libpq
3. Learned to cooperate with planner / executor bit by bit
(Volcano style executor)
27. Citus Architecture (Simplified)
SELECT avg(revenue)
FROM sales
Coordinator
SELECT sum(revenue), count(revenue)
FROM table_1001
SELECT sum … FROM table_1003
Worker node 1
Table metadata
Table_1001
Table_1003
SELECT sum … FROM table_1002
SELECT sum … FROM table_1004
Worker node 2
Table_1002
Table_1004
Worker node N
.
.
.
.
.
.
Each node Postgres with Citus installed
1 shard = 1 Postgres table
28.
29. Unfork Citus using Extension APIs
CREATE EXTENSION citus;
• System catalogs – Distributed metadata
• Planner hook – Insert, Update, Delete, Select
• Executor hook – Insert, Update, Delete, Select
• Utility hook – Alter Table, Create Index, Vacuum, etc.
• Transaction & resources handling – file descriptors, etc.
• Background worker process – Maintenance processes (distributed
deadlock detection, task tracker, etc.)
• Logical decoding – Online data migrations
32. Consistency in Distributed Databases
1. 2PC: All participating nodes need to be up
2. Paxos: Achieves consensus with quorum
3. Raft: More understandable alternative to Paxos
35. What is a Lock?
• Protects against concurrent modifications.
• Locks are released at the end of a transaction.
Deadlocks
36. Transactions Block on 1st Conflicting LockWhat is a lock?
Protects against concurrent modifications
Locks released at end of transaction
BEGIN;
UPDATE data SET y = 2 WHERE x = 1;
<obtained lock on rows with x = 1>
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = 5 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
37. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
38. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?
39. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
40. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
Deadlock detection in PostgreSQL
Transactions are cancelled until the cycle is gone
41. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
Deadlock detection in PostgreSQL
Transactions are cancelled until the cycle is gone
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
42. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
Deadlock detection in PostgreSQL
Transactions are cancelled until the cycle is gone
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
PostgreSQL’s deadlock detector still works
43. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
Deadlock detection in PostgreSQL
Transactions are cancelled until the cycle is gone
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
PostgreSQL’s deadlock detector still works
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
When deadlocks span across node, PostgreSQL cannot help us
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
When deadlocks span across node, PostgreSQL cannot help us
44. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
Deadlock detection in PostgreSQL
Transactions are cancelled until the cycle is gone
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
PostgreSQL’s deadlock detector still works
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
When deadlocks span across node, PostgreSQL cannot help us
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
When deadlocks span across node, PostgreSQL cannot help us
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlock detection in Citus 7
Citus 7 adds distributed deadlock detection
45. Transactions and Concurrency
• Transactions that don’t modify the same row can run concurrently.
Transactions block on 1st lock that conflicts
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
COMMIT;
<all locks released>
BEGIN;
UPDATE data SET y = y + 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
<waiting for lock on rows with x = 1>
<obtained lock on rows with x = 1>
COMMIT;
(Distributed) deadlock!
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 1;
UPDATE data SET y = y + 1 WHERE x = 2;
BEGIN;
UPDATE data SET y = y - 1 WHERE x = 2;
UPDATE data SET y = y + 1 WHERE x = 1;
But what if they start blocking each other?Deadlock detection in PostgreSQL
Deadlock detection builds a graph of processes that
are waiting for each other.
Deadlock detection in PostgreSQL
Transactions are cancelled until the cycle is gone
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
Citus delegates transactions to nodes
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
PostgreSQL’s deadlock detector still works
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
When deadlocks span across node, PostgreSQL cannot help us
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlocks in Citus
When deadlocks span across node, PostgreSQL cannot help us
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlock detection in Citus 7
Citus 7 adds distributed deadlock detection
Firstname Lastname | Citus Data | Meeting Name | Month Year
Deadlock detection in Citus 7
Citus 7 adds distributed deadlock detection.
46. Summary
Distributed transactions are a complex topic. Most
articles on this topic focus on data consistency.
Data consistency is only one side of the coin. If you’re
using a relational database, your application benefits
from another key feature: deadlock detection.
https://www.citusdata.com/blog/2017/08/31/databases-
and-distributed-deadlocks-a-faq
learned over the years in turning Postgres into a distributed database.
lessons applicable in a broader context than just PG
Fairly technical talk. If you have questions, please feel free to ask them.
Speak slowly.
Job Posting trends on HN
Last year PG = next for combined
So, developers love Postgres.
And it’s worth learning more about the technical components that make up Postgres and how one goes about scaling them.
Horz scales out PG across machines, using sharding + replication
Route vs. Parallalize query
Can do self, but Nice to have DB do this for you
Packaged as extension. extension APIs are new, unique to Postgres. More on them later in the talk.
In 1 sense, Citus does very little to PG
C & W nodes are PG with Citus ext
User connects to C. Manage+create dist tables
Queries ran through C, using standard PG protocol
C transforms query to smaller queries, push down to W
C merges, aggrigate if necessary
C doesn’t own any data (mostly)
W each has several shards which are small tables
Metadata table
The previous diagram looks simple, but scaling out SQL is actually an extremely challenging task.
Rest of talk explains Citus by looking at 3 challenges
This part describes the most asked questions about PG.
How do you handle replication and machine failures?
What challenges do you run into when setting up HA PG clusters?
Common setup: 1 Primary writes, many read replicas
In the context of Postgres, this setup brings two challenges.
First, many Postgres clients talk to a single endpoint. When the primary node fails, they will keep retrying the same IP.
Second, Postgres replicates its entire state. This makes it hard to shoot different nodes in the head and bring new nodes into the cluster.
PG has large ecosytm of clients.
Some can have a list of IPs to try (java, pg10).
List only works if you know upfront all possible failovers
To solve generally, need Network Primitives
elastic IP, or DNS, or a load balancer.
This example is EIP failover
2nd problem not widely recognized. Most think Primary/seconday is enough
In practice, 1of 3 approaches for replication and fail-over.
When you bring up new secondary, how does it start?
1st approach is the most common
Primary node has the tables’ data and write-ahead logs.
<explain wal>
Stream WAL to secondary, from beginning
Can cause load on the primary
disk mirroring / block-based replication.
Writes go to a persistent volume. This volume then gets synchronously mirrored to another volume.
works for all RDBMS. You can use it for MySQL, Postgres, Oracle, or SQL Server.
However, this approach also requires that you replicate both table and WAL log data.
writes needs to synchronously go over the network.
Missing 1 byte can cause corruption
turns the replication and disaster recovery process inside out.
base backup / incremental wal to s3
New secondary comes replays from s3
Switch to streaming replication for latency
Better for cloud, easy to bring up AND down replicas
Sync or async
Each benefit is drawback for others
1 Simple streaming replication is most common. Most on-prem. Easy to set up. Local disks ~10TBs
2 Disk Mirroring abstracts storage layer from DB. Loss of instance != loss of disk
3 Treat WAL as a 1 class citizen, certain features become trivial.
<explain + why fork, pitr>
Questions?
All three replication methods replicate a database’s state in full.
Sharding reduces the state you need to replicate per machine
So replication becomes a much easier problem to solve.
RDBMS have diverse features over many years
Distributing them introduces a lot of challenges
Middleware: routes queries (inserts, simple selects)
Fork: features diverge over time, eventually becomes a separate project
Early on when getting queries from users came across this
Savepoints in PG are like nested transactions.
Having this work in a distrubuted db was very difficult so they decided not to do it at the time
Founders wanted to know who would write such a query
Turns out not a person, just rails testing framework
Then knew, to really scale PG, need to go all in
All features people rely on need to work
Current/New features, clients, ORMs, 100s of tools around PG
Citus started as a fork. Extension support in PG wasn’t enough
Inside, postgres is very modular, easyish to hook in without mess
To distribute the CREATE INDEX to runs in, hook into the DDL and Utility processing
Planner and executor most complex. At the time assumptions on storage layerSo, we created a two stage query planner and executor.
PG parses and semantically validates query, Citus sees if touches distr. Tables
Use distr table metadata to plan the query
Minimize IO and xform the query into smaller fragments
Citus then deparses these query fragments back into SQL:
1) Distr query planner decople from the distr executor. Test, logging
2) PG workerscan optimize for local execution
Example SELECT to C. C parses the query.
Citus planner hook xforms into query fragments
Distr planner unparses these query fragments back into SQL and sends to W
W do own local planning and execution and send the results back to C
C does final computation on results and returns to application.
Over time we worked with PG to make these APIs official with extension framework
So we unforked from Postgres and made it an official extension.
An extension is a shared library that gets dynamically loaded into Postgres’ address space.
All you need is `create extension citus` to make PG a distr DB
handling distributed transactions in a relational database.
Distributed transactions big, heavy researched area
Inside your TXN, you should see your changes, but others shouldn’t until COMMIT
Or ROLLBACK
2 related challenges: Consisty and Locks
What happens when 1 or more machines that participate in a transaction fail?
Consistency is a well established problem in distributed systems. Three popular algorithms
2PC requires all nodes to be on to make progress. Paxos/Raft not
We looked both at 2PC in Postgres and also wrote “pg_paxos”
We went with 2PC bc it has been widley used in PG
With streaming rep, secondary promoted quickly
Not as popular a problem, but important:
Concurrent txns want to modify the same rows, what happens?
At core locks are simple
Prevent 2 txn from modifying same row, concurrently
Txns can get complicated, grab many locks
2 concurrent txn grab same lock
Need some way to deal with this
Almost any command you run grabs some locks.
This UPDATE gets a row level lock
any concurrent txnthat tries to update the same row will block
And then after commit or abort, all locks are released and the second txn continues.
If 2 txn have different filters, run concurrent
Allows for good write throughput
If later in the 2nd txn, conflict, then you block
Both modify the same rows in a different order.
Right x=2, left x=1
And now wait for each other
No way out, neither can continue: deadlock
New txns come in and also get stuck
Escalate to full system outage
If txn stuck 1 second, runs deadlock detection.
Looks at lock graph, across all processes, builds a graph of txns
Nodes are txns, edges are waiting on other
Cycles = deadlocks
If deadlock, cancel some tnx until cycle is gone
Locks are released, other can continue and finihs
1 txn dies, 1 lives
Citus has txns, delegated to the W node that has the data
If 2 tnxs happen to go to the same W, normal PG deadlock
See a cycle, cancel one
sends error back to C. C then aborts txn. Other txns can continue.
What if tnx spans several machines?
TXN d1 wait D2 on N1, D2 wait D1 on N3
C wait for response from both nodes
No deadlock on any node
But there is a distrb deadlock
Runs as bakckgrnd worker
If distr txn is stuck, gather lock tables from nodes all over the network
Build dist txn graph, associate txn on nodes to overall txn
Notice which is waiting on what
With that graph, can see cycles
Cancel the txn on coordinator, which will then go and abort on W nodes
Other dist txns can continue
Necessary part of having dist txns
Most things you see is on consistency: 2PC, Paxos, or Raft.
Important, but only part of the story
If you want to scale txns, you also need deadlock detection
We talked about three technical problems today.
First, replication and high availability in Postgres.
Second, Postgres’s extension APIs and how Citus leverages them to introduce distributed functionality.
Last, distributed deadlock detection.
When we started everyone said “sql doesn’t scale”
Easy to dismiss intractable problem as impossible, trivialize it
Scaling SQL several problems, we covered 3
“Scaling out SQL” is a very very hard, but not an impossible problem to solve.