Successfully reported this slideshow.
Your SlideShare is downloading. ×

Distributed Postgres with Citus / Will Leinweber (PostgreSQL)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 49 Ad

Distributed Postgres with Citus / Will Leinweber (PostgreSQL)

Download to read offline

HighLoad++ 2017

Зал «Кейптаун», 7 ноября, 15:00

Тезисы:
http://www.highload.ru/2017/abstracts/3043.html

Citus is an open-source extension to Postgres that transforms it into a multi-node, distributed database. It allows you to horizontally scale out both the.

In this session you'll learn how Citus takes care of sharding, distributed transactions, and even masterless writes. You'll learn how to transition your database from single-node Postgres in order to scale up your database to bigger and bigger sizes as your data grows.

HighLoad++ 2017

Зал «Кейптаун», 7 ноября, 15:00

Тезисы:
http://www.highload.ru/2017/abstracts/3043.html

Citus is an open-source extension to Postgres that transforms it into a multi-node, distributed database. It allows you to horizontally scale out both the.

In this session you'll learn how Citus takes care of sharding, distributed transactions, and even masterless writes. You'll learn how to transition your database from single-node Postgres in order to scale up your database to bigger and bigger sizes as your data grows.

Advertisement
Advertisement

More Related Content

More from Ontico (20)

Recently uploaded (20)

Advertisement

Distributed Postgres with Citus / Will Leinweber (PostgreSQL)

  1. 1. Distributed Postgres with Citus Will Leinweber
  2. 2. Will Leinweber Principal Cloud Engineer at Citus Previously at Heroku Postgres @leinweber bitfission.com (warning: autoplays MIDI)
  3. 3. Developers Love Postgres Postgres MySQL MongoDB SQL Server + Oracle RDBMS: Postgres, MySQL, Microsoft SQL Server, Oracle
  4. 4. A. Start with SQL, need to scale out and migrate to NoSQL B. Start with NoSQL, hope you actually later need scale out C. Start with SQL, need to scale out and stay with SQL? Possible Paths
  5. 5. What is Citus? 1.Scales out Postgres 2.Extension to Postgres 3.Available in 3 Ways • Using sharding & replication • Query engine parallelizes SQL queries across many nodes • Using Postgres extension APIs
  6. 6. Citus, Packaged Three Ways Open Source Enterprise Software Fully-Managed Database as a Service github.com/citusdata/citus
  7. 7. Simplified Citus Architecture
  8. 8. (coordinator node)=# d Schema | Name --------+------------ public | cw_metrics public | events (worker node)=# d Schema | Name --------+------------------- public | cw_metrics_102008 public | cw_metrics_102012 public | cw_metrics_102016 public | cw_metrics_102064 public | cw_metrics_102068 public | events_102104 public | events_102108 public | events_102112 public | events_102116 ...
  9. 9. citus=> select * from pg_dist_shard limit 10; logicalrelid | shardid | shardminvalue | shardmaxvalue --------------+---------+---------------+--------------- 19395 | 102040 | -2147483648 | -2013265921 19395 | 102041 | -2013265920 | -1879048193 19395 | 102042 | -1879048192 | -1744830465 19395 | 102043 | -1744830464 | -1610612737 19395 | 102044 | -1610612736 | -1476395009 19395 | 102045 | -1476395008 | -1342177281 19395 | 102046 | -1342177280 | -1207959553 19395 | 102047 | -1207959552 | -1073741825 19395 | 102048 | -1073741824 | -939524097 19395 | 102049 | -939524096 | -805306369 ...
  10. 10. 3 Challenges Distributing Postgres 1. Postgres and High Availability 2. To build new distributed database—or to fork? 3. Distributed transactions
  11. 11. Postgres & High Availability (HA) Designing for a Cloud-native world
  12. 12. Why is High Availability hard? Postgres replication uses one primary & multiple secondary nodes. Two challenges: 1. Most Postgres clients aren’t smart. When the primary fails, they retry the same IP. 2. Postgres replicates entire state. This makes it resource intensive to reconstruct new nodes from a primary.
  13. 13. Database Failures Should Be Transparent
  14. 14. Database Failures Shouldn’t Be a Big Deal 1. Postgres streaming replication to replicate from primary to secondary. Back up to S3. 2. Volume level replication to replicate to secondary’s volume. Back up to S3. 3. Incremental backups to S3. Reconstruct secondary nodes from S3. 3 Methods for HA & Backups in Postgres
  15. 15. Postgres - Streaming Replication (1) Write-ahead logs (streaming repl.) Table foo Primary – Postgres streaming repl. Table bar WAL logs Table foo Table bar WAL logs Secondary – Postgres streaming repl. Monitoring Agents - streaming repl. setup & auto failover S3 / Blob Storage (Encrypted) Backup Process
  16. 16. Postgres – AWS RDS & Azure (2) Postgres Primary Monitoring Agents (Auto node failover) Persistent Volume Postgres Standby S3 / Blob Storage (Encrypted) Table foo Table bar WAL logs Table foo Table bar WAL logs Backup process Backup Process Persistent Volume
  17. 17. Postgres – Reconstruct from WAL (3) Postgres Primary Monitoring Agents (Auto node failover) Persistent Volume Postgres Secondary Backup Process S3 / Blob Storage (Encrypted) Table foo Table bar WAL logs Persistent Volume Table foo Table bar WAL logs Backup process
  18. 18. WHO DOES THIS? PRIMARY BENEFITS Streaming Replication (local / ephemeral disk) On-prem Manual EC2 Simple to set up Direct I/O: High I/O & large storage Disk Mirroring RDS Azure Preview Works for MySQL and Postgres Data durability in cloud environments Reconstruct from WAL Heroku Citus Data Enables Fork and PITR Node reconstruction in background (Data durability in cloud environments) How do these approaches compare?
  19. 19. wal-e github.com/wal-e/wal-e github.com/wal-g/wal-g
  20. 20. Summary • In Postgres, a database node’s state gets replicated in its entirety. The replication can be set up in three ways. • Reconstructing a secondary node from S3 makes bringing up or shooting down nodes easy. • When you shard your database, the state you need to replicate per node becomes smaller.
  21. 21. Postgres has a huge ecosystem. How do you keep up with it?
  22. 22. 3 ways to build a distributed database 1. Build a distributed database from scratch 2. Middleware sharding (mimic the parser) 3. Fork your favorite database (like Postgres)
  23. 23. Example Transaction Block
  24. 24. Postgres Features, Tools & Frameworks • Postgres manual (US Letter) • Clients for different programming languages • ORMs, libraries, GUIs • Tools (dump, restore, analyze) • New features
  25. 25. At First, Forked Postgres with Style
  26. 26. Two Stage Query Optimization 1. Plan to minimize network I/O 2. Nodes talk to each other using SQL over libpq 3. Learned to cooperate with planner / executor bit by bit (Volcano style executor)
  27. 27. Citus Architecture (Simplified) SELECT avg(revenue) FROM sales Coordinator SELECT sum(revenue), count(revenue) FROM table_1001 SELECT sum … FROM table_1003 Worker node 1 Table metadata Table_1001 Table_1003 SELECT sum … FROM table_1002 SELECT sum … FROM table_1004 Worker node 2 Table_1002 Table_1004 Worker node N . . . . . . Each node Postgres with Citus installed 1 shard = 1 Postgres table
  28. 28. Unfork Citus using Extension APIs CREATE EXTENSION citus; • System catalogs – Distributed metadata • Planner hook – Insert, Update, Delete, Select • Executor hook – Insert, Update, Delete, Select • Utility hook – Alter Table, Create Index, Vacuum, etc. • Transaction & resources handling – file descriptors, etc. • Background worker process – Maintenance processes (distributed deadlock detection, task tracker, etc.) • Logical decoding – Online data migrations
  29. 29. Postgres has transactions How to handle distributed transactions
  30. 30. BEGIN INSERT UPDATE SELECT COMMIT ROLLBACK
  31. 31. Consistency in Distributed Databases 1. 2PC: All participating nodes need to be up 2. Paxos: Achieves consensus with quorum 3. Raft: More understandable alternative to Paxos
  32. 32. Concurrency in Distributed Databases
  33. 33. Locks Locks
  34. 34. What is a Lock? • Protects against concurrent modifications. • Locks are released at the end of a transaction. Deadlocks
  35. 35. Transactions Block on 1st Conflicting LockWhat is a lock? Protects against concurrent modifications Locks released at end of transaction BEGIN; UPDATE data SET y = 2 WHERE x = 1; <obtained lock on rows with x = 1> COMMIT; <all locks released> BEGIN; UPDATE data SET y = 5 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT;
  36. 36. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT;
  37. 37. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?
  38. 38. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other.
  39. 39. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other. Deadlock detection in PostgreSQL Transactions are cancelled until the cycle is gone
  40. 40. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other. Deadlock detection in PostgreSQL Transactions are cancelled until the cycle is gone Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes
  41. 41. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other. Deadlock detection in PostgreSQL Transactions are cancelled until the cycle is gone Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus PostgreSQL’s deadlock detector still works
  42. 42. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other. Deadlock detection in PostgreSQL Transactions are cancelled until the cycle is gone Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus PostgreSQL’s deadlock detector still works Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus When deadlocks span across node, PostgreSQL cannot help us Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus When deadlocks span across node, PostgreSQL cannot help us
  43. 43. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other. Deadlock detection in PostgreSQL Transactions are cancelled until the cycle is gone Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus PostgreSQL’s deadlock detector still works Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus When deadlocks span across node, PostgreSQL cannot help us Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus When deadlocks span across node, PostgreSQL cannot help us Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlock detection in Citus 7 Citus 7 adds distributed deadlock detection
  44. 44. Transactions and Concurrency • Transactions that don’t modify the same row can run concurrently. Transactions block on 1st lock that conflicts BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; COMMIT; <all locks released> BEGIN; UPDATE data SET y = y + 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; <waiting for lock on rows with x = 1> <obtained lock on rows with x = 1> COMMIT; (Distributed) deadlock! BEGIN; UPDATE data SET y = y - 1 WHERE x = 1; UPDATE data SET y = y + 1 WHERE x = 2; BEGIN; UPDATE data SET y = y - 1 WHERE x = 2; UPDATE data SET y = y + 1 WHERE x = 1; But what if they start blocking each other?Deadlock detection in PostgreSQL Deadlock detection builds a graph of processes that are waiting for each other. Deadlock detection in PostgreSQL Transactions are cancelled until the cycle is gone Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus Citus delegates transactions to nodes Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus PostgreSQL’s deadlock detector still works Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus When deadlocks span across node, PostgreSQL cannot help us Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlocks in Citus When deadlocks span across node, PostgreSQL cannot help us Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlock detection in Citus 7 Citus 7 adds distributed deadlock detection Firstname Lastname | Citus Data | Meeting Name | Month Year Deadlock detection in Citus 7 Citus 7 adds distributed deadlock detection.
  45. 45. Summary Distributed transactions are a complex topic. Most articles on this topic focus on data consistency. Data consistency is only one side of the coin. If you’re using a relational database, your application benefits from another key feature: deadlock detection. https://www.citusdata.com/blog/2017/08/31/databases- and-distributed-deadlocks-a-faq
  46. 46. Conclusion Postgres High Availability (HA) Extension APIs Distributed Deadlock Detection
  47. 47. SQL is hard, not impossible, to scale
  48. 48. © 2017 Citus Data. All right reserved. will@citusdata.com Questions? @citusdata Will Leinweber www.citusdata.com

Editor's Notes

  • learned over the years in turning Postgres into a distributed database.
    lessons applicable in a broader context than just PG
    Fairly technical talk. If you have questions, please feel free to ask them.
    Speak slowly.
  • Job Posting trends on HN
    Last year PG = next for combined
    So, developers love Postgres.
    And it’s worth learning more about the technical components that make up Postgres and how one goes about scaling them.

  • Horz scales out PG across machines, using sharding + replication
    Route vs. Parallalize query
    Can do self, but Nice to have DB do this for you
    Packaged as extension. extension APIs are new, unique to Postgres. More on them later in the talk.
  • In 1 sense, Citus does very little to PG
    C & W nodes are PG with Citus ext
    User connects to C. Manage+create dist tables
    Queries ran through C, using standard PG protocol
    C transforms query to smaller queries, push down to W
    C merges, aggrigate if necessary
    C doesn’t own any data (mostly)
    W each has several shards which are small tables
  • Metadata table
  • The previous diagram looks simple, but scaling out SQL is actually an extremely challenging task.
    Rest of talk explains Citus by looking at 3 challenges

  • This part describes the most asked questions about PG.
    How do you handle replication and machine failures?
    What challenges do you run into when setting up HA PG clusters?
  • Common setup: 1 Primary writes, many read replicas
    In the context of Postgres, this setup brings two challenges.
    First, many Postgres clients talk to a single endpoint. When the primary node fails, they will keep retrying the same IP.
    Second, Postgres replicates its entire state. This makes it hard to shoot different nodes in the head and bring new nodes into the cluster.
  • PG has large ecosytm of clients.
    Some can have a list of IPs to try (java, pg10).
    List only works if you know upfront all possible failovers
    To solve generally, need Network Primitives
    elastic IP, or DNS, or a load balancer.
    This example is EIP failover
  • 2nd problem not widely recognized. Most think Primary/seconday is enough
    In practice, 1of 3 approaches for replication and fail-over.
    When you bring up new secondary, how does it start?
  • 1st approach is the most common
    Primary node has the tables’ data and write-ahead logs.
    <explain wal>
    Stream WAL to secondary, from beginning
    Can cause load on the primary
  • disk mirroring / block-based replication.
    Writes go to a persistent volume. This volume then gets synchronously mirrored to another volume.
    works for all RDBMS. You can use it for MySQL, Postgres, Oracle, or SQL Server.
    However, this approach also requires that you replicate both table and WAL log data.
    writes needs to synchronously go over the network.
    Missing 1 byte can cause corruption
  • turns the replication and disaster recovery process inside out.
    base backup / incremental wal to s3
    New secondary comes replays from s3
    Switch to streaming replication for latency
    Better for cloud, easy to bring up AND down replicas
    Sync or async
  • Each benefit is drawback for others
    1 Simple streaming replication is most common. Most on-prem. Easy to set up. Local disks ~10TBs
    2 Disk Mirroring abstracts storage layer from DB. Loss of instance != loss of disk
    3 Treat WAL as a 1 class citizen, certain features become trivial.
    <explain + why fork, pitr>
  • Questions?
  • All three replication methods replicate a database’s state in full.
    Sharding reduces the state you need to replicate per machine
    So replication becomes a much easier problem to solve.
  • RDBMS have diverse features over many years
    Distributing them introduces a lot of challenges
  • Middleware: routes queries (inserts, simple selects)
    Fork: features diverge over time, eventually becomes a separate project
  • Early on when getting queries from users came across this
    Savepoints in PG are like nested transactions.
    Having this work in a distrubuted db was very difficult so they decided not to do it at the time
    Founders wanted to know who would write such a query
  • Turns out not a person, just rails testing framework
    Then knew, to really scale PG, need to go all in
    All features people rely on need to work
    Current/New features, clients, ORMs, 100s of tools around PG
  • Citus started as a fork. Extension support in PG wasn’t enough
    Inside, postgres is very modular, easyish to hook in without mess
    To distribute the CREATE INDEX to runs in, hook into the DDL and Utility processing
    Planner and executor most complex. At the time assumptions on storage layerSo, we created a two stage query planner and executor.
  • PG parses and semantically validates query, Citus sees if touches distr. Tables
    Use distr table metadata to plan the query
    Minimize IO and xform the query into smaller fragments
    Citus then deparses these query fragments back into SQL:
    1) Distr query planner decople from the distr executor. Test, logging
    2) PG workerscan optimize for local execution
  • Example SELECT to C. C parses the query.
    Citus planner hook xforms into query fragments
    Distr planner unparses these query fragments back into SQL and sends to W
    W do own local planning and execution and send the results back to C
    C does final computation on results and returns to application.
  • Over time we worked with PG to make these APIs official with extension framework
    So we unforked from Postgres and made it an official extension.
    An extension is a shared library that gets dynamically loaded into Postgres’ address space.
    All you need is `create extension citus` to make PG a distr DB
  • handling distributed transactions in a relational database.
  • Distributed transactions big, heavy researched area
    Inside your TXN, you should see your changes, but others shouldn’t until COMMIT
    Or ROLLBACK
    2 related challenges: Consisty and Locks
  • What happens when 1 or more machines that participate in a transaction fail?
    Consistency is a well established problem in distributed systems. Three popular algorithms
    2PC requires all nodes to be on to make progress. Paxos/Raft not
    We looked both at 2PC in Postgres and also wrote “pg_paxos”
    We went with 2PC bc it has been widley used in PG
    With streaming rep, secondary promoted quickly
  • Not as popular a problem, but important:
    Concurrent txns want to modify the same rows, what happens?
  • At core locks are simple
    Prevent 2 txn from modifying same row, concurrently
    Txns can get complicated, grab many locks
  • 2 concurrent txn grab same lock
    Need some way to deal with this

  • Almost any command you run grabs some locks.
    This UPDATE gets a row level lock
    any concurrent txnthat tries to update the same row will block
    And then after commit or abort, all locks are released and the second txn continues.
  • If 2 txn have different filters, run concurrent
    Allows for good write throughput
    If later in the 2nd txn, conflict, then you block
  • Both modify the same rows in a different order.
    Right x=2, left x=1
    And now wait for each other
    No way out, neither can continue: deadlock
    New txns come in and also get stuck
    Escalate to full system outage
  • If txn stuck 1 second, runs deadlock detection.
    Looks at lock graph, across all processes, builds a graph of txns
    Nodes are txns, edges are waiting on other
    Cycles = deadlocks
  • If deadlock, cancel some tnx until cycle is gone 
    Locks are released, other can continue and finihs
    1 txn dies, 1 lives
  • Citus has txns, delegated to the W node that has the data
    If 2 tnxs happen to go to the same W, normal PG deadlock
    See a cycle, cancel one
  • sends error back to C. C then aborts txn. Other txns can continue.
    What if tnx spans several machines?
  • TXN d1 wait D2 on N1, D2 wait D1 on N3
    C wait for response from both nodes
    No deadlock on any node
    But there is a distrb deadlock
  • Runs as bakckgrnd worker
    If distr txn is stuck, gather lock tables from nodes all over the network
    Build dist txn graph, associate txn on nodes to overall txn
    Notice which is waiting on what
  • With that graph, can see cycles
    Cancel the txn on coordinator, which will then go and abort on W nodes
    Other dist txns can continue
    Necessary part of having dist txns
  • Most things you see is on consistency: 2PC, Paxos, or Raft.
    Important, but only part of the story
    If you want to scale txns, you also need deadlock detection
  • We talked about three technical problems today.
    First, replication and high availability in Postgres.
    Second, Postgres’s extension APIs and how Citus leverages them to introduce distributed functionality.
    Last, distributed deadlock detection.
  • When we started everyone said “sql doesn’t scale”
    Easy to dismiss intractable problem as impossible, trivialize it
    Scaling SQL several problems, we covered 3
    “Scaling out SQL” is a very very hard, but not an impossible problem to solve.
  • - Thank you

×