Zero-downtime Postgres upgrades
Restarting databases without the apps noticing
@ChrisSinjo
GOCARDLESS
POST /cash/monies HTTP/1.1
{ amount: 100 }
💰💰💰
High 💵 per-request
Uptime is 🔑
Good durability guarantees
Good durability guarantees
Feature-cautious
Good durability guarantees
Feature-cautious
Transactions are cool
–Postgres
“Speak to this one node.”
Client
Postgres
Client
PostgresPostgres
Replication
Client
PostgresPostgres
Replication
Wake a human up
Client
PostgresPostgres
Replication
Client
PostgresPostgres
Client
PostgresPostgres
Client
PostgresPostgres
Replication
Awful time-to-recovery
Error-prone
You gotta perform:
- Many steps
- In the right order
- Perfectly
Don’t make a
tired
SRE think
Add automation
Pacemaker
A clustering tool
Client
PostgresPostgres
Replication
How do we know a
node has failed?
Jepsen
https://aphyr.com/tags/jepsen
https://aphyr.com/posts/317-jepsen-elasticsearch
Client
PostgresPostgres
Replication
Client
PostgresPostgresPostgres
Repl Repl
Client
PostgresPostgresPostgres Repl Repl
Pacemaker Pacemaker Pacemaker
Client
PostgresPostgresPostgres Repl Repl
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres Repl Repl
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
VIP
PostgresPostgresPostgres
Repl
Pacemaker Pacemaker Pacemaker
Client
VIP
PostgresPostgresPostgres
Repl
Pacemaker Pacemaker Pacemaker
Client
VIP
PostgresPostgresPostgres
Repl
Pacemaker Pacemaker Pacemaker
Client
VIP
Client
PostgresPostgresPostgres Repl
Repl
VIP
Pacemaker Pacemaker Pacemaker
$💯
Seems hard,
right?
It kinda is
You gotta know:
- Postgres
- Distributed systems
- Pacemaker
Get someone else
to run it for you
Client
PostgresPostgresPostgres Repl Repl
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
VIP
Every move means
a connection reset
Every move means
dropped requests
POST /cash/monies HTTP/1.1
{ amount: 100 }
💰💰💰
POST /cash/monies HTTP/1.1
{ amount: 100 }
500 Internal Server Error
What does this
mean for upgrades?
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
9.4.9 9.4.9 9.4.9
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
9.4.9 9.4.9 9.4.9
Repl Repl
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
9.4.10 9.4.9 9.4.10
Repl Repl
VIP
Client
PostgresPostgresPostgres Repl
Repl
VIP
Pacemaker Pacemaker Pacemaker
9.4.10 9.4.9 9.4.10
Every upgrade means
a connection reset
Every upgrade means
dropped requests
POST /cash/monies HTTP/1.1
{ amount: 100 }
500 Internal Server Error
Solution:
never upgrade
🙄
Not upgrading is
never
an option
Solution:
never upgrade
Solution:
never upgrade
Solution:
???
1thing
missing
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
VIP
PgBouncer has
This One Weird Trick™
PAUSE;
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
VIP
PAUSE;
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
PAUSE;
VIP
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
PAUSE;
VIP
So what does this
mean for upgrades?
Client
PostgresPostgresPostgres
Pacemaker Pacemaker Pacemaker
PgBouncerPgBouncer PgBouncer
VIP
VIP
Client
PostgresPostgresPostgres
PgBouncerPgBouncer PgBouncer
VIP
VIP
Client
PostgresPostgresPostgres
PgBouncerPgBouncer PgBouncer
VIP
VIP
9.4.10 9.4.9 9.4.10
Client
PostgresPostgresPostgres
PgBouncerPgBouncer PgBouncer
VIP
VIP
9.4.10 9.4.9 9.4.10
PAUSE;
Client
PostgresPostgresPostgres
PgBouncerPgBouncer PgBouncer
VIP
9.4.10 9.4.9 9.4.10
VIP
PAUSE;
Client
PostgresPostgresPostgres
PgBouncerPgBouncer PgBouncer
VIP
9.4.10 9.4.9 9.4.10
VIP
RESUME;
Client
PostgresPostgresPostgres
PgBouncerPgBouncer PgBouncer
VIP
9.4.10 9.4.10 9.4.10
VIP
RESUME;
$💯
Caveats
Minor versions
9.4.9 → 9.4.10
pglogical
Minor versions
Long-running transactions
while(running_queries):
if(now > timeout):
abandon_migration
else:
sleep(0.1)
promote_new_primary
Minor versions
Long-running transactions
Pause length
7-10s total
$💯
One more thing…
(#sorrynotsorry)
github.com/gocardless/our-postgresql-setup
We’re hiring
'❤
@ChrisSinjo
@GoCardlessEng
Thank you
'❤
@ChrisSinjo
@GoCardlessEng
Questions?
'❤
@ChrisSinjo
@GoCardlessEng

Zero Downtime Postgres Upgrades