PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disaster Recovery - Muhammad Haroon

https://www.2ndQuadrant.com
PGConf ASIA
Bali, Sept, 2019
WHO WANTS A SERVICE
WITH ZERO DOWNTIME?

PGConf ASIA
Bali, Sept, 2019
… EVERYBODY

PGConf ASIA
Bali, Sept, 2019
NOT JUST TECHNOLOGY.
RISKS, PROCEDURES, PEOPLE

PGConf ASIA
Bali, Sept, 2019
Your Business Continuity
Matrix and Disaster
Recovery
PGConf Asia
Sept, 2019
Muhammad Haroon
@ 2ndQuadrant

PGConf ASIA
Bali, Sept, 2019
$psql~#: d haroon
● Working in PostgreSQL space @ 2ndQuadrant
● Part of PostgreSQL family for nearly a decade and a half
● Development, support, consulting, professional services and administration
● Past stints with PostgreSQL family include
○ EnterpriseDB
○ OpenSCG (now Amazon Web Services)
● Led Engineering Ops efforts @ IBEX Group
● Principal/Architect/Product Owner @ TRG
Email: haroon@2ndQuadrant.com
Skype: contact.haroon

PGConf ASIA
Bali, Sept, 2019
Business Continuity
- Disaster Recovery
- High Availability
- Types of disaster/failures
- Availability = Uptime / (Uptime + Downtime)

PGConf ASIA
Bali, Sept, 2019
Objectives
- Recovery Point Objective (RPO)
- How much data can I afford to lose?
- Recovery Time Objective (RTO)
- How long will it take me to recover?

PGConf ASIA
Bali, Sept, 2019
Service RELIABILITY
● Cost of downtime
○ How many €/$/£/AUD/AED/…?
● Risk management
● SLI (Service Level Indicator), SLO (Service Level Objective) and SLA
(Service Level Agreement)

PGConf ASIA
Bali, Sept, 2019
Some notes for this talk
● PostgreSQL on Linux
● Servers can be either physical or virtual
● Reliable storage
● Incremental approach

PGConf ASIA
Bali, Sept, 2019
0.
One Postgres server

PGConf ASIA
Bali, Sept, 2019
Architecture
Server name: hope

PGConf ASIA
Bali, Sept, 2019
Recap
● Why RPO = ∞?
● Why RTO = n/a?
● “Hope is not a strategy” (cit. Google)
● More common than you’d expect

PGConf ASIA
Bali, Sept, 2019
10.
One Postgres server
+ Logical backups

PGConf ASIA
Bali, Sept, 2019
Architecture
Add automated
backups with
pg_dump

PGConf ASIA
Bali, Sept, 2019
Recap
● How do you feel now?
● Still: RPO = ∞ and RTO = n/a. Why?
● A backup is valid only if you have tested it
● Unfortunately, this is very common

PGConf ASIA
Bali, Sept, 2019
20.
One Postgres server
+ Logical backups
+ Logical restores

PGConf ASIA
Bali, Sept, 2019
Architecture
Test your backups
with pg_restore

PGConf ASIA
Bali, Sept, 2019
Are we any better now ?
● RPO = backup frequency
● RTO = maximum time of recovery
○ Provision another server
○ Conﬁgure another server (automated, right?)
○ Time to restore the last backup (measure it)
● Still missing something ?

PGConf ASIA
Bali, Sept, 2019
TIME TO REACTION

PGConf ASIA
Bali, Sept, 2019
Recap
● Can this architecture work for you?
● We need reliable monitoring
a. From now on, we assume we have it in place!
● We need to reduce both RPO and RTO

PGConf ASIA
Bali, Sept, 2019
How?
Point-In-Time-Recovery

PGConf ASIA
Bali, Sept, 2019
PostgreSQL’s PITR
● Part of core (fully open source)
● Rebuild a cluster at a point in time
● From crash recovery to sync streamrep (physical/logical)
● RPO = 0 (zero data loss)
● Hot base backup, continuous WAL archiving, Recovery

PGConf ASIA
Bali, Sept, 2019
Basic concepts
● Continuous copy of WAL data (continuous archiving)
● Physical base backups
● Recovery:
○ copy base backup to another location
○ recovery mode (replay of WALs until target)

PGConf ASIA
Bali, Sept, 2019
Tools
- Custom written scripts
- pg_backrest
- Pg_probackup
- WAL-G
- Barman
So yeah, any tool is ﬁne as long as you know what you are trying to achieve ...

PGConf ASIA
Bali, Sept, 2019
BARMAN
● Latest released version: Barman 2.9
● Open Source (GNU GPL 3)
● Written in Python
● Developed and maintained by 2ndQuadrant
● Available at www.pgbarman.org and
www.2ndquadrant.com/en/resources/barman/

PGConf ASIA
Bali, Sept, 2019
40.
One Postgres server
+ One Barman server

PGConf ASIA
Bali, Sept, 2019
Architecture
continuous backup

PGConf ASIA
Bali, Sept, 2019
Basic concepts
● Remote backup and
recovery
● Multiple server
management
● Backup catalogue and
WAL archive
● Retention policies

PGConf ASIA
Bali, Sept, 2019
Copy method
● PostgreSQL streaming
○ Windows/Docker
● Rsync/SSH
○ Incremental backup and recovery (via hard links)
○ Parallel backup and recovery
○ Network compression and bandwidth limitation

PGConf ASIA
Bali, Sept, 2019
WAL shipping method
● “archiving”, through “archive_command”:
○ RPO ~ 16MB of WAL data, or
○ “archive_timeout”
● “streaming”, through streaming replication:
○ “pg_receivewal” or “pg_receivexlog”
○ continuous stream, RPO ~ 0
○ PostgreSQL 9.2+ required

PGConf ASIA
Bali, Sept, 2019
Example from postgresql.conf
archive_mode = on
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10
archive_command = 'rsync -a %p
barman@HOST:/var/lib/barman/ID/incoming'

PGConf ASIA
Bali, Sept, 2019
Example from barman.conf
[angus]
description = “Angus Young database"
ssh_command = ssh postgres@angus
conninfo = user=barman-acdc dbname=postgres host=angus
retention_policy = RECOVERY WINDOW OF 6 MONTHS
copy_method = rsync
reuse_backup = link
parallel_jobs = 4
archiver = true
streaming_archiver = true
slot_name = barman_streaming_acdc

PGConf ASIA
Bali, Sept, 2019
RECAP
● How do you feel now?
● Still: RPO = ∞ and RTO = n/a. Why?
● A backup is valid only if you have tested it
● Barman reduces backup risks, does not exclude them
○ Systematic tests (especially custom scripts)
○ Business risk is very high

PGConf ASIA
Bali, Sept, 2019
60.
One Postgres server
+ One Barman server
+ One Recovery server

PGConf ASIA
Bali, Sept, 2019
Architecture
Test your backups with
barman recover

PGConf ASIA
Bali, Sept, 2019
Hook scripts
● Barman has hook scripts:
○ pre and post backup
○ pre and post archiving
○ with retry option (until the script returns SUCCESS)

PGConf ASIA
Bali, Sept, 2019
Example of Recovery script
● Write a bash script that:
○ connects to a remote server via SSH
○ stops the PostgreSQL server
○ issues a “barman recover” with target “immediate”
○ starts PostgreSQL
● Set it as post-backup script

PGConf ASIA
Bali, Sept, 2019
Some food for thought
● Outcomes:
○ Systematically test your backup
○ Measure your recovery time
○ Identical server? This is a backup server ready to start
● You can use a different data centre
● Be creative, PostgreSQL gives you inﬁnite freedom!

PGConf ASIA
Bali, Sept, 2019
RECAP
● RPO ~ 0 (your backups work, every time)
● RTO = Time of reaction + Recovery time
● Example: RPO ~0 and RTO < 1 day
○ Acceptable or not acceptable?
● Entry level architecture for business continuity
● Priority now: improve RTO

PGConf ASIA
Bali, Sept, 2019
How?
Replication

PGConf ASIA
Bali, Sept, 2019
PostgreSQL’s Replication
● Part of core (fully open source)
● One master, multiple standby servers
● Evolution of PITR
○ Standby server is in continuous recovery mode
○ Hot standby (read-only)
○ Both streaming (9.0+) and ﬁle based pulling of WAL
● Cascading from a standby

PGConf ASIA
Bali, Sept, 2019
Synchronous replication
● Fine control (from global down to transaction level)
● 2-safe replication
○ COMMIT of a write transactions waits until written on both the
master and a standby (or more from 9.6)
■ More than a synchronous client is required
○ Read consistency of a cluster
● RPO = 0 (zero data loss)

PGConf ASIA
Bali, Sept, 2019
80.
TWO Postgres servers
+ One Barman server

PGConf ASIA
Bali, Sept, 2019

PGConf ASIA
Bali, Sept, 2019
Excerpt from postgresql’s conﬁguration
postgresql.conf:
hot_standby = on
recovery.conf:
standby_mode = ‘on'
# Streaming
primary_conninfo = 'host=angus user=replica application_name=ha
sslmode=require’
# Fallback via Barman
restore_command = 'barman-wal-restore -U barman acdc angus %f %p'

PGConf ASIA
Bali, Sept, 2019
Switchover (planned)
● Applications are paused (start of downtime)
● Shut down the master
● Allow the standby to catch up with the master
● Promote the standby
● Switch virtual IPs
● Resume applications (end of downtime)
● Reconﬁgure the former master as standby

PGConf ASIA
Bali, Sept, 2019
FAILOVER (UNplanned)
● The master is down (start of downtime)
● Promote the standby
● Change the virtual IP
● DEGRADED SYSTEM

PGConf ASIA
Bali, Sept, 2019
Manual switchover and failover
● Manual switchover != manual switchover procedure
● Manual switchover = manually triggered
● Automate the procedure!!!
○ bash (good)
○ Ansible (better)
● Enhance gradually

PGConf ASIA
Bali, Sept, 2019
RECAP
● RPO ~ 0 (your backups work, every time)
● RTO = Time of reaction + Time of promotion
● Criticality: manual intervention
○ Reliable monitoring
○ Trained people (practice & docs!)

PGConf ASIA
Bali, Sept, 2019
Manual failover vs automated failover
● Risk management
○ Split brain nightmare
○ Automated is built on manual (test!)
○ Your choice
● Very good solution for business continuity
● Uptime > 99.99% in a year

PGConf ASIA
Bali, Sept, 2019
90.
TWO Postgres SYNC
servers
+ One Barman server

PGConf ASIA
Bali, Sept, 2019
Synchronous replication
● Primary: Barman
○ Zero data loss backup
● Primary: Standby
○ Zero data loss cluster (reduce RTO)
● Just one conﬁguration line in PostgreSQL
○ synchronous_standby_names = '1 (ha, barman_receive_wal)'

PGConf ASIA
Bali, Sept, 2019
~100.
TWO Postgres SYNC
servers
+ One Barman server
+ REPMGR
(AUTO-FAILOVER)

PGConf ASIA
Bali, Sept, 2019
What’s more?

PGConf ASIA
Bali, Sept, 2019
Push the boundaries
● Repeatable architectures in multiple data centres
● PgBouncer
● Virtual IPs
● S3 relay via Barman hook scripts
● Multiple standby servers and cascading replication
● Docker containers
● Logical replication backups

PGConf ASIA
Bali, Sept, 2019
Conclusions
● Babysteps and KISS
● New? Explore and learn
● Practice is the only way to mastery (drills)
● Plan regular healthy downtimes
○ Use switchovers to perform PostgreSQL updates
○ Smart downtimes increase long-term uptime

PGConf ASIA
Bali, Sept, 2019
Conclusions
● PostgreSQL: www.postgresql.org
● Barman: www.pgbarman.org #pgbarman
● Barman Cli: github.com/2ndquadrant-it/barman-cli
● PgBouncer: pgbouncer.github.io
● Repmgr: www.repmgr.org
● Our blog: 2ndquadrant.com/blog

PGConf ASIA
Bali, Sept, 2019
Thank you!!

PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disaster Recovery - Muhammad Haroon

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disaster Recovery - Muhammad Haroon

Similar to PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disaster Recovery - Muhammad Haroon (20)

More from Equnix Business Solutions

More from Equnix Business Solutions (20)

Recently uploaded

Recently uploaded (20)

PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disaster Recovery - Muhammad Haroon