Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From 0 to ~100: Business Continuity with PostgreSQL

1,694 views

Published on

A journey through the infinite combinations of processes and architectures that PostgreSQL and its open source ecosystem provide within disaster recovery and high availability to help you achieve peace of mind. Starting from zero.

Published in: Technology

From 0 to ~100: Business Continuity with PostgreSQL

  1. 1. WHO WANTS A SERVICE WITH ZERO DOWNTIME?
  2. 2. … EVERYBODY
  3. 3. IS IT THAT GOOD?
  4. 4. NOT JUST TECHNOLOGY. RISKS, PROCEDURES, PEOPLE
  5. 5. FROM 0 TO ~100: BUSINESS CONTINUITY WITH POSTGRESQL Gabriele Bartolini Head of Support @ 2ndQuadrant PgDay.IT 2017, Milan
  6. 6. 2ndquadrant.com @_GBartolini_ #PGDayIT ABOUT MYSELF ▸ Open Source passionate and programmer since 1995 ▸ First time with Postgres in 1997, regular from ~2000 ▸ Lean and DevOps practitioner ▸ Co-Founder of ITPUG and PostgreSQL Europe ▸ Entrepreneur, with 2ndQuadrant since 2008 ▸ Co-Author of “PostgreSQL Administration Cookbook” ▸ Came up with the name “Barman”
  7. 7. 2ndquadrant.com @_GBartolini_ #PGDayIT BUSINESS CONTINUITY ▸ Disaster Recovery ▸ High Availability ▸ Types of disaster/failures ▸ Availability = Uptime / (Uptime + Downtime)
  8. 8. 2ndquadrant.com @_GBartolini_ #PGDayIT OBJECTIVES ▸ Recovery Point Objective (RPO) ▸ How much data can I afford to lose? ▸ Recovery Time Objective (RTO) ▸ How long will it take me to recover?
  9. 9. 2ndquadrant.com @_GBartolini_ #PGDayIT SERVICE RELIABILITY ▸ Cost of downtime ▸ How many €/$/£/AUD/…? ▸ Risk management ▸ SLI, SLO and SLA
  10. 10. 2ndquadrant.com @_GBartolini_ #PGDayIT SOME NOTES FOR THIS PRESENTATION ▸ PostgreSQL on Linux ▸ Servers can be either physical or virtual ▸ Storage must be redundant ▸ RAID is required ▸ VOLUME: redundant disk mounted on a system
  11. 11. LET’S START
  12. 12. 0. ONE POSTGRES SERVER
  13. 13. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Server name: hope
  14. 14. 2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ Why is RPO = ∞? ▸ Why is RTO = n/a? ▸ “Hope is not a strategy” (cit. Google) ▸ More common than you’d expect
  15. 15. 10. ONE POSTGRES SERVER + LOGICAL BACKUPS
  16. 16. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Add systematic backups with pg_dump LOGICAL BACKUP LOGICAL BACKUP LOGICAL BACKUP … Day 0
 4AM Day -1 4AM Day -2 4AM
  17. 17. 2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ How do you feel now? ▸ Still: RPO = ∞ and RTO = n/a. Why? ▸ A backup is valid only if you have tested it ▸ Unfortunately, this is very common
  18. 18. 20. ONE POSTGRES SERVER + LOGICAL BACKUPS + LOGICAL RESTORES
  19. 19. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Test your backups with pg_restore LOGICAL BACKUP Day 0
 4AM
  20. 20. 2ndquadrant.com @_GBartolini_ #PGDayIT DEFINING SOME OBJECTIVES ▸ Measure time for pg_restore ▸ RPO = backup frequency ▸ RTO = maximum time of recovery ▸ Provision another server ▸ Configure another server (automated, right?) ▸ Time to restore the last backup (measure it)
  21. 21. HAVE WE REALLY THOUGHT ABOUT EVERYTHING?
  22. 22. TIME OF REACTION
  23. 23. 2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ Can this architecture work for you? ▸ We need reliable monitoring ▸ From now on, we assume we have it in place! ▸ We need to reduce both RPO and RTO
  24. 24. HOW? POINT-IN-TIME-RECOVERY
  25. 25. 2ndquadrant.com @_GBartolini_ #PGDayIT POSTGRESQL’S PITR ▸ Part of core (fully open source) ▸ Rebuild a cluster at a point in time ▸ From crash recovery to sync streamrep (physical/logical) ▸ RPO = 0 (zero data loss) ▸ Hot base backup, continuous WAL archiving, Recovery ▸ API
  26. 26. 2ndquadrant.com @_GBartolini_ #PGDayIT BASIC CONCEPTS ▸ Continuous copy of WAL data (continuous archiving) ▸ Physical base backups ▸ Recovery: ▸ copy base backup to another location ▸ recovery mode (replay of WALs until target)
  27. 27. 2ndquadrant.com @_GBartolini_ #PGDayIT BARMAN ▸ In this presentation: Barman 2.3 ▸ Open Source (GNU GPL 3) ▸ Written in Python ▸ Developed and maintained by 2ndQuadrant ▸ Available at www.pgbarman.org
  28. 28. 40. ONE POSTGRES SERVER + ONE BARMAN SERVER
  29. 29. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Continuous backup
  30. 30. 2ndquadrant.com @_GBartolini_ #PGDayIT BASIC CONCEPTS ▸ Remote backup and recovery ▸ Multiple server management ▸ Backup catalogue and WAL archive ▸ Retention policies
  31. 31. 2ndquadrant.com @_GBartolini_ #PGDayIT COPY METHOD ▸ PostgreSQL streaming ▸ Practical/Windows/Docker ▸ Rsync/SSH ▸ Incremental backup and recovery (via hard links) ▸ Parallel backup and recovery ▸ Network compression and bandwidth limitation
  32. 32. 2ndquadrant.com @_GBartolini_ #PGDayIT WAL SHIPPING METHOD ▸ “archiving”, through “archive_command”: ▸ RPO ~ 16MB of WAL data, or ▸ “archive_timeout” ▸ “streaming”, through streaming replication: ▸ “pg_receivewal” or “pg_receivexlog” ▸ continuous stream, RPO ~ 0 ▸ PostgreSQL 9.2+ required
  33. 33. 2ndquadrant.com @_GBartolini_ #PGDayIT EXAMPLE FROM POSTGRESQL.CONF archive_mode = on wal_level = logical max_wal_senders = 10 max_replication_slots = 10 archive_command = 'rsync -a %p barman@HOST:/var/lib/barman/ID/incoming'
  34. 34. 2ndquadrant.com @_GBartolini_ #PGDayIT EXAMPLE FROM BARMAN.CONF [angus] description = “Angus Young database" ssh_command = ssh postgres@angus conninfo = user=barman-acdc dbname=postgres host=angus retention_policy = RECOVERY WINDOW OF 6 MONTHS copy_method = rsync reuse_backup = link parallel_jobs = 4 archiver = true streaming_archiver = true slot_name = barman_streaming_acdc
  35. 35. 2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ How do you feel now? ▸ Still: RPO = ∞ and RTO = n/a. Why? ▸ A backup is valid only if you have tested it ▸ Barman reduces backup risks, does not exclude them ▸ Systematic tests (especially custom scripts) ▸ Business risk is very high
  36. 36. 60. ONE POSTGRES SERVER + ONE BARMAN SERVER + ONE RECOVERY SERVER
  37. 37. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Test your backups with barman recover
  38. 38. WHAT A WASTE!
  39. 39. TESTING OR BI? HAVE YOU EVER THOUGHT OF USING IT FOR
  40. 40. 2ndquadrant.com @_GBartolini_ #PGDayIT HOOK SCRIPTS ▸ Barman has hook scripts: ▸ pre and post backup ▸ pre and post archiving ▸ with retry option (until the script returns SUCCESS)
  41. 41. 2ndquadrant.com @_GBartolini_ #PGDayIT EXAMPLE OF RECOVERY SCRIPT ▸ Write a bash script that: ▸ connects to a remote server via SSH ▸ stops the PostgreSQL server ▸ issues a “barman recover” with target “immediate” ▸ starts the PostgreSQL ▸ Set it as post-backup script
  42. 42. 2ndquadrant.com @_GBartolini_ #PGDayIT SOME FOOD FOR THOUGHT ▸ Outcomes: ▸ Systematically test your backup ▸ Measure your recovery time ▸ Identical server? This is a backup server ready to start ▸ You can use a different data centre ▸ Be creative, PostgreSQL gives you infinite freedom!
  43. 43. 2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ RPO ~ 0 (your backups work, every time) ▸ RTO = Time of reaction + Recovery time ▸ Example: RPO ~0 and RTO < 1 day ▸ Acceptable or not acceptable? ▸ Entry level architecture for business continuity ▸ Priority now: improve RTO
  44. 44. HOW? REPLICATION
  45. 45. 2ndquadrant.com @_GBartolini_ #PGDayIT POSTGRESQL’S REPLICATION ▸ Part of core (fully open source) ▸ One master, multiple standby servers ▸ Evolution of PITR ▸ Standby server is in continuous recovery mode ▸ Hot standby (read-only) ▸ Both streaming (9.0+) and file based pulling of WAL ▸ Cascading from a standby
  46. 46. 2ndquadrant.com @_GBartolini_ #PGDayIT SYNCHRONOUS REPLICATION ▸ Fine control (from global down to transaction level) ▸ 2-safe replication ▸ COMMIT of a write transactions waits until written on both the master and a standby (or more from 9.6) ▸ More than a synchronous client is required ▸ Read consistency of a cluster ▸ RPO = 0 (zero data loss)
  47. 47. 80. TWO POSTGRES SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER
  48. 48. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE barman_restore_wal barman recover Symmetric Cluster master standbyANGUS MALCOLM
  49. 49. 2ndquadrant.com @_GBartolini_ #PGDayIT EXCERPT FROM POSTGRESQL’S CONFIGURATION postgresql.conf: hot_standby = on recovery.conf: standby_mode = ‘on' # Streaming primary_conninfo = 'host=angus user=replica application_name=ha sslmode=require’ # Fallback via Barman restore_command = 'barman-wal-restore -U barman acdc angus %f %p'
  50. 50. 2ndquadrant.com @_GBartolini_ #PGDayIT SWITCHOVER (PLANNED) ▸ Applications are paused (start of downtime) ▸ Shut down the master ▸ Allow the standby to catch up with the master ▸ Promote the standby ▸ Switch virtual IPs ▸ Resume applications (end of downtime) ▸ Reconfigure the former master as standby
  51. 51. 2ndquadrant.com @_GBartolini_ #PGDayIT FAILOVER (UNPLANNED) ▸ The master is down (start of downtime) ▸ Promote the standby ▸ Change the virtual IP ▸ DEGRADED SYSTEM
  52. 52. 2ndquadrant.com @_GBartolini_ #PGDayIT MANUAL SWITCHOVER AND FAILOVER ▸ Manual switchover != manual switchover procedure ▸ Manual switchover = manually triggered ▸ Automate the procedure!!! ▸ bash (good) ▸ Ansible (better) ▸ Enhance gradually
  53. 53. 2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ RPO ~ 0 (your backups work, every time) ▸ RTO = Time of reaction + Time of promotion ▸ Criticality: manual intervention ▸ Reliable monitoring ▸ Trained people (practice & docs!)
  54. 54. 2ndquadrant.com @_GBartolini_ #PGDayIT MANUAL FAILOVER VS AUTOMATED FAILOVER ▸ Risk management ▸ Split brain nightmare ▸ Automated is built on manual (test!) ▸ Your choice ▸ Very good solution for business continuity ▸ Uptime > 99.99% in a year
  55. 55. 90. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER
  56. 56. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE barman_restore_wal barman recover Potential synchronous Synchronous ZERO DATA LOSS
  57. 57. 2ndquadrant.com @_GBartolini_ #PGDayIT SYNCHRONOUS REPLICATION ▸ Primary: Barman ▸ Zero data loss backup ▸ Primary: Standby ▸ Zero data loss cluster (reduce RTO) ▸ Just one configuration line in PostgreSQL ▸ synchronous_standby_names = '1 (ha, barman_receive_wal)'
  58. 58. ~100. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER + REPMGR (AUTO-FAILOVER)
  59. 59. 2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Potential synchronous Synchronous repmgr repmgr repmgr witness
  60. 60. WHAT’S MORE?
  61. 61. 2ndquadrant.com @_GBartolini_ #PGDayIT PUSH THE BOUNDARIES ▸ Repeatable architectures ▸ PgBouncer ▸ Virtual IPs ▸ S3 relay via Barman hook scripts ▸ Multiple standby servers and cascading replication ▸ Docker containers ▸ Logical replication backups
  62. 62. 2ndquadrant.com @_GBartolini_ #PGDayIT CONCLUSIONS ▸ Babysteps and KISS ▸ New? Explore and learn ▸ Practice is the only way to mastery (drills) ▸ Plan regular healthy downtimes ▸ Use switchovers to perform PostgreSQL updates ▸ Smart downtimes increase long-term uptime
  63. 63. 2ndquadrant.com @_GBartolini_ #PGDayIT ANY QUESTIONS? ▸ PostgreSQL: www.postgresql.org ▸ Barman: www.pgbarman.org ▸ Barman Cli: github.com/2ndquadrant-it/barman-cli ▸ PgBouncer: pgbouncer.github.io ▸ Repmgr: www.repmgr.org ▸ Our blog: blog.2ndquadrant.com
  64. 64. 2ndquadrant.com @_GBartolini_ #PGDayIT LICENCE Attribution 4.0 International (CC BY 4.0) You are free to: ▸ Share — copy and redistribute the material in any medium or format ▸ Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms.

×