HowTo DR

Josh Berkus
PostgreSQL Experts
SCALE 2014
Disaster Recovery
“The process, policies and
procedures that are related to
preparing for recovery or
continuation of tech...
Disaster Recovery
Restoring services after the
unexpected.
Disaster Recovery
Limiting:
1. Downtime
2. Data Loss
Do you have
a DR Plan?
Is it fairly
complete?
Have you
tested it?
Dilbert is a copyright of Scott Adams. Used here as parody. All rights reserved to Scott Adams.
Threat
Model
server
failure

getting
hacked
natural
disaster
server
failure

getting
hacked
natural
disaster

storage
failure
traffic
spike

network
failure

admin
error

bad
update

...
server
failure

getting
hacked
natural
disaster

storage
failure
traffic
spike

network
failure

admin
error

bad
update

...
Accepting
Loss
The Nines
Nines
99.9%
99.99%
99.999%

Down/Year
9 hours
1 hour
5 minutes
The Nines
●

Treats all downtime causes as
identical
–

●
●
●

except the ones it ignores

Doesn't address data loss
Reall...
Disaster
Server Failure
Network Failure
Admin Error
Bad Update
Storage Failure
Getting Hacked
Natural Disaster

Downtime D...
Disaster

Downtime Data Loss

Server Failure

0

0

Network Failure

0

0

Admin Error

0

0

Bad Update

0

0

Storage Fa...
$$$$$$$$$$$$$$$$$$$$$$$
Disaster

Downtime Data Loss

Server Failure

5min

1min

Network Failure

3hrs

10min

Admin Error

1hr

1hr

Bad Update
...
Your
DR
Plan
Elements of a Plan
1. Backups/Replicas
2. Replacements
3. Procedures
4. People
Backups
●
●
●
●
●

pg_dump
mysqldump
rsync
zfs snapshot
SAN exportable snapshot
Backups++
●
●
●
●

Periodic
Portable
Simple
Recover point-in-time
Backups-●
●

Slow to restore
Data loss interval
Backups
●

Good for:
natural disaster
– admin error, bad update
– software bugs
– getting hacked
–

●

Bad for everything ...
Replication
●
●
●
●
●
●

Postgres binary replication
MySQL replication
Redundant HBase nodes
Redis clustering
DRBD
Gluster...
Replication++
●
●
●

Continuous
Fast failover
Low data loss
Replication-●
●
●
●
●

Extra hardware
Complex
High-maintenance
Can hurt performance
Can replicate failures
Replication
●

Good For:
–

●

server, storage, network failure

Bad For:
admin error, getting hacked
– software bugs
–
Continuous Backup
●
●
●

Also “PITR”
Continous like replication
Partial recovery like backups
… where you gonna
restore those backups
to?
Replacing Services
●
●
●
●
●

servers
network
storage
OS image
software reversion
Procedures
Written
Procedures
3AM
is not the time
to improvise
Procedures
… for each recovery step
… for deciding what steps
Database Server
Does Not Respond
1. Determine if physical server is
down
a. if network is down, use plan N1.
2. If not, tr...
Good: detailed written
procedures
Better: written procedures with
pastable commands
Best: tested single-command
scripts
Fallback Procedures
●
●
●

Sometimes recovery fails
Have fallback procedures
If the fallback fails
… time for a meeting!
Never
ever
ever
ever
improvise.
People

Who
You
Gonna
Call?
Know who to call
●
●
●
●
●

on call staff
experts in each service
consultants/contractors
vendors
required authorizations
Contact Book
●

●

Include as much contact
information as possible
Put copies in more than one
place
–

●

including paper...
Test Your DR
Good: when you create the
procedure
Better: quarterly
Best: as part of daily/weekly
provisioning
An untested
backup
is one which
doesn't work.
DR
in the
Cloud
“It's a cloud, right?
That means it's
redundant, right?”
… not necessarily for
your servers

unless you pay for it!
Some new problems
●
●
●
●

Instance failure
Resource overcommit
Zone failures
Admin error at scale
Some new solutions
●

Redundant services
–

●
●

RDS, VIP, S3

Rapid server deployment
Cheap replicas
… otherwise
pretty much
the same.
backup locations
●

shared instance storage (EBS)
–

●

fast failover for instance fail

long-term storage API (S3)
redund...
Use your rapid deploy!
●
●

Continuous backup to S3
Deploy scripts + server images
–

●

Chef/Salt/Puppet/etc. helps here
...
DR Tips
●

Have multiple copies of your plan
–

●
●

in multiple locations

A SAN is not a DR solution
One form of backup ...
Questions?
●

Josh Berkus
www.databasesoup.com
– www.pgexperts.com
–

●

Coming up:
NYC pgDay April 3-4
– pgCon May 21-24
...
HowTo DR
Upcoming SlideShare
Loading in...5
×

HowTo DR

406

Published on

A very general HowTo presentation about Disaster Recovery planning.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
406
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

HowTo DR

  1. 1. HowTo DR Josh Berkus PostgreSQL Experts SCALE 2014
  2. 2. Disaster Recovery “The process, policies and procedures that are related to preparing for recovery or continuation of technology infrastructure which are vital to an organization after a natural or human-induced disaster.” Wikipedia, February 2014
  3. 3. Disaster Recovery Restoring services after the unexpected.
  4. 4. Disaster Recovery Limiting: 1. Downtime 2. Data Loss
  5. 5. Do you have a DR Plan?
  6. 6. Is it fairly complete?
  7. 7. Have you tested it?
  8. 8. Dilbert is a copyright of Scott Adams. Used here as parody. All rights reserved to Scott Adams.
  9. 9. Threat Model
  10. 10. server failure getting hacked natural disaster
  11. 11. server failure getting hacked natural disaster storage failure traffic spike network failure admin error bad update OS / VM problem software bugs
  12. 12. server failure getting hacked natural disaster storage failure traffic spike network failure admin error bad update OS / VM problem software bugs
  13. 13. Accepting Loss
  14. 14. The Nines Nines 99.9% 99.99% 99.999% Down/Year 9 hours 1 hour 5 minutes
  15. 15. The Nines ● Treats all downtime causes as identical – ● ● ● except the ones it ignores Doesn't address data loss Really “Business Continuity” also unrealistic
  16. 16. Disaster Server Failure Network Failure Admin Error Bad Update Storage Failure Getting Hacked Natural Disaster Downtime Data Loss Detect
  17. 17. Disaster Downtime Data Loss Server Failure 0 0 Network Failure 0 0 Admin Error 0 0 Bad Update 0 0 Storage Failure 0 0 Getting Hacked 0 0 Natural Disaster 0 0 Detect 10 yrs 10 yrs
  18. 18. $$$$$$$$$$$$$$$$$$$$$$$
  19. 19. Disaster Downtime Data Loss Server Failure 5min 1min Network Failure 3hrs 10min Admin Error 1hr 1hr Bad Update 1hr 1hr Storage Failure 5min 30min Getting Hacked 1hr 1hr Natural Disaster 6hrs 1hr Detect 3 mo 3 mo
  20. 20. Your DR Plan
  21. 21. Elements of a Plan 1. Backups/Replicas 2. Replacements 3. Procedures 4. People
  22. 22. Backups ● ● ● ● ● pg_dump mysqldump rsync zfs snapshot SAN exportable snapshot
  23. 23. Backups++ ● ● ● ● Periodic Portable Simple Recover point-in-time
  24. 24. Backups-● ● Slow to restore Data loss interval
  25. 25. Backups ● Good for: natural disaster – admin error, bad update – software bugs – getting hacked – ● Bad for everything else
  26. 26. Replication ● ● ● ● ● ● Postgres binary replication MySQL replication Redundant HBase nodes Redis clustering DRBD GlusterFS etc.
  27. 27. Replication++ ● ● ● Continuous Fast failover Low data loss
  28. 28. Replication-● ● ● ● ● Extra hardware Complex High-maintenance Can hurt performance Can replicate failures
  29. 29. Replication ● Good For: – ● server, storage, network failure Bad For: admin error, getting hacked – software bugs –
  30. 30. Continuous Backup ● ● ● Also “PITR” Continous like replication Partial recovery like backups
  31. 31. … where you gonna restore those backups to?
  32. 32. Replacing Services ● ● ● ● ● servers network storage OS image software reversion
  33. 33. Procedures
  34. 34. Written Procedures
  35. 35. 3AM is not the time to improvise
  36. 36. Procedures … for each recovery step … for deciding what steps
  37. 37. Database Server Does Not Respond 1. Determine if physical server is down a. if network is down, use plan N1. 2. If not, try to restart database using command … 3. Still down? Fail over to replica using command … 4. Check replica. 5. Not working? Restore backup to test server 1 using command ...
  38. 38. Good: detailed written procedures Better: written procedures with pastable commands Best: tested single-command scripts
  39. 39. Fallback Procedures ● ● ● Sometimes recovery fails Have fallback procedures If the fallback fails … time for a meeting!
  40. 40. Never ever ever ever improvise.
  41. 41. People Who You Gonna Call?
  42. 42. Know who to call ● ● ● ● ● on call staff experts in each service consultants/contractors vendors required authorizations
  43. 43. Contact Book ● ● Include as much contact information as possible Put copies in more than one place – ● including paper! Keep it up to date
  44. 44. Test Your DR Good: when you create the procedure Better: quarterly Best: as part of daily/weekly provisioning
  45. 45. An untested backup is one which doesn't work.
  46. 46. DR in the Cloud
  47. 47. “It's a cloud, right? That means it's redundant, right?”
  48. 48. … not necessarily for your servers unless you pay for it!
  49. 49. Some new problems ● ● ● ● Instance failure Resource overcommit Zone failures Admin error at scale
  50. 50. Some new solutions ● Redundant services – ● ● RDS, VIP, S3 Rapid server deployment Cheap replicas
  51. 51. … otherwise pretty much the same.
  52. 52. backup locations ● shared instance storage (EBS) – ● fast failover for instance fail long-term storage API (S3) redundant – large –
  53. 53. Use your rapid deploy! ● ● Continuous backup to S3 Deploy scripts + server images – ● Chef/Salt/Puppet/etc. helps here = fast recovery – with low running costs
  54. 54. DR Tips ● Have multiple copies of your plan – ● ● in multiple locations A SAN is not a DR solution One form of backup is seldom enough
  55. 55. Questions? ● Josh Berkus www.databasesoup.com – www.pgexperts.com – ● Coming up: NYC pgDay April 3-4 – pgCon May 21-24 – Copyright 2013 PostgreSQL Experts Inc. Released under the Creative Commons Share-Alike 3.0 License. All images, logos and trademarks are the property of their respective owners and are used under principles of fair use unless otherwise noted.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×