We've all heard of and practiced variations of disaster recovery methodologies for achieving 24 X 7 availability for SaaS applications. There is a range of dependencies for optimal RPO (recovery point objective) and RTO ( recovery time objective) including where the application is running, inter-dependencies of microservices, dependencies on datastores, and storage among others. Based on our experience of SaaS-ifying applications to run on hyperscalers, we are identifying patterns for our applications to adopt to optimize RPO (recovery point objective) and RTO ( recovery time objective).
We will describe our challenges, learning, and patterns for next-gen SaaS applications, and guidance on how to adopt them in the delivery of your next SaaS applications running in single or multiple clouds with the Disaster Recovery as the focus area
1. DevOps for 24 X 7 SaaS and disaster
recovery
Shikha Srivastav
Distinguished Engineer, IBM
SaaS, Master Inventor
Amitabh Prasad
Software Architect, IBM
2. Agenda
• What is SaaS and the challenges
• Business Continuity and disaster recovery is critical
• Disaster Recovery Models
• What we learned
• Patterns based on our learning
Shikha
3. What is SaaS
IaaS
PaaS
SaaS
Virtual Machines,Network
Storage
Business Applications
Container Orchestration , CICD automation
Auto scaling ,Application development tools
• Stable - providing consistent
acceptable performance
• Available - providing 24X7
availability
• Reliable - for customers to
leverage SaaS in their mission-
critical business needs
• Resilient - allowing updates
(security, features, bugs), and
upgrades with zero downtime
• Secure and Compliant
4. What it takes
Collaboration and
Communication
Continuous Integration Continuous Delivery Observability
Change and incident
management
Security and Compliance Business Continuity Test, Quality and Scale
Plan Build Test Secure Release Operate
Ideas
Ideas
Feedbac
k
Feedbac
k
Shikha
5. What it takes
Collaboration and
Communication
Continuous Integration Continuous Delivery Observability
Change and incident
management
Security and Compliance Business Continuity Test, Quality and Scale
Plan Build Test Secure Release Operate
Ideas
Ideas
Feedbac
k
Feedbac
k
Shikha
Disaster Recovery
6. • Service needs to meet SLA
• Service Recovery most critical in
Cloud
• Recovery Point Objective (RPO): the
amount of data that will be lost
• Recovery Time Objective (RTO):
the amount of downtime a
business can tolerate.
Disaster Recovery
7. Key Considerations for Disaster Recovery
WHY backup
- Business continuity
WHAT to back up
- Do you need to back up all data ?
WHERE to back up
- Security and privacy for customer data consideration
HOW to back up
- Automation for backup
RECOVERY plans in case of disaster
- Algorithm for recovery with automated steps
9. Typical SaaS characteristics
• Multi-tenant architecture
• Multi Zone deployment
• Rely on cloud services for data
backup and recovery
10. What we learned
Highly available architecture with robust backup & restore for business continue
• HA architecture ensures 99.999% per year availability
• Backup & restore strategy ensure we have < 12 hrs of RPO and <12 hrs
of RTO (for regional failures)
• Within same region it's < 10 min of RPO and < 6 hrs RTO
11. What we learned
What to protect
• Runtime Config
• Volume
• Cloud resources (databases)
12. • Backup agents
• Valero Operator to backup up configs
• Volume backup agent
• tag volumes
• Backup of volumes & Cloud services
What we learned
How to protect
13. Patterns we established based on our learning
• Backup Runtime Kubernetes Objects
• Configure Cloud Data store backups
• Enable Volume replication
• Handle non-cloud data store
• gitOps to redeploy static and runtime objects
15. Use native data stores as much as
possible, if in cloud make sure to use
cloud native datastores
RDS, ICD etc
Tag Cloud Data Resources instances
and use them to to identify resources
to backup using Cloud Provider
Backup or any provider specific
backup techniques
This backup technique identifies all
the resources to backup using tags,
calls snapshot and other required
api's to do backup
Patterns
Backup Cloud resources
16. • Runs as pods
• Watches pvc and tags related
volumes
• Backup of volumes managed by
Cloud Provider’s api’s
Patterns
Backup Volumes (Replication)
17. • Used for application deployment
• Manage drifts between git repo and
registered clusters
• Restore data and artifacts
• Argo CD is used to deploy rest of
the application during recovery flow
Patterns
Gitops
19. References
Kubernetes volume backup for disaster recovery
Backup and Restore of Kubernetes Stateful Application Data with CSI Volume Snapshots
Speed and resiliency: two sides of the same coin
AWS backup strategies
Valero.io
Disaster Recovery with GitOps
Useful if leveraging Kubernetes as Container orchestrator:
Kubernetes & 12-factor apps
7 missing factors from 12 factor application
Editor's Notes
RPO: Max targeted period for which data will be lost in the event of disaster, this is based on how frequent you want to take backup and what’s your backup architecture. What kind of underlying backup technology and application architecture etc.
RTO: Time to restore system after disaster, again lot depends on how complex your system is and what automation is inplace to get a new system up and running. For example if we use gitops other then boot time and middleware/platform deployment, new application instance can be made up and running in no time
"An entire AWS region?? That will never happen!” (US East 1 went down in Dec 2021)
Ransomeware attack with Garmin, such risk are increasingly common and these can be easily mitigatAs part of enterprise SaaS solution, we have to ensure, there is a robust disaster recovery plan in place to restore service in the unlikely event that the entire region goes down.
Backup and restore: backing up of individual system in cloud native world this means backing up database files, volumes, configuration etc in traditional world this requires vm backup and so on (RTO in hrs)
Pilot light: Core components of the system is always running in dr region, with automation in place to quickly provision rest of the system. Like DB may be in non active state with replication enabled and in case of disaster automation takes care of rest of the application deployment (RTO in min)
Warm standby: scaled down version of fully functional env running in DR region, further reduces recovery time (RTO < 10 min)
Multisite active-active: Specialize configuration both env are in use, traffic is distributed across these env, with autoscaling capabilities to handle increased load DR can be easily managed (almost 0 RTO)
Emphasize on diff between HA deployment vs disaster recovery
HA can take care of local failures, node going down, corrupt instance etc
Runtime configuration
Volumes
This is where application may store some of the documents and in certain cases in-cluster db's
Cloud resources (databases)
- Rest of the application artifacts can be recreated as part of deployment
Velero for k8s backup filter what we want to backup based on labels
Volume backup using native snapshot
DB backup using cloud provider techniques or native DB techniques
Velero for Kubernetes objects
Data stored in s3 bucket replicated to DR region
Use cloud native or native DB
Abstractly identify services to be backed up using tags
Uses storage replication (or cloud provider techniques) to move data to DR location
Any typical k8s applications does have stateful sets
Use volume snapshot to take backups
Replication depends on cloud provider or existing storage
Gitops for application deployment
ArgoCD: k8s controller that monitors git repo and ensures all the deployed application is in sync with the application artifacts