Carlos Santana (@csantanapr)
Sr. EKS Specialist SA, AWS
CNCF Ambassador
Navigating Disaster Recovery in
Kubernetes and Crossplane
@csantanapr
Platform Engineering
@csantanapr
Platform Engineering
@csantanapr
Platform Engineering
@csantanapr
Platform Engineering
@csantanapr
Platform Engineering
@csantanapr
Platform Engineering
@csantanapr
SRE Engineering
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
A model to think of resiliency
Resiliency
Disaster
Recovery
One-time
Events
High
Availability
Average
over time
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Disaster recovery (DR)
• About business continuity
• Larger scale, less frequent, events:
• Natural disasters
• Technical failures
• Human actions
• Measures a one-time event:
• Recovery Time
• Recovery Point
Natural Disaster Technical
Failure
Human Actions
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Recovery Objectives
Data Loss Downtime
Recovery Point (RPO) Recovery Time (RTO)
Disaster
How much data can you afford
to recreate or lose?
How quickly must you recover?
What is the cost of downtime?
Time
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Backup &
Restore Pilot Light
Multi-site
active/active
Warm
standby
RPO / RTO:
Hours
RPO / RTO:
10s of minutes
RPO / RTO:
Minutes
RPO / RTO:
Near real-time
• Data backed up
• No services deployed
• Cost $
• Data live
• Services idle
• Cost: $$
• Data live
• Services run reduced capacity
• Cost $$$
• Data live
• Live services
• Cost $$$$
Strategies for disaster recovery
active/passive strategies
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Crossplane Disaster Recovery
• Crossplane Upgrades and Rollbacks
 New api versions added to CRD (ie 11.0 -> 1.10.2)
 Issue #3859
 Providers upgrade and rollback
– CRD ownership
• Configuration Package
 Provider auto upgrade
• Velero
 --features=EnableAPIGroupVersions
13
@csantanapr
managementPolicy (ObserveOnly)
@csantanapr
Disaster Recovery
@csantanapr
Disaster Recovery
@csantanapr
Disaster Recovery
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Scenario 2: Backup Database
22
Crossplane
east-1
ETCD
Claim
mutation webhooks
ArgoCD
AWS Cloud
Crossplane
ETCD
restore
restore
west-2
Amazon RDS Amazon RDS
EKS EKS
Backup-RDS
S3
backup
Backup non-global resources
Backup-EKS
S3
west-2
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Summary
• Everything fails all the time
• Shortest path to Recover
• Different failure domains
• Crossplane rollbacks
• Use auto replication (ie. s3) for faster RTO
• Lower cost by recover from backup DB (high RTO)
23
© 2023, Amazon Web Services, Inc. or its affiliates. @csantanapr
Resources
24
https://github.com/awslabs/crossplane-on-eks
https://crossplane.io
https://go.aws/3K4ue0W
https://velero.io
Recovery When Using Crossplane for
Infrastructure Provisioning on AWS
EKS Blueprints
https://argoproj.github.io/cd

Navigating Disaster Recovery in Kubernetes and CNCF Crossplane

  • 1.
    Carlos Santana (@csantanapr) Sr.EKS Specialist SA, AWS CNCF Ambassador Navigating Disaster Recovery in Kubernetes and Crossplane
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr A model to think of resiliency Resiliency Disaster Recovery One-time Events High Availability Average over time
  • 10.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Disaster recovery (DR) • About business continuity • Larger scale, less frequent, events: • Natural disasters • Technical failures • Human actions • Measures a one-time event: • Recovery Time • Recovery Point Natural Disaster Technical Failure Human Actions
  • 11.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Recovery Objectives Data Loss Downtime Recovery Point (RPO) Recovery Time (RTO) Disaster How much data can you afford to recreate or lose? How quickly must you recover? What is the cost of downtime? Time
  • 12.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Backup & Restore Pilot Light Multi-site active/active Warm standby RPO / RTO: Hours RPO / RTO: 10s of minutes RPO / RTO: Minutes RPO / RTO: Near real-time • Data backed up • No services deployed • Cost $ • Data live • Services idle • Cost: $$ • Data live • Services run reduced capacity • Cost $$$ • Data live • Live services • Cost $$$$ Strategies for disaster recovery active/passive strategies
  • 13.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Crossplane Disaster Recovery • Crossplane Upgrades and Rollbacks  New api versions added to CRD (ie 11.0 -> 1.10.2)  Issue #3859  Providers upgrade and rollback – CRD ownership • Configuration Package  Provider auto upgrade • Velero  --features=EnableAPIGroupVersions 13
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Scenario 2: Backup Database 22 Crossplane east-1 ETCD Claim mutation webhooks ArgoCD AWS Cloud Crossplane ETCD restore restore west-2 Amazon RDS Amazon RDS EKS EKS Backup-RDS S3 backup Backup non-global resources Backup-EKS S3 west-2
  • 19.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Summary • Everything fails all the time • Shortest path to Recover • Different failure domains • Crossplane rollbacks • Use auto replication (ie. s3) for faster RTO • Lower cost by recover from backup DB (high RTO) 23
  • 20.
    © 2023, AmazonWeb Services, Inc. or its affiliates. @csantanapr Resources 24 https://github.com/awslabs/crossplane-on-eks https://crossplane.io https://go.aws/3K4ue0W https://velero.io Recovery When Using Crossplane for Infrastructure Provisioning on AWS EKS Blueprints https://argoproj.github.io/cd