Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery - Shivani Gupta, Elotl & Sergey Pronin, Percona
Disaster Recovery(DR) is critical for business continuity in the face of widespread outages taking down entire data centers or cloud provider regions. DR relies on deployment to multiple locations, data replication, monitoring for failure and failover. The process is typically manual involving several moving parts, and, even in the best case, involves some downtime for end-users. A multi-cluster K8s control plane presents the opportunity to automate the DR setup as well as the failure detection and failover. Such automation can dramatically reduce RTO and improve availability for end-users. This talk (and demo) describes one such setup using the open source Percona Operator for PostgreSQL and a multi-cluster K8s orchestrator. The orchestrator will use policy driven placement to replicate the entire workload on multiple clusters (in different regions), detect failure using pluggable logic, and do failover processing by promoting the standby as well as redirecting application traffic
4. Agenda
1. Problem space
a. Why Disaster Recovery (DR)
b. PostgreSQL on Kubernetes
c. DR setup in Percona Operator for PostgreSQL
2. Solution
a. Multi-cluster control planes
b. DR orchestration architecture
c. Demo w/ Elotl Nova
5. Why Disaster Recovery
“Disaster Recovery is an organization's plan to protect its IT systems and
data from disasters and recover quickly to minimize downtime and losses.”
1. Business continuity
a. SLA requirements
2. Compliance and standards
11. Why Automation of Disaster Recovery?
Myth: ‘...but DR is rarely ever needed’
• Cloud Regions do fail often enough and for long enough to disrupt
business
• On-prem data centers do fail
When it happens: Need close to zero RTO for mission critical applications
• With manual steps, runbooks often cannot be found or are not
up-to-date
• Manual process comes with risk of human error
Should be regularly tested:
• Important to regularly fire-drill Disaster Recovery as part of regular QA
process (say once a month)
13. Agenda
1. Problem space
a. Why Disaster Recovery (DR)
b. PostgreSQL on Kubernetes
c. DR setup in Percona Operator for PostgreSQL
2. Solution
a. Multi-cluster control planes
b. DR orchestration architecture
c. Demo w/ Elotl Nova and Percona PostgreSQL
14. Multi-Cluster Control Plane aka Multi-cluster Orchestrator
• Deploy workloads to one or
more clusters from a central
scheduler
• Aggregate view of workload
topologies
• Orchestrate actions across
workloads
Multi-cluster Control Plane
Workload Clusters
Karmada, Admiralty, Elotl Nova follow similar architecture.
17. Spread Specification
“Cloning” a workload (e.g. ReplicaSet) from Control Plane cluster to the selected workload clusters
• Mode: Divide - each clusters runs a % of the replicas specified in the Control Plane workload
• Mode: Duplicate - each cluster runs the same number of replicas as specified in the Control Plane workload
apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
name: postgres-spread
spec:
spreadConstraints:
spreadMode: Duplicate
topologyKey: kubernetes.io/metadata.name
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: psql-operator
clusterSelector:
matchExpressions:
-key: kubernetes.io/metadata.name
operator: In
values:
-cluster-1
-cluster-2
18. Components of Disaster Recovery
• Setup database on multiple K8s clusters (different cloud regions or different
clouds or different data centers)
• Challenge: getting the setup right is error-prone. E.g. same configuration, same secrets
for backup repository (S3) or TLS secrets;
• Solution: Central scheduler w/ spread scheduling
• Data Replication
• Taken care of by PostgreSQL native methods
• Failure Detection
• Needs to be flexible depending on business requirements
• Failover
• Needs to be flexible based on business requirements. E.g. a simplistic scenario for
PostgreSQL is re-configure standby database and redirect application traffic.
• Failback (optional)
19. DR Orchestration Architecture
Scheduler
Failure
Webhook
Failover
Controller
Nova Control Plane
Workload Cluster
Nova Agent
Monitoring
Tool
Workload Cluster
Nova Agent
Configurations:
● Register Nova
Webhook as an alert
receiver in your
monitoring tool.
● Supply a mapping of
alert labels to docker
image w/ failover
logic.
20. Demo Layout: PostgreSQL automated failover to Standby
Nova Control Plane
S3
Bucket
Workload Cluster 1
Primary
Workload Cluster 2
StandBy
Workload Cluster 3
HAProxy
PSQL Client
AWS Region 1 AWS Region 2
AWS Region 3
Job for failover:
● Changes manifest of cluster-2
postgres to ‘primary’
● Re-configures HAProxy to
point to postgres on cluster-2
DB Monitoring
Nova agent to CP
22. Takeaways
• To survive widespread outages, your database requires deployment to
multiple clusters in different regions.
• Use of K8s, along with operators, makes DR setup easier and opens up
opportunities for automation, in turn enabling better RTO.
• Automation of recovery can be done in a simple, low-friction way using a
multi-cluster control plane such as Nova.
23. Future Work
• CRD based definition for failure detection and failover
• Eliminate out-of-band configuration and specify everything by deploying a
manifest
• High Availability of the Nova Control Plane
• Provide option to install Nova in active-active HA mode
24. Resources
• Learn more about Percona operators: https://per.co.na/operators
• Learn more about Elotl Nova: https://www.elotl.co/nova.html
• Free trial of Elotl Nova: https://www.elotl.co/free-trial.html
• Nova HADR beta coming soon!