Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery

Stop Worrying and Keep Querying Using
Automated Multi-Region Disaster Recovery
Sergey Pronin sergey.pronin@percona.com
Shivani Gupta shivani@elotl.co
Jan Baraniewski jan@elotl.co

Agenda
1. Problem space
2. Solution

Agenda
1. Problem space
a. Why Disaster Recovery (DR)
b. PostgreSQL on Kubernetes
c. DR setup in Percona Operator for PostgreSQL
2. Solution
a. Multi-cluster control planes
b. DR orchestration architecture
c. Demo w/ Elotl Nova

Why Disaster Recovery
“Disaster Recovery is an organization's plan to protect its IT systems and
data from disasters and recover quickly to minimize downtime and losses.”
1. Business continuity
a. SLA requirements
2. Compliance and standards

PostgreSQL on Kubernetes with Operators
● Operator controls database and
k8s primitives
● Day-1 simplified to one step
● Day-2 operations automated

Automated failover - problem space

Why Automation of Disaster Recovery?
Myth: ‘...but DR is rarely ever needed’
• Cloud Regions do fail often enough and for long enough to disrupt
business
• On-prem data centers do fail
When it happens: Need close to zero RTO for mission critical applications
• With manual steps, runbooks often cannot be found or are not
up-to-date
• Manual process comes with risk of human error
Should be regularly tested:
• Important to regularly fire-drill Disaster Recovery as part of regular QA
process (say once a month)

Agenda
1. Problem space
a. Why Disaster Recovery (DR)
b. PostgreSQL on Kubernetes
c. DR setup in Percona Operator for PostgreSQL
2. Solution
a. Multi-cluster control planes
b. DR orchestration architecture
c. Demo w/ Elotl Nova and Percona PostgreSQL

Multi-Cluster Control Plane aka Multi-cluster Orchestrator
• Deploy workloads to one or
more clusters from a central
scheduler
• Aggregate view of workload
topologies
• Orchestrate actions across
workloads
Multi-cluster Control Plane
Workload Clusters
Karmada, Admiralty, Elotl Nova follow similar architecture.

Policy Based Scheduling
Multi-cluster Control Plane
Workload Clusters
Decouple placement from workload definition
App
Manifest
Schedule
Policy

Schedule Policy Custom Resource
• Namespace selector
• Resource selector
• Cluster selector
spec:
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: microsvc-demo
clusterSelector:
matchLabels:
nova.elotl.co/cluster.region: "us-east-1"
resourceSelectors:
labelSelectors:
- matchLabels:
microServicesDemo: "yes"

Spread Specification
“Cloning” a workload (e.g. ReplicaSet) from Control Plane cluster to the selected workload clusters
• Mode: Divide - each clusters runs a % of the replicas specified in the Control Plane workload
• Mode: Duplicate - each cluster runs the same number of replicas as specified in the Control Plane workload
apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
name: postgres-spread
spec:
spreadConstraints:
spreadMode: Duplicate
topologyKey: kubernetes.io/metadata.name
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: psql-operator
clusterSelector:
matchExpressions:
-key: kubernetes.io/metadata.name
operator: In
values:
-cluster-1
-cluster-2

Components of Disaster Recovery
• Setup database on multiple K8s clusters (different cloud regions or different
clouds or different data centers)
• Challenge: getting the setup right is error-prone. E.g. same configuration, same secrets
for backup repository (S3) or TLS secrets;
• Solution: Central scheduler w/ spread scheduling
• Data Replication
• Taken care of by PostgreSQL native methods
• Failure Detection
• Needs to be flexible depending on business requirements
• Failover
• Needs to be flexible based on business requirements. E.g. a simplistic scenario for
PostgreSQL is re-configure standby database and redirect application traffic.
• Failback (optional)

DR Orchestration Architecture
Scheduler
Failure
Webhook
Failover
Controller
Nova Control Plane
Workload Cluster
Nova Agent
Monitoring
Tool
Workload Cluster
Nova Agent
Configurations:
● Register Nova
Webhook as an alert
receiver in your
monitoring tool.
● Supply a mapping of
alert labels to docker
image w/ failover
logic.

Demo Layout: PostgreSQL automated failover to Standby
Nova Control Plane
S3
Bucket
Workload Cluster 1
Primary
Workload Cluster 2
StandBy
Workload Cluster 3
HAProxy
PSQL Client
AWS Region 1 AWS Region 2
AWS Region 3
Job for failover:
● Changes manifest of cluster-2
postgres to ‘primary’
● Re-configures HAProxy to
point to postgres on cluster-2
DB Monitoring
Nova agent to CP

Takeaways
• To survive widespread outages, your database requires deployment to
multiple clusters in different regions.
• Use of K8s, along with operators, makes DR setup easier and opens up
opportunities for automation, in turn enabling better RTO.
• Automation of recovery can be done in a simple, low-friction way using a
multi-cluster control plane such as Nova.

Future Work
• CRD based definition for failure detection and failover
• Eliminate out-of-band configuration and specify everything by deploying a
manifest
• High Availability of the Nova Control Plane
• Provide option to install Nova in active-active HA mode

Resources
• Learn more about Percona operators: https://per.co.na/operators
• Learn more about Elotl Nova: https://www.elotl.co/nova.html
• Free trial of Elotl Nova: https://www.elotl.co/free-trial.html
• Nova HADR beta coming soon!

Thank you!
Please feel free to
provide feedback using
this QR code.

Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery

Recommended

Recommended

More Related Content

Similar to Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery

Similar to Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery (20)

More from DoKC

More from DoKC (20)

Recently uploaded

Recently uploaded (20)

Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery