SlideShare a Scribd company logo
1 of 25
Download to read offline
Stop Worrying and Keep Querying Using
Automated Multi-Region Disaster Recovery
Sergey Pronin sergey.pronin@percona.com
Shivani Gupta shivani@elotl.co
Jan Baraniewski jan@elotl.co
Agenda
1. Problem space
2. Solution
Problem space
Agenda
1. Problem space
a. Why Disaster Recovery (DR)
b. PostgreSQL on Kubernetes
c. DR setup in Percona Operator for PostgreSQL
2. Solution
a. Multi-cluster control planes
b. DR orchestration architecture
c. Demo w/ Elotl Nova
Why Disaster Recovery
“Disaster Recovery is an organization's plan to protect its IT systems and
data from disasters and recover quickly to minimize downtime and losses.”
1. Business continuity
a. SLA requirements
2. Compliance and standards
Disaster recovery
PostgreSQL on Kubernetes with Operators
● Operator controls database and
k8s primitives
● Day-1 simplified to one step
● Day-2 operations automated
DR through Backups
DR through Replication
Automated failover - problem space
Why Automation of Disaster Recovery?
Myth: ‘...but DR is rarely ever needed’
• Cloud Regions do fail often enough and for long enough to disrupt
business
• On-prem data centers do fail
When it happens: Need close to zero RTO for mission critical applications
• With manual steps, runbooks often cannot be found or are not
up-to-date
• Manual process comes with risk of human error
Should be regularly tested:
• Important to regularly fire-drill Disaster Recovery as part of regular QA
process (say once a month)
Solution
Agenda
1. Problem space
a. Why Disaster Recovery (DR)
b. PostgreSQL on Kubernetes
c. DR setup in Percona Operator for PostgreSQL
2. Solution
a. Multi-cluster control planes
b. DR orchestration architecture
c. Demo w/ Elotl Nova and Percona PostgreSQL
Multi-Cluster Control Plane aka Multi-cluster Orchestrator
• Deploy workloads to one or
more clusters from a central
scheduler
• Aggregate view of workload
topologies
• Orchestrate actions across
workloads
Multi-cluster Control Plane
Workload Clusters
Karmada, Admiralty, Elotl Nova follow similar architecture.
Policy Based Scheduling
Multi-cluster Control Plane
Workload Clusters
Decouple placement from workload definition
App
Manifest
Schedule
Policy
Schedule Policy Custom Resource
• Namespace selector
• Resource selector
• Cluster selector
spec:
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: microsvc-demo
clusterSelector:
matchLabels:
nova.elotl.co/cluster.region: "us-east-1"
resourceSelectors:
labelSelectors:
- matchLabels:
microServicesDemo: "yes"
Spread Specification
“Cloning” a workload (e.g. ReplicaSet) from Control Plane cluster to the selected workload clusters
• Mode: Divide - each clusters runs a % of the replicas specified in the Control Plane workload
• Mode: Duplicate - each cluster runs the same number of replicas as specified in the Control Plane workload
apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
name: postgres-spread
spec:
spreadConstraints:
spreadMode: Duplicate
topologyKey: kubernetes.io/metadata.name
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: psql-operator
clusterSelector:
matchExpressions:
-key: kubernetes.io/metadata.name
operator: In
values:
-cluster-1
-cluster-2
Components of Disaster Recovery
• Setup database on multiple K8s clusters (different cloud regions or different
clouds or different data centers)
• Challenge: getting the setup right is error-prone. E.g. same configuration, same secrets
for backup repository (S3) or TLS secrets;
• Solution: Central scheduler w/ spread scheduling
• Data Replication
• Taken care of by PostgreSQL native methods
• Failure Detection
• Needs to be flexible depending on business requirements
• Failover
• Needs to be flexible based on business requirements. E.g. a simplistic scenario for
PostgreSQL is re-configure standby database and redirect application traffic.
• Failback (optional)
DR Orchestration Architecture
Scheduler
Failure
Webhook
Failover
Controller
Nova Control Plane
Workload Cluster
Nova Agent
Monitoring
Tool
Workload Cluster
Nova Agent
Configurations:
● Register Nova
Webhook as an alert
receiver in your
monitoring tool.
● Supply a mapping of
alert labels to docker
image w/ failover
logic.
Demo Layout: PostgreSQL automated failover to Standby
Nova Control Plane
S3
Bucket
Workload Cluster 1
Primary
Workload Cluster 2
StandBy
Workload Cluster 3
HAProxy
PSQL Client
AWS Region 1 AWS Region 2
AWS Region 3
Job for failover:
● Changes manifest of cluster-2
postgres to ‘primary’
● Re-configures HAProxy to
point to postgres on cluster-2
DB Monitoring
Nova agent to CP
Demo
Takeaways
• To survive widespread outages, your database requires deployment to
multiple clusters in different regions.
• Use of K8s, along with operators, makes DR setup easier and opens up
opportunities for automation, in turn enabling better RTO.
• Automation of recovery can be done in a simple, low-friction way using a
multi-cluster control plane such as Nova.
Future Work
• CRD based definition for failure detection and failover
• Eliminate out-of-band configuration and specify everything by deploying a
manifest
• High Availability of the Nova Control Plane
• Provide option to install Nova in active-active HA mode
Resources
• Learn more about Percona operators: https://per.co.na/operators
• Learn more about Elotl Nova: https://www.elotl.co/nova.html
• Free trial of Elotl Nova: https://www.elotl.co/free-trial.html
• Nova HADR beta coming soon!
Thank you!
Please feel free to
provide feedback using
this QR code.

More Related Content

Similar to Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery

AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?Markus Michalewicz
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Codemotion
 
Rook - cloud-native storage
Rook - cloud-native storageRook - cloud-native storage
Rook - cloud-native storageKarol Chrapek
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAjith Narayanan
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Eran Gampel
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfLeah Cole
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh CodeOps Technologies LLP
 
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaSScaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaSJelastic Multi-Cloud PaaS
 
Distributed tracing in OpenStack
Distributed tracing in OpenStackDistributed tracing in OpenStack
Distributed tracing in OpenStackIlya Shakhat
 
Kubernetes #1 intro
Kubernetes #1   introKubernetes #1   intro
Kubernetes #1 introTerry Cho
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureDataStax Academy
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko
 
VMworld 2013: Automated Management of Tier-1 Applications on VMware
VMworld 2013: Automated Management of Tier-1 Applications on VMware VMworld 2013: Automated Management of Tier-1 Applications on VMware
VMworld 2013: Automated Management of Tier-1 Applications on VMware VMworld
 

Similar to Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery (20)

AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
 
Rook - cloud-native storage
Rook - cloud-native storageRook - cloud-native storage
Rook - cloud-native storage
 
Kubernetes intro
Kubernetes introKubernetes intro
Kubernetes intro
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh
 
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaSScaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
 
Distributed tracing in OpenStack
Distributed tracing in OpenStackDistributed tracing in OpenStack
Distributed tracing in OpenStack
 
Kubernetes #1 intro
Kubernetes #1   introKubernetes #1   intro
Kubernetes #1 intro
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
VMworld 2013: Automated Management of Tier-1 Applications on VMware
VMworld 2013: Automated Management of Tier-1 Applications on VMware VMworld 2013: Automated Management of Tier-1 Applications on VMware
VMworld 2013: Automated Management of Tier-1 Applications on VMware
 

More from DoKC

Distributed Vector Databases - What, Why, and How
Distributed Vector Databases - What, Why, and HowDistributed Vector Databases - What, Why, and How
Distributed Vector Databases - What, Why, and HowDoKC
 
Is It Safe? Security Hardening for Databases Using Kubernetes Operators
Is It Safe? Security Hardening for Databases Using Kubernetes OperatorsIs It Safe? Security Hardening for Databases Using Kubernetes Operators
Is It Safe? Security Hardening for Databases Using Kubernetes OperatorsDoKC
 
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...DoKC
 
The State of Stateful on Kubernetes
The State of Stateful on KubernetesThe State of Stateful on Kubernetes
The State of Stateful on KubernetesDoKC
 
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...DoKC
 
Make Your Kafka Cluster Production-Ready
Make Your Kafka Cluster Production-ReadyMake Your Kafka Cluster Production-Ready
Make Your Kafka Cluster Production-ReadyDoKC
 
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...DoKC
 
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
Run PostgreSQL in Warp Speed Using NVMe/TCP in the CloudRun PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
Run PostgreSQL in Warp Speed Using NVMe/TCP in the CloudDoKC
 
The Kubernetes Native Database
The Kubernetes Native DatabaseThe Kubernetes Native Database
The Kubernetes Native DatabaseDoKC
 
ING Data Services hosted on ICHP DoK Amsterdam 2023
ING Data Services hosted on ICHP DoK Amsterdam 2023ING Data Services hosted on ICHP DoK Amsterdam 2023
ING Data Services hosted on ICHP DoK Amsterdam 2023DoKC
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentDoKC
 
StatefulSets in K8s - DoK Talks #154
StatefulSets in K8s - DoK Talks #154StatefulSets in K8s - DoK Talks #154
StatefulSets in K8s - DoK Talks #154DoKC
 
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...DoKC
 
Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151DoKC
 
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...DoKC
 
Evaluating Cloud Native Storage Vendors - DoK Talks #147
Evaluating Cloud Native Storage Vendors - DoK Talks #147Evaluating Cloud Native Storage Vendors - DoK Talks #147
Evaluating Cloud Native Storage Vendors - DoK Talks #147DoKC
 
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...DoKC
 
We will Dok You! - The journey to adopt stateful workloads on k8s
We will Dok You! - The journey to adopt stateful workloads on k8sWe will Dok You! - The journey to adopt stateful workloads on k8s
We will Dok You! - The journey to adopt stateful workloads on k8sDoKC
 
Mastering MongoDB on Kubernetes, the power of operators
Mastering MongoDB on Kubernetes, the power of operators Mastering MongoDB on Kubernetes, the power of operators
Mastering MongoDB on Kubernetes, the power of operators DoKC
 
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...DoKC
 

More from DoKC (20)

Distributed Vector Databases - What, Why, and How
Distributed Vector Databases - What, Why, and HowDistributed Vector Databases - What, Why, and How
Distributed Vector Databases - What, Why, and How
 
Is It Safe? Security Hardening for Databases Using Kubernetes Operators
Is It Safe? Security Hardening for Databases Using Kubernetes OperatorsIs It Safe? Security Hardening for Databases Using Kubernetes Operators
Is It Safe? Security Hardening for Databases Using Kubernetes Operators
 
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...
 
The State of Stateful on Kubernetes
The State of Stateful on KubernetesThe State of Stateful on Kubernetes
The State of Stateful on Kubernetes
 
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...
 
Make Your Kafka Cluster Production-Ready
Make Your Kafka Cluster Production-ReadyMake Your Kafka Cluster Production-Ready
Make Your Kafka Cluster Production-Ready
 
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
 
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
Run PostgreSQL in Warp Speed Using NVMe/TCP in the CloudRun PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
 
The Kubernetes Native Database
The Kubernetes Native DatabaseThe Kubernetes Native Database
The Kubernetes Native Database
 
ING Data Services hosted on ICHP DoK Amsterdam 2023
ING Data Services hosted on ICHP DoK Amsterdam 2023ING Data Services hosted on ICHP DoK Amsterdam 2023
ING Data Services hosted on ICHP DoK Amsterdam 2023
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
StatefulSets in K8s - DoK Talks #154
StatefulSets in K8s - DoK Talks #154StatefulSets in K8s - DoK Talks #154
StatefulSets in K8s - DoK Talks #154
 
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...
 
Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151Analytics with Apache Superset and ClickHouse - DoK Talks #151
Analytics with Apache Superset and ClickHouse - DoK Talks #151
 
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...
 
Evaluating Cloud Native Storage Vendors - DoK Talks #147
Evaluating Cloud Native Storage Vendors - DoK Talks #147Evaluating Cloud Native Storage Vendors - DoK Talks #147
Evaluating Cloud Native Storage Vendors - DoK Talks #147
 
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...
 
We will Dok You! - The journey to adopt stateful workloads on k8s
We will Dok You! - The journey to adopt stateful workloads on k8sWe will Dok You! - The journey to adopt stateful workloads on k8s
We will Dok You! - The journey to adopt stateful workloads on k8s
 
Mastering MongoDB on Kubernetes, the power of operators
Mastering MongoDB on Kubernetes, the power of operators Mastering MongoDB on Kubernetes, the power of operators
Mastering MongoDB on Kubernetes, the power of operators
 
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
Leveraging Running Stateful Workloads on Kubernetes for the Benefit of Develo...
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery

  • 1. Stop Worrying and Keep Querying Using Automated Multi-Region Disaster Recovery Sergey Pronin sergey.pronin@percona.com Shivani Gupta shivani@elotl.co Jan Baraniewski jan@elotl.co
  • 4. Agenda 1. Problem space a. Why Disaster Recovery (DR) b. PostgreSQL on Kubernetes c. DR setup in Percona Operator for PostgreSQL 2. Solution a. Multi-cluster control planes b. DR orchestration architecture c. Demo w/ Elotl Nova
  • 5. Why Disaster Recovery “Disaster Recovery is an organization's plan to protect its IT systems and data from disasters and recover quickly to minimize downtime and losses.” 1. Business continuity a. SLA requirements 2. Compliance and standards
  • 7. PostgreSQL on Kubernetes with Operators ● Operator controls database and k8s primitives ● Day-1 simplified to one step ● Day-2 operations automated
  • 10. Automated failover - problem space
  • 11. Why Automation of Disaster Recovery? Myth: ‘...but DR is rarely ever needed’ • Cloud Regions do fail often enough and for long enough to disrupt business • On-prem data centers do fail When it happens: Need close to zero RTO for mission critical applications • With manual steps, runbooks often cannot be found or are not up-to-date • Manual process comes with risk of human error Should be regularly tested: • Important to regularly fire-drill Disaster Recovery as part of regular QA process (say once a month)
  • 13. Agenda 1. Problem space a. Why Disaster Recovery (DR) b. PostgreSQL on Kubernetes c. DR setup in Percona Operator for PostgreSQL 2. Solution a. Multi-cluster control planes b. DR orchestration architecture c. Demo w/ Elotl Nova and Percona PostgreSQL
  • 14. Multi-Cluster Control Plane aka Multi-cluster Orchestrator • Deploy workloads to one or more clusters from a central scheduler • Aggregate view of workload topologies • Orchestrate actions across workloads Multi-cluster Control Plane Workload Clusters Karmada, Admiralty, Elotl Nova follow similar architecture.
  • 15. Policy Based Scheduling Multi-cluster Control Plane Workload Clusters Decouple placement from workload definition App Manifest Schedule Policy
  • 16. Schedule Policy Custom Resource • Namespace selector • Resource selector • Cluster selector spec: namespaceSelector: matchLabels: kubernetes.io/metadata.name: microsvc-demo clusterSelector: matchLabels: nova.elotl.co/cluster.region: "us-east-1" resourceSelectors: labelSelectors: - matchLabels: microServicesDemo: "yes"
  • 17. Spread Specification “Cloning” a workload (e.g. ReplicaSet) from Control Plane cluster to the selected workload clusters • Mode: Divide - each clusters runs a % of the replicas specified in the Control Plane workload • Mode: Duplicate - each cluster runs the same number of replicas as specified in the Control Plane workload apiVersion: policy.elotl.co/v1alpha1 kind: SchedulePolicy metadata: name: postgres-spread spec: spreadConstraints: spreadMode: Duplicate topologyKey: kubernetes.io/metadata.name namespaceSelector: matchLabels: kubernetes.io/metadata.name: psql-operator clusterSelector: matchExpressions: -key: kubernetes.io/metadata.name operator: In values: -cluster-1 -cluster-2
  • 18. Components of Disaster Recovery • Setup database on multiple K8s clusters (different cloud regions or different clouds or different data centers) • Challenge: getting the setup right is error-prone. E.g. same configuration, same secrets for backup repository (S3) or TLS secrets; • Solution: Central scheduler w/ spread scheduling • Data Replication • Taken care of by PostgreSQL native methods • Failure Detection • Needs to be flexible depending on business requirements • Failover • Needs to be flexible based on business requirements. E.g. a simplistic scenario for PostgreSQL is re-configure standby database and redirect application traffic. • Failback (optional)
  • 19. DR Orchestration Architecture Scheduler Failure Webhook Failover Controller Nova Control Plane Workload Cluster Nova Agent Monitoring Tool Workload Cluster Nova Agent Configurations: ● Register Nova Webhook as an alert receiver in your monitoring tool. ● Supply a mapping of alert labels to docker image w/ failover logic.
  • 20. Demo Layout: PostgreSQL automated failover to Standby Nova Control Plane S3 Bucket Workload Cluster 1 Primary Workload Cluster 2 StandBy Workload Cluster 3 HAProxy PSQL Client AWS Region 1 AWS Region 2 AWS Region 3 Job for failover: ● Changes manifest of cluster-2 postgres to ‘primary’ ● Re-configures HAProxy to point to postgres on cluster-2 DB Monitoring Nova agent to CP
  • 21. Demo
  • 22. Takeaways • To survive widespread outages, your database requires deployment to multiple clusters in different regions. • Use of K8s, along with operators, makes DR setup easier and opens up opportunities for automation, in turn enabling better RTO. • Automation of recovery can be done in a simple, low-friction way using a multi-cluster control plane such as Nova.
  • 23. Future Work • CRD based definition for failure detection and failover • Eliminate out-of-band configuration and specify everything by deploying a manifest • High Availability of the Nova Control Plane • Provide option to install Nova in active-active HA mode
  • 24. Resources • Learn more about Percona operators: https://per.co.na/operators • Learn more about Elotl Nova: https://www.elotl.co/nova.html • Free trial of Elotl Nova: https://www.elotl.co/free-trial.html • Nova HADR beta coming soon!
  • 25. Thank you! Please feel free to provide feedback using this QR code.