SlideShare a Scribd company logo
1 of 40
CONFIDENTIAL designator
CONFIDENTIAL designator
2
Gaurav Midha
He is a Principal Architect, working in Red Hat
Hybrid Platforms. He creates Solution Patterns
on Openshift Platform Plus. He has over 13
years of experience in IT industry with major
focus on Automation and Deployments in Public
cloud. He has designed, architected and
automated different CI/CD pipelines using latest
DevOps automation tools.
CONFIDENTIAL designator
Disaster Recovery for
Openshift workloads
Gaurav Midha
Principal Architect
3
Resilience Solutions
Resiliency Solutions for different service level objectives
4
Minutes. Hours Days
Day
Hour
Minutes
Tolerance for Data loss
RPO --------------------------- >
Tolerance
for
App
downtime
RTO
---------------------------
>
Regional DR
X
Metro DR
X
Backup
Logical failures
Snapshot based
▸ Comprehensive protection solutions
against wide spectrum of failures
▸ Beyond Data Protection --> Full
Application protection
▸ Resiliency built into the platform –
Available to all stateful and stateless
applications on OpenShift
▸ OCP + ODF + ACM integrated stack lends
towards Automated and Simplified
Application granular protection
Site Disasters
Asynchronous Replication
HW System failures
Synchronous Mirroring
HA-DR Solutions
Overview
5
Application HA
6
Multi-Zone spanning Cluster for local HA
▸HA for Stateful Applications deployed on cluster that is stretched across
Availability Zones within a region
▸Installer ensures that resources are deployed across all AZs making the
cluster resilient against failures of any single AZ
▸ODF provides synchronous consistent copies in all AZs ensuring no data loss
during zone failures
▸Suitable for public cloud platforms with Regions supporting 3 or more AZs
・ Can be deployed on-prem when AZs are connected by networks with <10ms latency
OCP
ODF
Master Node
HAProxy
/ LB
PV PV
Worker
Nodes
Master Node
Worker
Nodes
Master Node
Worker
Nodes
Availability
Zone 1
PV
Availability Zone
2
Availability Zone
3
Regional-DR
Regional-DR with Failover Automation
7
Protection against Geographic Scale Disasters
▸ Asynchronous Volume Replication => low RPO
• ODF enables cross cluster replication of data volumes with
replication intervals as low as 1 min
• ODF Storage operators synchronizes both App data PVs and
Cluster metadata
▸ Automated Failover Management => low RTO
• ACM Multi-Cluster manager enables failover and failback
automation at application granularity
▸ Both clusters remain active with Apps distributed and
protected among them
OCP Cluster 1
Application
GTM
OCP Cluster 2
ACTIVE PASSIVE
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Application
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Asynchronous Volume
Replication with ODF
Automated Failover
Management with
ACM
RPO – Mins
RTO – Mins
Region 1 Region 2
Metro DR
Metro-DR – Multi Cluster
8
No Data loss Data Mirroring, across multiple OCP clusters
• Multiple OCP clusters deployed in different AZs provide a complete fault isolated
configuration
• External RHCS storage cluster provides persistent synchronous mirrored volumes
across multiple OCP clusters enabling zero RPO
• ACM managed automated Application failover across clusters reduces RTO
• Requires Arbiter node in a third site for storage cluster
• Arbiter node can be deployed over higher latency networks provided by public clouds
External ODF Cluster
OCP Cluster 1
Resources
PV
Arbiter Node
GTM
Automated Failover
Management with ACM
Data Center 1
PV
Data Center 2
Application
Neutral Zone
OCP Cluster 2
Resources
Application
Synchronously mirrored
PVs
RPO – 0
RTO – Mins
Comprehensive & Flexible Data Protection for the desired SLO (RPO+RTO)
9
HAPr
oxy
GTM
OCP
PASSIVE
NAMESPACE
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Asynchronous
Vol Replication
with ODF
Automated Failover
Management with
ACM
S3 Storage
Backup &
Restores
NAMESPACE
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Multi-tier protection against various failure
conditions:
• Cluster hardening with Multi-Zone spanning OCP cluster.
+
• Async Replication for HW and Data Center Failures
+
• Backup Restores from Snapshots for Software & Logical
Failures
OCP
ODF
Master Node
PV PV
Worker
Nodes
Master Node
Worker
Nodes
Master Node
Worker
Nodes
Availability
Zone 1
PV
Availability Zone
2
Availability Zone
3
Region 1
Region 2
Planning for
OpenShift DR
10
Start with HA configuration and add site DR
Topology Single OCP+ODF clusters deployed
over multiple AZs in a single region
Multi OCP + ODF clusters spread over
multiple regions
Multi OCP clusters + single external ODF
stretched cluster deployed over low
latency networks
RTO (Downtime) RTO=0 (Continuous)* RTO = minutes
DR Automation from ACM+ODF reduces
RTO
RTO = minutes
DR Automation from ACM+ODF reduces
RTO
RPO (Data loss
exposure)
RPO=0
No Data loss due due to Synchronous
mirroring of ODF data
RPO > 0; Usually 5 min or higher
Depends upon network bandwidth &
change rate
RPO=0
No Data loss due due to Synchronous
mirroring of ODF data
Infra Requirements Multi-AZ supported public clouds
(vSphere support in OCP 4.10)
All ODF supported platforms
No network latency limits
On-prem only (vSphere, bare metal)
<10ms network latency between sites
OCP
ODF
Master
Node
PV PV
Wor
ker
Nod
es
Master
Node
Wor
ker
Nod
es
Master
Node
Wor
ker
Nod
es
PV
External ODF Cluster
OCP Cluster 1
Resou
rces
PV PV
Application
OCP Cluster 2
Resou
rces
Application
Synchronous
mirrored PVs
OCP Cluster 1
Application
OCP Cluster 2
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Application
PVs
RESOURCES
RESOURCES
RESOURC
ES
PVs
PVs
Asynchronous
Volume
Replication
+
Cluster HA Regional-DR Metro-DR
* Subject to RWO PV
limitations
Choosing the right DR solution
12
Choose the right DR solution
▸ Metro-DR is the primary choice for no data loss (RPO=0) solution. But still has these
limitations :
・ ODF external mode only – Need to be comfortable with RHCS administration of storage
・ On-Prem only – No Public Clouds.
・ Requires Arbiter on a 3rd site and can be on a public cloud (<100ms RTT latency)
・ Failover granularity for MDR is per OCP cluster, unlike per-application granularity available in RDR
▸ Potential for Automatic Failover (RTO=0) – Roadmap
▸ Metro-DR avoids stretching OCP cluster across sites.
・ Stretching OCP clusters leads to cluster instability and complicates day 2 ops.
Guidance on DR choices
14
DR choices
▸ Backup vs. DR Solutions
▸ Both OADP and DR solutions provide site DR. Key Difference: Cost and Protection levels
・ OADP (Backup) solution is cheaper as there is no standby site/cluster requirement. OADP can function
with a cheaper remote Object storage.
・ Lower protection level as the data recovery starts after the failure.
▸ Users should be focused on Application protection and not cluster protection
・ All of our Backup and DR solutions are targeting (stateful) Application protection and not necessarily
cluster protection
・ OpenShift makes cluster redeployment easy
Application/Workload specific considerations
15
▸ Applications that have PVC resources as part of their git repository are fully supported
・ This enables ACM to ensure the PVCs are deleted, enabling ODR to switch image roles safely from
Primary to Secondary.
▸ If multiple applications are deployed to the same namespace, their PVCs should be uniquely
labelled, to enable grouping PVCs that belong to the application on the managed cluster.
・ The unique label would further need to be provided to the ODR DRPlacementControl resource.
▸ All application PVCs that need DR protection should use a StorageClass that is provisioned by
ODF csi-drivers, in the case of RDR only RBD/Block SC is supported.
▸ Applications MUST be deployed using the ACM hub application management workflow.
・ NOTE: Tested workflows use ACM git repositories model to deploy applications, though in the wild
the helm model has also been used with ODR.
CONFIDENTIAL Designator
Optional section marker or title
● Description
● Components
● Architecture
Solution
Overview
16
OCP Integrated, Full Stack DR Protection
Platform
Persistence
Layer
Multi-Site
Multi
Cluster
Manager
• ODR Hub Operator –
Orchestrates & Automates
DR operations across clusters
• ODR Cluster Operator -
Manages and synchronizes
cluster meta data and
application data
• Asynchronous replication of
Application volumes – or-
• Synchronous Mirroring of
Application volumes
RHACM
• Easy configuration DR across cluster and sites
as part of application deployment
• Automated DR Failover and Failback
operations reduces RTO
• Manage and Monitor DR across clusters and
Apps
• Same consistent DR operations for both
Metro-DR and Regional-DR
• Both Application Data and State is protected
and used for Application granular protection
• Consistent Data replication or mirroring or
both based on infrastructure and desired
protection.
▸ Red Hat Advanced Cluster Management for Kubernetes (RHACM)
・ RHACM provides the ability to manage multiple clusters and application lifecycles. Hence, it
serves as a control plane in a multi-cluster environment.
・ RHACM is split into two parts:
・ RHACM Hub: components that run on the multi-cluster control plane
・ Managed clusters: components that run on the clusters that are managed
・ RHACM submariner addon: provides OpenShift private network connectivity between clusters
▸ Red Hat OpenShift Data Foundation (ODF)
・ ODF provides the ability to provision and manage storage for stateful applications in an
OpenShift Container Platform cluster. It is backed by Ceph as the storage provider, whose
lifecycle is managed by Rook in the OpenShift Data Foundation component stack. Ceph-CSI
(Container Storage Interface) provides the provisioning and management of Persistent Volumes
for stateful applications.
▸ Red Hat Ceph Storage (RHCS)
・ RHCS is a massively scalable, open, software-defined storage platform combined with a Ceph
management platform, deployment utilities, and support services.
・ Metro-DR uses an external RHCS architecture called “stretch cluster mode with arbiter”.
・ ODF configured to connect to RHCS cluster that provides storage for OCP applications.
Solution Components
18
▸ OpenShift DR (ODR)
・ OpenShift DR is a disaster recovery capability for stateful applications across a set of peer
OpenShift clusters which are deployed and managed using RHACM. Provides cloud-native
interfaces to orchestrate the life-cycle of an application’s state on Persistent Volumes. These
include:
・ Protecting an application state relationship across OpenShift clusters
・ Failing over an application’s state to a peer cluster
・ Relocate an application’s state to the previously deployed cluster
・ OpenShift DR is comprised of three operators:
・ OpenShift DR Hub Operator
・ Installed on the hub cluster to manage failover and relocation for applications.
・ OpenShift DR Cluster Operator
・ Installed on each managed cluster to manage the lifecycle of all PVCs of an
application.
・ OpenShift Data Foundation Multicluster Orchestrator
・ Installed on the hub cluster to automate numerous configuration tasks.
・ Requires the OpenShift Data Foundation advanced entitlement.
Components (continued)
19
RH-ACM Based DR
Automation
Stateful Application Disaster Recovery (DR)
DR Automation enables
quick and error free
Application recovery
enabling lowering RTO
Simplified DR Management with ACM
21
▸ Centralized DR management across clusters, across sites,
across cloud platforms
▸ Policy driven DR Configuration, governance, and
compliance
▸ Better availability control with Application granular DR
Management and operations
▸ Multicluster observability and alerting enables quick DR
recovery and reduces downtime
▸ Multicluster networking simplifies cross-cluster networking
for DR configuration
Operator based DR Automation
Single Click Application Recovery
22
▸ OpenShift DR (ODR) operators reduce RTO and risk with
automation of DR operations
▸ Simplified DR Orchestration via centrally managed ACM Hub
via ODR Hub operator
▸ ODR Cluster Operator on each managed (workload) cluster
manages data replication and synchronization
▸ ACM Hub recovery in case of site disasters ensures its
availability
ODR Hub Operator
ODR Cluster Operator ODR Cluster Operator
ACM Hub
GitHub/Helm/
Object Store
A Centrally deployed ACM Hub managing DR between a pair of
Managed clusters deployed at different regions
Simplified DR Orchestration
▸ DR Orchestration in few easy steps with ODR Operators
・ Choose primary and secondary clusters for your Applications
・ Choose predefined DR Policy to apply for your application
・ ACM & ODR Operators deploy Application, configures it for DR and
initiate data replication
▸ Both Application state and Data are replicated and protected
・ Uses ODF Object Store to capture App meta data
▸ Provide active-active use of both clusters, with different
applications deployed at each site protected by each other
▸ Flexibility for replica copy to have different storage
configuration than primary
4a. PV Metadata replicated
to ODF Object Store
6. Client I/Os
GTM
Clients
3. Configure Replication for App1
2. ACM
deploys App1
4b. App1 PV Data replicated
Admi
n
1. User
configures
App1 for DR
Managed Cluster -
Primary
O
C
App1
git or he...
GitHub/Helm/
Object Store
ODR Hub
ACM Hub
ODR Cluster
ODF
App2
ODR Cluster
ODF
Managed Cluster -
Secondary
Regional-DR
Automated DR Failover reduces RTO
▸ ODR Operators automate the sequence of operations
required for Application to recover on the secondary cluster
and accomplished with a single command by the user
▸ Failover process automation – Increases application
availability and reduces user errors
▸ Application(s) granular Failovers provides better priority
based control
▸ Failovers are always user initiated and controlled, eliminates
un-intended switch and data loss.
5. Redirect Client I/Os
GTM
Clients
3. Designate
PV as primary
copy of App1
4. ACM
deploys App1
Admi
n
1. User initiates the
Failover
6. ACM Un-
deploys App 1
2. Populate metadata
of all App1 PVs
git or he...
GitHub
ODF Object
Store
ODR Hub
ACM Hub
GitHub/Helm/
Object Store
Managed Cluster -
Primary
App1
ODR Cluster
ODF
Managed Cluster -
Secondary
App1
ODR Cluster
ODF
Regional-DR
DR Failback ensures smooth recovery to Primary
▸ DR Failbacks are planned, controlled and with no-data loss
▸ PV Data changes and meta data changes are restored to
primary cluster before the fail-back is complete
▸ Fail-back operations are user initiated and controlled
▸ Workload migrations across hybrid platforms are also enabled
with the same process
6. Client I/Os
GTM
Clients
4. ACM un-deploys
App1
Admi
n
1. User initiates the
Failback
5. ACM deploys
App 1
2. Restore metadata
of all App1 PVs
git or he...
GitHub
3. Resync App1 PVs
data
ODR Hub
ACM Hub
Managed Cluster -
Secondary
App1
ODR Cluster
ODF
Managed Cluster -
Primary
App1
ODR Cluster
ODF
GitHub/Helm/
Object Store
Regional-DR
CONFIDENTIAL Designator
Technical
Details
26
27
DR Solution - Key Operations
Most of the key DR operations below are automated by the ODR Operators - Hence these are behind the scenes
operations for users
▸ Configure Ceph-RBD pools for mirroring (for Regional-DR)
・ Including starting the RBD mirror service, peering 2 clusters for mirroring, ensuring cross-cluster networking is
configured (submariner)
▸ Configure an S3 store per cluster for DR Ramen operator to store/restore PV metadata
・ As workloads move across clusters, PVCs need to bind back to specific PVs that point to the mirrored images, this
requires PV metadata store/restore from each cluster where the workload is placed/moved to
▸ DR enable PVCs of an application
・ Enable mirroring and manage mirroring role, Primary/Secondary, for all PVCs of an application, depending on where
and how the workload is placed across the peer clusters
▸ Orchestrate application placement, in cases of disaster and recovery
28
High level resources (CRDs)
▸ MirrorPeer (Hub) (User/UI created)
・ Responsible to setup RBD mirroring and ODF per PVC capabilities for mirroring across peer clusters
・ Responsible for creating required S3 buckets for Ramen (DR orchestrator) to use
▸ DRCluster (Hub) (created by MirrorPeer)
・ Defines a ManagedCluster as a DR peer, and also contains required S3 store configuration for this cluster
▸ DRPolicy (Hub) (User/UI created)
・ Defines the Sync/Async policy between 2 DRClusters, and also the Async schedule as required
▸ DRPlacementControl (Hub) (User/UI created)
・ Per application DR placement orchestrator, to place applications as per the DRPolicy
・ Responsible for application movement orchestration by scheduling the ACM PlacementRule appropriately
・ Responsible for directing PVC protection and state on a peer cluster
29
Sub resources (CRDs)
▸ VolumeReplicationGroup (ManagedCluster) (created by DRPlacementControl)
・ Defines, based on a label selector, which PVCs need DR protection on a workload cluster
・ Further defines,
・ Desired state for the PVCs (Primary/Secondary)
・ Mode of protection (Async/Sync)
・ Replication interval if Async
・ S3 profiles to use for PV resources (to store/restore from)
▸ VolumeReplicationClass (ManagedCluster) (created by MirrorPeer)
・ Like a StorageClass, defines a class containing a replication interval for protection and which CSI driver acts on
VolumeReplication resources point to the class
▸ VolumeReplication (ManagedCluster) (created by VolumeReplicationGroup)
・ Represents per-PVC replication desire, enabled when created and set to either the Primary/Secondary role
・ Interfaces with the Ceph-CSI driver to issue required operations to achieve the desired state
CONFIDENTIAL Designator
Regional-DR Operators
“Primary” ACM ManagedCluster
openshift-operators
“Hub” ACM HUB Cluster
ramen-hub-operator
odf-multicluster-console
odfmo-controller-manager
openshift-dr-cluster
openshift-storage
ramen-dr-cluster-operator
token-exchange-agent
rook-ceph-rbd-mirror-a
csi-rbdplugin-provisioner (additional volume-
replication sidecar)
“Secondary” ACM ManagedCluster
openshift-dr-cluster
openshift-storage
ramen-dr-cluster-operator
token-exchange-agent
rook-ceph-rbd-mirror-a
csi-rbdplugin-provisioner (additional volume-
replication sidecar)
31
Italics: Common OCP/ODF/ACM resources (like ManagedCluster, PVC)
Bold: Resources created/required for ODF DR functionality
Operator responsibilities
▸ odf-multicluster-console (Hub)
・ Provides UI functionality
・ Creates MirrorPeer and DRPolicy cluster scoped resources based on user actions
▸ odfmo-controller-manager (Hub)
・ Reconciles MirrorPeer resource
・ Creates DRCluster resource on the hub
・ Creates required VolumeReplicationClass cluster scoped resource on the ManagedCluster
▸ ramen-dr-hub-operator (Hub)
・ Reconciles DRPlacementControl, DRPolicy, DRCluster resources
・ Creates VolumeReplicationGroup namespace scoped resource on desired ManagedCluster
MetroDR + RHCS
Architecture
Metro (DR) Architecture
DR Automation enables
quick and error free
Application recovery
enabling lowering RTO
MetroDR + RHCS stretched
● Metro DR + RHCS Architecture Diagrams. Link
● Ceph 101 Architecture Introduction. Link
Simplified DR Orchestration
▸ DR Orchestration in few easy steps with ODR Operators
・ Choose primary and secondary clusters for your Applications
・ Choose predefined DR Policy to apply for your application
・ ACM & ODR Operators deploy Application, configures it for DR and
initiate data replication
▸ Both Application state and Data are replicated and protected
・ Uses ODF Object Store to capture PV/PVC App metadata
▸ Provide active-active use of both clusters, with different
applications deployed at each site protected by each other
5. Client I/Os
GTM
Clients
3. Configure Replication for App1
2. ACM
deploys App1
Admi
n
1. User
configures
App1 for DR
Managed Cluster -
Primary
O
C
App1
git or he...
GitHub/Helm/
Object Store
ODR Hub
ACM Hub
ODR Cluster
ODF
App2
ODR Cluster
ODF
Managed Cluster -
Secondary
ODF
Neutral zone
PV Data sync replication
Metro-DR
4 PV Metadata replicated
to ODF Object Store
Automated DR Failover reduces RTO
▸ ODR Operators automate the sequence of operations
required for Application to recover on the secondary cluster
and accomplished with a single command by the user
▸ Failover process automation – Increases application
availability and reduces user errors
▸ Application(s) granular Failovers provides better priority
based control
▸ Failovers are always user initiated and controlled, eliminates
un-intended switch and data loss.
▸ OCP worker nodes on failed cluster are IO Fenced to prevent
data corruption
4. Redirect Client I/Os
GTM
Clients
3. ACM
deploys App1
Admi
n
1. User initiates the
Failover
5. ACM Un-
deploys App1
git or he...
GitHub
ODR Hub
ACM Hub
GitHub/Helm/
Object Store
Managed Cluster -
Primary
App1
ODR Cluster
Managed Cluster -
Secondary
App1
ODR Cluster
O
C
ODF ODF
ODF
Neutral zone
Metro-DR
PV Data sync replication
2. Populate metadata of
all App1 PVs from ODF
Object Store
DR Failback ensures smooth recovery to Primary
▸ DR Failbacks are planned, controlled and with no-data loss
▸ PV Data changes and metadata changes are restored to
primary cluster before the fail-back is complete
▸ Fail-back operations are user initiated and controlled
4. Client I/Os
GTM
Clients
2. ACM un-deploys
App1
Admi
n
1. User initiates the
Failback
3. ACM deploys
App 1
git or he...
GitHub
ODR Hub
ACM Hub
Managed Cluster -
Secondary
App1
ODR Cluster
Managed Cluster -
Primary
App1
ODR Cluster
GitHub/Helm/
Object Store
O
C
ODF ODF
ODF
Neutral zone
Automated data
recovery in
background
Metro-DR
PV Data sync replication
2. Restore metadata
of all App1 PVs
CONFIDENTIAL Designator
Thank You!
37
Additional
Information
38
▸ Presentations
・ HA-DR for Stateful Applications on OpenShift
▸ Demo Videos and Office Hours
・ Automated Disaster Recovery failover and failback with Red Hat
OpenShift
・ Automated Disaster Recovery for OpenShift Workloads
・ ODF Office Hour June 16, 2022
・ ODF Office Hour August 25, 2022
・ ODF Office Hour Metro DR Introduction
・ OADP Native Backups
▸ Red Hat Documentation
・ [ODF 4.11] Configuring OpenShift Data Foundation Disaster Recovery for
OpenShift Workloads
Technical assets
39
CONFIDENTIAL Designator
Testing and Training
▸ Ensure OpenShift clusters have sufficient capacity (CPU, Mem) for ACM/ODF installation
・ ODF resource requirements
▸ ODF training.
・ Red Hat OpenShift Container Storage for Enterprise Kubernetes (DO370)
・ https://red-hat-storage.github.io/ocs-training/training/ocs4/odf.html
▸ RHCS Training.
・ Cloud Storage with Red Hat Ceph Storage CL260
CONFIDENTIAL designator
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat
41
Red Hat is the world’s leading provider of
enterprise open source software solutions. Award-
winning support, training, and consulting services
make
Red Hat a trusted adviser to the Fortune 500.
Thank you

More Related Content

Similar to final-red-hat-te-2023-gaurav-midha open to world

Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShiftKubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShiftDevOps.com
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
 
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo PruscinoOracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo PruscinoMarkus Michalewicz
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryAli Hodroj
 
MuleSoft Meetup Roma - CloudHub Networking Stategies
MuleSoft Meetup Roma -  CloudHub Networking StategiesMuleSoft Meetup Roma -  CloudHub Networking Stategies
MuleSoft Meetup Roma - CloudHub Networking StategiesAlfonso Martino
 
Oracle Cloud Maximum Availability Architecture
Oracle Cloud Maximum Availability ArchitectureOracle Cloud Maximum Availability Architecture
Oracle Cloud Maximum Availability ArchitectureYuri Carvalho Marques
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native EnvironmentsWSO2
 
Automated Deployment and Management of Edge Clouds
Automated Deployment and Management of Edge CloudsAutomated Deployment and Management of Edge Clouds
Automated Deployment and Management of Edge CloudsJay Bryant
 
Track 2, session 3, business continuity and disaster recovery in the virtuali...
Track 2, session 3, business continuity and disaster recovery in the virtuali...Track 2, session 3, business continuity and disaster recovery in the virtuali...
Track 2, session 3, business continuity and disaster recovery in the virtuali...EMC Forum India
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
2008-01-23 Red Hat Overview to CUNY Information Managers Forum
2008-01-23 Red Hat Overview to CUNY Information Managers Forum2008-01-23 Red Hat Overview to CUNY Information Managers Forum
2008-01-23 Red Hat Overview to CUNY Information Managers ForumShawn Wells
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreinside-BigData.com
 
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld
 
Ibm power ha v7 technical deep dive workshop
Ibm power ha v7 technical deep dive workshopIbm power ha v7 technical deep dive workshop
Ibm power ha v7 technical deep dive workshopsolarisyougood
 
Ransomware: The Defendable Epidemic
Ransomware: The Defendable EpidemicRansomware: The Defendable Epidemic
Ransomware: The Defendable EpidemicSagi Brody
 
Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center
Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center
Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center Alaeddine El Fawal
 
cncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetesKrishna-Kumar
 
Nutanix - Expert Session - Metro Availability
Nutanix -  Expert Session - Metro AvailabilityNutanix -  Expert Session - Metro Availability
Nutanix - Expert Session - Metro AvailabilityChristian Johannsen
 

Similar to final-red-hat-te-2023-gaurav-midha open to world (20)

Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShiftKubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
 
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo PruscinoOracle RAC and Your Way to the Cloud by Angelo Pruscino
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
 
Madrid meetup #7 deployment models
Madrid meetup #7   deployment modelsMadrid meetup #7   deployment models
Madrid meetup #7 deployment models
 
CVx_Pilot_DR_DS
CVx_Pilot_DR_DSCVx_Pilot_DR_DS
CVx_Pilot_DR_DS
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster Recovery
 
MuleSoft Meetup Roma - CloudHub Networking Stategies
MuleSoft Meetup Roma -  CloudHub Networking StategiesMuleSoft Meetup Roma -  CloudHub Networking Stategies
MuleSoft Meetup Roma - CloudHub Networking Stategies
 
Oracle Cloud Maximum Availability Architecture
Oracle Cloud Maximum Availability ArchitectureOracle Cloud Maximum Availability Architecture
Oracle Cloud Maximum Availability Architecture
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
 
Automated Deployment and Management of Edge Clouds
Automated Deployment and Management of Edge CloudsAutomated Deployment and Management of Edge Clouds
Automated Deployment and Management of Edge Clouds
 
Track 2, session 3, business continuity and disaster recovery in the virtuali...
Track 2, session 3, business continuity and disaster recovery in the virtuali...Track 2, session 3, business continuity and disaster recovery in the virtuali...
Track 2, session 3, business continuity and disaster recovery in the virtuali...
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
2008-01-23 Red Hat Overview to CUNY Information Managers Forum
2008-01-23 Red Hat Overview to CUNY Information Managers Forum2008-01-23 Red Hat Overview to CUNY Information Managers Forum
2008-01-23 Red Hat Overview to CUNY Information Managers Forum
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
 
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
 
Ibm power ha v7 technical deep dive workshop
Ibm power ha v7 technical deep dive workshopIbm power ha v7 technical deep dive workshop
Ibm power ha v7 technical deep dive workshop
 
Ransomware: The Defendable Epidemic
Ransomware: The Defendable EpidemicRansomware: The Defendable Epidemic
Ransomware: The Defendable Epidemic
 
Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center
Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center
Alaeddine El Fawal DR Plan for Private-Cloud Based Data Center
 
cncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetes
 
Nutanix - Expert Session - Metro Availability
Nutanix -  Expert Session - Metro AvailabilityNutanix -  Expert Session - Metro Availability
Nutanix - Expert Session - Metro Availability
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

final-red-hat-te-2023-gaurav-midha open to world

  • 2. CONFIDENTIAL designator 2 Gaurav Midha He is a Principal Architect, working in Red Hat Hybrid Platforms. He creates Solution Patterns on Openshift Platform Plus. He has over 13 years of experience in IT industry with major focus on Automation and Deployments in Public cloud. He has designed, architected and automated different CI/CD pipelines using latest DevOps automation tools.
  • 3. CONFIDENTIAL designator Disaster Recovery for Openshift workloads Gaurav Midha Principal Architect 3
  • 4. Resilience Solutions Resiliency Solutions for different service level objectives 4 Minutes. Hours Days Day Hour Minutes Tolerance for Data loss RPO --------------------------- > Tolerance for App downtime RTO --------------------------- > Regional DR X Metro DR X Backup Logical failures Snapshot based ▸ Comprehensive protection solutions against wide spectrum of failures ▸ Beyond Data Protection --> Full Application protection ▸ Resiliency built into the platform – Available to all stateful and stateless applications on OpenShift ▸ OCP + ODF + ACM integrated stack lends towards Automated and Simplified Application granular protection Site Disasters Asynchronous Replication HW System failures Synchronous Mirroring
  • 6. Application HA 6 Multi-Zone spanning Cluster for local HA ▸HA for Stateful Applications deployed on cluster that is stretched across Availability Zones within a region ▸Installer ensures that resources are deployed across all AZs making the cluster resilient against failures of any single AZ ▸ODF provides synchronous consistent copies in all AZs ensuring no data loss during zone failures ▸Suitable for public cloud platforms with Regions supporting 3 or more AZs ・ Can be deployed on-prem when AZs are connected by networks with <10ms latency OCP ODF Master Node HAProxy / LB PV PV Worker Nodes Master Node Worker Nodes Master Node Worker Nodes Availability Zone 1 PV Availability Zone 2 Availability Zone 3
  • 7. Regional-DR Regional-DR with Failover Automation 7 Protection against Geographic Scale Disasters ▸ Asynchronous Volume Replication => low RPO • ODF enables cross cluster replication of data volumes with replication intervals as low as 1 min • ODF Storage operators synchronizes both App data PVs and Cluster metadata ▸ Automated Failover Management => low RTO • ACM Multi-Cluster manager enables failover and failback automation at application granularity ▸ Both clusters remain active with Apps distributed and protected among them OCP Cluster 1 Application GTM OCP Cluster 2 ACTIVE PASSIVE PVs RESOURCES RESOURCES RESOURCES PVs PVs Application PVs RESOURCES RESOURCES RESOURCES PVs PVs Asynchronous Volume Replication with ODF Automated Failover Management with ACM RPO – Mins RTO – Mins Region 1 Region 2
  • 8. Metro DR Metro-DR – Multi Cluster 8 No Data loss Data Mirroring, across multiple OCP clusters • Multiple OCP clusters deployed in different AZs provide a complete fault isolated configuration • External RHCS storage cluster provides persistent synchronous mirrored volumes across multiple OCP clusters enabling zero RPO • ACM managed automated Application failover across clusters reduces RTO • Requires Arbiter node in a third site for storage cluster • Arbiter node can be deployed over higher latency networks provided by public clouds External ODF Cluster OCP Cluster 1 Resources PV Arbiter Node GTM Automated Failover Management with ACM Data Center 1 PV Data Center 2 Application Neutral Zone OCP Cluster 2 Resources Application Synchronously mirrored PVs RPO – 0 RTO – Mins
  • 9. Comprehensive & Flexible Data Protection for the desired SLO (RPO+RTO) 9 HAPr oxy GTM OCP PASSIVE NAMESPACE PVs RESOURCES RESOURCES RESOURCES PVs PVs Asynchronous Vol Replication with ODF Automated Failover Management with ACM S3 Storage Backup & Restores NAMESPACE PVs RESOURCES RESOURCES RESOURCES PVs PVs Multi-tier protection against various failure conditions: • Cluster hardening with Multi-Zone spanning OCP cluster. + • Async Replication for HW and Data Center Failures + • Backup Restores from Snapshots for Software & Logical Failures OCP ODF Master Node PV PV Worker Nodes Master Node Worker Nodes Master Node Worker Nodes Availability Zone 1 PV Availability Zone 2 Availability Zone 3 Region 1 Region 2
  • 11. Start with HA configuration and add site DR Topology Single OCP+ODF clusters deployed over multiple AZs in a single region Multi OCP + ODF clusters spread over multiple regions Multi OCP clusters + single external ODF stretched cluster deployed over low latency networks RTO (Downtime) RTO=0 (Continuous)* RTO = minutes DR Automation from ACM+ODF reduces RTO RTO = minutes DR Automation from ACM+ODF reduces RTO RPO (Data loss exposure) RPO=0 No Data loss due due to Synchronous mirroring of ODF data RPO > 0; Usually 5 min or higher Depends upon network bandwidth & change rate RPO=0 No Data loss due due to Synchronous mirroring of ODF data Infra Requirements Multi-AZ supported public clouds (vSphere support in OCP 4.10) All ODF supported platforms No network latency limits On-prem only (vSphere, bare metal) <10ms network latency between sites OCP ODF Master Node PV PV Wor ker Nod es Master Node Wor ker Nod es Master Node Wor ker Nod es PV External ODF Cluster OCP Cluster 1 Resou rces PV PV Application OCP Cluster 2 Resou rces Application Synchronous mirrored PVs OCP Cluster 1 Application OCP Cluster 2 PVs RESOURCES RESOURCES RESOURCES PVs PVs Application PVs RESOURCES RESOURCES RESOURC ES PVs PVs Asynchronous Volume Replication + Cluster HA Regional-DR Metro-DR * Subject to RWO PV limitations
  • 12. Choosing the right DR solution 12 Choose the right DR solution ▸ Metro-DR is the primary choice for no data loss (RPO=0) solution. But still has these limitations : ・ ODF external mode only – Need to be comfortable with RHCS administration of storage ・ On-Prem only – No Public Clouds. ・ Requires Arbiter on a 3rd site and can be on a public cloud (<100ms RTT latency) ・ Failover granularity for MDR is per OCP cluster, unlike per-application granularity available in RDR ▸ Potential for Automatic Failover (RTO=0) – Roadmap ▸ Metro-DR avoids stretching OCP cluster across sites. ・ Stretching OCP clusters leads to cluster instability and complicates day 2 ops.
  • 13. Guidance on DR choices 14 DR choices ▸ Backup vs. DR Solutions ▸ Both OADP and DR solutions provide site DR. Key Difference: Cost and Protection levels ・ OADP (Backup) solution is cheaper as there is no standby site/cluster requirement. OADP can function with a cheaper remote Object storage. ・ Lower protection level as the data recovery starts after the failure. ▸ Users should be focused on Application protection and not cluster protection ・ All of our Backup and DR solutions are targeting (stateful) Application protection and not necessarily cluster protection ・ OpenShift makes cluster redeployment easy
  • 14. Application/Workload specific considerations 15 ▸ Applications that have PVC resources as part of their git repository are fully supported ・ This enables ACM to ensure the PVCs are deleted, enabling ODR to switch image roles safely from Primary to Secondary. ▸ If multiple applications are deployed to the same namespace, their PVCs should be uniquely labelled, to enable grouping PVCs that belong to the application on the managed cluster. ・ The unique label would further need to be provided to the ODR DRPlacementControl resource. ▸ All application PVCs that need DR protection should use a StorageClass that is provisioned by ODF csi-drivers, in the case of RDR only RBD/Block SC is supported. ▸ Applications MUST be deployed using the ACM hub application management workflow. ・ NOTE: Tested workflows use ACM git repositories model to deploy applications, though in the wild the helm model has also been used with ODR.
  • 15. CONFIDENTIAL Designator Optional section marker or title ● Description ● Components ● Architecture Solution Overview 16
  • 16. OCP Integrated, Full Stack DR Protection Platform Persistence Layer Multi-Site Multi Cluster Manager • ODR Hub Operator – Orchestrates & Automates DR operations across clusters • ODR Cluster Operator - Manages and synchronizes cluster meta data and application data • Asynchronous replication of Application volumes – or- • Synchronous Mirroring of Application volumes RHACM • Easy configuration DR across cluster and sites as part of application deployment • Automated DR Failover and Failback operations reduces RTO • Manage and Monitor DR across clusters and Apps • Same consistent DR operations for both Metro-DR and Regional-DR • Both Application Data and State is protected and used for Application granular protection • Consistent Data replication or mirroring or both based on infrastructure and desired protection.
  • 17. ▸ Red Hat Advanced Cluster Management for Kubernetes (RHACM) ・ RHACM provides the ability to manage multiple clusters and application lifecycles. Hence, it serves as a control plane in a multi-cluster environment. ・ RHACM is split into two parts: ・ RHACM Hub: components that run on the multi-cluster control plane ・ Managed clusters: components that run on the clusters that are managed ・ RHACM submariner addon: provides OpenShift private network connectivity between clusters ▸ Red Hat OpenShift Data Foundation (ODF) ・ ODF provides the ability to provision and manage storage for stateful applications in an OpenShift Container Platform cluster. It is backed by Ceph as the storage provider, whose lifecycle is managed by Rook in the OpenShift Data Foundation component stack. Ceph-CSI (Container Storage Interface) provides the provisioning and management of Persistent Volumes for stateful applications. ▸ Red Hat Ceph Storage (RHCS) ・ RHCS is a massively scalable, open, software-defined storage platform combined with a Ceph management platform, deployment utilities, and support services. ・ Metro-DR uses an external RHCS architecture called “stretch cluster mode with arbiter”. ・ ODF configured to connect to RHCS cluster that provides storage for OCP applications. Solution Components 18
  • 18. ▸ OpenShift DR (ODR) ・ OpenShift DR is a disaster recovery capability for stateful applications across a set of peer OpenShift clusters which are deployed and managed using RHACM. Provides cloud-native interfaces to orchestrate the life-cycle of an application’s state on Persistent Volumes. These include: ・ Protecting an application state relationship across OpenShift clusters ・ Failing over an application’s state to a peer cluster ・ Relocate an application’s state to the previously deployed cluster ・ OpenShift DR is comprised of three operators: ・ OpenShift DR Hub Operator ・ Installed on the hub cluster to manage failover and relocation for applications. ・ OpenShift DR Cluster Operator ・ Installed on each managed cluster to manage the lifecycle of all PVCs of an application. ・ OpenShift Data Foundation Multicluster Orchestrator ・ Installed on the hub cluster to automate numerous configuration tasks. ・ Requires the OpenShift Data Foundation advanced entitlement. Components (continued) 19
  • 19. RH-ACM Based DR Automation Stateful Application Disaster Recovery (DR) DR Automation enables quick and error free Application recovery enabling lowering RTO
  • 20. Simplified DR Management with ACM 21 ▸ Centralized DR management across clusters, across sites, across cloud platforms ▸ Policy driven DR Configuration, governance, and compliance ▸ Better availability control with Application granular DR Management and operations ▸ Multicluster observability and alerting enables quick DR recovery and reduces downtime ▸ Multicluster networking simplifies cross-cluster networking for DR configuration
  • 21. Operator based DR Automation Single Click Application Recovery 22 ▸ OpenShift DR (ODR) operators reduce RTO and risk with automation of DR operations ▸ Simplified DR Orchestration via centrally managed ACM Hub via ODR Hub operator ▸ ODR Cluster Operator on each managed (workload) cluster manages data replication and synchronization ▸ ACM Hub recovery in case of site disasters ensures its availability ODR Hub Operator ODR Cluster Operator ODR Cluster Operator ACM Hub GitHub/Helm/ Object Store A Centrally deployed ACM Hub managing DR between a pair of Managed clusters deployed at different regions
  • 22. Simplified DR Orchestration ▸ DR Orchestration in few easy steps with ODR Operators ・ Choose primary and secondary clusters for your Applications ・ Choose predefined DR Policy to apply for your application ・ ACM & ODR Operators deploy Application, configures it for DR and initiate data replication ▸ Both Application state and Data are replicated and protected ・ Uses ODF Object Store to capture App meta data ▸ Provide active-active use of both clusters, with different applications deployed at each site protected by each other ▸ Flexibility for replica copy to have different storage configuration than primary 4a. PV Metadata replicated to ODF Object Store 6. Client I/Os GTM Clients 3. Configure Replication for App1 2. ACM deploys App1 4b. App1 PV Data replicated Admi n 1. User configures App1 for DR Managed Cluster - Primary O C App1 git or he... GitHub/Helm/ Object Store ODR Hub ACM Hub ODR Cluster ODF App2 ODR Cluster ODF Managed Cluster - Secondary Regional-DR
  • 23. Automated DR Failover reduces RTO ▸ ODR Operators automate the sequence of operations required for Application to recover on the secondary cluster and accomplished with a single command by the user ▸ Failover process automation – Increases application availability and reduces user errors ▸ Application(s) granular Failovers provides better priority based control ▸ Failovers are always user initiated and controlled, eliminates un-intended switch and data loss. 5. Redirect Client I/Os GTM Clients 3. Designate PV as primary copy of App1 4. ACM deploys App1 Admi n 1. User initiates the Failover 6. ACM Un- deploys App 1 2. Populate metadata of all App1 PVs git or he... GitHub ODF Object Store ODR Hub ACM Hub GitHub/Helm/ Object Store Managed Cluster - Primary App1 ODR Cluster ODF Managed Cluster - Secondary App1 ODR Cluster ODF Regional-DR
  • 24. DR Failback ensures smooth recovery to Primary ▸ DR Failbacks are planned, controlled and with no-data loss ▸ PV Data changes and meta data changes are restored to primary cluster before the fail-back is complete ▸ Fail-back operations are user initiated and controlled ▸ Workload migrations across hybrid platforms are also enabled with the same process 6. Client I/Os GTM Clients 4. ACM un-deploys App1 Admi n 1. User initiates the Failback 5. ACM deploys App 1 2. Restore metadata of all App1 PVs git or he... GitHub 3. Resync App1 PVs data ODR Hub ACM Hub Managed Cluster - Secondary App1 ODR Cluster ODF Managed Cluster - Primary App1 ODR Cluster ODF GitHub/Helm/ Object Store Regional-DR
  • 26. 27 DR Solution - Key Operations Most of the key DR operations below are automated by the ODR Operators - Hence these are behind the scenes operations for users ▸ Configure Ceph-RBD pools for mirroring (for Regional-DR) ・ Including starting the RBD mirror service, peering 2 clusters for mirroring, ensuring cross-cluster networking is configured (submariner) ▸ Configure an S3 store per cluster for DR Ramen operator to store/restore PV metadata ・ As workloads move across clusters, PVCs need to bind back to specific PVs that point to the mirrored images, this requires PV metadata store/restore from each cluster where the workload is placed/moved to ▸ DR enable PVCs of an application ・ Enable mirroring and manage mirroring role, Primary/Secondary, for all PVCs of an application, depending on where and how the workload is placed across the peer clusters ▸ Orchestrate application placement, in cases of disaster and recovery
  • 27. 28 High level resources (CRDs) ▸ MirrorPeer (Hub) (User/UI created) ・ Responsible to setup RBD mirroring and ODF per PVC capabilities for mirroring across peer clusters ・ Responsible for creating required S3 buckets for Ramen (DR orchestrator) to use ▸ DRCluster (Hub) (created by MirrorPeer) ・ Defines a ManagedCluster as a DR peer, and also contains required S3 store configuration for this cluster ▸ DRPolicy (Hub) (User/UI created) ・ Defines the Sync/Async policy between 2 DRClusters, and also the Async schedule as required ▸ DRPlacementControl (Hub) (User/UI created) ・ Per application DR placement orchestrator, to place applications as per the DRPolicy ・ Responsible for application movement orchestration by scheduling the ACM PlacementRule appropriately ・ Responsible for directing PVC protection and state on a peer cluster
  • 28. 29 Sub resources (CRDs) ▸ VolumeReplicationGroup (ManagedCluster) (created by DRPlacementControl) ・ Defines, based on a label selector, which PVCs need DR protection on a workload cluster ・ Further defines, ・ Desired state for the PVCs (Primary/Secondary) ・ Mode of protection (Async/Sync) ・ Replication interval if Async ・ S3 profiles to use for PV resources (to store/restore from) ▸ VolumeReplicationClass (ManagedCluster) (created by MirrorPeer) ・ Like a StorageClass, defines a class containing a replication interval for protection and which CSI driver acts on VolumeReplication resources point to the class ▸ VolumeReplication (ManagedCluster) (created by VolumeReplicationGroup) ・ Represents per-PVC replication desire, enabled when created and set to either the Primary/Secondary role ・ Interfaces with the Ceph-CSI driver to issue required operations to achieve the desired state
  • 29. CONFIDENTIAL Designator Regional-DR Operators “Primary” ACM ManagedCluster openshift-operators “Hub” ACM HUB Cluster ramen-hub-operator odf-multicluster-console odfmo-controller-manager openshift-dr-cluster openshift-storage ramen-dr-cluster-operator token-exchange-agent rook-ceph-rbd-mirror-a csi-rbdplugin-provisioner (additional volume- replication sidecar) “Secondary” ACM ManagedCluster openshift-dr-cluster openshift-storage ramen-dr-cluster-operator token-exchange-agent rook-ceph-rbd-mirror-a csi-rbdplugin-provisioner (additional volume- replication sidecar)
  • 30. 31 Italics: Common OCP/ODF/ACM resources (like ManagedCluster, PVC) Bold: Resources created/required for ODF DR functionality Operator responsibilities ▸ odf-multicluster-console (Hub) ・ Provides UI functionality ・ Creates MirrorPeer and DRPolicy cluster scoped resources based on user actions ▸ odfmo-controller-manager (Hub) ・ Reconciles MirrorPeer resource ・ Creates DRCluster resource on the hub ・ Creates required VolumeReplicationClass cluster scoped resource on the ManagedCluster ▸ ramen-dr-hub-operator (Hub) ・ Reconciles DRPlacementControl, DRPolicy, DRCluster resources ・ Creates VolumeReplicationGroup namespace scoped resource on desired ManagedCluster
  • 31. MetroDR + RHCS Architecture Metro (DR) Architecture DR Automation enables quick and error free Application recovery enabling lowering RTO
  • 32. MetroDR + RHCS stretched ● Metro DR + RHCS Architecture Diagrams. Link ● Ceph 101 Architecture Introduction. Link
  • 33. Simplified DR Orchestration ▸ DR Orchestration in few easy steps with ODR Operators ・ Choose primary and secondary clusters for your Applications ・ Choose predefined DR Policy to apply for your application ・ ACM & ODR Operators deploy Application, configures it for DR and initiate data replication ▸ Both Application state and Data are replicated and protected ・ Uses ODF Object Store to capture PV/PVC App metadata ▸ Provide active-active use of both clusters, with different applications deployed at each site protected by each other 5. Client I/Os GTM Clients 3. Configure Replication for App1 2. ACM deploys App1 Admi n 1. User configures App1 for DR Managed Cluster - Primary O C App1 git or he... GitHub/Helm/ Object Store ODR Hub ACM Hub ODR Cluster ODF App2 ODR Cluster ODF Managed Cluster - Secondary ODF Neutral zone PV Data sync replication Metro-DR 4 PV Metadata replicated to ODF Object Store
  • 34. Automated DR Failover reduces RTO ▸ ODR Operators automate the sequence of operations required for Application to recover on the secondary cluster and accomplished with a single command by the user ▸ Failover process automation – Increases application availability and reduces user errors ▸ Application(s) granular Failovers provides better priority based control ▸ Failovers are always user initiated and controlled, eliminates un-intended switch and data loss. ▸ OCP worker nodes on failed cluster are IO Fenced to prevent data corruption 4. Redirect Client I/Os GTM Clients 3. ACM deploys App1 Admi n 1. User initiates the Failover 5. ACM Un- deploys App1 git or he... GitHub ODR Hub ACM Hub GitHub/Helm/ Object Store Managed Cluster - Primary App1 ODR Cluster Managed Cluster - Secondary App1 ODR Cluster O C ODF ODF ODF Neutral zone Metro-DR PV Data sync replication 2. Populate metadata of all App1 PVs from ODF Object Store
  • 35. DR Failback ensures smooth recovery to Primary ▸ DR Failbacks are planned, controlled and with no-data loss ▸ PV Data changes and metadata changes are restored to primary cluster before the fail-back is complete ▸ Fail-back operations are user initiated and controlled 4. Client I/Os GTM Clients 2. ACM un-deploys App1 Admi n 1. User initiates the Failback 3. ACM deploys App 1 git or he... GitHub ODR Hub ACM Hub Managed Cluster - Secondary App1 ODR Cluster Managed Cluster - Primary App1 ODR Cluster GitHub/Helm/ Object Store O C ODF ODF ODF Neutral zone Automated data recovery in background Metro-DR PV Data sync replication 2. Restore metadata of all App1 PVs
  • 38. ▸ Presentations ・ HA-DR for Stateful Applications on OpenShift ▸ Demo Videos and Office Hours ・ Automated Disaster Recovery failover and failback with Red Hat OpenShift ・ Automated Disaster Recovery for OpenShift Workloads ・ ODF Office Hour June 16, 2022 ・ ODF Office Hour August 25, 2022 ・ ODF Office Hour Metro DR Introduction ・ OADP Native Backups ▸ Red Hat Documentation ・ [ODF 4.11] Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads Technical assets 39
  • 39. CONFIDENTIAL Designator Testing and Training ▸ Ensure OpenShift clusters have sufficient capacity (CPU, Mem) for ACM/ODF installation ・ ODF resource requirements ▸ ODF training. ・ Red Hat OpenShift Container Storage for Enterprise Kubernetes (DO370) ・ https://red-hat-storage.github.io/ocs-training/training/ocs4/odf.html ▸ RHCS Training. ・ Cloud Storage with Red Hat Ceph Storage CL260
  • 40. CONFIDENTIAL designator linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 41 Red Hat is the world’s leading provider of enterprise open source software solutions. Award- winning support, training, and consulting services make Red Hat a trusted adviser to the Fortune 500. Thank you