2. CONFIDENTIAL designator
2
Gaurav Midha
He is a Principal Architect, working in Red Hat
Hybrid Platforms. He creates Solution Patterns
on Openshift Platform Plus. He has over 13
years of experience in IT industry with major
focus on Automation and Deployments in Public
cloud. He has designed, architected and
automated different CI/CD pipelines using latest
DevOps automation tools.
4. Resilience Solutions
Resiliency Solutions for different service level objectives
4
Minutes. Hours Days
Day
Hour
Minutes
Tolerance for Data loss
RPO --------------------------- >
Tolerance
for
App
downtime
RTO
---------------------------
>
Regional DR
X
Metro DR
X
Backup
Logical failures
Snapshot based
▸ Comprehensive protection solutions
against wide spectrum of failures
▸ Beyond Data Protection --> Full
Application protection
▸ Resiliency built into the platform –
Available to all stateful and stateless
applications on OpenShift
▸ OCP + ODF + ACM integrated stack lends
towards Automated and Simplified
Application granular protection
Site Disasters
Asynchronous Replication
HW System failures
Synchronous Mirroring
6. Application HA
6
Multi-Zone spanning Cluster for local HA
▸HA for Stateful Applications deployed on cluster that is stretched across
Availability Zones within a region
▸Installer ensures that resources are deployed across all AZs making the
cluster resilient against failures of any single AZ
▸ODF provides synchronous consistent copies in all AZs ensuring no data loss
during zone failures
▸Suitable for public cloud platforms with Regions supporting 3 or more AZs
・ Can be deployed on-prem when AZs are connected by networks with <10ms latency
OCP
ODF
Master Node
HAProxy
/ LB
PV PV
Worker
Nodes
Master Node
Worker
Nodes
Master Node
Worker
Nodes
Availability
Zone 1
PV
Availability Zone
2
Availability Zone
3
7. Regional-DR
Regional-DR with Failover Automation
7
Protection against Geographic Scale Disasters
▸ Asynchronous Volume Replication => low RPO
• ODF enables cross cluster replication of data volumes with
replication intervals as low as 1 min
• ODF Storage operators synchronizes both App data PVs and
Cluster metadata
▸ Automated Failover Management => low RTO
• ACM Multi-Cluster manager enables failover and failback
automation at application granularity
▸ Both clusters remain active with Apps distributed and
protected among them
OCP Cluster 1
Application
GTM
OCP Cluster 2
ACTIVE PASSIVE
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Application
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Asynchronous Volume
Replication with ODF
Automated Failover
Management with
ACM
RPO – Mins
RTO – Mins
Region 1 Region 2
8. Metro DR
Metro-DR – Multi Cluster
8
No Data loss Data Mirroring, across multiple OCP clusters
• Multiple OCP clusters deployed in different AZs provide a complete fault isolated
configuration
• External RHCS storage cluster provides persistent synchronous mirrored volumes
across multiple OCP clusters enabling zero RPO
• ACM managed automated Application failover across clusters reduces RTO
• Requires Arbiter node in a third site for storage cluster
• Arbiter node can be deployed over higher latency networks provided by public clouds
External ODF Cluster
OCP Cluster 1
Resources
PV
Arbiter Node
GTM
Automated Failover
Management with ACM
Data Center 1
PV
Data Center 2
Application
Neutral Zone
OCP Cluster 2
Resources
Application
Synchronously mirrored
PVs
RPO – 0
RTO – Mins
9. Comprehensive & Flexible Data Protection for the desired SLO (RPO+RTO)
9
HAPr
oxy
GTM
OCP
PASSIVE
NAMESPACE
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Asynchronous
Vol Replication
with ODF
Automated Failover
Management with
ACM
S3 Storage
Backup &
Restores
NAMESPACE
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Multi-tier protection against various failure
conditions:
• Cluster hardening with Multi-Zone spanning OCP cluster.
+
• Async Replication for HW and Data Center Failures
+
• Backup Restores from Snapshots for Software & Logical
Failures
OCP
ODF
Master Node
PV PV
Worker
Nodes
Master Node
Worker
Nodes
Master Node
Worker
Nodes
Availability
Zone 1
PV
Availability Zone
2
Availability Zone
3
Region 1
Region 2
11. Start with HA configuration and add site DR
Topology Single OCP+ODF clusters deployed
over multiple AZs in a single region
Multi OCP + ODF clusters spread over
multiple regions
Multi OCP clusters + single external ODF
stretched cluster deployed over low
latency networks
RTO (Downtime) RTO=0 (Continuous)* RTO = minutes
DR Automation from ACM+ODF reduces
RTO
RTO = minutes
DR Automation from ACM+ODF reduces
RTO
RPO (Data loss
exposure)
RPO=0
No Data loss due due to Synchronous
mirroring of ODF data
RPO > 0; Usually 5 min or higher
Depends upon network bandwidth &
change rate
RPO=0
No Data loss due due to Synchronous
mirroring of ODF data
Infra Requirements Multi-AZ supported public clouds
(vSphere support in OCP 4.10)
All ODF supported platforms
No network latency limits
On-prem only (vSphere, bare metal)
<10ms network latency between sites
OCP
ODF
Master
Node
PV PV
Wor
ker
Nod
es
Master
Node
Wor
ker
Nod
es
Master
Node
Wor
ker
Nod
es
PV
External ODF Cluster
OCP Cluster 1
Resou
rces
PV PV
Application
OCP Cluster 2
Resou
rces
Application
Synchronous
mirrored PVs
OCP Cluster 1
Application
OCP Cluster 2
PVs
RESOURCES
RESOURCES
RESOURCES
PVs
PVs
Application
PVs
RESOURCES
RESOURCES
RESOURC
ES
PVs
PVs
Asynchronous
Volume
Replication
+
Cluster HA Regional-DR Metro-DR
* Subject to RWO PV
limitations
12. Choosing the right DR solution
12
Choose the right DR solution
▸ Metro-DR is the primary choice for no data loss (RPO=0) solution. But still has these
limitations :
・ ODF external mode only – Need to be comfortable with RHCS administration of storage
・ On-Prem only – No Public Clouds.
・ Requires Arbiter on a 3rd site and can be on a public cloud (<100ms RTT latency)
・ Failover granularity for MDR is per OCP cluster, unlike per-application granularity available in RDR
▸ Potential for Automatic Failover (RTO=0) – Roadmap
▸ Metro-DR avoids stretching OCP cluster across sites.
・ Stretching OCP clusters leads to cluster instability and complicates day 2 ops.
13. Guidance on DR choices
14
DR choices
▸ Backup vs. DR Solutions
▸ Both OADP and DR solutions provide site DR. Key Difference: Cost and Protection levels
・ OADP (Backup) solution is cheaper as there is no standby site/cluster requirement. OADP can function
with a cheaper remote Object storage.
・ Lower protection level as the data recovery starts after the failure.
▸ Users should be focused on Application protection and not cluster protection
・ All of our Backup and DR solutions are targeting (stateful) Application protection and not necessarily
cluster protection
・ OpenShift makes cluster redeployment easy
14. Application/Workload specific considerations
15
▸ Applications that have PVC resources as part of their git repository are fully supported
・ This enables ACM to ensure the PVCs are deleted, enabling ODR to switch image roles safely from
Primary to Secondary.
▸ If multiple applications are deployed to the same namespace, their PVCs should be uniquely
labelled, to enable grouping PVCs that belong to the application on the managed cluster.
・ The unique label would further need to be provided to the ODR DRPlacementControl resource.
▸ All application PVCs that need DR protection should use a StorageClass that is provisioned by
ODF csi-drivers, in the case of RDR only RBD/Block SC is supported.
▸ Applications MUST be deployed using the ACM hub application management workflow.
・ NOTE: Tested workflows use ACM git repositories model to deploy applications, though in the wild
the helm model has also been used with ODR.
16. OCP Integrated, Full Stack DR Protection
Platform
Persistence
Layer
Multi-Site
Multi
Cluster
Manager
• ODR Hub Operator –
Orchestrates & Automates
DR operations across clusters
• ODR Cluster Operator -
Manages and synchronizes
cluster meta data and
application data
• Asynchronous replication of
Application volumes – or-
• Synchronous Mirroring of
Application volumes
RHACM
• Easy configuration DR across cluster and sites
as part of application deployment
• Automated DR Failover and Failback
operations reduces RTO
• Manage and Monitor DR across clusters and
Apps
• Same consistent DR operations for both
Metro-DR and Regional-DR
• Both Application Data and State is protected
and used for Application granular protection
• Consistent Data replication or mirroring or
both based on infrastructure and desired
protection.
17. ▸ Red Hat Advanced Cluster Management for Kubernetes (RHACM)
・ RHACM provides the ability to manage multiple clusters and application lifecycles. Hence, it
serves as a control plane in a multi-cluster environment.
・ RHACM is split into two parts:
・ RHACM Hub: components that run on the multi-cluster control plane
・ Managed clusters: components that run on the clusters that are managed
・ RHACM submariner addon: provides OpenShift private network connectivity between clusters
▸ Red Hat OpenShift Data Foundation (ODF)
・ ODF provides the ability to provision and manage storage for stateful applications in an
OpenShift Container Platform cluster. It is backed by Ceph as the storage provider, whose
lifecycle is managed by Rook in the OpenShift Data Foundation component stack. Ceph-CSI
(Container Storage Interface) provides the provisioning and management of Persistent Volumes
for stateful applications.
▸ Red Hat Ceph Storage (RHCS)
・ RHCS is a massively scalable, open, software-defined storage platform combined with a Ceph
management platform, deployment utilities, and support services.
・ Metro-DR uses an external RHCS architecture called “stretch cluster mode with arbiter”.
・ ODF configured to connect to RHCS cluster that provides storage for OCP applications.
Solution Components
18
18. ▸ OpenShift DR (ODR)
・ OpenShift DR is a disaster recovery capability for stateful applications across a set of peer
OpenShift clusters which are deployed and managed using RHACM. Provides cloud-native
interfaces to orchestrate the life-cycle of an application’s state on Persistent Volumes. These
include:
・ Protecting an application state relationship across OpenShift clusters
・ Failing over an application’s state to a peer cluster
・ Relocate an application’s state to the previously deployed cluster
・ OpenShift DR is comprised of three operators:
・ OpenShift DR Hub Operator
・ Installed on the hub cluster to manage failover and relocation for applications.
・ OpenShift DR Cluster Operator
・ Installed on each managed cluster to manage the lifecycle of all PVCs of an
application.
・ OpenShift Data Foundation Multicluster Orchestrator
・ Installed on the hub cluster to automate numerous configuration tasks.
・ Requires the OpenShift Data Foundation advanced entitlement.
Components (continued)
19
19. RH-ACM Based DR
Automation
Stateful Application Disaster Recovery (DR)
DR Automation enables
quick and error free
Application recovery
enabling lowering RTO
20. Simplified DR Management with ACM
21
▸ Centralized DR management across clusters, across sites,
across cloud platforms
▸ Policy driven DR Configuration, governance, and
compliance
▸ Better availability control with Application granular DR
Management and operations
▸ Multicluster observability and alerting enables quick DR
recovery and reduces downtime
▸ Multicluster networking simplifies cross-cluster networking
for DR configuration
21. Operator based DR Automation
Single Click Application Recovery
22
▸ OpenShift DR (ODR) operators reduce RTO and risk with
automation of DR operations
▸ Simplified DR Orchestration via centrally managed ACM Hub
via ODR Hub operator
▸ ODR Cluster Operator on each managed (workload) cluster
manages data replication and synchronization
▸ ACM Hub recovery in case of site disasters ensures its
availability
ODR Hub Operator
ODR Cluster Operator ODR Cluster Operator
ACM Hub
GitHub/Helm/
Object Store
A Centrally deployed ACM Hub managing DR between a pair of
Managed clusters deployed at different regions
22. Simplified DR Orchestration
▸ DR Orchestration in few easy steps with ODR Operators
・ Choose primary and secondary clusters for your Applications
・ Choose predefined DR Policy to apply for your application
・ ACM & ODR Operators deploy Application, configures it for DR and
initiate data replication
▸ Both Application state and Data are replicated and protected
・ Uses ODF Object Store to capture App meta data
▸ Provide active-active use of both clusters, with different
applications deployed at each site protected by each other
▸ Flexibility for replica copy to have different storage
configuration than primary
4a. PV Metadata replicated
to ODF Object Store
6. Client I/Os
GTM
Clients
3. Configure Replication for App1
2. ACM
deploys App1
4b. App1 PV Data replicated
Admi
n
1. User
configures
App1 for DR
Managed Cluster -
Primary
O
C
App1
git or he...
GitHub/Helm/
Object Store
ODR Hub
ACM Hub
ODR Cluster
ODF
App2
ODR Cluster
ODF
Managed Cluster -
Secondary
Regional-DR
23. Automated DR Failover reduces RTO
▸ ODR Operators automate the sequence of operations
required for Application to recover on the secondary cluster
and accomplished with a single command by the user
▸ Failover process automation – Increases application
availability and reduces user errors
▸ Application(s) granular Failovers provides better priority
based control
▸ Failovers are always user initiated and controlled, eliminates
un-intended switch and data loss.
5. Redirect Client I/Os
GTM
Clients
3. Designate
PV as primary
copy of App1
4. ACM
deploys App1
Admi
n
1. User initiates the
Failover
6. ACM Un-
deploys App 1
2. Populate metadata
of all App1 PVs
git or he...
GitHub
ODF Object
Store
ODR Hub
ACM Hub
GitHub/Helm/
Object Store
Managed Cluster -
Primary
App1
ODR Cluster
ODF
Managed Cluster -
Secondary
App1
ODR Cluster
ODF
Regional-DR
24. DR Failback ensures smooth recovery to Primary
▸ DR Failbacks are planned, controlled and with no-data loss
▸ PV Data changes and meta data changes are restored to
primary cluster before the fail-back is complete
▸ Fail-back operations are user initiated and controlled
▸ Workload migrations across hybrid platforms are also enabled
with the same process
6. Client I/Os
GTM
Clients
4. ACM un-deploys
App1
Admi
n
1. User initiates the
Failback
5. ACM deploys
App 1
2. Restore metadata
of all App1 PVs
git or he...
GitHub
3. Resync App1 PVs
data
ODR Hub
ACM Hub
Managed Cluster -
Secondary
App1
ODR Cluster
ODF
Managed Cluster -
Primary
App1
ODR Cluster
ODF
GitHub/Helm/
Object Store
Regional-DR
26. 27
DR Solution - Key Operations
Most of the key DR operations below are automated by the ODR Operators - Hence these are behind the scenes
operations for users
▸ Configure Ceph-RBD pools for mirroring (for Regional-DR)
・ Including starting the RBD mirror service, peering 2 clusters for mirroring, ensuring cross-cluster networking is
configured (submariner)
▸ Configure an S3 store per cluster for DR Ramen operator to store/restore PV metadata
・ As workloads move across clusters, PVCs need to bind back to specific PVs that point to the mirrored images, this
requires PV metadata store/restore from each cluster where the workload is placed/moved to
▸ DR enable PVCs of an application
・ Enable mirroring and manage mirroring role, Primary/Secondary, for all PVCs of an application, depending on where
and how the workload is placed across the peer clusters
▸ Orchestrate application placement, in cases of disaster and recovery
27. 28
High level resources (CRDs)
▸ MirrorPeer (Hub) (User/UI created)
・ Responsible to setup RBD mirroring and ODF per PVC capabilities for mirroring across peer clusters
・ Responsible for creating required S3 buckets for Ramen (DR orchestrator) to use
▸ DRCluster (Hub) (created by MirrorPeer)
・ Defines a ManagedCluster as a DR peer, and also contains required S3 store configuration for this cluster
▸ DRPolicy (Hub) (User/UI created)
・ Defines the Sync/Async policy between 2 DRClusters, and also the Async schedule as required
▸ DRPlacementControl (Hub) (User/UI created)
・ Per application DR placement orchestrator, to place applications as per the DRPolicy
・ Responsible for application movement orchestration by scheduling the ACM PlacementRule appropriately
・ Responsible for directing PVC protection and state on a peer cluster
28. 29
Sub resources (CRDs)
▸ VolumeReplicationGroup (ManagedCluster) (created by DRPlacementControl)
・ Defines, based on a label selector, which PVCs need DR protection on a workload cluster
・ Further defines,
・ Desired state for the PVCs (Primary/Secondary)
・ Mode of protection (Async/Sync)
・ Replication interval if Async
・ S3 profiles to use for PV resources (to store/restore from)
▸ VolumeReplicationClass (ManagedCluster) (created by MirrorPeer)
・ Like a StorageClass, defines a class containing a replication interval for protection and which CSI driver acts on
VolumeReplication resources point to the class
▸ VolumeReplication (ManagedCluster) (created by VolumeReplicationGroup)
・ Represents per-PVC replication desire, enabled when created and set to either the Primary/Secondary role
・ Interfaces with the Ceph-CSI driver to issue required operations to achieve the desired state
30. 31
Italics: Common OCP/ODF/ACM resources (like ManagedCluster, PVC)
Bold: Resources created/required for ODF DR functionality
Operator responsibilities
▸ odf-multicluster-console (Hub)
・ Provides UI functionality
・ Creates MirrorPeer and DRPolicy cluster scoped resources based on user actions
▸ odfmo-controller-manager (Hub)
・ Reconciles MirrorPeer resource
・ Creates DRCluster resource on the hub
・ Creates required VolumeReplicationClass cluster scoped resource on the ManagedCluster
▸ ramen-dr-hub-operator (Hub)
・ Reconciles DRPlacementControl, DRPolicy, DRCluster resources
・ Creates VolumeReplicationGroup namespace scoped resource on desired ManagedCluster
31. MetroDR + RHCS
Architecture
Metro (DR) Architecture
DR Automation enables
quick and error free
Application recovery
enabling lowering RTO
32. MetroDR + RHCS stretched
● Metro DR + RHCS Architecture Diagrams. Link
● Ceph 101 Architecture Introduction. Link
33. Simplified DR Orchestration
▸ DR Orchestration in few easy steps with ODR Operators
・ Choose primary and secondary clusters for your Applications
・ Choose predefined DR Policy to apply for your application
・ ACM & ODR Operators deploy Application, configures it for DR and
initiate data replication
▸ Both Application state and Data are replicated and protected
・ Uses ODF Object Store to capture PV/PVC App metadata
▸ Provide active-active use of both clusters, with different
applications deployed at each site protected by each other
5. Client I/Os
GTM
Clients
3. Configure Replication for App1
2. ACM
deploys App1
Admi
n
1. User
configures
App1 for DR
Managed Cluster -
Primary
O
C
App1
git or he...
GitHub/Helm/
Object Store
ODR Hub
ACM Hub
ODR Cluster
ODF
App2
ODR Cluster
ODF
Managed Cluster -
Secondary
ODF
Neutral zone
PV Data sync replication
Metro-DR
4 PV Metadata replicated
to ODF Object Store
34. Automated DR Failover reduces RTO
▸ ODR Operators automate the sequence of operations
required for Application to recover on the secondary cluster
and accomplished with a single command by the user
▸ Failover process automation – Increases application
availability and reduces user errors
▸ Application(s) granular Failovers provides better priority
based control
▸ Failovers are always user initiated and controlled, eliminates
un-intended switch and data loss.
▸ OCP worker nodes on failed cluster are IO Fenced to prevent
data corruption
4. Redirect Client I/Os
GTM
Clients
3. ACM
deploys App1
Admi
n
1. User initiates the
Failover
5. ACM Un-
deploys App1
git or he...
GitHub
ODR Hub
ACM Hub
GitHub/Helm/
Object Store
Managed Cluster -
Primary
App1
ODR Cluster
Managed Cluster -
Secondary
App1
ODR Cluster
O
C
ODF ODF
ODF
Neutral zone
Metro-DR
PV Data sync replication
2. Populate metadata of
all App1 PVs from ODF
Object Store
35. DR Failback ensures smooth recovery to Primary
▸ DR Failbacks are planned, controlled and with no-data loss
▸ PV Data changes and metadata changes are restored to
primary cluster before the fail-back is complete
▸ Fail-back operations are user initiated and controlled
4. Client I/Os
GTM
Clients
2. ACM un-deploys
App1
Admi
n
1. User initiates the
Failback
3. ACM deploys
App 1
git or he...
GitHub
ODR Hub
ACM Hub
Managed Cluster -
Secondary
App1
ODR Cluster
Managed Cluster -
Primary
App1
ODR Cluster
GitHub/Helm/
Object Store
O
C
ODF ODF
ODF
Neutral zone
Automated data
recovery in
background
Metro-DR
PV Data sync replication
2. Restore metadata
of all App1 PVs
38. ▸ Presentations
・ HA-DR for Stateful Applications on OpenShift
▸ Demo Videos and Office Hours
・ Automated Disaster Recovery failover and failback with Red Hat
OpenShift
・ Automated Disaster Recovery for OpenShift Workloads
・ ODF Office Hour June 16, 2022
・ ODF Office Hour August 25, 2022
・ ODF Office Hour Metro DR Introduction
・ OADP Native Backups
▸ Red Hat Documentation
・ [ODF 4.11] Configuring OpenShift Data Foundation Disaster Recovery for
OpenShift Workloads
Technical assets
39
39. CONFIDENTIAL Designator
Testing and Training
▸ Ensure OpenShift clusters have sufficient capacity (CPU, Mem) for ACM/ODF installation
・ ODF resource requirements
▸ ODF training.
・ Red Hat OpenShift Container Storage for Enterprise Kubernetes (DO370)
・ https://red-hat-storage.github.io/ocs-training/training/ocs4/odf.html
▸ RHCS Training.
・ Cloud Storage with Red Hat Ceph Storage CL260