OpenStack Tokyo Talk Application Data Protection Service
1. OpenStack Summit Tokyo 2015
Wang Hao, Software Engineer, Huawei IT Product Line
Eran Gampel, Cloud Chief Architect , Huawei European Research Center
Oshrit Feder, IBM Research - Haifa
Cloud DR Orchestration:
Beyond volume replication
Why we need disaster recovery?
Replication in Cinder
ADPaaS: Project Smaug
3. Why do we need disaster recovery?
Customers want 24x7 service availability
Accidents and Natural Disasters
5. Got version 2 of replication in Liberty release
Improve and make it more widely
usable by other backend devices.
None driver supported yet
Implemented for Juno release
Upstream OS code merged Support to IBM Storwize/SVC driver
Begin from Icehouse summit
Design summit on volume replication
Status of Replication in Cinder
6. The main use of volume replication is resiliency in presence of failures.
Storage Backend Storage Backend
Use Case of Replication
10. Hypervisor LevelHardware Level
Replication Solution Types
Case in point: Hardware vs. Hypervisor
11. Production Site DR Site
DR Manager DR Manager
VM VM VM
Another choice: Hypervisor DR
12. IO Commands IO Completion
Write as normal
IO Forwarding ,Compression and
IO cache, Decompression and
Production Site DR Site
Hypervisor DR: IO Mirroring
Start CBT Data
Finished1.Host abnormal restart
Hypervisor DR: IO Mirroring State Machine
15. Replication Type HW Array
Multi-Vendor Hardware Agnostic
No Impact on Compute Performance
No Special Network/Storage Privileges
No Special Admin Skillset Required
Cross VM Consistency Grouping Support
Cross Array Consistency Group Support
Hypervisor DR: HW(Array) vs. Hypervisor
16. Multiple Use Cases, Multiple
Users need to be able to Choose the right protection plan
Vendors need a way to plug different implementations
25. Smaug: Mission Statement
Formalize Application Data Protection in OpenStack
APIs, Services, Plugins, …
Be able to protect Any Resource in OpenStack (as well as
Allow Diversity of vendor solutions, capabilities and
implementations without compromising usability
26. Smaug: Highlights
Vendors create plugins that implement Protection mechanisms for different
User perspective: Protect App Deployment
Configure and manage custom protection plans on the deployed resources
(topology, VMs, volumes, images, …)
Admin perspective: Define Protectable Resources
Decide what plugins protect which resources, what is available for the user
Decide where users can protect their resources
27. How to protect?
Smaug: Application Data Protection as a Service
What is protected?
Where to protect?
What was protected?
Resource Protection Service
Swift S3 …
What is protected?
How to protect?
Volume Protection Plugin
Backup Replication SnapshotWho protects?
VM Protection Plugin
Image Protection Plugin
Topology Protection Plugin
Where to protect?
Cinder Nova …
What was protected?
29. Help us Build Smaug – Join the project
30. Demo Time
Video -- Application DR With IBM Cloud Manger
Paris summit talk & demo
European FP7 ORBIT Research project
IBM Cloud Manager with Openstack
Hardware can fail, sometimes
People make mistakes, sometimes
Natural Calamities, or cataclysmic events (like fire, tornado, etc.)
Replication is for critical data and has relatively shorter lifespan
Backup has longer lifespan, but is snapshot-based, so your RPO is not as good.
Cloud admin create a volume type with capabilities:replication="<is> True“
End users use this volume type to create volume
Cinder scheduler will choose a backend supporting replication
The backend will create a volume replica & setup replication between two volumes
Cinder have periodic task to update volumes’ replication status
When disaster happen, the cloud admin promotes the replica
Users can use those volumes in the secondary data center with its storage
As part of the fail-back process, re-enable the replication between the primary and secondary volumes
Users can test the replication by creating volume with –source-replica
4. According the configuration in cinder.conf, driver will choose replication target device to create replica & setup replication between two volumes
5. If replication is enable in driver, update the replication status in driver report periodic task
6. When disaster happen, the cloud admin failover a replicating volume to it's secondary via “failover_replication” API
8. Cloud admin also can enable/disable replication on a replication capable volume for some use case, like maintenance
9. Cloud admin also can query a volume for a list of configured replication targets
IO Mirror state machine:
CBT（changed Block Tracking） replication: based on “Bitmap”
Queue replication： In this state, user can create a snapshot for replication data.
Setup Connection with Virtual Replication Gateway
Host normal restart, data in queue during shutdown is written to disk by using CBT bitmap
CBT Data Replication
CBT bitmap is clear, proceed to Queue-based
If Queue in overflow, switch to CBT
On Host Abnormal Restart or Swap (re-protect)
Do Consistency Check and then CBT data replication
Install and Configure Hypervisor with replication capabilities.
DR admin creates a Protected Group for VMS in dashboard
DR admin can define the Protection Policy (encryption, compression, RPO, etc)
When admin create the protect group, replication start, IO Mirror will send IO data to VRG.
DR admin creates a Recovery Plan for fail-over, replication test and fail-back
When disaster happens, DR admin chooses the fail-over recovery plan by using snapshot or newest data in DR site
DR admin can use re-protect to swap production site and DR site. System will replicate data from new production sit to new DR site.
If needing fail-back, DR admin choose the recovery plan to make data consistency between production site and DR site.
So… what do we need??
Is data only storage?
If it where so, we would need just Data Protection.
For example… (move slide)
We start by define the API and the services frameworks