My experience writing a
DR service for
CloudStack
Alena Prokharchyk
Citrix
@Lemonjet
What is a disaster for the cloud
• Disaster for the Cloud is hardware/software
failure,network/power outage, physical dama...
Existing DR solutions in CS
• Recurring snapshots feature
!
No out-of-box cross zones recovery solution
What new DR service does
• Lets admin to configure recovery service w/o putting
extra scripts and config files
• Prepares for...
Things DR service doesn’t cover
• No Storage replication is done by DR service, only
metadata replication
Storage replicat...
Which version of Cloudstack
is supported by DR?
DR works with:
• Cloudstack 4.5 version
• Next Citrix CloudPlatform releas...
Design principles followed while writing
the DR
• Develop as a CS plugin in V1 with ability to run as a separate
service i...
DR Service deployment
DR UI
plugin
DR API
plugin
DR
Events
listener
DR
Server
CS
Orchestration
engine
CS
API
DR service Cl...
DR process
• Configuration - configuring the DR service
• Preparation - preparing VM for failover
• Failover - failing over ...
Configuration DR
• Setup Active zone with the Recovery zone
• Configure DR offerings (SLAs)
• Tag storages for the DR VMs’ v...
Preparing VM for failover
• DR service listens to events from CS, and deploys/
updates a recovery VM metadata in the Recov...
Preparing VM for failover
Nic1
Nic 2
UserVm
Nic1
Nic 2
UserVm
Active zone Recovery zone
DR Service
Failover process
Process of restoring failed vm in the recovery zone
• DR doesn’t do automatic indication that the
Disaste...
Failover process
UserVm
Active zone Recovery zone
CS storage1
Volume1
Volume2
UserVm
Volume1
Volume2
CS storage2
Physical ...
Failback process
Process of moving VM back to its original zone
• Vm metadata is preserved in the original zone and re-use...
DR metadata in CS DB
user_vm
CS DB
id name zone_id
1 VM-user1 1
2 VM-user1 2
user_vm_details
vm_id detail_name detail_valu...
Who controls the DR
process
• Admin controls recovery process on behalf of users’ VMs
• End user can monitor:
- DR state o...
CS API enhancements
• Added some missing data to CS API responses
• Added missing “resource_details” tables for some CS
re...
Things yet to fix on CS
• Single sign on is missing
• Resource creation in the DB and actual
implementation are not granula...
Summary
If you are an API developer for open source IaaS
product:
• Always think from an end user/customer use case
perspe...
Upcoming SlideShare
Loading in …5
×

My experience writing DR service for CloudStack

481 views

Published on

My experience writing DR (Disaster Recovery) service for CloudStack

Published in: Engineering
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
481
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

My experience writing DR service for CloudStack

  1. 1. My experience writing a DR service for CloudStack Alena Prokharchyk Citrix @Lemonjet
  2. 2. What is a disaster for the cloud • Disaster for the Cloud is hardware/software failure,network/power outage, physical damage to the data center (DC) • Disaster can cause partial or entire DC failure • As a result, VMs become unresponsive and needs to be restored in another DataCenter • DR products’ goal is to prepare VM’s for failover and recover them in a short time frame
  3. 3. Existing DR solutions in CS • Recurring snapshots feature ! No out-of-box cross zones recovery solution
  4. 4. What new DR service does • Lets admin to configure recovery service w/o putting extra scripts and config files • Prepares for disaster and restores VM and all its metadata - Networks/Networking rules • Recovers VM cross zones • Real time updates for the recovery VMs' metadata - helps to keep MTTR (Mean Time to Repair) low • Provides tiered DR service - most important apps/ accounts can be recovered first
  5. 5. Things DR service doesn’t cover • No Storage replication is done by DR service, only metadata replication Storage replication is covered by the admin outside of CS (NetApp’s Snapmirror)
  6. 6. Which version of Cloudstack is supported by DR? DR works with: • Cloudstack 4.5 version • Next Citrix CloudPlatform release based on ASF 4.4
  7. 7. Design principles followed while writing the DR • Develop as a CS plugin in V1 with ability to run as a separate service in the future versions • No changes to core/server CS code that are specific just to DR • No direct access to CS DB. All data manipulation through CS APIs only • DR service doesn’t have its own DB in Version 1. All DR data is stored in CS DB in form of resources’ metadata • Rely on MTBF (Mean Time Between Failures) to be high. Never fail VM in original zone if its preparation fails, let admin fix things and retry
  8. 8. DR Service deployment DR UI plugin DR API plugin DR Events listener DR Server CS Orchestration engine CS API DR service CloudStack CS UI Event message bus CS Services /Plugins DR UI plugin DR API plugin DR Events listener DR Service
  9. 9. DR process • Configuration - configuring the DR service • Preparation - preparing VM for failover • Failover - failing over the vm to the Recovery zone • Failback - failing back the vm to its Original zone
  10. 10. Configuration DR • Setup Active zone with the Recovery zone • Configure DR offerings (SLAs) • Tag storages for the DR VMs’ volumes placement
  11. 11. Preparing VM for failover • DR service listens to events from CS, and deploys/ updates a recovery VM metadata in the Recovery zone • Recovery Vm doesn’t occupy physical resources on the CS side • Recovery VM is invisible to an end user
  12. 12. Preparing VM for failover Nic1 Nic 2 UserVm Nic1 Nic 2 UserVm Active zone Recovery zone DR Service
  13. 13. Failover process Process of restoring failed vm in the recovery zone • DR doesn’t do automatic indication that the Disaster happens • DR admin triggers failover for the VM by calling the DR API • DR service performs the failover process
  14. 14. Failover process UserVm Active zone Recovery zone CS storage1 Volume1 Volume2 UserVm Volume1 Volume2 CS storage2 Physical storage1 DR Service Volume1 Volume2 Volume1 Volume2 Physical storage2NetApp SnapMirror UUID1 UUID1
  15. 15. Failback process Process of moving VM back to its original zone • Vm metadata is preserved in the original zone and re-used when vm is recovered • Recovery VM’s volumes get re-introduced to the original zone, and attached to the original vm • VM in the recovery zone gets disabled • VM in the original zone gets enabled • UUID swap happens
  16. 16. DR metadata in CS DB user_vm CS DB id name zone_id 1 VM-user1 1 2 VM-user1 2 user_vm_details vm_id detail_name detail_value 1 DR_RECOVERY_ID 2 1 DR_STATE FAILED_TO_PREPARE_FOR_ DR 1 DR_ALERT Failed to attach Nic to the Recovery vm
  17. 17. Who controls the DR process • Admin controls recovery process on behalf of users’ VMs • End user can monitor: - DR state of his VMs - “Ready to Failover”/“FailedOver” - Recovery zone info - to which zone the VM recovers in case of failure - Recovery public ip address(es) info - to reconfigure his public DNS
  18. 18. CS API enhancements • Added some missing data to CS API responses • Added missing “resource_details” tables for some CS resources • Put in the support for CS services to publish Alerts via CS APIs • Introduced External UUID management • Implemented resource creation with delayed start for some objects (VPC)
  19. 19. Things yet to fix on CS • Single sign on is missing • Resource creation in the DB and actual implementation are not granular enough
  20. 20. Summary If you are an API developer for open source IaaS product: • Always think from an end user/customer use case perspective while adding/modifying end user APIs • Look out what plugins/services/bug fixes people write for your software. Helps to define missing pieces/common problems in your software

×