Building a DR solution with CloudStack
Multi-Datacenter Cloud with KVM
About me
● Solutions Architect at StorPool
Storage
● 25 years in ISP, Telcos, System
integrators
● Helping companies build the
perfect setup for their specific
project
2
Venko Moyankov
linkedin.com/in/venkomoyankov/
venko@storpool.com
Quadris Requirements
Quadris offer ‘disaster avoidance’ hosted single tenant
infrastructure solutions. These are always on active-active, for
storage, compute and network.
● New Cloud Solution needed to provide similar
availability
● Customers migrating to our cloud expect the same level
of DR protection they have from our other solutions
● Automation essential
3
Peter Grayson
Commercial Director
Challenges
● VMware Site Recovery Manager (SRM) in the VMware stack
● No alternative in the KVM stack
● CloudStack doesn’t have built-in DR features
● Main requirements for the DR cloud:
○ Be able to switch the workload from the main to the recovery site
○ HA of the control plane
○ Preserve / replicate the VM metadata
○ Virtual network, that allows service migration
○ Replicate data to the recovery site
○ Short RPO and RTO
○ Ability to test the failover process
○ Failback with minimal impact
4
StorPool
● Scale-out block storage software
● Runs on standard x86-64 servers
● On top of standard Linux distributions
● Stand-alone or hyper-converged
● Fully distributed with synchronous replication
5
6
System Architecture
Cloudstack Configuration
7
● Single Zone
● Each site is a Pod
● Layer 2 connectivity between sites (VxLAN)
● All VLANs are available at both sites
● Independent connection to the Internet for each site (HSRP)
● Single Zone-wide primary storage
● Individual storage clusters in each site with async replication
● Separate storage networks
The Network
● VxLAN underlay fabric
● 20 Gbps site to site (resilient)
● 50 Gbps storage bandwidth for storage
● 100 Gbps guest traffic
● Anycast gateways for HA of L3 gateway
● HSRP for internet range failover
8
Primary Storage
● One primary storage cluster at each site
● Synchronous replication within the cluster/site
● Asynchronous replication between clusters/sites
● Multi-cluster mode - both clusters are presented as a single
storage:
○ Single API
○ Common namespace
○ Single zone-wide primary storage
○ Live migrations
● Regular (15-minute) snapshots replicated to site-B
9
Secondary Storage
● Stores templates only
● Templates are read only on the first use
● Templates are copied on the primary storage on the first use
● Virtual machine protected by an active-active stretched VSAN
cluster
10
Failover procedure
11
1. Fence the hosts at the main site
2. Mark all machines in the
cloudstack database as stopped
3. Switch the ACS to use the StorPool
API at site-B
4. Get the list of protected VMs
5. For each disk, create a volume on
site B from the latest available
snapshot.
6. Update the volume path in
CloudStack to point to the volume
in the storage at site-B.
7. Start the VM
Failover procedure - API calls
● StorPool API - create a volume from snapshot
● Cloudstack - use the new volume, start the VM
12
curl -X POST 
--data '{"parent":"'${SNAP_ID}'","name":"","tags":{}}' 
http://${SP_API}/ctrl/1.0/MultiCluster/VolumeCreate
cmk update volume id=822fab0a-5fb2-4100-b2cc-b8bf9fe54a10 
path=/dev/storpool-byid/n2cn.b.d2
cmk start virtualmachine id=05886b5e-766f-43e3-a904-b0395e5c44ef
Fail-back procedure
1. Make sure VMs are not running at
site-A
2. Switch ACS to use the StorPool
API at site-A
3. Live migrate VMs from site-B to
site-A
4. Delete the old volumes on site-A of
the DR-enabled VMs
13
Main Features
● Live migration between sites
● Active-active mode
● Manually activated failover
● RPO ~ 15 minutes
● RTO ~ 5-10 minutes
● Third control site
● No modifications to CloudStack, external scripts
● DR is enabled per VM
● DR at a VM level (can switch selected VMs, test)
14
Q&A
Thank You

Building a DR Solution with CloudStack

  • 1.
    Building a DRsolution with CloudStack Multi-Datacenter Cloud with KVM
  • 2.
    About me ● SolutionsArchitect at StorPool Storage ● 25 years in ISP, Telcos, System integrators ● Helping companies build the perfect setup for their specific project 2 Venko Moyankov linkedin.com/in/venkomoyankov/ venko@storpool.com
  • 3.
    Quadris Requirements Quadris offer‘disaster avoidance’ hosted single tenant infrastructure solutions. These are always on active-active, for storage, compute and network. ● New Cloud Solution needed to provide similar availability ● Customers migrating to our cloud expect the same level of DR protection they have from our other solutions ● Automation essential 3 Peter Grayson Commercial Director
  • 4.
    Challenges ● VMware SiteRecovery Manager (SRM) in the VMware stack ● No alternative in the KVM stack ● CloudStack doesn’t have built-in DR features ● Main requirements for the DR cloud: ○ Be able to switch the workload from the main to the recovery site ○ HA of the control plane ○ Preserve / replicate the VM metadata ○ Virtual network, that allows service migration ○ Replicate data to the recovery site ○ Short RPO and RTO ○ Ability to test the failover process ○ Failback with minimal impact 4
  • 5.
    StorPool ● Scale-out blockstorage software ● Runs on standard x86-64 servers ● On top of standard Linux distributions ● Stand-alone or hyper-converged ● Fully distributed with synchronous replication 5
  • 6.
  • 7.
    Cloudstack Configuration 7 ● SingleZone ● Each site is a Pod ● Layer 2 connectivity between sites (VxLAN) ● All VLANs are available at both sites ● Independent connection to the Internet for each site (HSRP) ● Single Zone-wide primary storage ● Individual storage clusters in each site with async replication ● Separate storage networks
  • 8.
    The Network ● VxLANunderlay fabric ● 20 Gbps site to site (resilient) ● 50 Gbps storage bandwidth for storage ● 100 Gbps guest traffic ● Anycast gateways for HA of L3 gateway ● HSRP for internet range failover 8
  • 9.
    Primary Storage ● Oneprimary storage cluster at each site ● Synchronous replication within the cluster/site ● Asynchronous replication between clusters/sites ● Multi-cluster mode - both clusters are presented as a single storage: ○ Single API ○ Common namespace ○ Single zone-wide primary storage ○ Live migrations ● Regular (15-minute) snapshots replicated to site-B 9
  • 10.
    Secondary Storage ● Storestemplates only ● Templates are read only on the first use ● Templates are copied on the primary storage on the first use ● Virtual machine protected by an active-active stretched VSAN cluster 10
  • 11.
    Failover procedure 11 1. Fencethe hosts at the main site 2. Mark all machines in the cloudstack database as stopped 3. Switch the ACS to use the StorPool API at site-B 4. Get the list of protected VMs 5. For each disk, create a volume on site B from the latest available snapshot. 6. Update the volume path in CloudStack to point to the volume in the storage at site-B. 7. Start the VM
  • 12.
    Failover procedure -API calls ● StorPool API - create a volume from snapshot ● Cloudstack - use the new volume, start the VM 12 curl -X POST --data '{"parent":"'${SNAP_ID}'","name":"","tags":{}}' http://${SP_API}/ctrl/1.0/MultiCluster/VolumeCreate cmk update volume id=822fab0a-5fb2-4100-b2cc-b8bf9fe54a10 path=/dev/storpool-byid/n2cn.b.d2 cmk start virtualmachine id=05886b5e-766f-43e3-a904-b0395e5c44ef
  • 13.
    Fail-back procedure 1. Makesure VMs are not running at site-A 2. Switch ACS to use the StorPool API at site-A 3. Live migrate VMs from site-B to site-A 4. Delete the old volumes on site-A of the DR-enabled VMs 13
  • 14.
    Main Features ● Livemigration between sites ● Active-active mode ● Manually activated failover ● RPO ~ 15 minutes ● RTO ~ 5-10 minutes ● Third control site ● No modifications to CloudStack, external scripts ● DR is enabled per VM ● DR at a VM level (can switch selected VMs, test) 14
  • 15.