When disaster strikes the cloud: Who, what, when, where and how to recover

Accelerating Enterprise OpenStack
When Disaster Strikes the Cloud
Michael Factor
IBM Research - Haifa
factor@il.ibm.com
Who, What, When, Where and How to Recover
Ronen Kat
ronenkat@il.ibm.com
Sean Cohen
RedHat
scohen@redhat.com

2
Talk Outline
q What is disaster recovery?
q Concepts and basics
q Protecting data and applications from disasters
q OpenStack Cinder toolbox for disaster recovery
q Applications are more than just data
q The road ahead: Kilo and beyond

3
What is Disaster Recovery?
According to Wikipedia, Disaster Recovery (DR) is "the process, policies and
procedures . . . for recovery . . . of technology infrastructure . . . after a natural or
human-induced disaster.”
Servers Storage Network Software Configuration
Surviving a disaster requires geographic dispersion

4
Recovery Point Objective and Recovery Time Objective
How far back in time a
disaster takes one
How long until operational
after a disaster
Seconds 0
RECOVERY POINT OBJECTIVE
(RPO)
MinutesHoursDaysWeeks Weeks
RECOVERY POINT TIME
(RTO)
DaysHoursMinutesSeconds
Replication
Backup
restore Active site Hot site

5
Data and Metadata Consistency
Data consistency
q If a modified datum is available,
all data it depends upon is also
available
Metadata consistency
q Configuration updates are seen
in the same order relative to one
another and to data updates
Application VM
DB LOG
DB LOG
Remote Site

6
OpenStack Cloud Metadata
Virtual networks between the cloud VM
External network access
Attached volumes
Volume types
Virtual machines flavors
SSH keys for VM access
Virtual machines images
Identities of users

Protecting Data and Applications
from Disasters

8
Data Protection: Cinder Backup and Restore
q Cinder backup
q Backup a volume to backup storage
Swift
backup-create
Primary Cloud

9
q Can Cinder restore on secondary
cloud?
q Problem: Cinder on secondary
cloud is not aware of the backup
Swift
backup-restore
Primary Cloud
Secondary Cloud

10
q Solution: “electronic tape shipping”
q backup-export
q backup-import
q Cinder supports since Icehouse
Swift
backup-export
Primary Cloud
Secondary Cloud
Backup reference
backup-import

11
q After backup-import Cinder can
restore on secondary cloud
q backup-restore
Swift
backup-restore
Primary Cloud
Secondary Cloud

12
Data Protection: Cinder Volume replication
q Cinder has initial support for
volume replication in Juno release
q Cinder back-ends can “advertise”
support for replication
q Volume created with replication
extra-spec will be allocated on
back-end supporting replication and
will be replicated
q Supporting back ends:
q IBM Storwize, more expected in Kilo
Cinder back-end
Cinder back-end
Volume-type extra specs:
“capabilities:replication
<is> True”

13
Data Protection: Cinder Volume replication
q Secondary volume can become
primary when promoted
q replication-promote
q Replication can be reversed
following a replication-promote
q replication-reenable
Cinder back-end
Cinder back-end

14
Consistency Groups
q New in Juno
q Support for volume grouping for consistency
q Grouping of volumes is based on the volume-type
q Supporting
q Consistency group snapshots
q Needs to be extended to support
q Cinder backup
q Cinder volume replication
DB LOG

15
Protecting Applications from Disasters
Servers Storage Network Software Configuration
Disaster Recovery Orchestration

16
OpenStack Tools
q Applications are defined in OpenStack by
q Heat Orchestration Templates
q However
q Not all applications are template based
q Deployments (including configuration) change over time
q Some definitions are cloud specific, e.g., networks, types
q Heat templates and Stacks don’t stay consistent
q Tools that can create a template from deployment, e.g., Flame, ReHeat
q But, template will only fit the current cloud

17
OpenStack Tools and Beyond
q Demo:
A technology preview for disaster recovery with IBM Cloud Manager

19
Ceph Multi-Site & Disaster Recovery (Block) example
q Export snapshots to geographically dispersed data centers
q Provides disaster recovery
q Export incremental snapshots
q Minimize network bandwidth by only sending changes
q Kilo cycle focus to extends the multi-site and disaster recovery options
q  RBD Mirroring
q  Cinder Volume Replication

20
Ceph Multi-Site & Disaster Recovery (Object) example
q Zones and region support
q  Deploy topologies similar to S3
and others with a global
namespace
q Data center synchronization
q  Back-up full or partial sets of data
between regions
q Read affinity
q  Serve local copies of data to local
users

21
Disaster Recovery as a Service Catalog
q Pluggable Disaster Recovery policies
q Replication targets can specify different RPO/RTO levels that can be
offered based on the supported backend capabilities
q Disaster Recovery Policies
q  Active - Cold standby
q  Active - Hot standby
q  Active - Active (requires application awareness and transaction integrity)
q  Backup to Cloud / From the Cloud

22
Extending Heat Orchestration for Disaster Recovery
q Heat can be used to automate
q Add support for Cinder replication
q Need to make Consistency group across OpenStack projects
q Nova Cinder, Trove….
q Stack Snapshot Backup / Rollback
q Enable customization of workload components at recovery site.
q Networks, VM configurations changes, guest agent etc.

23
The Road Toward Application Consistency
First phase: File system consistency
q Integrate into OpenStack to allow consistent snapshots and
backups
q Nova needs to request QEMU Guest Agent to freeze the file systems
(and applications if fsfreeze-hook is installed) during the snapshot
q Patches has proposed for
Nova and Cinder, targeting
the Kilo release
Source: Hitachi

24
The Road Toward Application Consistency
Next phase: Consistency at the application level
q Application-Aware on Windows with VSS Support on qemu-ga
q Application notification via Microsoft Volume Shadow Copy Service (VSS)
q Application-Aware on Linux Using qemu-ga Hooks
q Application-consistent snapshots can be created with scripts interacting with the
QEMU guest agent
q The scripts can notify applications to flush their data

25
Disaster Recovery at Scale
q  Site evacuation holy grail is an automatic planned migration of the
workloads and data from one cloud-scale datacenter to another.
q  New OpenStack HA approaches to help Recovery from infrastructure
failures:
q  Leveraging Pacemaker to provide automated detection of a failed hypervisor
and the recovery of the VMs that were running there.
q  Evacuate instance to a scheduled host was added in Juno
q  Simple tagging API for instances in Nova was accepted for Kilo release
q  Can support automatic-recovery new tag
Suggest removing – no time

26
OpenStack Documentation needs to catch up…
q Join the OpenStack Disaster Recovery Guide
q We have a basic OpenStack High Availability Guide
q http://docs.openstack.org/high-availability-guide/content/
q A very outdated “Recover cloud after disaster” section in the Admin guide
http://docs.openstack.org/admin-guide-cloud/content/section_nova-disaster-
recovery-process.html

Q&A
Michael Factor
factor@il.ibm.com
THANK YOU
Ronen Kat
ronenkat@il.ibm.com
Sean Cohen
RedHat
scohen@redhat.com

When disaster strikes the cloud: Who, what, when, where and how to recover

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to When disaster strikes the cloud: Who, what, when, where and how to recover

Similar to When disaster strikes the cloud: Who, what, when, where and how to recover (20)

More from Sean Cohen

More from Sean Cohen (8)

Recently uploaded

Recently uploaded (20)

When disaster strikes the cloud: Who, what, when, where and how to recover