Enterprise applications needs to be able to survive large scale disasters. While some born-on-the-cloud applications have built-in disaster recovery functionality, non-born-on-the-cloud enterprise applications typically expect the infrastructure to provide disaster recovery support. OpenStack provides various building blocks that enable an OpenStack application to survive a disaster; these building blocks are being improved in Juno and Kilo. Some of these building blocks need to be enabled by the OpenStack cloud administrator and others need to be leveraged by the application deployer. In this presentation, we will review basic disaster recovery concepts covering when, where, and what is done at each stage of the application cloud life-cycle. We will describe the existing building blocks and we will explain the roles of cloud administrator and the cloud end-user, in enabling OpenStack applications to survive a disaster. We will then detail new features in Juno and coming in Kilo that will help enhance OpenStack's disaster recovery support. We will conclude by detailing the remaining gaps and present some tools that address these gaps, allowing an application to survive a disaster when running on an OpenStack cloud.
OpenStack Summit Session: https://youtu.be/Dj5sELG9keE
When disaster strikes the cloud: Who, what, when, where and how to recover
1. Accelerating Enterprise OpenStack
When Disaster Strikes the Cloud
Michael Factor
IBM Research - Haifa
factor@il.ibm.com
Who, What, When, Where and How to Recover
Ronen Kat
IBM Research - Haifa
ronenkat@il.ibm.com
Sean Cohen
RedHat
scohen@redhat.com
2. 2
Talk Outline
q What is disaster recovery?
q Concepts and basics
q Protecting data and applications from disasters
q OpenStack Cinder toolbox for disaster recovery
q Applications are more than just data
q The road ahead: Kilo and beyond
3. 3
What is Disaster Recovery?
According to Wikipedia, Disaster Recovery (DR) is "the process, policies and
procedures . . . for recovery . . . of technology infrastructure . . . after a natural or
human-induced disaster.”
Servers Storage Network Software Configuration
Surviving a disaster requires geographic dispersion
4. 4
Recovery Point Objective and Recovery Time Objective
How far back in time a
disaster takes one
How long until operational
after a disaster
Seconds 0
RECOVERY POINT OBJECTIVE
(RPO)
MinutesHoursDaysWeeks Weeks
RECOVERY POINT TIME
(RTO)
DaysHoursMinutesSeconds
Replication
Backup
restore Active site Hot site
5. 5
Data and Metadata Consistency
Data consistency
q If a modified datum is available,
all data it depends upon is also
available
Metadata consistency
q Configuration updates are seen
in the same order relative to one
another and to data updates
Application VM
DB LOG
DB LOG
Remote Site
6. 6
OpenStack Cloud Metadata
Virtual networks between the cloud VM
External network access
Attached volumes
Volume types
Virtual machines flavors
SSH keys for VM access
Virtual machines images
Identities of users
8. 8
Data Protection: Cinder Backup and Restore
q Cinder backup
q Backup a volume to backup storage
Swift
backup-create
Primary Cloud
9. 9
Data Protection: Cinder Backup and Restore
q Can Cinder restore on secondary
cloud?
q Problem: Cinder on secondary
cloud is not aware of the backup
Swift
backup-restore
Primary Cloud
Secondary Cloud
10. 10
Data Protection: Cinder Backup and Restore
q Solution: “electronic tape shipping”
q backup-export
q backup-import
q Cinder supports since Icehouse
Swift
backup-export
Primary Cloud
Secondary Cloud
Backup reference
backup-import
11. 11
Data Protection: Cinder Backup and Restore
q After backup-import Cinder can
restore on secondary cloud
q backup-restore
Swift
backup-restore
Primary Cloud
Secondary Cloud
12. 12
Data Protection: Cinder Volume replication
q Cinder has initial support for
volume replication in Juno release
q Cinder back-ends can “advertise”
support for replication
q Volume created with replication
extra-spec will be allocated on
back-end supporting replication and
will be replicated
q Supporting back ends:
q IBM Storwize, more expected in Kilo
Cinder back-end
Cinder back-end
Volume-type extra specs:
“capabilities:replication
<is> True”
13. 13
Data Protection: Cinder Volume replication
q Secondary volume can become
primary when promoted
q replication-promote
q Replication can be reversed
following a replication-promote
q replication-reenable
Cinder back-end
Cinder back-end
14. 14
Consistency Groups
q New in Juno
q Support for volume grouping for consistency
q Grouping of volumes is based on the volume-type
q Supporting
q Consistency group snapshots
q Needs to be extended to support
q Cinder backup
q Cinder volume replication
DB LOG
16. 16
OpenStack Tools
q Applications are defined in OpenStack by
q Heat Orchestration Templates
q However
q Not all applications are template based
q Deployments (including configuration) change over time
q Some definitions are cloud specific, e.g., networks, types
q Heat templates and Stacks don’t stay consistent
q Tools that can create a template from deployment, e.g., Flame, ReHeat
q But, template will only fit the current cloud
17. 17
OpenStack Tools and Beyond
q Demo:
A technology preview for disaster recovery with IBM Cloud Manager
19. 19
Ceph Multi-Site & Disaster Recovery (Block) example
q Export snapshots to geographically dispersed data centers
q Provides disaster recovery
q Export incremental snapshots
q Minimize network bandwidth by only sending changes
q Kilo cycle focus to extends the multi-site and disaster recovery options
q RBD Mirroring
q Cinder Volume Replication
20. 20
Ceph Multi-Site & Disaster Recovery (Object) example
q Zones and region support
q Deploy topologies similar to S3
and others with a global
namespace
q Data center synchronization
q Back-up full or partial sets of data
between regions
q Read affinity
q Serve local copies of data to local
users
21. 21
Disaster Recovery as a Service Catalog
q Pluggable Disaster Recovery policies
q Replication targets can specify different RPO/RTO levels that can be
offered based on the supported backend capabilities
q Disaster Recovery Policies
q Active - Cold standby
q Active - Hot standby
q Active - Active (requires application awareness and transaction integrity)
q Backup to Cloud / From the Cloud
22. 22
Extending Heat Orchestration for Disaster Recovery
q Heat can be used to automate
q Add support for Cinder replication
q Need to make Consistency group across OpenStack projects
q Nova Cinder, Trove….
q Stack Snapshot Backup / Rollback
q Enable customization of workload components at recovery site.
q Networks, VM configurations changes, guest agent etc.
23. 23
The Road Toward Application Consistency
First phase: File system consistency
q Integrate into OpenStack to allow consistent snapshots and
backups
q Nova needs to request QEMU Guest Agent to freeze the file systems
(and applications if fsfreeze-hook is installed) during the snapshot
q Patches has proposed for
Nova and Cinder, targeting
the Kilo release
Source: Hitachi
24. 24
The Road Toward Application Consistency
Next phase: Consistency at the application level
q Application-Aware on Windows with VSS Support on qemu-ga
q Application notification via Microsoft Volume Shadow Copy Service (VSS)
q Application-Aware on Linux Using qemu-ga Hooks
q Application-consistent snapshots can be created with scripts interacting with the
QEMU guest agent
q The scripts can notify applications to flush their data
25. 25
Disaster Recovery at Scale
q Site evacuation holy grail is an automatic planned migration of the
workloads and data from one cloud-scale datacenter to another.
q New OpenStack HA approaches to help Recovery from infrastructure
failures:
q Leveraging Pacemaker to provide automated detection of a failed hypervisor
and the recovery of the VMs that were running there.
q Evacuate instance to a scheduled host was added in Juno
q Simple tagging API for instances in Nova was accepted for Kilo release
q Can support automatic-recovery new tag
Suggest removing – no time
26. 26
OpenStack Documentation needs to catch up…
q Join the OpenStack Disaster Recovery Guide
q We have a basic OpenStack High Availability Guide
q http://docs.openstack.org/high-availability-guide/content/
q A very outdated “Recover cloud after disaster” section in the Admin guide
http://docs.openstack.org/admin-guide-cloud/content/section_nova-disaster-
recovery-process.html
27. Accelerating Enterprise OpenStack
Q&A
Michael Factor
IBM Research - Haifa
factor@il.ibm.com
THANK YOU
Ronen Kat
IBM Research - Haifa
ronenkat@il.ibm.com
Sean Cohen
RedHat
scohen@redhat.com