Problem The cloud can fail.Online businesses that rely and benefitmost from the cloud don’t have the skills to handle failure.
Disaster Recovery process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to anorganisation after a natural or human-induced disaster * *according to wikipedia..
RPO Recovery Point Objective“maximum tolerable period in which data might be lost from an IT Service due to a Major incident…” * *according to wikipedia..
RTO Recovery Time Objective “duration of time and a service level withinwhich a business process must be restored after a disaster…” * *according to wikipedia..
Our Goal Without re-architecting your application;Provide a configurable warm standby solution, with a known consistent RPO, reducing RTO, minimising business impact.
Goals and ChallengesReplicate application over to OpenStack incase of a disaster – Preserve the running environment of the application, this includes: • Compute instances • Networks • DNSMinimise RTO and RPO AND cost!
mypizzashop.com.auPublic IP / Load Balanced Web front end Apache/Nginx/IIS app.mypizzashop.com.au Private IP Application Processing/memcache db.mypizzashop.com.au Private IP Database MySQL/PostgreSQL/MSSQL
Architecting for DR in CloudVirtualise your servers – snapshotting support in hypervisor primarily at the diskUse Dynamic DNS solutions – E.g. Route 53, Anycast DNS
Compatibility across IaaS CloudsCloud Framework Compute Object Block Network SecurityProvider Instance Store Storage GroupAWS Custom ✓ ✓ ✓ DHCP ✓Rackspace Custom ✓ ✓ ✗ STATIC ✗Ninefold CloudStack ✓ ✓ ✓ DHCP ✓TryStack OpenStack ✓ ✓ ✓ DHCP ✓HP Cloud OpenStack ✓ ✓ ✗ DHCP ✓ • Replication from one cloud to another is NOT always possible • Some clouds do not have all the technology pieces (e.g., Block Storage) • Minimum requirements for replicating application servers: • compute instance and persistent storage, such as object store or block storage • Snapshot service (to ensure point-in-time consistency) • Hypervisor support (e.g., PVGrub)
Overview of DR Process Take snapshot Create volume AWS Partition Mount new Download from Send to storageOpenStack instance storage
Building DR using OpenStackProgress: – Deploying OpenStack in our NICTA lab – Successfully replicated AWS compute instances to OpenStack • In Rackspace OpenStack public cloud (private beta) • Instances created from standard 64-bit EXT3 AWS OpenSuse imageRequirements: – Xen support for PVGrub – Write access to partition table – Network support
ProblemsLatencyPoint in TimeLog and replay / transactionalHow do modern databases handle brokentransactions / problem disks?Rollback
Optimisations: Incremental BackupTypical AWS system volume is around 10GBReplication is tricky for large data volumes – Initial backup: • Send the whole data volume (unavoidable!) • Optimise by compression and skipping empty space (0’s) – Subsequent backups: • Incremental – partition a volume into chunks and resend only the difference (the ‘delta’)
Large Data Transfer AcrossCloud DatacentersWhy so slow?
Optimisations: Large Data Transfer Across Cloud Datacenters for DRProblem: Transferring large data volumes is slow – Where is the bottleneck? • Reading from the source volume? YES!! • Transferring across LAN/WAN? • Writing to destination volume? • Our solution Data Transfer Evaluations 1 Clone 4 ClonesRapidly Cloning data 190 140volumes from snapshots – Parallel transfers 50 40 Volume Scan (MB/s) End-to-end Transfer (MB/s)