99.999% Available OpenStack Cloud - A Builder's Guide

99.999% Available OpenStack Cloud
-
A Builder's Guide
Danny Al-Gaaf (Deutsche Telekom)
OpenStack Summit 2015 - Tokyo

● Motivation
● Availability and SLA's
● Data centers
○ Setup and failure scenarios
● OpenStack and Ceph
○ Architecture and Critical Components
○ HA setup
○ Quorum?
● OpenStack and Ceph == HA?
○ Failure scenarios
○ Mitigation
● Conclusions
Overview
2

NFV Cloud @ Deutsche Telekom
● Datacenter design
○ Backend DCs
■ Few but classic DCs
■ High SLAs for infrastructure and services
■ For private/customer data and services
○ Frontend DCs
■ Small but many
■ Near to the customer
■ Lower SLAs, can fail at any time
■ NFVs:
● Spread over many FDCs
● Failures are handled by services and not the infrastructure
● Run telco core services @OpenStack/KVM/Ceph
4

Availability
● Measured relative to “100 % operational”
6
availability downtime classification
99.9% 8.76 hours/year high availability
99.99% 52.6 minutes/year very high availability
99.999% 5.26 minutes/year highest availability
99.9999% 0.526 minutes/year disaster tolerant

High Availability
● Continuous system availability in case of component
failures
● Which availability?
○ Server
○ Network
○ Datacenter
○ Cloud
○ Application/Service
● End-to-End availability most interesting
7

High Availability
● Calculation
○ Each component contributes to the service availability
■ Infrastructure
■ Hardware
■ Software
■ Processes
○ Likelihood of disaster and failure scenarios
○ Model can get very complex
● SLA’s
○ ITIL (IT Infrastructure Library)
○ Planned maintenance depending on SLA may be excluded
8

Failure scenarios
● Power outage
○ External
○ Internal
○ Backup UPS/Generator
● Network outage
○ External connectivity
○ Internal
■ Cables
■ Switches, router
● Failure of a server or a component
● Failure of a software service
10

Failure scenarios
● Human error still often leading
cause of outage
○ Misconfiguration
○ Accidents
○ Emergency power-off
● Disaster
○ Fire
○ Flood
○ Earthquake
○ Plane crash
○ Nuclear accident
11

Mitigation
● Identify potential SPoF
● Use redundant components
● Careful planning
○ Network design (external/internal)
○ Power management (external/internal)
○ Fire suppression
○ Disaster management
○ Monitoring
● 5-nines on DC/HW level hard to achieve
○ Tier IV usually too expensive (compared with Tier III or III+)
○ Requires HA concept on cloud and application level
13

Example: Network
● Spine/leaf arch
● Redundant
○ DC-R
○ Spine switches
○ Leaf switches (ToR)
○ OAM switches
○ Firewall
● Server
○ Redundant NICs
○ Redundant power lines
and supplies
14

Architecture: Ceph Components
● OSDs
○ 10s - 1000s per cluster
○ One per device (HDD/SDD/RAID Group, SAN …)
○ Store objects
○ Handle replication and recovery
● MONs:
○ Maintain cluster membership and states
○ Use PAXOS protocol to establish quorum
consensus
○ Small, lightweight
○ Odd number17

Architecture: Ceph and OpenStack
18

HA - Critical Components
Which services need to be HA?
● Control plane
○ Provisioning, management
○ API endpoints and services
○ Admin nodes
○ Control nodes
● Data plane
○ Steady states
○ Storage
○ Network
19

HA Setup
● Stateless services
○ No dependency between requests
○ After reply no further attention required
○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler
● Stateful service
○ Action typically comprises out of multiple requests
○ Subsequent requests depend on the results of a former request
○ Databases, RabbitMQ
20

HA Setup
21
active/passive active/active
stateless ● load balance
redundant services
● load balance redundant
services
stateful ● bring replacement
resource online
● redundant services, all
with the same state
● state changes are
passed to all instances

Quorum?
● Required to decide which cluster partition/member is
primary to prevent data/service corruption
● Examples:
○ Databases
■ MariaDB / Galera, MongoDB, CassandraDB
○ Pacemaker/corosync
○ Ceph Monitors
■ Paxos
■ Odd number of MONs required
■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …)
■ Without quorum:
● no changes of cluster membership (e.g. add new MONs/ODSs)
● Clients can’t connect to cluster
23

SPoF
● OpenStack HA
○ No SPoF assumed
● Ceph
○ No SPoF assumed
○ Availability of RBDs is critical to VMs
○ Availability of RadosGW can be easily managed via HAProxy
● What in case of failures on higher level?
○ Data center cores or fire compartments
○ Network
■ Physical
■ Misconfiguration
○ Power
25

Failure scenarios - FC fails
27

Failure scenarios - FC fails
28

Failure scenarios - Split brain
29
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects B
● VMs in B still running
● Best-case scenario

Failure scenarios - Split brain
30
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects A
● VMs in A and B stop
working
● Worst-case scenario

Other issues
● Replica distribution
○ Two room setup:
■ 2 or 3 replica contain risk of having only one replica left
■ Would require 4 replica (2:2)
● Reduced performance
● Increased traffic and costs
○ Alternative: erasure coding
■ Reduced performance, less space required
● Spare capacity
○ Remaining room requires spare capacity to restore
○ Depends on
■ Failure/restore scenario
■ Replication vs erasure code
○ Costs
31

Mitigation - Three FCs
32
● Third FC/failure
zone hosting all
services
● Usually higher
costs
● More resistant
against failures
● Better replica
distribution
● More east/west
traffic

Mitigation - Quorum Room
33
● Most DCs have
backup rooms
● Only a few servers
to host quorum
related services
● Less cost
intensive
● Can mitigate split
brain between FCs
(depending on
network layout)

Mitigation - Pets vs Cattle
34
● NO pets allowed !!!
● Only cloud-ready applications

Mitigation - Failure tolerant applications
35
● Tier level is not the most relevant layer
● Application must build their own cluster
mechanisms on top of the DC
→ increases the availability significantly
● Data replication must be done across
multi-region
● In case of a disaster route traffic to
different DC
● Many VNF (virtual network functions)
already support such setups

Mitigation - Federated Object Stores
36
● Best way to synchronize and replicate
data across multiple DC is usage of
object storage
● Sync is done asynchronously
Open issues:
● Doesn’t solve replication of databases
● Many applications don’t support object
storage and need to be adapted
● Applications also need to support
regions/zones

Mitigation - Outlook
● “OpenStack follows Storage”
○ Use RBDs as fencing devices
○ Extend Ceph MONs
■ Include information about physical placement similar to CRUSH map
■ Enable HA setup to query quorum decisions and map quorum to physical layout
● Passive standby Ceph MONs to ease deployment of
MONs if quorum fails
○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors
● Generic quorum service/library ?
37

Conclusions
● OpenStack and Ceph provide HA if carefully planed
○ Be aware of potential failure scenarios!
○ All Quorums need must be synced
○ Third room must be used
○ Replica distribution and spare capacity must be considered
○ Ceph need more extended quorum information
● Target for five 9’s is E2E
○ Five 9’s on data center level very expensive
○ No pets !!!
○ Distribute applications or services over multiple DCs
39

Get involved !
● Ceph
○ https://ceph.com/community/contribute/
○ ceph-devel@vger.kernel.org
○ IRC: OFTC
■ #ceph,
■ #ceph-devel
○ Ceph Developer Summit
● OpenStack
○ Cinder, Glance, Manila, ...
40

danny.al-gaaf@telekom.de
dalgaaf
linkedin.com/in/dalgaaf
Danny Al-Gaaf
Senior Cloud Technologist
IRC
Q&A - THANK YOU!

99.999% Available OpenStack Cloud - A Builder's Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to 99.999% Available OpenStack Cloud - A Builder's Guide

Similar to 99.999% Available OpenStack Cloud - A Builder's Guide (20)

More from Danny Al-Gaaf

More from Danny Al-Gaaf (7)

Recently uploaded

Recently uploaded (17)

99.999% Available OpenStack Cloud - A Builder's Guide