OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph

OPENSTACK AT 99.999% AVAILABILITY
WITH CEPH
Danny Al-Gaaf (Deutsche Telekom)
Deutsche OpenStack Tage 2016 - Cologne

● Motivation
● Availability and SLA's
● Data centers
○ Setup and failure scenarios
● OpenStack and Ceph
○ Architecture and Critical Components
○ HA setup
○ Quorum?
● OpenStack and Ceph == HA?
○ Failure scenarios
○ Mitigation
● Conclusions
Overview
2

NFV Cloud @ Deutsche Telekom
● Datacenter design
○ Backend DCs
■ Few but classic DCs
■ High SLAs for infrastructure and services
■ For private/customer data and services
○ Frontend DCs
■ Small but many
■ Near to the customer
■ Lower SLAs, can fail at any time
■ NFVs:
● Spread over many FDCs
● Failures are handled by services and not the infrastructure
● Run telco core services @OpenStack/KVM/Ceph
4

High Availability
● Continuous system availability in
case of component failures
● Which availability?
○ Server
○ Network
○ Datacenter
○ Cloud
○ Application/Service
● End-to-End availability most interesting
6
availability downtime/year classification
99.9% 8.76 hours high availability
99.99% 52.6 minutes very high availability
99.999% 5.26 minutes highest availability
99.9999% 0.526 minutes disaster tolerant

High Availability
● Calculation
○ Each component contributes to the service availability
■ Infrastructure
■ Hardware
■ Software
■ Processes
○ Likelihood of disaster and failure scenarios
○ Model can get very complex
○ Hard to get all numbers required
● SLA’s
○ ITIL (IT Infrastructure Library)
○ Planned maintenance depending on SLA may be excluded
7

Failure scenarios
● Power outage
○ External
○ Internal
○ Backup UPS/Generator
● Network outage
○ External connectivity
○ Internal
■ Cables
■ Switches, router
● Failure of:
○ Cooling
○ server or component
○ software services
9

Failure scenarios
● Human error
○ Misconfiguration
○ Accidents
○ Emergency power-off
○ Often leading cause of outage
● Disaster
○ Fire
○ Flood
○ Earthquake
○ Plane crash
○ Nuclear accident
10

Mitigation
● Identify potential SPoF
● Use redundant components
● Careful planning
○ Network design (external / internal)
○ Power management (external / internal)
○ Fire suppression
○ Disaster management
○ Monitoring
● 5-nines on DC/HW level hard to achieve
○ Tier IV often too expensive (compared with Tier III or III+)
○ Even Tier IV does not provide 5-nines
○ Requires HA concept on cloud and application level
12

Example: Network
● Spine/leaf arch
● Redundant
○ DC-R
○ Spine switches
○ Leaf switches (ToR)
○ OAM switches
○ Firewall
● Server
○ Redundant NICs
○ Redundant power lines and
supplies
13

Architecture: Ceph Components
● OSDs
○ 10s - 1000s per cluster
○ One per device (HDD/SSD/RAID Group, SAN …)
○ Store objects
○ Handle replication and recovery
● MONs:
○ Maintain cluster membership and states
○ Use PAXOS protocol to establish quorum consensus
○ Small, lightweight
○ Odd number
16

Architecture: Ceph and OpenStack
17

HA - Critical Components
Which services need to be HA?
● Control plane
○ Provisioning, management
○ API endpoints and services
○ Admin nodes
○ Control nodes
● Data plane
○ Steady states
○ Storage
○ Network
18

HA Setup
● Stateless services
○ No dependency between requests
○ After reply no further attention required
○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler
● Stateful service
○ Action typically comprises out of multiple requests
○ Subsequent requests depend on the results of a former request
○ Databases, RabbitMQ
19

Quorum?
● Required to decide which cluster partition/member is
primary to prevent data/service corruption
● Examples:
○ Databases
■ MariaDB / Galera, MongoDB, CassandraDB
○ Pacemaker/corosync
○ Ceph Monitors
■ Paxos
■ Odd number of MONs required
■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …)
■ Without quorum:
● no changes of cluster membership (e.g. add new MONs/OSDs)
● Clients can’t connect to cluster
21

SPoF
● OpenStack HA
○ No SPoF assumed
● Ceph
○ No SPoF assumed
○ Availability of RBDs is critical to VMs
○ Availability of RadosGW can be easily managed via HAProxy
● What in case of failures on higher level?
○ Data center cores or fire compartments
○ Network
■ Physical
■ Misconfiguration
○ Power
23

Failure scenarios - FC fails
25

Failure scenarios - FC fails
26

Failure scenarios - Split brain
27
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects B
● VMs in B still running
● Best-case scenario

Failure scenarios - Split brain
28
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects A
● VMs in A and B stop
working
● Worst-case scenario

Other issues
● Replica distribution
○ Two room setup:
■ 2 or 3 replica contain risk of having only one replica left
■ Would require 4 replica (2:2)
● Reduced performance
● Increased traffic and costs
○ Alternative: erasure coding
■ Reduced performance, less space required
● Spare capacity
○ Remaining room requires spare capacity to restore
○ Depends on
■ Failure/restore scenario
■ Replication vs erasure code
○ Costs
29

Mitigation - Three FCs
30
● Third FC/failure
zone hosting all
services
● Usually higher
costs
● More resistant
against failures
● Better replica
distribution
● More east/west
traffic

Mitigation - Quorum Room
31
● Most DCs have backup
rooms
● Only a few servers to
host quorum related
services
● Less cost intensive
● Mitigate FCs split brain

Mitigation - Applications: First Rule
32

Mitigation - Applications: Third Rule
33

Mitigation - Applications: Third Rule
34

Mitigation - Applications: Pets vs Cattle
35

Mitigation - Failure tolerant applications
36
● DC Tier level is not the most relevant
● Application must build their own cluster
mechanisms on top of the DC
→ increases the service availability
significantly
● Data replication must be done across multi-
regions
● In case of a disaster traffic goes to
remaining DCs

Mitigation - Federated Object Stores
37
● Use object storage for persistent data
● Synchronize and replicate across
multiple DCs, sync in background
Open issues:
● Replication of databases
● Applications:
○ Need to support object storage
○ Need to support regions/zones

Mitigation - Outlook
● “Compute follows Storage”
○ Use RBDs as fencing devices in OpenStack HA setup
○ Extend Ceph MONs
■ Include information about physical placement similar to CRUSH map
■ Enable HA setup to monitor/query quorum decisions and map to physical layout
● Passive standby Ceph MONs to ease deployment of MONs if
quorum fails
○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors
● Generic quorum service/library ?
38

Conclusions
● OpenStack and Ceph provide HA if carefully planed
○ Be aware of potential failure scenarios!
○ All Quorum decisions must be in sync
○ Third room must be used
○ Replica distribution and spare capacity must be considered
○ Ceph need more extended quorum information
● Target for five 9’s is E2E
○ Five 9’s on data center level very expensive
○ NO PETS, NO PETS, NO PETS !!!
○ Distribute applications or services over multiple DCs
40

Get involved !
● Ceph
○ https://ceph.com/community/contribute/
○ ceph-devel@vger.kernel.org
○ IRC: OFTC
■ #ceph,
■ #ceph-devel
● OpenStack
○ Cinder, Glance, Manila, ...
41

danny.al-gaaf@telekom.de
dalgaaf
blog.bisect.de
@dannnyalgaaf
linkedin.com/in/dalgaaf
xing.com/profile/Danny_AlGaaf
Danny Al-Gaaf
Senior Cloud Technologist
Q&A - THANK YOU!

OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph

More Related Content

What's hot

Similar to OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph

More from Danny Al-Gaaf

Recently uploaded

OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph