OPENSTACK AT 99.999% AVAILABILITY
WITH CEPH
Danny Al-Gaaf (Deutsche Telekom)
Deutsche OpenStack Tage 2016 - Cologne
● Motivation
● Availability and SLA's
● Data centers
○ Setup and failure scenarios
● OpenStack and Ceph
○ Architecture and Critical Components
○ HA setup
○ Quorum?
● OpenStack and Ceph == HA?
○ Failure scenarios
○ Mitigation
● Conclusions
Overview
2
Motivation
NFV Cloud @ Deutsche Telekom
● Datacenter design
○ Backend DCs
■ Few but classic DCs
■ High SLAs for infrastructure and services
■ For private/customer data and services
○ Frontend DCs
■ Small but many
■ Near to the customer
■ Lower SLAs, can fail at any time
■ NFVs:
● Spread over many FDCs
● Failures are handled by services and not the infrastructure
● Run telco core services @OpenStack/KVM/Ceph
4
Availability
High Availability
● Continuous system availability in
case of component failures
● Which availability?
○ Server
○ Network
○ Datacenter
○ Cloud
○ Application/Service
● End-to-End availability most interesting
6
availability downtime/year classification
99.9% 8.76 hours high availability
99.99% 52.6 minutes very high availability
99.999% 5.26 minutes highest availability
99.9999% 0.526 minutes disaster tolerant
High Availability
● Calculation
○ Each component contributes to the service availability
■ Infrastructure
■ Hardware
■ Software
■ Processes
○ Likelihood of disaster and failure scenarios
○ Model can get very complex
○ Hard to get all numbers required
● SLA’s
○ ITIL (IT Infrastructure Library)
○ Planned maintenance depending on SLA may be excluded
7
Data centers
Failure scenarios
● Power outage
○ External
○ Internal
○ Backup UPS/Generator
● Network outage
○ External connectivity
○ Internal
■ Cables
■ Switches, router
● Failure of:
○ Cooling
○ server or component
○ software services
9
Failure scenarios
● Human error
○ Misconfiguration
○ Accidents
○ Emergency power-off
○ Often leading cause of outage
● Disaster
○ Fire
○ Flood
○ Earthquake
○ Plane crash
○ Nuclear accident
10
Data Center Tiers
11
Mitigation
● Identify potential SPoF
● Use redundant components
● Careful planning
○ Network design (external / internal)
○ Power management (external / internal)
○ Fire suppression
○ Disaster management
○ Monitoring
● 5-nines on DC/HW level hard to achieve
○ Tier IV often too expensive (compared with Tier III or III+)
○ Even Tier IV does not provide 5-nines
○ Requires HA concept on cloud and application level
12
Example: Network
● Spine/leaf arch
● Redundant
○ DC-R
○ Spine switches
○ Leaf switches (ToR)
○ OAM switches
○ Firewall
● Server
○ Redundant NICs
○ Redundant power lines and
supplies
13
Ceph and OpenStack
Architecture: Ceph
15
Architecture: Ceph Components
● OSDs
○ 10s - 1000s per cluster
○ One per device (HDD/SSD/RAID Group, SAN …)
○ Store objects
○ Handle replication and recovery
● MONs:
○ Maintain cluster membership and states
○ Use PAXOS protocol to establish quorum consensus
○ Small, lightweight
○ Odd number
16
Architecture: Ceph and OpenStack
17
HA - Critical Components
Which services need to be HA?
● Control plane
○ Provisioning, management
○ API endpoints and services
○ Admin nodes
○ Control nodes
● Data plane
○ Steady states
○ Storage
○ Network
18
HA Setup
● Stateless services
○ No dependency between requests
○ After reply no further attention required
○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler
● Stateful service
○ Action typically comprises out of multiple requests
○ Subsequent requests depend on the results of a former request
○ Databases, RabbitMQ
19
OpenStack HA
20
Quorum?
● Required to decide which cluster partition/member is
primary to prevent data/service corruption
● Examples:
○ Databases
■ MariaDB / Galera, MongoDB, CassandraDB
○ Pacemaker/corosync
○ Ceph Monitors
■ Paxos
■ Odd number of MONs required
■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …)
■ Without quorum:
● no changes of cluster membership (e.g. add new MONs/OSDs)
● Clients can’t connect to cluster
21
OpenStack and Ceph == HA ?
SPoF
● OpenStack HA
○ No SPoF assumed
● Ceph
○ No SPoF assumed
○ Availability of RBDs is critical to VMs
○ Availability of RadosGW can be easily managed via HAProxy
● What in case of failures on higher level?
○ Data center cores or fire compartments
○ Network
■ Physical
■ Misconfiguration
○ Power
23
Setup - Two Rooms
24
Failure scenarios - FC fails
25
Failure scenarios - FC fails
26
Failure scenarios - Split brain
27
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects B
● VMs in B still running
● Best-case scenario
Failure scenarios - Split brain
28
● Ceph
● Quorum selects B
● Storage in A stops
● OpenStack HA:
● Selects A
● VMs in A and B stop
working
● Worst-case scenario
Other issues
● Replica distribution
○ Two room setup:
■ 2 or 3 replica contain risk of having only one replica left
■ Would require 4 replica (2:2)
● Reduced performance
● Increased traffic and costs
○ Alternative: erasure coding
■ Reduced performance, less space required
● Spare capacity
○ Remaining room requires spare capacity to restore
○ Depends on
■ Failure/restore scenario
■ Replication vs erasure code
○ Costs
29
Mitigation - Three FCs
30
● Third FC/failure
zone hosting all
services
● Usually higher
costs
● More resistant
against failures
● Better replica
distribution
● More east/west
traffic
Mitigation - Quorum Room
31
● Most DCs have backup
rooms
● Only a few servers to
host quorum related
services
● Less cost intensive
● Mitigate FCs split brain
Mitigation - Applications: First Rule
32
Mitigation - Applications: Third Rule
33
Mitigation - Applications: Third Rule
34
Mitigation - Applications: Pets vs Cattle
35
Mitigation - Failure tolerant applications
36
● DC Tier level is not the most relevant
● Application must build their own cluster
mechanisms on top of the DC
→ increases the service availability
significantly
● Data replication must be done across multi-
regions
● In case of a disaster traffic goes to
remaining DCs
Mitigation - Federated Object Stores
37
● Use object storage for persistent data
● Synchronize and replicate across
multiple DCs, sync in background
Open issues:
● Replication of databases
● Applications:
○ Need to support object storage
○ Need to support regions/zones
Mitigation - Outlook
● “Compute follows Storage”
○ Use RBDs as fencing devices in OpenStack HA setup
○ Extend Ceph MONs
■ Include information about physical placement similar to CRUSH map
■ Enable HA setup to monitor/query quorum decisions and map to physical layout
● Passive standby Ceph MONs to ease deployment of MONs if
quorum fails
○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors
● Generic quorum service/library ?
38
Conclusions
Conclusions
● OpenStack and Ceph provide HA if carefully planed
○ Be aware of potential failure scenarios!
○ All Quorum decisions must be in sync
○ Third room must be used
○ Replica distribution and spare capacity must be considered
○ Ceph need more extended quorum information
● Target for five 9’s is E2E
○ Five 9’s on data center level very expensive
○ NO PETS, NO PETS, NO PETS !!!
○ Distribute applications or services over multiple DCs
40
Get involved !
● Ceph
○ https://ceph.com/community/contribute/
○ ceph-devel@vger.kernel.org
○ IRC: OFTC
■ #ceph,
■ #ceph-devel
● OpenStack
○ Cinder, Glance, Manila, ...
41
danny.al-gaaf@telekom.de
dalgaaf
blog.bisect.de
@dannnyalgaaf
linkedin.com/in/dalgaaf
xing.com/profile/Danny_AlGaaf
Danny Al-Gaaf
Senior Cloud Technologist
Q&A - THANK YOU!

OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph

  • 1.
    OPENSTACK AT 99.999%AVAILABILITY WITH CEPH Danny Al-Gaaf (Deutsche Telekom) Deutsche OpenStack Tage 2016 - Cologne
  • 2.
    ● Motivation ● Availabilityand SLA's ● Data centers ○ Setup and failure scenarios ● OpenStack and Ceph ○ Architecture and Critical Components ○ HA setup ○ Quorum? ● OpenStack and Ceph == HA? ○ Failure scenarios ○ Mitigation ● Conclusions Overview 2
  • 3.
  • 4.
    NFV Cloud @Deutsche Telekom ● Datacenter design ○ Backend DCs ■ Few but classic DCs ■ High SLAs for infrastructure and services ■ For private/customer data and services ○ Frontend DCs ■ Small but many ■ Near to the customer ■ Lower SLAs, can fail at any time ■ NFVs: ● Spread over many FDCs ● Failures are handled by services and not the infrastructure ● Run telco core services @OpenStack/KVM/Ceph 4
  • 5.
  • 6.
    High Availability ● Continuoussystem availability in case of component failures ● Which availability? ○ Server ○ Network ○ Datacenter ○ Cloud ○ Application/Service ● End-to-End availability most interesting 6 availability downtime/year classification 99.9% 8.76 hours high availability 99.99% 52.6 minutes very high availability 99.999% 5.26 minutes highest availability 99.9999% 0.526 minutes disaster tolerant
  • 7.
    High Availability ● Calculation ○Each component contributes to the service availability ■ Infrastructure ■ Hardware ■ Software ■ Processes ○ Likelihood of disaster and failure scenarios ○ Model can get very complex ○ Hard to get all numbers required ● SLA’s ○ ITIL (IT Infrastructure Library) ○ Planned maintenance depending on SLA may be excluded 7
  • 8.
  • 9.
    Failure scenarios ● Poweroutage ○ External ○ Internal ○ Backup UPS/Generator ● Network outage ○ External connectivity ○ Internal ■ Cables ■ Switches, router ● Failure of: ○ Cooling ○ server or component ○ software services 9
  • 10.
    Failure scenarios ● Humanerror ○ Misconfiguration ○ Accidents ○ Emergency power-off ○ Often leading cause of outage ● Disaster ○ Fire ○ Flood ○ Earthquake ○ Plane crash ○ Nuclear accident 10
  • 11.
  • 12.
    Mitigation ● Identify potentialSPoF ● Use redundant components ● Careful planning ○ Network design (external / internal) ○ Power management (external / internal) ○ Fire suppression ○ Disaster management ○ Monitoring ● 5-nines on DC/HW level hard to achieve ○ Tier IV often too expensive (compared with Tier III or III+) ○ Even Tier IV does not provide 5-nines ○ Requires HA concept on cloud and application level 12
  • 13.
    Example: Network ● Spine/leafarch ● Redundant ○ DC-R ○ Spine switches ○ Leaf switches (ToR) ○ OAM switches ○ Firewall ● Server ○ Redundant NICs ○ Redundant power lines and supplies 13
  • 14.
  • 15.
  • 16.
    Architecture: Ceph Components ●OSDs ○ 10s - 1000s per cluster ○ One per device (HDD/SSD/RAID Group, SAN …) ○ Store objects ○ Handle replication and recovery ● MONs: ○ Maintain cluster membership and states ○ Use PAXOS protocol to establish quorum consensus ○ Small, lightweight ○ Odd number 16
  • 17.
  • 18.
    HA - CriticalComponents Which services need to be HA? ● Control plane ○ Provisioning, management ○ API endpoints and services ○ Admin nodes ○ Control nodes ● Data plane ○ Steady states ○ Storage ○ Network 18
  • 19.
    HA Setup ● Statelessservices ○ No dependency between requests ○ After reply no further attention required ○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler ● Stateful service ○ Action typically comprises out of multiple requests ○ Subsequent requests depend on the results of a former request ○ Databases, RabbitMQ 19
  • 20.
  • 21.
    Quorum? ● Required todecide which cluster partition/member is primary to prevent data/service corruption ● Examples: ○ Databases ■ MariaDB / Galera, MongoDB, CassandraDB ○ Pacemaker/corosync ○ Ceph Monitors ■ Paxos ■ Odd number of MONs required ■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …) ■ Without quorum: ● no changes of cluster membership (e.g. add new MONs/OSDs) ● Clients can’t connect to cluster 21
  • 22.
  • 23.
    SPoF ● OpenStack HA ○No SPoF assumed ● Ceph ○ No SPoF assumed ○ Availability of RBDs is critical to VMs ○ Availability of RadosGW can be easily managed via HAProxy ● What in case of failures on higher level? ○ Data center cores or fire compartments ○ Network ■ Physical ■ Misconfiguration ○ Power 23
  • 24.
    Setup - TwoRooms 24
  • 25.
  • 26.
  • 27.
    Failure scenarios -Split brain 27 ● Ceph ● Quorum selects B ● Storage in A stops ● OpenStack HA: ● Selects B ● VMs in B still running ● Best-case scenario
  • 28.
    Failure scenarios -Split brain 28 ● Ceph ● Quorum selects B ● Storage in A stops ● OpenStack HA: ● Selects A ● VMs in A and B stop working ● Worst-case scenario
  • 29.
    Other issues ● Replicadistribution ○ Two room setup: ■ 2 or 3 replica contain risk of having only one replica left ■ Would require 4 replica (2:2) ● Reduced performance ● Increased traffic and costs ○ Alternative: erasure coding ■ Reduced performance, less space required ● Spare capacity ○ Remaining room requires spare capacity to restore ○ Depends on ■ Failure/restore scenario ■ Replication vs erasure code ○ Costs 29
  • 30.
    Mitigation - ThreeFCs 30 ● Third FC/failure zone hosting all services ● Usually higher costs ● More resistant against failures ● Better replica distribution ● More east/west traffic
  • 31.
    Mitigation - QuorumRoom 31 ● Most DCs have backup rooms ● Only a few servers to host quorum related services ● Less cost intensive ● Mitigate FCs split brain
  • 32.
  • 33.
  • 34.
  • 35.
    Mitigation - Applications:Pets vs Cattle 35
  • 36.
    Mitigation - Failuretolerant applications 36 ● DC Tier level is not the most relevant ● Application must build their own cluster mechanisms on top of the DC → increases the service availability significantly ● Data replication must be done across multi- regions ● In case of a disaster traffic goes to remaining DCs
  • 37.
    Mitigation - FederatedObject Stores 37 ● Use object storage for persistent data ● Synchronize and replicate across multiple DCs, sync in background Open issues: ● Replication of databases ● Applications: ○ Need to support object storage ○ Need to support regions/zones
  • 38.
    Mitigation - Outlook ●“Compute follows Storage” ○ Use RBDs as fencing devices in OpenStack HA setup ○ Extend Ceph MONs ■ Include information about physical placement similar to CRUSH map ■ Enable HA setup to monitor/query quorum decisions and map to physical layout ● Passive standby Ceph MONs to ease deployment of MONs if quorum fails ○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors ● Generic quorum service/library ? 38
  • 39.
  • 40.
    Conclusions ● OpenStack andCeph provide HA if carefully planed ○ Be aware of potential failure scenarios! ○ All Quorum decisions must be in sync ○ Third room must be used ○ Replica distribution and spare capacity must be considered ○ Ceph need more extended quorum information ● Target for five 9’s is E2E ○ Five 9’s on data center level very expensive ○ NO PETS, NO PETS, NO PETS !!! ○ Distribute applications or services over multiple DCs 40
  • 41.
    Get involved ! ●Ceph ○ https://ceph.com/community/contribute/ ○ ceph-devel@vger.kernel.org ○ IRC: OFTC ■ #ceph, ■ #ceph-devel ● OpenStack ○ Cinder, Glance, Manila, ... 41
  • 42.