Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

99.999% Available OpenStack Cloud - A Builder's Guide

1,098 views

Published on

High availability is a very important and frequently discussed topic for clouds at the infrastructure level. There are several concepts to provide a HA-ready OpenStack. And also software defined storage like Ceph is highly available with no single point of failure.

But what about HA if you bring OpenStack and Ceph together? How do they work together and what are the impacts on the availability of your OpenStack cloud infrastructure from the tenant or application point of view?

How does the design of your classic high-available data center, e.g. with two fire compartments, power backup, and redundant power and network lines impact your cluster setup? There are many different scenarios of potential failures. What does this mean regarding building and managing failure zones, especially in case of technologies like Ceph which need to be able to build a quorum to keep up running.

This talk will cover:
- Failure scenarios and their impact on OpenStack and Ceph availability
- Which components of the cloud need a quorum
- How to setup the infrastructure to ensure a quorum
- How the different quorum devices work together and if they guarantee the HA of your cloud
- Pitfalls and solutions

  • Be the first to comment

99.999% Available OpenStack Cloud - A Builder's Guide

  1. 1. 99.999% Available OpenStack Cloud - A Builder's Guide Danny Al-Gaaf (Deutsche Telekom) OpenStack Summit 2015 - Tokyo
  2. 2. ● Motivation ● Availability and SLA's ● Data centers ○ Setup and failure scenarios ● OpenStack and Ceph ○ Architecture and Critical Components ○ HA setup ○ Quorum? ● OpenStack and Ceph == HA? ○ Failure scenarios ○ Mitigation ● Conclusions Overview 2
  3. 3. Motivation
  4. 4. NFV Cloud @ Deutsche Telekom ● Datacenter design ○ Backend DCs ■ Few but classic DCs ■ High SLAs for infrastructure and services ■ For private/customer data and services ○ Frontend DCs ■ Small but many ■ Near to the customer ■ Lower SLAs, can fail at any time ■ NFVs: ● Spread over many FDCs ● Failures are handled by services and not the infrastructure ● Run telco core services @OpenStack/KVM/Ceph 4
  5. 5. Availability
  6. 6. Availability ● Measured relative to “100 % operational” 6 availability downtime classification 99.9% 8.76 hours/year high availability 99.99% 52.6 minutes/year very high availability 99.999% 5.26 minutes/year highest availability 99.9999% 0.526 minutes/year disaster tolerant
  7. 7. High Availability ● Continuous system availability in case of component failures ● Which availability? ○ Server ○ Network ○ Datacenter ○ Cloud ○ Application/Service ● End-to-End availability most interesting 7
  8. 8. High Availability ● Calculation ○ Each component contributes to the service availability ■ Infrastructure ■ Hardware ■ Software ■ Processes ○ Likelihood of disaster and failure scenarios ○ Model can get very complex ● SLA’s ○ ITIL (IT Infrastructure Library) ○ Planned maintenance depending on SLA may be excluded 8
  9. 9. Data centers
  10. 10. Failure scenarios ● Power outage ○ External ○ Internal ○ Backup UPS/Generator ● Network outage ○ External connectivity ○ Internal ■ Cables ■ Switches, router ● Failure of a server or a component ● Failure of a software service 10
  11. 11. Failure scenarios ● Human error still often leading cause of outage ○ Misconfiguration ○ Accidents ○ Emergency power-off ● Disaster ○ Fire ○ Flood ○ Earthquake ○ Plane crash ○ Nuclear accident 11
  12. 12. Data Center Tiers 12
  13. 13. Mitigation ● Identify potential SPoF ● Use redundant components ● Careful planning ○ Network design (external/internal) ○ Power management (external/internal) ○ Fire suppression ○ Disaster management ○ Monitoring ● 5-nines on DC/HW level hard to achieve ○ Tier IV usually too expensive (compared with Tier III or III+) ○ Requires HA concept on cloud and application level 13
  14. 14. Example: Network ● Spine/leaf arch ● Redundant ○ DC-R ○ Spine switches ○ Leaf switches (ToR) ○ OAM switches ○ Firewall ● Server ○ Redundant NICs ○ Redundant power lines and supplies 14
  15. 15. Ceph and OpenStack
  16. 16. Architecture: Ceph 16
  17. 17. Architecture: Ceph Components ● OSDs ○ 10s - 1000s per cluster ○ One per device (HDD/SDD/RAID Group, SAN …) ○ Store objects ○ Handle replication and recovery ● MONs: ○ Maintain cluster membership and states ○ Use PAXOS protocol to establish quorum consensus ○ Small, lightweight ○ Odd number17
  18. 18. Architecture: Ceph and OpenStack 18
  19. 19. HA - Critical Components Which services need to be HA? ● Control plane ○ Provisioning, management ○ API endpoints and services ○ Admin nodes ○ Control nodes ● Data plane ○ Steady states ○ Storage ○ Network 19
  20. 20. HA Setup ● Stateless services ○ No dependency between requests ○ After reply no further attention required ○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler ● Stateful service ○ Action typically comprises out of multiple requests ○ Subsequent requests depend on the results of a former request ○ Databases, RabbitMQ 20
  21. 21. HA Setup 21 active/passive active/active stateless ● load balance redundant services ● load balance redundant services stateful ● bring replacement resource online ● redundant services, all with the same state ● state changes are passed to all instances
  22. 22. OpenStack HA 22
  23. 23. Quorum? ● Required to decide which cluster partition/member is primary to prevent data/service corruption ● Examples: ○ Databases ■ MariaDB / Galera, MongoDB, CassandraDB ○ Pacemaker/corosync ○ Ceph Monitors ■ Paxos ■ Odd number of MONs required ■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …) ■ Without quorum: ● no changes of cluster membership (e.g. add new MONs/ODSs) ● Clients can’t connect to cluster 23
  24. 24. OpenStack and Ceph == HA ?
  25. 25. SPoF ● OpenStack HA ○ No SPoF assumed ● Ceph ○ No SPoF assumed ○ Availability of RBDs is critical to VMs ○ Availability of RadosGW can be easily managed via HAProxy ● What in case of failures on higher level? ○ Data center cores or fire compartments ○ Network ■ Physical ■ Misconfiguration ○ Power 25
  26. 26. Setup - Two Rooms 26
  27. 27. Failure scenarios - FC fails 27
  28. 28. Failure scenarios - FC fails 28
  29. 29. Failure scenarios - Split brain 29 ● Ceph ● Quorum selects B ● Storage in A stops ● OpenStack HA: ● Selects B ● VMs in B still running ● Best-case scenario
  30. 30. Failure scenarios - Split brain 30 ● Ceph ● Quorum selects B ● Storage in A stops ● OpenStack HA: ● Selects A ● VMs in A and B stop working ● Worst-case scenario
  31. 31. Other issues ● Replica distribution ○ Two room setup: ■ 2 or 3 replica contain risk of having only one replica left ■ Would require 4 replica (2:2) ● Reduced performance ● Increased traffic and costs ○ Alternative: erasure coding ■ Reduced performance, less space required ● Spare capacity ○ Remaining room requires spare capacity to restore ○ Depends on ■ Failure/restore scenario ■ Replication vs erasure code ○ Costs 31
  32. 32. Mitigation - Three FCs 32 ● Third FC/failure zone hosting all services ● Usually higher costs ● More resistant against failures ● Better replica distribution ● More east/west traffic
  33. 33. Mitigation - Quorum Room 33 ● Most DCs have backup rooms ● Only a few servers to host quorum related services ● Less cost intensive ● Can mitigate split brain between FCs (depending on network layout)
  34. 34. Mitigation - Pets vs Cattle 34 ● NO pets allowed !!! ● Only cloud-ready applications
  35. 35. Mitigation - Failure tolerant applications 35 ● Tier level is not the most relevant layer ● Application must build their own cluster mechanisms on top of the DC → increases the availability significantly ● Data replication must be done across multi-region ● In case of a disaster route traffic to different DC ● Many VNF (virtual network functions) already support such setups
  36. 36. Mitigation - Federated Object Stores 36 ● Best way to synchronize and replicate data across multiple DC is usage of object storage ● Sync is done asynchronously Open issues: ● Doesn’t solve replication of databases ● Many applications don’t support object storage and need to be adapted ● Applications also need to support regions/zones
  37. 37. Mitigation - Outlook ● “OpenStack follows Storage” ○ Use RBDs as fencing devices ○ Extend Ceph MONs ■ Include information about physical placement similar to CRUSH map ■ Enable HA setup to query quorum decisions and map quorum to physical layout ● Passive standby Ceph MONs to ease deployment of MONs if quorum fails ○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors ● Generic quorum service/library ? 37
  38. 38. Conclusions
  39. 39. Conclusions ● OpenStack and Ceph provide HA if carefully planed ○ Be aware of potential failure scenarios! ○ All Quorums need must be synced ○ Third room must be used ○ Replica distribution and spare capacity must be considered ○ Ceph need more extended quorum information ● Target for five 9’s is E2E ○ Five 9’s on data center level very expensive ○ No pets !!! ○ Distribute applications or services over multiple DCs 39
  40. 40. Get involved ! ● Ceph ○ https://ceph.com/community/contribute/ ○ ceph-devel@vger.kernel.org ○ IRC: OFTC ■ #ceph, ■ #ceph-devel ○ Ceph Developer Summit ● OpenStack ○ Cinder, Glance, Manila, ... 40
  41. 41. danny.al-gaaf@telekom.de dalgaaf linkedin.com/in/dalgaaf Danny Al-Gaaf Senior Cloud Technologist IRC Q&A - THANK YOU!

×