OpenStack HA -
Theory to Reality
GERD PRÜßMANN SHAMAIL TAHIR
SRIRAM SUBRAMANIAN KALIN NIKOLOV
Gerd Prüßmann Shamail Tahir
Cloud Architect Cloud Architect
Deutsche Telekom AG EMC Office of the CTO
Sriram Subramanian Kalin Nikolov
Founder & Cloud Specialist Cloud Engineer
CloudDon PayPal
@2digitsleft @ShamailXD
@sriramhere
Agenda
OpenStack HA - Introduction
Active/ Active
Active/ Passive
DT Implementation
eBay/PayPal Implementation
Summary
OpenStack HA - Introduction
What does it mean?
Why is it not by default?
Stateless vs Stateful
Challenges
More than one way
Active/ Passive
Active/ Active
Is This?
Or This?
Active/ Active
API Service Endpoints
Database
Networking
Active/ Active
● OS High Availability (HA) concept depends on components used for
i.e. network virtualization, storage backend, database system etc.
● Various technologies available to realize HA:
Vendors use combinations: i.e. Pacemaker, Corosync, Galera, Keepalived,
HAProxy, VRRP, DRBD … or their own tools
The following description is derived from the generic proposal from the
OpenStack HA guide:
http://docs.openstack.org/high-availability-guide/content/index.html
Active/ Active
● Target: Try to have all services of the platform highly available
Redundancy and resiliency against single service / node failure
● stateless services are load balanced (HAproxy + keepalived)
o i.e. API endpoints / nova-scheduler
● stateful services use individual HA technologies
o i.e. RabbitMQ, MySQL DB etc.
o might be load balanced as well
● some services/agents where no built in HA feature is available
Active/ Active - API service endpoints
API endpoints
● deploy on multiple nodes
● configure load balancing with virtual IPs in HAproxy
● use HAproxy’s VIPs to configure respective identity endpoints
● all service configuration files refer to these VIPs only
schedulers
● nova-scheduler, nova-conductor, cinder-scheduler, neutron-server,
ceilometer-collector, heat-engine
● schedulers will be configured with clustered RabbitMQ nodes
Active/ Active - Databases
● MySQL or MariaDB with Galera cluster
(wsrep) library extension
o transaction commit level replication
● synchronous multiple master nodes setup
o min. 3 nodes to get quorum in
case of network partition
● Write and read to any node
● other databases options possible:
Percona XtraDB, PostgreSQL etc.
Active/ Active - RabbitMQ
● RabbitMQ nodes clustered
● mirrored queues configured via policy (i.e. ha-mode all)
● all services use the RabbitMQ nodes
Active/ Active - Networking
Network
● deploy multiple network nodes
● Neutron DHCP agent – configure multiple DHCP agents
(dhcp_agents_per_network)
● Neutron L3 agent
o Automatic L3 agent HA (allow_automatic_l3agent_failover)
o VRRP (l3_ha, max_l3_agents_per_router, min_l3_agents_per_router)
● Neutron L2 agent - no HA available
● Neutron metadata agent – no HA availailable
● Neutron LBaaS agent – no HA available
● no HA feature available: active/passive pacemaker / corosync solution
Active/ Active - Example
Deployment example
Active/ Passive
General
Tools Overview
Controllers Overview
Active/ Passive: General
● Components should leverage a Virtual IP
● The primary tools used for Active/Passive
OpenStack configurations are general (non-
OpenStack specific): Pacemaker +
Corosync, and DRBD
Corosync
● Messaging Layer used by Cluster
● Responsibilities include cluster membership and
messaging
● Leverages RRP (Redundant Ring Protocol)
o Rings can be set up as A/A or A/P
o UDP Only
o mcastport specifies rcv port; mcastport minus 1 is
send port
Pacemaker
● Cluster Resource Manager
● Cluster Information Base (CIB)
o Represents current state of resources
and cluster configuration (XML)
● Cluster Resource Management Daemon
(CRMd)
o Acts as decision maker (one master)
● Policy Engine (PEngine)
o Send instructions to LRMd and CRMd
● STONITHd
o Fencing mechanism
CRMd
STONITHd CIB
PEngine
LRMd
DRBD
● Distributed Replicated Block Device
● Creates logical block devices (e.g. /dev/drbdX) that
having backing volumes
● Reads serviced locally
● Primary node writes are sent to secondary node
Host1
Active/Passive: Database
MySQL
Host2
MySQL
DRBD DRBD
Pacemaker Pacemaker
Corosync Corosync
● Use DRBD to back MySQL
● Leverage VIP that can float
between hosts
● Manage all resources (including
MySQL Daemon) with Pacemaker
● MySQL/Galera is an alternative
but current version of HA Guide
does not recommend it
Host1
Active/Passive: RabbitMQ
RabbitMQ
Host2
RabbitMQ
DRBD DRBD
Pacemaker Pacemaker
Corosync Corosync
● Use DRBD to back RabbitMQ
● Leverage VIP that can float
between hosts
● Ensure erlang.cookie are identical
on all nodes
o Enables ability to
communicate with each other
● RabbitMQ clustering does not
tolerate network partitions well
Active/Passive: Overview (From Guide)
● Leverage DB, RabbitMQ VIP in configuration files
● Configure Pacemaker Resources for OpenStack Services
o Image API
o Identity
o Block Storage API
o Telemetry Central Agent
o Networking
o L3-Agent
o DHCP
DT Implementation - Overview
● Business Market Place (BMP)
● SaaS offering
● https://portal.telekomcloud.com/
● SaaS Applications from Software Partners
(ISVs) and DT offered to SME customers
● Platform based on Open Source technologies only
(OpenStack, CEPH, Linux)
● Project started in 2012 with OS Essex, CEPH
● In production since 3/13
DT Implementation
DTAG scale out project (ongoing)
Target: Migrate production to a new DC and scale out
Requirements:
● scale out compute by 30%, storage by 40%
● eliminate all SPOFs
● Setup in two fire protection areas / physically separated DC rooms
DT Implementation
● single region HA OS instance
● all services distributed over two DC rooms
o Compute and Storage distributed equally
o All OpenStack services HA (as far as possible)
 OSS (DNS, NTP, puppet master, Mirror etc., redundant perimeter
firewall)
● Instance distribution: 4 Availability Zones, multiple host aggregates and
scheduler filters
DT Implementation
● Load Balancing
o HAproxy for MySQL, services, RabbitMQ, APIs (nginx under test)
● MySQL
o Galera Multi Master Node replication (3 nodes)
● RabbitMQ
o 2 nodes cluster / mirrored queues
● Neutron
o DHCP multiple agents started; Pacemaker/Corosync
● API Endpoints
o Loadbalancing with round robin distribution
● Storage
o 2 shared, distributed CEPH clusters (RBD/S3)
DT Implementation
Tests/Experiences so far
● Load balancing works well
● Database: OpenStack multi-node write issues
o 1 node write / 2 nodes backup: diminishes Galera HA efficiency (monitoring)
● Specific issues with deployment in 2 DC rooms / uneven distribution of services (Galera)
o if the “wrong” room fails
 Galera: quorum requires majority!
room with 2 nodes goes down → 3rd node will deactivate itself → DB outage
 Storage specific:
 CEPH may lose 2/3 of the replicas → heavy replication load on CEPH cluster
 danger of losing data (OSD/disk failure) → raise replica level / adapt crush map
 Network: recovering from a neutron / L3 failure: <15 minutes to recover
o pet applications vulnerable – may suffer from hick-ups at disasters anyway
● DHCP agent failures
DT Implementation
Plans for the future
● use DVR / VRRP in the future
o make network more resilient and elastic
● a third DC room would be desirable :-)
o CEPH replicas / MONs, MySQL Galera
eBay/PayPal Implementation
The scope of Ebay/PayPal OpenStack Clouds
● 100% of PayPal web/mid tier
● Most of Dev/QA
● Number of HVs: 8,500
● Number of Virtual Machines: 70,000
● Number of users: Several thousands
● Availability zones: 10
eBay/PayPal Implementation
● Database
MySQL MMM replication, VIP with FailoverPersistence / Galera
● RabbitMQ
VIP with SingleNode FailoverPersistence or 3 nodes with mirrored queues
● NeutronDHCP / LBaaS
Corosync/Pacemaker
● API Endpoints
LB VIPs for every service with either RR or least connection
● Storage
Shared storage with nfs/iscsi
eBay/PayPal Implementation
Successful HA Implementations
● LoadBalanced HA - VIPs for every service
● LB Single Node Failover Persistence Profile
● Galera/Percona for Identity Service
● Global Identity Service using GLB
eBay/PayPal Implementation
HA Failures
● Corosync/Pacemaker
NeutronDHCP and LBaaS - missing advanced health checks
● RabbitMQ
Single Node Failover Persistence
● MySQL Replication
Single Node Failover Persistence sometimes doesn't work well
Implemented external monitoring and disabling of the failed member.
● VIPs without ECV health checks
eBay/PayPal Implementation
Future direction
● HA on Global or Regional Services
One leg in each Availability Zone
(Keystone, LBaaS, Swift)
● RabbitMQ with 3 node/mirrored queues
LB VIP with least connections
● No shared NFS for Glance
eBay/PayPal Global Identity Service
eBay/PayPal Implementation
Lessons Learned
● Try not to overcomplicate
● Simulate Failures
Before placing in production make sure HA works
● Place your services in different Availability zones
or at least different FaultZones
● Always make backups
No matter how robust your HA solution is
● OpenStack HA Guide Update Efforts
● WTE Work Group (now known as ‘Enterprise’)
● Share Best Practices
Call to Action
Reference
OpenStack HA guide:
http://docs.openstack.org/high-availability-guide/content/index.html
Percona Resources
https://www.percona.com/resources/mysql-webinars/high-availability-using-
mysql-cloud-today-tomorrow-and-keys-your-success
HA Proxy Documentation:
http://www.haproxy.org/

Open stack HA - Theory to Reality

  • 1.
    OpenStack HA - Theoryto Reality GERD PRÜßMANN SHAMAIL TAHIR SRIRAM SUBRAMANIAN KALIN NIKOLOV
  • 2.
    Gerd Prüßmann ShamailTahir Cloud Architect Cloud Architect Deutsche Telekom AG EMC Office of the CTO Sriram Subramanian Kalin Nikolov Founder & Cloud Specialist Cloud Engineer CloudDon PayPal @2digitsleft @ShamailXD @sriramhere
  • 3.
    Agenda OpenStack HA -Introduction Active/ Active Active/ Passive DT Implementation eBay/PayPal Implementation Summary
  • 4.
    OpenStack HA -Introduction What does it mean? Why is it not by default? Stateless vs Stateful Challenges More than one way Active/ Passive Active/ Active
  • 5.
  • 6.
  • 7.
    Active/ Active API ServiceEndpoints Database Networking
  • 8.
    Active/ Active ● OSHigh Availability (HA) concept depends on components used for i.e. network virtualization, storage backend, database system etc. ● Various technologies available to realize HA: Vendors use combinations: i.e. Pacemaker, Corosync, Galera, Keepalived, HAProxy, VRRP, DRBD … or their own tools The following description is derived from the generic proposal from the OpenStack HA guide: http://docs.openstack.org/high-availability-guide/content/index.html
  • 9.
    Active/ Active ● Target:Try to have all services of the platform highly available Redundancy and resiliency against single service / node failure ● stateless services are load balanced (HAproxy + keepalived) o i.e. API endpoints / nova-scheduler ● stateful services use individual HA technologies o i.e. RabbitMQ, MySQL DB etc. o might be load balanced as well ● some services/agents where no built in HA feature is available
  • 10.
    Active/ Active -API service endpoints API endpoints ● deploy on multiple nodes ● configure load balancing with virtual IPs in HAproxy ● use HAproxy’s VIPs to configure respective identity endpoints ● all service configuration files refer to these VIPs only schedulers ● nova-scheduler, nova-conductor, cinder-scheduler, neutron-server, ceilometer-collector, heat-engine ● schedulers will be configured with clustered RabbitMQ nodes
  • 11.
    Active/ Active -Databases ● MySQL or MariaDB with Galera cluster (wsrep) library extension o transaction commit level replication ● synchronous multiple master nodes setup o min. 3 nodes to get quorum in case of network partition ● Write and read to any node ● other databases options possible: Percona XtraDB, PostgreSQL etc.
  • 12.
    Active/ Active -RabbitMQ ● RabbitMQ nodes clustered ● mirrored queues configured via policy (i.e. ha-mode all) ● all services use the RabbitMQ nodes
  • 13.
    Active/ Active -Networking Network ● deploy multiple network nodes ● Neutron DHCP agent – configure multiple DHCP agents (dhcp_agents_per_network) ● Neutron L3 agent o Automatic L3 agent HA (allow_automatic_l3agent_failover) o VRRP (l3_ha, max_l3_agents_per_router, min_l3_agents_per_router) ● Neutron L2 agent - no HA available ● Neutron metadata agent – no HA availailable ● Neutron LBaaS agent – no HA available ● no HA feature available: active/passive pacemaker / corosync solution
  • 14.
    Active/ Active -Example Deployment example
  • 15.
  • 16.
    Active/ Passive: General ●Components should leverage a Virtual IP ● The primary tools used for Active/Passive OpenStack configurations are general (non- OpenStack specific): Pacemaker + Corosync, and DRBD
  • 17.
    Corosync ● Messaging Layerused by Cluster ● Responsibilities include cluster membership and messaging ● Leverages RRP (Redundant Ring Protocol) o Rings can be set up as A/A or A/P o UDP Only o mcastport specifies rcv port; mcastport minus 1 is send port
  • 18.
    Pacemaker ● Cluster ResourceManager ● Cluster Information Base (CIB) o Represents current state of resources and cluster configuration (XML) ● Cluster Resource Management Daemon (CRMd) o Acts as decision maker (one master) ● Policy Engine (PEngine) o Send instructions to LRMd and CRMd ● STONITHd o Fencing mechanism CRMd STONITHd CIB PEngine LRMd
  • 19.
    DRBD ● Distributed ReplicatedBlock Device ● Creates logical block devices (e.g. /dev/drbdX) that having backing volumes ● Reads serviced locally ● Primary node writes are sent to secondary node
  • 20.
    Host1 Active/Passive: Database MySQL Host2 MySQL DRBD DRBD PacemakerPacemaker Corosync Corosync ● Use DRBD to back MySQL ● Leverage VIP that can float between hosts ● Manage all resources (including MySQL Daemon) with Pacemaker ● MySQL/Galera is an alternative but current version of HA Guide does not recommend it
  • 21.
    Host1 Active/Passive: RabbitMQ RabbitMQ Host2 RabbitMQ DRBD DRBD PacemakerPacemaker Corosync Corosync ● Use DRBD to back RabbitMQ ● Leverage VIP that can float between hosts ● Ensure erlang.cookie are identical on all nodes o Enables ability to communicate with each other ● RabbitMQ clustering does not tolerate network partitions well
  • 22.
    Active/Passive: Overview (FromGuide) ● Leverage DB, RabbitMQ VIP in configuration files ● Configure Pacemaker Resources for OpenStack Services o Image API o Identity o Block Storage API o Telemetry Central Agent o Networking o L3-Agent o DHCP
  • 23.
    DT Implementation -Overview ● Business Market Place (BMP) ● SaaS offering ● https://portal.telekomcloud.com/ ● SaaS Applications from Software Partners (ISVs) and DT offered to SME customers ● Platform based on Open Source technologies only (OpenStack, CEPH, Linux) ● Project started in 2012 with OS Essex, CEPH ● In production since 3/13
  • 24.
    DT Implementation DTAG scaleout project (ongoing) Target: Migrate production to a new DC and scale out Requirements: ● scale out compute by 30%, storage by 40% ● eliminate all SPOFs ● Setup in two fire protection areas / physically separated DC rooms
  • 25.
    DT Implementation ● singleregion HA OS instance ● all services distributed over two DC rooms o Compute and Storage distributed equally o All OpenStack services HA (as far as possible)  OSS (DNS, NTP, puppet master, Mirror etc., redundant perimeter firewall) ● Instance distribution: 4 Availability Zones, multiple host aggregates and scheduler filters
  • 26.
    DT Implementation ● LoadBalancing o HAproxy for MySQL, services, RabbitMQ, APIs (nginx under test) ● MySQL o Galera Multi Master Node replication (3 nodes) ● RabbitMQ o 2 nodes cluster / mirrored queues ● Neutron o DHCP multiple agents started; Pacemaker/Corosync ● API Endpoints o Loadbalancing with round robin distribution ● Storage o 2 shared, distributed CEPH clusters (RBD/S3)
  • 27.
    DT Implementation Tests/Experiences sofar ● Load balancing works well ● Database: OpenStack multi-node write issues o 1 node write / 2 nodes backup: diminishes Galera HA efficiency (monitoring) ● Specific issues with deployment in 2 DC rooms / uneven distribution of services (Galera) o if the “wrong” room fails  Galera: quorum requires majority! room with 2 nodes goes down → 3rd node will deactivate itself → DB outage  Storage specific:  CEPH may lose 2/3 of the replicas → heavy replication load on CEPH cluster  danger of losing data (OSD/disk failure) → raise replica level / adapt crush map  Network: recovering from a neutron / L3 failure: <15 minutes to recover o pet applications vulnerable – may suffer from hick-ups at disasters anyway ● DHCP agent failures
  • 28.
    DT Implementation Plans forthe future ● use DVR / VRRP in the future o make network more resilient and elastic ● a third DC room would be desirable :-) o CEPH replicas / MONs, MySQL Galera
  • 29.
    eBay/PayPal Implementation The scopeof Ebay/PayPal OpenStack Clouds ● 100% of PayPal web/mid tier ● Most of Dev/QA ● Number of HVs: 8,500 ● Number of Virtual Machines: 70,000 ● Number of users: Several thousands ● Availability zones: 10
  • 30.
    eBay/PayPal Implementation ● Database MySQLMMM replication, VIP with FailoverPersistence / Galera ● RabbitMQ VIP with SingleNode FailoverPersistence or 3 nodes with mirrored queues ● NeutronDHCP / LBaaS Corosync/Pacemaker ● API Endpoints LB VIPs for every service with either RR or least connection ● Storage Shared storage with nfs/iscsi
  • 31.
    eBay/PayPal Implementation Successful HAImplementations ● LoadBalanced HA - VIPs for every service ● LB Single Node Failover Persistence Profile ● Galera/Percona for Identity Service ● Global Identity Service using GLB
  • 32.
    eBay/PayPal Implementation HA Failures ●Corosync/Pacemaker NeutronDHCP and LBaaS - missing advanced health checks ● RabbitMQ Single Node Failover Persistence ● MySQL Replication Single Node Failover Persistence sometimes doesn't work well Implemented external monitoring and disabling of the failed member. ● VIPs without ECV health checks
  • 33.
    eBay/PayPal Implementation Future direction ●HA on Global or Regional Services One leg in each Availability Zone (Keystone, LBaaS, Swift) ● RabbitMQ with 3 node/mirrored queues LB VIP with least connections ● No shared NFS for Glance
  • 34.
  • 35.
    eBay/PayPal Implementation Lessons Learned ●Try not to overcomplicate ● Simulate Failures Before placing in production make sure HA works ● Place your services in different Availability zones or at least different FaultZones ● Always make backups No matter how robust your HA solution is
  • 36.
    ● OpenStack HAGuide Update Efforts ● WTE Work Group (now known as ‘Enterprise’) ● Share Best Practices Call to Action
  • 37.
    Reference OpenStack HA guide: http://docs.openstack.org/high-availability-guide/content/index.html PerconaResources https://www.percona.com/resources/mysql-webinars/high-availability-using- mysql-cloud-today-tomorrow-and-keys-your-success HA Proxy Documentation: http://www.haproxy.org/

Editor's Notes

  • #5 Explain the notion of High Availability under the context of OpenStack. Ensuring high availability of OpenStack Services, API services, and supporting infrastructure including databases, message queues. HA means different at different contexts - is it guest availability? is it DB? is it storage? or is it application availability? if there is a failure, should the application fail over or should the underlying infra? Broadly, protect against system down time and prevent accidental data loss. There could be multiple SPOFs - services, API end points, network components, storage components, infrastructure components such as power, cooling etc. Provide redundancy at appropriate levels OpenStack is a collection of services sharing some common infrastructure. It is not a monolithic application that can be made highly available by slapping in a load balancer. These services are independent and self-contained services with some shared infrastructure among them. They have different configuration, settings and more Some of the components are stateless - such as nova-api, keystone-api, glance-api etc. Some of the components are databases/ message queues OpenStack architecture is very complete Acitve/ Passive - one ‘active’ and on failure, the reduntant service/ system is brought in to action. For stateless services, very minimal config needed. For Stateful services, additional applications such as Pacemaker, CoreSync are needed Active/ Active - both active and redundant systemns are maintained in same state concurrently. For stateless services, active and redundant instances are load balanced using a LB such as HAProxy. Stateless services will need to be maintained in same state. Again, need an LB.