Open stack HA - Theory to Reality

OpenStack HA -
Theory to Reality
GERD PRÜßMANN SHAMAIL TAHIR
SRIRAM SUBRAMANIAN KALIN NIKOLOV

Gerd Prüßmann Shamail Tahir
Cloud Architect Cloud Architect
Deutsche Telekom AG EMC Office of the CTO
Sriram Subramanian Kalin Nikolov
Founder & Cloud Specialist Cloud Engineer
CloudDon PayPal
@2digitsleft @ShamailXD
@sriramhere

Agenda
OpenStack HA - Introduction
Active/ Active
Active/ Passive
DT Implementation
eBay/PayPal Implementation
Summary

OpenStack HA - Introduction
What does it mean?
Why is it not by default?
Stateless vs Stateful
Challenges
More than one way
Active/ Passive
Active/ Active

Active/ Active
API Service Endpoints
Database
Networking

Active/ Active
● OS High Availability (HA) concept depends on components used for
i.e. network virtualization, storage backend, database system etc.
● Various technologies available to realize HA:
Vendors use combinations: i.e. Pacemaker, Corosync, Galera, Keepalived,
HAProxy, VRRP, DRBD … or their own tools
The following description is derived from the generic proposal from the
OpenStack HA guide:
http://docs.openstack.org/high-availability-guide/content/index.html

Active/ Active
● Target: Try to have all services of the platform highly available
Redundancy and resiliency against single service / node failure
● stateless services are load balanced (HAproxy + keepalived)
o i.e. API endpoints / nova-scheduler
● stateful services use individual HA technologies
o i.e. RabbitMQ, MySQL DB etc.
o might be load balanced as well
● some services/agents where no built in HA feature is available

Active/ Active - API service endpoints
API endpoints
● deploy on multiple nodes
● configure load balancing with virtual IPs in HAproxy
● use HAproxy’s VIPs to configure respective identity endpoints
● all service configuration files refer to these VIPs only
schedulers
● nova-scheduler, nova-conductor, cinder-scheduler, neutron-server,
ceilometer-collector, heat-engine
● schedulers will be configured with clustered RabbitMQ nodes

Active/ Active - Databases
● MySQL or MariaDB with Galera cluster
(wsrep) library extension
o transaction commit level replication
● synchronous multiple master nodes setup
o min. 3 nodes to get quorum in
case of network partition
● Write and read to any node
● other databases options possible:
Percona XtraDB, PostgreSQL etc.

Active/ Active - RabbitMQ
● RabbitMQ nodes clustered
● mirrored queues configured via policy (i.e. ha-mode all)
● all services use the RabbitMQ nodes

Active/ Active - Networking
Network
● deploy multiple network nodes
● Neutron DHCP agent – configure multiple DHCP agents
(dhcp_agents_per_network)
● Neutron L3 agent
o Automatic L3 agent HA (allow_automatic_l3agent_failover)
o VRRP (l3_ha, max_l3_agents_per_router, min_l3_agents_per_router)
● Neutron L2 agent - no HA available
● Neutron metadata agent – no HA availailable
● Neutron LBaaS agent – no HA available
● no HA feature available: active/passive pacemaker / corosync solution

Active/ Active - Example
Deployment example

Active/ Passive
General
Tools Overview
Controllers Overview

Active/ Passive: General
● Components should leverage a Virtual IP
● The primary tools used for Active/Passive
OpenStack configurations are general (non-
OpenStack specific): Pacemaker +
Corosync, and DRBD

Corosync
● Messaging Layer used by Cluster
● Responsibilities include cluster membership and
messaging
● Leverages RRP (Redundant Ring Protocol)
o Rings can be set up as A/A or A/P
o UDP Only
o mcastport specifies rcv port; mcastport minus 1 is
send port

Pacemaker
● Cluster Resource Manager
● Cluster Information Base (CIB)
o Represents current state of resources
and cluster configuration (XML)
● Cluster Resource Management Daemon
(CRMd)
o Acts as decision maker (one master)
● Policy Engine (PEngine)
o Send instructions to LRMd and CRMd
● STONITHd
o Fencing mechanism
CRMd
STONITHd CIB
PEngine
LRMd

DRBD
● Distributed Replicated Block Device
● Creates logical block devices (e.g. /dev/drbdX) that
having backing volumes
● Reads serviced locally
● Primary node writes are sent to secondary node

Host1
Active/Passive: Database
MySQL
Host2
MySQL
DRBD DRBD
Pacemaker Pacemaker
Corosync Corosync
● Use DRBD to back MySQL
● Leverage VIP that can float
between hosts
● Manage all resources (including
MySQL Daemon) with Pacemaker
● MySQL/Galera is an alternative
but current version of HA Guide
does not recommend it

Host1
Active/Passive: RabbitMQ
RabbitMQ
Host2
RabbitMQ
DRBD DRBD
Pacemaker Pacemaker
Corosync Corosync
● Use DRBD to back RabbitMQ
● Leverage VIP that can float
between hosts
● Ensure erlang.cookie are identical
on all nodes
o Enables ability to
communicate with each other
● RabbitMQ clustering does not
tolerate network partitions well

Active/Passive: Overview (From Guide)
● Leverage DB, RabbitMQ VIP in configuration files
● Configure Pacemaker Resources for OpenStack Services
o Image API
o Identity
o Block Storage API
o Telemetry Central Agent
o Networking
o L3-Agent
o DHCP

DT Implementation - Overview
● Business Market Place (BMP)
● SaaS offering
● https://portal.telekomcloud.com/
● SaaS Applications from Software Partners
(ISVs) and DT offered to SME customers
● Platform based on Open Source technologies only
(OpenStack, CEPH, Linux)
● Project started in 2012 with OS Essex, CEPH
● In production since 3/13

DT Implementation
DTAG scale out project (ongoing)
Target: Migrate production to a new DC and scale out
Requirements:
● scale out compute by 30%, storage by 40%
● eliminate all SPOFs
● Setup in two fire protection areas / physically separated DC rooms

DT Implementation
● single region HA OS instance
● all services distributed over two DC rooms
o Compute and Storage distributed equally
o All OpenStack services HA (as far as possible)
 OSS (DNS, NTP, puppet master, Mirror etc., redundant perimeter
firewall)
● Instance distribution: 4 Availability Zones, multiple host aggregates and
scheduler filters

DT Implementation
● Load Balancing
o HAproxy for MySQL, services, RabbitMQ, APIs (nginx under test)
● MySQL
o Galera Multi Master Node replication (3 nodes)
● RabbitMQ
o 2 nodes cluster / mirrored queues
● Neutron
o DHCP multiple agents started; Pacemaker/Corosync
● API Endpoints
o Loadbalancing with round robin distribution
● Storage
o 2 shared, distributed CEPH clusters (RBD/S3)

DT Implementation
Tests/Experiences so far
● Load balancing works well
● Database: OpenStack multi-node write issues
o 1 node write / 2 nodes backup: diminishes Galera HA efficiency (monitoring)
● Specific issues with deployment in 2 DC rooms / uneven distribution of services (Galera)
o if the “wrong” room fails
 Galera: quorum requires majority!
room with 2 nodes goes down → 3rd node will deactivate itself → DB outage
 Storage specific:
 CEPH may lose 2/3 of the replicas → heavy replication load on CEPH cluster
 danger of losing data (OSD/disk failure) → raise replica level / adapt crush map
 Network: recovering from a neutron / L3 failure: <15 minutes to recover
o pet applications vulnerable – may suffer from hick-ups at disasters anyway
● DHCP agent failures

DT Implementation
Plans for the future
● use DVR / VRRP in the future
o make network more resilient and elastic
● a third DC room would be desirable :-)
o CEPH replicas / MONs, MySQL Galera

The scope of Ebay/PayPal OpenStack Clouds
● 100% of PayPal web/mid tier
● Most of Dev/QA
● Number of HVs: 8,500
● Number of Virtual Machines: 70,000
● Number of users: Several thousands
● Availability zones: 10

● Database
MySQL MMM replication, VIP with FailoverPersistence / Galera
● RabbitMQ
VIP with SingleNode FailoverPersistence or 3 nodes with mirrored queues
● NeutronDHCP / LBaaS
Corosync/Pacemaker
● API Endpoints
LB VIPs for every service with either RR or least connection
● Storage
Shared storage with nfs/iscsi

Successful HA Implementations
● LoadBalanced HA - VIPs for every service
● LB Single Node Failover Persistence Profile
● Galera/Percona for Identity Service
● Global Identity Service using GLB

HA Failures
● Corosync/Pacemaker
NeutronDHCP and LBaaS - missing advanced health checks
● RabbitMQ
Single Node Failover Persistence
● MySQL Replication
Single Node Failover Persistence sometimes doesn't work well
Implemented external monitoring and disabling of the failed member.
● VIPs without ECV health checks

Future direction
● HA on Global or Regional Services
One leg in each Availability Zone
(Keystone, LBaaS, Swift)
● RabbitMQ with 3 node/mirrored queues
LB VIP with least connections
● No shared NFS for Glance

eBay/PayPal Global Identity Service

Lessons Learned
● Try not to overcomplicate
● Simulate Failures
Before placing in production make sure HA works
● Place your services in different Availability zones
or at least different FaultZones
● Always make backups
No matter how robust your HA solution is

● OpenStack HA Guide Update Efforts
● WTE Work Group (now known as ‘Enterprise’)
● Share Best Practices
Call to Action

Reference
OpenStack HA guide:
http://docs.openstack.org/high-availability-guide/content/index.html
Percona Resources
https://www.percona.com/resources/mysql-webinars/high-availability-using-
mysql-cloud-today-tomorrow-and-keys-your-success
HA Proxy Documentation:
http://www.haproxy.org/

Open stack HA - Theory to Reality

More Related Content

What's hot

Similar to Open stack HA - Theory to Reality

More from Sriram Subramanian

Recently uploaded

Open stack HA - Theory to Reality

Editor's Notes