Design Best Practices for High Availability in Load Balancing

Copyright Avi Networks 2018
High Availability
Nathan McMahon
Product Management
nathan@avinetworks.com

High Availability
• Why change the HA model?
• How has the model changed?
• Specific examples of impact
Active Standby

Islands of Technology
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
LB2LB1
App VIP Data Center LB Pair LB IP Addr
Exchange 17.234.11.10 SV1 DC1-LB07a 10.120.23.34
Exchange 219.2.40.121 Virginia DC-V-LB01 10.8.10.241
OWA 17.234.11.11 SV1 DC1-LB07a 10.120.23.34
OWA 219.2.40.127 Virginia DC-V-LB01 10.8.10.241
www 17.234.28.24 SV1 DC1-LB2 10.120.23.117
AppStack 17.234.28.25 SV1 DC1-LB1 10.120.23.120
• Active / Standby load balancer pair has limited capacity
• Manual VS placement onto a single pair of LBs
• Management complexity increases with more apps
• Hard to write automation to point to the correct LBs
Active Standby

Islands of Technology
Active
15%
Standby
0%
• No shared capacity pooling
• Costly overprovisioning
• Shift from proprietary hardware to software compounds these challenges
• Average utilization of traditional LBs? 6-8 %

What if we change the model
Active Standby

CONTROLDATA
Service Engines
Controllers
Separate
Control and
Data Plane
Manage as one,
not many devices

Bare Metal Virtualized Containers Public Cloud
CONTROLDATA
Service Engines
Controllers
MESOS
Hybrid Cloud
Both traditional and modern use cases
Automation
Highly programmable, plug-n-play
Analytics
Actionable insights key to automation
Separate
Control and
Data Plane
Manage as one,
not many devices

Avi Object Model

Avi Object Model
• Avi Controller
• Avi Service Engines
• Load Balancing Components
Virtual Service Pools Networks Servers

Controller HA

Controller Redundancy
• Controller may be deployed as a standalone, or a redundant three node cluster
• High availability uses a Zookeeper-like model of a 3 node cluster to maintain a quorum
• All Controllers are active, sharding workloads
• Management may be performed from any Controller in the cluster
Controller Cluster
3 Node Cluster
Standalone
Leader
Follower Follower

Single Node Failure
• No impact to data plane (the Service Engines) or management
Controller High Availability

Single Node Failure
Two Node Failure
• The remaining Controller node will not take over as active without quorum (2 nodes)
– Mitigates split-brain issue with traditional A/A, such as if one Controller was not down but merely lost connectivity to peers
• Remaining Controller must be manually promoted to own the cluster and be active
?
?

Single Node Failure
Two Node Failure
• The remaining Controller node will not take over as active without quorum (2 nodes)
– Mitigates split-brain issue with traditional A/A, such as if one Controller was not down but merely lost connectivity to peers
• Remaining Controller must be manually promoted to own the cluster and be active
Three Node Failure
• No impact to data plane. Service Engines continue to run in headless mode until Controllers are restored
• No configuration changes possible until Controllers are restored / redeployed
• Service Engines will buffer metrics and logs until Controllers are back. Buffer size depends on disk allocation for SEs

Controller Process Sharding
• All Controllers are actively working, though they may be doing different tasks
• Each virtual service is hashed to a Controller to divide the workload
• Many newer environments are built around 3 availability zones
Controller cluster sharding
workloads from four virtual services
VS1 VS2 VS2 VS4VS3
Leader Follower Follower

Service Engine HA

Templates
• SE Groups contain sizing, scaling, placement and HA properties
• A new SE will be created from the SE Group properties
• SE Group options will vary based upon the cloud / ecosystem
Folders
• An SE is always a member of the group it was created within
• Each SE group is an isolation domain
• Apps may gracefully migrate, scale, or failover across SEs in the group
• Client session data automatically replicated to other SEs in the group
– Persistence tables
– SSL session/tickets
– DataScript variables
SE Groups
100 Avi-SE-xyz
70 Avi-SE-abc
100 Avi-SE-def
SEs: 2vCPU, 2Gb
HA: Active / Active
SE Group 2
! Avi-Lab-123
! Avi-Lab-456
SEs: 1vCPU, 1Gb
HA: Active / Standby
SE Group 1

SE High Availability Modes
Fastest failover time
Least efficient SE utilization
Longest failover time
Most efficient SE utilization
Legacy
Active / Standby
Elastic
Active / Active
Elastic
N + M
Elastic
N + 0
Failover Steps
SE failure detection
Controller determines SE to fail over to
Controller creates new SE
Copy VS configuration to new SE
Configure vNIC on new SE
Move VIP via GARP or cloud API

Legacy Active/Standby
• VS is active on one SE, standby on another
• No VS scaleout support
• Primarily for default gateway / non-SNAT app support
• Fastest failover, but half of SE resources are idle
SE 1
Active
SE 2
Standby
Steady state
App 3
App 2
App 1
SE 1
Down
SE 2
Active
Failed SE state
App 3
App 2
App 1
High Availability Mode A / S
SE failure detection O
Controller determines SE to fail over to -
Controller creates new SE -
Copy VS configuration to new SE -
Configure vNIC on new SE -
Move VIP via GARP or cloud API O

Elastic Active / Active [best practice for production apps]
• All SEs are active
• VS must be scaled across at least 2 SEs
• SE failover decision pre-determined
• Session info proactively replicated to other scaled SEs
• Faster failover, potentially greater SE resource requirement
Elastic N + M [default mode]
• All SEs are active
• N = number of SEs a new VS is scaled across
• M = the buffer, or number of failures the group can sustain
• SE failover decision determined at time of failure
• Session replication done after new SE is chosen
• Slower failover, less SE resource requirement
SE 1 SE 2 SE 3
SE 1 SE 2 SE 3
SE 1 SE 2 SE 3 SE 4
Steady state, each SE utilized
One SE fails
New SE created to meet HA requirement
App 2
App 1
App 2
App 3
App 2
App 4
App 2
App 1
App 3
App 2
App 4
App 2
App 1
App 2
App 4 App 3
App 2
High Availability Mode A / A N + M
SE failure detection O O
Controller determines SE to fail over to O
Copy VS configuration to new SE O
Configure vNIC on new SE O
Move VIP via GARP or cloud API O O

Fastest failover time
Least efficient SE utilization
Longest failover time
Most efficient SE utilization
Legacy
Active / Standby
Elastic
Active / Active
Elastic
N + M
Elastic
N + 0
High Availability Mode A / S A / A N + M N + 0
SE failure detection O O O O
Controller determines SE to fail over to O O
Controller creates new SE O
Copy VS configuration to new SE O O
Configure vNIC on new SE O O
Move VIP via GARP or cloud API O O O O

SE Native Scaling
Automatically Increase Service Engine Capacity
1. Traffic is steady for a virtual service.
The primary SE ARPs for the VIP address.
2. Traffic increases beyond the capacity of a single SE.
3. Controller brings new load balancers (SEs) online.
4. The primary SE delegates some traffic to new SEs by
forwarding some connections (L2 switched) to the MAC addresses
of the other SEs.
5. Each SE takes a portion of the load.
With SNAT, servers return traffic to the source SE MAC.
SEs forward response traffic directly back to clients.
SE 1

Scale Service Engines via Upstream Router
• All SEs advertise the VIP to BGP via Route Health Injection
• Router hashes client flows across SEs
• ECMP mode enables scaling across 2 to 64 Service Engines
• With SNAT, servers return traffic to the source SE MAC address
• SEs send response traffic directly to clients
Failure Mitigation
• BFD may be enabled to ensure faster detection of an SE failure
• Persistence and SSL connections are mirrored to ensure a graceful
and automatic recovery in case of a router hash redistribution
• SEs will forward incorrectly hashed flows to the proper SE
SE ECMP Scaling
SE

SE Auto Scaling
SE 1
Scaling
• Scale Out
• Scale In
– Gracefully remove an SE from the active/active group
– Waits one minute for connections to close before scaling in
• Migrate
1. Scale out from SE1 to SE2
2. SE2 GARPs for the VIP
3. Scale in to SE2, removing SE1 from servicing the VIP
Manual Scaling
• Administrator initiated scale in, out, and migrate
• Default mode
Auto Scaling
• SE Group may be configured for manual or automatic scaling
• Avi does not [yet] recommend auto scaling
– Works for CPU above/below threshold
– Auto scale available via CLI/API

Scale SE Performance Up and Out
SE
SE SE
SE
SE SE
Scale up with more CPU cores
Scale out with more SEs

Multi Availability Zones for Public Cloud
• Public clouds such as AWS split a data center into three Availability Zones
• Each AZ is a separate IP network space
• AWS customers are expected to load balancing traffic into the three Azs
• Avi deploys an SE per AZ
• DNS is then used to distribute traffic across the three VIP addresses for an app
• The Avi Controller removes a VIP from DNS if that AZ or SE is down
• Multi AZ awareness for AWS and Azure require a DNS profile for the cloud AZ 1 AZ 2 AZ 3
www.avi.com
20.1.1.1 20.2.2.2 20.3.3.3
Traffic distribution in AWS data center with 3 AZs

Bare Metal Virtualized Containers Public Cloud
CONTROLDATA
MESOS
Hybrid Cloud
Both traditional and modern use cases
Automation
Highly programmable, plug-n-play
Analytics
Actionable insights key to automation
Separate
Control and
Data Plane
Manage as one,
not many devices
• Why change the HA model?
• Active/Standby is based on a physical, device-centric world
• Doesn’t scale, increases management complexity
• How has the model changed?
• NFV model, Active/Active
• Specific examples of impact
• Nearly infinite scale
• Easier management, easier to write automation

Next Steps
• Avi Tech Corner Webinars avinetworks.com/webinars-avi-tech-corner
• Avi Knowledge Base avinetworks.com/docs
• Avi Workshops avinetworks.com/workshops
• Virtual Lab email: education@avinetworks.com

Nathan McMahon
education@avinetworks.com
avinetworks.com/workshops

Design Best Practices for High Availability in Load Balancing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Design Best Practices for High Availability in Load Balancing

Similar to Design Best Practices for High Availability in Load Balancing (20)

More from Avi Networks

More from Avi Networks (20)

Recently uploaded

Recently uploaded (20)

Design Best Practices for High Availability in Load Balancing

Editor's Notes