-
1.
Copyright © 2016 Mirantis, Inc. All rights reserved
www.mirantis.com
Is OpenStack Neutron
production ready for large
scale deployments?
Oleg Bondarev, Senior Software Engineer, Mirantis
Elena Ezhova, Software Engineer, Mirantis
-
2.
Copyright © 2016 Mirantis, Inc. All rights reserved
Why are we here?
“We've learned from experience that the truth will come out.”
Richard Feynman
-
3.
Copyright © 2016 Mirantis, Inc. All rights reserved
Key highlights (Spoilers!)
Mitaka-based OpenStack deployed
by Fuel
2 hardware labs were used for
testing
378 nodes was the size of the
largest lab
Line-rate throughput was achieved
Over 24500 VMs were launched on
a 200-node lab
...and yes, Neutron works at scale!
-
4.
Copyright © 2016 Mirantis, Inc. All rights reserved
Agenda
Labs overview & tools
Testing methodology
Results and analysis
Issues
Outcomes
-
5.
Copyright © 2016 Mirantis, Inc. All rights reserved
Deployment description
Mirantis OpenStack with Mitaka-based Neutron
ML2 OVS
VxLAN/L2 POP
DVR
rootwrap-daemon ON
ovsdb native interface OFF
ofctl native interface OFF
-
6.
Copyright © 2016 Mirantis, Inc. All rights reserved
Environment description. 200 node lab
3 controllers, 196 computes, 1 node for Grafana/Prometheus
CPU
2x CPU Intel Xeon E5-2650v3,Socket 2011,
2.3 GHz, 25MB Cache, 10 core, 105 W
RAM
8x 16GB Samsung M393A2G40DB0-CPB
DDR-IV PC4-2133P ECC Reg. CL13
Networ
k
2x Intel Corporation I350 Gigabit Network
Connection (public network)
2x Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
Controllers Computes
CPU
1x INTEL XEON Ivy Bridge 6C E5-2620 V2
2.1G 15M 7.2GT/s QPI 80w SOCKET 2011R
1600
RAM
4x Samsung DDRIII 8GB DDR3-1866 1Rx4
ECC REG RoHS M393B1G70QH0-CMA
Network
1x AOC-STGN-i2S - 2-port 10 Gigabit
Ethernet SFP+
-
7.
Copyright © 2016 Mirantis, Inc. All rights reserved
Environment description. 378 node lab
3 controllers, 375 computes
Model Dell PowerEdge R63
CPU 2x Intel, E5-2680 v3, 2.5 GHz, 12 core
RAM
256 GB RAM, Samsung, M393A2G40DB0-
CPB
Networ
k
2x Intel X710 Dual Port, 10-Gigabit
Storage
3.6 TB, SSD, raid1 - Dell, PERC H730P Mini,
2 disks Intel S3610
Model Lenovo RD550-1U
CPU 2x E5-2680v3, 12-core CPUs
RAM 256GB RAM
Network 2x Intel X710 Dual Port, 10-Gigabit
Storage
2x Intel S3610 800GB SSD
2x DP and 3Yr Standard Support 23 176
RD650-2
-
8.
Copyright © 2016 Mirantis, Inc. All rights reserved
Tools
Control plane testing
Rally
Data plane testing
Shaker
Density testing
Heat
Custom (ancillary) scripts
System resource monitoring
Grafana/Prometheus
Additionally
-
9.
Copyright © 2016 Mirantis, Inc. All rights reserved
Integrity test
Control group of resources that must
stay persistent no matter what other
operations are performed on the
cluster.
2 server groups of 10 instances
2 subnets connected by router
Connectivity checks by floating IPs
and fixed IPs
Checks are run between other tests
to ensure dataplane operability
-
10.
Copyright © 2016 Mirantis, Inc. All rights reserved
Integrity test
● From fixed IP to fixed
IP in the same subnet
● From fixed IP to fixed
IP in different subnets
-
11.
Copyright © 2016 Mirantis, Inc. All rights reserved
Integrity test
● From floating IP to
floating IP
● From fixed IP to
floating IP
-
12.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally control plane tests
Basic Neutron test suite
Tests with increased number of iterations and
concurrency
Neutron scale test with many servers/networks
-
13.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally basic Neutron test suite
create_and_update_
create_and_list_
create_and_delete_
● floating_ips
● networks
● subnets
● security_groups
● routers
● ports
Verify that cloud is
healthy, Neutron
services up and
running
-
14.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
Concurrency 50-100
Iterations 2000-5000
API tests
create-and-list-networks
create-and-list-ports
create-and-list-routers
create-and-list-security-groups
create-and-list-subnets
Boot VMs tests
boot-and-list-server
boot-and-delete-server-with-secgroups
boot-runcommand-delete
-
15.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
All test runs were successful, no errors.
Results on Lab 378 slightly better than
on Lab 200.
API tests
create-and-list-networks
create-and-list-ports
create-and-list-routers
create-and-list-security-groups
create-and-list-subnets
Boot VMs tests
boot-and-list-server
boot-and-delete-server-with-secgroups
boot-runcommand-delete
Scenario Iterations/
Concurrency
Time
Lab 200 Lab 378
create-and-list-routers 2000/50 avg 15.59
max 29.00
avg 12.942
max 19.398
create-and-list-subnets 2000/50 avg 25.973
max 64.553
avg 17.415
max 50.41
-
16.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
First run on Lab 200:
● 7.75% failures, concurrency
100
● 1.75% failures, concurrency 15
Fixes applied on Lab 378:
● 0% failures, concurrency 100
● 0% failures, concurrency 50
API tests
create-and-list-networks
create-and-list-ports
create-and-list-routers
create-and-list-security-groups
create-and-list-subnets
Boot VMs tests
boot-and-list-server
boot-and-delete-server-with-secgroups
boot-runcommand-delete
-
17.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
Trends
create_and_list_networks
● create - slow linear growth
● list - linear growth
-
18.
Copyright © 2016 Mirantis, Inc. All rights reserved
create_and_list_networks trends
create network
list networks
-
19.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
Trends
create_and_list_networks
● create - stable
● list - linear growth
create_and_list_routers
● create - stable
● list - linear growth (6.5 times in 2000 iterations)
-
20.
Copyright © 2016 Mirantis, Inc. All rights reserved
create_and_list_routers trends
create router
list routers
-
21.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
Trends
create_and_list_networks
● create - stable
● list - linear growth
create_and_list_routers
● create - stable
● list - linear growth (6.5 times in 2000 iterations)
create_and_list_subnets
● create - slow linear growth
● list - linear growth (20 times in 2000 iterations)
-
22.
Copyright © 2016 Mirantis, Inc. All rights reserved
create_and_list_subnets trends
create subnet
list subnets
-
23.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
Trends
create_and_list_networks
● create - stable
● list - linear growth
create_and_list_routers
● create - stable
● list - linear growth (6.5 times in 2000 iterations)
create_and_list_subnets
● create - low linear growth
● list - linear growth (20 times in 2000 iterations)
create_and_list_ports
-
24.
Copyright © 2016 Mirantis, Inc. All rights reserved
create_and_list_ports trends
average load
-
25.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally high load tests, increased
iterations/concurrency
Trends
create_and_list_networks
● create - stable
● list - linear growth
create_and_list_routers
● create - stable
● list - linear growth (6.5 times in 2000 iterations)
create_and_list_subnets
● create - low linear growth
● list - linear growth (20 times in 2000 iterations)
create_and_list_ports
● gradual growth
create_and_list_secgroups
● create 10 sec groups - stable, with peaks
● list - rapid growth rate by 17.2 times
-
26.
Copyright © 2016 Mirantis, Inc. All rights reserved
create_and_list_secgroups trends
create 10 security groups
list security groups
-
27.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally scale with many networks
100 networks per iteration
1 VM per network
Iterations 20, concurrency 3
-
28.
Copyright © 2016 Mirantis, Inc. All rights reserved
Rally scale with many VMs
1 network per iteration
100 VMs per network
Iterations 20, concurrency 3
-
29.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Architecture
Shaker is a distributed data-
plane testing tool for
OpenStack.
-
30.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: L2 scenario
Tests the bandwidth
between pairs of instances
on different nodes in the
same virtual network.
-
31.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: L3 East-West scenario
Tests the bandwidth
between pairs of
instances on different
nodes deployed in
different virtual networks
plugged into the same
router.
-
32.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: L3 North-South scenario
Tests the bandwidth
between pairs of
instances on different
nodes deployed in
different virtual networks.
-
33.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 200, MTU 1500
Standard configuration
Bi-directional L3 East-West
scenario:
● 561 Mbits/sec upload,
528 Mbits/sec
download
Intel 82599ES 10-Gigabit
-
34.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 200, MTU 9000
Enabled jumbo frames
Bi-directional L3 East-West
scenario:
● 3615 Mbits/sec upload,
3844 Mbits/sec
download
x7 increase in throughput
Intel 82599ES 10-Gigabit
-
35.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
L3 East-West Bi-directional test
HW offloads-capable NIC
Hardware offloads boost with
small MTU (1500):
● x3.5 throughput increase
in bi-directional test
Increasing MTU from 1500 to
9000 also gives a significant
boost:
● 75% throughput
increase in bi-directional
test (offloads on)
Intel X710 Dual Port 10-Gigabit
-
36.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
L3 East-West Download test
HW offloads-capable NIC
Hardware offloads boost with
small MTU (1500):
● x2.5 throughput increase
in download
Increasing MTU from 1500 to
9000 also gives a significant
boost:
● 41% throughput
increase in download
test (offloads on)
Intel X710 Dual Port 10-Gigabit
-
37.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
L3 East-West Download test
Near line-rate results in L2 and
L3 east-west Shaker tests
even with concurrency >50:
● 9800 Mbits/sec in
download/upload tests
● 6100 Mbits/sec each
direction in bi-directional
tests
Intel X710 Dual Port 10-Gigabit
-
38.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
Full L2 Download test
Intel X710 Dual Port 10-Gigabit
-
39.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
L3 East-West Download test
Intel X710 Dual Port 10-Gigabit
-
40.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
Full L3 North-South Download test
Intel X710 Dual Port 10-Gigabit
-
41.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
L3 East-west Bi-directional test
Intel X710 Dual Port 10-Gigabit
-
42.
Copyright © 2016 Mirantis, Inc. All rights reserved
Shaker: Lab 378,
L3 East-west Bi-directional test
Intel X710 Dual Port 10-Gigabit
-
43.
Copyright © 2016 Mirantis, Inc. All rights reserved
Dataplane testing outcomes
Neutron DVR+VxLAN+L2pop installations are capable of almost line-
rate performance.
Main bottlenecks: hardware configuration and MTU settings.
Solution:
1. Use HW offloads-capable NICs
2. Enable jumbo frames
North-South scenario needs improvement
-
44.
Copyright © 2016 Mirantis, Inc. All rights reserved
Density test
Aim:
Boot the maximum number of VMs the cloud can manage.
Make sure VMs are properly wired and have access to the
external network.
Verify that data-plane is not affected by high load on the
cloud.
-
45.
Copyright © 2016 Mirantis, Inc. All rights reserved
Environment description. 200 node lab
3 controllers, 196 computes, 1 node for Grafana/Prometheus
CPU 20 core
RAM 128 GB
Networ
k
2x Intel Corporation I350 Gigabit Network
Connection (public network)
2x Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
Controllers Computes
CPU 6 core
RAM 32 GB
Network
1x AOC-STGN-i2S - 2-port 10 Gigabit
Ethernet SFP+
-
46.
Copyright © 2016 Mirantis, Inc. All rights reserved
Density test process
Heat used for creating 1 network
with a subnet, 1 DVR router,
and 1 cirros VM per compute
node.
1 Heat stack == 196 VMs
Upon spawn VMs get their IPs
from metadata and send them
to the external HTTP server
Iteration 1
-
47.
Copyright © 2016 Mirantis, Inc. All rights reserved
Density test process
Heat stacks were created in
batches of 1 to 5 (5 most of the
times)
1 iteration == 196*5 VMs
Integrity test was ran periodically
Constant monitoring of lab status
using Grafana dashboard
Iteration k
-
48.
Copyright © 2016 Mirantis, Inc. All rights reserved
Density test results
125 Heat stacks were created
Total 24500 VMs on a cluster
Number of bugs filed and fixed: 8
Days spent: 3
People involved: 12
Data-plane connectivity lost: 0 times
-
49.
Copyright © 2016 Mirantis, Inc. All rights reserved
Grafana dashboard during density test
-
50.
Copyright © 2016 Mirantis, Inc. All rights reserved
Density test load analysis
-
51.
Copyright © 2016 Mirantis, Inc. All rights reserved
Issues faced
● Ceph failure!
● Bugs
● LP #1614452 Port create time grows at scale due to dvr arp update
● LP #1610303 l2pop mech fails to update_port_postcommit on a loaded cluster
● LP #1528895 Timeouts in update_device_list (too slow with large # of VIFs)
● LP #1606827 Agents might be reported as down for 10 minutes after all controllers restart
● LP #1606844 L3 agent constantly resyncing deleted router
● LP #1549311 Unexpected SNAT behavior between instances with DVR+floating ip
● LP #1609741 oslo.messaging does not redeclare exchange if it is missing
● LP #1606825 nova-compute hangs while executing a blocking call to librbd
● Limits
● ARP table size on nodes
● cpu_allocation_ratio
-
52.
Copyright © 2016 Mirantis, Inc. All rights reserved
Outcomes
● No major issues in Neutron
● No threatening trends in control-plane tests
● Data-plane tests showed stable performance on all hardware
● Data-plane does not suffer from control-plane failures
● 24K+ VMs on 200 nodes without serious performance
degradation
● Neutron is ready for large-scale production deployments on
350+ nodes
-
53.
Copyright © 2016 Mirantis, Inc. All rights reserved
Links
http://docs.openstack.org/developer/performance-
docs/test_plans/neutron_features/vm_density/plan.html
http://docs.openstack.org/developer/performance-
docs/test_results/neutron_features/vm_density/results.h
tml
-
54.
Copyright © 2016 Mirantis, Inc. All rights reserved
Thank you
for your time
Good afternoon, everyone! My name is Elena Ezhova, I am a Software Engineer at Mirantis, and this is Oleg Bondarev, Senior Software Engineer at Mirantis. Today we are going to talk about Neutron performance at scale and find out whether it is ready for large deployments.
So, why are we here? For quite a long time there has been a misconception that Neutron is not production-ready and has certain performance issues. That’s why we aspired to put an end to these rumors and perform Neutron-focused performance and scale testing. And now we’d like to share our results.
Here are some key points of our testing:
First, we deployed Mirantis OpenStack 9.0 with Mitaka-based Neutron on 2 hardware labs, with the largest lab having 378 nodes in total.
Secondly, we were able to achieve line-rate throughput in dataplane tests and boot over 24 thousand VMs during density test
...and finally, that’s the major spoiler by the way, we can confirm that Neutron works at scale!
But let’s not get ahead of ourselves and stick to the agenda.
We shall start with describing the clusters we used for testing, their hardware and software configuration along with the tools that we used.
Then we’ll go on to describe tests that were performed, results we got and their analysis.
After that we’ll take a look at issues that were faced during the testing process as well as some performance considerations.
Finally, we’ll round out with the conclusions and outcomes.
We were testing the Mitaka-based Mirantis OpenStack 9.0 distribution with Neutron with ML2/OVS plugin.
We’ve used VxLAN segmentation type as it is a common choice in production.
We were also using DVR for enhanced data-plane performance.
As to hardware, we were lucky to be able to experiment on two different hardware labs.
The first one had 200 nodes: 3 of which were controllers, 1 we used for running Prometheus w/Grafana for cluster status monitoring. And the rest nodes were computes.
Here, as you can see, controllers were more powerful than computes, all of them having standard NICs with Intel 82599 controllers.
Now, the second lab had more nodes and had way more powerful hardware.
It had 378 nodes: 3 controllers and all the rest computes. As I said, these servers are more powerful than those on the first lab as they have more CPU, RAM and, what’s important, modern X710 Intel NICs.
Now a quick look at the tools that were used in testing process.
All the tests that we were running can be roughly classified into three groups: control plane, data plane and density tests.
For control plane testing we were using Rally.
For testing data-plane we used a specially designed tool called Shaker and for density testing - it were mostly our custom scripts and Heat templates for creating stacks.
Prometheus with Grafana dashboard was quite useful for monitoring cluster state.
And, of course we were using our eyes, hands and sometimes even the 6th sense for tracking down issues.
So, what exactly were we doing?
The very first thing we wanted to know when we got the deployed cloud is whether it is working correctly, meaning, do we have internal and external connectivity? What’s more, we needed to always have a way to check that data-plane is working after massive resources creation/deletion, heavy workloads, etc.
The solution was to create an Integrity test. It is very simple and straightforward.
We create a control group of 20 instances, all of which are located on different compute nodes. Half of them are in one subnet and have floating IPs, the other half are in another subnet and have only fixed IPs. Both subnets are plugged into a router with a gateway to an external network.
For each of the instances we check that it’s possible to:
1. SSH into it.
2. Ping an external resource (eg. Google)
3. Ping other VMs (by fixed or floating IPs)
This infrastructure should always be persistent and resources shouldn’t not be cleaned up after connectivity check is made.
Lists of IPs to ping are formed in a way to check all possible combinations with minimum redundancy. Having VMs from different subnets with and without floating IPs allows to check that all possible traffic routes are working.
For example, the check validates that ping passes:
From fixed IP to fixed IP in the same subnet
From fixed IP to fixed IP in different subnets, when packets have to go through the qrouter namespace
From floating IP to floating IP, traffic goes through FIP namespace to the external network
From fixed IP to floating IP, when traffic goes through a controller.
This connectivity check is really very helpful for verifying that data-plane connectivity is not lost during testing and it really helped us spotting that something went wrong with dataplane early.
Now I’d like to pass the ball to Oleg who will tell you of control plane testing process and results.
Rally is a well known and I’d say “official” tool for testing control plane performance of OpenStack clusters. I won’t talk much about the tool itself, let’s move to the tests and results.
We started with so called basic Neutron test suite - it’s actually Neutron API tests like create and list nets, subnets, routers, etc. which doesn’t include VMs spawning. This test suite goes with rally itself and we didn’t modify test options much, as the main purpose is to validate cluster operability.
Secondly we ran “hardened” version of same tests with increased numbers of iterations and concurrency. Plus we added several tests which spawn VMs.
Finally we ran two tests specially targeted to create many networks and servers in different proportions (servers per network) - like many nets with one VM in each vs. less nets with many servers in each.
Not much to add here, as I already said these are basic Neutron API tests to validate cluster (and Rally) operability. The picture shows that there is no big difference between avg and max response times which is positive.
Moving on. Following tests were run with concurrency 50-100 and 2000-5000 iterations. Create_and_list are additive type of tests which do not delete resources on each iteration, so the load (in terms of number of resources) grows with each iteration.
We also added booting VMs tests where boot_runcomand_delete is the most interesting, since it tests successful VM spawning and external connectivity through a floating IP, all at a high rate.
Speaking of results I’d like to note that all highlighted tests were successful (each iteration) and results on a more powerful lab are better, which is expected.
For boot-and-delete-server-with-secgroups and boot-runcommand-delete there were some failures initially on lab 200 (I’ll talk about failures later), after investigation and applying fixes on lab 378 we got a 100% success rate for these tests even with greater concurrency.
Speaking of trends we see that for create and list nets it is a linear growth for list and slow linear growth for create. This has a simple explanation - the more resources we have, the more time neutron server needs for processing.
create & list from 200 node lab
It’s even better for routers - no time increase for create and slow linear growth for list.
create & list from 200 node lab
Same for subnets - slow linear growth for both create and list
create & list from 200 node lab
Here is an aggregated graph for ports - gradual growth as well with some peaks
There is something to look and profile in list security groups as it seems not quite linear growth. For create it’s more or less stable response times not depending on amount of resources created.
In this test on each iteration 100 networks are created with a VM in each network. There were 20 iterations with concurrency 3 and as you can see from the graph this is a really slow response time increase.
And it’s even better for so called “Rally scale with many VMs” test, where it is 1 net with 100 VMs per iteration, 20 iterations and concurrency 3 - a pretty stable time for each iteration. Probably we should’ve done more iterations but we were very limited in time and had to give a priority to other tests.
Just like with this talk! So now I’ll pass the ball to Elena and she will speak about Shaker and data plane testing.
Thanks, Oleg! Shaker is a distributed data-plane testing tool for OpenStack that was developed at Mirantis. Shaker wraps around popular system network testing tools like iperf3, netperf and others. Shaker is able to deploy OpenStack instances and networks in different topologies using Heat.
Shaker starts lightweight agents inside VMs, these agents execute tests and report the results back to a centralized server. In case of network testing only master agents are involved, while slaves are used as back-ends handling incoming packets.
There are three typical dataplane test scenarios.
The L2 scenario tests the bandwidth between pairs of VMs in the same subnet. Each instance is deployed on own compute node. The test increases the load from 1 pair until all available computes are used.
The L3 east-west scenario is the same as the previous with the only difference that pairs of VMs are deployed in different subnets.
In the L3 north-south scenario VMs with master agents are located in one subnet, and VMs with slave agents are reached via their floating IPs
Our data plane performance testing started on the 200-node lab deployed with standard configuration, which also means that we had 1500 MTU. Having run the Shaker test suite we saw disquietingly low throughput: in east-west bi-directional tests upload was almost 500 MBits/sec!
These results suggested that it would be reasonable to update the MTU from the default 1500 to 9000 that is commonly used in production installations. By doing so we were able to increase throughput by almost 7 times and it reached almost 4 GBits/sec each direction in the same test case. Such difference in results shows that performance to a great extent depends on a lab configuration.
Now, if you remember I was telling that we actually had two hardware labs, where the second lab had more advanced hardware, most importantly - more advanced Intel X710 NICs.
Among else, these NICs allow to make more full use of hardware offloads, that are especially needed when VxLAN segmentation (with 50 bytes overhead) comes in. Hardware offloads allow to significantly increase throughput while reducing load on CPU.
Let’s see what difference does advanced offloads-capable hardware make.
On the 300+ node lab we ran Shaker tests with different lab configurations: MTU 1500 and 9000 and hardware offloads on and off.
As it can be seen on the chart, hardware offloads are most effective with smaller MTU, mostly due to segmentation offloads:
we can see x3.5 throughput increase in bi-directional test (compare columns 1 and 2)
Increasing MTU from 1500 to 9000 also gives a significant boost:
75% throughput increase in bi-directional test (offloads on) (columns 2 and 4)
The situation is the same for unidirectional test cases (download in this example): hardware offloads give x2.5 throughput increase (compare columns 1 and 2).
And combining enabled hardware offloads with jumbo frames helps to increase throughput by 41% (columns 2 and 4).
These results prove that it makes very much sense to enable jumbo frames and hardware offloads in production environments whenever possible.
So, here are the real numbers that we got on this lab:
We were able to achieve near line-rate results in L2 and L3 east-west Shaker tests even with concurrency > 50, which means that there were more than 50 pairs of instances sending traffic simultaneously:
9.8 Gbits/sec in download and upload tests
Over 6 Gbits/sec each direction in bi-directional tests
Now, let’s compare the results we got on 200-node lab, that had less advanced hardware with results on 300+ node lab that had more advanced hardware.
On this chart you can see how average throughput between VMs in the same network changes with increasing concurrency. On a 300+ node lab throughput remains line-rate even when concurrency reaches 99.
Almost the same situation is with L3 east-west download test when the VMs are in different subnets connected to the same router.
Here it can be seen that running the same test on a lab with enabled jumbo frames and supported hardware offloads leads to sufficient increase of throughput, that keeps stable even with high concurrency.
L3 North-South performance is still far from being perfect mostly due to the fact that in this scenario even with DVR all the traffic goes through the controller which in case of high concurrency may get flooded. Apart from that the resulting throughput depends on many factors including configuration of a switch and lab topology (whether nodes are situated in the same rack or not, etc.) AND MTU in the external network that must always considered to be no more than 1500.
The results of bi-directional tests are the most important as in real environments there is usually traffic going in and out and therefore it is important that throughput is stable in both directions. Here we can see that on the 300+ node lab the average throughput in both directions was almost 3 times higher than on the 200-node lab with the same MTU 9000.
The average results that are shown on the previous graphs are often affected by corner cases when the channel gets stuck due to various reasons and throughput drops significantly. To have a fuller understanding of what throughput is achievable you can take a look at a chart with most successful results, where upload/download exceeds 7 Gbits/sec on a 378-node lab.
To sum up, the dataplane testing has shown that Neutron DVR+VxLAN installations are capable of very high, almost line-rate performance.
There are two major factors: hardware configuration and MTU settings. This means that to get the best results it’s needed to have a modern HW-offloads capable NIC and enable jumbo frames. Even on older NICs that don’t support ALL offloads network performance can be improved drastically, which the results that we got on a 200-node lab clearly show.
The North-South scenario clearly needs improvement as DVR is not currently truly distributed and in this scenario all traffic goes through controller which eventually gets clogged.
Now, Oleg will tell you about Density testing and share probably the most exciting results that we got.
Right! With density test we aimed 3 main things:
Boot as many VMs as the cloud can manage
But not only boot - make sure VMs are properly wired and have access to the external network
Verify that data-plane is not affected by high load on the cloud
So essentially the main idea was to load cluster to death to see what are the limits and where are bottlenecks. And additionally check what happens to data plane when control plane breaks.
We only had a chance to ran density test on a 200 node lab. Just to remind about the HW: it was 3 controllers with 20 cores and 128 gigs of RAM, and 196 computes with 6 cores and 32 gigs of RAM. One node was taken for cluster health monitoring, with Grafana/Prometeus on it
Now about the process. We used Heat for the first version of density test on this lab.
1 Heat stack is 1 private net with a subnet connected by a router to a public net and 1 VM per compute node. So 1 stack means 196 new VMs. To control external connectivity and metadata access of VMs, each of them should get some metadata from metadata server and send this info to the external HTTP server. Thus server will check that all VMs got metadata and external access.
We created heat stacks in batches of 1-5 (5 most of the times), so 1 iteration means up to 1000 new VMs.
After each iteration we checked data plane integrity by executing connectivity check which Elena described earlier. We also constantly monitored cluster health to be able to detect and investigate any problem at an early stage.
I’ll speak about issues we faced a bit later. Now about the results: it was a 3 (or maybe 4) days journey with over 10 people from different teams involved, and finally we successfully created 125 stacks on this cluster, which is more that 24k VMs which were successfully spawned and got external connectivity. Data plane connectivity for the control group of VMs was never lost.
This is how one of Grafana pages was looking during density tests. It has CPU and Memory load as well as load on DB and Network. These are aggregated graphs for all controllers and computes. Here peaks correspond to batches of VMs spawned. You can also see how memory usage grows on compute nodes, while staying pretty stable on controllers. This is by the way close to final iterations as you see memory on computes is getting close to end.
And this is how CPU and memory consumption changed from first to last iteration. As you see we almost reached memory limit on computes which we expected to be the limiting factor, but no.
Actually the bottleneck appeared to be in Ceph which was used in our deployment.
The initial failure was with the lack of allowed PIDs per OSD node, then Ceph monitors started to consume all (and even more) resources on controllers in order to restart, causing all other services (Rabbit, OpenStack services) to suffer.
After this Ceph failure the cluster could not be recovered, so the density test had to be stopped before the capacity of compute nodes was exhausted.
The Ceph team commented that 3 Ceph monitors aren't enough for over 20000 VMs (each having 2 drives) and recommended to have at least 1 monitor per ~1000 client connections. It’s also better to move monitors to dedicated nodes.
One pretty important note: Connectivity check of Integrity test passed 100% even when cluster went crazy. That is a good illustration of control plane failures not affecting data plane.
Other issues:
At some point we had to increase ARP table size on computes and then on controllers;
Later we had to increase cpu_allocation_ratio on computes. It’s a nova config controlling how many VMs can be spawned on a certain compute node depending on the number of real cores;
Several neutron bugs, nothing critical though, most interesting is port creation time growth which was fixed by a 2-lines patch. Other thing that deserves attention is OVS agent restart on a loaded compute node - there might be timeouts on agent side trying to update status of a big number of interfaces at once. It’s a well known issue which has two alternative patches on review and just needs to reach consensus.
A bug in oslo.messaging which affected us pretty much and took some time to be investigated and fixed by our messaging team; the gist is that agents were reporting to queues consumed by nobody;
A Nova bug where massive VM deleting leads to nova-computes hanging; it’s related to nova - ceph interactions;
And finally here are the main outcomes of our scale testing:
No major issues in Neutron were found during testing (all labs, all tests).
Issues found were either already fixed in upstream or fixed in upstream during our testing, one is in progress and close to be fixed.
Rally tests did not reveal any significant issues.
No threatening trends in Rally tests results.
Data-plane tests showed stable performance on all hardware. It was demonstrated that high network performance can be achieved even on old hardware, that doesn’t support VxLAN offloads, just need proper MTU settings. On servers with modern NICs throughput is almost line-rate.
Data-plane connectivity is not lost even during serious issues with control plane.
Density testing clearly demonstrated that Neutron is capable of managing over 24500 VMs on 200 nodes (3 controllers) without serious performance degradation. In fact we weren’t even able to spot significant bottlenecks in Neutron control plane as had to stop the test due to issues not related to Neutron.
Neutron is ready for large-scale production deployments on 350+ nodes.
Our process and results has been shared on docs.openstack.org, here’re the links