Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

Similar to Is OpenStack Neutron production ready for large scale deployments?(20)

Advertisement

Is OpenStack Neutron production ready for large scale deployments?

  1. Copyright © 2016 Mirantis, Inc. All rights reserved www.mirantis.com Is OpenStack Neutron production ready for large scale deployments? Oleg Bondarev, Senior Software Engineer, Mirantis Elena Ezhova, Software Engineer, Mirantis
  2. Copyright © 2016 Mirantis, Inc. All rights reserved Why are we here? “We've learned from experience that the truth will come out.” Richard Feynman
  3. Copyright © 2016 Mirantis, Inc. All rights reserved Key highlights (Spoilers!) Mitaka-based OpenStack deployed by Fuel 2 hardware labs were used for testing 378 nodes was the size of the largest lab Line-rate throughput was achieved Over 24500 VMs were launched on a 200-node lab ...and yes, Neutron works at scale!
  4. Copyright © 2016 Mirantis, Inc. All rights reserved Agenda Labs overview & tools Testing methodology Results and analysis Issues Outcomes
  5. Copyright © 2016 Mirantis, Inc. All rights reserved Deployment description Mirantis OpenStack with Mitaka-based Neutron ML2 OVS VxLAN/L2 POP DVR rootwrap-daemon ON ovsdb native interface OFF ofctl native interface OFF
  6. Copyright © 2016 Mirantis, Inc. All rights reserved Environment description. 200 node lab 3 controllers, 196 computes, 1 node for Grafana/Prometheus CPU 2x CPU Intel Xeon E5-2650v3,Socket 2011, 2.3 GHz, 25MB Cache, 10 core, 105 W RAM 8x 16GB Samsung M393A2G40DB0-CPB DDR-IV PC4-2133P ECC Reg. CL13 Networ k 2x Intel Corporation I350 Gigabit Network Connection (public network) 2x Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Controllers Computes CPU 1x INTEL XEON Ivy Bridge 6C E5-2620 V2 2.1G 15M 7.2GT/s QPI 80w SOCKET 2011R 1600 RAM 4x Samsung DDRIII 8GB DDR3-1866 1Rx4 ECC REG RoHS M393B1G70QH0-CMA Network 1x AOC-STGN-i2S - 2-port 10 Gigabit Ethernet SFP+
  7. Copyright © 2016 Mirantis, Inc. All rights reserved Environment description. 378 node lab 3 controllers, 375 computes Model Dell PowerEdge R63 CPU 2x Intel, E5-2680 v3, 2.5 GHz, 12 core RAM 256 GB RAM, Samsung, M393A2G40DB0- CPB Networ k 2x Intel X710 Dual Port, 10-Gigabit Storage 3.6 TB, SSD, raid1 - Dell, PERC H730P Mini, 2 disks Intel S3610 Model Lenovo RD550-1U CPU 2x E5-2680v3, 12-core CPUs RAM 256GB RAM Network 2x Intel X710 Dual Port, 10-Gigabit Storage 2x Intel S3610 800GB SSD 2x DP and 3Yr Standard Support 23 176 RD650-2
  8. Copyright © 2016 Mirantis, Inc. All rights reserved Tools Control plane testing Rally Data plane testing Shaker Density testing Heat Custom (ancillary) scripts System resource monitoring Grafana/Prometheus Additionally
  9. Copyright © 2016 Mirantis, Inc. All rights reserved Integrity test Control group of resources that must stay persistent no matter what other operations are performed on the cluster. 2 server groups of 10 instances 2 subnets connected by router Connectivity checks by floating IPs and fixed IPs Checks are run between other tests to ensure dataplane operability
  10. Copyright © 2016 Mirantis, Inc. All rights reserved Integrity test ● From fixed IP to fixed IP in the same subnet ● From fixed IP to fixed IP in different subnets
  11. Copyright © 2016 Mirantis, Inc. All rights reserved Integrity test ● From floating IP to floating IP ● From fixed IP to floating IP
  12. Copyright © 2016 Mirantis, Inc. All rights reserved Rally control plane tests Basic Neutron test suite Tests with increased number of iterations and concurrency Neutron scale test with many servers/networks
  13. Copyright © 2016 Mirantis, Inc. All rights reserved Rally basic Neutron test suite create_and_update_ create_and_list_ create_and_delete_ ● floating_ips ● networks ● subnets ● security_groups ● routers ● ports Verify that cloud is healthy, Neutron services up and running
  14. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Concurrency 50-100 Iterations 2000-5000 API tests create-and-list-networks create-and-list-ports create-and-list-routers create-and-list-security-groups create-and-list-subnets Boot VMs tests boot-and-list-server boot-and-delete-server-with-secgroups boot-runcommand-delete
  15. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency All test runs were successful, no errors. Results on Lab 378 slightly better than on Lab 200. API tests create-and-list-networks create-and-list-ports create-and-list-routers create-and-list-security-groups create-and-list-subnets Boot VMs tests boot-and-list-server boot-and-delete-server-with-secgroups boot-runcommand-delete Scenario Iterations/ Concurrency Time Lab 200 Lab 378 create-and-list-routers 2000/50 avg 15.59 max 29.00 avg 12.942 max 19.398 create-and-list-subnets 2000/50 avg 25.973 max 64.553 avg 17.415 max 50.41
  16. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency First run on Lab 200: ● 7.75% failures, concurrency 100 ● 1.75% failures, concurrency 15 Fixes applied on Lab 378: ● 0% failures, concurrency 100 ● 0% failures, concurrency 50 API tests create-and-list-networks create-and-list-ports create-and-list-routers create-and-list-security-groups create-and-list-subnets Boot VMs tests boot-and-list-server boot-and-delete-server-with-secgroups boot-runcommand-delete
  17. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - slow linear growth ● list - linear growth
  18. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_networks trends create network list networks
  19. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations)
  20. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_routers trends create router list routers
  21. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations) create_and_list_subnets ● create - slow linear growth ● list - linear growth (20 times in 2000 iterations)
  22. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_subnets trends create subnet list subnets
  23. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations) create_and_list_subnets ● create - low linear growth ● list - linear growth (20 times in 2000 iterations) create_and_list_ports
  24. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_ports trends average load
  25. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations) create_and_list_subnets ● create - low linear growth ● list - linear growth (20 times in 2000 iterations) create_and_list_ports ● gradual growth create_and_list_secgroups ● create 10 sec groups - stable, with peaks ● list - rapid growth rate by 17.2 times
  26. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_secgroups trends create 10 security groups list security groups
  27. Copyright © 2016 Mirantis, Inc. All rights reserved Rally scale with many networks 100 networks per iteration 1 VM per network Iterations 20, concurrency 3
  28. Copyright © 2016 Mirantis, Inc. All rights reserved Rally scale with many VMs 1 network per iteration 100 VMs per network Iterations 20, concurrency 3
  29. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Architecture Shaker is a distributed data- plane testing tool for OpenStack.
  30. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: L2 scenario Tests the bandwidth between pairs of instances on different nodes in the same virtual network.
  31. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: L3 East-West scenario Tests the bandwidth between pairs of instances on different nodes deployed in different virtual networks plugged into the same router.
  32. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: L3 North-South scenario Tests the bandwidth between pairs of instances on different nodes deployed in different virtual networks.
  33. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 200, MTU 1500 Standard configuration Bi-directional L3 East-West scenario: ● 561 Mbits/sec upload, 528 Mbits/sec download Intel 82599ES 10-Gigabit
  34. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 200, MTU 9000 Enabled jumbo frames Bi-directional L3 East-West scenario: ● 3615 Mbits/sec upload, 3844 Mbits/sec download x7 increase in throughput Intel 82599ES 10-Gigabit
  35. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Bi-directional test HW offloads-capable NIC Hardware offloads boost with small MTU (1500): ● x3.5 throughput increase in bi-directional test Increasing MTU from 1500 to 9000 also gives a significant boost: ● 75% throughput increase in bi-directional test (offloads on) Intel X710 Dual Port 10-Gigabit
  36. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Download test HW offloads-capable NIC Hardware offloads boost with small MTU (1500): ● x2.5 throughput increase in download Increasing MTU from 1500 to 9000 also gives a significant boost: ● 41% throughput increase in download test (offloads on) Intel X710 Dual Port 10-Gigabit
  37. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Download test Near line-rate results in L2 and L3 east-west Shaker tests even with concurrency >50: ● 9800 Mbits/sec in download/upload tests ● 6100 Mbits/sec each direction in bi-directional tests Intel X710 Dual Port 10-Gigabit
  38. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, Full L2 Download test Intel X710 Dual Port 10-Gigabit
  39. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Download test Intel X710 Dual Port 10-Gigabit
  40. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, Full L3 North-South Download test Intel X710 Dual Port 10-Gigabit
  41. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-west Bi-directional test Intel X710 Dual Port 10-Gigabit
  42. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-west Bi-directional test Intel X710 Dual Port 10-Gigabit
  43. Copyright © 2016 Mirantis, Inc. All rights reserved Dataplane testing outcomes Neutron DVR+VxLAN+L2pop installations are capable of almost line- rate performance. Main bottlenecks: hardware configuration and MTU settings. Solution: 1. Use HW offloads-capable NICs 2. Enable jumbo frames North-South scenario needs improvement
  44. Copyright © 2016 Mirantis, Inc. All rights reserved Density test Aim: Boot the maximum number of VMs the cloud can manage. Make sure VMs are properly wired and have access to the external network. Verify that data-plane is not affected by high load on the cloud.
  45. Copyright © 2016 Mirantis, Inc. All rights reserved Environment description. 200 node lab 3 controllers, 196 computes, 1 node for Grafana/Prometheus CPU 20 core RAM 128 GB Networ k 2x Intel Corporation I350 Gigabit Network Connection (public network) 2x Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Controllers Computes CPU 6 core RAM 32 GB Network 1x AOC-STGN-i2S - 2-port 10 Gigabit Ethernet SFP+
  46. Copyright © 2016 Mirantis, Inc. All rights reserved Density test process Heat used for creating 1 network with a subnet, 1 DVR router, and 1 cirros VM per compute node. 1 Heat stack == 196 VMs Upon spawn VMs get their IPs from metadata and send them to the external HTTP server Iteration 1
  47. Copyright © 2016 Mirantis, Inc. All rights reserved Density test process Heat stacks were created in batches of 1 to 5 (5 most of the times) 1 iteration == 196*5 VMs Integrity test was ran periodically Constant monitoring of lab status using Grafana dashboard Iteration k
  48. Copyright © 2016 Mirantis, Inc. All rights reserved Density test results 125 Heat stacks were created Total 24500 VMs on a cluster Number of bugs filed and fixed: 8 Days spent: 3 People involved: 12 Data-plane connectivity lost: 0 times
  49. Copyright © 2016 Mirantis, Inc. All rights reserved Grafana dashboard during density test
  50. Copyright © 2016 Mirantis, Inc. All rights reserved Density test load analysis
  51. Copyright © 2016 Mirantis, Inc. All rights reserved Issues faced ● Ceph failure! ● Bugs ● LP #1614452 Port create time grows at scale due to dvr arp update ● LP #1610303 l2pop mech fails to update_port_postcommit on a loaded cluster ● LP #1528895 Timeouts in update_device_list (too slow with large # of VIFs) ● LP #1606827 Agents might be reported as down for 10 minutes after all controllers restart ● LP #1606844 L3 agent constantly resyncing deleted router ● LP #1549311 Unexpected SNAT behavior between instances with DVR+floating ip ● LP #1609741 oslo.messaging does not redeclare exchange if it is missing ● LP #1606825 nova-compute hangs while executing a blocking call to librbd ● Limits ● ARP table size on nodes ● cpu_allocation_ratio
  52. Copyright © 2016 Mirantis, Inc. All rights reserved Outcomes ● No major issues in Neutron ● No threatening trends in control-plane tests ● Data-plane tests showed stable performance on all hardware ● Data-plane does not suffer from control-plane failures ● 24K+ VMs on 200 nodes without serious performance degradation ● Neutron is ready for large-scale production deployments on 350+ nodes
  53. Copyright © 2016 Mirantis, Inc. All rights reserved Links http://docs.openstack.org/developer/performance- docs/test_plans/neutron_features/vm_density/plan.html http://docs.openstack.org/developer/performance- docs/test_results/neutron_features/vm_density/results.h tml
  54. Copyright © 2016 Mirantis, Inc. All rights reserved Thank you for your time

Editor's Notes

  1. Good afternoon, everyone! My name is Elena Ezhova, I am a Software Engineer at Mirantis, and this is Oleg Bondarev, Senior Software Engineer at Mirantis. Today we are going to talk about Neutron performance at scale and find out whether it is ready for large deployments.
  2. So, why are we here? For quite a long time there has been a misconception that Neutron is not production-ready and has certain performance issues. That’s why we aspired to put an end to these rumors and perform Neutron-focused performance and scale testing. And now we’d like to share our results.
  3. Here are some key points of our testing: First, we deployed Mirantis OpenStack 9.0 with Mitaka-based Neutron on 2 hardware labs, with the largest lab having 378 nodes in total. Secondly, we were able to achieve line-rate throughput in dataplane tests and boot over 24 thousand VMs during density test ...and finally, that’s the major spoiler by the way, we can confirm that Neutron works at scale!
  4. But let’s not get ahead of ourselves and stick to the agenda. We shall start with describing the clusters we used for testing, their hardware and software configuration along with the tools that we used. Then we’ll go on to describe tests that were performed, results we got and their analysis. After that we’ll take a look at issues that were faced during the testing process as well as some performance considerations. Finally, we’ll round out with the conclusions and outcomes.
  5. We were testing the Mitaka-based Mirantis OpenStack 9.0 distribution with Neutron with ML2/OVS plugin. We’ve used VxLAN segmentation type as it is a common choice in production. We were also using DVR for enhanced data-plane performance.
  6. As to hardware, we were lucky to be able to experiment on two different hardware labs. The first one had 200 nodes: 3 of which were controllers, 1 we used for running Prometheus w/Grafana for cluster status monitoring. And the rest nodes were computes. Here, as you can see, controllers were more powerful than computes, all of them having standard NICs with Intel 82599 controllers.
  7. Now, the second lab had more nodes and had way more powerful hardware. It had 378 nodes: 3 controllers and all the rest computes. As I said, these servers are more powerful than those on the first lab as they have more CPU, RAM and, what’s important, modern X710 Intel NICs.
  8. Now a quick look at the tools that were used in testing process. All the tests that we were running can be roughly classified into three groups: control plane, data plane and density tests. For control plane testing we were using Rally. For testing data-plane we used a specially designed tool called Shaker and for density testing - it were mostly our custom scripts and Heat templates for creating stacks. Prometheus with Grafana dashboard was quite useful for monitoring cluster state. And, of course we were using our eyes, hands and sometimes even the 6th sense for tracking down issues.
  9. So, what exactly were we doing? The very first thing we wanted to know when we got the deployed cloud is whether it is working correctly, meaning, do we have internal and external connectivity? What’s more, we needed to always have a way to check that data-plane is working after massive resources creation/deletion, heavy workloads, etc. The solution was to create an Integrity test. It is very simple and straightforward. We create a control group of 20 instances, all of which are located on different compute nodes. Half of them are in one subnet and have floating IPs, the other half are in another subnet and have only fixed IPs. Both subnets are plugged into a router with a gateway to an external network. For each of the instances we check that it’s possible to: 1. SSH into it. 2. Ping an external resource (eg. Google) 3. Ping other VMs (by fixed or floating IPs) This infrastructure should always be persistent and resources shouldn’t not be cleaned up after connectivity check is made.
  10. Lists of IPs to ping are formed in a way to check all possible combinations with minimum redundancy. Having VMs from different subnets with and without floating IPs allows to check that all possible traffic routes are working. For example, the check validates that ping passes: From fixed IP to fixed IP in the same subnet From fixed IP to fixed IP in different subnets, when packets have to go through the qrouter namespace
  11. From floating IP to floating IP, traffic goes through FIP namespace to the external network From fixed IP to floating IP, when traffic goes through a controller. This connectivity check is really very helpful for verifying that data-plane connectivity is not lost during testing and it really helped us spotting that something went wrong with dataplane early. Now I’d like to pass the ball to Oleg who will tell you of control plane testing process and results.
  12. Rally is a well known and I’d say “official” tool for testing control plane performance of OpenStack clusters. I won’t talk much about the tool itself, let’s move to the tests and results. We started with so called basic Neutron test suite - it’s actually Neutron API tests like create and list nets, subnets, routers, etc. which doesn’t include VMs spawning. This test suite goes with rally itself and we didn’t modify test options much, as the main purpose is to validate cluster operability. Secondly we ran “hardened” version of same tests with increased numbers of iterations and concurrency. Plus we added several tests which spawn VMs. Finally we ran two tests specially targeted to create many networks and servers in different proportions (servers per network) - like many nets with one VM in each vs. less nets with many servers in each.
  13. Not much to add here, as I already said these are basic Neutron API tests to validate cluster (and Rally) operability. The picture shows that there is no big difference between avg and max response times which is positive.
  14. Moving on. Following tests were run with concurrency 50-100 and 2000-5000 iterations. Create_and_list are additive type of tests which do not delete resources on each iteration, so the load (in terms of number of resources) grows with each iteration. We also added booting VMs tests where boot_runcomand_delete is the most interesting, since it tests successful VM spawning and external connectivity through a floating IP, all at a high rate.
  15. Speaking of results I’d like to note that all highlighted tests were successful (each iteration) and results on a more powerful lab are better, which is expected.
  16. For boot-and-delete-server-with-secgroups and boot-runcommand-delete there were some failures initially on lab 200 (I’ll talk about failures later), after investigation and applying fixes on lab 378 we got a 100% success rate for these tests even with greater concurrency.
  17. Speaking of trends we see that for create and list nets it is a linear growth for list and slow linear growth for create. This has a simple explanation - the more resources we have, the more time neutron server needs for processing.
  18. create & list from 200 node lab
  19. It’s even better for routers - no time increase for create and slow linear growth for list.
  20. create & list from 200 node lab
  21. Same for subnets - slow linear growth for both create and list
  22. create & list from 200 node lab
  23. Here is an aggregated graph for ports - gradual growth as well with some peaks
  24. There is something to look and profile in list security groups as it seems not quite linear growth. For create it’s more or less stable response times not depending on amount of resources created.
  25. In this test on each iteration 100 networks are created with a VM in each network. There were 20 iterations with concurrency 3 and as you can see from the graph this is a really slow response time increase.
  26. And it’s even better for so called “Rally scale with many VMs” test, where it is 1 net with 100 VMs per iteration, 20 iterations and concurrency 3 - a pretty stable time for each iteration. Probably we should’ve done more iterations but we were very limited in time and had to give a priority to other tests. Just like with this talk! So now I’ll pass the ball to Elena and she will speak about Shaker and data plane testing.
  27. Thanks, Oleg! Shaker is a distributed data-plane testing tool for OpenStack that was developed at Mirantis. Shaker wraps around popular system network testing tools like iperf3, netperf and others. Shaker is able to deploy OpenStack instances and networks in different topologies using Heat. Shaker starts lightweight agents inside VMs, these agents execute tests and report the results back to a centralized server. In case of network testing only master agents are involved, while slaves are used as back-ends handling incoming packets.
  28. There are three typical dataplane test scenarios. The L2 scenario tests the bandwidth between pairs of VMs in the same subnet. Each instance is deployed on own compute node. The test increases the load from 1 pair until all available computes are used.
  29. The L3 east-west scenario is the same as the previous with the only difference that pairs of VMs are deployed in different subnets.
  30. In the L3 north-south scenario VMs with master agents are located in one subnet, and VMs with slave agents are reached via their floating IPs
  31. Our data plane performance testing started on the 200-node lab deployed with standard configuration, which also means that we had 1500 MTU. Having run the Shaker test suite we saw disquietingly low throughput: in east-west bi-directional tests upload was almost 500 MBits/sec!
  32. These results suggested that it would be reasonable to update the MTU from the default 1500 to 9000 that is commonly used in production installations. By doing so we were able to increase throughput by almost 7 times and it reached almost 4 GBits/sec each direction in the same test case. Such difference in results shows that performance to a great extent depends on a lab configuration. Now, if you remember I was telling that we actually had two hardware labs, where the second lab had more advanced hardware, most importantly - more advanced Intel X710 NICs. Among else, these NICs allow to make more full use of hardware offloads, that are especially needed when VxLAN segmentation (with 50 bytes overhead) comes in. Hardware offloads allow to significantly increase throughput while reducing load on CPU. Let’s see what difference does advanced offloads-capable hardware make.
  33. On the 300+ node lab we ran Shaker tests with different lab configurations: MTU 1500 and 9000 and hardware offloads on and off. As it can be seen on the chart, hardware offloads are most effective with smaller MTU, mostly due to segmentation offloads: we can see x3.5 throughput increase in bi-directional test (compare columns 1 and 2) Increasing MTU from 1500 to 9000 also gives a significant boost: 75% throughput increase in bi-directional test (offloads on) (columns 2 and 4)
  34. The situation is the same for unidirectional test cases (download in this example): hardware offloads give x2.5 throughput increase (compare columns 1 and 2). And combining enabled hardware offloads with jumbo frames helps to increase throughput by 41% (columns 2 and 4). These results prove that it makes very much sense to enable jumbo frames and hardware offloads in production environments whenever possible.
  35. So, here are the real numbers that we got on this lab: We were able to achieve near line-rate results in L2 and L3 east-west Shaker tests even with concurrency > 50, which means that there were more than 50 pairs of instances sending traffic simultaneously: 9.8 Gbits/sec in download and upload tests Over 6 Gbits/sec each direction in bi-directional tests
  36. Now, let’s compare the results we got on 200-node lab, that had less advanced hardware with results on 300+ node lab that had more advanced hardware. On this chart you can see how average throughput between VMs in the same network changes with increasing concurrency. On a 300+ node lab throughput remains line-rate even when concurrency reaches 99.
  37. Almost the same situation is with L3 east-west download test when the VMs are in different subnets connected to the same router. Here it can be seen that running the same test on a lab with enabled jumbo frames and supported hardware offloads leads to sufficient increase of throughput, that keeps stable even with high concurrency.
  38. L3 North-South performance is still far from being perfect mostly due to the fact that in this scenario even with DVR all the traffic goes through the controller which in case of high concurrency may get flooded. Apart from that the resulting throughput depends on many factors including configuration of a switch and lab topology (whether nodes are situated in the same rack or not, etc.) AND MTU in the external network that must always considered to be no more than 1500.
  39. The results of bi-directional tests are the most important as in real environments there is usually traffic going in and out and therefore it is important that throughput is stable in both directions. Here we can see that on the 300+ node lab the average throughput in both directions was almost 3 times higher than on the 200-node lab with the same MTU 9000.
  40. The average results that are shown on the previous graphs are often affected by corner cases when the channel gets stuck due to various reasons and throughput drops significantly. To have a fuller understanding of what throughput is achievable you can take a look at a chart with most successful results, where upload/download exceeds 7 Gbits/sec on a 378-node lab.
  41. To sum up, the dataplane testing has shown that Neutron DVR+VxLAN installations are capable of very high, almost line-rate performance. There are two major factors: hardware configuration and MTU settings. This means that to get the best results it’s needed to have a modern HW-offloads capable NIC and enable jumbo frames. Even on older NICs that don’t support ALL offloads network performance can be improved drastically, which the results that we got on a 200-node lab clearly show. The North-South scenario clearly needs improvement as DVR is not currently truly distributed and in this scenario all traffic goes through controller which eventually gets clogged. Now, Oleg will tell you about Density testing and share probably the most exciting results that we got.
  42. Right! With density test we aimed 3 main things: Boot as many VMs as the cloud can manage But not only boot - make sure VMs are properly wired and have access to the external network Verify that data-plane is not affected by high load on the cloud So essentially the main idea was to load cluster to death to see what are the limits and where are bottlenecks. And additionally check what happens to data plane when control plane breaks.
  43. We only had a chance to ran density test on a 200 node lab. Just to remind about the HW: it was 3 controllers with 20 cores and 128 gigs of RAM, and 196 computes with 6 cores and 32 gigs of RAM. One node was taken for cluster health monitoring, with Grafana/Prometeus on it
  44. Now about the process. We used Heat for the first version of density test on this lab. 1 Heat stack is 1 private net with a subnet connected by a router to a public net and 1 VM per compute node. So 1 stack means 196 new VMs. To control external connectivity and metadata access of VMs, each of them should get some metadata from metadata server and send this info to the external HTTP server. Thus server will check that all VMs got metadata and external access.
  45. We created heat stacks in batches of 1-5 (5 most of the times), so 1 iteration means up to 1000 new VMs. After each iteration we checked data plane integrity by executing connectivity check which Elena described earlier. We also constantly monitored cluster health to be able to detect and investigate any problem at an early stage.
  46. I’ll speak about issues we faced a bit later. Now about the results: it was a 3 (or maybe 4) days journey with over 10 people from different teams involved, and finally we successfully created 125 stacks on this cluster, which is more that 24k VMs which were successfully spawned and got external connectivity. Data plane connectivity for the control group of VMs was never lost.
  47. This is how one of Grafana pages was looking during density tests. It has CPU and Memory load as well as load on DB and Network. These are aggregated graphs for all controllers and computes. Here peaks correspond to batches of VMs spawned. You can also see how memory usage grows on compute nodes, while staying pretty stable on controllers. This is by the way close to final iterations as you see memory on computes is getting close to end.
  48. And this is how CPU and memory consumption changed from first to last iteration. As you see we almost reached memory limit on computes which we expected to be the limiting factor, but no.
  49. Actually the bottleneck appeared to be in Ceph which was used in our deployment. The initial failure was with the lack of allowed PIDs per OSD node, then Ceph monitors started to consume all (and even more) resources on controllers in order to restart, causing all other services (Rabbit, OpenStack services) to suffer. After this Ceph failure the cluster could not be recovered, so the density test had to be stopped before the capacity of compute nodes was exhausted. The Ceph team commented that 3 Ceph monitors aren't enough for over 20000 VMs (each having 2 drives) and recommended to have at least 1 monitor per ~1000 client connections. It’s also better to move monitors to dedicated nodes. One pretty important note: Connectivity check of Integrity test passed 100% even when cluster went crazy. That is a good illustration of control plane failures not affecting data plane. Other issues: At some point we had to increase ARP table size on computes and then on controllers; Later we had to increase cpu_allocation_ratio on computes. It’s a nova config controlling how many VMs can be spawned on a certain compute node depending on the number of real cores; Several neutron bugs, nothing critical though, most interesting is port creation time growth which was fixed by a 2-lines patch. Other thing that deserves attention is OVS agent restart on a loaded compute node - there might be timeouts on agent side trying to update status of a big number of interfaces at once. It’s a well known issue which has two alternative patches on review and just needs to reach consensus. A bug in oslo.messaging which affected us pretty much and took some time to be investigated and fixed by our messaging team; the gist is that agents were reporting to queues consumed by nobody; A Nova bug where massive VM deleting leads to nova-computes hanging; it’s related to nova - ceph interactions;
  50. And finally here are the main outcomes of our scale testing: No major issues in Neutron were found during testing (all labs, all tests). Issues found were either already fixed in upstream or fixed in upstream during our testing, one is in progress and close to be fixed. Rally tests did not reveal any significant issues. No threatening trends in Rally tests results. Data-plane tests showed stable performance on all hardware. It was demonstrated that high network performance can be achieved even on old hardware, that doesn’t support VxLAN offloads, just need proper MTU settings. On servers with modern NICs throughput is almost line-rate. Data-plane connectivity is not lost even during serious issues with control plane. Density testing clearly demonstrated that Neutron is capable of managing over 24500 VMs on 200 nodes (3 controllers) without serious performance degradation. In fact we weren’t even able to spot significant bottlenecks in Neutron control plane as had to stop the test due to issues not related to Neutron. Neutron is ready for large-scale production deployments on 350+ nodes.
  51. Our process and results has been shared on docs.openstack.org, here’re the links
Advertisement