Successfully reported this slideshow.
®       Load Balancing in the Cloud:         To o l s , T i p s , a n d Te c h n i q u e sA T E C H N I C A L W H I T E PA...
AbstractLoad Balancing is a method to distribute workload across one or more servers, network interfaces, harddrives, or o...
1 IntroductionA large percentage of the systems (or “deployments” in RightScale vernacular) managed by theRightScale Cloud...
EC2 small instances (m1.small, 1 virtual core, 1.7GB memory, 32-bit platform), and utilized a CentOS 5.2RightImage. Each w...
Figure 1 – Test setup architectureIn all tests, a round-robin load balancing algorithm was used. CPU utilization on all we...
between the two test architectures, lending validation to the base test architecture. Additional details onthese HAProxy t...
3.1.3	 HAProxy with Kernel TuningThere are numerous kernel parameters that can be tuned at runtime, all of which can be fo...
3.2 Zeus Load BalancerZeus Technologies (http://www.zeus.com/) is a RightScale partner which has created a cloud-ready Ser...
Requests per second:                   15342 [#/sec]This is large improvement (320%) over the initial aiCache load balanci...
backend web servers as used in the previous tests, a total of 25 identical backend web servers wereused, with 45 load-gene...
average of 10500 responses/second was realized (again, due to the restart time between iterations ofthe httperf loop, the ...
Figure 3 – CPU activity on typical backend web server                          Figure 4 – Interface traffic on typical loa...
were shared with members of the AWS network engineering team, who confirmed that there are activitythresholds that will tri...
<haproxy_pid>’ command) CPU-0’s utilization was dropped to less than 5%, and CPU-1’s changedfrom 0% utilized to approximat...
Figure 9 – Interface utilization on HAProxy serverWith unused CPU cycles on both cores, and considerable bandwidth on the ...
The takeaway from these experiments is that in high traffic applications, the network interface should bemonitored and addi...
Appendices[A] Summary of all tests performed[B] Kernel tuning parameters         net.ipv4.conf.default.rp_filter=1        ...
[C] HAProxy configuration file# Copyright (c) 2007 RightScale, Inc, All Rights Reserved Worldwide.## THIS PROGRAM IS CONFIDE...
Upcoming SlideShare
Loading in …5
×

Load Balancing In The Cloud

5,429 views

Published on

Load Balancing is a method to distribute workload across one or more servers, network interfaces, hard
drives, or other computing resources. Typical datacenter implementations rely on large, powerful (and
expensive) computing hardware and network infrastructure, which are subject to the usual risks
associated with any physical device, including hardware failure, power and/or network interruptions, and
resource limitations in times of high demand.
Load balancing in the cloud differs from classical thinking on load-balancing architecture and
implementation by using commodity servers to perform the load balancing. This provides for new
opportunities and economies-of-scale, as well as presenting its own unique set of challenges.
The discussion to follow details many of these architectural decision points and implementation
considerations, while focusing on several of the cloud-ready load balancing solutions provided by
RightScale, either directly from our core components, or from resources provided by members of our
comprehensive partner network.

  • Be the first to comment

Load Balancing In The Cloud

  1. 1. ® Load Balancing in the Cloud: To o l s , T i p s , a n d Te c h n i q u e sA T E C H N I C A L W H I T E PA P E R Brian Adler, Solutions Architect, RightScale, Inc. RightScale • w w w. r i g h t s c a l e . c o m • 1
  2. 2. AbstractLoad Balancing is a method to distribute workload across one or more servers, network interfaces, harddrives, or other computing resources. Typical datacenter implementations rely on large, powerful (andexpensive) computing hardware and network infrastructure, which are subject to the usual risksassociated with any physical device, including hardware failure, power and/or network interruptions, andresource limitations in times of high demand.Load balancing in the cloud differs from classical thinking on load-balancing architecture andimplementation by using commodity servers to perform the load balancing. This provides for newopportunities and economies-of-scale, as well as presenting its own unique set of challenges.The discussion to follow details many of these architectural decision points and implementationconsiderations, while focusing on several of the cloud-ready load balancing solutions provided byRightScale, either directly from our core components, or from resources provided by members of ourcomprehensive partner network. RightScale • w w w. r i g h t s c a l e . c o m • 2
  3. 3. 1 IntroductionA large percentage of the systems (or “deployments” in RightScale vernacular) managed by theRightScale Cloud Management Platform employ some form of front-end load balancing. As a result ofthis customer need, we have encountered, developed, architected, and implemented numerous loadbalancing solutions. In the process we have accumulated experience with solutions that excelled in theirapplication, as well as discovering the pitfalls and shortcomings of other solutions that did not meet thedesired performance criteria. Some of these solutions are open source and are fully supported byRightScale, while others are commercial applications (with a free, limited version in some cases)supported by members of the RightScale partner network.In this discussion we will focus on the following technologies that support cloud-based load balancing:HAProxy, Amazon Web Services’ Elastic Load Balancer (ELB), Zeus Technologies’ Load Balancer (withsome additional discussion of their Traffic Manager features), and aiCache’s Web Accelerator. While itmay seem unusual to include a caching application in this discussion, we will describe the setup in alater section that illustrates how aiCache can be configured to perform strictly as a load balancer.The primary goal of the load balancing tests performed in this study is to determine the maximumconnection rate that the various solutions are capable of supporting. For this purpose we focused onretrieving a very small web page from backend servers via the load balancer under test. Particular use-cases may see more relevance in testing for bandwidth or other metrics, but we have seen moredifficulties surrounding scaling to high connection rates than any other performance criterion, hence thefocus of this paper. As will be seen, the results provide insight into other operational regimes andmetrics as well.Section 2 will describe the test architecture and the method and manner of the performance tests thatwere executed. Application- and/or component-specific configurations will be described in each of thesubsections describing the solution under test. Wherever possible, the same (or similar) configurationoptions were used in an attempt to maintain a compatible testing environment, with the goal beingrelevant and comparable test results. Section 3 will discuss the results of these tests from a pure loadbalancing perspective, with additional commentary on specialized configurations pertinent to eachsolution that may enhance its performance (with the acknowledgement that these configurations/optionsmay not be available with the other solutions included in these evaluations). Section 4 will describe anenhanced testing scenario used to exercise the unique features of the ELB, and section 5 will summarizethe results and offer suggestions with regards to best practices in the load balancing realm.2 Test Architecture and SetupIn order to accomplish a reasonable comparison among the solutions exercised, an architecture typicalof many RightScale customer deployments (and cloud-based deployments in general) was utilized. Alltests were performed in the AWS EC2 US-East cloud, and all instances (application servers, serverunder test, and load-generation servers) were launched in a single availability zone.A single EC2 large instance (m1.large, 2 virtual cores, 7.5GB memory, 64-bit platform) was used for theload balancer under test for each of the software appliances (HAProxy, Zeus Load Balancer, andaiCache Web Accelerator). As the ELB is not launched as an instance, we will address it as anarchitectural component as opposed to a server in these discussions. A RightImage (a RightScale-created and supported Machine Image) utilizing CentOS 5.2 was used as the base operating system onthe HAProxy and aiCache servers, while an Ubuntu 8.04 RightImage was used with the Zeus LoadBalancer. A total of five identically configured web servers were used in each test to handle theresponses to the http requests initiated by the load-generation server. These web servers were run on RightScale • w w w. r i g h t s c a l e . c o m • 3
  4. 4. EC2 small instances (m1.small, 1 virtual core, 1.7GB memory, 32-bit platform), and utilized a CentOS 5.2RightImage. Each web server was running Apache version 2.2.3 and the web page being requestedwas a simple text-only page with a size of 147 bytes. The final server involved in the test was the load-generation server. This server was run on an m1.large instance, and also used a CentOS 5.2RightImage. The server configurations used are summarized in Table 1 below. Table 1 – Summary of server configurationsThe testing tool used to generate the load was ApacheBench, and the command used during the testswas the following:ab -k -n 100000 -c 100 http://<Public_DNS_name_of_EC2_server>The full list of options available is described in the ApacheBench man page (http://httpd.apache.org/docs/2.2/programs/ab.html) but the options used in these tests were:-k Enable the HTTP KeepAlive feature, i.e., perform multiple requests within one HTTP session. Default is no KeepAlive.-n requests Number of requests to perform for the benchmarking session. The default is to just perform a single request which usually leads to non-representative benchmarking results.-c concurrency Number of multiple requests to perform at a time. Default is one request at a time.Additional tests were performed on the AWS ELB and on HAProxy using httperf as an alternative toApacheBench. These tests are described in sections to follow.An architectural diagram of the test setup is shown in Figure 1. RightScale • w w w. r i g h t s c a l e . c o m • 4
  5. 5. Figure 1 – Test setup architectureIn all tests, a round-robin load balancing algorithm was used. CPU utilization on all web servers wastracked during the tests to ensure this tier of the architecture was not a limiting factor on performance.The CPU idle value for each of the five web servers was consistently between 65%-75% during theentire timeline of all tests. The CPU utilization of the load-generating server was also monitored during alltests, and the idle value was consistently above 70% on both cores (the httperf tests more fullyutilized the CPU, and these configurations will be discussed in detail in subsequent sections).As an additional test, two identical load-generating servers were used to simultaneously generate loadon the load balancer. In each case, the performance seen by the first load-generating server was halvedas compared to the single load generator case, with the second server performing equally. Thus, theoverall performance of the load balancer remained the same. As a result, the series of tests thatgenerated the results discussed herein were run with a single load-generating server to simplify the testsetup and results analysis. The load-generation process was handled differently in the ELB test to moreadequately test the auto-scaling aspects of the ELB. Additional details are provided in section 4.1,which discusses this test setup and results.The metric collected and analyzed in all tests was the number of requests per second that wereprocessed by the server under test (referred to as responses/second hereafter). Other metrics may bemore relevant for a particular application, but pure connection-based performance was the desiredmetric for these tests.2.1 Additional Testing ScenariosDue to the scaling design of the AWS ELB, adequately testing this solution requires a different and morecomplex test architecture. The details of this test configuration are described in section 4 below. Withthis more involved architecture in place, additional tests of HAProxy were performed to confirm theresults seen in the more simplistic architecture described above. The HAProxy results were consistent RightScale • w w w. r i g h t s c a l e . c o m • 5
  6. 6. between the two test architectures, lending validation to the base test architecture. Additional details onthese HAProxy tests are provided in section 4.2.3 Test ResultsEach of the ApacheBench tests described in Section 2 was repeated a total of ten times against eachload balancer under test with the numbers quoted being the averages of those tests. Socket stateswere checked between tests (via the netstat command) to ensure that all sockets closed correctlyand the server had returned to a quiescent state. A summary of all test results is included in AppendixA.3.1 HAProxyHAProxy is an open-source software application that provides high-availability and load balancingfeatures (http://haproxy.1wt.eu/). In this test, version 1.3.19 was used and the health-checkoption was enabled, but no other additional features were configured. The CPU utilization was less than50% on both cores of the HAProxy server during these tests (HAProxy does not utilize multiple cores,but monitoring was performed to ensure no other processes were active and consuming CPU cycles),and the addition of another web server did not increase the number of requests serviced, nor change theCPU utilization of the HAProxy server. HAProxy performance tuning as well as Linux kernel tuning wasperformed. The tuned parameters are indicated in the results below, and are summarized in AppendixB. HAProxy does not support the keep-alive mode of the HTTP transactional model, thus its responserate is equal to the TCP connection rate.3.1.1 HAProxy BaselineIn this test, HAProxy was run with the standard configuration file (Appendix C) included with theRightScale frontend ServerTemplates (a ServerTemplate is a RightScale concept, and defines the baseOS image and series of scripts to install and configure a server at boot time). The results of the initialHAProxy tests were:Requests per second: 4982 [#/sec]This number will be used as a baseline for comparison with the other load balancing solutions underevaluation.3.1.2 HAProxy with nbproc ModificationThe nbproc option to HAProxy is used to set the number of haproxy processes when run in daemonmode. This is not the preferred mode in which to run HAProxy as it makes debugging more difficult, butit may result in performance improvements on certain systems. As mentioned previously, the HAProxyserver was run on an m1.large instance, which has two cores, so the nbproc value was set to 2 for thistest. Results:Requests per second: 4885 [#/sec]This is approximately 2% of a performance reduction compared with the initial tests (in which nbprocwas set to the default value of 1), so this difference is considered statistically insignificant, with theconclusion that in this test scenario, modifying the nbproc parameter has no effect on performance.This is most likely an indicator that user CPU load is not the limiting factor in this configuration.Additional tests described in section 4 add credence to this assumption. RightScale • w w w. r i g h t s c a l e . c o m • 6
  7. 7. 3.1.3 HAProxy with Kernel TuningThere are numerous kernel parameters that can be tuned at runtime, all of which can be found underthe /proc/sys directory. The ones mentioned below are not an exhaustive or all-inclusive list of theparameters that would positively (or negatively) affect HAProxy performance, but they have been foundto be beneficial in these tests. Alternate values for these (and other) parameters may have positiveperformance implications depending on the traffic patterns a site encounters and the type of contentbeing served. The following kernel parameters were modified by adding them to the /etc/sysctl.conf file and executing the ‘sysctl –p’ command to load them into the kernel: net.ipv4.conf.default.rp_filter=1 net.ipv4.conf.all.rp_filter=1 net.core.rmem_max = 8738000 net.core.wmem_max = 6553600 net.ipv4.tcp_rmem = 8192 873800 8738000 net.ipv4.tcp_wmem = 4096 655360 6553600 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_max_tw_buckets = 360000 vm.min_free_kbytes = 65536 vm.swappiness = 0 net.ipv4.ip_local_port_range = 30000 65535With these modifications in place, the results of testing were:Requests per second: 5239 [#/sec]This represents about a 5.2% improvement over the initial HAProxy baseline tests. To ensure accuracyand repeatability of these results, the same tests (HAProxy Benchmark with no application or kerneltuning and the current test) were rerun. The 5%-6% performance improvement was consistent acrossthese tests. Additional tuning of the above-mentioned parameters was performed, with the addition ofother network- and buffer-related parameters, but no significant improvements to these results wereobserved. Setting the haproxy process affinity also had a positive effect on performance (and negatedany further gains from kernel tuning). This process affinity modification is described in section 4.2.It is worth noting that HAProxy can be configured for both cookie-based and IP-based session stickiness(IP-based if a single HAProxy load balancer is used). This can enhance performance, and in certainapplication architectures, it may be a necessity. RightScale • w w w. r i g h t s c a l e . c o m • 7
  8. 8. 3.2 Zeus Load BalancerZeus Technologies (http://www.zeus.com/) is a RightScale partner which has created a cloud-ready ServerTemplate available from the RightScale Dashboard. Zeus is a fee-based softwareapplication, with different features being enabled at varying price points. The ServerTemplate used inthese tests utilized version 6.0 of the Zeus Traffic Manager. This application provides many advancedfeatures that support caching, SSL termination (including the ability to terminate SSL for multiple fully-qualified domain names on the same virtual appliance), cookie- and IP-based session stickiness,frontend clustering, as well as numerous other intelligent load balancing features. In this test only theZeus Load Balancer (a feature subset of the Zeus Traffic Manager) was used to provide more feature-compatible tests with the other solutions involved in these evaluations. By default, Zeus enables bothHTTP keep-alives as well as TCP keep-alives on the backend (the connections to the web servers), thusavoiding the overhead of unnecessary TCP handshakes and tear-downs. With a single Zeus LoadBalancer running on an m1.large (consistent with all other tests), the results were:Requests per second: 6476 [#/sec]This represents a 30% increase over the HAProxy baseline, and a 24% increase over the tuned HAProxytest results. As mentioned previously, the Zeus Traffic Manager is capable of many advanced loadbalancing and traffic managing features, so depending on the needs and architecture of the application,significantly improved performance may be achieved with appropriate tuning and configuration. Forexample, enabling caching would increase performance dramatically in this test since a simple statictext-based web page was used. We will see a use case for this in the following section discussingaiCache’s Web Accelerator. However, for these tests standard load balancing with particular attention torequests served per second was the desired metric, so the Zeus Load Balancer features were exercised,and not the extended Zeus Traffic Manager capabilities.3.3 aiCache Web AcceleratoraiCache implements a software-solution to provide frontend web server caching (http://aicache.com/). aiCache is a RightScale partner that has created a ServerTemplate to deploy theirapplication in the cloud through the RightScale platform. The aiCache Web Accelerator is also a fee-based application. While it may seem out of place to include a caching application in an evaluation ofload balancers, the implementation of aiCache lends itself nicely to this discussion. If aiCache does notfind the requested object in its cache, it will load the object into the cache by accessing the “origin”servers (the web servers used in these discussions) in a round-robin fashion. aiCache does not supportsession stickiness by default, but it can be enabled via a simple configuration file directive. In the testsrun as part of this evaluation, aiCache was configured with the same five web servers on the backend asin the other tests, and no caching was enabled, thus forcing the aiCache server to request the page froma backend server every time. With this setup and configuration in place, the results were:Requests per second: 4785 [#/sec]This performance is comparable with that of HAProxy (it is 4% less than the HAProxy baseline, and 9%less than the tuned HAProxy results). As mentioned previously, aiCache is designed as a cachingapplication to be placed in front of the web servers of an application, and not as a load balancer per se.But as these results show, it performs this function quite well. Although it is a bit out of scope withregard to the intent of these discussions on load balancing, a simple one line change to the aiCacheconfiguration file allowed caching of the simple web page being used in these tests. With this one linechange in place, the same tests were run, and the results were: RightScale • w w w. r i g h t s c a l e . c o m • 8
  9. 9. Requests per second: 15342 [#/sec]This is large improvement (320%) over the initial aiCache load balancing test, and similarly compared tothe HAProxy tests (307% over the HAProxy baseline, and 293% better than the tuned HAProxy results).Caching is most beneficial in applications that serve primarily static content. In this simple test it wasapplicable in that the requested object was a static, text-based web page. As mentioned above in thediscussion of the Zeus solution, depending on the needs, architecture, and traffic patterns associatedwith an application, significantly improved results can be obtained by selecting the correct application forthe task, and tuning that application correctly.3.4 Amazon Web Services Elastic Load Balancer (ELB)Elastic Load Balancing facilitates distributing incoming traffic among multiple AWS instances (much likeHAProxy). Where ELB differs from the other solutions discussed in this white paper is that it can spanAvailability Zones (AZ), and can distribute traffic to different AZs. While this is possible with HAProxy,Zeus Load Balancer, and aiCache Web Accelerator, there is a cost associated with cross-AZ traffic(traffic within the same AZ via private IPs is at no cost, while traffic between different AZs is fee-based).However, an ELB has a cost associated with it as well (an hourly rate plus a data transfer rate), so someof this inter-AZ traffic cost may be equivalent to the ELB charges depending on your applicationarchitecture. Multiple AZ configurations are recommended for applications that demand high reliabilityand availability, but an entire application can be (and often is) run within a single AZ. AWS has notreleased details on how ELB is implemented, but since it is designed to scale based on load (which willbe shown in sections to follow), it is most likely a software-based virtual appliance. The initial release ofELB did not support session stickiness, but cookie-based session affinity is now supported.AWS does not currently have different sizes or versions of ELBs, so all tests executed were run with thestandard ELB. Additionally, no performance tuning or configuration is currently possible on ELBs. Theonly configuration that was set with regard to the ELB used in these tests was that only a single AZ wasenabled for traffic distribution.Two sets of tests were run. The first was functionally equivalent to the tests run against the other loadbalancing solutions in that a single load-generating server was used to generate a total of 100,000requests (and then repeated 10 times to obtain an average). The second test was designed to exercisethe auto-scaling nature of ELB, and additional details are provided in section 4.1. For the first set oftests, the results were:Requests per second: 2293 [#/sec]This performance is about 46% of that of the HAProxy baseline tests, and approximately 43% of thetuned HAProxy results. This result is consistent with tests several of RightScale’s customers have runindependently. As a comparison to this simple ELB test, a test of HAProxy on an m1.small instance wasconducted. The results of this HAProxy test are as follows:Requests per second: 2794 [#/sec]In this test scenario, the ELB performance is approximately 82% that of HAProxy running on anm1.small. However, due to the scaling design of the ELB solution discussed previously, another testingmethodology is required to adequately test the true capabilities of ELB. This test is fundamentallydifferent from all others performed in this investigation, so it will be addressed separately in section 4.4 Enhanced Testing ArchitectureIn this test, a much more complex and involved testing architecture was implemented. Instead of five RightScale • w w w. r i g h t s c a l e . c o m • 9
  10. 10. backend web servers as used in the previous tests, a total of 25 identical backend web servers wereused, with 45 load-generating servers utilized instead of a single server. The reason for the change isthat fully exercising ELB requires that requests are issued to a dynamically varying number of IPaddresses returned by the DNS resolution of the ELB endpoint. In effect, this is the first stage of loadbalancing employed by ELB in order to distribute incoming requests across a number of IP addresseswhich correspond to different ELB servers. Each ELB server then in turn load balances across theregistered application servers.The load-generation servers used in this test were run on AWS c1.medium servers (2 virtual cores,1.7GB memory, 32-bit platform). As a result of observing the load-generating servers in the previoustests, it was determined that memory was not a limiting factor and the 7.5GB available was far morethan was necessary for the required application. CPU utilization was high on the load-generator, so thec1.medium was used to add an additional 25% of computing power. As mentioned previously, insteadof a single load-generating server, up to 45 servers were used, each running the following httperfcommand in an endless loop:httperf --hog --server=$ELB_IP --num-conns=50000 --rate=500 --timeout=5In order to spread the load among the ELB IPs that were automatically added by AWS, a DNS querywas made at the beginning of each loop iteration so that subsequent runs would not necessarily use thesame IP address. These 45 load-generating servers were added in groups at specific intervals, whichwill be detailed below.The rate of 500 requests per second (the “--rate=500” option to httperf) was determined viaexperimentation on the load-generating server. With rates higher than this, non-zero fd-unavail errorcounts were observed, which is an indication that the client has run out of file descriptors (or moreaccurately, TCP ports), and is thus overloaded. The number of total connections per iteration was set to50,000 (--num-conns=50000) in order to keep each test run fairly short in duration (typically less thantwo minutes) such that DNS queries would occur at frequent intervals in order to spread the load as theELB scaled.4.1 ELB performanceThe first phase of the ELB test utilized all 25 backend web servers, but only three load-generatingservers were launched initially (which would generate about 1500 requests/sec – three servers at 500requests/second each). Some reset/restart time was incurred between each loop iteration running thehttperf commands, so a sustained 500 requests/second per load-generating server was not quiteachievable. DNS queries initially showed three IPs for the ELB. As shown in Figure 2 (label (a)) anaverage of about 1000 requests/second were processed by the ELB at this point.Approximately 20 minutes into the test, an additional three load-generating servers were added, resultingin a total of six, generating about 3000 requests/second (see Figure 2 (b)). The ELB scaled up to five IPsover the course of the next 20 minutes (c), and the response rate leveled out at about 3000/second atthis point. The test was left to run in its current state for the next 45 minutes, with the number of ELBIPs monitored periodically, as well as the response rate. As Figure 2 shows (d), the response rateremained fairly stable at about 3000/second during this phase of the test. The number of IPs returnedvia DNS queries for the ELB varied between seven and 11 during this time.At this point, an additional 19 load-generating servers were added (for a total of 25, see Figure 2 (e)),which generated about 12500 requests/second. The ELB added IPs fairly quickly in response to thisload, and averaged between 11 and 15 within 10 minutes. After about 20 minutes (Figure 2 (f)), an RightScale • w w w. r i g h t s c a l e . c o m • 10
  11. 11. average of 10500 responses/second was realized (again, due to the restart time between iterations ofthe httperf loop, the theoretical maximum of 12500 requests/second was not quite realized).The test was left to run in this state for about 20 minutes, where it remained fairly stable in terms ofresponse rate, but the number of IPs for the ELB continued to vary between 11 and 15. An additional20 load-generating servers (for a total of 45, see Figure 2 (g)) were added at this time. About 10 minuteswere required before the ELB scaled up to accommodate this new load, with a result of between 18 and23 IPs for the ELB. The response rate at this time averaged about 19000/second (Figure 2 (h)). The testwas allowed to run for approximately another 20 minutes before all servers were terminated. Theresponse rate during this time remained around 19000/second, and the ELB varied in the number of IPsbetween 19 and 22. Figure 2 – httperf responses per second through the AWS ELB. Each color corresponds to the responses received from an individual ELB IP address. The quantization is due to the fact that each load generating server is locked to a specific IP address for a 1-2 minute period during which it issues 500 requests/second.To ensure that the backend servers were not overtaxed during these tests, the CPU activity of each wasmonitored. Figure 3 show the CPU activity on a typical backend server. Additionally, the interface trafficon the load-generating servers and the number of Apache requests on the backend servers wasmonitored. Figures 4 and 5 show graphs for these metrics. RightScale • w w w. r i g h t s c a l e . c o m • 11
  12. 12. Figure 3 – CPU activity on typical backend web server Figure 4 – Interface traffic on typical load-generating server Figure 5 – Apache requests on typical backend web server. Peak is with 45 load-generating servers.It would appear that the theoretical maximum response rate using an ELB is almost limitless, assumingthat the backend servers can handle the load. Practically this would be limited by the capacity of theAWS infrastructure, and/or by throttles imposed by AWS with regards to an ELB. These test results RightScale • w w w. r i g h t s c a l e . c o m • 12
  13. 13. were shared with members of the AWS network engineering team, who confirmed that there are activitythresholds that will trigger an inspection of traffic to ensure it is legitimate (and not a DoS/DDoS attack,or similar). We assume that the tests performed here did not surpass this threshold and that additionalrequests could have been generated before the alert/inspection mechanism would have beenperformed. If the alert threshold is met, and after inspection the traffic is deemed to be legitimate, thethreshold is lifted to allow additional AWS resources to be allocated to meet the demand. In addition,when using multiple availability zones (as opposed to the single AZ used in this test) supplemental ELBresources become available.While the ELB does scale up to accommodate increased traffic, the ramp-up is not instantaneous, andtherefore may not be suitable to all applications. In a deployment that experiences a slow and steadyload increase, an ELB is an extremely scalable solution, but in a flash-crowd or viral event, ELB scalingmay not be rapid enough to accommodate the sudden influx of traffic, although artificial “pre-warming”of ELB may be feasible.4.2 Enhanced Test Configuration with HAProxyIn order to validate the previous HAProxy results, the enhanced test architecture described above wasused to test a single instance running HAProxy on an m1.large (2 virtual cores, 7.5GB memory, 64-bitarchitecture). In this test configuration, 16 load-generating servers were used as opposed to the 45used in the ELB tests. (No increase in performance was seen beyond 10 load-generators, so the testwas halted once 16 had been added.) The backend was populated with 25 web servers as in the ELBtest, and the same 147-byte text-only web page was the requested object. Figure 6 shows a graph ofthe responses/second handled by HAProxy. The average was just above 5000, which is consistent withthe results obtained in the tests described in section 3.1 above. Figure 6 – HAProxy responses/secondThe gap in the graph was the result of a restart of HAProxy once kernel parameters had been modified.The graph tails off at the end as DNS TTLs expired, which pushed the traffic to a different HAProxyserver running on an m1.xlarge. Results of this m1.xlarge test are described below.In the initial test run in the new configuration, an average of about 5000 responses/second wasobserved. During this time frame, CPU-0 was above 90% utilization (see Figure 7), while CPU-1 wasessentially idle. By setting the HAProxy process affinity for a single CPU (essentially moving all system-related CPU cycles to a separate CPU), performance was increased approximately 10% to the 5000responses/second shown in Figure 6. When the affinity was set (using the ‘taskset -p 2 RightScale • w w w. r i g h t s c a l e . c o m • 13
  14. 14. <haproxy_pid>’ command) CPU-0’s utilization was dropped to less than 5%, and CPU-1’s changedfrom 0% utilized to approximately 60% utilization (due to the fact that the HAProxy process was movedexclusively to CPU-1). (See Figure 8.) Additionally, when the HAProxy process’ affinity was set, tuningthe kernel parameters no longer had any noticeable effect. Figure 7 – CPU-0 activity on HAProxy server Figure 8 – CPU activity on CPU-0 and CPU-1 after HAProxy affinity is set to CPU-1The interface on the HAProxy server averaged approximately 100 MBits/second total (in and outcombined) during the test (see Figure 9). In previous tests of m1.large instances in the same availabilityzone, throughput in excess of 300 MBits/second has been observed, thus confirming the instance’sbandwidth was not the bottleneck in these tests. RightScale • w w w. r i g h t s c a l e . c o m • 14
  15. 15. Figure 9 – Interface utilization on HAProxy serverWith unused CPU cycles on both cores, and considerable bandwidth on the interface available, thebottleneck in the HAProxy solution is not readily apparent. The HAProxy test described above was alsorun on an m1.xlarge (4 virtual cores, 15GB memory, 64-bit platform) with the same configuration. Theresults observed were identical to that of the m1.large. Since HAProxy is not memory-intensive, anddoes not utilize additional cores, these results are not surprising, and support the reasoning that thethrottling factor may be an infrastructure- or hypervisor-related limitation.During these HAProxy tests, it was observed that the virtual interface was peaking at approximately110K packets per second (pps) in total throughput (input + output). As a result of this observation, thettcp utility was run in several configurations to attempt to validate this finding. Tests accessing theinstance via its internal IP, external IP, as well as two concurrent transmit sessions were executed (seeFigure 10). Figure 10 – Packets per second as generated by ttcpThe results of these tests were fairly consistent in that a maximum of about 125K pps were achieved,with an average of 118K-120K being more typical. These results were shared with AWS networkengineering representatives, who confirmed that we are indeed hitting limits in the virtualization layerwhich involves the traversal of two network stacks. RightScale • w w w. r i g h t s c a l e . c o m • 15
  16. 16. The takeaway from these experiments is that in high traffic applications, the network interface should bemonitored and additional capacity should be added when the interface approaches 100K pps,regardless of other resources that may still be available on the instance.These findings also explain why the results between HAProxy, aiCache, and Zeus are very similar. With allthree appliances the practical limit is about 100K packets per second. The minor performancedifferences between the three are primarily due to keep-alive versus non keep-alive HTTP connectionsand internal buffer strategies that may distribute payloads over more or fewer packets in differentsituations.5 ConclusionsAt RightScale we have encountered numerous and varied customer architectures, applications, and usecases, and the vast majority of these deployments use, or can benefit from, the inclusion of front-endload balancing. As a result of assisting these customers both in a consultant capacity as well asengaging with them on a professional services level, we have amassed a broad spectrum of experiencewith load balancing solutions. The intent of this discussion was to give a brief overview of the loadbalancing options currently available in the cloud via the RightScale platform, and compare and contrastthese solutions using a specific configuration and metric on which to rate these solutions. Throughthese comparisons, we have hoped to illustrate that there is no “one size fits all” when it comes to loadbalancing. Depending on the particular application’s architecture, technology stack, traffic patterns, andnumerous other variables, there may be one or more viable solutions, and the decision on whichmechanism to put in place will often come down to a tradeoff between performance, functionality, andcost. RightScale • w w w. r i g h t s c a l e . c o m • 16
  17. 17. Appendices[A] Summary of all tests performed[B] Kernel tuning parameters net.ipv4.conf.default.rp_filter=1 net.ipv4.conf.all.rp_filter=1 net.core.rmem_max = 8738000 net.core.wmem_max = 6553600 net.ipv4.tcp_rmem = 8192 873800 8738000 net.ipv4.tcp_wmem = 4096 655360 6553600 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_max_tw_buckets = 360000 vm.min_free_kbytes = 65536 vm.swappiness = 0 net.ipv4.ip_local_port_range = 30000 65535 RightScale • w w w. r i g h t s c a l e . c o m • 17
  18. 18. [C] HAProxy configuration file# Copyright (c) 2007 RightScale, Inc, All Rights Reserved Worldwide.## THIS PROGRAM IS CONFIDENTIAL AND PROPRIETARY TO RIGHTSCALE# AND CONSTITUTES A VALUABLE TRADE SECRET. Any unauthorized use,# reproduction, modification, or disclosure of this program is# strictly prohibited. Any use of this program by an authorized# licensee is strictly subject to the terms and conditions,# including confidentiality obligations, set forth in the applicable# License Agreement between RightScale.com, Inc. and# the licensee.globalstats socket /home/haproxy/status user haproxy group haproxy log 127.0.0.1 local2 info# log 127.0.0.1 local5 info maxconn 4096 ulimit-n 8250 # typically: /home/haproxy chroot /home/haproxy user haproxy group haproxy daemon quiet pidfile /home/haproxy/haproxy.piddefaults log global mode http option httplog option dontlognull retries 3 option redispatch maxconn 2000 contimeout 5000 clitimeout 60000 srvtimeout 60000# Configuration for one application:# Example: listen myapp 0.0.0.0:80listen www 0.0.0.0:80 mode http balance roundrobin # When acting in a reverse-proxy mode, mod_proxy from Apache adds X-Forwarded-For, # X-Forwarded-Host, and X-Forwarded-Server request headers in orderto pass information to # the origin server;therefore, the following option is commented out #option forwardfor # Haproxy status page stats uri /haproxy-status #stats auth @@LB_STATS_USER@@:@@LB_STATS_PASSWORD@@ # when cookie persistence is required cookie SERVERID insert indirect nocache # When internal servers support a status page #option httpchk GET @@HEALTH_CHECK_URI@@ # Example server line (with optional cookie and check included) # server srv3.0 10.253.43.224:8000 cookie srv03.0 check inter2000 rise 2 fall 3 server i-570a243f 10.212.69.176:80 check inter 3000 rise 2 fall 3maxconn 255 RightScale • w w w. r i g h t s c a l e . c o m • 18

×