PyCon US 2012 - Web Server Bottlenecks and Performance Tuning


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • If you want to follow along with the slides on your laptop as the talk goes on, you can view them on Just search for my name.\n
  • Before we talk about web server performance tuning, it is important to step back and look at the bigger picture. Although newbies especially have an obsession with trying to find the fastest web server, reality is that things are more complicated than that. Systems have many moving parts. The actual per request latency introduced by a web server is very small in relation to time spent in other parts of the system. \n
  • As far as a user is concerned, the main delays they will notice are those resulting from how long it takes their web browser to render the page returned by the web application. This will be followed up by network delays when talking to the web application, and when grabbing down static assets from media servers or content data networks.\n
  • The time spent in the web application is therefore a small percentage of the time as perceived by the user for rendering a web page being served. Any time delays introduced by the web server will be a much smaller percentage again.\n
  • Steve Sounders summaries this disparity between front end time and web application time in what he calls the "Performance Golden Rule". This is that "80-90% of the end-user response time is spent on the frontend". If you are after an easy win for improving end user satisfaction with your web site, the front end is where you should start.\n
  • Although the big immediate gains may be won in the front end, the web application still presents lots of opportunity to further reduce response times through means other than fiddling with the web server. Improvements can be had in the application code, but also in databases or backend services used by the web application.\n
  • Is running benchmarks on web servers a complete waste of time then? The answer is yes and no. The sort of benchmarks that are usually published on web sites to compare web servers usually serve little value. They generally only serve to give newbies a false sense of security over any decision they make as to which web server to use. Worse is that people forever reference them as the gospel truth when they can be far from it.\n
  • The main reason that the typical web server benchmark is useless is that it tests only a single narrow idealised use case. Web servers are implemented using different architectures and using different code. You are better off choosing a web server that you believe has the features you require and then use benchmarks to help explore the behaviour of that system.\n
  • Often the documented benchmarks you find are nothing more than a hello world program. The test then consists of running it at maximum request throughput with some arbitrary number of concurrent users. This does not mirror what real traffic a public facing web server would receive. It certainly doesn't show what causes the server to fail as load increases, just that it will.\n
  • What should you test then? There are many different use cases one could test and how any one performs can be dictated by the architecture of the system, how the code was written and how the system is configured when the test is run. The more interesting tests are those which deliberately go out to trigger specific problems. This is because it is the corner cases that are usually going to cause an issue rather than the typical use case.\n
  • What sort of factors can come into play and affect performance? These are varied and can arise from the hardware or virtualised system being used. They can derive from the configuration you use for the specific web server, but also can be influenced by how the Python language interpreter works. To make it hard, these can all interplay with each other in unexpected ways.\n
  • Some things can be out of your control altogether, such as the type of web browser and what type of network the traffic between you and the user has to traverse. Very few published benchmarks try to account for these issues in a realistic way.\n
  • Requirements as dictated by your own web application or how you decided to architect your overall system can also contribute. Such as whether you try and use the same server to serve up static assets.\n
  • To illustrate how some of these different factors can come into play, I will go through a few specific use cases that present issues in practice and where possible relate them to those factors. These include memory usage, use of processes vs threads, impacts of long running requests, restarting of server processes and startup costs.\n
  • A simple place to get started is memory usage. This is always a hot topic of contention with benchmarks. It isn't hard to find people claiming that Apache is a big bloated memory hog. This benchmark in particular is representative of a poorly chosen Apache configuration. Of course it will use more memory if you configure it to have 1000 threads. If servers tested aren't set up in a comparable way, you can hardly expect it to be a fair comparison.\n
  • Actually estimating the overall amount of memory used is not a difficult exercise, it is after all a simple formula which takes into consideration the number of processes, the base memory used by the web server, memory for each additional thread and the application itself. Things get more complicated when one considers per request transient memory, but ignoring that, one can easily visualise what you are dealing with.\n
  • In short, adding more processes is going to see memory usage grow quicker than adding more threads to existing processes. Although some of that per process base memory usage is the web server, the majority of it will in the end be your fat Python web application. To blame a web server for using too much memory is plain silly when your web application could be using up to 50 times as much memory. The issue is really about what configuration you chose to set the web server up with.\n
  • What usually happens is that people will blindly use whatever the defaults are for a server. For fat Python web applications which use a lot of memory this can be disastrous. Apache with its prefork MPM can for example dynamically create up to 150 processes. That is potentially 150 copies of your fat Python web application. So of course it will use a lot of memory.\n
  • Those servers which are generally seen as fairing best as far as memory usage are those whose default configurations use only a single process and single thread. Guess what, if you configure Apache that way then the amount of memory it uses will not be much different. Granted, it does help to also strip unneeded modules out of Apache that you don't use to really get the best from it. \n
  • So don't start things off by using whatever the default processes/threads configurations are, especially if looking at memory usage. Do so and you can easily get the wrong impression. Also don't pick arbitrary values when you have no idea whether it is reasonable. At this scale a configuration with 1000 threads will not even fit on the chart, would almost be in the next room, and again in the red zone.\n
  • How many processes and threads should you use then? The total number of threads across all processes is dictated by the number of overlapping concurrent requests. How much overlap there is depends on response times and throughput. Processes are preferred over threads, but constrained by memory. The optimal number can also be dictated by how many processors are available. Gunicorn recommends for example using 2 to 4 the number of processes as you have CPU cores.\n
  • One can get a feel for how many threads you will need by looking at thread utilisation. That is, how much do the requests take up of the potential capacity. In this example, by adding up the green areas representing the requests coming in over time, we have here a thread utilisation of about 2.0. This means that if all requests were serialised, we would need only two threads. Requests don't arrive in such an orderly fashion though, so we need more threads to ensure they aren't delayed.\n
  • Because response times are generally quite short, it is actually surprising how few threads you can get away with. If the number of threads is too low and response times or throughput grows though, then thread utilisation will increase. Eventually what happens then is that requests will start to backup as they wait for available threads and queueing time will increase. This will add to the delays that end users see in their total page load time.\n
  • If we add processes rather than threads we can delay the on set of such problems. The reason that processes work better is that the Python global interpreter lock (GIL) effectively serialises execution within distinct threads of a single process. Adding more processes though obviously means more memory. This has got nothing to do with which server now and a choice you make which is going to be bound by how much memory you have available.\n
  • If you are memory constrained, finding the right balance and what you can get away with in order to still reduce memory usage is a tricky problem. It is all made harder when you have no idea what is going on inside of your web application. If a web application has a heavy bias towards CPU bound activity within the process, then you are forced towards the direction of needing more processes.\n
  • If your web application is making lots of call outs to backend services and so threads are blocked waiting on I/O more of the time than not, you can get away with using more threads because the threads aren't competing as much with each other for use of the CPU within the same process. If you have no idea though what your web application is doing, this judgement is going to be a hit and miss affair as far as tuning the processes/threads balance.\n
  • To make such judgements even harder you also have long running requests to contend with. These can arise due to issues in your own code or backend services, but also due to how much data you are moving around and how slow the HTTP clients are. The basic problem here is that a long running request, because it ties up a thread, will reduce the maximum throughput you could achieve during that period of time.\n
  • The unpredictability of request times means you need to always ensure you have a good amount of extra capacity in the number of processes/threads allocated. Don't provide sufficient head room and when a number of long running requests coincide you will suddenly find thread availability drops, requests can start backlogging and overall response times as seen by the user will increase.\n
  • Where your application code or backend service is slow, you obviously need to work out why. Sometimes issues can come from places you least expect them. For example, especially with Django, watch out for how long PostgreSQL database connections take. One thing you can consider in this case is a local external connection pooler such as pgbouncer. \n
  • If you're using Apache/mod_wsgi or gunicorn, stick nginx in front of it and proxy requests through to your WSGI server. This will make your WSGI server perform better as you will be isolated from slow clients. The threads in the backend will be tied up for less time, meaning lower thread utilisation, thus allowing you to handle a higher throughput with less resources. You can also offload tasks such as static file serving to nginx, which is going to do a better job of it anyway.\n
  • When introducing a front end, do be careful though of the funnelling effect, especially if the number of concurrent requests that can be handled reduces at each step. If your web application backlogs, users may give up, but requests are still queued and have to be handled. Your web application wastes time and may have trouble catching up with the backlog. It is perhaps better to setup servers so requests time out with a 503 before getting to your web application if you can.\n
  • Worst case scenario here is a complete overload where the server never really recovers for an extended period or until you can shutdown the server. Request timeouts within the web application where supported can help a bit, but only to throw out long running requests. As already mentioned, you really need to stop the requests getting to the web application if there is no longer a point handling them. Options here vary and solutions available to avoid it aren't always great.\n
  • You might actually think that doing a restart will solve a problem with backlogged requests. You have to be careful here as well though. For some servers, the listener socket can be preserved, so any backlog there isn't actually cleared. Further, when performing a restart, new processes have to be created and application loaded again. This can take time and cause more requests to backlog. So choose carefully when you restart. To totally reset, it is better to do a full shutdown and clear the backlog.\n
  • For fat Python web applications with a large startup cost, server configurations which allow for auto scaling can also compound problems. When under load and you get a further throughput spike the server can decide to start more processes. This slows the system down temporarily, causing backlog and if it takes a long time to start processes, the server could decide to start even more processes, increasing system load again, blowing out memory and overloading your whole system.\n
  • To avoid unexpected surprises, you are better off starting up the maximum number of processes you expect to require or can fit in available memory with your web application loaded. Ensure you pre load your web application when processes start and not lazily when first request arrives. Do everything possible to keep the processes in memory all the time, avoiding restarts. Especially don't use options to restart when some maximum request count is reached.\n
  • Because the suggestion is that you should preconfigure the server to its maximum capacity at the outset, it does limit the vertical scaling you can do at least within the confines of the same hardware. Next step therefore is horizontal scaling. Keep in mind the same issues about preloading. You don't want to bring on new hosts and direct traffic to them, only for the first requests sent to it to be delayed while the application loads.\n
  • No matter how you set your system up, if problems do arise, the only way you are going to start to be able to understand what went wrong when it does all crash in a heap is through monitoring. If you treat your system as a black box, how will you know what is going on inside. One thing is for sure, all those benchmarks you may have run to find out what the fastest web server was are not going to help you one bit.\n
  • Server monitoring tools, although useful, only show you the affect of the problem on the overall system. They don't necessarily provide you that insight of what is going on inside of your web application as they still largely treat your web application and web server like a black box. A deeper level of introspection is required.\n
  • When we talk about finding out what is happening inside of your Python web application, the options have been limited. Tools such as Django debug toolbar, or the Python profiler are only suited to a development environment. Sentry can be used in production to capture errors, but performance problems aren't going to generate nice exceptions for you.\n
  • This historical lack of good tools for knowing what is going on inside of your Python web application is why I am loving my current day job. If you had managed to miss it, I am now working at New Relic. New Relic performance monitoring provides the ability to monitor the front end, your web application and the underlying server. I am bringing all that goodness to the world of the Python web. New Relic gives you that deep introspection required to know what is going on.\n
  • I am of course also the author of mod_wsgi. Being able to get New Relic working with Python means I have been able to use the reporting it provides to delve quite deep into the behaviour of mod_wsgi under different situations. The results have been quite revealing. One of the areas it has helped in understanding is the funnelling effects when using daemon mode. I'll admit there is room for improvement and I will be trying to address some issues in mod_wsgi 4.0.\n
  • Summing things up. Pick a web server and architecture which seems to meet your requirements, then use benchmarks to evaluate its behaviour. Don't use benchmarks simply to try and compare different systems. Don't trust server defaults. Configure and tune your whole stack based on the results you get from live production monitoring. Try using New Relic for really deep introspection of what is going on in all parts of your system.\n
  • So, if you are doing Python web application development, do consider giving New Relic a try. If your not sure, New Relic does provide a free trial period where you can try out all the features it has. Even when the trial ends, a free Lite subscription is available which still provides lots of useful information. If you want to work at New Relic then come talk to us. Right now we are looking for for a Python developer in Portland. While you think about how cool that might be, we should have time for questions.\n
  • PyCon US 2012 - Web Server Bottlenecks and Performance Tuning

    1. 1. Web Server Bottlenecks and Performance Tuning Graham Dumpleton PyCon – March 2012Graham Dumpleton @GrahamDumpletonStarting my PyCon talk. Lets hope I dont loose myvoice completely while doing this.
    2. 2. Follow along✴
    3. 3. The big picture
    4. 4. Front end
    5. 5. Web applicationWeb application Front end time 0.15 seconds 3.1 seconds
    6. 6. Performance Golden Rule "80-90% of the end-user response time is spent on the frontend. Start there." the-performance-golden-rule/
    7. 7. Application breakdown
    8. 8. Are benchmarks stupid?
    9. 9. Benchmarks as a tool✴Web server benchmarks are of more value when used as an exploratory tool to understand how a specific system works, not to compare systems.
    10. 10. What about load tests?✴Hitting a site with extreme load will only show you that it will likely fail under a denial of service attack.✴Your typical web server load test isnt alone going to help you understand how a web server is contributing to it failing.
    11. 11. What should you test?✴You should use a range of purpose built tests to trigger certain scenarios.✴Use the tests to explore corner cases and not just the typical use case.
    12. 12. Environment factors✴Amount of memory available.✴Number of processors available.✴Use of threads vs processes.✴Python global interpreter lock (GIL)
    13. 13. Client impacts✴Slow HTTP browsers/clients.✴Browser keep alive connections.
    14. 14. Application requirements✴Need to handle static assets.
    15. 15. Use cases to explore✴Memory used by web application.✴Using processes versus threads.✴Impacts of long running requests.✴Restarting of server processes.✴Startup costs and lazy loading.
    16. 16. Memory usage 1 process 1 process 1000 threads 1 thread
    17. 17. What effects memory use?✴Web server base memory usage.✴Web server per thread memory usage.✴Application base memory usage. ✴Is application loaded prior to forking?✴Per request transient memory usage.
    18. 18. Processes Vs Threads 150 135 120 105 90 75Processes 60 45 30 15 0 0 15 30 45 60 75 90 105 120 135 150 Threads
    19. 19. Apache/mod_wsgi defaults Configuration Max Processes Threads Apache (prefork)mod_wsgi (embedded) 150 1 Apache (worker)mod_wsgi (embedded) 6 25 Apache (prefork) mod_wsgi (daemon) 1 15 Apache (worker) mod_wsgi (daemon) 1 15
    20. 20. Other WSGI serversConfiguration Max Processes Threads FASTCGIflup (prefork) 50 1 FASTCGIflup (threaded) 1 5 gunicorn 1 1 uWSGI 1 1 tornado 1 1
    21. 21. Less than fair 150 135 Apache (prefork) 120 mod_wsgi (embedded) 105 FASTCGI 90 flup (prefork) 75 Apache (prefork/worker)Processes 60 mod_wsgi (daemon) 45 Apache (worker) 30 mod_wsgi (embedded) 15 0 0 15 30 45 60 75 90 105 120 135 150 Threads
    22. 22. What to use?✴Number of overall threads dictated by: ✴Number of concurrent users. ✴Response time for requests.✴Processes preferred over threads, but: ✴Restricted by amount of memory. ✴Choice influenced by number of processors.
    23. 23. Thread utilisation 1 second 6 5Threads 4 3 2 1
    24. 24. Request backlog Backlog occurred and queue time increased to 750 ms150ms 60 requests Thread utilisation jumped from per second 2.5 to 7.5 and maxed out at 9
    25. 25. Processes are better Backlog only started at higher throughput and queue time mostly under 100ms100ms Thread utilisation only jumped from 75 requests 2.5 to 7.5 at higher throughput per second and didnt actually reach 9
    26. 26. CPU bound Bulk of time is from doingthings within the process itself
    27. 27. I/O wait Waiting on responses frombackend services a significant proportion of time
    28. 28. Long running requests✴Complex calculations.✴Slow backend services.✴Large file uploads.✴Large responses.✴Slow HTTP clients.
    29. 29. Varying request timesAverage: 1385 msMinimum: 4.7 msMaximum: 20184 msStd Dev: 3896 ms
    30. 30. Performance breakdownWhy is creating the connectionto PostgreSQL taking up 40% of overall response time
    31. 31. Slow HTTP clients✴Add nginx as a front end to the WSGI server.✴Brings the following benefits to the WSGI server. ✴Isolation from slow clients. ✴No need to handle keep alive in the WSGI server. ✴Can offload serving of static files. ✴Can use X-Accel-Redirect for dynamically generated files.
    32. 32. Request funnelling nginxfront endApacheworkersmod_wsgidaemons
    33. 33. Complete overload
    34. 34. Forced restarts✴Triggers for restarts: ✴Manual restart to fix issues/configuration. ✴Maximum number of requests reached. ✴Reloading of new application code. ✴Individual requests block/timeout.✴Restarts can make things worse.
    35. 35. Auto scaling✴Apache/mod_wsgi embedded mode. ✴Apache prefork MPM defaults. ✴Initial 1 / Maximum 150 ✴Apache worker MPM defaults. ✴Initial 2 / Maximum 6✴Auto scaling can make things worse.
    36. 36. Pre load everything✴Start maximum processes up front.✴Pre load your web application when the process starts and not lazily loaded on the first request.✴Keep processes persistent in memory and avoid unnecessary restarts.
    37. 37. Horizontal scaling✴Using more servers is fine.✴Load balance across dedicated hosts.✴Or add additional hosts as required.✴Ensure though that if adding more hosts that you have preloaded the web application before directing traffic to it.
    38. 38. Monitoring is key✴Treat your server as a black box and you will never know what is going on inside.
    39. 39. Server monitoring✴Open source tools. ✴Monit ✴Munin ✴Cacti ✴Nagios
    40. 40. Python web tools✴Django debug toolbar. ✴Only useful for debugging a single request in a development setting.✴Sentry. ✴Useful for capturing runtime errors, but performance issues dont generate exceptions.
    41. 41. New Relic APM
    42. 42. Apache/mod_wsgi
    43. 43. Summing up✴Use benchmarks to explore a specific system, not to compare different systems.✴Dont trust the defaults of any server, you need to tune it for your web application.✴Monitor your live production systems.✴New Relic for really deep introspection.
    44. 44. Try New Relic✴✴✴ Find out more about New Relic: ✴✴ Extended Pro Trial for PyCon attendees: ✴✴ Come work for New Relic: ✴