LCA13: Web Server and Caching Technologies


Published on

Resource: LCA13
Name: Web Server and Caching Technologies
Date: 05-03-2013
Speaker: Ard Biesheuvel

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

LCA13: Web Server and Caching Technologies

  1. 1. ASIA 2013 (LCA13) Web server and caching technologies Ard Biesheuvel (ardb) <>
  2. 2. ASIA 2013 (LCA13) Definitions Topic internally referred to as web frontend vertical Web frontend → the part of a hyperscale web service deployment that faces the external network Excludes single server LAMP boxes Excludes specific optimizations for distributed data storage etc Vertical → take all layers of the software stack into account Networking performance at the kernel level I/O performance in reverse proxies/caches CPU performance in dynamic content generation
  3. 3. ASIA 2013 (LCA13) Mission We aim to put ourselves in the shoes of the provider of a medium-to-large web service who is looking to replace some parts of their deployment with ARM servers and aim to: determine which parts of such a stack are most suitable for replacement with ARM server parts, and find and ameliorate performance bottlenecks before someone tries to do this for real.
  4. 4. ASIA 2013 (LCA13) Mission So what is a suitable replacement? Cost of whole deployment should be less Performance of whole deployment should not be less What constitutes performance? Throughput → ability to handle # requests per unit time Latency → ability to handle an individual request within time t
  5. 5. ASIA 2013 (LCA13) Performance and scalability Scale up Faster node More requests handled per unit time Throughput improves Latency improves Scale out Multiple slower nodes in parallel More requests handled per unit time Throughput improves Latency stays the same! When trying to maintain the same level of performance using a larger number of less powerful nodes, latency is likely a bigger concern than throughput.
  6. 6. ASIA 2013 (LCA13) Example: SSL handling Facebook, Google etc now use https:// by default The CONNECT phase of SSL consists of RSA authentication (which involves costly signing at the server end) and Diffie-Hellman key exchange No parallellism here Conventional chaining modes for symmetric payload encryption can only execute sequentially Web server workload is highly parallel in nature, but not below the connection level.
  7. 7. ASIA 2013 (LCA13) Example: Wordpress Unoptimized boilerplate Wordpress install on an ARM system running Ubuntu 12.04 ApacheBench running on an identical node 1000 requests at concurrency levels 1, 2, 3, 4, 8, 16 Connection Times (ms) min mean[+/-sd] median max Connect: 200 202 1.8 201 231 Processing: 142 145 1.2 144 154 Waiting: 142 144 1.2 144 154 Total: 343 347 1.8 347 373 Obvious optimizations stand out Connection caching for SSL Opcode cache for PHP ApacheBench is single threaded!!
  8. 8. ASIA 2013 (LCA13) LAVA support Currently, LAVA has no support at all for running test cases involving more than a single node Any web frontend test case will involve at least an origin server and 2+ request generators Jobs cannot be executed in parallel arbitrarily if the nodes have shared resources (such as the Calxeda fabric)
  9. 9. ASIA 2013 (LCA13) Power efficiency as a first-class metric Power efficiency is what sets ARM apart from the competition However, latency is often proportional to raw CPU performance We need a benchmark that takes both into account, and Allows us to track our progress by watching the numbers improve over time Compare ourselves with others in a meaningful way
  10. 10. ASIA 2013 (LCA13) Approach Benchmark 'typical' deployment running on server grade hardware Calxeda Highbank has plenty of nodes and fast interconnect 'Typical' means state of the art in terms of applied optimizations Anything less means we are reinventing the wheel Measure performance Latency, throughput, power consumption? Systems not under test should be contention free I/O induced CPU contention is more costly on ARM Look out for bottlenecks in networking stack Focus on latency not throughput Use faster code so CPU bound tasks complete in less time → Assembler dependencies (Steve McIntyre Fri 9 am) → Hadoop optimization (Steve Capper Tue 12 pm) More pipelining of CPU and I/O bound tasks
  11. 11. More about Linaro Connect: More about Linaro: More about Linaro engineering: ASIA 2013 (LCA13)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.