Latency Trumps All

4,581 views
4,444 views

Published on

Web 2.0 Expo
Thursday, Nov. 19th, 2009
by Chris Saari

Published in: Technology, Business
1 Comment
18 Likes
Statistics
Notes
No Downloads
Views
Total views
4,581
On SlideShare
0
From Embeds
0
Number of Embeds
119
Actions
Shares
0
Downloads
105
Comments
1
Likes
18
Embeds 0
No embeds

No notes for slide

Latency Trumps All

  1. 1. Latency Trumps All Chris Saari twitter.com/chrissaari blog.chrissaari.com saari@yahoo-inc.com Thursday, November 19, 2009
  2. 2. Packet Latency  Time for a packet to get between points A and B  Physical distance + time queued in devices along the way ~60ms Thursday, November 19, 2009
  3. 3. ... Thursday, November 19, 2009
  4. 4. Anytime...  ... the system is waiting for data  The system is end to end - Human response time - Network card buffering - System bus/interconnect speed - Interrupt handling - Network stacks - Process scheduling delays - Application process waiting for data from memory to get to CPU, or from disk to memory to CPU - Routers, modems, last mile speeds - Backbone speed and operating condition - Inter-cluster/colo performance Thursday, November 19, 2009
  5. 5. Big Picture Disk Network CPU User Thursday, November 19, 2009 Memory
  6. 6. Tubes? Thursday, November 19, 2009
  7. 7. Latency vs. Bandwidth Bandwidth Bits / Second Latency Time Thursday, November 19, 2009
  8. 8. Bandwidth of a Truck Full of Tape Thursday, November 19, 2009
  9. 9. Latency Lags Bandwidth -David Patterson Given the record of Ethernet, no matter which advances in bandwidth ver- actually provides better sus latency, the logical value. One can argue that question is why? Here are greater advances in band- five technical reasons and width led to marketing one marketing reason. techniques to sell band- 1. Moore’s Law helps width that in turn trained bandwidth more than customers to desire it. No latency. The scaling of matter what the real chain semiconductor processes of events, unquestionably provides both faster transis- higher bandwidth for tors and many more on a processors, memories, or chip. Moore’s Law predicts the networks is easier to a periodic doubling in the sell today than latency. number of transistors per Since bandwidth sells, chip, due to scaling and in engineering resources tend part to larger chips; to be thrown at band- recently, that rate has been width, which further tips 22–24 months [6]. Band- the balance. width is helped by faster 4. Latency helps band- transistors, more transis- width. Technology im- tors, and more pins operat- provements that help ing in parallel. The faster latency usually also help transistors help latency, but bandwidth, but not vice the larger number of tran- versa. For example, sistors and the relatively Figure 1. Log-log plot of DRAM latency determines the number of accesses per bandwidth and latency longer distances on the milestones from Table 1 second, so lower latency means more accesses per sec- actually larger chips limit relative to the first milestone. ond and hence higher bandwidth. Also, spinning the benefits of scaling to disks faster reduces the rotational latency, but the read latency. For example, Thursday, November 19, 2009 head must read data at the new faster rate as well.
  10. 10. The Problem  Relative Data Access Latencies, Fastest to Slowest - CPU Registers (1) - L1 Cache (1-2) - L2 Cache (6-10) - Main memory (25-100) --- don’t cross this line, don’t go off mother board! --- - Hard drive (1e7) - LAN (1e7-1e8) - WAN (1e9-2e9) Thursday, November 19, 2009
  11. 11. Relative Data Access Latency Fast Slow CPU Register L1 L2 RAM Thursday, November 19, 2009
  12. 12. Relative Data Access Latency Fast Slow CPU Register L1 L2 RAM Hard Disk Thursday, November 19, 2009
  13. 13. Relative Data Access Latency Lower Higher Register L1 L2 RAM Hard Disk LANFloppy/CD-ROMWAN Thursday, November 19, 2009
  14. 14. CPU Register  CPU Register Latency - Average Human Height Thursday, November 19, 2009
  15. 15. L1 Cache Thursday, November 19, 2009
  16. 16. L2 Cache x6 x 10 Thursday, November 19, 2009
  17. 17. RAM x 25 to x 100 Thursday, November 19, 2009
  18. 18. Hard Drive 0.4 x equatorial circumference of Earth x 10 M Thursday, November 19, 2009
  19. 19. WAN x 100 M 0.42 x Earth to Moon Distance Thursday, November 19, 2009
  20. 20. To experience pain...  Mobile phone network latency 2-10x that of wired - iPhone 3G 500ms ping x 500 M 2 x Earth to Moon Distance Thursday, November 19, 2009
  21. 21. 500ms isn’t that long... Thursday, November 19, 2009
  22. 22. Google SPDY “It is designed specifically for minimizing latency through features such as multiplexed streams, request prioritization and HTTP header compression.” Thursday, November 19, 2009
  23. 23. Strategy Pattern: Move Data Up  Relative Data Access Latencies - CPU Registers (1) - L1 Cache (1-2) - L2 Cache (6-10) - Main memory (25-50) - Hard drive (1e7) - LAN (1e7-1e8) - WAN (1e9-2e9) Thursday, November 19, 2009
  24. 24. Batching: Do it Once Thursday, November 19, 2009
  25. 25. Batching: Maximize Data Locality Thursday, November 19, 2009
  26. 26. Let’s Dig In  Relative Data Access Latencies, Fastest to Slowest - CPU Registers (1) - L1 Cache (1-2) - L2 Cache (6-10) - Main memory (25-100) - Hard drive (1e7) - LAN (1e7-1e8) - WAN (1e9-2e9) Thursday, November 19, 2009
  27. 27. Network  If you can’t Move Data Up, minimize accesses Thursday, November 19, 2009
  28. 28. Network  If you can’t Move Data Up, minimize accesses  Souders Performance Rules  1) Make fewer HTTP requests - Avoid going halfway to the moon whenever possible Thursday, November 19, 2009
  29. 29. Network  If you can’t Move Data Up, minimize accesses  Souders Performance Rules  1) Make fewer HTTP requests - Avoid going halfway to the moon whenever possible  2) Use a content delivery network - Edge caching gets data physically closer to the user Thursday, November 19, 2009
  30. 30. Network  If you can’t Move Data Up, minimize accesses  Souders Performance Rules  1) Make fewer HTTP requests - Avoid going halfway to the moon whenever possible  2) Use a content delivery network - Edge caching gets data physically closer to the user  3) Add an expires header - Instead of going halfway to the moon (Network), climb Godzilla (RAM) or go 40% of the way around the Earth (Disk) instead Thursday, November 19, 2009
  31. 31. Network: Packets and Latency Less data = less packets = less packet loss = less latency Thursday, November 19, 2009
  32. 32. Network  1) Make fewer HTTP requests  2) Use a content delivery network  3) Add an expires header  4) Gzip components Thursday, November 19, 2009
  33. 33. Disk: Falling of the Latency Cliff Thursday, November 19, 2009
  34. 34. Jim Gray, Microsoft 2006 Tape is Dead Disk is Tape Flash is Disk RAM Locality is King Thursday, November 19, 2009
  35. 35. Strategy: Move Up: Disk to RAM  RAM gets you above the exponential latency line - Linear cost and power consumption = $$$ Main memory (25-50) Hard drive (1e7) Thursday, November 19, 2009
  36. 36. Strategy: Avoidance: Bloom Filters - Probabilistic answer to question if a member is in a set - Constant time via multiple hashes - Constant space bit string - Used in BigTable, Cassandra, Squid Thursday, November 19, 2009
  37. 37. In Memory Indexes  Haystack keeps file system indexes in RAM - Cut disk access per image from 3 to 1  Search index compression  GFS master node prefix compression of names Thursday, November 19, 2009
  38. 38. Managing Gigabytes -Witten, Moffat, and Bell Thursday, November 19, 2009
  39. 39. SSDs Disk SSD ~ 180 - 200 (15K RPM) I/O Ops / Sec ~ 70 - 100 ~ 10K - 100K Seek times ~ 7 - 3.2 ms ~ 0.085 - 0.05 ms SSDs < 1/5th power consumption of spinning disk Thursday, November 19, 2009
  40. 40. Sequential vs. Random Disk Access - James Hamilton Thursday, November 19, 2009
  41. 41. 1TB Sequential Read Thursday, November 19, 2009
  42. 42. 1TB Random Read Sunday Monday Tuesday Wednes Thursda Friday Saturda day y y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Done! Thursday, November 19, 2009
  43. 43. Strategy: Batching and Streaming  Fewer reads/writes of large contiguous chunks of data - GFS 64MB chunks Thursday, November 19, 2009
  44. 44. Strategy: Batching and Streaming  Fewer reads/writes of large contiguous chunks of data - GFS 64MB chunks  Requires data locality - BigTable app specified data layout and compression Thursday, November 19, 2009
  45. 45. The CPU Thursday, November 19, 2009
  46. 46. “CPU Bound” Data in RAM CPU access to that data Thursday, November 19, 2009
  47. 47. The Memory Wall Thursday, November 19, 2009
  48. 48. Latency Lags Bandwidth -Dave Patterson Thursday, November 19, 2009
  49. 49. Multicore Makes It Worse!  More cores accelerates the rate of divergence - CPU performance doubled 3x over the past 5 years - Memory performance doubled once Thursday, November 19, 2009
  50. 50. Evolving CPU Memory Access Designs  Intel Nehalem integrated memory controller and new high- speed interconnect  40 percent shorter latency and increased bandwidth, 4-6x faster system Thursday, November 19, 2009
  51. 51. More CPU evolution  Intel Nehalem-EX - 8 core, 24MB of cache, 2 integrated memory controllers - ring interconnect on-die network designed to speed the movement of data among the caches used by each of the cores  IBM Power 7 - 32MB Level 3 cache  AMD Magny-Cours - 12 cores, 12MB of Level 3 cache Thursday, November 19, 2009
  52. 52. Cache Hit Ratio Thursday, November 19, 2009
  53. 53. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad Thursday, November 19, 2009
  54. 54. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1 Thursday, November 19, 2009
  55. 55. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1  Stack allocation - Top of stack is usually in cache, top of the heap is usually not in cache Thursday, November 19, 2009
  56. 56. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1  Stack allocation - Top of stack is usually in cache, top of the heap is usually not in cache  Pipeline processing - Stages of operations on a piece of data do them all at once vs. each stage separately Thursday, November 19, 2009
  57. 57. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1  Stack allocation - Top of stack is usually in cache, top of the heap is usually not in cache  Pipeline processing - Stages of operations on a piece of data do them all at once vs. each stage separately  Optimize for size - Might be faster execution than code optimized for speed Thursday, November 19, 2009
  58. 58. Cycles to Burn  1) Make fewer HTTP requests  2) Use a content delivery network  3) Add an expires header  4) Gzip components - Use excess compute for compression Thursday, November 19, 2009
  59. 59. Datacenter Thursday, November 19, 2009
  60. 60. Datacenter Storage Heiracrchy Storage hierarchy: a different view - Jeff Dean, Google A bumpy ride that has been getting bumpier over time Thursday, November 19, 2009
  61. 61. Intra-Datacenter Round Trip ~500 miles ~NYC to Columbus, OH x 500,000 Thursday, November 19, 2009
  62. 62. Datacenter Level Systems RethinkDB Facebook Haystack HBase memcached Google File System Yahoo Sherpa Facebook Cassandra Sawzall / Pig Redis Project Voldemort MonetDB Google BigTable Thursday, November 19, 2009
  63. 63. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets Thursday, November 19, 2009
  64. 64. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching Thursday, November 19, 2009
  65. 65. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching - Contention on network device transmit queue lock, packets added/removed from the queue one at a time - Change dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets Thursday, November 19, 2009
  66. 66. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching - Contention on network device transmit queue lock, packets added/removed from the queue one at a time - Change dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets - More lock contention fixes Thursday, November 19, 2009
  67. 67. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching - Contention on network device transmit queue lock, packets added/removed from the queue one at a time - Change dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets - More lock contention fixes - Result 200,000 UDP requests/second with average latency of 173 microseconds Thursday, November 19, 2009
  68. 68. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up Thursday, November 19, 2009
  69. 69. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up Thursday, November 19, 2009
  70. 70. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up Thursday, November 19, 2009
  71. 71. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up  Locality groups loaded in memory - Move Up, Batching - Clients can control compression of locality groups Thursday, November 19, 2009
  72. 72. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up  Locality groups loaded in memory - Move Up, Batching - Clients can control compression of locality groups  2 levels of caching - Move Up - Scan cache of key/value pairs and block cache Thursday, November 19, 2009
  73. 73. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up  Locality groups loaded in memory - Move Up, Batching - Clients can control compression of locality groups  2 levels of caching - Move Up - Scan cache of key/value pairs and block cache  Clients cache tablet server locations - 3 to 6 network trips if cache is invalid - Move Up Thursday, November 19, 2009
  74. 74. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up Thursday, November 19, 2009
  75. 75. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead Thursday, November 19, 2009
  76. 76. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead  Log to memory and write to commit log on dedicated disk - Batching Thursday, November 19, 2009
  77. 77. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead  Log to memory and write to commit log on dedicated disk - Batching  Programmer controlled data layout for locality - Batching Thursday, November 19, 2009
  78. 78. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead  Log to memory and write to commit log on dedicated disk - Batching  Programmer controlled data layout for locality - Batching  Result: 2 orders of magnitude better performance than MySQL Thursday, November 19, 2009
  79. 79. Move the Compute to the Data: YQL Execute Thursday, November 19, 2009
  80. 80. From the Browser Perspective  Performance bounded by 3 things: Thursday, November 19, 2009
  81. 81. From the Browser Perspective  Performance bounded by 3 things: - Fetch time - Unless you’re bundling everything it is a cascade of interdependent requests, at least 2 phases worth Thursday, November 19, 2009
  82. 82. From the Browser Perspective  Performance bounded by 3 things: - Fetch time - Unless you’re bundling everything it is a cascade of interdependent requests, at least 2 phases worth - Parse time - HTML - CSS - Javascript Thursday, November 19, 2009
  83. 83. From the Browser Perspective  Performance bounded by 3 things: - Fetch time - Unless you’re bundling everything it is a cascade of interdependent requests, at least 2 phases worth - Parse time - HTML - CSS - Javascript - Execution time - Javascript execution - DOM construction and layout - Style application Thursday, November 19, 2009
  84. 84. Recap  Move Data Up - Caching - Compression - If You Can’t Move All The Data Up - Indexes - Bloom filters  Batching and Streaming - Maximize locality Thursday, November 19, 2009
  85. 85. Take 2 And Call Me In The Morning  An Engineer’s Guide to Bandwidth - http://developer.yahoo.net/blog/archives/2009/10/ a_engineers_gui.html  High Performance Web Sites - Steve Souders  Even Faster Web Sites - Steve Souders  Managing Gigabytes: Compressing and Indexing Documents and Images - Witten, Moffat, Bell  Yahoo Query Language (YQL) - http://developer.yahoo.com/yql/ Thursday, November 19, 2009

×