Achieving Rapid Response Times in Large Online Services

1,830 views

Published on

Great recent talk by Googler Jeff Dean on problems due to hitting occasional latency in large scale distributed systems. Some highlights: Better to sync than randomize regularly scheduled OS tasks and log rotations on machines (surprising result to me, and rarely done to my knowledge), should vary the number of shard replicas based on demand such as creating more replicas of recent log data or popular web page snippets (may seem obvious, yes, but I've seen people not do this, just using three replicas for all data), should stop sending requests to nodes with high latency and bring up replacement nodes instead (also may seem obvious, but not used enough in the systems I've seen), and issue redundant requests for data (issue a second request if the first does not immediately (< 2ms) respond, use whichever comes first, cancel the other, also may seem obvious, but also rarely done, more common is a retry loop with a much longer (> 50ms) cycle between retries).

Published in: Technology, News & Politics
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,830
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
33
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Achieving Rapid Response Times in Large Online Services

  1. 1. Achieving Rapid Response Times in Large Online Services Jeff Dean Google Fellow jeff@google.comMonday, March 26, 2012
  2. 2. Faster Is BetterMonday, March 26, 2012
  3. 3. Faster Is BetterMonday, March 26, 2012
  4. 4. Large Fanout Services query Ad System Frontend Web Server Super root Cache servers News Local Video Images Blogs Books WebMonday, March 26, 2012
  5. 5. Why Does Fanout Make Things Harder? • Overall latency ≥ latency of slowest component – small blips on individual machines cause delays – touching more machines increases likelihood of delays • Server with 1 ms avg. but 1 sec 99%ile latency – touch 1 of these: 1% of requests take ≥1 sec – touch 100 of these: 63% of requests take ≥1 secMonday, March 26, 2012
  6. 6. One Approach: Squash All Variability • Careful engineering all components of system • Possible at small scale – dedicated resources – complete control over whole system – careful understanding of all background activities – less likely to have hardware fail in bizarre ways • System changes are difficult – software or hardware changes affect delicate balance Not tenable at large scale: need to share resourcesMonday, March 26, 2012
  7. 7. Shared Environment • Huge benefit: greatly increased utilization • ... but hard to predict effects increase variability – network congestion – background activities – bursts of foreground activity – not just your jobs, but everyone else’s jobs, too • Exacerbated by large fanout systemsMonday, March 26, 2012
  8. 8. Shared Environment LinuxMonday, March 26, 2012
  9. 9. Shared Environment file system chunkserver LinuxMonday, March 26, 2012
  10. 10. Shared Environment file system scheduling chunkserver system LinuxMonday, March 26, 2012
  11. 11. Shared Environment various other system services file system scheduling chunkserver system LinuxMonday, March 26, 2012
  12. 12. Shared Environment Bigtable tablet server various other system services file system scheduling chunkserver system LinuxMonday, March 26, 2012
  13. 13. Shared Environment cpu intensive job Bigtable tablet server various other system services file system scheduling chunkserver system LinuxMonday, March 26, 2012
  14. 14. Shared Environment cpu intensive job random Bigtable MapReduce #1 tablet server various other system services file system scheduling chunkserver system LinuxMonday, March 26, 2012
  15. 15. Shared Environment random app #2 cpu intensive random job app random Bigtable MapReduce #1 tablet server various other system services file system scheduling chunkserver system LinuxMonday, March 26, 2012
  16. 16. Basic Latency Reduction Techniques • Differentiated service classes – prioritized request queues in servers – prioritized network traffic • Reduce head-of-line blocking – break large requests into sequence of small requests • Manage expensive background activities – e.g. log compaction in distributed storage systems – rate limit activity – defer expensive activity until load is lowerMonday, March 26, 2012
  17. 17. Synchronized Disruption • Large systems often have background daemons – various monitoring and system maintenance tasks • Initial intuition: randomize when each machine performs these tasks – actually a very bad idea for high fanout services • at any given moment, at least one or a few machines are slow • Better to actually synchronize the disruptions – run every five minutes “on the dot” – one synchronized blip better than unsynchronizedMonday, March 26, 2012
  18. 18. Tolerating Faults vs. Tolerating Variability • Tolerating faults: – rely on extra resources • RAIDed disks, ECC memory, dist. system components, etc. – make a reliable whole out of unreliable parts • Tolerating variability: – use these same extra resources – make a predictable whole out of unpredictable parts • Times scales are very different: – variability: 1000s of disruptions/sec, scale of milliseconds – faults: 10s of failures per day, scale of tens of secondsMonday, March 26, 2012
  19. 19. Latency Tolerating Techniques • Cross request adaptation – examine recent behavior – take action to improve latency of future requests – typically relate to balancing load across set of servers – time scale: 10s of seconds to minutes • Within request adaptation – cope with slow subsystems in context of higher level request – time scale: right now, while user is waitingMonday, March 26, 2012
  20. 20. Fine-Grained Dynamic Partitioning • Partition large datasets/computations – more than 1 partition per machine (often 10-100/machine) – e.g. BigTable, query serving systems, GFS, ... Master 17 9 1 3 2 8 4 12 7 ... ... ... ... Machine 1 Machine 2 Machine 3 Machine NMonday, March 26, 2012
  21. 21. Load Balancing • Can shed load in few percent increments – prioritize shifting load when imbalance is more severe Master 17 9 1 3 2 8 4 12 7 ... ... ... ...Monday, March 26, 2012
  22. 22. Load Balancing • Can shed load in few percent increments – prioritize shifting load when imbalance is more severe Master 17 9 1 3 2 8 4 12 7 ... ... ... ... Overloaded!Monday, March 26, 2012
  23. 23. Load Balancing • Can shed load in few percent increments – prioritize shifting load when imbalance is more severe Master 17 1 3 2 9 8 4 12 7 ... ... ... ... Overloaded!Monday, March 26, 2012
  24. 24. Load Balancing • Can shed load in few percent increments – prioritize shifting load when imbalance is more severe Master 17 1 3 2 9 8 4 12 7 ... ... ... ...Monday, March 26, 2012
  25. 25. Speeds Failure Recovery • Many machines each recover one or a few partition – e.g. BigTable tablets, GFS chunks, query serving shards Master 17 1 3 2 9 8 4 12 7 ... ... ... ...Monday, March 26, 2012
  26. 26. Speeds Failure Recovery • Many machines each recover one or a few partition – e.g. BigTable tablets, GFS chunks, query serving shards Master 2 9 8 4 12 7 ... ... ...Monday, March 26, 2012
  27. 27. Speeds Failure Recovery • Many machines each recover one or a few partition – e.g. BigTable tablets, GFS chunks, query serving shards Master 1 3 17 2 9 8 4 12 7 ... ... ...Monday, March 26, 2012
  28. 28. Selective Replication • Find heavily used items and make more replicas – can be static or dynamic • Example: Query serving system – static: more replicas of important docs – dynamic: more replicas of Chinese documents as Chinese query load increases Master ... ... ... ...Monday, March 26, 2012
  29. 29. Selective Replication • Find heavily used items and make more replicas – can be static or dynamic • Example: Query serving system – static: more replicas of important docs – dynamic: more replicas of Chinese documents as Chinese query load increases Master ... ... ... ...Monday, March 26, 2012
  30. 30. Selective Replication • Find heavily used items and make more replicas – can be static or dynamic • Example: Query serving system – static: more replicas of important docs – dynamic: more replicas of Chinese documents as Chinese query load increases Master ... ... ... ...Monday, March 26, 2012
  31. 31. Latency-Induced Probation • Servers sometimes become slow to respond – could be data dependent, but... – often due to interference effects • e.g. CPU or network spike for other jobs running on shared server • Non-intuitive: remove capacity under load to improve latency (?!) • Initiate corrective action – e.g. make copies of partitions on other servers – continue sending shadow stream of requests to server • keep measuring latency • return to service when latency back down for long enoughMonday, March 26, 2012
  32. 32. Handling Within-Request Variability • Take action within single high-level request • Goals: – reduce overall latency – don’t increase resource use too much – keep serving systems safeMonday, March 26, 2012
  33. 33. Data Independent Failures queryMonday, March 26, 2012
  34. 34. Data Independent Failures queryMonday, March 26, 2012
  35. 35. Data Independent Failures queryMonday, March 26, 2012
  36. 36. Canary Requests (2) queryMonday, March 26, 2012
  37. 37. Canary Requests (2) queryMonday, March 26, 2012
  38. 38. Canary Requests (2) queryMonday, March 26, 2012
  39. 39. Backup Requests req 3 req 5 req 8 req 6 Replica 1 Replica 2 Replica 3 ClientMonday, March 26, 2012
  40. 40. Backup Requests req 3 req 5 req 8 req 6 Replica 1 Replica 2 Replica 3 Client9 reqMonday, March 26, 2012
  41. 41. Backup Requests req 3 req 5 req 8 req 6 req 9 Replica 1 Replica 2 Replica 3 ClientMonday, March 26, 2012
  42. 42. Backup Requests req 3 req 5 req 8 req 6 req 9 req 9 Replica 1 Replica 2 Replica 3 ClientMonday, March 26, 2012
  43. 43. Backup Requests req 3 req 8 req 6 req 9 req 9 Replica 1 Replica 2 Replica 3 ClientMonday, March 26, 2012
  44. 44. Backup Requests req 3 req 8 req 6 req 9 Replica 1 reply Replica 2 Replica 3 ClientMonday, March 26, 2012
  45. 45. Backup Requests req 3 req 8 req 6 req 9 Replica 1 Replica 2 Replica 3 reply ClientMonday, March 26, 2012
  46. 46. Backup Requests req 3 req 8 req 6 req 9 Replica 1 Replica 2 Replica 3 “Cancel req 9” reply ClientMonday, March 26, 2012
  47. 47. Backup Requests req 3 req 8 req 6 req 9 “Cancel req 9” Replica 1 Replica 2 Replica 3 reply ClientMonday, March 26, 2012
  48. 48. Backup Requests req 3 req 8 req 6 Replica 1 Replica 2 Replica 3 reply ClientMonday, March 26, 2012
  49. 49. Backup Requests Effects • In-memory BigTable lookups – data replicated in two in-memory tables – issue requests for 1000 keys spread across 100 tablets – measure elapsed time until data for last key arrivesMonday, March 26, 2012
  50. 50. Backup Requests Effects • In-memory BigTable lookups – data replicated in two in-memory tables – issue requests for 1000 keys spread across 100 tablets – measure elapsed time until data for last key arrives Avg Std Dev 95%ile 99%ile 99.9%ile No backups 33 ms 1524 ms 24 ms 52 ms 994 ms Backup after 10 ms 14 ms 4 ms 20 ms 23 ms 50 ms Backup after 50 ms 16 ms 12 ms 57 ms 63 ms 68 msMonday, March 26, 2012
  51. 51. Backup Requests Effects • In-memory BigTable lookups – data replicated in two in-memory tables – issue requests for 1000 keys spread across 100 tablets – measure elapsed time until data for last key arrives Avg Std Dev 95%ile 99%ile 99.9%ile No backups 33 ms 1524 ms 24 ms 52 ms 994 ms Backup after 10 ms 14 ms 4 ms 20 ms 23 ms 50 ms Backup after 50 ms 16 ms 12 ms 57 ms 63 ms 68 ms • Modest increase in request load: – 10 ms delay: <5% extra requests; 50 ms delay: <1%Monday, March 26, 2012
  52. 52. Backup Requests w/ Cross-Server Cancellation req 3 req 5 req 6 Server 1 Server 2 ClientMonday, March 26, 2012
  53. 53. Backup Requests w/ Cross-Server Cancellation req 3 req 5 req 6 Server 1 Server 2 req 9 ClientMonday, March 26, 2012
  54. 54. Backup Requests w/ Cross-Server Cancellation req 3 req 5 req 6 req 9 also: server 2 Server 1 Server 2 req 9 Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  55. 55. Backup Requests w/ Cross-Server Cancellation req 3 req 5 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  56. 56. Backup Requests w/ Cross-Server Cancellation req 3 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  57. 57. Backup Requests w/ Cross-Server Cancellation req 3 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 “Server 2: Starting req 9” Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  58. 58. Backup Requests w/ Cross-Server Cancellation req 3 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 “Server 2: Starting req 9” Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  59. 59. Backup Requests w/ Cross-Server Cancellation req 3 req 6 req 9 also: server 1 Server 1 Server 2 “Server 2: Starting req 9” Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  60. 60. Backup Requests w/ Cross-Server Cancellation req 3 req 6 req 9 also: server 1 Server 1 Server 2 reply Client Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  61. 61. Backup Requests w/ Cross-Server Cancellation req 3 req 6 req 9 also: server 1 Server 1 Server 2 Client reply Each request identifies other server(s) to which request might be sentMonday, March 26, 2012
  62. 62. Backup Requests: Bad Case req 3 req 5 Server 1 Server 2 ClientMonday, March 26, 2012
  63. 63. Backup Requests: Bad Case req 3 req 5 Server 1 Server 2 req 9 ClientMonday, March 26, 2012
  64. 64. Backup Requests: Bad Case req 3 req 5 req 9 also: server 2 Server 1 Server 2 req 9 ClientMonday, March 26, 2012
  65. 65. Backup Requests: Bad Case req 3 req 5 req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 ClientMonday, March 26, 2012
  66. 66. Backup Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 ClientMonday, March 26, 2012
  67. 67. Backup Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 “Server 2: Starting req 9” “Server 1: Starting req 9” ClientMonday, March 26, 2012
  68. 68. Backup Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 “Server 2: Starting req 9” “Server 1: Starting req 9” ClientMonday, March 26, 2012
  69. 69. Backup Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 reply ClientMonday, March 26, 2012
  70. 70. Backup Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 Client replyMonday, March 26, 2012
  71. 71. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch diskMonday, March 26, 2012
  72. 72. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 msMonday, March 26, 2012
  73. 73. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk -43% Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 msMonday, March 26, 2012
  74. 74. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms +Terasort No backups 24 ms 56 ms 108 ms 159 ms Backup after 2 ms 19 ms 35 ms 67 ms 108 msMonday, March 26, 2012
  75. 75. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk -38% Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms +Terasort No backups 24 ms 56 ms 108 ms 159 ms Backup after 2 ms 19 ms 35 ms 67 ms 108 msMonday, March 26, 2012
  76. 76. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms +Terasort No backups 24 ms 56 ms 108 ms 159 ms Backup after 2 ms 19 ms 35 ms 67 ms 108 ms Backups cause about ~1% extra disk readsMonday, March 26, 2012
  77. 77. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms +Terasort No backups 24 ms 56 ms 108 ms 159 ms Backup after 2 ms 19 ms 35 ms 67 ms 108 msMonday, March 26, 2012
  78. 78. Backup Requests w/ Cross-Server Cancellation • Read operations in distributed file system client – send request to first replica – wait 2 ms, and send to second replica – servers cancel request on other replica when starting read • Time for bigtable monitoring ops that touch disk Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms +Terasort No backups 24 ms 56 ms 108 ms 159 ms Backup after 2 ms 19 ms 35 ms 67 ms 108 ms Backups w/big sort job gives same read latencies as no backups w/ idle cluster!Monday, March 26, 2012
  79. 79. Backup Request Variants • Many variants possible: • Send to third replica after longer delay – sending to two gives almost all the benefit, however. • Keep requests in other queues, but reduce priority • Can handle Reed-Solomon reconstruction similarlyMonday, March 26, 2012
  80. 80. Tainted Partial Results • Many systems can tolerate inexact results – information retrieval systems • search 99.9% of docs in 200ms better than 100% in 1000ms – complex web pages with many sub-components • e.g. okay to skip spelling correction service if it is slow • Design to proactively abandon slow subsystems – set cutoffs dynamically based on recent measurements • can tradeoff completeness vs. responsiveness – important to mark such results as tainted in cachesMonday, March 26, 2012
  81. 81. Hardware Trends • Some good: – lower latency networks make things like backup request cancellations work better • Some not so good: – plethora of CPU and device sleep modes save power, but add latency variability – higher number of “wimpy” cores => higher fanout => more variability • Software techniques can reduce variability despite increasing variability in underlying hardwareMonday, March 26, 2012
  82. 82. Conclusions • Tolerating variability – important for large-scale online services – large fanout magnifies importance – makes services more responsive – saves significant computing resources • Collection of techniques – general good engineering practices • prioritized server queues, careful management of background activities – cross-request adaptation • load balancing, micro-partitioning – within-request adaptation • backup requests, backup requests w/ cancellation, tainted resultsMonday, March 26, 2012
  83. 83. Thanks • Joint work with Luiz Barroso and many others at Google • Questions?Monday, March 26, 2012

×