Running our software on a 2000-core cluster<br />Lessons learnt<br />
Structure<br />For each problem<br />Symptoms<br />Method of investigation<br />Cause<br />Action taken<br />Morale<br />
Background<br />Pretty simple: Distributing embarassingly parallel computations on a cluster<br />Distribution fabric is R...
RabbitMQ starts refusing connections<br />to some clients when there are <br />too many of them.<br />
Investigation<br />Eventually turned out RabbitMQ supports <br />max ~400 connections per process on Windows.<br />
Solution<br />In RabbitMQ:<br />Establish a cluster of RabbitMQs<br />2 “eternal” connections per client, 512 connections ...
Morale<br />Capacity planning!<br />If there’s a resource, plan how much of it you’ll need and with what pattern of usage....
RabbitMQConsumer uses a legacy component<br />which can’t run concurrent instances<br />in the same directory<br />
Solution<br />Create temporary directory.<br />Directory.SetCurrentDirectory() at startup.<br />The temp directories pile ...
Solution<br />At startup, clean up unused temp directories.<br />How to know if it is unused?<br />Create a lock file in t...
Morale<br />If it’s non-critical, wrap the whole thing with try/ignore<br />Even if you think it will never fail<br />It w...
Then the thing started working.<br />Kind of.<br />We asked for 1000 tasks “in flight”, and got only about 125.<br />
Gateway is highly CPU loaded<br />(perhaps that’s the bottleneck?)<br />
Solution<br />Eliminate data compression<br />It was unneeded – 160 compressions of <1kb-sized data per task (1 per subtas...
Solution (ctd.)<br />There was support for our own throttling and round-robining in code<br />We didn’t actually need it! ...
Cause<br />3 queues per client<br />Remember “Capacity planning”?<br />A RabbitMQ queue is an exhaustable resource<br />Di...
Solution<br />Have 2 queues per JOB and no cancellation queues<br />Just purge request queue<br />OK unless several jobs s...
And then it worked<br />Compute nodes at 100% cpu<br />Cluster quickly and sustainedly saturated<br />Cluster fully loaded...
Morale<br />Eliminate bloat – Complexity kills<br />Even if “We’ve got feature X” sounds cool<br />Round-robining and thro...
Morale<br />Rethink what is CPU-cheap<br />O(1) is not enough<br />You’re going to compete with 2000 cores<br />You’re goi...
Morale<br />Rethink what is CPU-cheap<br />1 task = avg. 600ms of computation for 2000 cores<br />Split into 160 parts<br ...
And then we tried short tasks<br />~1000x shorter<br />
Oh well.<br />The tasks are really short, after all…<br />
And we started getting really a lot of memcached misses.<br />
Investigation<br />Have we put so much into memcached that it evicted the tasks?<br />Log:<br />Key XXX not found<br />> e...
Solution<br />Retry until ok (with exponential back-off)<br />
Desperately retrying<br />Blue: Fetching from memcached<br />Orange: Computing<br />Oh.<br />
Investigation<br />Memcached can’t be down for that long, right?<br />Right.<br />Look into code…<br />We cached the Memca...
Investigation<br />There was a bug in the memcached client library (Enyim)<br />It took too long to discover that a server...
Solution<br />Do not cache the MemcachedClient objects<br />Result:<br />That helped. No more misses.<br />
Morale<br />Eliminate bloat – Complexity kills<br />I think we’ve already talked of this one.<br />Smart code is bad becau...
Then we saw that memcached gets take 200ms each<br />
Investigation<br />Memcached can’t be that slow, right?<br />Right.<br />Then who is slow?<br />Who is between us and memc...
Solution<br />Write own fat-free “memcached client”<br />Just a dozen lines of code<br />The protocol is very simple.<br /...
Morale<br />Eliminate bloat – Complexity kills<br />Should I say more?<br />
And this is how well we scaled these short tasks.<br />About 5 1-second tasks/s. <br />Terrific for a 2000-core cluster.<b...
Investigation<br />These stripes are almost parallel!<br />Because tasks are round-robined to nodes in the same order.<br ...
Investigation<br />And we also have 16 RabbitMQs.<br />And there’s just 1 queue.<br />Every queue lives on 1 node.<br />15...
Solution<br />Don’t split these short tasks into parts.<br />Result:<br />That helped.<br />~76 tasks/s submitted to Rabbi...
And then this<br />An operation on a socket could not be performed because the system lacked sufficient buffer space or be...
Solution<br />Update Enyim to latest version.<br />Result:<br />Didn’t help.<br />
Solution<br />Get rid of Enyim completely.<br />(also implement put() – another 10 LOC)<br />Result:<br />That helped<br /...
Morale<br />Third-party libraries can fail<br />They’re written by humans<br />Maybe by humans who didn’t test them under ...
But we’re still there at 76 tasks/s.<br />
Solution<br />A thourough and blind CPU hunt in Client and Gateway.<br />Didn’t want to launch a profiler on the cluster n...
Solution<br />Fix #1<br />Special-case optimization for TO-0 tasks: Unneeded deserialization and splitting in Gateway (don...
Solution<br />Fix #2<br />Eliminate task GUID generation in Client<br />Parallelize submission of requests<br />To spread ...
Look at the cluster load again<br />Where do these pauses come from?<br />They appear consistently on every run.<br />
Where do these pauses come from?<br />What can pause a .NET application?<br />The Garbage Collector<br />The OS (swap in/o...
Where did the memory go?<br />Node with Client had 98-99% physical memory occupied.<br />By whom?<br />SQL Server: >4Gb<br...
Solution<br />Turn off HPC Server on this node.<br />Result:<br />These pauses got much milder<br />
Still don’t know what’s this.<br />About 170 tasks/s. Only using 1248 cores. Why? We don’t know yet.<br />
Morale<br />Measure your application.Eliminate interference from others. The interference can be drastic.<br />Do not plac...
But scalability didn’t improve much.<br />
How do we understand why it’s so bad?<br />Eliminate interference.<br />
What interference is there?<br />“Normalizing” tasks<br />Deserialize<br />Extract data to memcached<br />Serialize<br />L...
So how long does it take to submit a task?<br />(now that it’s the only thing we’re doing)<br />Client: “Oh, quite a lot!”...
Duration of these bars<br />Client:<br />“Usually and consistently<br />about 50ms.”<br />Gateway: <br />“Usually a couple...
Very suspicious<br />What are those 50ms? Too round of a number.<br />Perhaps some protocol is enforcing it?<br />What’s o...
What’s our protocol?<br />tcp, right?<br />var client = new CloudGatewayClient(	"BasicHttpBinding_ICloudGateway");<br />Oo...
Solution<br />Change to NetTcpBinding<br />Don’t remember which is which :(Still looks strange, but much better.<br />
About 340 tasks/s. <br />Only using 1083 of >1800 cores! <br />Why? We don’t know yet.<br />
Morale<br />Double-check your configuration.<br />Measure the “same” thing in several ways.<br />Time to submit a task, fr...
Here comes the dessert.<br />“Tools matter”<br />Already shown how pictures (and drawing tools) matter.<br />We have a log...
Tools matter<br />And it didn’t work quite well, for quite a long time. <br />Here’s how it failed:<br />Ate 1-2Gb RAM<br ...
Tools matter<br />Here’s how its failures mattered:<br />Had to wait several minutes to gather all the events from a run.<...
Tools matter<br />Too bad it was on the last day of cluster availability.<br />
Why was it so buggy?<br />The problem ain’t that easy (as it seemed).<br />Lots of clients (~2000)<br />Lots of messages<b...
How does it work?<br />Client buffers messages and sends them to server in batches (client initiates). <br />Messages mark...
So, the tricks were:<br />Limit the global buffer (drop messages if it’s full)<br />“Dropping message”…“Dropped 10000,2000...
So, the tricks were:<br />Prefer “negative feedback” style<br />Wake up, see what’s wrong, fix<br />Not: “react to every e...
And the bugs were:<br />Client called server even if it had nothing to say.<br />Impact: *lots* of unneeded connections.<b...
And the bugs were:<br />“Pending records” per-client buffer was unbounded.<br />Impact: Server ate memory if it couldn’t s...
And the bugs were:<br />If couldn’t calibrate with client at 1st attempt, never calibrated.<br />Impact: Well… Esp. given ...
And the bugs were:<br />No calibration with a machine in scenario “Start client A, start client B, kill client A”<br />Imp...
And the bugs were:<br />Events were not coming out in order.<br />Impact: Not critical by itself, but casts doubt on the c...
The case of the lagging events <br />There were many places where they could lag.<br />That’s already very bad by itself…<...
The case of the lagging events<br />Meta fix:<br />More internal logging<br />Didn’t help.<br />This logging was invisible...
The case of the lagging events<br />Investigation by sequential elimination of reasons.<br />The most suspicious thing was...
The case of the lagging events<br />Rewritten it.<br />Polling instead of blocking: “What’s the earliest event? Has it bee...
The case of the lagging events<br />What remained? Only a walk through the code.<br />
The case of the lagging events<br />A while later…<br />
The case of the lagging events<br />A client has 3 associated threads.<br />(1 per batch of records) Thread that reads the...
The case of the lagging events<br />A client has 3 associated threads.<br />And they were created in ThreadPool.<br />And ...
The case of the lagging events<br />So we have 2000 clients on 250 machines.<br />A couple thousand threads.<br />Not a bi...
The case of the lagging events<br />Fix: Start a new thread without ThreadPool.<br />And suddenly everything worked.<br />
The case of the lagging events<br />Why did it take so long to find?<br />Unreproducible on less than a dozen machines<br ...
The case of the lagging events<br />Morale:<br />Functional abstractions leak in non-functional ways.<br />Thread pool fun...
Greg again<br />Rewritten it nearly from scratch<br />Calibration now also initiated by client<br />Server only accepts cl...
The good partsOK, lots of things were broken. Which weren’t?<br />Asynchronous processing<br />We’d be screwed if not for ...
Morales<br />
Morales<br />Tools matter<br />Would be helpless without the graphs <br />Would have done much more if the logger was fixe...
That’s all.<br />
Upcoming SlideShare
Loading in …5
×

Lessons learnt on a 2000-core cluster

5,581 views

Published on

Lessons learnt when testing our "embarassingly parallel" software on a 2000-core cluster.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,581
On SlideShare
0
From Embeds
0
Number of Embeds
4,077
Actions
Shares
0
Downloads
30
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Lessons learnt on a 2000-core cluster

  1. 1. Running our software on a 2000-core cluster<br />Lessons learnt<br />
  2. 2. Structure<br />For each problem<br />Symptoms<br />Method of investigation<br />Cause<br />Action taken<br />Morale<br />
  3. 3. Background<br />Pretty simple: Distributing embarassingly parallel computations on a cluster<br />Distribution fabric is RabbitMQ<br />Publish tasks to queue<br />Pull results from queue<br />Computational listeners on cluster nodes<br />Tasks are “fast” (~1s cpu time) or “slow” (~15min cpu time)<br />Tasks are split into parts (usually 160)<br />Also parts share the same data chunk – it’s stored in memcached and task input contains the “shared data id”<br />Requirements: 95% utilization for slow tasks, “as much as we can” for fast ones.<br />
  4. 4. RabbitMQ starts refusing connections<br />to some clients when there are <br />too many of them.<br />
  5. 5. Investigation<br />Eventually turned out RabbitMQ supports <br />max ~400 connections per process on Windows.<br />
  6. 6. Solution<br />In RabbitMQ:<br />Establish a cluster of RabbitMQs<br />2 “eternal” connections per client, 512 connections per instance, 1600 clients  ~16 instances suffice.<br />Instances start on same IP, on subsequent ports (5672,5673..)<br />In code:<br />Make both submitter and consumer scan ports until success<br />
  7. 7. Morale<br />Capacity planning!<br />If there’s a resource, plan how much of it you’ll need and with what pattern of usage. Otherwise you’ll exhaust it sooner or later.<br />Network bandwidth<br />Network latency<br />Connections<br />Threads<br />Memory<br />Whatever<br />
  8. 8. RabbitMQConsumer uses a legacy component<br />which can’t run concurrent instances<br />in the same directory<br />
  9. 9. Solution<br />Create temporary directory.<br />Directory.SetCurrentDirectory() at startup.<br />The temp directories pile up.<br />
  10. 10. Solution<br />At startup, clean up unused temp directories.<br />How to know if it is unused?<br />Create a lock file in the directory<br />At startup, try removing lock files and dirs<br />Problem<br />Races: several instances want to delete the same file<br />All but one crash!<br />Several solutions with various kinds of races, “fixed” by try/ignore band-aid…<br />Just wrap the whole “clean-up” block in a try/ignore!<br />That’s it.<br />
  11. 11. Morale<br />If it’s non-critical, wrap the whole thing with try/ignore<br />Even if you think it will never fail<br />It will<br />(maybe in the future, after someone changes the code…)<br />Thinking “it won’t” is unneeded complexity<br />Low-probable errors will happen<br />The chance is small but frequent<br />0.001 probability of error, 2000 occasions = 87% that at least 1 failure occurs<br />
  12. 12. Then the thing started working.<br />Kind of.<br />We asked for 1000 tasks “in flight”, and got only about 125.<br />
  13. 13. Gateway is highly CPU loaded<br />(perhaps that’s the bottleneck?)<br />
  14. 14. Solution<br />Eliminate data compression<br />It was unneeded – 160 compressions of <1kb-sized data per task (1 per subtask)!<br />Eliminate unneeded deserialization<br />Eliminate Guid.NewGuid() per subtask<br />It’s not nearly as cheap as one might think<br />Especially if there’s 160 of them per task<br />Turn on server GC<br />
  15. 15. Solution (ctd.)<br />There was support for our own throttling and round-robining in code<br />We didn’t actually need it! (needed before, but not now)<br />Eliminated both<br />Result<br />Oops, RabbitMQ crashed!<br />
  16. 16. Cause<br />3 queues per client<br />Remember “Capacity planning”?<br />A RabbitMQ queue is an exhaustable resource<br />Didn’t even remove unneeded queues<br />Long to explain, but<br />Didn’t actually need them in this scenario<br />RabbitMQ is not ok with several thousand queues<br />rabbimqctl list_queues took an eternity<br />
  17. 17. Solution<br />Have 2 queues per JOB and no cancellation queues<br />Just purge request queue<br />OK unless several jobs share their request queue<br />We don’t use this option.<br />
  18. 18. And then it worked<br />Compute nodes at 100% cpu<br />Cluster quickly and sustainedly saturated<br />Cluster fully loaded<br />
  19. 19. Morale<br />Eliminate bloat – Complexity kills<br />Even if “We’ve got feature X” sounds cool<br />Round-robining and throttling<br />Cancellation queues<br />Compression<br />
  20. 20. Morale<br />Rethink what is CPU-cheap<br />O(1) is not enough<br />You’re going to compete with 2000 cores<br />You’re going to do this “cheap” stuff a zillion times<br />
  21. 21. Morale<br />Rethink what is CPU-cheap<br />1 task = avg. 600ms of computation for 2000 cores<br />Split into 160 parts<br />160 Guid.NewGuid()<br />160 gzip compressions of 1kb data<br />160 publishes to RabbitMQ<br />160*N serializations/deserializations<br />It’s not cheap at all, compared to 600ms<br />Esp. compared to 30ms, if you’re aiming at 95% scalability<br />
  22. 22. And then we tried short tasks<br />~1000x shorter<br />
  23. 23. Oh well.<br />The tasks are really short, after all…<br />
  24. 24. And we started getting really a lot of memcached misses.<br />
  25. 25. Investigation<br />Have we put so much into memcached that it evicted the tasks?<br />Log:<br />Key XXX not found<br />> echo “GET XXX” | telnet 123.45.76.89 11211<br />YYYYYYYY<br />Nope, it’s still there.<br />
  26. 26. Solution<br />Retry until ok (with exponential back-off)<br />
  27. 27. Desperately retrying<br />Blue: Fetching from memcached<br />Orange: Computing<br />Oh.<br />
  28. 28. Investigation<br />Memcached can’t be down for that long, right?<br />Right.<br />Look into code…<br />We cached the MemcachedClient objects<br />to avoid creating them per each request<br />because this is oh so slow<br />
  29. 29. Investigation<br />There was a bug in the memcached client library (Enyim)<br />It took too long to discover that a server is back online<br />Our “retries” were not actually retrying<br />They were stumbling on Enyim’s cached “server is down”.<br />
  30. 30. Solution<br />Do not cache the MemcachedClient objects<br />Result:<br />That helped. No more misses.<br />
  31. 31. Morale<br />Eliminate bloat – Complexity kills<br />I think we’ve already talked of this one.<br />Smart code is bad because you don’t know what it’s actually doing<br />
  32. 32. Then we saw that memcached gets take 200ms each<br />
  33. 33. Investigation<br />Memcached can’t be that slow, right?<br />Right.<br />Then who is slow?<br />Who is between us and memcached?<br />Right, Enyim.<br />Creating those non-cached Client objects<br />
  34. 34. Solution<br />Write own fat-free “memcached client”<br />Just a dozen lines of code<br />The protocol is very simple.<br />Nothing stands between us and memcached(well, except for the OS TCP stack)<br />Result:<br />That helped. Now gets took ~2ms.<br />
  35. 35. Morale<br />Eliminate bloat – Complexity kills<br />Should I say more?<br />
  36. 36. And this is how well we scaled these short tasks.<br />About 5 1-second tasks/s. <br />Terrific for a 2000-core cluster.<br />
  37. 37. Investigation<br />These stripes are almost parallel!<br />Because tasks are round-robined to nodes in the same order.<br />And this round-robiner’s not keeping up.<br />Who’s that?<br />RabbitMQ.<br />We must have hit RabbitMQ limits<br />ORLY?<br />We push 160 messages per 1 task that takes 0.25ms on 2000 cores.<br />Capacity planning?<br />
  38. 38. Investigation<br />And we also have 16 RabbitMQs.<br />And there’s just 1 queue.<br />Every queue lives on 1 node.<br />15/16 = 93.75% of pushes and pulls are indirect.<br />
  39. 39. Solution<br />Don’t split these short tasks into parts.<br />Result:<br />That helped.<br />~76 tasks/s submitted to RabbitMQ.<br />
  40. 40. And then this<br />An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.(during connection to the Gateway)<br />Spurious program crashes in Enyim code under load<br />
  41. 41. Solution<br />Update Enyim to latest version.<br />Result:<br />Didn’t help.<br />
  42. 42. Solution<br />Get rid of Enyim completely.<br />(also implement put() – another 10 LOC)<br />Result:<br />That helped<br />No more crashes<br />Postfactum:<br />Actually I forgot to destroy the Enyim client objects <br />
  43. 43. Morale<br />Third-party libraries can fail<br />They’re written by humans<br />Maybe by humans who didn’t test them under these conditions (i.e. a large number of connections occupied by the rest of the program)<br />YOU can fail (for example, misuse a library)<br />You’re a human<br />Do not fear replacing a library with an easy piece of code<br />Of course if it is easy (for memcached it, luckily, was)<br />“Why did they write a complex library?” Because it does more, but maybe not what you need.<br />
  44. 44. But we’re still there at 76 tasks/s.<br />
  45. 45. Solution<br />A thourough and blind CPU hunt in Client and Gateway.<br />Didn’t want to launch a profiler on the cluster nodesbecause RDP was laggy and I was lazy<br />(Most probably this was a mistake)<br />
  46. 46. Solution<br />Fix #1<br />Special-case optimization for TO-0 tasks: Unneeded deserialization and splitting in Gateway (don’t split them at all)<br />Result<br />Gateway CPU load drops 2x<br />Scalability doesn’t improve<br />
  47. 47. Solution<br />Fix #2<br />Eliminate task GUID generation in Client<br />Parallelize submission of requests<br />To spread WCF serialization CPU overhead over cores<br />Turn on Server GC<br />Result<br />Now it takes 14 instead of 20s to push 1900 tasks to Gateway (130/s). Still not quite there.<br />
  48. 48. Look at the cluster load again<br />Where do these pauses come from?<br />They appear consistently on every run.<br />
  49. 49. Where do these pauses come from?<br />What can pause a .NET application?<br />The Garbage Collector<br />The OS (swap in/out)<br />What’s common between these runs?<br />~Number of tasks in memory at pauses<br />
  50. 50. Where did the memory go?<br />Node with Client had 98-99% physical memory occupied.<br />By whom?<br />SQL Server: >4Gb<br />MS HPC Server: Another few Gb<br />No wonder.<br />
  51. 51. Solution<br />Turn off HPC Server on this node.<br />Result:<br />These pauses got much milder<br />
  52. 52. Still don’t know what’s this.<br />About 170 tasks/s. Only using 1248 cores. Why? We don’t know yet.<br />
  53. 53. Morale<br />Measure your application.Eliminate interference from others. The interference can be drastic.<br />Do not place a latency-sensitive component together with anything heavy (throughput-sensitive) like SQL Server.<br />
  54. 54. But scalability didn’t improve much.<br />
  55. 55. How do we understand why it’s so bad?<br />Eliminate interference.<br />
  56. 56. What interference is there?<br />“Normalizing” tasks<br />Deserialize<br />Extract data to memcached<br />Serialize<br />Let us remove it (prepare tasks, then shoot like a machinegun).<br />Result: almost same – 172 tasks/s<br />(Unrealistic but easier for further investigation)<br />
  57. 57. So how long does it take to submit a task?<br />(now that it’s the only thing we’re doing)<br />Client: “Oh, quite a lot!”<br />Gateway: “Not much.”<br />1 track = 1 thread. Before BeginExecute: start orange bar, after BeginExecute: end bar.<br />
  58. 58. Duration of these bars<br />Client:<br />“Usually and consistently<br />about 50ms.”<br />Gateway: <br />“Usually a couple ms.”<br />
  59. 59. Very suspicious<br />What are those 50ms? Too round of a number.<br />Perhaps some protocol is enforcing it?<br />What’s our protocol?<br />
  60. 60. What’s our protocol?<br />tcp, right?<br />var client = new CloudGatewayClient( "BasicHttpBinding_ICloudGateway");<br />Oops.<br />
  61. 61. Solution<br />Change to NetTcpBinding<br />Don’t remember which is which :(Still looks strange, but much better.<br />
  62. 62. About 340 tasks/s. <br />Only using 1083 of >1800 cores! <br />Why? We don’t know yet.<br />
  63. 63. Morale<br />Double-check your configuration.<br />Measure the “same” thing in several ways.<br />Time to submit a task, from POV of client and gateway<br />
  64. 64. Here comes the dessert.<br />“Tools matter”<br />Already shown how pictures (and drawing tools) matter.<br />We have a logger. “Greg” = “Global Registrator”.<br />Most of the pictures wouldn’t be possible without it.<br />Distributed (client/server)<br />Accounts for machine clock offset<br />Output is sorted on “global time axis”<br />Lots of smart “scalability” tricks inside<br />
  65. 65. Tools matter<br />And it didn’t work quite well, for quite a long time. <br />Here’s how it failed:<br />Ate 1-2Gb RAM<br />Output was not sorted<br />Logged events with a 4-5min lag<br />
  66. 66. Tools matter<br />Here’s how its failures mattered:<br />Had to wait several minutes to gather all the events from a run.<br />Sometimes not all of them were even gathered<br />After the problems were fixed, “experiment roundtrip” (change, run, collect data, analyze) skyrocketed at least 2x-3x.<br />
  67. 67. Tools matter<br />Too bad it was on the last day of cluster availability.<br />
  68. 68. Why was it so buggy?<br />The problem ain’t that easy (as it seemed).<br />Lots of clients (~2000)<br />Lots of messages<br />1 RPC request per message = unacceptable<br />Don’t log a message until clock synced with the client machine<br />Resync clock periodically<br />Log messages in order of global time, not order of arrival<br />Anyone might (and does) fail or come back online at any moment<br />Must not crash<br />Must not overflow RAM<br />Must be fast<br />
  69. 69. How does it work?<br />Client buffers messages and sends them to server in batches (client initiates). <br />Messages marked with client’s local timestamp.<br />Server buffers messages from each client.<br />Periodically client and server calibrate clocks (server initiates).Once a client machine is calibrated  its messages go to global buffer with transformed timestamp.<br />Messages stay in global buffer for 10s (“if a message is earliest for 10s, it will remain earliest”)<br />Global buffer(windowSize):<br /> Add(time, event)<br />PopEarliest() : (time,event)<br />
  70. 70. So, the tricks were:<br />Limit the global buffer (drop messages if it’s full)<br />“Dropping message”…“Dropped 10000,20000… messages”…”Accepting again after dropping N”<br />Limit the send buffer on client<br />Same<br />Use compression for batches<br />(unused actually)<br />Ignore (but log) errors like failed calibration, failed send, failed receive, failed connect etc<br />Retry after a while<br />Send records to server in bounded batches<br />If I’ve got 1mln records to say, I shouldn’t keep the connection busy for a long time (num.concurrent connections is a resource!). Cut into batches of 10000.<br />Prefer polling to blocking because it’s simpler<br />
  71. 71. So, the tricks were:<br />Prefer “negative feedback” style<br />Wake up, see what’s wrong, fix<br />Not: “react to every event with preserving invariants”Much harder, sometimes impossible.<br />Network performance tricks:<br />TCP NO_DELAY whenever possible<br />Warm up the connection before calibrating<br />Calibrate N times, average until confidence interval reached<br />(actually precise calibration is theoretically impossible, only if network latencies are symmetric, which they aren’t…)<br />
  72. 72. And the bugs were:<br />Client called server even if it had nothing to say.<br />Impact: *lots* of unneeded connections.<br />Fix: Check, poll.<br />
  73. 73. And the bugs were:<br />“Pending records” per-client buffer was unbounded.<br />Impact: Server ate memory if it couldn’t sync clock<br />Reason: Code duplication. Should have abstracted away “Bounded buffer”.<br />Fix: Bound.<br />
  74. 74. And the bugs were:<br />If couldn’t calibrate with client at 1st attempt, never calibrated.<br />Impact: Well… Esp. given the previous bug.<br />Reason: try{loop}/ignore instead of loop{try/ignore}<br />Meta reason: too complex code, mixed levels of abstraction<br />Mixed what’s being “tried” with how it’s being managed (failures handled)<br />Fix: change to loop{try/ignore}.<br />Meta fix: Go through all code, classify methods into “spaghetti” and “flat logic”. Extract logic from spaghetti.<br />
  75. 75. And the bugs were:<br />No calibration with a machine in scenario “Start client A, start client B, kill client A”<br />Impact: Very bad.<br />Reason: If client couldn’t establish a calibration TCP listener, it wouldn’t try again (“someone else’s listening, not my job”).Then that guy dies and whose job is it now?<br />Meta reason: One-time global initialization for a globally periodic process (init; loop{action}).Global conditions change and initialization is needed again.<br />Fix: Transform to loop{init; action} – periodically establish listener (ignore failure).<br />
  76. 76. And the bugs were:<br />Events were not coming out in order.<br />Impact: Not critical by itself, but casts doubt on the correctness of everything.If this doesn’t work, how can we be sure that we even get all messages?All in all, very bad.<br />Reason: ???<br />And they were also coming out with a huge lag.<br />Impact: Dramatic (as already said).<br />
  77. 77. The case of the lagging events <br />There were many places where they could lag.<br />That’s already very bad by itself…<br />On client? (repeatedly failing to connect to server)<br />On server? (repeatedly failing to read from client)<br />In per-client buffer? (failing to calibrate / to notice that calibration is done)<br />In global buffer?(failing to notice that this event has “expired” its 10s)<br />
  78. 78. The case of the lagging events<br />Meta fix:<br />More internal logging<br />Didn’t help.<br />This logging was invisible because done with Trace.WriteLine and viewed with DbgView, which doesn’t work between sessions<br />My fault – didn’t cope with this.<br />Only failed under large load from many machines (the worst kind of error…)<br />But could have helped.<br />Log/assert everything<br />If things were fine where you expect them to be, there’d be no bugs.But there are.<br />
  79. 79. The case of the lagging events<br />Investigation by sequential elimination of reasons.<br />The most suspicious thing was “time-buffered queue”.<br />A complex piece of mud.<br />“Kind of” a priority queue with tracking times and sleeping/blocking on “pop”<br />Looked right and passed tests, but felt uncomfortable<br />Rewritten it.<br />
  80. 80. The case of the lagging events<br />Rewritten it.<br />Polling instead of blocking: “What’s the earliest event? Has it been here for 10s yet?”<br />A classic priority queue “from the book”<br />Peek minimum, check expiry  pop or not.<br />That’s it.<br />Now the queue definitely worked correctly.<br />But events still lagged.<br />
  81. 81. The case of the lagging events<br />What remained? Only a walk through the code.<br />
  82. 82. The case of the lagging events<br />A while later…<br />
  83. 83. The case of the lagging events<br />A client has 3 associated threads.<br />(1 per batch of records) Thread that reads them to per-client buffer.<br />(1 per client) Thread that pulls from per-client bufferand writes calibrated events to global buffer(after calibration is done)<br />(1 per machine) Calibration thread<br />
  84. 84. The case of the lagging events<br />A client has 3 associated threads.<br />And they were created in ThreadPool.<br />And ThreadPool creates no more than 2 new threads/s.<br />
  85. 85. The case of the lagging events<br />So we have 2000 clients on 250 machines.<br />A couple thousand threads.<br />Not a big deal, OS can handle more. And they’re all doing IO. That’s what an OS is for.<br />Created at a rate of 2 per second.<br />4-5 minutes pass before the calibration thread is created in pool for the last machine!<br />
  86. 86. The case of the lagging events<br />Fix: Start a new thread without ThreadPool.<br />And suddenly everything worked.<br />
  87. 87. The case of the lagging events<br />Why did it take so long to find?<br />Unreproducible on less than a dozen machines<br />Bad internal debugging tools (Trace.WriteLine)<br />And lack of understanding of their importance<br />Too complex architecture<br />Too many places can fail, need to debug all at once<br />
  88. 88. The case of the lagging events<br />Morale:<br />Functional abstractions leak in non-functional ways.<br />Thread pool functional abstraction = “Do something soon”<br />Know how exactly they leak, or don’t use them.<br />“Soon, but no sooner than 2/s”<br />
  89. 89. Greg again<br />Rewritten it nearly from scratch<br />Calibration now also initiated by client<br />Server only accepts client connections and moves messages around the queues<br />Pattern “Move responsibility to client” – server now does a lot less calibration-related bookkeeping<br />Pattern “Eliminate dependency cycles / feedback loops”<br />Now server doesn’t care at all about failure of client<br />Pattern “Do one thing and do it well”<br />Just serve requests.<br />Don’t manage workflow.<br />It’s now easier for server to throttle the number of concurrent requests of any kind<br />
  90. 90. The good partsOK, lots of things were broken. Which weren’t?<br />Asynchronous processing<br />We’d be screwed if not for the recent “fully asynchronous” rewrite<br />“Concurrent synchronous calls” are a very scarce resource<br />Reliance on a fault-tolerant abstraction: Messaging<br />We’d be screwed if RabbitMQ didn’t handle the failures for us<br />Good measurement tools<br />We’d be blindfolded without the global clock-synced logging and drawing tools<br />Good deployment scripts<br />We’d be in a configuration hell if we did that manually<br />Reasonably low coupling<br />We’d have much longer experiment roundtrips if we ran tests on “the real thing” (Huge Legacy Program + HPC Server + everything)<br />It was not hard to do independent performance optimizations of all the component layers involved (and there were not too many layers)<br />
  91. 91. Morales<br />
  92. 92. Morales<br />Tools matter<br />Would be helpless without the graphs <br />Would have done much more if the logger was fixed earlier…<br />Capacity planning<br />How much of X will you need for 2000 cores?<br />Complexity kills<br />Problems are everywhere, and if they’re also complex, then you can’t fix them<br />Rethink “CPU cheap”<br />Is it cheap compared to what 2000 cores can do?<br />Abstractions leak<br />Do not rely on a functional abstraction when you have non-functional requirements<br />Everything fails<br />Especially you<br />Planning to have failures is more robust than planning how exactly to fight them<br />There are no “almost improbable errors”: probabilities accumulate<br />Explicitly ignore failures in non-critical code<br />Code that does this is larger but simpler to understand than code that doesn’t<br />Think where to put responsibility for what<br />Difference in ease of implementation may be dramatic<br />
  93. 93. That’s all.<br />

×