Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

7 lessons learned building high availability / performance systems - CM2015


Published on

Slides for my talk "Never gonna give you up, never gonna let you down - 7 lessons learned building high availability, high performance systems".
Presented at Codemotion Berlin 2015

Published in: Software
  • Be the first to comment

7 lessons learned building high availability / performance systems - CM2015

  1. 1. @EdMcBane 7 lessons learned building HP/HA systems Never gonna give you up Never gonna let you down
  2. 2. @EdMcBane Francesco Degrassi Enthusiastic yet pragmatic Lean Software Developer. Uppish and cynical nihilist from time to time.
  3. 3. @EdMcBane Lean Software Development Continuous Delivery - High availability - Scale-up Security sensitive & high uncertainty domains
  4. 4. @EdMcBane The challenge ● Primary european client ● Innovative service for the consumer market ● Non-trivial userbase (400K+ users) ● High request rate ● Low latency requirement (<< RTT)
  5. 5. @EdMcBane What we built
  6. 6. @EdMcBane Make your assumptions explicit and keep testing them Do not eat yellow snow What did we learn?
  7. 7. @EdMcBane Make your assumptions explicit and keep testing them #1 Make your assumptions explicit and keep challenging them
  8. 8. @EdMcBane Issues ● failure to properly estimate ● failure to reassess performance goals ● losing track of assumptions and implications
  9. 9. @EdMcBane Make your assumptions explicit and keep testing them #2 Performance & Availability are not extra features
  10. 10. @EdMcBane
  11. 11. @EdMcBane Challenges ● Support for required failover modes ● Support for required scale-out/scale-up modes ● Operability in general ○ and monitoring in particular ● most important of all, avoiding complexity
  12. 12. @EdMcBane Make your assumptions explicit and keep testing them #3 Keep things simple and do not reinvent the wheel
  13. 13. @EdMcBane Everything should be made as simple as possible, but not simpler — Albert Einstein
  14. 14. @EdMcBane
  15. 15. @EdMcBane LESS(1) General Commands Manual LESS(1) NAME less - opposite of more SYNOPSIS less -? less --help less -V less --version less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~] [-b space] [-h lines] [-j line] [-k keyfile] [-{oO} logfile] [-p pattern] [-P prompt] [-t tag] [-T tagsfile] [-x tab,...] [-y lines] [-[z] lines] [-# shift] [+[+]cmd] [--] [filename]... (See the OPTIONS section for alternate option syntax with long option names.) DESCRIPTION LESS IS similar to MORE (1), but has many more features. Less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1). Less uses termcap (or terminfo on some systems), so it can run on Manual page less(1) line 1 (press h for help or q to quit) .
  16. 16. @EdMcBane ● Everything was good with the single core scenario In our case...
  17. 17. @EdMcBane SO_REUSEPORT For TCP, so_reuseport allows multiple listener sockets to be bound to the same port. Received packets are distributed to multiple sockets bound to the same port using a 4-tuple hash. With so_reuseport the distribution is uniform.
  18. 18. @EdMcBane Suggestions ● Prefer open source solutions ○ when things break, you want to be able to fix it ● Be skeptical ○ pick any software, chances are it is crap ○ +1 for open source, you can “peek under the hood” ● Do not use tools you do not fully understand ○ or as I’d rather say...
  19. 19. @EdMcBane Make your assumptions explicit and keep testing them #4 Be wary of cargo-cult software engineering
  20. 20. @EdMcBane
  21. 21. @EdMcBane TCP_TW_RECYCLE Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts. Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp TCP_TW_RECYCLE + NAT = MADNESS
  22. 22. @EdMcBane
  23. 23. @EdMcBane Make your assumptions explicit and keep testing them #5 High Availability is much more than just redundancy
  24. 24. @EdMcBane Impact Frequency Time to recover
  25. 25. @EdMcBane ● Redundant hardware ● Redundant software components But there’s more! ● Graceful degradation ● Incremental rollouts Failure impact
  26. 26. @EdMcBane Failure frequency But then also: ● proven technology ● high quality hardware ● automation (to avoid errors)
  27. 27. @EdMcBane ● Effective monitoring ○ realtime ○ reliable ○ understandable ○ thorough ○ meaningful ○ actionable ● Rollback / rollforward ● Automation (for speed) Time to recover
  28. 28. @EdMcBane Our response plan goes something like this... AaaaaAAaaaah
  29. 29. @EdMcBane ...but be prepared to improvise Processes designed for ordinary times are not resilient in a crisis and need to be changed. Dave Snowden “ ”
  30. 30. @EdMcBane Easier said than done No, improvising is wonderful. But, the thing is that you cannot improvise unless you know exactly what you're doing. Christopher Walken “ ”
  31. 31. @EdMcBane Improvisation requires ● In house expertise ● Lots and lots of experience ● Developers on call ● Practice (drills, e.g. chaos monkeys)
  32. 32. @EdMcBane Also from Walken... At its best, life is completely unpredictable.“ ” Everybody has to be a little lucky, I think.“ ” I try not to worry about things I can't do anything about.“ ”
  33. 33. @EdMcBane Make your assumptions explicit and keep testing them #6 Embrace diversity
  34. 34. @EdMcBane
  35. 35. @EdMcBane
  36. 36. @EdMcBane Make your assumptions explicit and keep testing them #7 Monitoring is essential … and we can do way better
  37. 37. @EdMcBane No one size fits all ● “Monitor everything”, like “100% test coverage” is a nice slogan, nothing more. ● Each environment requires a slightly different solution ● Balance between data availability, cost and ability to keep it actionable
  38. 38. @EdMcBane
  39. 39. @EdMcBane We are doing logging wrong ● Unstructured ● Inconsistent ● Poor defaults ● Complex, obscure components ● A huge waste of computing power
  40. 40. @EdMcBane We need a complete overview ● Logs ● Metrics ● Alerts ● Together, coherent, cross-referenced ○ correlating different stores poses challenges
  41. 41. @EdMcBane Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so. Douglas Adams “ ”
  42. 42. @EdMcBane Thanks! @EdMcBane