Planning to Fail #phpne13


Published on

Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I’m dave!
  • I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
  • The title of my talk is “planning to fail”
  • First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
  • This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
  • Why do we care about reliability?
  • Advantages
  • Advantages
  • Advantages
  • Advantages
  • Advantages
  • But first, let’s rewind to the beginning
  • The pure joy of inserting a php tag in the middle of an HTML table
  • My website still follows this pattern. I’d like to think my website is quite reliable.
  • My website is reliable, but simple. Doesn’t change very often.
  • Hailo is complex!
  • Hailo is growing.
  • Key quote: less machinery is quadratically better.
  • Hailo have a lot of machinery!
  • Enter the chaos monkey… If you want to be good at something, practice often!
  • How about the “reliable” VPC that runs my website?
  • But not resilient; my website would not cope well with the chaos monkey approach.
  • We have to choose our stack appropriately if we are going to go down the chaos monkey route.
  • Hailo didn’t start out this way; but the PHP component did
  • Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
  • Advantages
  • Here’s one of our services… is this reliable?
  • But Hailo is going global
  • At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
  • Don’t pick things that arebroken by design
  • We remove services from the critical path using lazy-init pattern
  • We want to define timeouts so that under failure conditions we don’t hang forever
  • Instrumenting operations times – mean, upper 90th, upper bound (highest observed value)
  • Let’s aim for 95th percentile as our timeout – but instrument when we do have timeouts so that we know what’s going on
  • Yay!
  • Boo
  • Boo
  • This was after we fixed the bug, but we had the timeouts configured badly.
  • Better –memcache failure having less impact now; some features might be degraded, but the minimal viable service now works
  • Runnable .md based system tests
  • Planning to Fail #phpne13

    1. 1. Planningto fail@davegardnerisme#phpne13
    2. 2. dave
    3. 3. the taxi app
    4. 4. Planning to fail
    5. 5. Planningfor failure
    6. 6. Planning to fail
    7. 7. Why?
    8. 8. 99.9% (three nines)Downtime:43.8 minutes per month8.76 hours per year
    9. 9. 99.99% (four nines)Downtime:4.32 minutes per month52.56 minutes per year
    10. 10. 99.999% (five nines)Downtime:25.9 seconds per month5.26 minutes per year
    11. 11. ?
    12. 12. YOU
    13. 13. The beginning
    14. 14. <?php
    15. 15. My website: single VPS running PHP + MySQL
    16. 16. No growth, low volume, simple functionality, one engineer (me!)
    17. 17. Large growth, high volume, complex functionality, lots of engineers
    18. 18. • Launched in London November 2011• Now in 5 cities in 3 countries (30%+ growth every month)• A Hailo hail is accepted around the world every 5 seconds
    19. 19. “.. Brooks [1] reveals that the complexityof a software project grows as the squareof the number of engineers and Leveson[17] cites evidence that most failures incomplex systems result from unexpectedinter-component interaction rather thanintra-component bugs, we conclude thatless machinery is (quadratically) better.”
    20. 20. • SOA (10+ services)• AWS (3 regions, 9 AZs, lots of instances)• 10+ engineers building services and you? (hailo is hiring)
    21. 21. Our overallreliability is in danger
    22. 22. Embracing failure(a coping strategy)
    23. 23. VPC(running PHP+MySQL) reliable?
    24. 24. Reliable !==Resilient
    25. 25. Choosing a stack
    26. 26. “Hailo”(running PHP+MySQL) reliable?
    27. 27. Service Service Service Service each service does one job well Service Oriented Architecture
    28. 28. • Fewer lines of code• Fewer responsibilities• Changes less frequently• Can swap entire implementation if needed
    29. 29. Service(running PHP+MySQL) reliable?
    30. 30. Service MySQL MySQL running on different box
    31. 31. MySQLService MySQL MySQL running in Multi-Master mode
    32. 32. Going global
    33. 33. CRUD LockingMySQL Search Analytics ID generation also queuing… Separating concerns
    34. 34. At Hailo we look for technologies that are:• Distributed run on more than one machine• Homogenous all nodes look the same• Resilient can cope with the loss of node(s) with no loss of data
    35. 35. “There is no such thing as standbyinfrastructure: there is stuff youalways use and stuff that won’twork when you need it.”
    36. 36. • Highly performant, scalable and resilient data store• Underpins much of what we do at Hailo• Makes multi-DC easy!
    37. 37. ZooKeeper• Highly reliable distributed coordination• We implement locking and leadership election on top of ZK and use sparingly
    38. 38. • Distributed, RESTful, Search Engine built on top of Apache Lucene• Replaced basic foo LIKE ‘%bar%’ queries (so much better)
    39. 39. NSQ• Realtime message processing system designed to handle billions of messages per day• Fault tolerant, highly available with reliable message delivery guarantee
    40. 40. • Real time incremental analytics platform, backed by Apache Cassandra• Powerful SQL-like interface• Scalable and highly available
    41. 41. Cruftflake• Distributed ID generation with no coordination required• Rock solid
    42. 42. • All these technologies have similar properties of distribution and resilience• They are designed to cope with failure• They are not broken by design
    43. 43. Lessons learned
    44. 44. Minimise thecritical path
    45. 45. What is the minimum viable service?
    46. 46. class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use
    47. 47. Configure clients carefully
    48. 48. $this->mc = new Memcached;$this->mc->addServers($s);$this->mc->setOption( Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure timeouts are configured
    49. 49. here?Choose timeouts based on data
    50. 50. “Fail Fast: Set aggressive timeoutssuch that failing componentsdon’t make the entire systemcrawl to a halt.”
    51. 51. here?95th percentile
    52. 52. Test
    53. 53. • Kill memcache on box A, measure impact on application• Kill memcache on box B, measure impact on applicationAll fine.. we’ve got this covered!
    54. 54. FAIL
    55. 55. • Box A, running in AWS, locks up• Any parts of application that touch Memcache stop working
    56. 56. Things fail inexotic ways
    57. 57. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK!Packets rejected and source notified by ICMP. Expect fast fails.
    58. 58. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j DROP$ php test-memcache.phpWorking OK! Packets silently dropped. Expect long time outs.
    59. 59. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -m state --state ESTABLISHED -j DROP$ php test-memcache.php Hangs! Uh oh.
    60. 60. • When AWS instances hang they appear to accept connections but drop packets• Bug!
    61. 61. Fix, rinse, repeat
    62. 62. RabbitMQ RabbitMQ RabbitMQ HA cluster AMQP (port 5672) Service
    63. 63. $ iptables -A INPUT -i eth0 -p tcp --dport 5672 -m state --state ESTABLISHED -j DROP$ php test-rabbitmq.php Fantastic! Block AMQP port, client times out
    64. 64. FAIL
    65. 65. “RabbitMQ clusters do nottolerate network partitionswell.”
    66. 66. $ epmd –namesepmd: up and running on port4369 with data:name rabbit at port 60278 Each node listens on a port assigned by EPMD
    67. 67. $ iptables -A INPUT -i eth0 -p tcp --dport 60278 -m state --state ESTABLISHED -j DROP$ php test-rabbitmq.php Hangs! Uh oh.
    68. 68. Mnesia(rabbit@dmzutilities03-global01- test): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@dmzutilities01-global01-test} application: rabbitmq_management exited: shutdown type: temporaryRabbitMQ logs show partitioned network error; nodes shutdown
    69. 69. while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf;} PHP library didn’t have any time limit on reading a frame
    70. 70. Fix, rinse, repeat
    71. 71. It would benice if we could automate this
    72. 72. Automate!
    73. 73. • Hailo run a dedicated automated test environment• Powered by bash, JMeter and Graphite• Continuous automated testing with failure simulations
    74. 74. Fix attempt 1: bad timeouts configured
    75. 75. Fix attempt 2: better timeouts
    76. 76. Simulate insystem tests
    77. 77. Simulate failureAssert monitoring endpointpicks this up Assert features still work
    78. 78. In conclusion
    79. 79. “the best way to avoidfailure is to fail constantly.”
    80. 80. You should test forfailureHow does the software react?How does the PHP client react?
    81. 81. Automation makescontinuous failuretesting feasible
    82. 82. Systems that cope wellwith failure are easierto operate
    84. 84. ThanksSoftware used at Hailo a load of other things I’ve not mentioned.
    85. 85. Further readingHystrix: Latency and Fault Tolerance for Distributed Systems a network simulator on distributed systems for young bloods de-duplication (relevant to NSQ) generation in distributed systems