Planning to Fail #phpuk13

2,352 views
2,227 views

Published on

How to build resilient and reliable services by embracing failure.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,352
On SlideShare
0
From Embeds
0
Number of Embeds
109
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • I’m dave!
  • I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
  • The title of my talk is “planning to fail”
  • First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
  • This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
  • But first, let’s rewind to the beginning
  • The pure joy of inserting a php tag in the middle of an HTML table
  • My website still follows this pattern. I’d like to think my website is quite reliable.
  • My website is reliable, but simple. Doesn’t change very often.
  • Hailo is complex!
  • Hailo is growing.
  • Key quote: less machinery is quadratically better.
  • Hailo have a lot of machinery!
  • Enter the chaos monkey… If you want to be good at something, practice often!
  • How about the “reliable” VPC that runs my website?
  • But not resilient; my website would not cope well with the chaos monkey approach.
  • This doesn’t matter for my website – this is not a bus timetable app – this is not life and death stuff.
  • We have to choose our stack appropriately if we are going to go down the chaos monkey route.
  • Hailo didn’t start out this way; but the PHP component did
  • Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
  • Advantages
  • Here’s one of our services… is this reliable?
  • But Hailo is going global
  • At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
  • Don’t pick things that arebroken by design
  • We remove services from the critical path using lazy-init pattern
  • We want to define timeouts so that under failure conditions we don’t hang forever
  • Instrumenting operations times – mean, upper 90th, upper bound (highest observed value)
  • Let’s aim for 95th percentile as our timeout – but instrument when we do have timeouts so that we know what’s going on
  • Yay!
  • Boo
  • This was after we fixed the bug, but we had the timeouts configured badly.
  • Better –memcache failure having less impact now; some features might be degraded, but the minimal viable service now works
  • Runnable .md based system tests
  • Planning to Fail #phpuk13

    1. 1. Planningto fail@davegardnerisme#phpuk2013
    2. 2. dave
    3. 3. the taxi app
    4. 4. Planning to fail
    5. 5. Planningfor failure
    6. 6. Planning to fail
    7. 7. The beginning
    8. 8. <?php
    9. 9. My website: single VPS running PHP + MySQL
    10. 10. No growth, low volume, simple functionality, one engineer (me!)
    11. 11. Large growth, high volume, complex functionality, lots of engineers
    12. 12. • Launched in London November 2011• Now in 5 cities in 3 countries (30%+ growth every month)• A Hailo hail is accepted around the world every 5 seconds
    13. 13. “.. Brooks [1] reveals that the complexityof a software project grows as the squareof the number of engineers and Leveson[17] cites evidence that most failures incomplex systems result from unexpectedinter-component interaction rather thanintra-component bugs, we conclude thatless machinery is (quadratically) better.”http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
    14. 14. • SOA (10+ services)• AWS (3 regions, 9 AZs, lots of instances)• 10+ engineers building services and you? (hailo is hiring)
    15. 15. Our overallreliability is in danger
    16. 16. Embracing failure(a coping strategy)
    17. 17. VPC(running PHP+MySQL) reliable?
    18. 18. Reliable !==Resilient
    19. 19. Choosing a stack
    20. 20. “Hailo”(running PHP+MySQL) reliable?
    21. 21. Service Service Service Service each service does one job well Service Oriented Architecture
    22. 22. • Fewer lines of code• Fewer responsibilities• Changes less frequently• Can swap entire implementation if needed
    23. 23. Service(running PHP+MySQL) reliable?
    24. 24. Service MySQL MySQL running on different box
    25. 25. MySQLService MySQL MySQL running in Multi-Master mode
    26. 26. Going global
    27. 27. CRUD LockingMySQL Search Analytics ID generation also queuing… Separating concerns
    28. 28. At Hailo we look for technologies that are:• Distributed run on more than one machine• Homogenous all nodes look the same• Resilient can cope with the loss of node(s) with no loss of data
    29. 29. “There is no such thing as standbyinfrastructure: there is stuff youalways use and stuff that won’twork when you need it.”http://blog.b3k.us/2012/01/24/some-rules.html
    30. 30. • Highly performant, scalable and resilient data store• Underpins much of what we do at Hailo• Makes multi-DC easy!
    31. 31. ZooKeeper• Highly reliable distributed coordination• We implement locking and leadership election on top of ZK and use sparingly
    32. 32. • Distributed, RESTful, Search Engine built on top of Apache Lucene• Replaced basic foo LIKE ‘%bar%’ queries (so much better)
    33. 33. NSQ• Realtime message processing system designed to handle billions of messages per day• Fault tolerant, highly available with reliable message delivery guarantee
    34. 34. Cruftflake• Distributed ID generation with no coordination required• Rock solid
    35. 35. • All these technologies have similar properties of distribution and resilience• They are designed to cope with failure• They are not broken by design
    36. 36. Lessons learned
    37. 37. Minimise thecritical path
    38. 38. What is the minimum viable service?
    39. 39. class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use
    40. 40. Configure clients carefully
    41. 41. $this->mc = new Memcached;$this->mc->addServers($s);$this->mc->setOption( Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure timeouts are configured
    42. 42. here?Choose timeouts based on data
    43. 43. “Fail Fast: Set aggressive timeoutssuch that failing componentsdon’t make the entire systemcrawl to a halt.”http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
    44. 44. here?95th percentile
    45. 45. Test
    46. 46. • Kill memcache on box A, measure impact on application• Kill memcache on box B, measure impact on applicationAll fine.. we’ve got this covered!
    47. 47. FAIL
    48. 48. • Box A, running in AWS, locks up• Any parts of application that touch Memcache stop working
    49. 49. Things fail inexotic ways
    50. 50. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK!Packets rejected and source notified by ICMP. Expect fast fails.
    51. 51. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j DROP$ php test-memcache.phpWorking OK! Packets silently dropped. Expect long time outs.
    52. 52. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -m state --state ESTABLISHED -j DROP$ php test-memcache.php Hangs! Uh oh.
    53. 53. • When AWS instances hang they appear to accept connections but drop packets• Bug!https://bugs.launchpad.net/libmemcached/+bug/583031
    54. 54. Fix, rinse, repeat
    55. 55. It would benice if we could automate this
    56. 56. Automate!
    57. 57. • Hailo run a dedicated automated test environment• Powered by bash, JMeter and Graphite• Continuous automated testing with failure simulations
    58. 58. Fix attempt 1: bad timeouts configured
    59. 59. Fix attempt 2: better timeouts
    60. 60. Simulate insystem tests
    61. 61. Simulate failureAssert monitoring endpointpicks this up Assert features still work
    62. 62. In conclusion
    63. 63. “the best way to avoidfailure is to fail constantly.”http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
    64. 64. TIMED BLOCK ALLTHE THINGS
    65. 65. ThanksSoftware used at Hailohttp://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphpPlus a load of other things I’ve not mentioned.
    66. 66. Further readingHystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/HystrixTimelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulatorNotes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems

    ×