Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Engineer's Guide to a Good Night's Sleep

8 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2xNcGlf.

Nicky Wrightson gives some practical insight into how to handle failure in today's more complex distributed microservice systems. This includes looking at approaches to resiliency, understanding a system, understanding the requirements for fault tolerance, and the developers' mindset necessary for this. She shares real-world examples, and an occasional war story along the way too. Filmed at qconlondon.com.

Nicky Wrightson is a principal engineer working at River Island. She passionately drives forward cloud native architectures and approaches that allow engineers to deliver business value quickly whilst also reducing the support overhead needed for complex distributed systems.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

An Engineer's Guide to a Good Night's Sleep

  1. 1. @nickywrightson An Engineer’s Guide to a Good Night’s Sleep By Nicky Wrightson@nickywrightson
  2. 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ microservices-failure-insights/ • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  4. 4. @nickywrightson
  5. 5. @nickywrightson
  6. 6. @nickywrightson
  7. 7. @nickywrightson We are building REALLY complicated distributed systems
  8. 8. @nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team to manage lots of services, which are being redeployed regularly” https://martinfowler.com/articles/microservice-trade-offs.html
  9. 9. @nickywrightson
  10. 10. @nickywrightson Empowered teams means the team also control the support
  11. 11. @nickywrightson 2014 Consumers add a caching layer to protect against our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time
  12. 12. @nickywrightson Approaches to reduce the risk of being called 5
  13. 13. @nickywrightson Engineer’s mindset 1
  14. 14. @nickywrightson 1
  15. 15. @nickywrightson Enable teams to own their own support models 1
  16. 16. @nickywrightson Operations Support Team A Support Team B 1
  17. 17. @nickywrightson The team triages issues during the day 1
  18. 18. @nickywrightson Engineers need to think about that out of hours call with every error condition 1
  19. 19. @nickywrightson Design the severity levels within your service 1
  20. 20. @nickywrightson “The quality of a system will appear to be declining unless it is rigorously maintained” Lehmans Laws of Software Evolution “Declining Quality” (1996) 1
  21. 21. @nickywrightson As system evolves, its complexity increases unless work is done to maintain or reduce it Lehmans Laws of Software Evolution cont. "Increasing Complexity" (1974) 1
  22. 22. @nickywrightson Engineer’s mindset 1
  23. 23. @nickywrightson Don’t get called for issues that could have been caught in office hours 2
  24. 24. @nickywrightson Releases during the day should never wake you up at night 2
  25. 25. @nickywrightson Can our deployment times help this? 2
  26. 26. @nickywrightson Quick deployment 2
  27. 27. @nickywrightson 2 VERIFY VERIFY VERIFY
  28. 28. @nickywrightson By Cindy Sridharan (@copyconstruct) 2
  29. 29. @nickywrightson 3am batch jobs are a guarantee to get an overnight call at some point 2
  30. 30. @nickywrightson 2
  31. 31. @nickywrightson 2
  32. 32. @nickywrightson Don’t get called for issues that could have been caught in office hours 2
  33. 33. @nickywrightson Automate failure recovery where possible 3
  34. 34. @nickywrightson Let your platform recover for you 3
  35. 35. @nickywrightson Applications need to cope with change  Graceful Termination Transactional Clean restarts Stateless Queue Backed Idempotent 3
  36. 36. @nickywrightson Make your system idempotent so you can automatically replay failed events 3
  37. 37. @nickywrightson Multi region automatic system failovers 3
  38. 38. @nickywrightson Multi region automatic system failovers 3
  39. 39. @nickywrightson 3
  40. 40. @nickywrightson Our EU stack went 3
  41. 41. @nickywrightson 3
  42. 42. @nickywrightson 3
  43. 43. @nickywrightson 3
  44. 44. @nickywrightson Automate failure recovery where possible 3
  45. 45. @nickywrightson Understand what your customers really care about 4
  46. 46. @nickywrightson You want to be the first to know about a critical failure 4
  47. 47. @nickywrightson “Only have alerts that you need to action” Sarah Wells - Director of Operations and Reliability at FT 4
  48. 48. @nickywrightson Service that cleans old images from the repo Service that takes payments Not all services are equal != 4
  49. 49. @nickywrightson Synthetic Requests 4
  50. 50. @nickywrightson Use tracing to monitor your critical flows 4 Ben Sigelman @ this morning’s keynote
  51. 51. @nickywrightson 4
  52. 52. @nickywrightson 4
  53. 53. @nickywrightson 4
  54. 54. @nickywrightson 4
  55. 55. @nickywrightson We are now flagging important events close to the code 4
  56. 56. @nickywrightson Understand what your customers really care about 4
  57. 57. @nickywrightson Break things and practice everything 5
  58. 58. @nickywrightson “a method of experimenting on infrastructure that lets you expose weaknesses before they become a real problem.” 5
  59. 59. @nickywrightson Monolith to microservice timeline 5
  60. 60. @nickywrightson When can we release the chaos monkeys? 5
  61. 61. @nickywrightson Manual simulation of outages work too 5
  62. 62. @nickywrightson Spot the SPOF 5
  63. 63. @nickywrightson Multi region automatic system failovers 5
  64. 64. @nickywrightson Multi region automatic system failovers 5
  65. 65. @nickywrightson Fixing things in hours helps team confidence to support out of hours 5
  66. 66. @nickywrightson Manual intervention should be simple FIX IT! 5
  67. 67. @nickywrightson 5
  68. 68. @nickywrightson Make sure your alerts have all the relevant information to action the event 5
  69. 69. @nickywrightson Failed requests 5
  70. 70. @nickywrightson At 3am just get the system to limp into hours 5
  71. 71. @nickywrightson Break things and practice everything 5
  72. 72. @nickywrightson Engineer’s mindset 1
  73. 73. @nickywrightson Don’t get called for issues that could have been caught in office hours 2
  74. 74. @nickywrightson Automate failure recovery where possible 3
  75. 75. @nickywrightson Understand what your customers care about? 4
  76. 76. @nickywrightson Break things and practice everything 5
  77. 77. @nickywrightson The engineers are the ones called at 3am We now own this!
  78. 78. @nickywrightson Thanks!
  79. 79. @nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan https://medium.com/@copyconstruct/testing-microservices-the-sane- way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html https://medium.com/netflix-techblog/vizceral-open-source-acc0c32113fe
  80. 80. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ microservices-failure-insights/

×