Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Art of Successful Failure - AWS Summit Sydney 2019

444 views

Published on

Welcome to the real world, where things don’t always go your way. You’ve designed your systems to be highly available, scalable, and resilient, and yet sometimes they fail anyway. These failures, if used correctly, can be a powerful lever for gaining a deep understanding of how your system actually works, and a tool for spreading knowledge through your engineering community. In this session we will cover some of AWS’ favourite techniques for defining and reviewing metrics – watching the systems before they fail – as well as how to do an effective post-mortem that drives both learning and meaningful improvement.

  • Be the first to comment

The Art of Successful Failure - AWS Summit Sydney 2019

  1. 1. S U M M I T SYDNEY
  2. 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The art of successful failure Becky Weiss Senior Principal Engineer Amazon Web Services
  3. 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T This is a talk about failing… successfully
  4. 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Agenda • Never waste a failure: The AWS approach to post-mortems • Seeing your failures before your customers do • How AWS can help you fail successfully
  5. 5. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  6. 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
  7. 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Availability 100 90 80
  8. 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T COE: Correction of Error • Structured analysis of customer-impacting events • Reflection of Amazon’s peculiar culture • Goes well beyond “How do we prevent this from happening again” COE
  9. 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T We take these very seriously
  10. 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T COEs start with the customer and work backwards • Summary • Narrative description of what happened • Metrics and graphs • Primary impact and supporting graphs • If they don’t exist, that’s something to fix • Customer impact • How many customers affected • What was the impaired experience AvailabilityLatencyof dependency p99 p50
  11. 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Areas of focus • Root cause: Why? (x 5) • Blast radius: How widespread was the impact? • Duration: For how long? • What can others learn?
  12. 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Toyota’s Five-Whys approach to root cause The vehicle will not start. (the problem) Why? - The battery is dead. (First why) Why? - The alternator is not functioning. (Second why) Why? - The alternator belt has broken. (Third why) Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why) Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
  13. 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius
  14. 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius
  15. 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius containment as a core design tenet AWS Cloud Region Region Region …
  16. 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius containment as a core design tenet AWS Cloud Region Availability Zone Availability Zone Availability Zone
  17. 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blast radius containment as a core design tenet AWS Cloud Region Cell Cell Cell AWS Cloud
  18. 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling event duration Availability 100 90 80 Impact period
  19. 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving incident response • “How was the event detected?” • “How could time to detection be improved? As a thought experiment, how would you have cut the time in half?” Availability 100 90 80 Time to detection
  20. 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving incident response Good Amazon CloudWatch Alarm !
  21. 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving incident response Good Bad ??? Amazon CloudWatch Alarm !
  22. 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving time to mitigation • “How did you reach the point where you knew how to mitigate the impact?” • “How could time to mitigation be improved? As a thought experiment, how would you have cut the time in half?” Availability 100 90 80 Determining how to mitigate Mitigation activity
  23. 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving time to mitigation Example: Alarm-based automatic rollback AWS CodeDeploy Alarm !
  24. 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling duration: Improving time to mitigation Example: Alarm-based automatic rollback AWS CodeDeploy Alarm rollback
  25. 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling event duration Availability 100 90 80 Impact period
  26. 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Controlling event duration Availability 100 90 80 Impact period
  27. 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Wins are just as important as failures Latency deployment p99
  28. 28. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  29. 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Metrics are very interesting, and that can be a problem
  30. 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health metrics and diagnostic metrics Health metrics • Answers the question: Am I failing? • Does not answer the question: Why am I failing? • Always set alarms on these • Be conservative in defining Diagnostic metrics • Answers the question: What is the value of this thing I measured? • Might answer the question: Why isn’t my system working? • Sometimes set alarms on these • Be liberal in defining
  31. 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Responsecount 5XX
  32. 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Responsecount 4XX 5XX
  33. 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Databasetransactionrollbacks
  34. 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Latency(msec) avg
  35. 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Latency(msec) avg 50th percentile
  36. 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Health or Diagnostic? Time Latency(msec) avg 99th percentile 50th percentile
  37. 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Percentiles >> Avg Time Latency(msec) 99th percentile 50th percentile
  38. 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Layout of a Great Dashboard Health Metrics at the top Latency percentiles Faults Volume Key diagnostic metrics below the fold
  39. 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The AWS Ops Wheel
  40. 40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  41. 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
  42. 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Volume Time scale: ~one week volume
  43. 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Time scale: ~one week volume p99.9 latency
  44. 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Volume Time scale: Weeks/months
  45. 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Higherisworse Time scale: Weeks/months
  46. 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Latency p99 Alarm threshold
  47. 47. Failures
  48. 48. Failures
  49. 49. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  50. 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms AWS Lambda Amazon API Gateway Amazon DynamoDB Billing Amazon Simple Notification Service Amazon Simple Storage Service (S3) Amazon EC2 Automatically-published metrics
  51. 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms Application-specific logging and metrics Amazon EC2 instances with CloudWatch agent AWS Lambda functions
  52. 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms Application-specific logging and metrics Amazon EC2 instances with CloudWatch agent AWS Lambda functions
  53. 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Key Service: Amazon CloudWatch Amazon CloudWatch: Metrics, Logs and Alarms Application-specific logging and metrics Amazon EC2 instances with CloudWatch agent AWS Lambda functions VPC VPC Endpoint
  54. 54. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  55. 55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A simple serverless API Amazon API Gateway: “Colors” Lambda function: “GetColor” Proxy integration GET /blue GET /red Amazon CloudWatch: Dashboards, Logs, and Alarms AWS CloudFormation Stack
  56. 56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon CloudWatch default view
  57. 57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard
  58. 58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard: Metric math
  59. 59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Provisioning dashboards with Cloud Formation AWS CloudFormation Stack
  60. 60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard: Breakdown by method
  61. 61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom dashboard: Diagnosing customer issues
  62. 62. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  63. 63. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  64. 64. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  65. 65. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  66. 66. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Fast diagnostics with Amazon CloudWatch Logs Insights
  67. 67. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  68. 68. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Takeaways • Never waste a failure: Effective post-mortems • Catch failures before your customers do: Effective dashboarding and metrics-reading • Use AWS tools to gain visibility and insight into your application
  69. 69. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Becky Weiss becky@amazon.com

×