Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310) - AWS re:Invent 2018

408 views

Published on

You may have heard of the buzzwords “chaos engineering” and “containers.” But what do they have to do with each other? In this session, we introduce chaos engineering and share a live demo of how to practice chaos engineering principles on AWS. We walk through chaos engineering practices, tools, and success metrics you can use to inject failures in order to make your systems more reliable.

  • Be the first to comment

Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Breaking Containers Chaos Engineering for Modern Applications on AWS Adrian Cockcroft AWS VP Cloud Architecture Strategy C O N 3 1 0 Ana Medina Gremlin Inc. Chaos Engineer
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History and Future of Chaos Engineering: Adrian Breaking Containers—Tools and Demonstrations: Ana
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Stop? Carry on with reduced functionality? What should your system do when something fails?
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How often do you failover apps to it? How often do you failover the whole data center at once? “Availability Theater” Do you have a backup data center?
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A fairy tale… Once upon a time, in theory, if everything works perfectly, we have a plan to survive the disasters we thought of in advance How did that work out?
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Forgot to renew domain name… SaaS vendor Didn’t update security certificate and it expired… Entertainment site Data center flooded in hurricane Sandy… Finance company, Jersey City Whoops! YOU, tomorrow
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “You can’t legislate against failure, focus on fast detection and response.” —Chris Pinkham
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Network Is Reliable ACM Queue 2014 Bailis & Kingsbury @pbailis @aphyr (Spoiler—it isn’t…)
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Drift Into Failure Sydney Dekker Everyone can do everything right at every step, and you may still get a catastrophic failure as a result…
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Release It! Second Edition 2017 Michael Nygard Bulkheads, circuit breakers, and some new ideas…
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capacity to withstand turbulent conditions in production Principles of Chaos Engineering
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capacity to withstand turbulent conditions in production Principles of Chaos Engineering experimenting build confidence capacity to withstand turbulent conditions
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Experiment to ensure that The impact of failure is mitigated experimenting build confidence capacity to withstand turbulent conditions
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures What went wrong? What kind of thing failed?
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What are the effects of the failure? What mitigation mechanisms are in place? FailuresImpacts
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. You can only be as strong as your weakest link How can we try to think of everything that might fail?
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Taxonomy of failures Common language, terminology, and definitions help mitigate communication failure between people working on resiliency I’m proposing some terminology, try to use common definitions rather than making up your own!
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructure Software stack Application Operations Taxonomy of failures failures Taxonomy Failure layers
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mitigation Layers Application level replication Storage block level replication Structured database replication
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resilience Past Present Future
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resilience Past Present Future ? Disaster recovery Chaos engineering Resilient critical systems
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disaster recovery 1978 Sungard—Mainframe batch Recovery Point Objective Time interval between recovery point snapshots; e.g., daily backups Recovery Time Objective Time taken to recover after a failure; e.g., time to locate and restore from a backup RPO RTO
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disaster recovery 1978 Sungard—Mainframe batch RPO RTO, Business continuity Resilience Recovery Contingency ISO 22301:2012 Societal security BCM (Glossary) ISO 27031:2011 Infosec
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering 2004 2010 2012 2016 2017 2018 Amazon—Jesse Robbins. Master of disaster Netflix—Greg Orzell. @chaosimia—First implementation of Chaos Monkey to enforce use of auto-scaled stateless services NetflixOSS open sources simian army Gremlin Inc founded Netflix chaos eng book. Chaos toolkit open source project Chaos concepts getting adopted widely, and this conference! 2004 2010 2012 2016 2017 2018
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Four layers Two teams An attitude Chaos architecture
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructure Switching Application People Chaos Engineering Team
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures are a system problem— lack of safety margin Not something with a root cause of component or human error
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis testing We think we have safety margin in this dimension, let’s carefully test to be sure In production Without causing an issue
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Experienced staff Robust applications Dependable switching fabric Redundant service foundation
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cloud Native Chaos in practice Mechanisms for AWS and Kubernetes
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Isolation Model No global network or service dependencies completely independent regions Regions made up of Availability Zones Zones between 10–100km apart Separate flood planes, electric supply etc. Close enough for synchronous replication
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Isolation Model Zones made up of data center buildings Data centers divided internally to isolate and replicate critical services Redundant private network around the world
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Mechanisms Amazon Aurora DB Cluster Fault Injection Queries Crash master or replica Fail a replica Disk failure or congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT READ REPLICA FAILURE [ TO ALL | TO "replica name" ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND }; https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Mechanisms IAM Region Restriction Simulate regional API issues by changing the list of permitted regions { "Sid": "RegionRestricted", "Effect": "Allow", "Action": “*”, "Resource": "*", "Condition": {"StringEquals": {"aws:RequestedRegion": [ "eu-west-1"]}} } https://aws.amazon.com/blogs/security/easier-way-to-control-access-to-aws-regions-using-iam-policies/
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos for Kubernetes Region: eu-west-1 VPC eu-west-1a eu-west-1b eu-west-1c public-a private-a public-b private-b public-c private-c Impact k8s control plane impact applications
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Kubernetes Mechanisms Gremlin Inc. attacks Gremlin runs a daemon on each node that manages and induces controlled failure, including blocking network access
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure as a service Resource, network, state attacks on hosts or containers Application-level fault injection for serverless Easy to use API and UI
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Minimize the blast radius
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  50. 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  51. 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  52. 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More resources on Chaos Engineering Join the Chaos Engineering Community bit.ly/chaos-eng-slack Chaos Engineering Tutorials: gremlin.com/community Enterprise Trial: gremlin.com/break-containers
  53. 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ? Possible future directions
  54. 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Observability of systems Epidemic failure modes Automation and continuous chaos
  55. 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Observability Microservice that does one thing Function with no side effects Monolith with logging Monolith with tracing and logging Low Medium High
  56. 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures can be Independent Common assumption Correlated Harder to model and mitigate knock-on effects Epidemic Everything breaks at once!
  57. 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epidemic examples Linux Leap—second bug Sun SPARC cache bit-flip Cloud zone or region failure DNS failure Security configuration syntax error
  58. 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epidemic examples Linux Leap—second bug Sun SPARC cache bit-flip Cloud zone or region failure DNS failure Security configuration syntax error Quarantine needed
  59. 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epidemic examples Linux Leap—second bug Maintain ability to deploy on Windows Sun SPARC cache bit-flip Use a variety of CPU implementations Cloud zone or region failure Cross-zone or region replication DNS failure Multiple domains and providers Security configuration syntax error Limit the scope of deployments Quarantine
  60. 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epidemic examples Linux Leap—second bug Maintain ability to deploy on Windows Sun SPARC cache bit-flip Use a variety of CPU implementations Cloud zone or region failure Cross-zone or region replication DNS failure Multiple domains and providers Security configuration corruption Limit scope of deployments Diversity needs to be managed to contain an epidemic Quarantine
  61. 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cloud provides the automation that leads to chaos engineering
  62. 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. As data centers migrate to cloud, fragile and manual disaster recovery will be replaced by chaos engineering
  63. 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Testing failure mitigation will move from a scary annual experience to automated continuous chaos
  64. 64. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adrian Cockcroft @adrianco AWS VP Cloud Architecture Strategy Ana Medina @Ana_M_Medina Gremlin Inc. Chaos Engineer
  65. 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×