SlideShare a Scribd company logo
Game day & Chaos Engineering
Concepts on AWS
Gameday & Chaos Engineering Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
•Famous Disasters
•Catastrophes common thread
•Ingredients for Catastrophes
•Classess of issues
•Fragility vs Resilience
•Resilience vs Antifragility
• GameDay Concept
• AWS GameDay with the AWS Well-Architected Framework
• Example GameDay Plan
• Chaos Experiments
• Resource Exhaustion
• The Network is Not Reliable
• Datastore saturation
• DNS Unavailability
Game day & Chaos
Engineering
Concepts on AWS
• Chaos Engineering
• Examples of inputs for chaos experiments:
• How Does Chaos Engineering Differ from Testing?
• Software Tools for Chaos Engineering
• Chaos Toolkit
• Simian Army
• Chaos Monkey
•Chaos Monkey Basic Demo on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
In February 2017, Cloudflare faced a major
software bug that led to sensitive customer
data like passwords, cookies and
authentication tokens to get leaked from
customer websites.
Bitcoin Unlimited suffered a serious
memory leak which caused several nodes
to fall from 800 to about 300. This is
almost 70 percent of the nodes run by
Bitcoin Unlimited at the time.
Game day & Chaos
Engineering
Concepts on AWS
Less than a week into 2016, HSBC became the first
bank to suffer a major IT outage. Millions of the
bank's customers were unable to access online
accounts. Services only returned to normal after a
two-day outage.
The bank’s chief operating officer Jack Hackett
blamed a “complex technical issue” with its internal
systems.
In June 2015 about 600,000 payments failed
to enter the accounts of RBS overnight –
including wages and benefit payments. Many
took several days to come through. The
bank’s chief admin officer said a “technology
fault meant we could not ingest a file from a
third-party provider”
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Catastrophes
Game day & Chaos
Engineering
Concepts on AWS
Catastrophes
Game day & Chaos
Engineering
Concepts on AWS
Catastrophes
Game day & Chaos
Engineering
Concepts on AWS
Catastrophes
Game day & Chaos
Engineering
Concepts on AWS
Catastrophes
Game day & Chaos
Engineering
Concepts on AWS
Catastrophes
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
GameDay
Concept
• GameDay is a hands-on competition with live data. By setting up a “fake-real”
environment for the teams to work in, players are free to try things out in a live
environment with no actual risk. It’s unusual to have such a “carefree experience
working in a live environment. Ironically, though, the one-day time frame is actually
“kind of punitive. In a real situation you’d probably have more time to deal with each
problem, but the high-pressure environment adds to the excitement, and AWS
GameDay’s goal is to entertain and make the learning experience as fun as possible.
Game day & Chaos
Engineering
Concepts on AWS
• How Are You Managing Keys?
• How Are You Planning for Disaster Recovery?
• How Does Your System Withstand Component Failures?
Looking at GameDay through the lens of the Well-Architected
Framework, it was obvious that there were many opportunities
for improvement. The AWS review team prioritized the findings
into two sets: critical and recommended. Most of the findings
were classified as recommended—these don’t pose an
immediate risk and will be incorporated into our roadmap.
However, the three elements that were identified as critical
needed to be addressed immediately.
Game day & Chaos
Engineering
Concepts on AWS
 Target
In any GameDay, an exact target or targets should be specified. Without it, it’s
impossible to bring in the right people to run and observe the GameDay.
 Time And Place
Things to include:
• Precise date
• War room for in-person attendance (make sure it fits enough people)
• Dial-in information - including conferencing link, phone number and code to join
 Goals Of The GameDay
When planning, we need a goal for the GameDay. This is to ensure that we create
relevant test cases. Sometimes the goal is to replay as many previous production
impacts as possible, to test whether or not the current systems are more or less
resilient.
Game day & Chaos
Engineering
Concepts on AWS
 Whiteboarding
• With so many great minds present and aligned on the goal of the GameDay, it’s
the perfect time to whiteboard out a system’s architecture. This session helps
paint a clear picture of what we’re about to break, and makes obvious some
areas worth testing.
 Test Cases Scoping
• Test cases are developed to help answer the question, “What could go wrong?”
or “Do we know what will happen if this breaks?” As a team looks at the
architecture that’s on the whiteboard, you will start to identify areas of concern.
Game day & Chaos
Engineering
Concepts on AWS
 Make a plan
• Role play and scope definition
• Create the simulation environment
• Set a deadline
• Create the GameDay Environment
• Activate AWS CloudTrail ; Gameplay recording and auditing
• Simulate activity
Game day & Chaos
Engineering
Concepts on AWS
Resources on computers are finite. A machine/VM/container will
inevitably hit a resource limit at some point, and the application will be forced to
handle the lack of a resource. Commonly, this is CPU, Memory, or I/O.
We can reproduce CPU exhaustion by conducting a chaos experiment. Running
this experiment will consume CPU cycles, leaving the application with the same
amount of customer-facing work, and less CPU to do it with. As always, we
advocate starting small on a single instance, then increasing the blast radius as
confidence grows. Common reactions to CPU exhaustion are an increase in
errors and latency and a reduction in successful requests to customers.
 Attack: CPU / Memory / Disk
 Scope: Single instance
 Expected Results: Rate of good responses goes down, errors increase at all
layers, brownout mode entered (if implemented), alerts fire (if configured at
single-instance level), load balancer routes traffic away (if applicable)
Game day & Chaos
Engineering
Concepts on AWS
Network
Latency
Network dependencies are a fact of life in a distributed system, and as
distributed systems are growing in adoption AND complexity, chaos engineering becomes
an optimal way to test for potential failures on the path to increasing resilience.
 Attack: Network Blackhole / Latency
 Scope: single instance
 Expected Results: Traffic to dependency goes to 0 (or gets slow), startup completes
without errors, application-level metrics in steady-state are unaffected, traffic to
fallback systems shows up and is successful, dependency alerts and pages may fire (if
scoped to single-instance)
Game day & Chaos
Engineering
Concepts on AWS
All interesting applications have some sort of storage, and managing the
relationship between application and datastore is critical to overall system health.
There are a variety of ways that an application may overwhelm a data store (poor
queries, lack of indices, bad sharding, upstream caching decisions, etc), but all of
them result in what appears to be an unresponsive data layer.
It’s important to understand how datastore saturation manifests in your
application. There are a few ways of modeling this with a Chaos Experiment. You
can blackhole your datastore, making it appear completely unavailable. You can
add latency to requests to your datastore, making it appear slow. Finally, you can
consume I/O bandwidth to simulate a congested path to the datastore.
 Attack: Network Blackhole / Latency / IO
 Scope: single instance
 Expected Results: Traffic to datastore is reduced or slower, application-level
metrics in steady-state are unaffected to the degree possible, traffic to fallback
systems shows up and is successful, timeouts and concurrency limits kick in
when appropriate, alerts and pages may fire (if scoped to single-instance)
Game day & Chaos
Engineering
Concepts on AWS
The best way forward is to induce a DNS outage and understand how
your application behaves. If you blackhole DNS traffic on a single instance, it will
appear to that instance as if DNS is unavailable. The fixes will vary depending on
the issue, but common solutions are to pass around IP addresses instead of
hostnames for internal addressing and the use of a backup DNS provider.
 Attack: DNS blackhole
 Scope: single instance
 Expected Results: Inbound traffic may drop, traffic to external systems may
fail, startup may not complete successfully
Game day & Chaos
Engineering
Concepts on AWS
What happened? Was that expected? What do we do next?
After tests are run, it’s good to take some time to wind down, then have
a follow up recap. This should be done relatively soon after the
GameDay (days, not weeks), as the experience is still fresh for everyone.
 How long does it take to detect an event?
 Simulate failure situation
 Validate assumptions
 Prove your architecture
 Knowledge of Procedures
 How is the communication channel or chain of command during the
game?
Game day & Chaos
Engineering
Concepts on AWS
Chaos Use
Cases
 Simulating the failure of an entire region or datacenter.
 Partially deleting Kafka topics over a variety of instances to recreate an issue that
occurred in production.
 Injecting latency between services for a select percentage of traffic over a
predetermined period of time.
 Function-based chaos (runtime injection): randomly causing functions to throw
exceptions.
 Code insertion: Adding instructions to the target program and allowing fault injection
to occur prior to certain instructions.
 Time travel: forcing system clocks out of sync with each other.
 Executing a routine in driver code emulating I/O errors.
 Maxing out CPU cores on an Elasticsearch cluster.
Game day & Chaos
Engineering
Concepts on AWS
The primary difference between Chaos Engineering and these other
approaches is that Chaos Engineering is a practice for generating new information,
while fault injection is a specific approach to testing one condition.
Tests are typically binary, and determine whether a property is true or false.
Strictly speaking, this does not generate new knowledge about the system, it just
assigns valence to a known property of it. Experimentation generates new
knowledge, and often suggests new avenues of exploration.
Game day & Chaos
Engineering
Concepts on AWS
01
02
05
03
 VaryReal-World Events
 Canary Analysis
 Hypothesize about
SteadyState
 Automate
Experiments toRun
Continuously
04
 Run Experiments in
Production
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Chaos
Monkey
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS
Game day & Chaos
Engineering
Concepts on AWS

More Related Content

What's hot

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Ana Medina
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testingjeetendra mandal
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformAnshul Patel
 
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos EngineeringGameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos EngineeringDiUS
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureAna Medina
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Nora Jones
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guidematthewbrahms
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyTimetrix
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern ApplicationsAmazon Web Services
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native ObservabilityTyler Treat
 
Monitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsMonitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsDevOps.com
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications MoovingON
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyLeonid Nekhymchuk
 
Software Composition Analysis Deep Dive
Software Composition Analysis Deep DiveSoftware Composition Analysis Deep Dive
Software Composition Analysis Deep DiveUlisses Albuquerque
 

What's hot (20)

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
 
GameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos EngineeringGameDay - Achieving resilience through Chaos Engineering
GameDay - Achieving resilience through Chaos Engineering
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern Applications
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
Zero-Trust SASE DevSecOps
Zero-Trust SASE DevSecOpsZero-Trust SASE DevSecOps
Zero-Trust SASE DevSecOps
 
Monitoring at the Speed of DevOps
Monitoring at the Speed of DevOpsMonitoring at the Speed of DevOps
Monitoring at the Speed of DevOps
 
infrastructure as code
infrastructure as codeinfrastructure as code
infrastructure as code
 
Static Code Analysis
Static Code AnalysisStatic Code Analysis
Static Code Analysis
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
 
Software Composition Analysis Deep Dive
Software Composition Analysis Deep DiveSoftware Composition Analysis Deep Dive
Software Composition Analysis Deep Dive
 

Similar to Chaos engineering & Gameday on AWS

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Ramprasad Nagaraja
 
AWS Cloud for HPC and Big Data
AWS Cloud for HPC and Big DataAWS Cloud for HPC and Big Data
AWS Cloud for HPC and Big Datainside-BigData.com
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...Amazon Web Services
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...Amazon Web Services
 
Machine Learning inference at the Edge
Machine Learning inference at the EdgeMachine Learning inference at the Edge
Machine Learning inference at the EdgeJulien SIMON
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Migrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWSMigrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWSTom Laszewski
 
A real-life account of moving 100% to a public cloud
A real-life account of moving 100% to a public cloudA real-life account of moving 100% to a public cloud
A real-life account of moving 100% to a public cloudJulien SIMON
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Amazon Web Services
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingwebscale
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapIan Massingham
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAdrian Hornsby
 
AWS re:Invent 2016 recap (part 1)
AWS re:Invent 2016 recap (part 1)AWS re:Invent 2016 recap (part 1)
AWS re:Invent 2016 recap (part 1)Julien SIMON
 
AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...
AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...
AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...Amazon Web Services
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWSAmazon Web Services Korea
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Amazon Web Services
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 

Similar to Chaos engineering & Gameday on AWS (20)

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
 
AWS Cloud for HPC and Big Data
AWS Cloud for HPC and Big DataAWS Cloud for HPC and Big Data
AWS Cloud for HPC and Big Data
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Machine Learning inference at the Edge
Machine Learning inference at the EdgeMachine Learning inference at the Edge
Machine Learning inference at the Edge
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Migrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWSMigrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWS
 
A real-life account of moving 100% to a public cloud
A real-life account of moving 100% to a public cloudA real-life account of moving 100% to a public cloud
A real-life account of moving 100% to a public cloud
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
AWS re:Invent 2016 recap (part 1)
AWS re:Invent 2016 recap (part 1)AWS re:Invent 2016 recap (part 1)
AWS re:Invent 2016 recap (part 1)
 
AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...
AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...
AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 

Recently uploaded

Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectRased Khan
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdfAhmedHussein950959
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdfKamal Acharya
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageRCC Institute of Information Technology
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientistgettygaming1
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdfKamal Acharya
 
AI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfAI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfmahaffeycheryld
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdfKamal Acharya
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...Amil baba
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 

Recently uploaded (20)

Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
AI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfAI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdf
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 

Chaos engineering & Gameday on AWS

  • 1. Game day & Chaos Engineering Concepts on AWS
  • 2. Gameday & Chaos Engineering Concepts on AWS
  • 3. Game day & Chaos Engineering Concepts on AWS •Famous Disasters •Catastrophes common thread •Ingredients for Catastrophes •Classess of issues •Fragility vs Resilience •Resilience vs Antifragility • GameDay Concept • AWS GameDay with the AWS Well-Architected Framework • Example GameDay Plan • Chaos Experiments • Resource Exhaustion • The Network is Not Reliable • Datastore saturation • DNS Unavailability
  • 4. Game day & Chaos Engineering Concepts on AWS • Chaos Engineering • Examples of inputs for chaos experiments: • How Does Chaos Engineering Differ from Testing? • Software Tools for Chaos Engineering • Chaos Toolkit • Simian Army • Chaos Monkey •Chaos Monkey Basic Demo on AWS
  • 5. Game day & Chaos Engineering Concepts on AWS
  • 6. Game day & Chaos Engineering Concepts on AWS In February 2017, Cloudflare faced a major software bug that led to sensitive customer data like passwords, cookies and authentication tokens to get leaked from customer websites. Bitcoin Unlimited suffered a serious memory leak which caused several nodes to fall from 800 to about 300. This is almost 70 percent of the nodes run by Bitcoin Unlimited at the time.
  • 7. Game day & Chaos Engineering Concepts on AWS Less than a week into 2016, HSBC became the first bank to suffer a major IT outage. Millions of the bank's customers were unable to access online accounts. Services only returned to normal after a two-day outage. The bank’s chief operating officer Jack Hackett blamed a “complex technical issue” with its internal systems. In June 2015 about 600,000 payments failed to enter the accounts of RBS overnight – including wages and benefit payments. Many took several days to come through. The bank’s chief admin officer said a “technology fault meant we could not ingest a file from a third-party provider”
  • 8. Game day & Chaos Engineering Concepts on AWS
  • 9. Game day & Chaos Engineering Concepts on AWS
  • 10. Game day & Chaos Engineering Concepts on AWS Catastrophes
  • 11. Game day & Chaos Engineering Concepts on AWS Catastrophes
  • 12. Game day & Chaos Engineering Concepts on AWS Catastrophes
  • 13. Game day & Chaos Engineering Concepts on AWS Catastrophes
  • 14. Game day & Chaos Engineering Concepts on AWS Catastrophes
  • 15. Game day & Chaos Engineering Concepts on AWS Catastrophes
  • 16. Game day & Chaos Engineering Concepts on AWS
  • 17. Game day & Chaos Engineering Concepts on AWS
  • 18. Game day & Chaos Engineering Concepts on AWS
  • 19. Game day & Chaos Engineering Concepts on AWS GameDay Concept • GameDay is a hands-on competition with live data. By setting up a “fake-real” environment for the teams to work in, players are free to try things out in a live environment with no actual risk. It’s unusual to have such a “carefree experience working in a live environment. Ironically, though, the one-day time frame is actually “kind of punitive. In a real situation you’d probably have more time to deal with each problem, but the high-pressure environment adds to the excitement, and AWS GameDay’s goal is to entertain and make the learning experience as fun as possible.
  • 20. Game day & Chaos Engineering Concepts on AWS • How Are You Managing Keys? • How Are You Planning for Disaster Recovery? • How Does Your System Withstand Component Failures? Looking at GameDay through the lens of the Well-Architected Framework, it was obvious that there were many opportunities for improvement. The AWS review team prioritized the findings into two sets: critical and recommended. Most of the findings were classified as recommended—these don’t pose an immediate risk and will be incorporated into our roadmap. However, the three elements that were identified as critical needed to be addressed immediately.
  • 21. Game day & Chaos Engineering Concepts on AWS  Target In any GameDay, an exact target or targets should be specified. Without it, it’s impossible to bring in the right people to run and observe the GameDay.  Time And Place Things to include: • Precise date • War room for in-person attendance (make sure it fits enough people) • Dial-in information - including conferencing link, phone number and code to join  Goals Of The GameDay When planning, we need a goal for the GameDay. This is to ensure that we create relevant test cases. Sometimes the goal is to replay as many previous production impacts as possible, to test whether or not the current systems are more or less resilient.
  • 22. Game day & Chaos Engineering Concepts on AWS  Whiteboarding • With so many great minds present and aligned on the goal of the GameDay, it’s the perfect time to whiteboard out a system’s architecture. This session helps paint a clear picture of what we’re about to break, and makes obvious some areas worth testing.  Test Cases Scoping • Test cases are developed to help answer the question, “What could go wrong?” or “Do we know what will happen if this breaks?” As a team looks at the architecture that’s on the whiteboard, you will start to identify areas of concern.
  • 23. Game day & Chaos Engineering Concepts on AWS  Make a plan • Role play and scope definition • Create the simulation environment • Set a deadline • Create the GameDay Environment • Activate AWS CloudTrail ; Gameplay recording and auditing • Simulate activity
  • 24. Game day & Chaos Engineering Concepts on AWS Resources on computers are finite. A machine/VM/container will inevitably hit a resource limit at some point, and the application will be forced to handle the lack of a resource. Commonly, this is CPU, Memory, or I/O. We can reproduce CPU exhaustion by conducting a chaos experiment. Running this experiment will consume CPU cycles, leaving the application with the same amount of customer-facing work, and less CPU to do it with. As always, we advocate starting small on a single instance, then increasing the blast radius as confidence grows. Common reactions to CPU exhaustion are an increase in errors and latency and a reduction in successful requests to customers.  Attack: CPU / Memory / Disk  Scope: Single instance  Expected Results: Rate of good responses goes down, errors increase at all layers, brownout mode entered (if implemented), alerts fire (if configured at single-instance level), load balancer routes traffic away (if applicable)
  • 25. Game day & Chaos Engineering Concepts on AWS Network Latency Network dependencies are a fact of life in a distributed system, and as distributed systems are growing in adoption AND complexity, chaos engineering becomes an optimal way to test for potential failures on the path to increasing resilience.  Attack: Network Blackhole / Latency  Scope: single instance  Expected Results: Traffic to dependency goes to 0 (or gets slow), startup completes without errors, application-level metrics in steady-state are unaffected, traffic to fallback systems shows up and is successful, dependency alerts and pages may fire (if scoped to single-instance)
  • 26. Game day & Chaos Engineering Concepts on AWS All interesting applications have some sort of storage, and managing the relationship between application and datastore is critical to overall system health. There are a variety of ways that an application may overwhelm a data store (poor queries, lack of indices, bad sharding, upstream caching decisions, etc), but all of them result in what appears to be an unresponsive data layer. It’s important to understand how datastore saturation manifests in your application. There are a few ways of modeling this with a Chaos Experiment. You can blackhole your datastore, making it appear completely unavailable. You can add latency to requests to your datastore, making it appear slow. Finally, you can consume I/O bandwidth to simulate a congested path to the datastore.  Attack: Network Blackhole / Latency / IO  Scope: single instance  Expected Results: Traffic to datastore is reduced or slower, application-level metrics in steady-state are unaffected to the degree possible, traffic to fallback systems shows up and is successful, timeouts and concurrency limits kick in when appropriate, alerts and pages may fire (if scoped to single-instance)
  • 27. Game day & Chaos Engineering Concepts on AWS The best way forward is to induce a DNS outage and understand how your application behaves. If you blackhole DNS traffic on a single instance, it will appear to that instance as if DNS is unavailable. The fixes will vary depending on the issue, but common solutions are to pass around IP addresses instead of hostnames for internal addressing and the use of a backup DNS provider.  Attack: DNS blackhole  Scope: single instance  Expected Results: Inbound traffic may drop, traffic to external systems may fail, startup may not complete successfully
  • 28. Game day & Chaos Engineering Concepts on AWS What happened? Was that expected? What do we do next? After tests are run, it’s good to take some time to wind down, then have a follow up recap. This should be done relatively soon after the GameDay (days, not weeks), as the experience is still fresh for everyone.  How long does it take to detect an event?  Simulate failure situation  Validate assumptions  Prove your architecture  Knowledge of Procedures  How is the communication channel or chain of command during the game?
  • 29. Game day & Chaos Engineering Concepts on AWS Chaos Use Cases  Simulating the failure of an entire region or datacenter.  Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.  Injecting latency between services for a select percentage of traffic over a predetermined period of time.  Function-based chaos (runtime injection): randomly causing functions to throw exceptions.  Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.  Time travel: forcing system clocks out of sync with each other.  Executing a routine in driver code emulating I/O errors.  Maxing out CPU cores on an Elasticsearch cluster.
  • 30. Game day & Chaos Engineering Concepts on AWS The primary difference between Chaos Engineering and these other approaches is that Chaos Engineering is a practice for generating new information, while fault injection is a specific approach to testing one condition. Tests are typically binary, and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it. Experimentation generates new knowledge, and often suggests new avenues of exploration.
  • 31. Game day & Chaos Engineering Concepts on AWS 01 02 05 03  VaryReal-World Events  Canary Analysis  Hypothesize about SteadyState  Automate Experiments toRun Continuously 04  Run Experiments in Production
  • 32. Game day & Chaos Engineering Concepts on AWS
  • 33. Game day & Chaos Engineering Concepts on AWS
  • 34. Game day & Chaos Engineering Concepts on AWS
  • 35. Game day & Chaos Engineering Concepts on AWS Chaos Monkey
  • 36. Game day & Chaos Engineering Concepts on AWS
  • 37. Game day & Chaos Engineering Concepts on AWS
  • 38. Game day & Chaos Engineering Concepts on AWS
  • 39. Game day & Chaos Engineering Concepts on AWS
  • 40. Game day & Chaos Engineering Concepts on AWS