© 2021, Amazon Web Services, Inc. or its affiliates.
© 2021, Amazon Web Services, Inc. or its affiliates.
Let's write your
AWS FIS
experiment templates!!
Masao Kanamori
Solutions Architect, DevAx
Masao Kanamori
© 2021, Amazon Web Services, Inc. or its affiliates.
About me
Masao Kanamori
 Title/Role :
DevAx(Developer Acceleration) Team
Solutions Architect
 Favorite avengers:
Hawkeye: Clint Barton
© 2021, Amazon Web Services, Inc. or its affiliates.
Agenda
• Chaos engineering and AWS FIS
• What is experiment templates
• How to write experiment templates with JSON
• Why we need to write experiment template
• Conclusion
3
© 2021, Amazon Web Services, Inc. or its affiliates.
© 2021, Amazon Web Services, Inc. or its affiliates.
Chaos engineering and
AWS FIS
© 2021, Amazon Web Services, Inc. or its affiliates.
Distributed systems are complex
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
Message
Message
Reply
Reply
Server
Network
Client
© 2021, Amazon Web Services, Inc. or its affiliates.
Traditional testing is not enough
TESTING = VERIFYING A KNOWN CONDITION
Unit testing
of components
Tested in isolation to ensure
function meets expectations
Functional testing
of integrations
Each execution path tested
to assure expected results
© 2021, Amazon Web Services, Inc. or its affiliates.
What Chaos Engineering is:
• Experimenting on a system
• Identify failures
• Fix failures before they become outages
Chaos Engineering is meant to do:
• Improve resilience and performance
• Uncover hidden issues
• Expose blind spots (monitoring, observability, and alarms)
Chaos Engineering: Testing the Unknowns
S
O
I
T R E S S
B S E R V E
M P R O V E
© 2021, Amazon Web Services, Inc. or its affiliates.
Steady
state
Hypothesis
Run
experiment
Verify
Improve
Phases of chaos engineering
© 2021, Amazon Web Services, Inc. or its affiliates.
Challenges in Chaos Engineering
Difficult
to ensure
safety
Stitch together
different tools and
homemade scripts
1 Agents or
libraries required
to get started
3
2
Difficult to reproduce
“real-world” events
(multiple failures
at once)
4
© 2021, Amazon Web Services, Inc. or its affiliates.
Safeguards
Real-world
conditions
Easy to
get started
Fully managed chaos engineering service
© 2021, Amazon Web Services, Inc. or its affiliates.
AWS Fault Injection Simulator
O V E R V I E W
AWS Fault Injection
Simulator
Experiment
template
AWS Command
Line Interface
AWS Management
Console
AWS Identity and
Access Management
FIS safeguards
FIS engine
Compute
Start experiment
Third party
AWS
Amazon
EventBridge
Amazon
CloudWatch
alarms
AWS resources
Databases Networking Storage
Compute
Monitoring
Stop experiment
© 2021, Amazon Web Services, Inc. or its affiliates.
© 2021, Amazon Web Services, Inc. or its affiliates.
What is
experiment templates
© 2021, Amazon Web Services, Inc. or its affiliates.
Experiment
templates
Experiments
Actions Targets
Components
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions are the fault injection actions executed
during an experiment
aws:<service-name>:<action-type>
Actions include:
• Fault type
• Duration
• Targeted resources
• Timing relative to any other actions
• Fault-specific parameters, such as rollback behavior
or the portion of requests to throttle
Actions
© 2021, Amazon Web Services, Inc. or its affiliates.
Targets
Targets define one or more AWS resources on
which to carry out an action
Targets include:
• Resource type
• Resource IDs, tags, and filters
• Selection mode (e.g., ALL, RANDOM)
© 2021, Amazon Web Services, Inc. or its affiliates.
Experiment templates define an experiment and
are used in the start-experiment request
Experiment templates include:
• Actions
• Targets
• Stop condition alarms
• IAM role
• Description
• Tags
Experiment
templates
© 2021, Amazon Web Services, Inc. or its affiliates.
Experiment template A
Stop conditions
Targets
Actions Action 1 Action 2
Amazon
CloudWatch alarm
i-aaaa i-bbbb i-cccc
Specific EC2 instances
Experiment template B
Stop conditions
Targets
Actions
Action 3
Action 1
Action 2
Amazon
CloudWatch alarms
All EC2 instances with
“chaos-ready” tag
© 2021, Amazon Web Services, Inc. or its affiliates.
Video
• Chaos Engineering starting guide ( AWS Summit Online Japan 2021 )
https://www.youtube.com/watch?v=9M13W0sYgks
Builders.flush
• Graphic recording: How to start Chaos Engineering without chaos
https://aws.amazon.com/jp/builders-flash/202110/awsgeek-fault-
injection-simulator/
• Hands-on: Let’s start your first experiment with AWS FIS
https://aws.amazon.com/jp/builders-flash/202111/try-chaos-
engineering/
Related resources (Japanese)
18
© 2021, Amazon Web Services, Inc. or its affiliates.
Now you can create and run experiment from console. But…
But you need automation
Experiment
templates
Experiments
Create Run
❷How to track change?
❶We need to iterate this process.
❸How to mapping
which template version?
© 2021, Amazon Web Services, Inc. or its affiliates.
© 2021, Amazon Web Services, Inc. or its affiliates.
How to write
experiment templates
with JSON
© 2021, Amazon Web Services, Inc. or its affiliates.
Experiment template as JSON
{
"tags": {
"Name": "StopAndRestartRandomeInstance"
},
"description": ”FIS Stop and Restart One Random Instance",
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole",
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic"
}
],
"targets": {
"myInstance": {
"resourceTags": {
"Purpose": "chaos-ready"
},
"resourceType": "aws:ec2:instance",
"selectionMode": "COUNT(1)”
}
},
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"description": "stop the instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "myInstance"
}
}
}
}
Description
IAM role
Stop conditions
Targets
Actions
Name
© 2021, Amazon Web Services, Inc. or its affiliates.
"tags": {
"Name": "StopAndRestartRandomeInstance"
}
Description
IAM role
Name
"description": ”FIS Stop and Restart One Random Instance"
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole"
© 2021, Amazon Web Services, Inc. or its affiliates.
"tags": {
"Name": "StopAndRestartRandomeInstance"
}
Description
IAM role
Name
"description": ”FIS Stop and Restart One Random Instance"
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole"
We use “Name” tag for the name of the experiment template
same as EC2 instances etc.
© 2021, Amazon Web Services, Inc. or its affiliates.
"tags": {
"Name": "StopAndRestartRandomeInstance"
}
Description
IAM role
Name
"description": ”FIS Stop and Restart One Random Instance"
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole"
Description about this experiment template.(required)
© 2021, Amazon Web Services, Inc. or its affiliates.
"tags": {
"Name": "StopAndRestartRandomeInstance"
}
Description
IAM role
Name
"description": ”FIS Stop and Restart One Random Instance"
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole"
ARN of the IAM role that grants the AWS FIS service permission
to perform service actions.
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
There are two actions
StopInstances
TerminateInstances
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
Specify action identifier.
Each AWS FIS action has an identifier
with the following format:
aws:<service-name>:<action-type>
See the document for details.
https://docs.aws.amazon.com/fis/latest/userguide/fis-
actions-reference.html
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
Some of actions have parameters.
You can check it in the document.
https://docs.aws.amazon.com/fis/latest/userguide/fis-
actions-reference.html
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
You need to specify targets.
What is a target will be described later.
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
You can specify the order of actions
with this attribute.
© 2021, Amazon Web Services, Inc. or its affiliates.
Actions
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "AllTaggedInstances"
}
},
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "RandomInstancesInAZ"
},
"startAfter": [
"StopInstances"
]
}
}
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": ”PERCENT(50)"
}
Targets
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": "PERCENT(50)"
}
Targets
There are two targets
AllTarggedInstances
RandomInstancesInAZ
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": "PERCENT(50)"
}
Targets
You must specify exactly one resource type.
And when you specify a target for an action,
the target must be the resource type supported by the action
Resource types supported by AWS FIS
• aws:ec2:instance
• aws:ec2:spot-instance
• aws:ecs:cluster
• aws:eks:nodegroup
• aws:iam:role
• aws:rds:cluster
• aws:rds:db
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": "PERCENT(50)"
}
Targets You can use tags to specify AWS resources for target.
Of course you can use ARN using resourceArns
attribute instead tag.
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": "PERCENT(50)"
}
Targets
You can use resource filter to specify resource with specific attributes.
You can describe the path to reach an attribute in the output of the
Describe action for a resource.
(ex: for aws:ec2:instance , DescribeInstances API action is used)
More details , see following document:
https://docs.aws.amazon.com/fis/latest/userguide/targets.html#target-filters
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": "PERCENT(50)"
}
Targets
You can scope identified resources using selectionMode.
Default is "ALL”(all identified resources will be target).
You can use two other methods to scope.
• COUNT(n)
• PERCENT(n)
© 2021, Amazon Web Services, Inc. or its affiliates.
"targets": {
"AllTaggedInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
"selectionMode": "ALL"
},
"RandomInstancesInAZ": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Purpose": "chaos-ready"
},
filters: [
{
path: 'Placement.AvailabilityZone’,
values: [‘us.east.1a’]
},
{
path: 'State.Name’,
values: ['running’]
}
]
"selectionMode": ”PERCENT(50)"
}
Targets
© 2021, Amazon Web Services, Inc. or its affiliates.
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic"
}
],
Stop conditions
You can specify CloudWatch alarm
to stop your experiment if it reach the threshold. “none” or “aws:cloudwatch:alarm ”
ARN of the CloudWatch alarm.
(It’s required if the source is a CloudWatch alarm.)
© 2021, Amazon Web Services, Inc. or its affiliates.
Experiment template as JSON
{
"tags": {
"Name": "StopAndRestartRandomeInstance"
},
"description": ”FIS Stop and Restart One Random Instance",
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole",
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic"
}
],
"targets": {
"myInstance": {
"resourceTags": {
"Purpose": "chaos-ready"
},
"resourceType": "aws:ec2:instance",
"selectionMode": "COUNT(1)”
}
},
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"description": "stop the instances",
"parameters": {
"startInstancesAfterDuration": ”PT5M"
},
"targets": {
"Instances": "myInstance"
}
}
}
}
Description
IAM role
Stop conditions
Targets
Actions
Name
© 2021, Amazon Web Services, Inc. or its affiliates.
© 2021, Amazon Web Services, Inc. or its affiliates.
Why we need to write
experiment template
© 2021, Amazon Web Services, Inc. or its affiliates.
Now you can create and run experiment from console. But…
But you need automation (repeat)
Experiment
templates
Experiments
Create Run
❷How to track change?
❶We need to iterate this process.
❸How to mapping
which template version?
© 2021, Amazon Web Services, Inc. or its affiliates.
Using VCS to track change experiment template
{
"tags": { "Name": "StopAndRestartRandomeInstance"},
"description": ”FIS Stop and Restart One Random Instance",
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole",
"stopConditions": [{
"source": ”none",
}],
...
}
{
"tags": { "Name": "StopAndRestartRandomeInstance” },
"description": ”FIS Stop and Restart One Random Instance",
"roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole",
"stopConditions": [{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic"
}],
...
}
Version 1:
Version 2:
Add stop condition
Github
Bitbucket
Git repository
AWS CodeCommit
etc…
© 2021, Amazon Web Services, Inc. or its affiliates.
Automate update and run experiment template
VPC
Auto Scaling group
Instance Instance
Target environment
AWS CodeCommit AWS CodePipeline
AWS CloudFormation
AWS CodeBuild
AWS CodeBuild
Alarm
User
Experiment
templates
Experiments
AWS Command Line
Interface (AWS CLI)
Template Update Stage
or
AWS Command Line
Interface (AWS CLI)
Experiment Stage
push trigger pipeline
update template
run experiment
create/update
use
stop
condition
run experiment
© 2021, Amazon Web Services, Inc. or its affiliates.
© 2021, Amazon Web Services, Inc. or its affiliates.
Conclusion
© 2021, Amazon Web Services, Inc. or its affiliates.
• You can define your experiments as JSON/YAML.
• It’s good as start point automating your experiments.
• You shouldn’t forget to define a steady state and hypothesis.
You can try this idea in Chaos Engineering on AWS workshop:
https://chaos-engineering.workshop.aws
Let’s automate your experiments!
47
© 2021, Amazon Web Services, Inc. or its affiliates.
You can see good example in AWS Resilience Hub
© 2021, Amazon Web Services, Inc. or its affiliates.
Thank you!
© 2021, Amazon Web Services, Inc. or its affiliates.

AWS FIS の実験テンプレートを書いてみよう!!

  • 1.
    © 2021, AmazonWeb Services, Inc. or its affiliates. © 2021, Amazon Web Services, Inc. or its affiliates. Let's write your AWS FIS experiment templates!! Masao Kanamori Solutions Architect, DevAx Masao Kanamori
  • 2.
    © 2021, AmazonWeb Services, Inc. or its affiliates. About me Masao Kanamori  Title/Role : DevAx(Developer Acceleration) Team Solutions Architect  Favorite avengers: Hawkeye: Clint Barton
  • 3.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Agenda • Chaos engineering and AWS FIS • What is experiment templates • How to write experiment templates with JSON • Why we need to write experiment template • Conclusion 3
  • 4.
    © 2021, AmazonWeb Services, Inc. or its affiliates. © 2021, Amazon Web Services, Inc. or its affiliates. Chaos engineering and AWS FIS
  • 5.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Distributed systems are complex https://aws.amazon.com/builders-library/challenges-with-distributed-systems/ Message Message Reply Reply Server Network Client
  • 6.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Traditional testing is not enough TESTING = VERIFYING A KNOWN CONDITION Unit testing of components Tested in isolation to ensure function meets expectations Functional testing of integrations Each execution path tested to assure expected results
  • 7.
    © 2021, AmazonWeb Services, Inc. or its affiliates. What Chaos Engineering is: • Experimenting on a system • Identify failures • Fix failures before they become outages Chaos Engineering is meant to do: • Improve resilience and performance • Uncover hidden issues • Expose blind spots (monitoring, observability, and alarms) Chaos Engineering: Testing the Unknowns S O I T R E S S B S E R V E M P R O V E
  • 8.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Steady state Hypothesis Run experiment Verify Improve Phases of chaos engineering
  • 9.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Challenges in Chaos Engineering Difficult to ensure safety Stitch together different tools and homemade scripts 1 Agents or libraries required to get started 3 2 Difficult to reproduce “real-world” events (multiple failures at once) 4
  • 10.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Safeguards Real-world conditions Easy to get started Fully managed chaos engineering service
  • 11.
    © 2021, AmazonWeb Services, Inc. or its affiliates. AWS Fault Injection Simulator O V E R V I E W AWS Fault Injection Simulator Experiment template AWS Command Line Interface AWS Management Console AWS Identity and Access Management FIS safeguards FIS engine Compute Start experiment Third party AWS Amazon EventBridge Amazon CloudWatch alarms AWS resources Databases Networking Storage Compute Monitoring Stop experiment
  • 12.
    © 2021, AmazonWeb Services, Inc. or its affiliates. © 2021, Amazon Web Services, Inc. or its affiliates. What is experiment templates
  • 13.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Experiment templates Experiments Actions Targets Components
  • 14.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions are the fault injection actions executed during an experiment aws:<service-name>:<action-type> Actions include: • Fault type • Duration • Targeted resources • Timing relative to any other actions • Fault-specific parameters, such as rollback behavior or the portion of requests to throttle Actions
  • 15.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Targets Targets define one or more AWS resources on which to carry out an action Targets include: • Resource type • Resource IDs, tags, and filters • Selection mode (e.g., ALL, RANDOM)
  • 16.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Experiment templates define an experiment and are used in the start-experiment request Experiment templates include: • Actions • Targets • Stop condition alarms • IAM role • Description • Tags Experiment templates
  • 17.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Experiment template A Stop conditions Targets Actions Action 1 Action 2 Amazon CloudWatch alarm i-aaaa i-bbbb i-cccc Specific EC2 instances Experiment template B Stop conditions Targets Actions Action 3 Action 1 Action 2 Amazon CloudWatch alarms All EC2 instances with “chaos-ready” tag
  • 18.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Video • Chaos Engineering starting guide ( AWS Summit Online Japan 2021 ) https://www.youtube.com/watch?v=9M13W0sYgks Builders.flush • Graphic recording: How to start Chaos Engineering without chaos https://aws.amazon.com/jp/builders-flash/202110/awsgeek-fault- injection-simulator/ • Hands-on: Let’s start your first experiment with AWS FIS https://aws.amazon.com/jp/builders-flash/202111/try-chaos- engineering/ Related resources (Japanese) 18
  • 19.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Now you can create and run experiment from console. But… But you need automation Experiment templates Experiments Create Run ❷How to track change? ❶We need to iterate this process. ❸How to mapping which template version?
  • 20.
    © 2021, AmazonWeb Services, Inc. or its affiliates. © 2021, Amazon Web Services, Inc. or its affiliates. How to write experiment templates with JSON
  • 21.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Experiment template as JSON { "tags": { "Name": "StopAndRestartRandomeInstance" }, "description": ”FIS Stop and Restart One Random Instance", "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole", "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic" } ], "targets": { "myInstance": { "resourceTags": { "Purpose": "chaos-ready" }, "resourceType": "aws:ec2:instance", "selectionMode": "COUNT(1)” } }, "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "description": "stop the instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "myInstance" } } } } Description IAM role Stop conditions Targets Actions Name
  • 22.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "tags": { "Name": "StopAndRestartRandomeInstance" } Description IAM role Name "description": ”FIS Stop and Restart One Random Instance" "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole"
  • 23.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "tags": { "Name": "StopAndRestartRandomeInstance" } Description IAM role Name "description": ”FIS Stop and Restart One Random Instance" "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole" We use “Name” tag for the name of the experiment template same as EC2 instances etc.
  • 24.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "tags": { "Name": "StopAndRestartRandomeInstance" } Description IAM role Name "description": ”FIS Stop and Restart One Random Instance" "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole" Description about this experiment template.(required)
  • 25.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "tags": { "Name": "StopAndRestartRandomeInstance" } Description IAM role Name "description": ”FIS Stop and Restart One Random Instance" "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole" ARN of the IAM role that grants the AWS FIS service permission to perform service actions.
  • 26.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } }
  • 27.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } } There are two actions StopInstances TerminateInstances
  • 28.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } } Specify action identifier. Each AWS FIS action has an identifier with the following format: aws:<service-name>:<action-type> See the document for details. https://docs.aws.amazon.com/fis/latest/userguide/fis- actions-reference.html
  • 29.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } } Some of actions have parameters. You can check it in the document. https://docs.aws.amazon.com/fis/latest/userguide/fis- actions-reference.html
  • 30.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } } You need to specify targets. What is a target will be described later.
  • 31.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } } You can specify the order of actions with this attribute.
  • 32.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Actions "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "AllTaggedInstances" } }, "TerminateInstances": { "actionId": "aws:ec2:terminate-instances", "parameters": {}, "targets": { "Instances": "RandomInstancesInAZ" }, "startAfter": [ "StopInstances" ] } }
  • 33.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": ”PERCENT(50)" } Targets
  • 34.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": "PERCENT(50)" } Targets There are two targets AllTarggedInstances RandomInstancesInAZ
  • 35.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": "PERCENT(50)" } Targets You must specify exactly one resource type. And when you specify a target for an action, the target must be the resource type supported by the action Resource types supported by AWS FIS • aws:ec2:instance • aws:ec2:spot-instance • aws:ecs:cluster • aws:eks:nodegroup • aws:iam:role • aws:rds:cluster • aws:rds:db
  • 36.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": "PERCENT(50)" } Targets You can use tags to specify AWS resources for target. Of course you can use ARN using resourceArns attribute instead tag.
  • 37.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": "PERCENT(50)" } Targets You can use resource filter to specify resource with specific attributes. You can describe the path to reach an attribute in the output of the Describe action for a resource. (ex: for aws:ec2:instance , DescribeInstances API action is used) More details , see following document: https://docs.aws.amazon.com/fis/latest/userguide/targets.html#target-filters
  • 38.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": "PERCENT(50)" } Targets You can scope identified resources using selectionMode. Default is "ALL”(all identified resources will be target). You can use two other methods to scope. • COUNT(n) • PERCENT(n)
  • 39.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "targets": { "AllTaggedInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, "selectionMode": "ALL" }, "RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { "Purpose": "chaos-ready" }, filters: [ { path: 'Placement.AvailabilityZone’, values: [‘us.east.1a’] }, { path: 'State.Name’, values: ['running’] } ] "selectionMode": ”PERCENT(50)" } Targets
  • 40.
    © 2021, AmazonWeb Services, Inc. or its affiliates. "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic" } ], Stop conditions You can specify CloudWatch alarm to stop your experiment if it reach the threshold. “none” or “aws:cloudwatch:alarm ” ARN of the CloudWatch alarm. (It’s required if the source is a CloudWatch alarm.)
  • 41.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Experiment template as JSON { "tags": { "Name": "StopAndRestartRandomeInstance" }, "description": ”FIS Stop and Restart One Random Instance", "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole", "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic" } ], "targets": { "myInstance": { "resourceTags": { "Purpose": "chaos-ready" }, "resourceType": "aws:ec2:instance", "selectionMode": "COUNT(1)” } }, "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "description": "stop the instances", "parameters": { "startInstancesAfterDuration": ”PT5M" }, "targets": { "Instances": "myInstance" } } } } Description IAM role Stop conditions Targets Actions Name
  • 42.
    © 2021, AmazonWeb Services, Inc. or its affiliates. © 2021, Amazon Web Services, Inc. or its affiliates. Why we need to write experiment template
  • 43.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Now you can create and run experiment from console. But… But you need automation (repeat) Experiment templates Experiments Create Run ❷How to track change? ❶We need to iterate this process. ❸How to mapping which template version?
  • 44.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Using VCS to track change experiment template { "tags": { "Name": "StopAndRestartRandomeInstance"}, "description": ”FIS Stop and Restart One Random Instance", "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole", "stopConditions": [{ "source": ”none", }], ... } { "tags": { "Name": "StopAndRestartRandomeInstance” }, "description": ”FIS Stop and Restart One Random Instance", "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole", "stopConditions": [{ "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:0123456789:alarm:No_Traffic" }], ... } Version 1: Version 2: Add stop condition Github Bitbucket Git repository AWS CodeCommit etc…
  • 45.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Automate update and run experiment template VPC Auto Scaling group Instance Instance Target environment AWS CodeCommit AWS CodePipeline AWS CloudFormation AWS CodeBuild AWS CodeBuild Alarm User Experiment templates Experiments AWS Command Line Interface (AWS CLI) Template Update Stage or AWS Command Line Interface (AWS CLI) Experiment Stage push trigger pipeline update template run experiment create/update use stop condition run experiment
  • 46.
    © 2021, AmazonWeb Services, Inc. or its affiliates. © 2021, Amazon Web Services, Inc. or its affiliates. Conclusion
  • 47.
    © 2021, AmazonWeb Services, Inc. or its affiliates. • You can define your experiments as JSON/YAML. • It’s good as start point automating your experiments. • You shouldn’t forget to define a steady state and hypothesis. You can try this idea in Chaos Engineering on AWS workshop: https://chaos-engineering.workshop.aws Let’s automate your experiments! 47
  • 48.
    © 2021, AmazonWeb Services, Inc. or its affiliates. You can see good example in AWS Resilience Hub
  • 49.
    © 2021, AmazonWeb Services, Inc. or its affiliates. Thank you! © 2021, Amazon Web Services, Inc. or its affiliates.

Editor's Notes

  • #2 おはようございます。 それでは、AWS FIS の実験テンプレートを書いてみよう、のセッションを始めます。
  • #3 初めに簡単に自己紹介させていただきます。 私は金森政雄と申します。 私はアマゾン ウェブ サービス ジャパン 合同会社で、デベロッパーアクセラレーション チームのソリューションアーキテクトとして活動しています。 デベロッパーアクセラレーションチームは主に開発者の方に向けたコンテンツの作成、イベントの企画などを通じて開発者の方を支援しています。 ----- ちなみに私が好きなアベンジャーズはホークアイです。 来週から始まるドラマシリーズが今から楽しみです。
  • #4 本日のアジェンダはこちらです。 カオスエンジニアリングとAWS FIS の説明から始めます。 -- 削除-- 次に、実験テンプレートとそれをJSON で書く方法を紹介します。 最後に、なぜこのような方法が必要なのかをお話しします。
  • #5 では早速、カオスエンジニアリングとAWS FIS について話しましょう。 01:21 -> 01:03
  • #6 現代のシステムの多くは分散システムです。 そして、分散システムは複雑です。 一般的なクライアントとサーバをベースとしたアプリケーションでも、この絵のように複数のモジュールで構成されています。 1回だけのリクエストでも、複数の失敗する可能性のあるステップが含まれています。 リクエスト数やモジュールが増えると、さらに複雑になっていくことはお解りいただけるかと思います。 皆さんのコードやシステムはこれらの失敗を正しくハンドリングできる必要があります。 Distributed systems are complex. Engineers working on distributed systems must test for all aspects of failure from the client, network, and servers – as these do not share fate. And, they must ensure that code (on both client and server) always behaves correctly in light of those failures. Taking a look at this example, we can see that there are several steps involved in ensuring success with this operation, and there are several different permutations of possible failure on this simple distributed system, across thousands or millions of requests.
  • #7 このようなシステムを自信を持って運用し、改善していくためにテストは重要です。 しかし、これまでのテストだけでは不十分です。 なぜならテストは、事前に「知っている」ことが正しく動作することしか確認できないからです。 1min 49sec Traditional testing such as unit tests and functional tests is required, but doesn’t always address the complexity of a production environment. Running these tests in isolation often only verify a known condition. What about the random errors that we aren’t expecting, the configuration drifts, network errors, etc – the unknown conditions?
  • #8 カオスエンジニアリングは「知らないこと」をテストし、発見するための手法です。 カオスエンジニアリングはそのシステムが実際に動いている場所で実験を行います。 未知の問題を発見しそれが障害になる前に修正することを目指します。 カオスエンジニアリングによって、レジリエンシーやパフォーマンスを継続的に強化できます。 また、見えていなかった問題やシステムの隠れていた監視ポイントを見つけることができます。 Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems. You can think of it as a preventative, to avoid larger compounding issues down the road The end goal is to: Improve resilience and performance Uncover hidden issues Expose blind spots (monitoring, observability, and alarms)
  • #9 これはカオスエンジニアリングのフェーズを表した図です。 まず、定常状態を定義します。次に、何らかの問題が起きても定常状態が維持される仮説を立てます。 実験を行うことでその仮説を検証し、もし反証される部分があれば、改善します。 このサイクルを回しながら、システムを改善していくのが、カオスエンジニアリングの基本的なサイクルです。 There are 5 phases to Chaos Engineering: Steady State: Define steady state as some measurable output of a system that indicates normal behavior. For example, a Weather monitoring application should be able to fetch weather data, and display it to the user within a certain tolerance Hypothesis: In this stage, create a hypothesis that this steady state will continue in a control group and an experimental group (aka our testing group) Run Experiment: Introduce variables that reflect real world events such as servers that crash, malfunctioning hard drives (returning no data, or incorrect data), breaking network connections, etc Verify: After running the tests, verify whether or not the hypothesis was correct (did steady state continue through experimentation when compared to the control group) Very similar to the PDCA method,which is used in control and continuous improvement of processes and products https://en.wikipedia.org/wiki/PDCA
  • #10 カオスエンジニアリングの難しさの1つは実験の仕組みを作ることです。 実験のために、ツールやスクリプトの作成が必要かもしれません。 カオスエンジニアリングのツールは多くの場合、agent やライブラリのインストールを要求します。 特にプロダクション環境で実施を目指す場合、安全を担保する仕組みも必要です。 現実の環境で起こる様々なイベントを再現するのは難しい場合もあります。 There are many open source tools for chaos engineering, however processes for these tools may be complicated and support options may be limited. Additional scripting may be required which can lead to issues in getting up and running with Chaos Engineering Compatibility of required libraries/agents for these open source tools may be limited If we’re performing testing in a production, or high profile environment, we want to be able to limit the extent of potential issues from an experiment. Without those guardrails in place, an experiment can go sideways quickly and affect the rest of the environment and cause an outage We want to be able to simulate failure in software, as well as hardware (for example, multiple server failures at once as well as an application microservice failure)
  • #11 これらの課題を解決するために、フルマネージドなカオスエンジニアリングサービスが求められました。 それは、簡単に始められ、現実世界の問題を再現でき、安全な仕組みが組み込まれています。 AWS Fault Injection Simulator is a fully managed chaos engineering service. Designed to be easy to get started and to allow you to test your systems against real-world failures, whether they are simple (such as stopping an instance) or more complex. AWS Fault Injection Simulator fully embraces the idea of safeguards, which is a way to monitor the blast of the experiment and stop it if certain alarms are set off.
  • #12 AWS Fault Injection Simulator はagent をインストールをしないで、すぐ開始することができます。 また、実際のAWS 環境を操作するため、現実的な問題を再現しやすくなっています。 安全に実験を行うために、CloudWatch alarms と連携した、停止条件という機能があります。 これにより実験が想定外の障害を起こす前に実験を停止することができます。 At a high level, we start with an experiment template which will comprise of different fault injection actions, targets that will be affected and safeguards to be run during the experiment. FIS performs the actions (fault injection) on the AWS resources that are specified as the target(s) when you start the experiment and you can monitor the experiment using CloudWatch and FIS can be integrated with EventBridge which allows to integrate with your existing monitoring tools. Experiments once started automatically stop when all the actions are complete or you can optionally configure to stop the experiment when an alarm or event is triggered. Once the experiment is complete, you can view the results of the experiment to identify any performance, observability or resilience issues.
  • #13 ここからはAWS FIS を構成するコンポーネントを説明します。 特に、今回のセッションのテーマである実験テンプレートを中心に説明します。 05:02 -> 4:59
  • #14 これがAWS FIS の4つの主要なコンポーネントです。 今日は実験テンプレートを説明するためにアクションとターゲットから説明します。 実験の内容を定義する、アクション。 対象のリソースを定義するターゲット。 アクションとターゲットを組み合わせて実験の内容を決める実験テンプレート。 実験テンプレートをもとに実行される実験があります。 Diving deep to the components of FIS, we have four main components that are part of the Fault Injection Simulator. * Actions * Targets * Experiment templates * Experiments. We will go through each one of the components in the next few slides.
  • #15 アクションは実験の中で行われる障害注入を定義します。 障害のタイプとそれに関連するパラメータや実行期間などを定義します。 An action is the fault injection activity that you run on target(s) using AWS Fault Injection Simulator (AWS FIS). There are multiple pre-configured actions present that are targeted for specific types of targets across various AWS services. Action parameters include: Action type – The type of action that FIS runs. Various types of actions available including FIS actions (API Internal Error, Throttle Error, Unavailable Error), EC2 Actions (Stop/Reboot/Terminate Instances action) etc. As part of the action, you can also pass other parameters like how long the action should run (Duration), which targets this action should apply on (Target) etc. while creating the action.
  • #16 ターゲットはアクションが注入されるAWS リソースです。 リソースタイプと、リソースを指定するためのIDやタグ、リソースの選択方法などを定義します。 6:33 Targets : A target can be a specific resource in your AWS environment, or one or more resources that match criteria that you specify, for example, resources that have specific tags. For e.g., A Target can be a specific RDS Instance that you want to fail over as part of the experiment , or your application server instances that all have a specific tag like “App: MyDemoAppInstances” Discovery Questions: Are your target resources already designed and configured for scalability and resilience? Do you have dev/test/staging environments that are configured "the same" as production? If they are "similar", do they differ in anything other than scale, e.g. both have autoscaling enabled but in dev/test/staging the ASG is configured for fewer instances? For your EC2 workloads what is your mix of linux / windows? What types of resources are you targeting? EC2, EKS, ECS, databases, serverless?
  • #17 アクションとターゲットを組み合わせて実験テンプレートを定義します。 実験テンプレートには、AWS FIS が使用するIAM role や停止条件も含みます。 Experiment Templates: An experiment template contains one or more actions to run on specified targets during an experiment. It also contains the stop conditions that prevent the experiment from going out of bounds. After you create an experiment template, you can use it to run an experiment. An experiment template consists of below components: Action set - An action set contains the AWS FIS actions that you want to run. You must specify at least one action set in your experiment template. Actions can be run in a set order that you specify, or they can be run simultaneously. Targets - One or more AWS resources on which a specific action is carried out. IAM Role - The ARN of an IAM role that grants the AWS FIS service permission to carry out actions on your behalf. Stop conditions - One or more CloudWatch alarms. If a stop condition is triggered while an experiment is running, AWS FIS stops the experiment. Description – A description of the experiment. Tags - Optionally, you can add tags to your FIS experiment template.
  • #18 この図は実験テンプレートが表現するものを表しています。 図が示す通り、ターゲットやアクション、停止条件を複数指定することも可能です。 Here we can see there are two experiment templates that showcase two different type of Action set and targets With specific EC2 Instances targeted and the actions are run sequentially and a single CloudWatch alarm added as a Stop condition. This template, targets the instances with a specific tag and the actions 1 and 2 are run simultaneously and action 3 is triggered after the completion of 2. As you can see , we can add more than one CloudWatch alarm as a stop condition which can stop the FIS experiment.
  • #19 カオスエンジニアリングやAWS FIS について、より詳しくはこちらの関連リソースをご覧ください。
  • #20 AWS FIS を使うことで、簡単かつ安全に実験を行うことができます。 手動で実験を行うのはすぐにできるかと思います。 しかし、カオスエンジニアリングを活用していくには自動化も必要です。 改善されたことを確認するために、同じ実験を繰り返し行えることが理想的です。 また、実験テンプレート自体にはバージョニングなどの機能がありません。 運用していく中で、実験テンプレートの変更を追跡し、実験の履歴とマッピングしたくなるでしょう。 そのため実験テンプレートをコードのように管理できると便利です。
  • #21 実験テンプレートはJSONで記述することができます。 CloudFormationを利用する場合はYAML で定義することも可能です。 ここではJSON を使って解説していきます。 全体: 07:48 -> 7:04 ラップ: 02:20 -> 2:05
  • #22 これは実験テンプレートをJSON で表現した場合の例です。詳しく見ていきましょう。 This shows a complete Experiment template which comprises of all the components we have discussed so far: This experiment template creates an action to stop the instances that have a tag “chaos-ready” and based on the selection mode picks one random instance that has the tag. We also have a stop condition which monitors a CloudWatch alarm “No_Traffic” and if the alarm is triggered, the FIS experiment will stop. Discovery Questions: Are you planning to build complex / timed failure patterns? Are you planning to re-use experiments? Are you looking to adopt templating / coding patterns in your use of FIS? Are you considering using FIS in the context of a CI/CD pipeline? If so, what are your use cases?
  • #23 まず、実験テンプレートの名前や説明を見ていきましょう。 This shows a sample experiment template where there are two actions: 1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes. 2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
  • #24 実験テンプレートでは、他の多くのAWS リソース同様Name タグを設定できます。 必須ではないですが、マネジメントコンソールでの見易さを考慮し、設定することをお勧めします。 必須ではないけどつけた方がよいですよ、という話 This shows a sample experiment template where there are two actions: 1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes. 2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
  • #25 Description には、この実験の説明を記載します。必須項目です。
  • #26 roleArn として、AWS FIS が実験のために使うIAM role のARN を指定します。 AWS FIS の実験用に最小権限が付与されたIAM role を作成しましょう。
  • #27 続いて、Actions です。 This shows a sample experiment template where there are two actions: 1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes. 2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
  • #28 ここには2つのアクションが定義されています。 StopInstancesとTerminateInstancesです。 This shows a sample experiment template where there are two actions: 1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes. 2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
  • #29 まず、actionId を指定します。 AWS FIS には事前に定義されたアクションがあり、それぞれ識別子を持っています。 形式は、awsコロン、サービス名コロン、アクションタイプです。 詳しくはドキュメントにリファレンスがありますので、ご参照ください。 セクション開始から 02:00
  • #30 アクションタイプによってパラメータを受け取れる場合があります。 この例では、startInstanceAfterDuration というパラメータで停止されたインスタンスを再起動させる時間を指定できます。 各アクションタイプで利用できるパラメータはドキュメントを参照してください。
  • #31 targets にはこの後定義するターゲットを指定します。
  • #32 startAfter 属性を利用することで、アクションの実行順を制御することも可能です。 この例では、StopInstances アクションの後に、TerminateInstances を実行しています。
  • #33 ここまで説明したアクションの定義の全体像です。 This shows a sample experiment template where there are two actions: 1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes. 2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
  • #34 続いて、Targets を見ていきましょう。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #35 ここでも、2つのTargets が定義されています。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #36 ターゲットではリソースタイプを1つ指定する必要があります。 これは、そのターゲットを指定するアクションがサポートするリソースである必要があります。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #37 リソースタイプの中から、特定のリソースを指定するためにタグを利用できます。 ARN で直接リソースを指定することも可能です。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #38 リソースの属性に応じてリソースを指定するために、リソースフィルターを利用することもできます。 これにより、例えば「実行中のEC2インスタンス」を指定することができます。 より詳しくは、ドキュメントをご覧ください。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #39 selectionModeによって、指定したリソースの中で、最終的にどのようにターゲットを決定するかを指定できます。 デフォルトではALLなので条件を満たした全てのリソースがターゲットになります。 他の方法として、COUNT()で具体的な数を指定する方法と、 PERCENT()によって割合を指定する方法があります。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #40 ここまで説明した、ターゲットの全体像がこちらです。 Here is an illustration of two targets which were used in the action earlier: Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready” Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
  • #41 最後に停止条件を指定します。 停止条件によって、指定したCloudWatch alarm が閾値に達した場合に自動的に実験を止めることができます。 ガードレールとして、実験がワークロードに悪影響を与えることを防いだり、軽減することができます。 本番環境で実施する場合は必ず設定するようにしましょう。 This shows a sample experiment template where there are two actions: 1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes. 2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
  • #42 実験テンプレートの記述方法は以上です。 This shows a complete Experiment template which comprises of all the components we have discussed so far: This experiment template creates an action to stop the instances that have a tag “chaos-ready” and based on the selection mode picks one random instance that has the tag. We also have a stop condition which monitors a CloudWatch alarm “No_Traffic” and if the alarm is triggered, the FIS experiment will stop. Discovery Questions: Are you planning to build complex / timed failure patterns? Are you planning to re-use experiments? Are you looking to adopt templating / coding patterns in your use of FIS? Are you considering using FIS in the context of a CI/CD pipeline? If so, what are your use cases?
  • #43 では、このように実験テンプレートを記述できると何がよかったのでしょうか。 12:33 11:52
  • #44 課題を再掲します。私たちは自動化し、変更を追跡できる必要がありました。
  • #45 まず、JSONで実験テンプレートを定義できることで、AWS CodeCommit やGithub などのGit リポジトリで管理できるようになります。 これにより、変更の管理、追跡が簡単になります。
  • #46 次に、実験の自動化です。 テンプレートの変更や、アプリケーションの変更を契機に、実験を実行するパイプラインを定義できます。 実験テンプレートの展開は、CloudFormationか、CodebuildからAWS CLI を実行することで実現できます。 IaC としての機能や、停止条件として利用するCloudWatch Alarm の定義も合わせて実施するために、CloudFormation を利用することをお勧めします。 展開された実験テンプレートから実験の実行もCodeBuild から、AWS CLI を使って自動化できます。 この例のように、実験テンプレートをJSONなどで記述することで、アプリケーションコードで利用しているCICD などの仕組みをカオスエンジニアリングでも利用することができます。
  • #47 まとめです。 14:11 13:31
  • #48 実験テンプレートはJSONやYAML で定義できます。 実験を自動化するスタートポイントとして、良い方法だと思っています。 もちろん、カオスエンジニアリングでは、定常状態や仮説の定義が重要ですから、それを忘れないでください。 自動化を実際に試してみたい方は、こちらのカオスエンジニアリングon AWS のワークショップをお試しください。 現状、日本語での提供はないことをご了承ください。
  • #49 また、最近発表された、AWS Resilience Hub では、皆さんが指定した、RTO/RPO をアプリケーションが満たしているかをテストできます。 その中で、指定したアプリケーションを実験するための実験テンプレートが自動生成されます。 サンプルとして、参照してみると良いと思います。 A key part of Resilience Hub is the integration with other AWS services. We already talked a bit about Fault Injection Simulator. We’re also integrated with AWS CloudFormation, AWS Systems Manager, Route 53 ARC, and AWS CloudWatch. And that list of integrated services will continue to grow.
  • #50 ご静聴ありがとうございました。カオスエンジニアリングへのチャレンジのきっかけになれば嬉しいです。 もし何かあればいつでもご相談ください。引き続き、JAWS パンクラチオンを楽しみましょう