現代のシステムの多くは分散システムです。
そして、分散システムは複雑です。
一般的なクライアントとサーバをベースとしたアプリケーションでも、この絵のように複数のモジュールで構成されています。
1回だけのリクエストでも、複数の失敗する可能性のあるステップが含まれています。
リクエスト数やモジュールが増えると、さらに複雑になっていくことはお解りいただけるかと思います。
皆さんのコードやシステムはこれらの失敗を正しくハンドリングできる必要があります。
Distributed systems are complex.
Engineers working on distributed systems must test for all aspects of failure from the client, network, and servers – as these do not share fate. And, they must ensure that code (on both client and server) always behaves correctly in light of those failures.
Taking a look at this example, we can see that there are several steps involved in ensuring success with this operation, and there are several different permutations of possible failure on this simple distributed system, across thousands or millions of requests.
このようなシステムを自信を持って運用し、改善していくためにテストは重要です。
しかし、これまでのテストだけでは不十分です。
なぜならテストは、事前に「知っている」ことが正しく動作することしか確認できないからです。
1min 49sec
Traditional testing such as unit tests and functional tests is required, but doesn’t always address the complexity of a production environment. Running these tests in isolation often only verify a known condition.
What about the random errors that we aren’t expecting, the configuration drifts, network errors, etc – the unknown conditions?
カオスエンジニアリングは「知らないこと」をテストし、発見するための手法です。
カオスエンジニアリングはそのシステムが実際に動いている場所で実験を行います。
未知の問題を発見しそれが障害になる前に修正することを目指します。
カオスエンジニアリングによって、レジリエンシーやパフォーマンスを継続的に強化できます。
また、見えていなかった問題やシステムの隠れていた監視ポイントを見つけることができます。
Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems. You can think of it as a preventative, to avoid larger compounding issues down the road
The end goal is to:
Improve resilience and performance
Uncover hidden issues
Expose blind spots (monitoring, observability, and alarms)
これはカオスエンジニアリングのフェーズを表した図です。
まず、定常状態を定義します。次に、何らかの問題が起きても定常状態が維持される仮説を立てます。
実験を行うことでその仮説を検証し、もし反証される部分があれば、改善します。
このサイクルを回しながら、システムを改善していくのが、カオスエンジニアリングの基本的なサイクルです。
There are 5 phases to Chaos Engineering:
Steady State: Define steady state as some measurable output of a system that indicates normal behavior. For example, a Weather monitoring application should be able to fetch weather data, and display it to the user within a certain tolerance
Hypothesis: In this stage, create a hypothesis that this steady state will continue in a control group and an experimental group (aka our testing group)
Run Experiment: Introduce variables that reflect real world events such as servers that crash, malfunctioning hard drives (returning no data, or incorrect data), breaking network connections, etc
Verify: After running the tests, verify whether or not the hypothesis was correct (did steady state continue through experimentation when compared to the control group)
Very similar to the PDCA method,which is used in control and continuous improvement of processes and products
https://en.wikipedia.org/wiki/PDCA
カオスエンジニアリングの難しさの1つは実験の仕組みを作ることです。
実験のために、ツールやスクリプトの作成が必要かもしれません。
カオスエンジニアリングのツールは多くの場合、agent やライブラリのインストールを要求します。
特にプロダクション環境で実施を目指す場合、安全を担保する仕組みも必要です。
現実の環境で起こる様々なイベントを再現するのは難しい場合もあります。
There are many open source tools for chaos engineering, however processes for these tools may be complicated and support options may be limited. Additional scripting may be required which can lead to issues in getting up and running with Chaos Engineering
Compatibility of required libraries/agents for these open source tools may be limited
If we’re performing testing in a production, or high profile environment, we want to be able to limit the extent of potential issues from an experiment. Without those guardrails in place, an experiment can go sideways quickly and affect the rest of the environment and cause an outage
We want to be able to simulate failure in software, as well as hardware (for example, multiple server failures at once as well as an application microservice failure)
これらの課題を解決するために、フルマネージドなカオスエンジニアリングサービスが求められました。
それは、簡単に始められ、現実世界の問題を再現でき、安全な仕組みが組み込まれています。
AWS Fault Injection Simulator is a fully managed chaos engineering service. Designed to be easy to get started and to allow you to test your systems against real-world failures, whether they are simple (such as stopping an instance) or more complex.
AWS Fault Injection Simulator fully embraces the idea of safeguards, which is a way to monitor the blast of the experiment and stop it if certain alarms are set off.
AWS Fault Injection Simulator はagent をインストールをしないで、すぐ開始することができます。
また、実際のAWS 環境を操作するため、現実的な問題を再現しやすくなっています。
安全に実験を行うために、CloudWatch alarms と連携した、停止条件という機能があります。
これにより実験が想定外の障害を起こす前に実験を停止することができます。
At a high level, we start with an experiment template which will comprise of different fault injection actions, targets that will be affected and safeguards to be run during the experiment.
FIS performs the actions (fault injection) on the AWS resources that are specified as the target(s) when you start the experiment and you can monitor the experiment using CloudWatch and FIS can be integrated with EventBridge which allows to integrate with your existing monitoring tools.
Experiments once started automatically stop when all the actions are complete or you can optionally configure to stop the experiment when an alarm or event is triggered.
Once the experiment is complete, you can view the results of the experiment to identify any performance, observability or resilience issues.
これがAWS FIS の4つの主要なコンポーネントです。
今日は実験テンプレートを説明するためにアクションとターゲットから説明します。
実験の内容を定義する、アクション。
対象のリソースを定義するターゲット。
アクションとターゲットを組み合わせて実験の内容を決める実験テンプレート。
実験テンプレートをもとに実行される実験があります。
Diving deep to the components of FIS, we have four main components that are part of the Fault Injection Simulator.
* Actions
* Targets
* Experiment templates
* Experiments.
We will go through each one of the components in the next few slides.
アクションは実験の中で行われる障害注入を定義します。
障害のタイプとそれに関連するパラメータや実行期間などを定義します。
An action is the fault injection activity that you run on target(s) using AWS Fault Injection Simulator (AWS FIS).
There are multiple pre-configured actions present that are targeted for specific types of targets across various AWS services.
Action parameters include:
Action type – The type of action that FIS runs. Various types of actions available including FIS actions (API Internal Error, Throttle Error, Unavailable Error), EC2 Actions (Stop/Reboot/Terminate Instances action) etc.
As part of the action, you can also pass other parameters like how long the action should run (Duration), which targets this action should apply on (Target) etc. while creating the action.
ターゲットはアクションが注入されるAWS リソースです。
リソースタイプと、リソースを指定するためのIDやタグ、リソースの選択方法などを定義します。
6:33
Targets :
A target can be a specific resource in your AWS environment, or one or more resources that match criteria that you specify, for example, resources that have specific tags.
For e.g., A Target can be a specific RDS Instance that you want to fail over as part of the experiment , or your application server instances that all have a specific tag like “App: MyDemoAppInstances”
Discovery Questions:
Are your target resources already designed and configured for scalability and resilience?
Do you have dev/test/staging environments that are configured "the same" as production? If they are "similar", do they differ in anything other than scale, e.g. both have autoscaling enabled but in dev/test/staging the ASG is configured for fewer instances?
For your EC2 workloads what is your mix of linux / windows?
What types of resources are you targeting? EC2, EKS, ECS, databases, serverless?
アクションとターゲットを組み合わせて実験テンプレートを定義します。
実験テンプレートには、AWS FIS が使用するIAM role や停止条件も含みます。
Experiment Templates:
An experiment template contains one or more actions to run on specified targets during an experiment. It also contains the stop conditions that prevent the experiment from going out of bounds. After you create an experiment template, you can use it to run an experiment.
An experiment template consists of below components:
Action set - An action set contains the AWS FIS actions that you want to run. You must specify at least one action set in your experiment template. Actions can be run in a set order that you specify, or they can be run simultaneously.
Targets - One or more AWS resources on which a specific action is carried out.
IAM Role - The ARN of an IAM role that grants the AWS FIS service permission to carry out actions on your behalf.
Stop conditions - One or more CloudWatch alarms. If a stop condition is triggered while an experiment is running, AWS FIS stops the experiment.
Description – A description of the experiment.
Tags - Optionally, you can add tags to your FIS experiment template.
この図は実験テンプレートが表現するものを表しています。
図が示す通り、ターゲットやアクション、停止条件を複数指定することも可能です。
Here we can see there are two experiment templates that showcase two different type of Action set and targets
With specific EC2 Instances targeted and the actions are run sequentially and a single CloudWatch alarm added as a Stop condition.
This template, targets the instances with a specific tag and the actions 1 and 2 are run simultaneously and action 3 is triggered after the completion of 2. As you can see , we can add more than one CloudWatch alarm as a stop condition which can stop the FIS experiment.
これは実験テンプレートをJSON で表現した場合の例です。詳しく見ていきましょう。
This shows a complete Experiment template which comprises of all the components we have discussed so far:
This experiment template creates an action to stop the instances that have a tag “chaos-ready” and based on the selection mode picks one random instance that has the tag.
We also have a stop condition which monitors a CloudWatch alarm “No_Traffic” and if the alarm is triggered, the FIS experiment will stop.
Discovery Questions:
Are you planning to build complex / timed failure patterns? Are you planning to re-use experiments?
Are you looking to adopt templating / coding patterns in your use of FIS?
Are you considering using FIS in the context of a CI/CD pipeline? If so, what are your use cases?
まず、実験テンプレートの名前や説明を見ていきましょう。
This shows a sample experiment template where there are two actions:
1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes.
2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
実験テンプレートでは、他の多くのAWS リソース同様Name タグを設定できます。
必須ではないですが、マネジメントコンソールでの見易さを考慮し、設定することをお勧めします。
必須ではないけどつけた方がよいですよ、という話
This shows a sample experiment template where there are two actions:
1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes.
2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
Description には、この実験の説明を記載します。必須項目です。
roleArn として、AWS FIS が実験のために使うIAM role のARN を指定します。
AWS FIS の実験用に最小権限が付与されたIAM role を作成しましょう。
続いて、Actions です。
This shows a sample experiment template where there are two actions:
1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes.
2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
ここには2つのアクションが定義されています。
StopInstancesとTerminateInstancesです。
This shows a sample experiment template where there are two actions:
1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes.
2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
ここまで説明したアクションの定義の全体像です。
This shows a sample experiment template where there are two actions:
1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes.
2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
続いて、Targets を見ていきましょう。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
ここでも、2つのTargets が定義されています。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
ターゲットではリソースタイプを1つ指定する必要があります。
これは、そのターゲットを指定するアクションがサポートするリソースである必要があります。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
リソースタイプの中から、特定のリソースを指定するためにタグを利用できます。
ARN で直接リソースを指定することも可能です。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
リソースの属性に応じてリソースを指定するために、リソースフィルターを利用することもできます。
これにより、例えば「実行中のEC2インスタンス」を指定することができます。
より詳しくは、ドキュメントをご覧ください。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
selectionModeによって、指定したリソースの中で、最終的にどのようにターゲットを決定するかを指定できます。
デフォルトではALLなので条件を満たした全てのリソースがターゲットになります。
他の方法として、COUNT()で具体的な数を指定する方法と、
PERCENT()によって割合を指定する方法があります。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
ここまで説明した、ターゲットの全体像がこちらです。
Here is an illustration of two targets which were used in the action earlier:
Target “AllTaggedInstances” uses the tags to filter any instance that is running in the region with the tag “Purpose: chaos-ready”
Second target “RandomInstancesinAZ” targets all the running instances based on the instance state that is running in a particular AZ (In this case ‘us.east.1a’)
最後に停止条件を指定します。
停止条件によって、指定したCloudWatch alarm が閾値に達した場合に自動的に実験を止めることができます。
ガードレールとして、実験がワークロードに悪影響を与えることを防いだり、軽減することができます。
本番環境で実施する場合は必ず設定するようにしましょう。
This shows a sample experiment template where there are two actions:
1. StopInstances which targets all the instances that are tagged with a specific tags and the action is configured to run for a duration of 10 minutes.
2. TerminateInstances which waits for the above StopInstances action to complete and then targets the Target group “RandomInstanceinAZ”.
実験テンプレートの記述方法は以上です。
This shows a complete Experiment template which comprises of all the components we have discussed so far:
This experiment template creates an action to stop the instances that have a tag “chaos-ready” and based on the selection mode picks one random instance that has the tag.
We also have a stop condition which monitors a CloudWatch alarm “No_Traffic” and if the alarm is triggered, the FIS experiment will stop.
Discovery Questions:
Are you planning to build complex / timed failure patterns? Are you planning to re-use experiments?
Are you looking to adopt templating / coding patterns in your use of FIS?
Are you considering using FIS in the context of a CI/CD pipeline? If so, what are your use cases?
また、最近発表された、AWS Resilience Hub では、皆さんが指定した、RTO/RPO をアプリケーションが満たしているかをテストできます。
その中で、指定したアプリケーションを実験するための実験テンプレートが自動生成されます。
サンプルとして、参照してみると良いと思います。
A key part of Resilience Hub is the integration with other AWS services. We already talked a bit about Fault Injection Simulator. We’re also integrated with AWS CloudFormation, AWS Systems Manager, Route 53 ARC, and AWS CloudWatch. And that list of integrated services will continue to grow.