Managing ECS hosts with AWS lambda and step
functions
Terraform at Comtravo
Terraform at Comtravo
➢ Six environments maintained by Terraform.
➢ Integrated into our CI/CD pipeline.
➢ Each environment has:
○ 500+ AWS components.
○ 43 Lambdas.
○ 25 microservices.
CI/CD at Comtravo: Mono-repo Pull request
CI/CD at Comtravo: Mono-repo Pull request
CI/CD at Comtravo: Mono-repo Merge to master
CI/CD at Comtravo: Mono-repo Merge to master
ECS at Comtravo
ECS: Many interesting challenges
One such challenge:
Update EC2 hosts in a ECS cluster
Update EC2 hosts in a ECS cluster: Use cases
➢ You have a custom AMI for your ECS cluster(s).
➢ You want to always rollout the latest ECS-optimized AMIs.
➢ You want to rotate the admin keys.
➢ Change Instance type.
➢ Use an updated user_data script.
Update EC2 hosts in a ECS cluster: The process
➢ Terraform emits an AWS cloudwatch event once launch
configuration was created.
➢ Detach “old instances“ from ASG and wait for capacity.
➢ “Move” services from old instances to new instances.
➢ Terminate old instances when no more tasks running.
➢ Alert on failures.
Terraform + AWS Events + AWS Step functions =
Awesome
I created a new
launch configuration
lc-1234 for ASG
asg-1234 belonging
to ECS cluster
cluster-A
AWS CloudWatch Events
time
Task A
started
bar
Task C
started
Task B
stopped
ECS
Host
bla
baz
custom event
custom event
custom event
Terraform Event Emitter
resource "null_resource" "launch-config-update" {
provisioner "local-exec" {
command = "python ${path.module}/scripts/emit_launchconfig_event.py
--launch_configuration_name ${aws_launch_configuration.ecs-lc.name}
--autoscaling_group_name ${aws_autoscaling_group.ecs-asg.name}
--ami ${var.aws_ami}
--cluster ${var.cluster}"
}
triggers {
launchConfigurationName = "${aws_launch_configuration.ecs-lc.name}"
}
}
Terraform Event
{
"version": "0",
"id": "f24d8f1c-8c3f-9b62-cb3c-54430739fc55",
"source": "comtravo.terraform.alpha",
"account": "1234567890",
"time": "2018-05-09T13:35:43Z",
"region": "eu-west-1",
"resources": [
"ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003"
],
"detail": {
"ami": "ami-bfb5fec6",
"status": "ACTIVE",
"agentConnected": false,
"autoscalingGroupName": "ct-backend-ecs-alpha-t2.large-generic20180503065507554700000005",
"environment": "alpha",
"clusterArn": "arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-alpha"
"launchConfigurationName": "ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003"
},
"detailType": "ECS Launch Configuration Change"
}
AWS CloudWatch Event Rules
resource "aws_cloudwatch_event_rule" "ecs-manager" {
name = "capture-ecs-events-${terraform.workspace}"
description = "Capture ECS related events"
event_pattern = <<PATTERN
{
"source": [
"comtravo.terraform.${terraform.workspace}"
],
"detail-type": [
"ECS Launch Configuration Change"
],
"detail": {
"clusterArn": [
"arn:aws:ecs:${var.region}:${var.ct_account_id}:cluster/ct-backend-ecs-${terraform.workspace}"
],
"status": ["ACTIVE"]
}
}
PATTERN
}
AWS Step functions
DEMO
Questions
You all have been awesome!!!
Extras
ECS Challenge #1
ECS AGENT DISCONNECTS
#1 ECS agent disconnects - Initial solution
➢ Cron job on ECS hosts to notify via SNS event and restart
ECS agent.
➢ Chances of ECS agent failing again due to some inherent
problem within the instance are high.
#1 ECS agent disconnects - Initial solution
#1 ECS agent disconnects - Better solution
➢ Detect ECS agent disconnects.
➢ Bootup new ECS host and wait for it to be healthy.
➢ “Move” all the existing containers from the problematic
instance to a new Instance.
➢ Terminate the problematic instance.
➢ Alert on failures.
#1 ECS agent disconnects - Better solution
#1 ECS agent disconnects: Detection
How do we detect ECS agent disconnects?
AWS Cloudwatch EVENTS to the
rescue!!!
#1 ECS agent disconnects: ECS Events
time
Task A
started
bar
Task C
started
Task B
stopped foo baz
ECS agent
disconnected
ECS agent
connected
ECS agent
disconnected
#1 ECS agent disconnects: Filter ECS Events
{
"detail": {
"agentConnected": [
false
],
"clusterArn": [
"arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-qa"
],
"status": [
"ACTIVE"
]
},
"detail-type": [
"ECS Container Instance State Change"
],
"source": [
"aws.ecs"
]
}
#1 ECS agent disconnects: Trigger step function
#1 ECS agent disconnects: ECS Events

Zero downtime ECS host updates with Terraform

  • 2.
    Managing ECS hostswith AWS lambda and step functions
  • 3.
  • 4.
    Terraform at Comtravo ➢Six environments maintained by Terraform. ➢ Integrated into our CI/CD pipeline. ➢ Each environment has: ○ 500+ AWS components. ○ 43 Lambdas. ○ 25 microservices.
  • 5.
    CI/CD at Comtravo:Mono-repo Pull request
  • 6.
    CI/CD at Comtravo:Mono-repo Pull request
  • 7.
    CI/CD at Comtravo:Mono-repo Merge to master
  • 8.
    CI/CD at Comtravo:Mono-repo Merge to master
  • 9.
  • 10.
  • 11.
    One such challenge: UpdateEC2 hosts in a ECS cluster
  • 12.
    Update EC2 hostsin a ECS cluster: Use cases ➢ You have a custom AMI for your ECS cluster(s). ➢ You want to always rollout the latest ECS-optimized AMIs. ➢ You want to rotate the admin keys. ➢ Change Instance type. ➢ Use an updated user_data script.
  • 13.
    Update EC2 hostsin a ECS cluster: The process ➢ Terraform emits an AWS cloudwatch event once launch configuration was created. ➢ Detach “old instances“ from ASG and wait for capacity. ➢ “Move” services from old instances to new instances. ➢ Terminate old instances when no more tasks running. ➢ Alert on failures.
  • 14.
    Terraform + AWSEvents + AWS Step functions = Awesome I created a new launch configuration lc-1234 for ASG asg-1234 belonging to ECS cluster cluster-A
  • 15.
    AWS CloudWatch Events time TaskA started bar Task C started Task B stopped ECS Host bla baz custom event custom event custom event
  • 16.
    Terraform Event Emitter resource"null_resource" "launch-config-update" { provisioner "local-exec" { command = "python ${path.module}/scripts/emit_launchconfig_event.py --launch_configuration_name ${aws_launch_configuration.ecs-lc.name} --autoscaling_group_name ${aws_autoscaling_group.ecs-asg.name} --ami ${var.aws_ami} --cluster ${var.cluster}" } triggers { launchConfigurationName = "${aws_launch_configuration.ecs-lc.name}" } }
  • 17.
    Terraform Event { "version": "0", "id":"f24d8f1c-8c3f-9b62-cb3c-54430739fc55", "source": "comtravo.terraform.alpha", "account": "1234567890", "time": "2018-05-09T13:35:43Z", "region": "eu-west-1", "resources": [ "ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003" ], "detail": { "ami": "ami-bfb5fec6", "status": "ACTIVE", "agentConnected": false, "autoscalingGroupName": "ct-backend-ecs-alpha-t2.large-generic20180503065507554700000005", "environment": "alpha", "clusterArn": "arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-alpha" "launchConfigurationName": "ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003" }, "detailType": "ECS Launch Configuration Change" }
  • 18.
    AWS CloudWatch EventRules resource "aws_cloudwatch_event_rule" "ecs-manager" { name = "capture-ecs-events-${terraform.workspace}" description = "Capture ECS related events" event_pattern = <<PATTERN { "source": [ "comtravo.terraform.${terraform.workspace}" ], "detail-type": [ "ECS Launch Configuration Change" ], "detail": { "clusterArn": [ "arn:aws:ecs:${var.region}:${var.ct_account_id}:cluster/ct-backend-ecs-${terraform.workspace}" ], "status": ["ACTIVE"] } } PATTERN }
  • 19.
  • 20.
  • 22.
  • 23.
    You all havebeen awesome!!!
  • 24.
  • 25.
    ECS Challenge #1 ECSAGENT DISCONNECTS
  • 26.
    #1 ECS agentdisconnects - Initial solution ➢ Cron job on ECS hosts to notify via SNS event and restart ECS agent. ➢ Chances of ECS agent failing again due to some inherent problem within the instance are high.
  • 27.
    #1 ECS agentdisconnects - Initial solution
  • 28.
    #1 ECS agentdisconnects - Better solution ➢ Detect ECS agent disconnects. ➢ Bootup new ECS host and wait for it to be healthy. ➢ “Move” all the existing containers from the problematic instance to a new Instance. ➢ Terminate the problematic instance. ➢ Alert on failures.
  • 29.
    #1 ECS agentdisconnects - Better solution
  • 30.
    #1 ECS agentdisconnects: Detection How do we detect ECS agent disconnects? AWS Cloudwatch EVENTS to the rescue!!!
  • 31.
    #1 ECS agentdisconnects: ECS Events time Task A started bar Task C started Task B stopped foo baz ECS agent disconnected ECS agent connected ECS agent disconnected
  • 32.
    #1 ECS agentdisconnects: Filter ECS Events { "detail": { "agentConnected": [ false ], "clusterArn": [ "arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-qa" ], "status": [ "ACTIVE" ] }, "detail-type": [ "ECS Container Instance State Change" ], "source": [ "aws.ecs" ] }
  • 33.
    #1 ECS agentdisconnects: Trigger step function
  • 34.
    #1 ECS agentdisconnects: ECS Events