Learn more about how Airtime works with microservices and ECS from start to finish: from developing against a local Vagrant environment, working with ECR and container images, utilizing CI/CD with CircleCI, to running a production workload on Amazon's ECS.
2. Airtime is a new social experience that lets real friends share real moments
in real time through group video, messaging, and more.
3. Things I promised I’d cover
● Microservices and ECS overview
● What does the Airtime architecture look like?
● Developing locally with Docker and Vagrant
● Testing and deploying with CircleCI
● Working with ECR
● We’re live on ECS!
● Demo time.
● Questions?
4. Monolith to microservices
● Rebuilt infrastructure a couple of months ago to solve significant issues:
○ Environment inconsistencies
○ Limited velocity
○ Unhappy developers
○ Struggles with configuration management
○ Hard to debug issues
● Single monolith, to containerized microservices, built on AWS ECS
5. Why microservices and containers?
● Containers are atomic
● Can change a single piece without affecting the whole
● Majority of requirements are at container level, reduces need for heavily
customized servers
● Same container can be tested locally, and then deployed remotely to staging
and production- consistency!
6. A little bit about ECS
● Amazon’s container management services: allows you to run Docker
containers on EC2 instances, and helps with scheduling, resource
management, etc.
● Native integration with other AWS features: ELB/ALB, IAM roles for services
and tasks, Cloudwatch
● Containers are registered to services, which are registered to clusters.
8. ELB as service discovery
● ECS automatically
associates cluster
instances with ELBs
● Gives us service discovery
for free
● Looking forward: we can
do this with ALBs
10. What are we aiming for with dev environments?
● Support developer productivity
● Low learning curve for contribution
● Should not require knowledge of the backend services themselves
● Should be repeatable, and self-contained
● Should mimic the actual staging and production environments as closely as
possible.
So how’d we do this?
11. Developing locally with Docker and Vagrant
● Every feature is developed and tested locally with Vagrant environments
● Allows us to quickly describe development environments (resources, ports,
etc.), and run a provisioner that sets up service-level requirements
● Increases velocity, since a working container can be developed locally,
pushed to ECR, and then deployed to staging or production
12. Why we use Vagrant
● Developer happiness, easy to start, easy to maintain provisioners
● Protection! Services run inside Vagrant environment, inaccessible from host
unless we explicitly forward ports
● Consistent and repeatable: developers work from environments set up
identically, reduces “it works locally!”
13. Setting up the environment
● All we need is the Vagrantfile in our project’s root directory
● Vagrantfile does a couple of things for us:
○ Defines the open ports we need for our containers
○ Defines the resources we allocate to the virtual environment
○ Runs our chosen provisioner (more about this up next
● Vagrant uses shared directories, so developers can work locally from their
editor, and changes will be reflected in the virtual environment
Want to see actual code? You can see a slightly edited version of
my real-life Vagrant file here.
14. A closer look at the Ansible provisioner
● We provision service-level requirements with Ansible.
● Provisioner handles a couple of things for us:
○ Install environment requirements
○ Pull/start dependency containers (Redis and MongoDB)
○ Pull and start service containers from ECR
○ Runs NPM install for containers and host
○ Grabs container IP that we can use for cross-container linking
● We handle local secrets with ansible-vault
Like the Vagrantfile, there is a gist of the provisioner here.
16. So what happened here?
Let’s break it down. A few things are happening:
● Vagrant started the environment we described and forwarded ports
● Vagrant sees that we’re running an Ansible provisioner, and that we use
ansible-vault.
● This prompts us for a vault password
17. Once we’ve entered the Vault password to start decrypting our secrets, Vagrant
runs our Ansible set-up tasks.
21. CI/CD with CircleCI
Once a feature has been developed and tested locally, you’re ready to test on
staging. This process starts with merging a pull request to develop:
23. Working with ECR
● We version control all of our containers through ECR
● Lots of tagging schemes out there, but this one is ours:
○ Individual builds are tagged with the commit SHA1 from CircleCI. This allows us to tie a
specific container version to a specific commit.
○ Develop branches are tagged with both the SHA1, and with :develop
○ Master branches are tagged with both the SHA1, and with :latest
● We use :latest and :develop for local purposes only. ECS task definitions
exclusively use SHAs for debugging purposes
● ECR lets us pull directly from our repositories from our cluster machines
25. Deploying to ECS
The final CircleCI build step triggers the ECS deployment:
There are a couple different pieces to this, so we’ll walk through them individually.
27. Next, we create a TaskDefinition:
And finally, we register it to our cluster:
28. Let’s talk about ECS, baby.
With ECS, TaskDefinitions are registered to services, which are in turn registered
to clusters. Here’s what that looks like:
30. OK, so where were we?
Once CircleCI makes the call to switch over to our new TaskDefinition, ECS takes
over the deployment process:
31. Looking more closely at connection draining
In the previous example, the deployment went smoothly: we deployed a new
revision, it passed health checks, and ECS drained connections off the old
revision, to replace it with our new one. Zero downtime deployments FTW. But
what happens if that doesn’t work?
35. The good news
In that last case, ECS doesn’t drain connections off and route traffic to the new
task, since it fails to pass healthchecks. From the console side, that looks
something like this:
36. Preventing bad deploys
Besides connection-draining, which we get for free with ECS, we take a couple
more steps to prevent bad deployments:
● TDD, and developing locally with Vagrant
● Services run individual tests on CircleCI
● NotoriousJPG (Hubot) runs an additional test suite, plus load tests
So, we only make the call to deploy a service if both the local tests, and the
CircleCI tests pass. Errors that have snuck through can be caught by ECS
healthchecks, or by the automated tests. And finally...
37. Monitoring our microservices
● Lots of microservices means lots of monitoring
● ECS creates automatic memory and CPU usage metrics in Cloudwatch for
each service.
● Custom Cloudwatch dashboard that allows us to check the health of all our
services at a glance.
38. Logging beyond Cloudwatch
● We use Sumologic to grab application logs
● We instrument application-side with Newrelic
● Errors, and warnings, and other fatal issues are sent to PagerDuty
39. A little bit more detail on container logging
● We run Sumologic as a container on the cluster hosts, started at boot
● From the containers, we use syslog as our log driver:
echo OPTIONS="--log-driver=syslog" >> /etc/sysconfig/docker
● Sumologic follows /var/log/messages on the cluster host
● allows us to catch application logs from containers even if the container
process fails quickly.
40. Bonus round: autoscaling with ECS
● Lots of AWS options for keeping user experience consistent.
● We autoscale with ECS at both the cluster, and the service level
41. Scaling the cluster hosts
● Like with regular autoscaling groups, can scale ECS cluster hosts based on
metrics (like CPU usage, etc.)
42. Stop! Demo time.
What we’re going to do:
1. Change a service locally
2. Hope tests pass on CircleCI
3. PR against develop
4. Merge our branch into develop
5. Hope tests still pass on CircleCI
6. Watch our changes go to ECS
7. Check to see if our new TaskDefinition deploys successfully
8. Profit