This session provides the attendee with an overview of our Amazon EC2 Container Service (Amazon ECS) and the benefits of running a managed cluster on AWS. We also discuss the benefits from a customer perspective.
4. The Problem
Different application stacks
Different hardware deployment
environments
How to run all applications
across different environments?
How to easily migrate from one
environment to another?
Static
website
Web
front end
Background
workers
User DB
Analytics
DB
Queue
Develop-
ment VM
QA
server
Single
prod
server
On-site
cluster
Public
cloud
Contributor’s
laptop
Customer
servers
6. Containers
User space running on OS kernel
Little overhead
Guest OS choices limited to host OS kernel
Been around for a while: chroot, FreeBSD jails,
Solaris containers, OpenVZ, LXC
9. Benefits
Portable runtime application environment
Package application and dependencies in a single artifact
Run different application versions (different dependencies)
simultaneously
Faster development & deployment cycles
Better resource utilization
10. Use Cases
Consistent environment between development & production
Service-oriented architectures / microservices
Short lived workflows
Isolated environments for testing
11. Services Evolve to Microservices
Monolithic Application
Order UI User UI Shipping UI
Order
Service
User
Service
Shipping
Service
Data
Access
Host 1
Service A
Service B
Host 2
Service B
Service D
Host 3
Service A
Service C
Host 4
Service B
Service C
12. Containers Are Natural for Microservices
Simple to model
Any app, any language
Image is the version
Test & deploy same artifact
Stateless servers decrease change risk
19. Scheduling a Cluster Is Hard
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
20. General Cluster Management: Resource
Management
Docker
Task
EC2 Instance
Container
Docker
Task
EC2 Instance
Container
Task
Container
Docker
EC2 Instance
Task
Container
AZ 1 AZ 2
21. General Cluster Management: Scheduling
Docker
Task
EC2 Instance
Container
Docker
Task
EC2 Instance
Container
Task
Container
Docker
EC2 Instance
Task
Container
AZ 1 AZ 2
31. Designed for Use with Other AWS Services
Elastic Load Balancing
Amazon Elastic Block Store
Amazon Virtual Private Cloud
Amazon CloudWatch
AWS Identity and Access Management
AWS CloudTrail
41. Create Service
Load balance traffic across containers
Automatically recover unhealthy containers
Discover services
Elastic Load Balancing
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
42. Scale Service
Scale up
Scale down
Elastic Load Balancing
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
43. Update Service
Deploy new version
Drain connections
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
new new new
Elastic Load Balancing
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
old old old
44. Update Service (cont.)
Deploy new version
Drain connections
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
new new new
Elastic Load Balancing
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
old old old
45. Update Service (cont.)
Deploy new version
Drain connections
Elastic Load Balancing
Shared Data Volume
Containers
Shared Data Volume
Containers
Shared Data Volume
Containers
new new new
48. • California Institute of Technology
• Pasadena, CA
• Top tier university: #1 in Times Higher
Education world rankings
• Small: 6400 people (1000 undergrads,
1200 grads, 300 faculty, 3900 staff)
• 3:1 undergrad-faculty ratio
• JPL: Founded by Caltech in 30’s,
managed for NASA since 1958
49. • Part of IMSS, the central IT org
• Lean, 6 people, all developers, even management
• 35 years of collective systems administration experience
• 50 years of collective development experience
• ~130 websites and web applications, including www.caltech.edu and
the campus intranet portal
• Much smaller than counterparts at peer institutions
Our job: Enable research and instruction through software
Academic Development Services
50. Upper management, operations and developers pro-cloud
Move all on-premise services to cloud within 3 years
We've been production in AWS since 2010
Many Caltech production workloads currently in AWS
Strategy: DevOps, public data, low-hanging fruit, Field of Dreams model
Cloud Adoption (2010-present)
51. Leverage AWS scale, expertise, and capabilities
• AZs, APIs, Infrastructure as code
• AWS better than us at many things
• AWS allows us to do things we can’t on-premises
• Don’t have to run low level services
Allows us to concentrate on how we add value
Why cloud?
53. • Distributed system comprised of many interconnected
systems.
• Authenticating proxy server with around 90 applications
behind it
• Covers most of the academic and administrative apps
people might use
Two parts: core system and proxied apps
access.caltech: Caltech’s intranet portal
54.
55.
56. • Needs to be highly-available
• Be performant at variable loads
• Typical traffic: 5-10 hits/s
• Must scale to 800 hits/s during registration
• Protect and secure proxied apps and data
• Certain core components should stay up during disaster
• Be able to easily deploy new versions of core software
• Need many DEV, TEST, QA and production support envs
access.caltech: key requirements
59. • Move access.caltech core PROD to VPC in AWS
• Continuous deployment system based on
Jenkins, Docker containers, and Consul
• Be able to build DEV and TEST environments in AWS
• Proxy from AWS to on-premises apps via VPN tunnel
Later phases: move proxied apps individually to AWS
access.caltech in AWS: phase 1
61. Need a more rapid, consistent deployment mechanism
• Our current process takes weeks to months to get new versions
to production, and deployments are rocky
• Raw vs cooked. Cooked: build as much before deployment as
possible.
• encapsulation of entire OS as a software artifact
• guaranteed same code and OS build for DEV, TEST, PROD
• easily replicate whole systems architectures in DEV
Docker image community
Why Docker?
62. Deployment pipeline (Jenkins)
Build Test Image Run Tests
Build and Push
final image
Deploy to QA
infrastructure
Run integration
tests
Human
Review
Deploy to prod
infrastructure
Run integration
tests
Deploy to prod
support
infrastructure
QA Pipeline
Developer
pushes
code
Promote to Prod pipeline
63. No orchestration infrastructure to run
• Container scheduling and placement are implicitly at cloud scale
— no need to plan for HA, throughput, etc.
• Built in monitoring via CloudWatch and ECS event stream
• Powerful ECS command line tools
AWS API for managing tasks and services
AWS service integration, especially for load balancers and
VPCs
ECS repositories
Why ECS? (vs Docker Swarm) PROS
64. Painful to debug container launch fails
docker version lags behind current, sometimes significantly
No equivalent to swarm overlay network
Different strategies for deploying containers
• Swarm has spread, binpack and random
• ECS has task and service strategies, which both seem to be like
Swarm’s “spread” strategy
• Although ECS allows you to develop your own strategies via
custom schedulers via StartTask API
Why ECS? (vs Docker Swarm) CONS
65. The entire container is your software
• not just your own code.
• OS + code becomes a software artifact
Development team will need to have or develop systems
experience
• Or work closely with systems people
Probably need to remediate your code in order to take
advantage of the container environment
Docker/ECS Challenges
66. Containers are truly disposable and anonymous
• Figuring out which container is having issues is interesting
• Entire OS is destroyed when re-deploying containers
Containers are not VMs
• No ssh interface to containers
• Containers are minimal systems: no ssh, no cron, no syslogd,
etc.
Need to change your architecture and practices
Logging, monitoring
Docker/ECS Challenges, cont.
Hypervisor virtualization usually means running independent VMs on an intermediate abstraction layer, either on top of an OS or directly on hardware.
Containers differ in that the user space (the memory dedicated to handling applications, etc) is ran on top of the OS’ Kernel
They’re generally considered lighter than hypervisor virtualisation, as there’s a lot less overhead, but they’re limited by the underlying kernel. E.g. Amazon Linux kernel could run Amazon Linux, Ubuntu, etc., but not Windows.
In Docker, these containers were originally based on LXC, but are now based on a native “libcontainer” library.
Docker also makes use of cgroups & kernel namespaces to provide isolation of processes, resources, network & filesystems.
Each container has it’s own process environment, virtual network interface & root filesystem. The filesystem is copy-on-write, meaning it’s layer, very fast & doesn’t require much disk space.
Stdin, Stdout & Stderr are collected, logged & available for you.
You can also create a pseudo-tty and login to an interactive shell on the container.
You may be thinking – this sounds interesting but why would I want to use containers? There are 4 key benefits to using containers.
1.) The first is that containers are portable.
the image is consistent and immutable -- no matter where I run it, or when I start it, it’s the same.
This makes the dev lifecycle simpler –an image works the same on the developer’s desktop & prod, whether I start it today or scale my environment tomorrow, so there’s no surprises.
The entire Application is self-contained -- The image is the version, which makes deployments and scaling easier because the image includes the dependencies.
Small, usually 10s MB for the image, very sharable
2.) Containers are flexible.
You can create clean, reproducible, and modular environments.
Whereas in the past multiple processes would be on the same OS (e.g. Ruby, caching, log pushing), now
Containers makes make it easy to decompose an app into smaller chunks, like microservices, reducing complexity & letting teams move faster while still running the processes on the same host, e.g. no library conflicts
This streamlines both code deployment and infrastructure management
3.) Simply stating that Docker images start fast sells the technology short as speed is apparent in both performance characteristics and in application lifecycle and deployment benefits
So yes, containers start quickly because the operating system is already running, but
Every container can be a single threaded dev stream; less interdependencies
Also ops benefits - Example: IT updates the base image, I just do a new docker build – I can just focus on my app, meaning it’s faster for me to build & release.
4.) Finally, containers are efficient. You can allocate exactly the resources you want – specific cpu, ram, disk, network
Since it shares the same OS kernel & libs, containers use less resources than running the same processes on different virtual machines (different way to get isolation)
Makes it easier to build & deploy things more rapidly, because the environment is the same. Build the container in dev, push to test, release to prod. This can also be useful for customers running hybrid environments.
Makes it easier to keep consistent environments for SOAs or Micro-Services. Also, many of these services aren’t very resource intensive, so you can place them together on one instance.
Sometimes customers have short-lived workflows that need to setup environments (e.g. queue systems, CI jobs, etc.), which doesn’t always map too well to EC2’s per-hour billing model. Docker can be one workaround, allowing them to push & pop containers on to the instance.
Can be useful for isolated execution when testing user code. E.g. Go Playground or similar.
So I want to tell you a story about Amazon.com and the evolution of its architecture.
Over 10 years ago, Amazon had a large monolithic application running its website. Everything from its UI, ordering systems, recommendations engine, shopping cart was one big application with one large code base. The problem with that was there are a lot of code interdependencies that have to be resolved. Another problem Amazon experienced was it was hard to scale the website. If one part or service was memory intensive and another CPU intensive, the servers much be provisioned with enough memory and CPU to handle that baseline load. So if the CPU intensive service received a heavy load you have to provision a large machine and have a lot of underutilized resources
In order to scale better, Amazon decomposed its architecture into individual services that could be deployed separately. This allowed it to scale each service independently. It was able to have smaller teams that worked on each of the services and controlled that services codebase. This allowed the website to evolve faster because new updates can be delivered independently of other teams. This architecture is what now is known as microservices.
Containers & Docker are natural for this pattern of microservices
It makes services simple to model; The application and all its dependencies are packaged into an image using a Dockerfile.
It supports Any app, any language
The Image is a versioned artifact that can be stored in a repository just like your source code.
This makes applications easy to test & deploy because they are the same artifacts
Containers also simplify deployment -- Stateless servers are natural with Docker and each deployment is a new set of containers
This Decreases risk of change – rollback is simple
This all makes it easy to decompose applications to microservices. Every microservice is self contained allowing you to reduce dependency conflicts and decouple deployments.
Docker is a platform designed to help automate deployment of application containers.
It was built by the Docker, Inc., who were previously “DotCloud”, a PaaS platform.
So lets talk about scheduling
The Docker CLI is great if you want to run a container on your laptop for example “docker run myimage”.
But it’s challenging to scale to 100s of containers. Now you’re suddenly managing a cluster & cluster management is hard.
You need a way to intelligently place your containers on the instances that have the resources and that means you need to know the state of everything in your system. For example…
what instances have available resources like memory and ports?
How do I know if a container dies?
How do I hook into other resources like ELB?
Can I extend whatever system that I use, e.g. CD pipeline, third party schedulers, etc.
Do I need to operate another piece of software?
These are the questions and challenges that our customers had which led us to build Amazon ECS
Resource Manager is responsible for keeping track of resources like memory, CPU, and storage and their availability at any given time in the cluster.
Next, the Scheduler is responsible for scheduling containers or tasks for execution.
The scheduler contains algorithms for assigning tasks to nodes in the cluster based on the resources required to execute the task.
To properly schedule you need to :
Know your constraints like memory, CPU
Find resources from your cluster that meet the constraints
Request a resource
Confirm the resource
The scheduler is also responsible for the task execution lifecycle.
Is the task alive or dead and should it be rescheduled
ECS provides a simple solution to cluster management:
We have a cluster management engine that coordinates the cluster of instances, which is just a pool of CPU, memory, storage, and networking resources
The instances are just EC2 instances that are running our agent that have been checked into a cluster. You own them and can SSH into them if you want
Dynamically scalable. Possible to have a 1 instance cluster, and then a 100 or even 1000 instance cluster.
Segment for particular purposes, e.g. dev/test
On each instance, we have the ECS agent which communicates with the engine and processes ECS commands and turns them into Docker commands
To instructs the EC2 instances to start, stop containers and monitor the used and available resources
It’s all open source on Github and we develop in the open, so we’d love to see you involved through pull requests.
To coordinate this cluster we need a single source of truth for all the instances in the cluster, tasks running on the instances, and containers that make up the task, and the resources available. This is known as cluster state
So at the heart of ECS is a key/value store that stores all of this cluster state
To be robust and scalable, this key/value store needs to be distributed for durability and availability
But because the key/value store is distributed, making sure data is consistent and handling concurrent changes becomes more difficult
For example, if two developers request all the remaining memory resources from a certain EC2 instance for their container, only one container can actually receive those resources and the other would have to be told their request could not be completed.
As such, some form of concurrency control has to be put in place in order to make sure that multiple state changes don’t conflict.
But what is unique about ECS is we decouple the container scheduling from the cluster management.
We have opened up the Amazon ECS cluster manager through a set of API actions that allow customers to access all the cluster state information stored in our key/value store
This set of API actions form the basis of solutions that customers can build on top of Amazon ECS such as connecting your CICD system or schedulers
This API allows you to connect different schedulers to ECS
A scheduler just provides logic around how, when, and where to start and stop containers.
Amazon ECS’ architecture is designed to share the state of the cluster and allow customers to run as many varieties of schedulers (e.g., bin packing, spread, etc) as needed for their applications.
The reason we developed ECS was customers had been running containers and Docker on EC2 for quite some time.
What customers told us was the difficulty of running these containers at scale which generally involved installing and managing cluster management software
Eliminates cluster management software
Manages cluster state
Manages containers
Control and monitoring
Scale from one to tens of thousands of containers
Earlier this year we ran a load test
Over a 3 day period we scaled our cluster 200 to over 1000 instances in our cluster as represented by the purple line
The green and red line show the p99 and p50 latencies
As you can see they are relatively flat demonstrating that ECS is stable and will scale regardless of your cluster size
So Amazon ECS has two builtin schedulers to help find the optimal instance placement based on your resource needs, isolation policies, and availability requirements:
A scheduler for long running applications and services
A scheduler for short running tasks like batch jobs
As discussed before - Because ECS provides you a power set of APIs, it allows you to integrate your own custom scheduler as well as open source schedulers.
All of these allow you to have very flexible methods to do scheduling on ECS
Amazon ECS is built to work with the AWS services you value. You can set up each cluster in its own Virtual Private Cloud and use security groups to control network access to your ec2 instances. You can store persistent information using EBS and you can route traffic to containers using ELB. CloudTrail integration captures every API access for security analysis, resource change tracking, and compliance auditing
As discussed before ECS has a simple set of APIs that allows it to be very easy to integrate and extend
You can use your own container scheduler or connect ECS into your existing software delivery process (e.g., continuous integration and delivery systems)
Our container agent and CLI is open source and available on GitHub. We look forward to hearing your input and pull requests.
Summing everything, what ECS allows is the reduction on the amount of code you need to go from idea to implementation when building distributed systems.
So, rather than having Mesos or other cluster management software having to manage a set of machines directly, ECS manages your instances.
Much of the undifferentiated heavy lifting and housekeeping has been abstracted behind a set of APIs.
The ability to run multiple tasks on a shared pool of resources can also lead to higher utilization and faster task completion than if compute resources are statically partitioned.
You can model your app using a file called a Task Definition
This file defines the containers you want to run together.
A task definition also lets you specify Docker concepts like links to establish network channels between the containers and the volumes your containers need.
Task definitions are tracked by name and revision, just like source code
To create a task definition, you can use the console to specify the Docker image to use for the containers
You can specify resources like CPU and memory, ports and volumes for each container.
You can specify what command to run when the container starts.
And the essential flag specifies whether the task should fail if the container stops running.
You can also type everything as JSON if you want
Once your task definition is created, scheduling a Task Definition onto an instance with available resources creates a task
A task is an instantiation of a task definition.
You can have a task with just 1 container…or up to 10 that work together on a single machine. Maybe nginx in front of rails, or redis behind rails.
You can run as many tasks on an instances as will fit.
Often people wonder about cross host links, those don’t go in your task, put them behind an ELB, or a discovery system and make multiple tasks.
ECS has a scheduler that is good for long-running applications called the service scheduler
You reference a task definition and the number of tasks you want to run and then can optionally place it behind an ELB.
The scheduler will then launch the number of tasks that you requested
The scheduler will maintain the number of tasks you want to run and will have it automatically load balance
Scaling up and down is simple. You just tell the scheduler how many tasks you need and the scheduler will automatically launch more tasks or terminate tasks
Updating a service is easy
You deploy the new version and the scheduler with launch tasks with the new application version
It will drain the connection from the old containers and remove the containers