6. Services evolve to microservices
Monolithic application
Order UI User UI Shipping UI
Order
service
User
service
Shipping
service
Data
access
Host 1
Service A
Service B
Host 2
Service B
Service D
Host 3
Service A
Service C
Host 4
Service B
Service C
7. Containers are natural for microservices
Simple to model
Any app, any language
Image is the version
Test and deploy same artifact
Stateless servers decrease change risk
10. Scheduling a cluster is hard
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
29. Designed for use with other AWS services
Elastic Load Balancing (Classic & Application)
Amazon Elastic Block Store
Amazon Virtual Private Cloud
Amazon CloudWatch
AWS Identity and Access Management
AWS CloudTrail
40. Create service
Load balance traffic across containers
Automatically recover unhealthy containers
Discover services
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
41. Scale service
Scale up
Scale down
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
42. Update service
Deploy new version
Drain connections
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
new new new
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
old old old
43. Update service (cont.)
Deploy new version
Drain connections
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
new new new
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
old old old
44. Update service (cont.)
Deploy new version
Drain connections
Elastic Load Balancing
Shared data volume
Containers
Shared data volume
Containers
Shared data volume
Containers
new new new
46. Anatomy of Task Placement
Cluster Constraints
Custom Constraints
Placement Strategies
Apply Filter
Satisfy CPU, memory, and port requirements
Filter for location, instance-type, AMI, or custom
attribute constraints
Identify instances that meet spread or binpack
placement strategy
Select final container instances for placement
So we are going to briefly recap, why containers and the challenges you may face in production, and some of the use patterns
We will then talk about cluster management and how Amazon ECS fits into all of this
Then we will close with a demo from my colleague _____
Abstract the OS
VMs Hardware Level Virtualization - install applications on the host using package manager, tightly coupled applications, libraries, configuration – you can have really good practices like building immutable images but they are still very heavy weight and dependant on the underlying infrastructure – hypervisor, etc.
Containers - Operating System Virtualization – isolated from each other and the host, leverage linux kernel primitves, namespaces for isolation and cgroups for limiting resources. Since they are decoupled from the underlying infra and host, they are portable across environments and OS distributions. Also can be super lightweight if built correctly which makes them very easy to ship – Alpine Linux 5MB container image
Containers are similar to hardware virtualization (like EC2), however instead of partitioning a machine, containers isolate the processes running on a single operating system
This is a useful concept that lets you use the OS kernel to create multiple isolated user space processes that can have constraints on them like cpu & memory.
The Docker CLI makes using containers easy, with commands like docker run.
Docker images make it easy to define what runs in a container and versions the entire app
These concepts enable automation – you can define your app, build & share the image, and deploy that image.
You may be thinking – this sounds interesting but why would I want to use containers? There are 4 key benefits to using containers.
1.) The first is that containers are portable.
the image is consistent and immutable -- no matter where I run it, or when I start it, it’s the same.
This makes the dev lifecycle simpler –an image works the same on the developer’s desktop & prod, whether I start it today or scale my environment tomorrow, so there’s no surprises.
The entire Application is self-contained -- The image is the version, which makes deployments and scaling easier because the image includes the dependencies.
Small, usually 10s MB for the image, very sharable
2.) Containers are flexible.
You can create clean, reproducible, and modular environments.
Whereas in the past multiple processes would be on the same OS (e.g. Ruby, caching, log pushing), now
Containers makes make it easy to decompose an app into smaller chunks, like microservices, reducing complexity & letting teams move faster while still running the processes on the same host, e.g. no library conflicts
This streamlines both code deployment and infrastructure management
3.) Simply stating that Docker images start fast sells the technology short as speed is apparent in both performance characteristics and in application lifecycle and deployment benefits
So yes, containers start quickly because the operating system is already running, but
Every container can be a single threaded dev stream; less interdependencies
Also ops benefits - Example: IT updates the base image, I just do a new docker build – I can just focus on my app, meaning it’s faster for me to build & release.
4.) Finally, containers are efficient. You can allocate exactly the resources you want – specific cpu, ram, disk, network
Since it shares the same OS kernel & libs, containers use less resources than running the same processes on different virtual machines (different way to get isolation)
So I want to tell you a story about Amazon.com and the evolution of its architecture.
Over 10 years ago, Amazon.com had a large monolithic application running its website. Everything from its UI, ordering systems, recommendations engine, shopping cart was one big application with one large code base. The problem with that was there are a lot of code interdependencies that have to be resolved. Another problem Amazon experienced was it was hard to scale the website. If one part or service was memory intensive and another CPU intensive, the servers much be provisioned with enough memory and CPU to handle that baseline load. So if the CPU intensive service received a heavy load you have to provision a large machine and have a lot of underutilized resources
In order to scale better, Amazon decomposed its architecture into individual services that could be deployed separately. This allowed it to scale each service independently. It was able to have smaller teams that worked on each of the services and controlled that services codebase. This allowed the website to evolve faster because new updates can be delivered independently of other teams. This architecture is what now is known as microservices.
Containers & Docker are natural for this pattern of microservices
It makes services simple to model; The application and all its dependencies are packaged into an image using a Dockerfile.
It supports Any app, any language
The Image is a versioned artifact that can be stored in a repository just like your source code.
This makes applications easy to test & deploy because they are the same artifacts
Containers also simplify deployment -- Stateless servers are natural with Docker and each deployment is a new set of containers
This Decreases risk of change – rollback is simple
This all makes it easy to decompose applications to microservices. Every microservice is self contained allowing you to reduce dependency conflicts and decouple deployments.
So lets talk about scheduling
The Docker CLI is great if you want to run a container on your laptop for example “docker run myimage”.
But it’s challenging to scale to 100s of containers. Now you’re suddenly managing a cluster & cluster management is hard.
You need a way to intelligently place your containers on the instances that have the resources and that means you need to know the state of everything in your system. For example…
what instances have available resources like memory and ports?
How do I know if a container dies?
How do I hook into other resources like ELB?
Can I extend whatever system that I use, e.g. CD pipeline, third party schedulers, etc.
Do I need to operate another piece of software?
These are the questions and challenges that our customers had which led us to build Amazon ECS
Amazon EC2 Container Service (ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances. Amazon ECS eliminates the need for you to install, operate, and scale your own cluster management infrastructure.
Resource Manager is responsible for keeping track of resources like memory, CPU, and storage and their availability at any given time in the cluster.
[describe what’s on the slides]
Next, the Scheduler is responsible for scheduling containers or tasks for execution.
The scheduler contains algorithms for assigning tasks to nodes in the cluster based on the resources required to execute the task.
To properly schedule you need to :
Know your constraints like memory, CPU
Find resources from your cluster that meet the constraints
Request a resource
Confirm the resource
The scheduler is also responsible for the task execution lifecycle.
Is the task alive or dead and should it be rescheduled
ECS provides a simple solution to cluster management:
We have a cluster management engine that coordinates the cluster of instances, which is just a pool of CPU, memory, storage, and networking resources
The instances are just EC2 instances that are running our agent that have been checked into a cluster. You own them and can SSH into them if you want
Dynamically scalable. Possible to have a 1 instance cluster, and then a 100 or even 1000 instance cluster.
Segment for particular purposes, e.g. dev/test
On each instance, we have the ECS agent which communicates with the engine and processes ECS commands and turns them into Docker commands
To instructs the EC2 instances to start, stop containers and monitor the used and available resources
It’s all open source on Github and we develop in the open, so we’d love to see you involved through pull requests.
To coordinate this cluster we need a single source of truth for all the instances in the cluster, tasks running on the instances, and containers that make up the task, and the resources available. This is known as cluster state
So at the heart of ECS is a key/value store that stores all of this cluster state
To be robust and scalable, this key/value store needs to be distributed for durability and availability
But because the key/value store is distributed, making sure data is consistent and handling concurrent changes becomes more difficult
For example, if two developers request all the remaining memory resources from a certain EC2 instance for their container, only one container can actually receive those resources and the other would have to be told their request could not be completed.
As such, some form of concurrency control has to be put in place in order to make sure that multiple state changes don’t conflict.
Lets talk a bit how we achieve this concurrency control under the hood
We implemented Amazon ECS using one of Amazon’s core distributed systems primitives:
a Paxos-based transactional journal based data store that keeps a record of every change made to a data entry.
Any write to the data store is committed as a transaction in the journal with a specific order-based ID.
The current value in a data store is the sum of all transactions made as recorded by the journal.
Any read from the data store is only a snapshot in time of the journal.
For a write to succeed, the write proposed must be the latest transaction since the last read.
So if a user made a read and subsequently a few writes happened after that and it tries to write based on the last seend ID, the write wouldn’t succeed
This primitive allows Amazon ECS to store its cluster state information with optimistic concurrency,
which is ideal in environments where constantly changing data is shared.
This architecture affords Amazon ECS high availability, low latency, and high throughput because the data store is never pessimistically locked.
But what is unique about ECS is we decouple the container scheduling from the cluster management.
We have opened up the Amazon ECS cluster manager through a set of API actions that allow customers to access all the cluster state information stored in our key/value store
This set of API actions form the basis of solutions that customers can build on top of Amazon ECS such as connecting your CICD system or schedulers
This API allows you to connect different schedulers to ECS
A scheduler just provides logic around how, when, and where to start and stop containers.
Amazon ECS’ architecture is designed to share the state of the cluster and allow customers to run as many varieties of schedulers (e.g., bin packing, spread, etc) as needed for their applications.
So how it works is that Each scheduler periodically queries the current cluster state to check the resource availability
To schedule a task, the scheduler makes a claim for any available resources
The scheduler then updates the cluster state with the newly claimed resources in an atomic transaction.
If a resource is already claimed, ECS will reject the transaction because it maintains concurrency control
So what ECS enables is called “shared state optimistic scheduling” where all schedulers can see the current state of the cluster at all times.
The reason we developed ECS was customers had been running containers and Docker on EC2 for quite some time.
What customers told us was the difficulty of running these containers at scale which generally involved installing and managing cluster management software
Eliminates cluster management software
Manages cluster state
Manages containers
Control and monitoring
Scale from one to tens of thousands of containers
Earlier in the life of the service
Over a 3 day period we scaled our cluster 200 to over 1000 instances in our cluster as represented by the purple line
The green and red line show the p99 and p50 latencies
As you can see they are relatively flat demonstrating that ECS is stable and will scale regardless of your cluster size
So Amazon ECS has two builtin schedulers to help find the optimal instance placement based on your resource needs, isolation policies, and availability requirements:
A scheduler for long running applications and services
A scheduler for short running tasks like batch jobs
As discussed before - Because ECS provides you a power set of APIs, it allows you to integrate your own custom scheduler as well as open source schedulers.
All of these allow you to have very flexible methods to do scheduling on ECS
Amazon ECS is built to work with the AWS services you value. You can set up each cluster in its own Virtual Private Cloud and use security groups to control network access to your ec2 instances. You can store persistent information using EBS and you can route traffic to containers using Classic LB or ALB. CloudTrail integration captures every API access for security analysis, resource change tracking, and compliance auditing. Also native support for CloudWatch
As discussed before ECS has a simple set of APIs that allows it to be very easy to integrate and extend
You can use your own container scheduler or connect ECS into your existing software delivery process (e.g., continuous integration and delivery systems)
Our container agent and CLI is open source and available on GitHub. We look forward to hearing your input and pull requests.
Summing everything, what ECS allows is the reduction on the amount of code you need to go from idea to implementation when building distributed systems.
So, rather than having Mesos or other cluster management software having to manage a set of machines directly, ECS manages your instances.
Much of the undifferentiated heavy lifting and housekeeping has been abstracted behind a set of APIs.
The ability to run multiple tasks on a shared pool of resources can also lead to higher utilization and faster task completion than if compute resources are statically partitioned.
You can model your app using a file called a Task Definition
This file defines the containers you want to run together.
A task definition also lets you specify Docker concepts like links to establish network channels between the containers and the volumes your containers need.
Task definitions are tracked by name and revision, just like source code
To create a task definition, you can use the console to specify the Docker image to use for the containers
You can specify resources like CPU and memory, ports and volumes for each container.
You can specify what command to run when the container starts.
And the essential flag specifies whether the task should fail if the container stops running.
You can also type everything as JSON if you want
Once your task definition is created, scheduling a Task Definition onto an instance with available resources creates a task
A task is an instantiation of a task definition.
You can have a task with just 1 container…or up to 10 that work together on a single machine. Maybe nginx in front of rails, or redis behind rails.
You can run as many tasks on an instances as will fit.
Often people wonder about cross host links, those don’t go in your task, put them behind an ELB, or a discovery system and make multiple tasks.
ECS has a scheduler that is good for long-running applications called the service scheduler
You reference a task definition and the number of tasks you want to run and then can optionally place it behind an ELB.
The scheduler will then launch the number of tasks that you requested
The scheduler will maintain the number of tasks you want to run and will have it automatically load balance
AZ-aware
Scaling up and down is simple. You just tell the scheduler how many tasks you need and the scheduler will automatically launch more tasks or terminate tasks
Updating a service is easy
You deploy the new version and the scheduler with launch tasks with the new application version
It will drain the connection from the old containers and remove the containers
Leaving the newest containers running
Prior to today, you could explicitly require CPU, memory, or ports as part of the task definition when running a task or service.
We have now extended that set of placement constraints to now include
availability-zone
ami-id
instance-type
distinctInstance
Or, a custom attribute
Let’s look at a scenario where you had 10 container instances. To start you’ll make a request to run some tasks or create a service. As part of that request you’ll specify CPU, memory, or port requirements.
In addition now, you’ll also provide other constraints, such as specific Availability Zone, AMI or insance-type.
And then last you’ll tell us the strategy you prefer for us to use when starting the tasks, which could range from spread for availability, binpack to optimize for utilization, place together (affinity) or place apart (anti-affinity), etc.
At the end of that process we have identified a set of instances that satisfies the requirements for the task you want to run and we place (or run) those tasks across your cluster based on the requirements specified.
Prior to today we had support for three placement strategies:
targeted instances through start-task
random placement through run-task
spread AZ, spread instance by create-service
Now we have support for:
spread with placement groups (constraints)
Bin packing
Distinct instances
Affinity / Anti-Affinity
These offer customers greater control and choice over how to run their applications.