Running & Monitoring Docker at Scale

Monitoring and Running
Docker Containers at Scale
Alexis Lê-Quôc, Datadog
November 12th, 2014 | Las Vegas
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Datadog
• Monitoring service
• Made for the cloud
• Aggregates everything
• Support for Docker
(since 1.0)

Goals
1. Present key Docker metrics
2. Explain operational complexity
3. Rethink monitoring of Docker containers

Agenda
• A (very) brief history of containers
• Docker containers on AWS
• Key Docker metrics
• Operational complexity
• Monitoring Docker effectively
• Demo

Containers in a nutshell
• Been around for a long time
– jails, zones, cgroups
• No full-virtualization overhead
• Used for runtime isolation (e.g. jails)
• Docker: escape from dependency hell

Escape from dependency hell
a.out
shared libs
packages
omnibus
Docker ~

Container ~ single static binary
Process Container Host
Source Dockerfile Chef/Puppet
Kickstart
.TEXT /var/lib/docker Full distro
PID Name/ID Hostname

(Some) Docker use cases
• Continous integration
– eliminate dependency variance
– same code from dev laptop to production
– git-like workflow
• Continuous delivery
– (quasi) stateless components
– web workers, video encoders, etc.
– not for data stores (Amazon RDS a better fit)

Instance types
20% 20% 19%
13%
8%
21%
c3.2xl m3.medium m3.large m3.xlarge m1.large the rest
Source: Datadog, October 2014

Containers per instance
• Average: 5 (October 2014)
• Highly dependent on the workload
• This is just the beginning…
• Expect higher container density going forward
Source: Datadog, October 2014

Monitoring fundamentals
Work
Resource consumption
Measures the amount of value
created
Measures the amount of resources
consumed to create value
What your customers care about What your customers don’t care
about
Database: queries answered
Web server: requests served
Queue: wait time distribution
Database: I/O throughput
Web server: active connections
OS: CPU utilization
Container: memory footprint

Docker containers consume…
• Memory
• CPU
• I/O
• Network

Memory
Name Why it matters
pgmajfault Paging to/from disk is slow
pgfault Context switches hurt
application performance
resident set size (rss) Too much RSS causes paging
and swapping
swap Swapping in/out is slow

CPU
Name Why it matters
user Measures work being done
system System calls, a necessary evil

Block I/O
Name Why it matters
blkio.io_service_bytes I/O is (often) bottleneck
blkio.io_queued Measures saturation

Network
Name Why it matters
tx/rx_errors Because… errors are bad.
tx/rx_dropped Measures contention
tx/rx_bytes Measures traffic

How to collect metrics
• https://github.com/google/cadvisor

Combinatorial multiplication
Your Application
Off-the-shelf
OS
Hardware
App
Off-the-shelf
App
Off-the-shelf
OS OS
Hypervisor
Hardware
A A A A
O O O O
Containers
OS OS
Hypervisor
Hardware

Operational complexity
• Average containers per instance: N (N=5, 10/2014)
• N-times as many “hosts” to manage
• Affects
– provisioning: prep’ing & building containers
– configuration: passing config to containers
– orchestration: deciding where/when containers run
– monitoring: making sure containers run properly

Monitoring: metric counts on Amazon EC2
• 1 Amazon EC2 instance
– 10 CloudWatch metrics
• 1 operating system (e.g. linux)
– 100 metrics
• 1 Container
– 50 metrics
• 1 off-the-shelf application
– ~50 metrics

100 500
instances containers
Assuming only 5 containers per instance

160 410 metrics
per instance
metrics
per instance
Assuming only 5 containers per instance

Velocity
EC2 instance half-life Container half-life
hours,
days,
months
minutes,
hours,
days

Aggravating factors
• Hub-based provisioning
– new images every day
• Autonomic orchestration
– from imperative to declarative
– automated
– individual containers don’t matter
– e.g. kubernetes, mesos

If your monitoring is still centered on individual hosts or instances…

Host-centric monitoring
Monitor
GAP
Monitor
A A A A
O O O O
Containers
OS OS
Hypervisor

A lot more pain,
A lot faster.

Monitoring containers effectively

A new approach to container monitoring

Layers of monitoring
Monitor
A A A A
O O O O
Containers
OS OS
Hypervisor

APM
Infrastructure
Monitoring
CloudWatch
A A A A
O O O O
Containers
OS OS
Hypervisor

app throughput
filesystem
docker mem
docker cpu
db queries
web requests
cpu/net/io
APM
Infrastructure
Monitoring
CloudWatch
e.g.
A A A A
O O O O
Containers
OS OS
Hypervisor

• Access to metrics from all the layers
• Amazon CloudWatch, OS metrics, Docker metrics,
app metrics in 1 place
• Shared timeline

If your monitoring
does not cover all layers,
pain.

Tags
• Monitoring is like Auto-Scaling Groups
• Monitoring is like Docker orchestration
• From imperative to declarative
• Query-based
• Queries operate on tags

Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2 across all availability zones”
“… and make sure resident set size < 1GB on c3.xl”

Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2 across all availability zones”
“… that use more than 1.5x the average on c3.xl”

“Dude, where’s my server?”

“Dude, where’s my container?”

If your monitoring
is not tag-based,
pain.

Take-aways
1. Docker increases operational complexity by an order
of magnitude unless…
2. You have layered monitoring, from the instance to
the container and to the application, and…
3. You monitor using tags and queries

Please give us your feedback on this
presentation
Join the conversation on Twitter with #reinvent
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Running & Monitoring Docker at Scale

More Related Content

What's hot

Similar to Running & Monitoring Docker at Scale

More from Datadog

Recently uploaded

Running & Monitoring Docker at Scale