Monitoring and Running 
Docker Containers at Scale 
Alexis Lê-Quôc, Datadog 
November 12th, 2014 | Las Vegas 
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
@alq — CTO at Datadog
Datadog 
• Monitoring service 
• Made for the cloud 
• Aggregates everything 
• Support for Docker 
(since 1.0)
Goals 
1. Present key Docker metrics 
2. Explain operational complexity 
3. Rethink monitoring of Docker containers
Agenda 
• A (very) brief history of containers 
• Docker containers on AWS 
• Key Docker metrics 
• Operational complexity 
• Monitoring Docker effectively 
• Demo
A brief history of containers
Containers in a nutshell 
• Been around for a long time 
– jails, zones, cgroups 
• No full-virtualization overhead 
• Used for runtime isolation (e.g. jails) 
• Docker: escape from dependency hell
Escape from dependency hell 
a.out 
shared libs 
packages 
omnibus 
Docker ~
Container ~ single static binary 
Process Container Host 
Source Dockerfile Chef/Puppet 
Kickstart 
.TEXT /var/lib/docker Full distro 
PID Name/ID Hostname
Docker on AWS: some numbers
(Some) Docker use cases 
• Continous integration 
– eliminate dependency variance 
– same code from dev laptop to production 
– git-like workflow 
• Continuous delivery 
– (quasi) stateless components 
– web workers, video encoders, etc. 
– not for data stores (Amazon RDS a better fit)
Instance types 
20% 20% 19% 
13% 
8% 
21% 
c3.2xl m3.medium m3.large m3.xlarge m1.large the rest 
Source: Datadog, October 2014
Containers per instance 
• Average: 5 (October 2014) 
• Highly dependent on the workload 
• This is just the beginning… 
• Expect higher container density going forward 
Source: Datadog, October 2014
Key Docker metrics
Monitoring fundamentals 
Work 
Resource consumption 
Measures the amount of value 
created 
Measures the amount of resources 
consumed to create value 
What your customers care about What your customers don’t care 
about 
Database: queries answered 
Web server: requests served 
Queue: wait time distribution 
Database: I/O throughput 
Web server: active connections 
OS: CPU utilization 
Container: memory footprint
Docker containers consume… 
• Memory 
• CPU 
• I/O 
• Network
Memory 
Name Why it matters 
pgmajfault Paging to/from disk is slow 
pgfault Context switches hurt 
application performance 
resident set size (rss) Too much RSS causes paging 
and swapping 
swap Swapping in/out is slow
CPU 
Name Why it matters 
user Measures work being done 
system System calls, a necessary evil
Block I/O 
Name Why it matters 
blkio.io_service_bytes I/O is (often) bottleneck 
blkio.io_queued Measures saturation
Network 
Name Why it matters 
tx/rx_errors Because… errors are bad. 
tx/rx_dropped Measures contention 
tx/rx_bytes Measures traffic
How to collect metrics 
• https://github.com/google/cadvisor
Operational complexity
Combinatorial multiplication 
Your Application 
Off-the-shelf 
OS 
Hardware 
App 
Off-the-shelf 
App 
Off-the-shelf 
OS OS 
Hypervisor 
Hardware 
A A A A 
O O O O 
Containers 
OS OS 
Hypervisor 
Hardware
Operational complexity 
• Average containers per instance: N (N=5, 10/2014) 
• N-times as many “hosts” to manage 
• Affects 
– provisioning: prep’ing & building containers 
– configuration: passing config to containers 
– orchestration: deciding where/when containers run 
– monitoring: making sure containers run properly
Monitoring: metric counts on Amazon EC2 
• 1 Amazon EC2 instance 
– 10 CloudWatch metrics 
• 1 operating system (e.g. linux) 
– 100 metrics 
• 1 Container 
– 50 metrics 
• 1 off-the-shelf application 
– ~50 metrics
Combinatorial multiplication 
100 500 
instances containers 
Assuming only 5 containers per instance
Combinatorial multiplication 
160 410 metrics 
per instance 
metrics 
per instance 
Assuming only 5 containers per instance
Velocity 
EC2 instance half-life Container half-life 
hours, 
days, 
months 
minutes, 
hours, 
days
Aggravating factors 
• Hub-based provisioning 
– new images every day 
• Autonomic orchestration 
– from imperative to declarative 
– automated 
– individual containers don’t matter 
– e.g. kubernetes, mesos
A lot more, 
A lot faster.
If your monitoring is still centered on individual hosts or instances…
Host-centric monitoring 
Monitor 
GAP 
Monitor 
A A A A 
O O O O 
Containers 
OS OS 
Hypervisor
A lot more pain, 
A lot faster.
Monitoring containers effectively
A new approach to container monitoring
Layers + Tags
Layers of monitoring 
Monitor 
A A A A 
O O O O 
Containers 
OS OS 
Hypervisor
Layers of monitoring 
APM 
Infrastructure 
Monitoring 
CloudWatch 
A A A A 
O O O O 
Containers 
OS OS 
Hypervisor
Layers of monitoring 
app throughput 
filesystem 
docker mem 
docker cpu 
db queries 
web requests 
cpu/net/io 
APM 
Infrastructure 
Monitoring 
CloudWatch 
e.g. 
A A A A 
O O O O 
Containers 
OS OS 
Hypervisor
Layers of monitoring 
• Access to metrics from all the layers 
• Amazon CloudWatch, OS metrics, Docker metrics, 
app metrics in 1 place 
• Shared timeline
If your monitoring 
does not cover all layers, 
pain.
Tags 
You use them already
Tags 
• Monitoring is like Auto-Scaling Groups 
• Monitoring is like Docker orchestration 
• From imperative to declarative 
• Query-based 
• Queries operate on tags
Monitoring with tags and queries 
“Monitor all Docker containers running image web” 
“… in region us-west-2 across all availability zones” 
“… and make sure resident set size < 1GB on c3.xl”
Monitoring with tags and queries 
“Monitor all Docker containers running image web” 
“… in region us-west-2 across all availability zones” 
“… and make sure resident set size < 1GB on c3.xl”
Monitoring with tags and queries 
“Monitor all Docker containers running image web” 
“… in region us-west-2 across all availability zones” 
“… that use more than 1.5x the average on c3.xl”
“Dude, where’s my server?”
“Dude, where’s my container?”
If your monitoring 
is not tag-based, 
pain.
Demo
Take-aways 
1. Docker increases operational complexity by an order 
of magnitude unless… 
2. You have layered monitoring, from the instance to 
the container and to the application, and… 
3. You monitor using tags and queries
Please give us your feedback on this 
presentation 
Join the conversation on Twitter with #reinvent 
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Running & Monitoring Docker at Scale

  • 1.
    Monitoring and Running Docker Containers at Scale Alexis Lê-Quôc, Datadog November 12th, 2014 | Las Vegas © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2.
    @alq — CTOat Datadog
  • 3.
    Datadog • Monitoringservice • Made for the cloud • Aggregates everything • Support for Docker (since 1.0)
  • 4.
    Goals 1. Presentkey Docker metrics 2. Explain operational complexity 3. Rethink monitoring of Docker containers
  • 5.
    Agenda • A(very) brief history of containers • Docker containers on AWS • Key Docker metrics • Operational complexity • Monitoring Docker effectively • Demo
  • 6.
    A brief historyof containers
  • 7.
    Containers in anutshell • Been around for a long time – jails, zones, cgroups • No full-virtualization overhead • Used for runtime isolation (e.g. jails) • Docker: escape from dependency hell
  • 8.
    Escape from dependencyhell a.out shared libs packages omnibus Docker ~
  • 9.
    Container ~ singlestatic binary Process Container Host Source Dockerfile Chef/Puppet Kickstart .TEXT /var/lib/docker Full distro PID Name/ID Hostname
  • 10.
    Docker on AWS:some numbers
  • 11.
    (Some) Docker usecases • Continous integration – eliminate dependency variance – same code from dev laptop to production – git-like workflow • Continuous delivery – (quasi) stateless components – web workers, video encoders, etc. – not for data stores (Amazon RDS a better fit)
  • 12.
    Instance types 20%20% 19% 13% 8% 21% c3.2xl m3.medium m3.large m3.xlarge m1.large the rest Source: Datadog, October 2014
  • 13.
    Containers per instance • Average: 5 (October 2014) • Highly dependent on the workload • This is just the beginning… • Expect higher container density going forward Source: Datadog, October 2014
  • 14.
  • 15.
    Monitoring fundamentals Work Resource consumption Measures the amount of value created Measures the amount of resources consumed to create value What your customers care about What your customers don’t care about Database: queries answered Web server: requests served Queue: wait time distribution Database: I/O throughput Web server: active connections OS: CPU utilization Container: memory footprint
  • 16.
    Docker containers consume… • Memory • CPU • I/O • Network
  • 17.
    Memory Name Whyit matters pgmajfault Paging to/from disk is slow pgfault Context switches hurt application performance resident set size (rss) Too much RSS causes paging and swapping swap Swapping in/out is slow
  • 18.
    CPU Name Whyit matters user Measures work being done system System calls, a necessary evil
  • 19.
    Block I/O NameWhy it matters blkio.io_service_bytes I/O is (often) bottleneck blkio.io_queued Measures saturation
  • 20.
    Network Name Whyit matters tx/rx_errors Because… errors are bad. tx/rx_dropped Measures contention tx/rx_bytes Measures traffic
  • 21.
    How to collectmetrics • https://github.com/google/cadvisor
  • 22.
  • 23.
    Combinatorial multiplication YourApplication Off-the-shelf OS Hardware App Off-the-shelf App Off-the-shelf OS OS Hypervisor Hardware A A A A O O O O Containers OS OS Hypervisor Hardware
  • 24.
    Operational complexity •Average containers per instance: N (N=5, 10/2014) • N-times as many “hosts” to manage • Affects – provisioning: prep’ing & building containers – configuration: passing config to containers – orchestration: deciding where/when containers run – monitoring: making sure containers run properly
  • 25.
    Monitoring: metric countson Amazon EC2 • 1 Amazon EC2 instance – 10 CloudWatch metrics • 1 operating system (e.g. linux) – 100 metrics • 1 Container – 50 metrics • 1 off-the-shelf application – ~50 metrics
  • 26.
    Combinatorial multiplication 100500 instances containers Assuming only 5 containers per instance
  • 27.
    Combinatorial multiplication 160410 metrics per instance metrics per instance Assuming only 5 containers per instance
  • 28.
    Velocity EC2 instancehalf-life Container half-life hours, days, months minutes, hours, days
  • 29.
    Aggravating factors •Hub-based provisioning – new images every day • Autonomic orchestration – from imperative to declarative – automated – individual containers don’t matter – e.g. kubernetes, mesos
  • 30.
    A lot more, A lot faster.
  • 31.
    If your monitoringis still centered on individual hosts or instances…
  • 32.
    Host-centric monitoring Monitor GAP Monitor A A A A O O O O Containers OS OS Hypervisor
  • 33.
    A lot morepain, A lot faster.
  • 34.
  • 35.
    A new approachto container monitoring
  • 36.
  • 37.
    Layers of monitoring Monitor A A A A O O O O Containers OS OS Hypervisor
  • 38.
    Layers of monitoring APM Infrastructure Monitoring CloudWatch A A A A O O O O Containers OS OS Hypervisor
  • 39.
    Layers of monitoring app throughput filesystem docker mem docker cpu db queries web requests cpu/net/io APM Infrastructure Monitoring CloudWatch e.g. A A A A O O O O Containers OS OS Hypervisor
  • 40.
    Layers of monitoring • Access to metrics from all the layers • Amazon CloudWatch, OS metrics, Docker metrics, app metrics in 1 place • Shared timeline
  • 41.
    If your monitoring does not cover all layers, pain.
  • 42.
    Tags You usethem already
  • 43.
    Tags • Monitoringis like Auto-Scaling Groups • Monitoring is like Docker orchestration • From imperative to declarative • Query-based • Queries operate on tags
  • 44.
    Monitoring with tagsand queries “Monitor all Docker containers running image web” “… in region us-west-2 across all availability zones” “… and make sure resident set size < 1GB on c3.xl”
  • 45.
    Monitoring with tagsand queries “Monitor all Docker containers running image web” “… in region us-west-2 across all availability zones” “… and make sure resident set size < 1GB on c3.xl”
  • 46.
    Monitoring with tagsand queries “Monitor all Docker containers running image web” “… in region us-west-2 across all availability zones” “… that use more than 1.5x the average on c3.xl”
  • 47.
  • 48.
    “Dude, where’s mycontainer?”
  • 49.
    If your monitoring is not tag-based, pain.
  • 50.
  • 51.
    Take-aways 1. Dockerincreases operational complexity by an order of magnitude unless… 2. You have layered monitoring, from the instance to the container and to the application, and… 3. You monitor using tags and queries
  • 52.
    Please give usyour feedback on this presentation Join the conversation on Twitter with #reinvent © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.