Host Health Monitoring with Docker Run

Host Health
Monitoring
with `docker run`
Noah Zoschke
@nzoschke
noah@convox.com
10 / 28 / 2015

Health
Monitoring
circa 1999
• Nagios Core
• Event scheduler
• Event processor
• Alert manager
• Host groups conﬁg
• Ping
• HTTP
• SSH
• Nagios Remote Plugin Executor
• SNMP
• load
• disk
photo credit:
https://en.wikipedia.org/wiki/Nagios

Health Monitoring
circa 2012
• AMI
• Chef / Ansible
• ELB / Health Check
• Protocol: HTTP (or HTTPS, TCP, SSL)
• Port: 80
• Path: /index.html
• Timeout / Interval: 5s / 30s
• Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks
• Loss of network
• Loss of power
• Host software problems
• Host hardware problems
• ASG photo credit:
http://aws.amazon.com/architecture/
http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-conﬁguration.html

But you probably still
need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit:
http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938

Health Monitoring
circa 2016, the age of containers
• Generic AMI
• Docker
• ECS
• Container scheduling
and re-scheduling as a
service
• ASG / EC2 / Status Checks
• Simple monitoring
container
photo credit:
https://github.com/docker/swarm

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB

api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky

Failure Scenarios
• ecs-agent fails
• dockerd fails
photo credit:
http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693
> reschedule task

Container Schedulers are
the new watchman
• Container process
monitoring
• Service health check
monitoring
• Automatic re-scheduling
photo credit:
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html

api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• ecs-agent fails
• dockerd fails
ECS
Still need to conﬁgure an ASG
to maintain capacity…

api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• ecs-agent fails
• dockerd fails
ECS
Still need a monitor…

api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring
circa 2016, the age of containers
• Schedule a monitor process in
container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered
with ECS unhealthy
• `docker run` a user space health
check on every instance
• Mark instances that fail to
connect to Docker unhealthy
• Mark instances that fail user
space health check unhealthy
No Nagios server + plugins!

Partial Failure Scenarios
battle scars
• ecs-agent fails
• dockerd fails
ECS
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …

User Space Health Check
$ docker run busybox sh -c
'dmesg | grep "Remounting filesystem read-only"'
# why not:
$ docker run health-check
To package, distribute and run common top, netstat,
smartmontools, etc. binaries and scripts

Thanks!
Slides available on Medium / SlideShare
https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHub
https://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or noah@convox.com

Host Health Monitoring with Docker Run

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Host Health Monitoring with Docker Run

Similar to Host Health Monitoring with Docker Run (20)

More from Noah Zoschke

More from Noah Zoschke (6)

Recently uploaded

Recently uploaded (20)

Host Health Monitoring with Docker Run