SlideShare a Scribd company logo
Host Health
Monitoring
with `docker run`
Noah Zoschke
@nzoschke
noah@convox.com
10 / 28 / 2015
Health
Monitoring
circa 1999
• Nagios Core
• Event scheduler
• Event processor
• Alert manager
• Host groups config
• Ping
• HTTP
• SSH
• Nagios Remote Plugin Executor
• SNMP
• load
• disk
photo credit:
https://en.wikipedia.org/wiki/Nagios
Health Monitoring
circa 2012
• AMI
• Chef / Ansible
• ELB / Health Check
• Protocol: HTTP (or HTTPS, TCP, SSL)
• Port: 80
• Path: /index.html
• Timeout / Interval: 5s / 30s
• Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks
• Loss of network
• Loss of power
• Host software problems
• Host hardware problems
• ASG photo credit:
http://aws.amazon.com/architecture/
http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
But you probably still
need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit:
http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
Health Monitoring
circa 2016, the age of containers
• Generic AMI
• Docker
• ECS
• Container scheduling
and re-scheduling as a
service
• ASG / EC2 / Status Checks
• Simple monitoring
container
photo credit:
https://github.com/docker/swarm
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
photo credit:
http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693
>	reschedule	task
Container Schedulers are
the new watchman
• Container process
monitoring
• Service health check
monitoring
• Automatic re-scheduling
photo credit:
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need to configure an ASG
to maintain capacity…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need a monitor…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring
circa 2016, the age of containers
• Schedule a monitor process in
container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered
with ECS unhealthy
• `docker run` a user space health
check on every instance
• Mark instances that fail to
connect to Docker unhealthy
• Mark instances that fail user
space health check unhealthy
No Nagios server + plugins!
Partial Failure Scenarios
battle scars
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …
User Space Health Check
$	docker	run	busybox	sh	-c		
				'dmesg	|	grep	"Remounting	filesystem	read-only"'	
#	why	not:	
$	docker	run	health-check
To package, distribute and run common top, netstat,
smartmontools, etc. binaries and scripts
Thanks!
Slides available on Medium / SlideShare
https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHub
https://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or noah@convox.com

More Related Content

What's hot

Microservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-dockerMicroservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-docker
Kidong Lee
 
Building a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerBuilding a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for Docker
Tomas Doran
 
Lesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at ProntoLesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at Pronto
Kan Ouivirach, Ph.D.
 
Managing Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with AnsibleManaging Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with Ansible
fmaccioni
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmus
Bram Vogelaar
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with Varnish
Samantha Quiñones
 
London HUG 12/4
London HUG 12/4London HUG 12/4
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
Claus Ibsen
 
Automating the Network
Automating the NetworkAutomating the Network
Automating the Network
Puppet
 
A complete guide to Node.js
A complete guide to Node.jsA complete guide to Node.js
A complete guide to Node.js
Prabin Silwal
 
Cyansible
CyansibleCyansible
Cyansible
Alan Norton
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
Amazon Web Services
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorp
Ontico
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECS
Yevgeniy Brikman
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraForm
Wesley Charles Blake
 
ApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration libraryApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration library
Claus Ibsen
 
Ansible Crash Course
Ansible Crash CourseAnsible Crash Course
Ansible Crash Course
Peter Sankauskas
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - Copenhagen
Claus Ibsen
 

What's hot (20)

Microservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-dockerMicroservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-docker
 
Building a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerBuilding a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for Docker
 
Lesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at ProntoLesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at Pronto
 
Managing Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with AnsibleManaging Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with Ansible
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmus
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with Varnish
 
London HUG 12/4
London HUG 12/4London HUG 12/4
London HUG 12/4
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
Automating the Network
Automating the NetworkAutomating the Network
Automating the Network
 
A complete guide to Node.js
A complete guide to Node.jsA complete guide to Node.js
A complete guide to Node.js
 
Cyansible
CyansibleCyansible
Cyansible
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
About Node.js
About Node.jsAbout Node.js
About Node.js
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorp
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECS
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraForm
 
ApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration libraryApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration library
 
Ansible Crash Course
Ansible Crash CourseAnsible Crash Course
Ansible Crash Course
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - Copenhagen
 

Similar to Host Health Monitoring with Docker Run

Securing Hadoop @eBay
Securing Hadoop @eBaySecuring Hadoop @eBay
Securing Hadoop @eBay
DataWorks Summit
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
macslide
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
Pavel Chunyayev
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
Matt Ray
 
Monitoring Docker with ELK
Monitoring Docker with ELKMonitoring Docker with ELK
Monitoring Docker with ELK
Daniel Berman
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applicationsevilmike
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
Matt Ray
 
Chef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdfChef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdf
OpenStack Foundation
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
Matt Ray
 
Rack
RackRack
Rack
shaokun
 
Доклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDaysДоклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDays
ru_Parallels
 
Hacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sitesHacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sites
Mikhail Egorov
 
Building production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stackBuilding production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stack
CellarTracker
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
Matt Ray
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
Cosimo Streppone
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
Michael Bahr
 
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
Amazon Web Services
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
Biomatters
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
Felipe Prado
 

Similar to Host Health Monitoring with Docker Run (20)

Securing Hadoop @eBay
Securing Hadoop @eBaySecuring Hadoop @eBay
Securing Hadoop @eBay
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
 
Monitoring Docker with ELK
Monitoring Docker with ELKMonitoring Docker with ELK
Monitoring Docker with ELK
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
Chef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdfChef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdf
 
Event machine
Event machineEvent machine
Event machine
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
Rack
RackRack
Rack
 
Доклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDaysДоклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDays
 
Hacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sitesHacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sites
 
Building production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stackBuilding production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stack
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
 

More from Noah Zoschke

DevOps for Humans
DevOps for HumansDevOps for Humans
DevOps for Humans
Noah Zoschke
 
Bootstrapping Microservices
Bootstrapping MicroservicesBootstrapping Microservices
Bootstrapping Microservices
Noah Zoschke
 
Minimum Viable Infrastructure
Minimum Viable InfrastructureMinimum Viable Infrastructure
Minimum Viable Infrastructure
Noah Zoschke
 
Open Source SLAs
Open Source SLAsOpen Source SLAs
Open Source SLAs
Noah Zoschke
 
Choose Your Own AWS Adventure
Choose Your Own AWS AdventureChoose Your Own AWS Adventure
Choose Your Own AWS Adventure
Noah Zoschke
 
Convox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECSConvox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECS
Noah Zoschke
 

More from Noah Zoschke (6)

DevOps for Humans
DevOps for HumansDevOps for Humans
DevOps for Humans
 
Bootstrapping Microservices
Bootstrapping MicroservicesBootstrapping Microservices
Bootstrapping Microservices
 
Minimum Viable Infrastructure
Minimum Viable InfrastructureMinimum Viable Infrastructure
Minimum Viable Infrastructure
 
Open Source SLAs
Open Source SLAsOpen Source SLAs
Open Source SLAs
 
Choose Your Own AWS Adventure
Choose Your Own AWS AdventureChoose Your Own AWS Adventure
Choose Your Own AWS Adventure
 
Convox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECSConvox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECS
 

Recently uploaded

Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 

Recently uploaded (20)

Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 

Host Health Monitoring with Docker Run

  • 1. Host Health Monitoring with `docker run` Noah Zoschke @nzoschke noah@convox.com 10 / 28 / 2015
  • 2. Health Monitoring circa 1999 • Nagios Core • Event scheduler • Event processor • Alert manager • Host groups config • Ping • HTTP • SSH • Nagios Remote Plugin Executor • SNMP • load • disk photo credit: https://en.wikipedia.org/wiki/Nagios
  • 3. Health Monitoring circa 2012 • AMI • Chef / Ansible • ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10 • EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems • ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
  • 4. But you probably still need… • Nagios for monitoring • or Zabbix, Ganglia, Sensu… • or OpsView, SolarWinds… • or Pingdom, Datadog… • To provide system feedback • ASG SetInstanceHealth photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
  • 5. Health Monitoring circa 2016, the age of containers • Generic AMI • Docker • ECS • Container scheduling and re-scheduling as a service • ASG / EC2 / Status Checks • Simple monitoring container photo credit: https://github.com/docker/swarm
  • 6. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB
  • 7. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky
  • 8. Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693 > reschedule task
  • 9. Container Schedulers are the new watchman • Container process monitoring • Service health check monitoring • Automatic re-scheduling photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
  • 10. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky Still need to configure an ASG to maintain capacity…
  • 11. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky Still need a monitor…
  • 12. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Health Monitoring circa 2016, the age of containers • Schedule a monitor process in container cluster • Describe ASG an ECS membership • Mark all instances unregistered with ECS unhealthy • `docker run` a user space health check on every instance • Mark instances that fail to connect to Docker unhealthy • Mark instances that fail user space health check unhealthy No Nagios server + plugins!
  • 13. Partial Failure Scenarios battle scars • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky • Disk full • Disk partition corrupt / read-only • Network packet loss • CPU steal • Kernel bugs triggered • Security vulnerabilities • Security breaches • …
  • 14. User Space Health Check $ docker run busybox sh -c 'dmesg | grep "Remounting filesystem read-only"' # why not: $ docker run health-check To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts
  • 15. Thanks! Slides available on Medium / SlideShare https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286 http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run Open source Golang monitor available on GitHub https://github.com/convox/rack/blob/master/api/workers/cluster.go Questions / feedback to @nzoschke or noah@convox.com