SlideShare a Scribd company logo
1 of 15
Download to read offline
Host Health
Monitoring
with `docker run`
Noah Zoschke
@nzoschke
noah@convox.com
10 / 28 / 2015
Health
Monitoring
circa 1999
• Nagios Core
• Event scheduler
• Event processor
• Alert manager
• Host groups config
• Ping
• HTTP
• SSH
• Nagios Remote Plugin Executor
• SNMP
• load
• disk
photo credit:
https://en.wikipedia.org/wiki/Nagios
Health Monitoring
circa 2012
• AMI
• Chef / Ansible
• ELB / Health Check
• Protocol: HTTP (or HTTPS, TCP, SSL)
• Port: 80
• Path: /index.html
• Timeout / Interval: 5s / 30s
• Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks
• Loss of network
• Loss of power
• Host software problems
• Host hardware problems
• ASG photo credit:
http://aws.amazon.com/architecture/
http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
But you probably still
need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit:
http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
Health Monitoring
circa 2016, the age of containers
• Generic AMI
• Docker
• ECS
• Container scheduling
and re-scheduling as a
service
• ASG / EC2 / Status Checks
• Simple monitoring
container
photo credit:
https://github.com/docker/swarm
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
photo credit:
http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693
>	reschedule	task
Container Schedulers are
the new watchman
• Container process
monitoring
• Service health check
monitoring
• Automatic re-scheduling
photo credit:
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need to configure an ASG
to maintain capacity…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
Still need a monitor…
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api
128 MB
registry
256 MB
rails web.2
1024 MB
data worker.1
512 MB
rails web.3
1024 MB
data worker.2
512 MB
rails worker.2
256 MB
rails worker.3
256 MB
rails web.1
1024 MB
rails worker.1
256 MB
rails worker.4
256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring
circa 2016, the age of containers
• Schedule a monitor process in
container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered
with ECS unhealthy
• `docker run` a user space health
check on every instance
• Mark instances that fail to
connect to Docker unhealthy
• Mark instances that fail user
space health check unhealthy
No Nagios server + plugins!
Partial Failure Scenarios
battle scars
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with
ECS
• Instance userspace gets wacky
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …
User Space Health Check
$	docker	run	busybox	sh	-c		
				'dmesg	|	grep	"Remounting	filesystem	read-only"'	
#	why	not:	
$	docker	run	health-check
To package, distribute and run common top, netstat,
smartmontools, etc. binaries and scripts
Thanks!
Slides available on Medium / SlideShare
https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHub
https://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or noah@convox.com

More Related Content

What's hot

Microservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-dockerMicroservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-dockerKidong Lee
 
Building a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerBuilding a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerTomas Doran
 
Lesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at ProntoLesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at ProntoKan Ouivirach, Ph.D.
 
Managing Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with AnsibleManaging Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with Ansiblefmaccioni
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmusBram Vogelaar
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSamantha Quiñones
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelClaus Ibsen
 
Automating the Network
Automating the NetworkAutomating the Network
Automating the NetworkPuppet
 
A complete guide to Node.js
A complete guide to Node.jsA complete guide to Node.js
A complete guide to Node.jsPrabin Silwal
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014Amazon Web Services
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpOntico
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSYevgeniy Brikman
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraFormWesley Charles Blake
 
ApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration libraryApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration libraryClaus Ibsen
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - CopenhagenClaus Ibsen
 

What's hot (20)

Microservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-dockerMicroservices blue-green-deployment-with-docker
Microservices blue-green-deployment-with-docker
 
Building a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for DockerBuilding a smarter application stack - service discovery and wiring for Docker
Building a smarter application stack - service discovery and wiring for Docker
 
Lesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at ProntoLesson Learned from Using Docker Swarm at Pronto
Lesson Learned from Using Docker Swarm at Pronto
 
Managing Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with AnsibleManaging Your Cisco Datacenter Network with Ansible
Managing Your Cisco Datacenter Network with Ansible
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmus
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with Varnish
 
London HUG 12/4
London HUG 12/4London HUG 12/4
London HUG 12/4
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
Automating the Network
Automating the NetworkAutomating the Network
Automating the Network
 
A complete guide to Node.js
A complete guide to Node.jsA complete guide to Node.js
A complete guide to Node.js
 
Cyansible
CyansibleCyansible
Cyansible
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
About Node.js
About Node.jsAbout Node.js
About Node.js
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorp
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECS
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraForm
 
ApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration libraryApacheCon EU 2016 - Apache Camel the integration library
ApacheCon EU 2016 - Apache Camel the integration library
 
Ansible Crash Course
Ansible Crash CourseAnsible Crash Course
Ansible Crash Course
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - Copenhagen
 

Similar to Host Health Monitoring with Docker Run

Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissmacslide
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Pavel Chunyayev
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefMatt Ray
 
Monitoring Docker with ELK
Monitoring Docker with ELKMonitoring Docker with ELK
Monitoring Docker with ELKDaniel Berman
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applicationsevilmike
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 SummitMatt Ray
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with ChefMatt Ray
 
Hacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sitesHacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sitesMikhail Egorov
 
Доклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDaysДоклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDaysru_Parallels
 
Building production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stackBuilding production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stackCellarTracker
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackMatt Ray
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareCosimo Streppone
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to AnsibleMichael Bahr
 
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2Amazon Web Services
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...Felipe Prado
 

Similar to Host Health Monitoring with Docker Run (20)

Securing Hadoop @eBay
Securing Hadoop @eBaySecuring Hadoop @eBay
Securing Hadoop @eBay
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
 
Monitoring Docker with ELK
Monitoring Docker with ELKMonitoring Docker with ELK
Monitoring Docker with ELK
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
Chef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdfChef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdf
 
Event machine
Event machineEvent machine
Event machine
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
Rack
RackRack
Rack
 
Hacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sitesHacking Adobe Experience Manager sites
Hacking Adobe Experience Manager sites
 
Доклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDaysДоклад Михаила Егорова на PHDays
Доклад Михаила Егорова на PHDays
 
Building production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stackBuilding production websites with Node.js on the Microsoft stack
Building production websites with Node.js on the Microsoft stack
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
(GAM404) Hunting Monsters in a Low-Latency Multiplayer Game on EC2
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
 

More from Noah Zoschke

Bootstrapping Microservices
Bootstrapping MicroservicesBootstrapping Microservices
Bootstrapping MicroservicesNoah Zoschke
 
Minimum Viable Infrastructure
Minimum Viable InfrastructureMinimum Viable Infrastructure
Minimum Viable InfrastructureNoah Zoschke
 
Choose Your Own AWS Adventure
Choose Your Own AWS AdventureChoose Your Own AWS Adventure
Choose Your Own AWS AdventureNoah Zoschke
 
Convox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECSConvox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECSNoah Zoschke
 

More from Noah Zoschke (6)

DevOps for Humans
DevOps for HumansDevOps for Humans
DevOps for Humans
 
Bootstrapping Microservices
Bootstrapping MicroservicesBootstrapping Microservices
Bootstrapping Microservices
 
Minimum Viable Infrastructure
Minimum Viable InfrastructureMinimum Viable Infrastructure
Minimum Viable Infrastructure
 
Open Source SLAs
Open Source SLAsOpen Source SLAs
Open Source SLAs
 
Choose Your Own AWS Adventure
Choose Your Own AWS AdventureChoose Your Own AWS Adventure
Choose Your Own AWS Adventure
 
Convox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECSConvox: Open Source Tooling for ECS
Convox: Open Source Tooling for ECS
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Host Health Monitoring with Docker Run

  • 1. Host Health Monitoring with `docker run` Noah Zoschke @nzoschke noah@convox.com 10 / 28 / 2015
  • 2. Health Monitoring circa 1999 • Nagios Core • Event scheduler • Event processor • Alert manager • Host groups config • Ping • HTTP • SSH • Nagios Remote Plugin Executor • SNMP • load • disk photo credit: https://en.wikipedia.org/wiki/Nagios
  • 3. Health Monitoring circa 2012 • AMI • Chef / Ansible • ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10 • EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems • ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
  • 4. But you probably still need… • Nagios for monitoring • or Zabbix, Ganglia, Sensu… • or OpsView, SolarWinds… • or Pingdom, Datadog… • To provide system feedback • ASG SetInstanceHealth photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
  • 5. Health Monitoring circa 2016, the age of containers • Generic AMI • Docker • ECS • Container scheduling and re-scheduling as a service • ASG / EC2 / Status Checks • Simple monitoring container photo credit: https://github.com/docker/swarm
  • 6. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB
  • 7. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky
  • 8. Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693 > reschedule task
  • 9. Container Schedulers are the new watchman • Container process monitoring • Service health check monitoring • Automatic re-scheduling photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
  • 10. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky Still need to configure an ASG to maintain capacity…
  • 11. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Failure Scenarios • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky Still need a monitor…
  • 12. ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd api 128 MB registry 256 MB rails web.2 1024 MB data worker.1 512 MB rails web.3 1024 MB data worker.2 512 MB rails worker.2 256 MB rails worker.3 256 MB rails web.1 1024 MB rails worker.1 256 MB rails worker.4 256 MB ECS ASG api ELB rails ELB Health Monitoring circa 2016, the age of containers • Schedule a monitor process in container cluster • Describe ASG an ECS membership • Mark all instances unregistered with ECS unhealthy • `docker run` a user space health check on every instance • Mark instances that fail to connect to Docker unhealthy • Mark instances that fail user space health check unhealthy No Nagios server + plugins!
  • 13. Partial Failure Scenarios battle scars • web.2 container crashes • web.2 port unresponsive • ecs-agent fails • dockerd fails • Instance hardware fails • Instance fails to register with ECS • Instance userspace gets wacky • Disk full • Disk partition corrupt / read-only • Network packet loss • CPU steal • Kernel bugs triggered • Security vulnerabilities • Security breaches • …
  • 14. User Space Health Check $ docker run busybox sh -c 'dmesg | grep "Remounting filesystem read-only"' # why not: $ docker run health-check To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts
  • 15. Thanks! Slides available on Medium / SlideShare https://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286 http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run Open source Golang monitor available on GitHub https://github.com/convox/rack/blob/master/api/workers/cluster.go Questions / feedback to @nzoschke or noah@convox.com