SlideShare a Scribd company logo
Please stop using Nagios
(so it can die peacefully)
Andy Sykes
Devops @ Forward3D
@supersheep
andy@forward3d.com
Do you use Nagios?
Tell me why you picked it.
Go on.
If you don't, why don't you?
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
So why did you pick Nagios?
Because it's the "safe", default choice.
Because we've grown accustomed to the things
that really, really suck about it.
It's a little like we've all got Stockholm
Syndrome.
What Nagios gets right
Incredibly simple plugin model.
Fairly secure (SSL between agents + master).
Very simple conceptually.
Reliable.
Nagios, I hate thee; let me count thy ways
Doesn't scale. At all.
World's second most horrible configuration*.
Horrendous interface**.
Assumes a static infrastructure.
No decent programmatic interfaces***.
Throws away perfdata.
Stupid wire format for clients (NRPE/NSCA).
* the world's most horrible configuration is, obviously, Sendmail.
** even the paid Nagios XI one is ugly as sin and unusable.
*** if I catch you parsing status.dat, I will beat your ass.
Expansion about config
Configuration has to be in two places:
Server has to know what checks to invoke
via NRPE.
Client has to know what checks it will be
asked to invoke with NRPE.
THIS IS MADNESS.
Scaling, or lack of it
No such thing as a Nagios cluster.
More checks = more work = longer before you
know something's happened!
Every check increases your master's load
average.
Okay, yes, there’s mod_gearman
But it’s a hack at best.
No redundancy for the machine that distributes
the checks, so it’s not a real cluster.
API poverty
Can't easily integrate with other systems.
Can't easily write custom dashboards.
Can't get information out again!

Assumes a static infra
Master has to be told about a client before
things can happen.
The bandaids we make
Interface:
Opsview, Icinga, Shinken, others

API:
Parsing status.dat, NDO

Client wire format:
Opsview's NRPE, NRD

Config management:
Puppet types, Chef cookbooks
None of it is good enough.
The take-home point:

"If we keep using Nagios,
we'll never get anything
better."
(Writing monitoring systems is hard, and needs community involvement and
real world adoption. Nagios steals mindshare by being just good enough. It's
the monitoring system we deserve, but not the one we need right now.)
So, smart guy. What do we do?

Steal all the things that are great about Nagios.
(existing plugin investment, simplicity, security, reliability)

Strap them to something more awesome.
(scalable, API-ready, config management friendly, modern!)
THIS DOESN’T MEAN WRITING
YOUR OWN MONITORING SYSTEM
Points for thought:

●  What else are people using?
●  Should we greenfield or lift existing tools?
●  What tools could we go with?
My suggestion:

Like OMD, but better.
Wrap up a series of “best in breed” tools to
make one kickass monitoring tool.
What we need:
Core
Agent
Graphing
Anomaly detection
Alerting
UI
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
There’s something we can use for this.
Sensu!
Sensu is often described as the “monitoring router”.
{
"checks": {
"chef_client": {
"command": "check-chef-client.rb",
"subscribers": [
"production" ],
"interval": 60,
"handlers": [
"pagerduty",
"irc"
]
}
}
}

Only on the server
Client requires no registration for the server
to know about it
Uses Nagios status return codes
Doesn’t talk to the server - talks to
RabbitMQ
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing
Anomaly detection
Alerting
UI
Graphing is easy now.
If you’re not using Graphite, you should be.
Sensu “metric” checks can pump data to it.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection
Alerting
UI
Anomaly detection is hard.
We’ve got all this metric data, but how do we check it?
- Skyline/Oculus (Etsy)
- Grok (very early days)
- ???
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting
UI
Alerting is tricky, but mostly solved.
Flapjack! - flapjack.io
Alerting is not the concern of your monitoring tool.
Push all alerts at Flapjack
- define gateways (PagerDuty, email)
- create relationships between checks and gateways
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
User interfaces are hard.
What do we need from it?
- What’s broken
- When it broke, when it broke in the past
- Say “OK, I know it’s broken”
- View graphs to see how quickly it broke
- See every check everywhere, and filter the list
The Sensu Dashboard sucks.
No history!
Acknowledgements aren’t easy to do.
No graphing.
Can’t see anything that’s reporting an OK status.
This won’t do.
I’m going to have to write a UI. Sigh.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
- ???
In Summary

Nagios sucks.
There are good tools for each concern
of monitoring.
If we can package them together, we
can have something that rocks.
Thank You.

Contact
andy@forward3d.com (@supersheep)

More Related Content

What's hot

Introduction To Network Design
Introduction To Network DesignIntroduction To Network Design
Introduction To Network DesignSteven Cahill
 
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
KWON JUNHYEOK
 
Iptables the Linux Firewall
Iptables the Linux Firewall Iptables the Linux Firewall
Iptables the Linux Firewall
Syed fawad Gillani
 
Reverse shell
Reverse shellReverse shell
Reverse shell
Ilan Mindel
 
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
InfluxData
 
Autobahn primer
Autobahn primerAutobahn primer
Autobahn primer
Robbie Byrd
 
Infrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfraInfrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfra
Tomislav Plavcic
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
InfluxData
 
Nagios intro
Nagios intro Nagios intro
Nagios intro
Hsi-Kai Wang
 
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Principled Technologies
 
Pong game breakdown
Pong game breakdownPong game breakdown
Pong game breakdownJamie Hyman
 
Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...
Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...
Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...
Philip Polstra
 
Proxy server
Proxy serverProxy server
Proxy server
Proxies Rent
 
01 Metasploit kung fu introduction
01 Metasploit kung fu introduction01 Metasploit kung fu introduction
01 Metasploit kung fu introduction
Mostafa Abdel-sallam
 
Node.js - #6 - Core Modules - net - Rodrigo Branas
Node.js - #6 - Core Modules - net - Rodrigo BranasNode.js - #6 - Core Modules - net - Rodrigo Branas
Node.js - #6 - Core Modules - net - Rodrigo Branas
Rodrigo Branas
 
VPN Virtual Private Network
VPN Virtual Private NetworkVPN Virtual Private Network
VPN Virtual Private Network
Rama Krishna Nakka
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
InfluxData
 
VIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALA
VIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALAVIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALA
VIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALA
Saikiran Panjala
 
Iptables presentation
Iptables presentationIptables presentation
Iptables presentation
Emin Abdul Azeez
 
Chapter11
Chapter11Chapter11
Chapter11
Muhammad Ahad
 

What's hot (20)

Introduction To Network Design
Introduction To Network DesignIntroduction To Network Design
Introduction To Network Design
 
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
 
Iptables the Linux Firewall
Iptables the Linux Firewall Iptables the Linux Firewall
Iptables the Linux Firewall
 
Reverse shell
Reverse shellReverse shell
Reverse shell
 
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
Discover How IBM Uses InfluxDB and Grafana to Help Clients Monitor Large Prod...
 
Autobahn primer
Autobahn primerAutobahn primer
Autobahn primer
 
Infrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfraInfrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfra
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
 
Nagios intro
Nagios intro Nagios intro
Nagios intro
 
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
 
Pong game breakdown
Pong game breakdownPong game breakdown
Pong game breakdown
 
Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...
Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...
Am I being spied on: Low-tech ways of detecting high-tech surveillance (DEFCO...
 
Proxy server
Proxy serverProxy server
Proxy server
 
01 Metasploit kung fu introduction
01 Metasploit kung fu introduction01 Metasploit kung fu introduction
01 Metasploit kung fu introduction
 
Node.js - #6 - Core Modules - net - Rodrigo Branas
Node.js - #6 - Core Modules - net - Rodrigo BranasNode.js - #6 - Core Modules - net - Rodrigo Branas
Node.js - #6 - Core Modules - net - Rodrigo Branas
 
VPN Virtual Private Network
VPN Virtual Private NetworkVPN Virtual Private Network
VPN Virtual Private Network
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
 
VIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALA
VIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALAVIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALA
VIRTUAL PRIVATE NETWORKS BY SAIKIRAN PANJALA
 
Iptables presentation
Iptables presentationIptables presentation
Iptables presentation
 
Chapter11
Chapter11Chapter11
Chapter11
 

Viewers also liked

Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
Zabbix
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
Philip Wernersbach
 
Andrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on Linux
Zabbix
 
Icinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of Icinga
Icinga
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
Theo Schlossnagle
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
Zabbix
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
torkelo
 

Viewers also liked (7)

Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
 
Andrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on Linux
 
Icinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of Icinga
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 

Similar to Stop using Nagios (so it can die peacefully)

How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA World
Kyle Anderson
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
miquelruizm
 
Automating Monitoring with Puppet
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with Puppet
Christian Mague
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
Puppet
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basicsarunvr
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
Kyle Anderson
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathDevopsdays
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013Nick Galbreath
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativestzang ms
 
Advanced googling
Advanced googlingAdvanced googling
Advanced googling
sonuagain
 
Google Hacking
Google HackingGoogle Hacking
Google Hacking
Pim Piepers
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
NETWAYS
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
NETWAYS
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebula Project
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
Priyanka Aash
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
André Goliath
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
stricaud
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Sylvain Kalache
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari
 

Similar to Stop using Nagios (so it can die peacefully) (20)

How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA World
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
 
Automating Monitoring with Puppet
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with Puppet
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basics
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
 
Advanced googling
Advanced googlingAdvanced googling
Advanced googling
 
Google Hacking
Google HackingGoogle Hacking
Google Hacking
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
 
Django Girls Tutorial
Django Girls TutorialDjango Girls Tutorial
Django Girls Tutorial
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

Stop using Nagios (so it can die peacefully)

  • 1. Please stop using Nagios (so it can die peacefully) Andy Sykes Devops @ Forward3D @supersheep andy@forward3d.com
  • 2. Do you use Nagios? Tell me why you picked it. Go on. If you don't, why don't you?
  • 3. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 4. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 5. So why did you pick Nagios? Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.
  • 6. What Nagios gets right Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.
  • 7. Nagios, I hate thee; let me count thy ways Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA). * the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.
  • 8. Expansion about config Configuration has to be in two places: Server has to know what checks to invoke via NRPE. Client has to know what checks it will be asked to invoke with NRPE. THIS IS MADNESS.
  • 9. Scaling, or lack of it No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.
  • 10. Okay, yes, there’s mod_gearman But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.
  • 11. API poverty Can't easily integrate with other systems. Can't easily write custom dashboards. Can't get information out again! Assumes a static infra Master has to be told about a client before things can happen.
  • 12. The bandaids we make Interface: Opsview, Icinga, Shinken, others API: Parsing status.dat, NDO Client wire format: Opsview's NRPE, NRD Config management: Puppet types, Chef cookbooks None of it is good enough.
  • 13. The take-home point: "If we keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)
  • 14. So, smart guy. What do we do? Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability) Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)
  • 15. THIS DOESN’T MEAN WRITING YOUR OWN MONITORING SYSTEM
  • 16. Points for thought: ●  What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?
  • 17. My suggestion: Like OMD, but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.
  • 19. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 20. There’s something we can use for this. Sensu! Sensu is often described as the “monitoring router”.
  • 21.
  • 22. { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } } Only on the server
  • 23. Client requires no registration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ
  • 24. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 25. What we need: Core - Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI
  • 26. Graphing is easy now. If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.
  • 27. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI
  • 28. Anomaly detection is hard. We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???
  • 29. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI
  • 30. Alerting is tricky, but mostly solved. Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways
  • 31. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI
  • 32. User interfaces are hard. What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list
  • 33. The Sensu Dashboard sucks. No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.
  • 34. I’m going to have to write a UI. Sigh.
  • 35. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???
  • 36. In Summary Nagios sucks. There are good tools for each concern of monitoring. If we can package them together, we can have something that rocks.