Kalin Hicks – Set up original GCL VM – countless
explanations whiteboard sessions
Brian Cook – Set up original Sepsis Zabbix VM’s
John Breese – Set up 2.0 templates spanning hosts
Brad Beam – Many dashboards, alerts and triggers
Chris Rooney – Brahe-hubble gem
Nidhi Bhargava – Low level discovery on 2.0
Dev – White Ops - Yellow
ish for me
real world case
layers, how to’s
Correlation of Alerts
• Sometimes the availability of one host
depends on another. A server that is behind
some router will become unreachable if the
router goes down. With triggers configured for
both, you might get notifications about two
hosts down - while only the router was the
“Flap Detection” and a Grace Period
Nagios uses "flap detection" to prevent many
ERROR's and OK's being sent right after each
Zabbix calls this "hysteresis".
Hysteresis is the dependence of a system not
only on its current environment but also on its
Correlation of Alerts
We need to get to the point where:
100’s of Related Alerts Enter,
One Causal Alert Leaves
What if someone misses something?
With 100+ alert emails per day, they are almost
guaranteed to miss something.
“Why on earth was I not notified?!”
Trends of Flakiness
These should not be dealt with by alerts/alarms.
Rather by daily/weekly reports.
Unfortunately Zabbix is not strong in this area yet.
There is a thread:
False Alarms Due to Chef Restarts
Current – Manual
Potentially – Automated
Automate the Maintenance Periods
Highly Available Deployments
Highly Available Deployments
Doesn’t Work @ablythe
Leveraging Init.d to Manage State
case "$1" in
rm -f /var/<service>/start
rm -f /var/<service>/stop
rm -f /var/<service>/restart
This of course is messy if the service
ever hangs during a restart.
More discussion needs to be had in this
Mark Burgess – Book of Promises
Draft published on January 21st 2013
For the Project Managers
PLANS TO FAIL
FAIL TO PLAN
For the Project Managers
PLAN TO FAIL
PRACTICE LOCALIZED FAILURE
MINIMIZE RECOVERY TIME
The Phoenix Project: A Novel About
IT, DevOps, and Helping Your Business
The Brent Effect
Brent is the one person who understands the
how the entire system fits together.
Brent is the one person who fixes most of the
Being spread so thin, Brent is also the one
person who causes most of the issues.
Dystopian Future Where The Survival of Many is
in the Hands of One Man
The system or crucial parts of the system
Man or Woman
What is OpsInfra?
A team built on enablement of DevOps.
Build an Ecosystem
The Success of:
• 4 steps
– Log a Jira with the intent to research a tool
– Write a wiki article on how to use it
– Write a blog on how it is awesome
– Record a demo of the tool
For the Architects
Monitoring is only “technical debt” if you
choose to carry it that way.
Depending on when you invest, it easily can be
Past – Hackers - Craft
Now – SysAdmin - Trade
Future – Devops - Science
The years travel fast
And time after time, I've done the tell
But this ain't one body’s tell
It's the tell of us all
And you gotta listen it and 'member
Cuz what you hears today
You gotta tell the newborn tomorrow
Blair Witch at one point held the record for the highest profit to cost ratio ever. <enter>But before that…
Mad Max held that record for a couple decades.
My name is Aaron Blythe, and this presentation is CalledZabbix: Beyond Thunderdome.
ZabbixBy show of hands who has logged into a Zabbix instance?And who has received email alerts from Zabbix?
We will go through where we have been.Where we are.And where we can go with Zabbix.I will try to not give too many spoilers on the Mad Max series of films, merely just lay down the story line.
First I want to go through how we got here with Zabbix so far, using the original Mad Max as a guide.
Zabbix is an Open Source Monitoring ToolWebsite claims:Up-to 100,000 monitored devicesUp-to 1,000,000 of metrics
Mad Max is set in Australia in a dystopian future where earth’s oil supply has been nearly exhausted.Max Rockatansky is the top driver in the Main Force Patrol (basically the police). Gangs have taken over the highway. In a car chase, Max kills one of the gang members, so they want revenge.Honestly the story is sort of dis-jointed. The movie was edited in the home of one of the producers on a home made editing machine, created by his father (an engineer).
Brian Cook told me a story of when they were first working on one of our cloud applications. It was memory bound. When a lot of data was being pumped through in batches it would actually clobber the machine. He would have to call someone in the data center at 2 in the morning to physically reboot the machine. Oh, and after doing this a few times he would always make sure to tell them to bring a pencil so they could actually get to the button
Kalin Hicks and Brian Cook told me:Zabbix was originally installed to bridge the gap in our monitoring for the Sepsis project, while we waited for a permanent solution, we just chose to use another monitoring tool instead of a bunch of scripts.It was a Skunkworks project that went viral and certainly was not ever intended to become such a big project.
Necessity helps us create or adapt great fun thingsDavid Eggby, responsible for much of the footage for Mad Max had this to say about filming.“… [Shooting from the back of the Goose bike] I couldn't have a helmet on because you can't operate a camera, it gets in the way… They put a seat belt strap around us and we went for it, and you can see on the speedo that it's cracking 180kph.” From: http://sideburnmag.blogspot.com/2012/06/mad-max.htmlSpeedo is ‘stra’in for spedometer…
Unlike proprietary monitoring tools that we use now or have used in the past, we don’t have to worry about paying a license for every stakeholder that has a business need to see the data. <enter><enter>Fixes on the 2.0 line have so far been decently timely. With a community of hundreds of contributors Linus’s law applies.Which is given enough eyeballs all bugs are shallow.<enter>Zabbix is community based
Community based means there are forums, where we can ask questions and get answers ourselves or see the answers to others questions. <Enter>Yes that is almost 40,000 posts to over 10,000 threads. We could never expect this level of interaction and support for a internally developed monitoring tool.
The number of users in the freenode IRC channel continues to grow to nearly 200 people on average.This is a place to ask advanced questions in real time from users around the world.Oh and this graph was created and gathered in Zabbix over 7 years.
We providehealth care solutions, if we can integrate tools that solve software and hardware problems, that gets us to our goal faster.
For those of you who now want to see the movie because of this talk I don’t want to ruin it for you.But some bad things happen to people Max knows in this movie.This causes Max to quit the force, but he is talked into just taking a holiday instead. At this point Max is just a regular guy. He is trying to keep the peace and lead a good life with his girlfriend.
There are 4 steps to get your host connected to the Zabbix Server and use the Linux OS Template. <enter>However 2 of them have likely been done for you on the Zabbix Server already <enter>And soon we plan to automate Application of the Template to the Host using Auto-Discovery of Linux nodes.So we are left with one step.
For those couple steps you get (roughly depending on the layout of the host):11 applications90 items120 triggersAnd20 graphs
As I said at the beginning Mad Max made a ton of money for the amount of money spent. About 500 to 1000 dollars for every dollar spent.With the Zabbix Linux Template, we are talking about a couple hours of work for 120 Triggers. Once you’ve set this up before it is really only about 10 minutes work to set it up for future nodes.
The 80% full alerts have been extremely beneficial.In the case of disk space and inodes, these alerts give us the time and ability to troubleshoot the issue and make a decision if we Extend the Logical Volume or Find the offending large file or processIn the case of the volume reaching 100% the only choice is extend the LVMIn the case that I spoke of before that Brian Cook ran into with RAM, we can make better decisions on the size and number of nodes we need for Map Reduce.
The entire Mad Max series is built on Car chases, which are awesome to watch.So far it has been awesome to watch Zabbix grow so prolifically throughout Cerner.
What impresses me most about Zabbix and Mad Max is that something so simple and easy could gain so much mindshare.The Creators of each poured time and effort into something that has universal and world wide appeal.We are adaptors of there work and I want to thank them.
So that is where we have been and howwe got started.Now let’s talk about where we now using Mad Max 2: The Road Warrior
Mad Max 2 The Road Warrior picks up a few years later. Max is older and hardened from the tragedy at the end of the first movie. Oil is still scarce. There are still street gangs.Max is now a Lone Wolf.He is looking for more ammunition for his sawed off.
Oh and the villians have slightly better costumes… more budget.
We have well over 2000 nodes currently in the ProductionZabbix 2.0 instance currently.And we believe we can scale that much incredibly higher with our current deployment structure.
A common setup for a highly available system (or HA) is to have N+1 nodes.Here we see 2 proxy layer nodes fronting 3 service layer nodes.
If one of the service layer nodes goes down that is a problem, that needs to be addressed and likely quickly.However the system as a whole is still functioning.
However if all 3 nodes go down that is a disaster that needs to be addressed immediately and someone needs to be paged to fix it.
John Breese was able to set this up for us on Semantic Solutions using templates.We receive high alerts in the event that any single node goes down.We receive disaster alerts in the event that all of there servers or proxies are down.
The alerts go to auCern Space set up specifically for monitoring our system. Associates are free to subscribe or unsubscribe from this space as they need.The discussion can occur in the open and the URL can quickly be pasted on other discussions or Jiras that are occurring on other related issues.
Brad Beam created these graphs that anyone who can access the production Zabbix system can see. Meaning if you have the need to see this, you only have to log an issue in Jira.This graph is monitoring the Real Time processing of data through Storm.The Storm acknowledgement rates (or ack rates) are away to gauge system healthA low ack rate and a sufficient backlog in notifications, it is indicative of an issue.I’ll be honest, I am not sure how exactly these graphs were created, nor that many details about it specifically. What I do know is that many people have been watching this information to understand the system behavior and improve it over the last couple months.
Another Dashboard created by Brad BeamWe currently have a bug in the JVM reuse for the M/R jobs The resources for the finished JVMs wouldn't be reclaimed which would eventually exhaust the resources on the box. So with this graph we can identify if a server has bogus JVMs out there and need to be addressed.Development of basic monitoring features can now be measured in hours or days, as opposed to months.We need the freedom to change these metrics daily/weekly as we learn more.
Brahe Hubble is a Ruby Gem created by Chris Rooney here are Cerner<enter>Not to steal any thunder from Ben Brown and KartikVishwanath presenting on Brahe later in this conference, Brahe is named after the astronomer Tycho Brahe (similar to the project Kepler, which many of you may be more familiar with).Brahe Solr is a cloud based indexing application also created here at Cerner <enter>presents at least 2 replicas <enter> That are fronted by a Brahe REST services <enter> to manage and query their state <enter>Brahehubble uses this rest services <enter>To present a Json document <enter>To be used by a Zabbix TemplateSo why not have Zabbix call the rest interface directly?Basically the logic done by Brahe Hubble is too complicated for Zabbix to complete on it’s own.
With the help of Kalin and Brad Beam, NidhiBhargava worked through this for our Brahe Hubble deploymentYou have your Host or Node and aZabbix Server <enter>First you have to get the Zabbix Agent Installed (preferably through Chef) <enter>Then a script (or in the case of Brahe Hubble a RubyGem) that does the gathering of information and outputs a json documentBut how will the Zabbix Agent know about the script or command line? <enter>Easy you will have to configure the UserParameter for Zabbix Agent (simple to do if your are using the zabbix_agent_chef cookbook) <enter>This will allow you to present a json document to the Zabbix Server <enter>The Zabbix Server then uses this json document in a Template with a Macro.
In Templates <enter>The important part is that this is created under “discovery” <enter>In Discovery we created an item and a trigger <enter>The item <enter>
It is here where you can use the name value pairs presented in json from the script or RubyGem.
Let me stop for a minute and tell you about my 2 favorite characters in Mad Max 2Max meets this guy that we refer to as the “Gyro Captain” because no one says his name in the movie and Max never asks.Oh and probably because he drives a gyro copter.Character development is starting to become part of the Mad Max movie this time around. Even if names are not. I personally like names and would love to celebrate things you do with Zabbix as I just did with the cool stuff I have seen done with Zabbix.
Names I have already said so far. <enter>There are many more, but notice that there are 3 dev and 3 ops. Each of us have learned a lot from one another.
There is also The Feral Kid, named for similar reasons. Max gives the feral kid a music box. Max’s heart is starting to soften some and he decides to help this village of people protecting their oil try to get away from the road gang.Max has become more invested in the village. Over the past couple years Zabbix has moved from that side project, or Skunkworks project to an investment in the health of our system.
Max tries to leave the village once, but does not make it. He comes back after a pretty severe beating.
Remember that Max was the best driver on the Main Force Patrol.Max is the only one who is going to be able to drive the tankard of oil out of the protected village.Oh and there is an epic oil tanker chase scene. It goes on for like 20 minutes.In Software we often refer to situations where only one or a few can do something critical as having a low “Bus Factor”. Which put simply is the total number of key developers who would need to be hit by a bus (or tankard) before the project would not be able to proceed.
I would describe Mad Max 2 as aREAD SLIDE
The Zabbix Information model has a rather steep learning curve. But I believe it is one worth climbing.From https://www.zabbix.com/forum/showthread.php?t=21030
As I often do,I asked Kalin to talk to me like I'm a 3rd grader and he boiled it down to this for me.* A Host can be part of many Host Groups.A Host can have many Templates applied to itA Template can have Graphs, Items, and TriggersYou can define actions for TriggersKyle McGovern and Ben Hemphill mentioned yesterday that they are using Zabbix to restart Hadoop Region Servers.So Self healing system of the future? We have that now.
The Road Warrior won critical acclaim, and is an incredibly better movie than the first. The story line is cohesive and somewhat compelling. Max truly comes out a hero.By putting in more work, we have a better story and done some awesome stuff with Zabbix so far…
Let’s talk about where we want to go with Zabbix in the next couple years.
We want Tina Turner level success…In the third installment of the saga, Mad Max: Beyond Thunderdome, Tina Turner is the leader of Bartertown. She plays Aunty Entity.
Bartertown has regained some technology through the use of methane.Years have past and an aging Max has some of his supplies stolen and becomes involved in the local political power struggle.
Recently Nimesh Subramanian created a Skybox Labs virtual cluster with a Chef Server and a Zabbix Server.You can check this out upload the cookbook for your app or service and start playing around with Zabbix without affecting a shared domain where others are working.When you are finished you can just throw the image away.
Dashboards are an area that could use a lot of work. Each of these titles are available on Safari Online. The way people read books is a personal decision. I personally use my library card and each of these 4 are available on Safari Online so I can read them on my iPad.How do we convey the most information in the least amount of space to make only the real problems gain attention?
Zabbix has a full API.Many have been pulling Jira and Splunk data already into Dashing from Shopify which can be optimized It should be rather trivial.
Zabbix does have some interesting features.A couple weeks ago, in the workaround.org blog, Zabbix Maps were explained fairly well.We have not made use of this very heavily however this could potentially give us a graphical relational way to reason about the data that Zabbix is gathering.
In Mad Max Beyond Thunderdome there is a cage match between Max and a huge opponent named Blaster.The crowd chants “Two men enter, one man leaves”
Remember back to my example of High Alerts vs. Disaster for the Service Layer? In the disaster scenario I get 4 alerts. 3 for each of the host, and one for the disaster.However this is likely all from one cause. Meaning those alerts are correlated, but how to do I get the system to only email me once?Sometimes a single cause can result in hundreds of emails from Zabbix. I heard one system engineer recently refer to this as “Getting Zabbixed”
Straight from the Zabbix Documentationhttps://www.zabbix.com/documentation/2.0/manual/config/triggers/dependencies
http://meinit.nl/zabbix-triggers-flap-detection-and-grace-periodSystems can get into states where they send Error then immediately send OK’s.A different monitoring system, Nagios, calls this “Flap detection”.In these cases real time alerts are not of much value, Because the system is doing one of two things:Correcting itself somehow faster than a human can interveneOr these are just the downstream effect of the network or another factor (that we should be using the previously mentioned trigger dependency for)Zabbix calls this Hysteresis pronounced “Historee Sis”
Hysteresis is the dependence of a system not only on its current environment but also on its past environment <Enter>For alerts such as this we can use the unix pipe command to chain. <enter>Problem: being less than 10GB for 5 minutes <enter>notice you set this a max of 5 minutes <enter>Recovery: being more than 40 GB in the last 10 minutes <enter>notice the min of 10 minutes <enter>
https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notificationsFrom the Zabbix documentation (I have not fully tested this myself).First check the box to Schedule Actions – This allows the actions on the right sideNext, set a period (maybe 120 seconds)Enable a recovery messageMake sure Trigger value = “PROBLEM” or you will delay the recovery messageStep 2 happens after 120 seconds (step 1 is not defined) so nothing happens.
We need Thunderdome for our alerts100’s of related alerts enterOne causal alert leaves
In discussing these methods of correlation, suppression, and delaying messages, I often get asked, “What if someone misses something?” <enter>A monitoring system that cries wolf too often is almost guaranteed not to get listened to. When I hear a car alarm these days I unfortunately almost never think that someone is trying to steal a car.While this is a valid question, it is not the most interesting question to me. It seems like a question that could stunt progress.The Zabbix community is working through an Action Simulator that may be part of a future release of Zabbix. Look for the blog entry entitled: “Why on earth was I not notified?!”
Trends of flapping are better dealt with in an wholistic manner.Zabbix is not yet great at daily/weekly reports, but it appears that the community has made a lot of headway and it will be in a near future release.
So let’s return to my previous example. <enter>If I delay the notification by 120 seconds and the node recovers in time, then I get no notification – this is good as it will cut down on a number of notificationsIf the node does not recover in that time - the system as a whole is still up and I can deal with the problematic node individually <enter>
If all 3 nodes are down at the same time, I would not however delay the notifications of the Disaster.In this case, the system is not likely to recover in 2 minutes so I would just be delaying the other 3 emails. <enter>I may be able to set up a trigger dependency, however that would sort of be circular in my current opinion. Remember trigger dependency was for a separate host. <enter>
In beyond Thunderdome, Max is banished from Bartertown. He is found by a tribe of children who have a “tell” that prophesizes his arrival. Again Max becomes a reluctant hero to this tribe of people.
When Adam Jacob from OpsCode was visiting our campus he walked through an example that we had been working through with proxies.He mentioned Promise Theory. <enter>I am going to use an example I lifted from John Willis of the DevOps Café Podcast.A promise of B from agent 1 to agent 2.http://www.socallinuxexpo.org/sites/default/files/presentations/scale11x-historyofmgmt-130222175623-phpapp01.pdf
There are promises to give and promises to receiveLet’s use + for give and – for receiveI (a1) promise to feed my neighbor’s cat (a2) My neighbor (a2) promises to grant me access to his house.Trust comes in:That my neighbor gave me the correct code and I will not get arrested.That I will not drink his 25 year old scotch
My Service promises to publish state.
If you think this subject is interesting Mark Burgess (who wrote cfengine – a precursor to Chef - well before it’s time) recently published a 303 page Draft of his book on the subject.
I have had the opportunity to read many books and take classes on project management.We see this quote many times Nobody Plans to fail, some just Fail to Plan <Enter>This is cute <Enter>But it is wrong
Read the slideSchedule strategic iteration time to work through monitoring…So you are not scheduling weekend war rooms
The Phoenix Project is a novel about IT and DevOps.It is about a company on the brink of complete failure.
Beyond Thunderdome is yet again a Dystopian Future where the Survival of many is in the hands of one Man <enter>It makes a great action movie, but not a great way to do business.
Our team is built on enablement. We are structured around understanding, harnessing and providing the capabilities needed to deliver software in the Big Data world.There are many tools already in use by a large number of teams. Each of the tools used have a large open community outside of Cerner.We are focused on building an ecosystem within Cerner to solve the large scale problems we are facing with these large scale deployments.
I have been asked many times in the past couple months “Have you seen monitoring tool X? It is awesome.”I am sure that it is. Please show me why it is awesome. We have set up a way that you can do this.Visit the our Incubator link on the uCern wiki. We would like to collect the awesome DevOps tools you are looking into, in a place where you can compare the capabilities to make the best decisions on which ones should be applied to your team.
I had an architect recently refer to working on a monitoring solution as “technical debt” when his system was not yet in production.READ SLIDE
The third installment closes with yet another epic chase in all sorts of vehicles and epic explosions. Max again comes out a hero…
So to relate this back to Chris Brown’s Keynote yesterday?