Zabbix: Beyond Thunderdome

•Download as PPTX, PDF•

2 likes•4,785 views

Aaron Blythe

Presentation at #cernerdevcon on 6/5/2013

Technology Business

Huh?
• Does anyone know what movie that was?
@ablythe

World Record
• Highest Profit to Cost Ratio Ever
• But before that…
@ablythe

Why Zabbix?
Open Source
Linus’s Law
Given enough ‘s all ‘s are
Community Based
@ablythe

Why Zabbix?
Mission Statement
To contribute to the systemic improvement of
health care delivery and the health of
communities.
@ablythe

Zabbix Linux Template - Cost
• Connect Host as Agent to Zabbix Server (Via
Chef)
• Download Template from Zabbix
• Upload Template to Zabbix Server
• Apply Template to Host
____________________
• Cost = 4 steps
2 Steps 1 Step
@ablythe

Zabbix Linux Template - Return
• ~ 11 applications
• ~ 90 items
• ~ 120 triggers
• ~ 20 graphs
@ablythe

Profit to Cost Ratio
• Mad Max
– $100 million worldwide/A$400,000
• Zabbix Linux Template
– 120 Triggers/2 Steps
@ablythe

Benefit
• 80% full alerts
– Disk space/inodes
– RAM
• Make better decisions on size needed
Decision
Find file or
process
Extend LVM
@ablythe

Creators
Byron Kennedy
George Miller Alexei Vladishev
Zabbix (Latvia)
Mad Max(Australia)
@ablythe

Highly Available Deployments
Proxy Layer
Service Layer
@ablythe

Email Alerts to uCern Discussions
@ablythe

$Brahe Hubble { “{INDEX_MACRO}"=>”name]}", “{VERSION_MACRO}"=>” version", “{ERROR_MACRO}"=>"#{error}" } @ablythe$

Zabbix Low Level Discovery
@ablythe
Zabbix Host
Zabbix Agent
UserParameter
Shell Script or
RubyGem
Zabbix Server
json
Document Template
w/ Macro

Who?
Kalin Hicks – Set up original GCL VM – countless
explanations whiteboard sessions
Brian Cook – Set up original Sepsis Zabbix VM’s
John Breese – Set up 2.0 templates spanning hosts
Brad Beam – Many dashboards, alerts and triggers
Chris Rooney – Brahe-hubble gem
Nidhi Bhargava – Low level discovery on 2.0
Dev – White Ops - Yellow
@ablythe

Bus Factor
Dystopian Future Where The Survival of Many is
in the Hands of One Man
@ablythe

Host Group Host Group
Host
Template
Template (0..n)
Item TriggerGraph
Applications
0..n
Action
email command
Items
1..n
… has a learning curve

Virtualization thru Skybox Labs
@ablythe

Dashboards
chapters
divided by
types of
data rather
than types
of display
chapters on
multi-variables,
correlationand
proportions
Honestly a
little too
textbook-
ish for me
from more
than two
dozen experts,
real world case
studies,
beautiful
layers, how to’s
@ablythe

Zabbix Maps
http://workaround.org/zabbix/maps
@ablythe

Alert Exhaustion
Ain’t Nobody Got
@ablythe

Correlation of Alerts
Proxy Layer
Service Layer
@ablythe

Trigger Dependencies
• Sometimes the availability of one host
depends on another. A server that is behind
some router will become unreachable if the
router goes down. With triggers configured for
both, you might get notifications about two
hosts down - while only the router was the
guilty party.
@ablythe

“Flap Detection” and a Grace Period
Nagios uses "flap detection" to prevent many
ERROR's and OK's being sent right after each
other.
Zabbix calls this "hysteresis".
@ablythe

Hysteresis
Hysteresis is the dependence of a system not
only on its current environment but also on its
past environment
@ablythe

Correlation of Alerts
We need to get to the point where:
100’s of Related Alerts Enter,
One Causal Alert Leaves
@ablythe

What if someone misses something?
With 100+ alert emails per day, they are almost
guaranteed to miss something.
@ablythe
“Why on earth was I not notified?!”
On http://blog.zabbix.com/

Trends of Flakiness
These should not be dealt with by alerts/alarms.
Rather by daily/weekly reports.
Unfortunately Zabbix is not strong in this area yet.
There is a thread:
https://www.zabbix.com/forum/showthread.php?t
=18901
@ablythe

False Alarms Due to Chef Restarts
Current – Manual
Maintenance Periods
Potentially – Automated
Automate the Maintenance Periods
Delaying Notifications
Hysteresis
Promise Theory
@ablythe

Highly Available Deployments
Delayed Notifications/Hystersis
Proxy Layer
Service Layer
Delay Alert
120 seconds
Works!! @ablythe

Highly Available Deployments
Delayed Notifications/Hystersis
Proxy Layer
Service Layer
Delay Alert
120 seconds
Delay Alert
120 seconds
Delay Alert
120 seconds
No Delay
Doesn’t Work @ablythe

Promise Theory
+data
a1
a2
My Service
Zabbix
@ablythe

Leveraging Init.d to Manage State
…
case "$1" in
start)
touch /var/<service>/start
…
rm -f /var/<service>/start
;;
stop)
touch /var/<service>/stop
;;
rm -f /var/<service>/stop
restart)
touch /var/<service>/restart
$0 stop
$0 start
rm -f /var/<service>/restart
;;
…
This of course is messy if the service
ever hangs during a restart.
More discussion needs to be had in this
area.
@ablythe

Mark Burgess – Book of Promises
http://cfengine.com/markburgess/BookOfPromi
ses.pdf
Draft published on January 21st 2013
@ablythe

For the Project Managers
Nobody
PLANS TO FAIL
Some just
FAIL TO PLAN
@ablythe

For the Project Managers
Everybody should
PLAN TO FAIL
PRACTICE LOCALIZED FAILURE
And
MINIMIZE RECOVERY TIME
@ablythe

The Phoenix Project: A Novel About
IT, DevOps, and Helping Your Business
Win
@ablythe

The Brent Effect
Brent is the one person who understands the
how the entire system fits together.
Brent is the one person who fixes most of the
issues.
Being spread so thin, Brent is also the one
person who causes most of the issues.
@ablythe

Dystopian Future Where The Survival of Many is
in the Hands of One Man
The system or crucial parts of the system
Man or Woman
@ablythe

What is OpsInfra?
A team built on enablement of DevOps.
@ablythe
Other tools
As needed
Build an Ecosystem
Tool Virtualization
Repeatable Deployment
Documentation
Discussion
Auxiliary Tooling
Education
The Success of:
Population Health
Millennium+
Project Go

Incubator
• https://wiki.ucern.com/display/OPIT/Incubato
r
• 4 steps
– Log a Jira with the intent to research a tool
– Write a wiki article on how to use it
– Write a blog on how it is awesome
– Record a demo of the tool
@ablythe

For the Architects
Monitoring is only “technical debt” if you
choose to carry it that way.
Depending on when you invest, it easily can be
“technical capital”
@ablythe

Past – Hackers - Craft
Now – SysAdmin - Trade
Future – Devops - Science
@ablythe

The Tell
The years travel fast
And time after time, I've done the tell
But this ain't one body’s tell
It's the tell of us all
And you gotta listen it and 'member
Cuz what you hears today
You gotta tell the newborn tomorrow
@ablythe

Similar to Zabbix: Beyond Thunderdome

Open Source Craft at Twitter

Chris Aniszczyk

Jr devsurvivalguide

James York

Are you new to the professional world of software development? Do you have new developers on your team? Are you wondering why college and the School of Hard Knocks did such a bad job preparing you to be a functional member of a high-performing team? Take some advice from a junior dev who has walked the path and learn to avoid rookie mistakes. Learn the skills employers value and how to get them. We will discuss quick return actions that can be undertaken immediately, as well as long term, slow-burn investments in your career. This session will focus on technical and interpersonal advice to help make your first job search, entry-level hire, and first year as a developer go smoothly. A great career won’t just fall into your lap. It takes dedication, skill, persistence, and more than a little luck. Happily, we make our own luck.

The Junior Developer Survival Guide - GDI Ann Arbor 2/10/15

James York

Interns What Is DevOps

Aaron Blythe

Blackmagic Open Source Intelligence OSINT

Sudhanshu Chauhan

Open Source Intelligence is the art of collecting information which is scattered on publicly available sources. With evolution of social media and digital marketplaces a huge amount of information is constantly generated on the Internet (sometimes even without our conscious consent). This is of great concern for organizations and businesses as chances of confidential data floating in the public domain may seriously harm their business integrity. All recent hacks are related to internal source code disclosure, API keys leakage, known vulnerability in third party plugin, data dump leaks etc. Based on experience and robust research in this domain, for this talk the speakers have created a tool which will help all kind of organizations to monitor cyberspace effectively without much investment. This tool is simple but an effective solution which is capable of hearing digital whispers which are usually missed or ignored but shouldn’t be.

OSINT Black Magic: Listen who whispers your name in the dark!!!

Nutan Kumar Panda

Agile has helped teams to collaborate and organize work better. That’s great. Better teamwork and better understanding of the work definitely helps a team to do right things. Agile has also lead the way toward technical practices such as Continuous Integration and Delivery, Test Driven Development and SOLID-architecture principles. Great, these things definitely help the team to do things right. Then again, most of the time in software projects goes into problem solving and similar creative acts. Agile has relatively little to give on these areas. Currently, agile is not about creativity nor is it about problem solving. This coaching circle session will focus on the creative core of software development: solving creatively novel, original and broad problems more effectively all the time. I will introduce some principles and tools I’ve found useful when helping people to solve hard problems and to find creative solutions.

Coaching teams in creative problem solving

Flowa Oy

A presentation from Museums and the Web 2009. Brian Kelly, University of Bath, United Kingdom The benefits of Web 2.0 in a museum context are now being increasingly accepted, with papers at recent Museums and the Web conferences having highlighted a range of ways in which services such as Flickr and YouTube and technologies such as blogs and wikis can be used. But what of the associated risks? What of the various concerns that the sector is beginning to address: concerns that the services may not be sustainable; institutional data may be locked into external services; services may infringe accessibility guidelines and associated legislation; users may lose interest in the services; inappropriate user-generated content may be published on the service; data created or stored on the services may not be preserved; etc.? In a paper on "Web 2.0: How to Stop Thinking and Start Doing: Addressing Organisational Barriers" presented at Museums and the Web 2007 conference, the authors encouraged museums to take a leap of faith and begin experimentation with use of Web 2.0. But now that organisations have a clearer idea of the benefits which Web 2.0 can provide, it is appropriate to "stop doing and start thinking". This paper describes a framework for supporting cultural heritage organisations in their use of Web 2.0 services, with examples of how this framework can be used in various contexts are provided. Session: Frameworks for Redesign [Design]

Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...

museums and the web

With DVCSs branch creation became very easy, but it comes at a certain cost. Long living branches break the flow of the software delivery process, impacting stability and throughput. The session explores why teams are using feature branches, what problems are introduced by using them and what techniques exist to avoid them altogether. It explores exactly what's evil about feature branches, which is not necessarily the problems they introduce - but rather, the real reasons why teams are using them. After the session, you'll understand a different branching strategy and how it relates to CI/CD.

Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018

Codemotion

Open source software for startups

victorneo

Lifestream: The New Future of Blogging?

guestda4755

Akshay Anand - Using Cynefin to make sense of ITSM

itSMF UK

Building the Orchard Community

Paula Hunter

Lifestreaming: The New Future of Blogging?

Fellow.app

How do volunteer open-source projects create and maintain so many compelling, competitive products? What is the Open Source Secret Sauce? Join open-source insider, Ted Husted, as he takes us deep inside the Apache Software Foundation, to show how the sausages are made. In this session, you will learn * Why open source matters; * How open source development works at the ASF; * What makes open source projects successful.

Open source-secret-sauce-rit-2010

Ted Husted

Pythonlearn-01-Intro.pptx

MrHackerxD

Troublefree troubleshooting ian campbell sps jhb 2019

Ian Campbell

Re-Building a Tech Community - Post Pandemic!

Jen Looper

Jason Yee - Chaos! - Codemotion Rome 2019

Codemotion

Devops at scale is a hard problem challenges, insights and lessons learned

kjalleda

Similar to Zabbix: Beyond Thunderdome (20)

Open Source Craft at Twitter

Jr devsurvivalguide

The Junior Developer Survival Guide - GDI Ann Arbor 2/10/15

Interns What Is DevOps

Blackmagic Open Source Intelligence OSINT

OSINT Black Magic: Listen who whispers your name in the dark!!!

Coaching teams in creative problem solving

Time To Stop Doing and Start Thinking: A Framework For Exploiting Web 2.0 Ser...

Thierry de Pauw - Feature Branching considered Evil - Codemotion Milan 2018

Open source software for startups

Lifestream: The New Future of Blogging?

Akshay Anand - Using Cynefin to make sense of ITSM

Building the Orchard Community

Lifestreaming: The New Future of Blogging?

Open source-secret-sauce-rit-2010

Pythonlearn-01-Intro.pptx

Troublefree troubleshooting ian campbell sps jhb 2019

Re-Building a Tech Community - Post Pandemic!

Jason Yee - Chaos! - Codemotion Rome 2019

Devops at scale is a hard problem challenges, insights and lessons learned

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Histor y of HAM Radio presentation slide

vu2urc

Real Time Object Detection Using Open CV

Khem

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Axa Assurance Maroc - Insurer Innovation Award 2024

GenCyber Cyber Security Day Presentation

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Year of the Servo Reboot: Where Are We Now?

Apidays New York 2024 - The value of a flexible API Management solution for O...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Histor y of HAM Radio presentation slide

Real Time Object Detection Using Open CV

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

🐬 The future of MySQL is Postgres 🐘

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Automating Google Workspace (GWS) & more with Apps Script

Powerful Google developer tools for immediate impact! (2023-24 C)

Zabbix: Beyond Thunderdome

1. What’s going on? @ablythe

2. Huh? @ablythe

3. Huh? • Does anyone know what movie that was? @ablythe

4. @ablythe

5. World Record • Highest Profit to Cost Ratio Ever • But before that… @ablythe

6. @ablythe

7. Zabbix: Beyond Thunderdome Aaron Blythe

8. This presentation is about… @ablythe

9. This presentation is about… @ablythe

10. This presentation is about… @ablythe

11. This presentation is about… @ablythe

12. Past Now Future @ablythe

13. Past Now Future @ablythe

14. What is Zabbix? @ablythe

15. What is Mad Max? @ablythe

16. Why Zabbix? @ablythe

17. Why Zabbix? Necessity @ablythe

18. Why Zabbix? @ablythe

19. Why Zabbix? Open Source Linus’s Law Given enough ‘s all ‘s are Community Based @ablythe

20. Why Zabbix? @ablythe

21. Why Zabbix? @ablythe

22. Why Zabbix? Mission Statement To contribute to the systemic improvement of health care delivery and the health of communities. @ablythe

23. @ablythe

24. Zabbix Linux Template - Cost • Connect Host as Agent to Zabbix Server (Via Chef) • Download Template from Zabbix • Upload Template to Zabbix Server • Apply Template to Host ____________________ • Cost = 4 steps 2 Steps 1 Step @ablythe

25. Zabbix Linux Template - Return • ~ 11 applications • ~ 90 items • ~ 120 triggers • ~ 20 graphs @ablythe

26. Profit to Cost Ratio • Mad Max – $100 million worldwide/A$400,000 • Zabbix Linux Template – 120 Triggers/2 Steps @ablythe

27. Benefit • 80% full alerts – Disk space/inodes – RAM • Make better decisions on size needed Decision Find file or process Extend LVM @ablythe

28. Chase Scenes and Crashes @ablythe

29. Creators Byron Kennedy George Miller Alexei Vladishev Zabbix (Latvia) Mad Max(Australia) @ablythe

30. Past Now Future @ablythe

31. Mad Max 2 – The Road Warrior @ablythe

32. @ablythe

33. Scale @ablythe

34. Highly Available Deployments Proxy Layer Service Layer @ablythe

35. Highly Available Deployments Proxy Layer Service Layer @ablythe

36. Highly Available Deployments Proxy Layer Service Layer @ablythe

37. Highly Available Deployments @ablythe

38. Email Alerts to uCern Discussions @ablythe

39. Screens/Graphs – ack rates @ablythe

40. Screens/Graphs @ablythe

41. Brahe Hubble { “{INDEX_MACRO}"=>”name]}", “{VERSION_MACRO}"=>” version", “{ERROR_MACRO}"=>"#{error}" } @ablythe

42. Zabbix Low Level Discovery @ablythe Zabbix Host Zabbix Agent UserParameter Shell Script or RubyGem Zabbix Server json Document Template w/ Macro

43. Zabbix Low Level Discovery @ablythe

44. Zabbix Low Level Discovery @ablythe

45. @ablythe

46. Who? Kalin Hicks – Set up original GCL VM – countless explanations whiteboard sessions Brian Cook – Set up original Sepsis Zabbix VM’s John Breese – Set up 2.0 templates spanning hosts Brad Beam – Many dashboards, alerts and triggers Chris Rooney – Brahe-hubble gem Nidhi Bhargava – Low level discovery on 2.0 Dev – White Ops - Yellow @ablythe

47. @ablythe

48. It’s not all dogs… @ablythe

49. …and Gyrocopters @ablythe

50. Sometimes my email inbox… @ablythe

51. Has me feeling like @ablythe

52. Bus Factor @ablythe

53. Bus Factor Dystopian Future Where The Survival of Many is in the Hands of One Man @ablythe

54. The Information Model @ablythe

55. Host Group Host Group Host Template Template (0..n) Item TriggerGraph Applications 0..n Action email command Items 1..n … has a learning curve

56. Mad Max 2: The Road Warrior @ablythe

57. Past Now Future @ablythe

58. We Want Tina Turner! @ablythe

59. Beyond Thunderdome @ablythe

60. Virtualization thru Skybox Labs @ablythe

61. Dashboards chapters divided by types of data rather than types of display chapters on multi-variables, correlationand proportions Honestly a little too textbook- ish for me from more than two dozen experts, real world case studies, beautiful layers, how to’s @ablythe

62. Pull Data External? @ablythe

63. Zabbix Maps http://workaround.org/zabbix/maps @ablythe

64. Alert Exhaustion Ain’t Nobody Got @ablythe

65. Two Men Enter, One Man Leaves @ablythe

66. Correlation of Alerts Proxy Layer Service Layer @ablythe

67. Trigger Dependencies • Sometimes the availability of one host depends on another. A server that is behind some router will become unreachable if the router goes down. With triggers configured for both, you might get notifications about two hosts down - while only the router was the guilty party. @ablythe

68. “Flap Detection” and a Grace Period Nagios uses "flap detection" to prevent many ERROR's and OK's being sent right after each other. Zabbix calls this "hysteresis". @ablythe

69. Hysteresis Hysteresis is the dependence of a system not only on its current environment but also on its past environment @ablythe

70. Delaying Notifications @ablythe

71. Correlation of Alerts We need to get to the point where: 100’s of Related Alerts Enter, One Causal Alert Leaves @ablythe

72. What if someone misses something? With 100+ alert emails per day, they are almost guaranteed to miss something. @ablythe “Why on earth was I not notified?!” On http://blog.zabbix.com/

73. Trends of Flakiness These should not be dealt with by alerts/alarms. Rather by daily/weekly reports. Unfortunately Zabbix is not strong in this area yet. There is a thread: https://www.zabbix.com/forum/showthread.php?t =18901 @ablythe

74. False Alarms Due to Chef Restarts Current – Manual Maintenance Periods Potentially – Automated Automate the Maintenance Periods Delaying Notifications Hysteresis Promise Theory @ablythe

75. Highly Available Deployments Delayed Notifications/Hystersis Proxy Layer Service Layer Delay Alert 120 seconds Works!! @ablythe

76. Highly Available Deployments Delayed Notifications/Hystersis Proxy Layer Service Layer Delay Alert 120 seconds Delay Alert 120 seconds Delay Alert 120 seconds No Delay Doesn’t Work @ablythe

77. Beyond Thunderdome @ablythe

78. Promise Theory @ablythe

79. Deconstructing Promises @ablythe

80. Promise Theory +data a1 a2 My Service Zabbix @ablythe

81. Leveraging Init.d to Manage State … case "$1" in start) touch /var/<service>/start … rm -f /var/<service>/start ;; stop) touch /var/<service>/stop ;; rm -f /var/<service>/stop restart) touch /var/<service>/restart $0 stop $0 start rm -f /var/<service>/restart ;; … This of course is messy if the service ever hangs during a restart. More discussion needs to be had in this area. @ablythe

82. Mark Burgess – Book of Promises http://cfengine.com/markburgess/BookOfPromi ses.pdf Draft published on January 21st 2013 @ablythe

83. For the Project Managers Nobody PLANS TO FAIL Some just FAIL TO PLAN @ablythe

84. For the Project Managers Everybody should PLAN TO FAIL PRACTICE LOCALIZED FAILURE And MINIMIZE RECOVERY TIME @ablythe

85. The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win @ablythe

86. The Brent Effect Brent is the one person who understands the how the entire system fits together. Brent is the one person who fixes most of the issues. Being spread so thin, Brent is also the one person who causes most of the issues. @ablythe

87. Dystopian Future Where The Survival of Many is in the Hands of One Man The system or crucial parts of the system Man or Woman @ablythe

88. What is OpsInfra? A team built on enablement of DevOps. @ablythe Other tools As needed Build an Ecosystem Tool Virtualization Repeatable Deployment Documentation Discussion Auxiliary Tooling Education The Success of: Population Health Millennium+ Project Go

89. Incubator • https://wiki.ucern.com/display/OPIT/Incubato r • 4 steps – Log a Jira with the intent to research a tool – Write a wiki article on how to use it – Write a blog on how it is awesome – Record a demo of the tool @ablythe

90. For the Architects Monitoring is only “technical debt” if you choose to carry it that way. Depending on when you invest, it easily can be “technical capital” @ablythe

91. Beyond Thunderdome @ablythe

92. Past – Hackers - Craft Now – SysAdmin - Trade Future – Devops - Science @ablythe

93. The Tell The years travel fast And time after time, I've done the tell But this ain't one body’s tell It's the tell of us all And you gotta listen it and 'member Cuz what you hears today You gotta tell the newborn tomorrow @ablythe

94. What’d ya think? @ablythe

Editor's Notes

CLICK PLAY
READ THE SLIDE
That was the Blair Witch Project
Blair Witch at one point held the record for the highest profit to cost ratio ever. <enter>But before that…
Mad Max held that record for a couple decades.
My name is Aaron Blythe, and this presentation is CalledZabbix: Beyond Thunderdome.
READ SLIDE
Mad Max
READ SLIDE
ZabbixBy show of hands who has logged into a Zabbix instance?And who has received email alerts from Zabbix?
We will go through where we have been.Where we are.And where we can go with Zabbix.I will try to not give too many spoilers on the Mad Max series of films, merely just lay down the story line.
First I want to go through how we got here with Zabbix so far, using the original Mad Max as a guide.
Zabbix is an Open Source Monitoring ToolWebsite claims:Up-to 100,000 monitored devicesUp-to 1,000,000 of metrics
Mad Max is set in Australia in a dystopian future where earth’s oil supply has been nearly exhausted.Max Rockatansky is the top driver in the Main Force Patrol (basically the police). Gangs have taken over the highway. In a car chase, Max kills one of the gang members, so they want revenge.Honestly the story is sort of dis-jointed. The movie was edited in the home of one of the producers on a home made editing machine, created by his father (an engineer).
Brian Cook told me a story of when they were first working on one of our cloud applications. It was memory bound. When a lot of data was being pumped through in batches it would actually clobber the machine. He would have to call someone in the data center at 2 in the morning to physically reboot the machine. Oh, and after doing this a few times he would always make sure to tell them to bring a pencil so they could actually get to the button
Kalin Hicks and Brian Cook told me:Zabbix was originally installed to bridge the gap in our monitoring for the Sepsis project, while we waited for a permanent solution, we just chose to use another monitoring tool instead of a bunch of scripts.It was a Skunkworks project that went viral and certainly was not ever intended to become such a big project.
Necessity helps us create or adapt great fun thingsDavid Eggby, responsible for much of the footage for Mad Max had this to say about filming.“… [Shooting from the back of the Goose bike] I couldn't have a helmet on because you can't operate a camera, it gets in the way… They put a seat belt strap around us and we went for it, and you can see on the speedo that it's cracking 180kph.” From: http://sideburnmag.blogspot.com/2012/06/mad-max.htmlSpeedo is ‘stra’in for spedometer…
Unlike proprietary monitoring tools that we use now or have used in the past, we don’t have to worry about paying a license for every stakeholder that has a business need to see the data. <enter><enter>Fixes on the 2.0 line have so far been decently timely. With a community of hundreds of contributors Linus’s law applies.Which is given enough eyeballs all bugs are shallow.<enter>Zabbix is community based
Community based means there are forums, where we can ask questions and get answers ourselves or see the answers to others questions. <Enter>Yes that is almost 40,000 posts to over 10,000 threads. We could never expect this level of interaction and support for a internally developed monitoring tool.
The number of users in the freenode IRC channel continues to grow to nearly 200 people on average.This is a place to ask advanced questions in real time from users around the world.Oh and this graph was created and gathered in Zabbix over 7 years.
We providehealth care solutions, if we can integrate tools that solve software and hardware problems, that gets us to our goal faster.
For those of you who now want to see the movie because of this talk I don’t want to ruin it for you.But some bad things happen to people Max knows in this movie.This causes Max to quit the force, but he is talked into just taking a holiday instead. At this point Max is just a regular guy. He is trying to keep the peace and lead a good life with his girlfriend.
There are 4 steps to get your host connected to the Zabbix Server and use the Linux OS Template. <enter>However 2 of them have likely been done for you on the Zabbix Server already <enter>And soon we plan to automate Application of the Template to the Host using Auto-Discovery of Linux nodes.So we are left with one step.
For those couple steps you get (roughly depending on the layout of the host):11 applications90 items120 triggersAnd20 graphs
As I said at the beginning Mad Max made a ton of money for the amount of money spent. About 500 to 1000 dollars for every dollar spent.With the Zabbix Linux Template, we are talking about a couple hours of work for 120 Triggers. Once you’ve set this up before it is really only about 10 minutes work to set it up for future nodes.
The 80% full alerts have been extremely beneficial.In the case of disk space and inodes, these alerts give us the time and ability to troubleshoot the issue and make a decision if we Extend the Logical Volume or Find the offending large file or processIn the case of the volume reaching 100% the only choice is extend the LVMIn the case that I spoke of before that Brian Cook ran into with RAM, we can make better decisions on the size and number of nodes we need for Map Reduce.
The entire Mad Max series is built on Car chases, which are awesome to watch.So far it has been awesome to watch Zabbix grow so prolifically throughout Cerner.
What impresses me most about Zabbix and Mad Max is that something so simple and easy could gain so much mindshare.The Creators of each poured time and effort into something that has universal and world wide appeal.We are adaptors of there work and I want to thank them.
So that is where we have been and howwe got started.Now let’s talk about where we now using Mad Max 2: The Road Warrior
Mad Max 2 The Road Warrior picks up a few years later. Max is older and hardened from the tragedy at the end of the first movie. Oil is still scarce. There are still street gangs.Max is now a Lone Wolf.He is looking for more ammunition for his sawed off.
Oh and the villians have slightly better costumes… more budget.
We have well over 2000 nodes currently in the ProductionZabbix 2.0 instance currently.And we believe we can scale that much incredibly higher with our current deployment structure.
A common setup for a highly available system (or HA) is to have N+1 nodes.Here we see 2 proxy layer nodes fronting 3 service layer nodes.
If one of the service layer nodes goes down that is a problem, that needs to be addressed and likely quickly.However the system as a whole is still functioning.
However if all 3 nodes go down that is a disaster that needs to be addressed immediately and someone needs to be paged to fix it.
John Breese was able to set this up for us on Semantic Solutions using templates.We receive high alerts in the event that any single node goes down.We receive disaster alerts in the event that all of there servers or proxies are down.
The alerts go to auCern Space set up specifically for monitoring our system. Associates are free to subscribe or unsubscribe from this space as they need.The discussion can occur in the open and the URL can quickly be pasted on other discussions or Jiras that are occurring on other related issues.
Brad Beam created these graphs that anyone who can access the production Zabbix system can see. Meaning if you have the need to see this, you only have to log an issue in Jira.This graph is monitoring the Real Time processing of data through Storm.The Storm acknowledgement rates (or ack rates) are away to gauge system healthA low ack rate and a sufficient backlog in notifications, it is indicative of an issue.I’ll be honest, I am not sure how exactly these graphs were created, nor that many details about it specifically. What I do know is that many people have been watching this information to understand the system behavior and improve it over the last couple months.
Another Dashboard created by Brad BeamWe currently have a bug in the JVM reuse for the M/R jobs The resources for the finished JVMs wouldn't be reclaimed which would eventually exhaust the resources on the box. So with this graph we can identify if a server has bogus JVMs out there and need to be addressed.Development of basic monitoring features can now be measured in hours or days, as opposed to months.We need the freedom to change these metrics daily/weekly as we learn more.
Brahe Hubble is a Ruby Gem created by Chris Rooney here are Cerner<enter>Not to steal any thunder from Ben Brown and KartikVishwanath presenting on Brahe later in this conference, Brahe is named after the astronomer Tycho Brahe (similar to the project Kepler, which many of you may be more familiar with).Brahe Solr is a cloud based indexing application also created here at Cerner <enter>presents at least 2 replicas <enter> That are fronted by a Brahe REST services <enter> to manage and query their state <enter>Brahehubble uses this rest services <enter>To present a Json document <enter>To be used by a Zabbix TemplateSo why not have Zabbix call the rest interface directly?Basically the logic done by Brahe Hubble is too complicated for Zabbix to complete on it’s own.
With the help of Kalin and Brad Beam, NidhiBhargava worked through this for our Brahe Hubble deploymentYou have your Host or Node and aZabbix Server <enter>First you have to get the Zabbix Agent Installed (preferably through Chef) <enter>Then a script (or in the case of Brahe Hubble a RubyGem) that does the gathering of information and outputs a json documentBut how will the Zabbix Agent know about the script or command line? <enter>Easy you will have to configure the UserParameter for Zabbix Agent (simple to do if your are using the zabbix_agent_chef cookbook) <enter>This will allow you to present a json document to the Zabbix Server <enter>The Zabbix Server then uses this json document in a Template with a Macro.
In Templates <enter>The important part is that this is created under “discovery” <enter>In Discovery we created an item and a trigger <enter>The item <enter>
It is here where you can use the name value pairs presented in json from the script or RubyGem.
Let me stop for a minute and tell you about my 2 favorite characters in Mad Max 2Max meets this guy that we refer to as the “Gyro Captain” because no one says his name in the movie and Max never asks.Oh and probably because he drives a gyro copter.Character development is starting to become part of the Mad Max movie this time around. Even if names are not. I personally like names and would love to celebrate things you do with Zabbix as I just did with the cool stuff I have seen done with Zabbix.
Names I have already said so far. <enter>There are many more, but notice that there are 3 dev and 3 ops. Each of us have learned a lot from one another.
There is also The Feral Kid, named for similar reasons. Max gives the feral kid a music box. Max’s heart is starting to soften some and he decides to help this village of people protecting their oil try to get away from the road gang.Max has become more invested in the village. Over the past couple years Zabbix has moved from that side project, or Skunkworks project to an investment in the health of our system.
Max tries to leave the village once, but does not make it. He comes back after a pretty severe beating.
Remember that Max was the best driver on the Main Force Patrol.Max is the only one who is going to be able to drive the tankard of oil out of the protected village.Oh and there is an epic oil tanker chase scene. It goes on for like 20 minutes.In Software we often refer to situations where only one or a few can do something critical as having a low “Bus Factor”. Which put simply is the total number of key developers who would need to be hit by a bus (or tankard) before the project would not be able to proceed.
I would describe Mad Max 2 as aREAD SLIDE
The Zabbix Information model has a rather steep learning curve. But I believe it is one worth climbing.From https://www.zabbix.com/forum/showthread.php?t=21030
As I often do,I asked Kalin to talk to me like I'm a 3rd grader and he boiled it down to this for me.* A Host can be part of many Host Groups.A Host can have many Templates applied to itA Template can have Graphs, Items, and TriggersYou can define actions for TriggersKyle McGovern and Ben Hemphill mentioned yesterday that they are using Zabbix to restart Hadoop Region Servers.So Self healing system of the future? We have that now.
The Road Warrior won critical acclaim, and is an incredibly better movie than the first. The story line is cohesive and somewhat compelling. Max truly comes out a hero.By putting in more work, we have a better story and done some awesome stuff with Zabbix so far…
Let’s talk about where we want to go with Zabbix in the next couple years.
We want Tina Turner level success…In the third installment of the saga, Mad Max: Beyond Thunderdome, Tina Turner is the leader of Bartertown. She plays Aunty Entity.
Bartertown has regained some technology through the use of methane.Years have past and an aging Max has some of his supplies stolen and becomes involved in the local political power struggle.
Recently Nimesh Subramanian created a Skybox Labs virtual cluster with a Chef Server and a Zabbix Server.You can check this out upload the cookbook for your app or service and start playing around with Zabbix without affecting a shared domain where others are working.When you are finished you can just throw the image away.
Dashboards are an area that could use a lot of work. Each of these titles are available on Safari Online. The way people read books is a personal decision. I personally use my library card and each of these 4 are available on Safari Online so I can read them on my iPad.How do we convey the most information in the least amount of space to make only the real problems gain attention?
Zabbix has a full API.Many have been pulling Jira and Splunk data already into Dashing from Shopify which can be optimized It should be rather trivial.
Zabbix does have some interesting features.A couple weeks ago, in the workaround.org blog, Zabbix Maps were explained fairly well.We have not made use of this very heavily however this could potentially give us a graphical relational way to reason about the data that Zabbix is gathering.
Seriously…http://serverfault.com/questions/327472/zabbix-server-sends-too-many-notifications
In Mad Max Beyond Thunderdome there is a cage match between Max and a huge opponent named Blaster.The crowd chants “Two men enter, one man leaves”
Remember back to my example of High Alerts vs. Disaster for the Service Layer? In the disaster scenario I get 4 alerts. 3 for each of the host, and one for the disaster.However this is likely all from one cause. Meaning those alerts are correlated, but how to do I get the system to only email me once?Sometimes a single cause can result in hundreds of emails from Zabbix. I heard one system engineer recently refer to this as “Getting Zabbixed”
Straight from the Zabbix Documentationhttps://www.zabbix.com/documentation/2.0/manual/config/triggers/dependencies
http://meinit.nl/zabbix-triggers-flap-detection-and-grace-periodSystems can get into states where they send Error then immediately send OK’s.A different monitoring system, Nagios, calls this “Flap detection”.In these cases real time alerts are not of much value, Because the system is doing one of two things:Correcting itself somehow faster than a human can interveneOr these are just the downstream effect of the network or another factor (that we should be using the previously mentioned trigger dependency for)Zabbix calls this Hysteresis pronounced “Historee Sis”
Hysteresis is the dependence of a system not only on its current environment but also on its past environment <Enter>For alerts such as this we can use the unix pipe command to chain. <enter>Problem: being less than 10GB for 5 minutes <enter>notice you set this a max of 5 minutes <enter>Recovery: being more than 40 GB in the last 10 minutes <enter>notice the min of 10 minutes <enter>
https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notificationsFrom the Zabbix documentation (I have not fully tested this myself).First check the box to Schedule Actions – This allows the actions on the right sideNext, set a period (maybe 120 seconds)Enable a recovery messageMake sure Trigger value = “PROBLEM” or you will delay the recovery messageStep 2 happens after 120 seconds (step 1 is not defined) so nothing happens.
We need Thunderdome for our alerts100’s of related alerts enterOne causal alert leaves
In discussing these methods of correlation, suppression, and delaying messages, I often get asked, “What if someone misses something?” <enter>A monitoring system that cries wolf too often is almost guaranteed not to get listened to. When I hear a car alarm these days I unfortunately almost never think that someone is trying to steal a car.While this is a valid question, it is not the most interesting question to me. It seems like a question that could stunt progress.The Zabbix community is working through an Action Simulator that may be part of a future release of Zabbix. Look for the blog entry entitled: “Why on earth was I not notified?!”
Trends of flapping are better dealt with in an wholistic manner.Zabbix is not yet great at daily/weekly reports, but it appears that the community has made a lot of headway and it will be in a near future release.
So let’s return to my previous example. <enter>If I delay the notification by 120 seconds and the node recovers in time, then I get no notification – this is good as it will cut down on a number of notificationsIf the node does not recover in that time - the system as a whole is still up and I can deal with the problematic node individually <enter>
If all 3 nodes are down at the same time, I would not however delay the notifications of the Disaster.In this case, the system is not likely to recover in 2 minutes so I would just be delaying the other 3 emails. <enter>I may be able to set up a trigger dependency, however that would sort of be circular in my current opinion. Remember trigger dependency was for a separate host. <enter>
In beyond Thunderdome, Max is banished from Bartertown. He is found by a tribe of children who have a “tell” that prophesizes his arrival. Again Max becomes a reluctant hero to this tribe of people.
When Adam Jacob from OpsCode was visiting our campus he walked through an example that we had been working through with proxies.He mentioned Promise Theory. <enter>I am going to use an example I lifted from John Willis of the DevOps Café Podcast.A promise of B from agent 1 to agent 2.http://www.socallinuxexpo.org/sites/default/files/presentations/scale11x-historyofmgmt-130222175623-phpapp01.pdf
There are promises to give and promises to receiveLet’s use + for give and – for receiveI (a1) promise to feed my neighbor’s cat (a2) My neighbor (a2) promises to grant me access to his house.Trust comes in:That my neighbor gave me the correct code and I will not get arrested.That I will not drink his 25 year old scotch
My Service promises to publish state.
If you think this subject is interesting Mark Burgess (who wrote cfengine – a precursor to Chef - well before it’s time) recently published a 303 page Draft of his book on the subject.
I have had the opportunity to read many books and take classes on project management.We see this quote many times Nobody Plans to fail, some just Fail to Plan <Enter>This is cute <Enter>But it is wrong
Read the slideSchedule strategic iteration time to work through monitoring…So you are not scheduling weekend war rooms
The Phoenix Project is a novel about IT and DevOps.It is about a company on the brink of complete failure.
Beyond Thunderdome is yet again a Dystopian Future where the Survival of many is in the hands of one Man <enter>It makes a great action movie, but not a great way to do business.
Our team is built on enablement. We are structured around understanding, harnessing and providing the capabilities needed to deliver software in the Big Data world.There are many tools already in use by a large number of teams. Each of the tools used have a large open community outside of Cerner.We are focused on building an ecosystem within Cerner to solve the large scale problems we are facing with these large scale deployments.
I have been asked many times in the past couple months “Have you seen monitoring tool X? It is awesome.”I am sure that it is. Please show me why it is awesome. We have set up a way that you can do this.Visit the our Incubator link on the uCern wiki. We would like to collect the awesome DevOps tools you are looking into, in a place where you can compare the capabilities to make the best decisions on which ones should be applied to your team.
I had an architect recently refer to working on a monitoring solution as “technical debt” when his system was not yet in production.READ SLIDE
The third installment closes with yet another epic chase in all sorts of vehicles and epic explosions. Max again comes out a hero…
So to relate this back to Chris Brown’s Keynote yesterday?

Zabbix: Beyond Thunderdome

Recommended

Recommended

More Related Content

Similar to Zabbix: Beyond Thunderdome

Similar to Zabbix: Beyond Thunderdome (20)

More from Aaron Blythe

More from Aaron Blythe (7)

Recently uploaded

Recently uploaded (20)

Zabbix: Beyond Thunderdome

Editor's Notes