Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zabbix: Beyond Thunderdome

4,416 views

Published on

Presentation at #cernerdevcon on 6/5/2013

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Zabbix: Beyond Thunderdome

  1. 1. What’s going on?@ablythe
  2. 2. Huh?@ablythe
  3. 3. Huh?• Does anyone know what movie that was?@ablythe
  4. 4. @ablythe
  5. 5. World Record• Highest Profit to Cost Ratio Ever• But before that…@ablythe
  6. 6. @ablythe
  7. 7. Zabbix: Beyond ThunderdomeAaron Blythe
  8. 8. This presentation is about…@ablythe
  9. 9. This presentation is about…@ablythe
  10. 10. This presentation is about…@ablythe
  11. 11. This presentation is about…@ablythe
  12. 12. PastNowFuture@ablythe
  13. 13. PastNowFuture@ablythe
  14. 14. What is Zabbix?@ablythe
  15. 15. What is Mad Max?@ablythe
  16. 16. Why Zabbix?@ablythe
  17. 17. Why Zabbix?Necessity@ablythe
  18. 18. Why Zabbix?@ablythe
  19. 19. Why Zabbix?Open SourceLinus’s LawGiven enough ‘s all ‘s areCommunity Based@ablythe
  20. 20. Why Zabbix?@ablythe
  21. 21. Why Zabbix?@ablythe
  22. 22. Why Zabbix?Mission StatementTo contribute to the systemic improvement ofhealth care delivery and the health ofcommunities.@ablythe
  23. 23. @ablythe
  24. 24. Zabbix Linux Template - Cost• Connect Host as Agent to Zabbix Server (ViaChef)• Download Template from Zabbix• Upload Template to Zabbix Server• Apply Template to Host____________________• Cost = 4 steps2 Steps 1 Step@ablythe
  25. 25. Zabbix Linux Template - Return• ~ 11 applications• ~ 90 items• ~ 120 triggers• ~ 20 graphs@ablythe
  26. 26. Profit to Cost Ratio• Mad Max– $100 million worldwide/A$400,000• Zabbix Linux Template– 120 Triggers/2 Steps@ablythe
  27. 27. Benefit• 80% full alerts– Disk space/inodes– RAM• Make better decisions on size neededDecisionFind file orprocessExtend LVM@ablythe
  28. 28. Chase Scenes and Crashes@ablythe
  29. 29. CreatorsByron KennedyGeorge Miller Alexei VladishevZabbix (Latvia)Mad Max(Australia)@ablythe
  30. 30. PastNowFuture@ablythe
  31. 31. Mad Max 2 – The Road Warrior@ablythe
  32. 32. @ablythe
  33. 33. Scale@ablythe
  34. 34. Highly Available DeploymentsProxy LayerService Layer@ablythe
  35. 35. Highly Available DeploymentsProxy LayerService Layer@ablythe
  36. 36. Highly Available DeploymentsProxy LayerService Layer@ablythe
  37. 37. Highly Available Deployments@ablythe
  38. 38. Email Alerts to uCern Discussions@ablythe
  39. 39. Screens/Graphs – ack rates@ablythe
  40. 40. Screens/Graphs@ablythe
  41. 41. Brahe Hubble{“{INDEX_MACRO}"=>”name]}",“{VERSION_MACRO}"=>” version",“{ERROR_MACRO}"=>"#{error}"}@ablythe
  42. 42. Zabbix Low Level Discovery@ablytheZabbix HostZabbix AgentUserParameterShell Script orRubyGemZabbix ServerjsonDocument Templatew/ Macro
  43. 43. Zabbix Low Level Discovery@ablythe
  44. 44. Zabbix Low Level Discovery@ablythe
  45. 45. @ablythe
  46. 46. Who?Kalin Hicks – Set up original GCL VM – countlessexplanations whiteboard sessionsBrian Cook – Set up original Sepsis Zabbix VM’sJohn Breese – Set up 2.0 templates spanning hostsBrad Beam – Many dashboards, alerts and triggersChris Rooney – Brahe-hubble gemNidhi Bhargava – Low level discovery on 2.0Dev – White Ops - Yellow@ablythe
  47. 47. @ablythe
  48. 48. It’s not all dogs…@ablythe
  49. 49. …and Gyrocopters@ablythe
  50. 50. Sometimes my email inbox…@ablythe
  51. 51. Has me feeling like@ablythe
  52. 52. Bus Factor@ablythe
  53. 53. Bus FactorDystopian Future Where The Survival of Many isin the Hands of One Man@ablythe
  54. 54. The Information Model@ablythe
  55. 55. Host Group Host GroupHostTemplateTemplate (0..n)Item TriggerGraphApplications0..nActionemail commandItems1..n… has a learning curve
  56. 56. Mad Max 2: The Road Warrior@ablythe
  57. 57. PastNowFuture@ablythe
  58. 58. We Want Tina Turner!@ablythe
  59. 59. Beyond Thunderdome@ablythe
  60. 60. Virtualization thru Skybox Labs@ablythe
  61. 61. Dashboardschaptersdivided bytypes ofdata ratherthan typesof displaychapters onmulti-variables,correlationandproportionsHonestly alittle tootextbook-ish for mefrom morethan twodozen experts,real world casestudies,beautifullayers, how to’s@ablythe
  62. 62. Pull Data External?@ablythe
  63. 63. Zabbix Mapshttp://workaround.org/zabbix/maps@ablythe
  64. 64. Alert ExhaustionAin’t Nobody Got@ablythe
  65. 65. Two Men Enter, One Man Leaves@ablythe
  66. 66. Correlation of AlertsProxy LayerService Layer@ablythe
  67. 67. Trigger Dependencies• Sometimes the availability of one hostdepends on another. A server that is behindsome router will become unreachable if therouter goes down. With triggers configured forboth, you might get notifications about twohosts down - while only the router was theguilty party.@ablythe
  68. 68. “Flap Detection” and a Grace PeriodNagios uses "flap detection" to prevent manyERRORs and OKs being sent right after eachother.Zabbix calls this "hysteresis".@ablythe
  69. 69. HysteresisHysteresis is the dependence of a system notonly on its current environment but also on itspast environment@ablythe
  70. 70. Delaying Notifications@ablythe
  71. 71. Correlation of AlertsWe need to get to the point where:100’s of Related Alerts Enter,One Causal Alert Leaves@ablythe
  72. 72. What if someone misses something?With 100+ alert emails per day, they are almostguaranteed to miss something.@ablythe“Why on earth was I not notified?!”On http://blog.zabbix.com/
  73. 73. Trends of FlakinessThese should not be dealt with by alerts/alarms.Rather by daily/weekly reports.Unfortunately Zabbix is not strong in this area yet.There is a thread:https://www.zabbix.com/forum/showthread.php?t=18901@ablythe
  74. 74. False Alarms Due to Chef RestartsCurrent – ManualMaintenance PeriodsPotentially – AutomatedAutomate the Maintenance PeriodsDelaying NotificationsHysteresisPromise Theory@ablythe
  75. 75. Highly Available DeploymentsDelayed Notifications/HystersisProxy LayerService LayerDelay Alert120 secondsWorks!! @ablythe
  76. 76. Highly Available DeploymentsDelayed Notifications/HystersisProxy LayerService LayerDelay Alert120 secondsDelay Alert120 secondsDelay Alert120 secondsNo DelayDoesn’t Work @ablythe
  77. 77. Beyond Thunderdome@ablythe
  78. 78. Promise Theory@ablythe
  79. 79. Deconstructing Promises@ablythe
  80. 80. Promise Theory+dataa1a2My ServiceZabbix@ablythe
  81. 81. Leveraging Init.d to Manage State…case "$1" instart)touch /var/<service>/start…rm -f /var/<service>/start;;stop)touch /var/<service>/stop;;rm -f /var/<service>/stoprestart)touch /var/<service>/restart$0 stop$0 startrm -f /var/<service>/restart;;…This of course is messy if the serviceever hangs during a restart.More discussion needs to be had in thisarea.@ablythe
  82. 82. Mark Burgess – Book of Promiseshttp://cfengine.com/markburgess/BookOfPromises.pdfDraft published on January 21st 2013@ablythe
  83. 83. For the Project ManagersNobodyPLANS TO FAILSome justFAIL TO PLAN@ablythe
  84. 84. For the Project ManagersEverybody shouldPLAN TO FAILPRACTICE LOCALIZED FAILUREAndMINIMIZE RECOVERY TIME@ablythe
  85. 85. The Phoenix Project: A Novel AboutIT, DevOps, and Helping Your BusinessWin@ablythe
  86. 86. The Brent EffectBrent is the one person who understands thehow the entire system fits together.Brent is the one person who fixes most of theissues.Being spread so thin, Brent is also the oneperson who causes most of the issues.@ablythe
  87. 87. Dystopian Future Where The Survival of Many isin the Hands of One ManThe system or crucial parts of the systemMan or Woman@ablythe
  88. 88. What is OpsInfra?A team built on enablement of DevOps.@ablytheOther toolsAs neededBuild an EcosystemTool VirtualizationRepeatable DeploymentDocumentationDiscussionAuxiliary ToolingEducationThe Success of:Population HealthMillennium+Project Go
  89. 89. Incubator• https://wiki.ucern.com/display/OPIT/Incubator• 4 steps– Log a Jira with the intent to research a tool– Write a wiki article on how to use it– Write a blog on how it is awesome– Record a demo of the tool@ablythe
  90. 90. For the ArchitectsMonitoring is only “technical debt” if youchoose to carry it that way.Depending on when you invest, it easily can be“technical capital”@ablythe
  91. 91. Beyond Thunderdome@ablythe
  92. 92. Past – Hackers - CraftNow – SysAdmin - TradeFuture – Devops - Science@ablythe
  93. 93. The TellThe years travel fastAnd time after time, Ive done the tellBut this aint one body’s tellIts the tell of us allAnd you gotta listen it and memberCuz what you hears todayYou gotta tell the newborn tomorrow@ablythe
  94. 94. What’d ya think?@ablythe

×