Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alexei Vladishev - Zabbix - Monitoring Solution for Everyone

37,314 views

Published on

Paris Zabbix User Group Meetup 2016
June 23, 2016

1. Open Source
2. Zabbix Architecture
3. Data Collection
4. Problem Detection
5. Problem Forecasting / Trend Prediction
6. Lifecycle and Support Policy

Published in: Technology

Alexei Vladishev - Zabbix - Monitoring Solution for Everyone

  1. 1. The Universal Open Source Enterprise Level Monitoring Solution Introduction to Zabbix
  2. 2. Who am I? Alexei Vladishev Creator of Zabbix CEO, Architect and Product Manager Twitter: @avladishev Email: alex@zabbix.com 2
  3. 3. About Zabbix team Zabbix Team: 40 members working full-time Offices in Riga (Headquarters), Tokyo and New-York Business model is based on providing services: technical support, trainings, development, turn-key solutions 3
  4. 4. My plan • Introduction to Zabbix • Problem detection • Why Zabbix? 4
  5. 5. 5 Zabbix is a universal open source enterprise level monitoring solution
  6. 6. 6 Zabbix is a universal open source enterprise level monitoring solution
  7. 7. 7 Zabbix is a universal open source enterprise level monitoring solution
  8. 8. 8 Zabbix is a universal open source enterprise level monitoring solution
  9. 9. 9 Zabbix is a universal open source enterprise level monitoring solution
  10. 10. Zabbix Is Open Source and Free! Use all enterprise features of Zabbix Monitor 100.000s of devices Collect TBs of history data per day 10
  11. 11. 11 Zabbix Architecture
  12. 12. 12 Data collection
  13. 13. 13 History Analysis Data collection Zabbix server
  14. 14. 14 History Analysis Data collection Alerts & Automatic actions Zabbix server Visualization
  15. 15. Data collection 15
  16. 16. All levels of infrastructure 16
  17. 17. All platforms and OS All platforms Most of vendors Thousands of devices 17
  18. 18. Distributed environment 18
  19. 19. Methods of data collection Pull • Service checks: VMWare, HTTP, SSH, IMAP, NTP and other • Passive agent • Script execution via SSH and Telnet Push • Active agent • Zabbix Trapper and SNMP Traps • Monitoring of log files and event logs on Windows 19 And any other custom checks!
  20. 20. Active vs Passive Pull • Service checks • Passive agent • SSH and Telnet Push • Active agent • Zabbix Trapper and SNMP Traps • Monitoring of log files 20
  21. 21. How often execute checks? Every N seconds, down to 1 check per second • Zabbix will evenly distribute checks Different frequency in different time periods • Every X seconds in working time • Every Y second in weekend At a specific time • Ready for business checks • Every hour starting from 9:00 at working hours (9:00, 10:00, …, 18:00) 21
  22. 22. Encryption and authentication Strong encryption and authentication for all components based on TLS v1.2 22 Zabbix server AgentAgent Zabbix proxy Agent TLSTLSTLS TLS Zabbix sender Zabbix get TLS TLS
  23. 23. Sensors & small devices 23
  24. 24. Problem detection 24
  25. 25. How to detect problems in this data flow? 25 Data collection ?
  26. 26. Triggers! 26
  27. 27. Trigger is problem definition 27
  28. 28. {server:system.cpu.load.last()} > 5 28
  29. 29. Triggers Example {server:system.cpu.load.last()} > 5 Operators - + / * < > = <> <= >= or and not Functions min max avg last count date time diff regexp and much more! 29
  30. 30. {node1:system.cpu.load.avg(10m)} > 5 and
 {node2:system.cpu.load.avg(10m)} > 5 and 
 {nodes:tps.min(30m)} > 5000 30 Analyse everything: any metric and any hosts
  31. 31. Art of problem detection 31
  32. 32. Junior level Performance {server:system.cpu.load.last()} > 5 32
  33. 33. False positives 33 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 {server:system.cpu.load.last()} > 5 Flapping
  34. 34. Junior level Availability {server:net.tcp.service[http].last()} = 0 34
  35. 35. Too sensitive 35 0 0,25 0,5 0,75 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0
  36. 36. Too sensitive leads to false positives 36
  37. 37. How to get rid of false positives? 37
  38. 38. Properly define problem conditions and think carefully! system is overloaded running out of disk space a service is not available 38 What really means ?
  39. 39. Take advantage of history System performance {server:system.cpu.load.min(10m)} > 5 Service availability {server:net.tcp.service[http].max(5m)} = 0 {server:net.tcp.service[http].max(#3)} = 0 39
  40. 40. Analyse history 40 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 {server:system.cpu.load.min(10m)} > 5
  41. 41. Analyse history 41 0 0,25 0,5 0,75 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:15 {server:net.tcp.service[http].max(#3)} = 0
  42. 42. Problem disappeared != problem is resolved 42
  43. 43. A few examples Problem: free disk space < 10%
 No problem: free disk space = 10.001% Resolved? Problem: CPU load > 5
 No problem: CPU load = 4.99 Resolved? Problem: SSH check failed
 No problem: SSH is up Resolved? 43
  44. 44. A few examples Problem: free disk space < 10%
 No problem: free disk space = 10.001% Resolved? Problem: CPU load > 5
 No problem: CPU load = 4.99 Resolved? Problem: SSH check failed
 No problem: SSH is up Resolved? 44
  45. 45. A few examples Problem: free disk space < 10%
 No problem: free disk space = 10.001% Resolved? Problem: CPU load > 5
 No problem: CPU load = 4.99 Resolved? Problem: SSH check failed
 No problem: SSH is up Resolved? 45
  46. 46. Different conditions for problem and recovery Before {server:system.cpu.load.last()} > 5 Now Problem: {server:system.cpu.load.last()}>5) Recovery: {server:system.cpu.load.last()}<1) 46
  47. 47. Hysteresis 47 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 {server:system.cpu.load.last()} > 5 … {server:system.cpu.load.last()} < 1
  48. 48. No flapping! 48
  49. 49. Several examples System is overloaded Problem: {server:system.cpu.load.min(5m)}>3
 
 Recovery: {server:system.cpu.load.min(2m)}<1 No free disk space on /
 
 Problem: {server:vfs.fs.size[/,pfree].last()}<10
 
 Recovery: {server:vfs.fs.size[/,pfree].min(15m)}>30 SSH server is not available
 
 Problem: {server:net.tcp.service[ssh].max(#3)}=0
 
 Recovery: {server:net.tcp.service[ssh].min(#10)}=1 49
  50. 50. Typical Use: Problem Detection … but Zabbix also offers: Anomaly detection Problem forecasting Trend prediction 50
  51. 51. Anomalies 51
  52. 52. How to detect? Compare with a norm, where norm is system state in the past. Average number of transactions per second for the last hour is 2x less than number of transactions per second for the same period week ago {server:tps.avg(1h)} < 2 * {server:tps.avg(1h,7d)} 52
  53. 53. Anomaly detection 53 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Compare current system state with the past Anomaly!
  54. 54. Problem forecasting 54
  55. 55. Problem Forecasting 55 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309 Problem 10% Trigger function timeleft() 2 hours
  56. 56. Trend Prediction 56 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309 5.2% 4 hours Trigger function forecast()
  57. 57. How Zabbix reacts on problems? 57
  58. 58. Possible reactions • Automatic problem resolution • Sending alerts to user and user group • Opening tickets in Helpdesk systems • Unlimited number of possible reactions 58
  59. 59. Escalate! • Immediate reaction • Delayed reaction • Notification if automatic
 action failed • Repeated notifications • Escalation to a new level 59
  60. 60. Escalate! • Immediate reaction • Delayed reaction • Notification if automatic
 action failed • Repeated notifications • Escalation to a new level 60
  61. 61. Example 61 Critical problem Repeated Email SMS and ticket Service restart SMS to manager 5 min 10 min 15 min 20 min 0 min
  62. 62. What if there are too many problems? 62
  63. 63. Event correlation 63 PROBLEM EVENTS EVENT CORRELATION MODULE Root cause problems Deduplication, event filtering, root cause analysis, etc
  64. 64. 64 Scalability
  65. 65. Core internet services: DNS, DNSSEC, WHOIS Management of TLDs, domain registration and registrars Billions of users 65
  66. 66. 60.000 hosts 66
  67. 67. 60.000 hosts 67 2.000.000 metrics 20.000.000 triggers 6TB history 40 locations, proxy per location Zabbix performance (avg): 21.000 checks per second
  68. 68. 68 Zabbix Lifecycle
  69. 69. Lifecycle One major release every 6 months One LTS release every 1.5 years 69 Zabbix 3.0 LTS Zabbix 3.2 Zabbix 4.0 LTSZabbix 3.4 February, 2016 August, 2016 February, 2017 August, 2017
  70. 70. Support policy LTS releases: 5 years
 Normal releases: till next release + 1 month 70
  71. 71. What users like in Zabbix? 71
  72. 72. 72 All-in-one solution Trend prediction Data collection Problem detection Automatic actions Agent based monitoring Encryption Anomaly detection Maintenance Event correlation Scalability Visualization Discovery Trigger dependencies Centralized management Service checks IoT and embedded Distributed monitoring Zabbix API Alerting Escalations User permissions Integration with AD, OpenLDAP WEB front-end All OSes … and more
  73. 73. Easy to maintain 73 All components are compatible within one major release Zabbix Agents are backward compatible since Zabbix 1.0!
  74. 74. Benefits of Zabbix All in one universal monitoring platform True Free Software Easy to adopt, use commercial services if needed No License Fees Extremely low TCO No vendor lock in 74
  75. 75. The Universal Open Source Enterprise Level Monitoring Solution Thank you! Twitter: @avladishev Email: alex@zabbix.com Learn more at www.zabbix.com

×