Alexei Vladishev - Zabbix - Monitoring Solution for Everyone

The Universal Open Source Enterprise Level Monitoring Solution
Introduction to
Zabbix

Who am I?
Alexei Vladishev
Creator of Zabbix
CEO, Architect and Product Manager
Twitter: @avladishev
Email: alex@zabbix.com
2

About Zabbix team
Zabbix Team: 40 members working full-time
Ofﬁces in Riga (Headquarters), Tokyo and New-York
Business model is based on providing services:
technical support, trainings, development, turn-key
solutions
3

My plan
• Introduction to Zabbix
• Problem detection
• Why Zabbix?
4

5
Zabbix is a universal open
source enterprise level
monitoring solution

6
monitoring solution

7
monitoring solution

8
monitoring solution

9
monitoring solution

Zabbix Is Open Source and Free!
Use all enterprise features of Zabbix
Monitor 100.000s of devices
Collect TBs of history data per day
10

13
History
Analysis
Data collection
Zabbix server

14
History
Analysis
Data collection
Alerts &
Automatic actions
Zabbix server
Visualization

All levels of infrastructure
16

All platforms and OS
All platforms
Most of vendors
Thousands of devices
17

Methods of data collection
Pull
• Service checks: VMWare, HTTP, SSH, IMAP, NTP and other
• Passive agent
• Script execution via SSH and Telnet
Push
• Active agent
• Zabbix Trapper and SNMP Traps
• Monitoring of log ﬁles and event logs on Windows
19
And any other custom checks!

Active vs Passive
Pull
• Service checks
• Passive agent
• SSH and Telnet
Push
• Active agent
• Zabbix Trapper and SNMP Traps
• Monitoring of log ﬁles
20

How often execute checks?
Every N seconds, down to 1 check per second
• Zabbix will evenly distribute checks
Different frequency in different time periods
• Every X seconds in working time
• Every Y second in weekend
At a speciﬁc time
• Ready for business checks
• Every hour starting from 9:00 at working hours (9:00, 10:00, …, 18:00)
21

Encryption and
authentication
Strong encryption and authentication for all
components based on TLS v1.2
22
Zabbix
server
AgentAgent
Zabbix
proxy
Agent
TLSTLSTLS
TLS
Zabbix
sender
Zabbix
get
TLS TLS

How to detect problems
in this data ﬂow?
25
Data collection
?

Trigger is
problem deﬁnition
27

{server:system.cpu.load.last()} > 5
28

Triggers
Example
Operators
- + / * < > = <> <= >= or and not
Functions
min max avg last count date time diff regexp and much
more!
29

{node1:system.cpu.load.avg(10m)} > 5 and 
{node2:system.cpu.load.avg(10m)} > 5 and  
{nodes:tps.min(30m)} > 5000
30
Analyse everything:
any metric and any hosts

Junior level
Performance
32

False positives
33
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
Flapping

Junior level
Availability
{server:net.tcp.service[http].last()} = 0
34

Too sensitive
35
0
0,25
0,5
0,75
1
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14
{server:net.tcp.service[http].last()} = 0

Too sensitive leads to
false positives
36

How to get rid of
false positives?
37

Properly deﬁne problem
conditions and think
carefully!
system is overloaded
running out of disk space
a service is not available
38
What really means ?

Take advantage of history
System performance
{server:system.cpu.load.min(10m)} > 5
Service availability
{server:net.tcp.service[http].max(5m)} = 0
{server:net.tcp.service[http].max(#3)} = 0
39

Analyse history
40
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
{server:system.cpu.load.min(10m)} > 5

Analyse history
41
0
0,25
0,5
0,75
1
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:15
{server:net.tcp.service[http].max(#3)} = 0

Problem disappeared
!=
problem is resolved
42

A few examples
Problem: free disk space < 10% 
No problem: free disk space = 10.001% Resolved?
Problem: CPU load > 5 
No problem: CPU load = 4.99 Resolved?
Problem: SSH check failed 
No problem: SSH is up Resolved?
43

A few examples
44

A few examples
45

Different conditions for
problem and recovery
Before
Now
Problem: {server:system.cpu.load.last()}>5)
Recovery: {server:system.cpu.load.last()}<1)
46

Hysteresis
47
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
{server:system.cpu.load.last()} > 5 … {server:system.cpu.load.last()} < 1

Several examples
System is overloaded
Problem: {server:system.cpu.load.min(5m)}>3 
 
Recovery: {server:system.cpu.load.min(2m)}<1
No free disk space on / 
 
Problem: {server:vfs.fs.size[/,pfree].last()}<10 
 
Recovery: {server:vfs.fs.size[/,pfree].min(15m)}>30
SSH server is not available 
 
Problem: {server:net.tcp.service[ssh].max(#3)}=0 
 
Recovery: {server:net.tcp.service[ssh].min(#10)}=1
49

Typical Use: Problem Detection
… but Zabbix also offers:
Anomaly detection
Problem forecasting
Trend prediction
50

How to detect?
Compare with a norm, where norm is system state in the past.
Average number of transactions per second for the last hour
is 2x less than number of transactions per second for the
same period week ago
{server:tps.avg(1h)} < 2 * {server:tps.avg(1h,7d)}
52

Anomaly detection
53
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
Compare current system state with the past
Anomaly!

Problem Forecasting
55
0
12,5
25
37,5
50
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
y = -2,9455x + 48,309
Problem
10%
Trigger function timeleft()
2 hours

Trend Prediction
56
0
12,5
25
37,5
50
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
y = -2,9455x + 48,309
5.2%
4 hours
Trigger function forecast()

How Zabbix reacts
on problems?
57

Possible reactions
• Automatic problem resolution
• Sending alerts to user and user group
• Opening tickets in Helpdesk systems
• Unlimited number of possible reactions
58

Escalate!
• Immediate reaction
• Delayed reaction
• Notiﬁcation if automatic 
action failed
• Repeated notiﬁcations
• Escalation to a new level
59

Escalate!
• Immediate reaction
• Delayed reaction
• Notiﬁcation if automatic 
action failed
• Repeated notiﬁcations
• Escalation to a new level
60

Example
61
Critical problem
Repeated Email
SMS and ticket
Service restart
SMS to manager
5 min
10 min
15 min
20 min
0 min

What if there are too
many problems?
62

Event correlation
63
PROBLEM
EVENTS
EVENT
CORRELATION
MODULE
Root cause
problems
Deduplication, event ﬁltering, root cause analysis, etc

Core internet services: DNS, DNSSEC, WHOIS
Management of TLDs, domain registration and registrars
Billions of users
65

60.000 hosts
67
2.000.000 metrics
20.000.000 triggers
6TB history
40 locations, proxy per location
Zabbix performance (avg):
21.000 checks per second

Lifecycle
One major release every 6 months
One LTS release every 1.5 years
69
Zabbix 3.0 LTS Zabbix 3.2 Zabbix 4.0 LTSZabbix 3.4
February, 2016 August, 2016 February, 2017 August, 2017

Support policy
LTS releases: 5 years 
Normal releases: till next release + 1 month
70

72
All-in-one solution
Trend
prediction
Data collection
Problem
detection
Automatic
actions
Agent based
monitoring
Encryption
Anomaly
detection
Maintenance
Event
correlation
Scalability
Visualization
Discovery
Trigger
dependencies
Centralized
management
Service checks
IoT and
embedded
Distributed
monitoring
Zabbix API Alerting Escalations
User
permissions
Integration with
AD, OpenLDAP
WEB front-end
All OSes
… and more

Easy to maintain
73
All components are compatible within one major release
Zabbix Agents are backward compatible since Zabbix 1.0!

Beneﬁts of Zabbix
All in one universal monitoring platform
True Free Software
Easy to adopt, use commercial services if needed
No License Fees
Extremely low TCO
No vendor lock in
74

The Universal Open Source Enterprise Level Monitoring Solution
Thank you!
Twitter: @avladishev
Email: alex@zabbix.com
Learn more at www.zabbix.com

Alexei Vladishev - Zabbix - Monitoring Solution for Everyone

More Related Content

What's hot

Similar to Alexei Vladishev - Zabbix - Monitoring Solution for Everyone

More from Zabbix

Recently uploaded

In this document

Alexei Vladishev - Zabbix - Monitoring Solution for Everyone