The Universal Open Source Enterprise Level Monitoring Solution
Introduction to
Zabbix
Who am I?
Alexei Vladishev
Creator of Zabbix
CEO, Architect and Product Manager
Twitter: @avladishev
Email: alex@zabbix.com
2
About Zabbix team
Zabbix Team: 40 members working full-time
Offices in Riga (Headquarters), Tokyo and New-York
Business model is based on providing services:
technical support, trainings, development, turn-key
solutions
3
My plan
• Introduction to Zabbix
• Problem detection
• Why Zabbix?
4
5
Zabbix is a universal open
source enterprise level
monitoring solution
6
Zabbix is a universal open
source enterprise level
monitoring solution
7
Zabbix is a universal open
source enterprise level
monitoring solution
8
Zabbix is a universal open
source enterprise level
monitoring solution
9
Zabbix is a universal open
source enterprise level
monitoring solution
Zabbix Is Open Source and Free!
Use all enterprise features of Zabbix
Monitor 100.000s of devices
Collect TBs of history data per day
10
11
Zabbix Architecture
12
Data collection
13
History
Analysis
Data collection
Zabbix server
14
History
Analysis
Data collection
Alerts &
Automatic actions
Zabbix server
Visualization
Data collection
15
All levels of infrastructure
16
All platforms and OS
All platforms
Most of vendors
Thousands of devices
17
Distributed environment
18
Methods of data collection
Pull
• Service checks: VMWare, HTTP, SSH, IMAP, NTP and other
• Passive agent
• Script execution via SSH and Telnet
Push
• Active agent
• Zabbix Trapper and SNMP Traps
• Monitoring of log files and event logs on Windows
19
And any other custom checks!
Active vs Passive
Pull
• Service checks
• Passive agent
• SSH and Telnet
Push
• Active agent
• Zabbix Trapper and SNMP Traps
• Monitoring of log files
20
How often execute checks?
Every N seconds, down to 1 check per second
• Zabbix will evenly distribute checks
Different frequency in different time periods
• Every X seconds in working time
• Every Y second in weekend
At a specific time
• Ready for business checks
• Every hour starting from 9:00 at working hours (9:00, 10:00, …, 18:00)
21
Encryption and
authentication
Strong encryption and authentication for all
components based on TLS v1.2
22
Zabbix
server
AgentAgent
Zabbix
proxy
Agent
TLSTLSTLS
TLS
Zabbix
sender
Zabbix
get
TLS TLS
Sensors & small devices
23
Problem detection
24
How to detect problems
in this data flow?
25
Data collection
?
Triggers!
26
Trigger is
problem definition
27
{server:system.cpu.load.last()} > 5
28
Triggers
Example
{server:system.cpu.load.last()} > 5
Operators
- + / * < > = <> <= >= or and not
Functions
min max avg last count date time diff regexp and much
more!
29
{node1:system.cpu.load.avg(10m)} > 5 and

{node2:system.cpu.load.avg(10m)} > 5 and 

{nodes:tps.min(30m)} > 5000
30
Analyse everything:
any metric and any hosts
Art of problem
detection
31
Junior level
Performance
{server:system.cpu.load.last()} > 5
32
False positives
33
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
{server:system.cpu.load.last()} > 5
Flapping
Junior level
Availability
{server:net.tcp.service[http].last()} = 0
34
Too sensitive
35
0
0,25
0,5
0,75
1
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14
{server:net.tcp.service[http].last()} = 0
Too sensitive leads to
false positives
36
How to get rid of
false positives?
37
Properly define problem
conditions and think
carefully!
system is overloaded
running out of disk space
a service is not available
38
What really means ?
Take advantage of history
System performance
{server:system.cpu.load.min(10m)} > 5
Service availability
{server:net.tcp.service[http].max(5m)} = 0
{server:net.tcp.service[http].max(#3)} = 0
39
Analyse history
40
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
{server:system.cpu.load.min(10m)} > 5
Analyse history
41
0
0,25
0,5
0,75
1
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:15
{server:net.tcp.service[http].max(#3)} = 0
Problem disappeared
!=
problem is resolved
42
A few examples
Problem: free disk space < 10%

No problem: free disk space = 10.001% Resolved?
Problem: CPU load > 5

No problem: CPU load = 4.99 Resolved?
Problem: SSH check failed

No problem: SSH is up Resolved?
43
A few examples
Problem: free disk space < 10%

No problem: free disk space = 10.001% Resolved?
Problem: CPU load > 5

No problem: CPU load = 4.99 Resolved?
Problem: SSH check failed

No problem: SSH is up Resolved?
44
A few examples
Problem: free disk space < 10%

No problem: free disk space = 10.001% Resolved?
Problem: CPU load > 5

No problem: CPU load = 4.99 Resolved?
Problem: SSH check failed

No problem: SSH is up Resolved?
45
Different conditions for
problem and recovery
Before
{server:system.cpu.load.last()} > 5
Now
Problem: {server:system.cpu.load.last()}>5)
Recovery: {server:system.cpu.load.last()}<1)
46
Hysteresis
47
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
{server:system.cpu.load.last()} > 5 … {server:system.cpu.load.last()} < 1
No flapping!
48
Several examples
System is overloaded
Problem: {server:system.cpu.load.min(5m)}>3



Recovery: {server:system.cpu.load.min(2m)}<1
No free disk space on /



Problem: {server:vfs.fs.size[/,pfree].last()}<10



Recovery: {server:vfs.fs.size[/,pfree].min(15m)}>30
SSH server is not available



Problem: {server:net.tcp.service[ssh].max(#3)}=0



Recovery: {server:net.tcp.service[ssh].min(#10)}=1
49
Typical Use: Problem Detection
… but Zabbix also offers:
Anomaly detection
Problem forecasting
Trend prediction
50
Anomalies
51
How to detect?
Compare with a norm, where norm is system state in the past.
Average number of transactions per second for the last hour
is 2x less than number of transactions per second for the
same period week ago
{server:tps.avg(1h)} < 2 * {server:tps.avg(1h,7d)}
52
Anomaly detection
53
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
Compare current system state with the past
Anomaly!
Problem forecasting
54
Problem Forecasting
55
0
12,5
25
37,5
50
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
y = -2,9455x + 48,309
Problem
10%
Trigger function timeleft()
2 hours
Trend Prediction
56
0
12,5
25
37,5
50
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
y = -2,9455x + 48,309
5.2%
4 hours
Trigger function forecast()
How Zabbix reacts
on problems?
57
Possible reactions
• Automatic problem resolution
• Sending alerts to user and user group
• Opening tickets in Helpdesk systems
• Unlimited number of possible reactions
58
Escalate!
• Immediate reaction
• Delayed reaction
• Notification if automatic

action failed
• Repeated notifications
• Escalation to a new level
59
Escalate!
• Immediate reaction
• Delayed reaction
• Notification if automatic

action failed
• Repeated notifications
• Escalation to a new level
60
Example
61
Critical problem
Repeated Email
SMS and ticket
Service restart
SMS to manager
5 min
10 min
15 min
20 min
0 min
What if there are too
many problems?
62
Event correlation
63
PROBLEM
EVENTS
EVENT
CORRELATION
MODULE
Root cause
problems
Deduplication, event filtering, root cause analysis, etc
64
Scalability
Core internet services: DNS, DNSSEC, WHOIS
Management of TLDs, domain registration and registrars
Billions of users
65
60.000 hosts
66
60.000 hosts
67
2.000.000 metrics
20.000.000 triggers
6TB history
40 locations, proxy per location
Zabbix performance (avg):
21.000 checks per second
68
Zabbix Lifecycle
Lifecycle
One major release every 6 months
One LTS release every 1.5 years
69
Zabbix 3.0 LTS Zabbix 3.2 Zabbix 4.0 LTSZabbix 3.4
February, 2016 August, 2016 February, 2017 August, 2017
Support policy
LTS releases: 5 years

Normal releases: till next release + 1 month
70
What users like in
Zabbix?
71
72
All-in-one solution
Trend
prediction
Data collection
Problem
detection
Automatic
actions
Agent based
monitoring
Encryption
Anomaly
detection
Maintenance
Event
correlation
Scalability
Visualization
Discovery
Trigger
dependencies
Centralized
management
Service checks
IoT and
embedded
Distributed
monitoring
Zabbix API Alerting Escalations
User
permissions
Integration with
AD, OpenLDAP
WEB front-end
All OSes
… and more
Easy to maintain
73
All components are compatible within one major release
Zabbix Agents are backward compatible since Zabbix 1.0!
Benefits of Zabbix
All in one universal monitoring platform
True Free Software
Easy to adopt, use commercial services if needed
No License Fees
Extremely low TCO
No vendor lock in
74
The Universal Open Source Enterprise Level Monitoring Solution
Thank you!
Twitter: @avladishev
Email: alex@zabbix.com
Learn more at www.zabbix.com

Alexei Vladishev - Zabbix - Monitoring Solution for Everyone