Alexei vladishev - Open Source Monitoring With Zabbix

Open Source Enterprise Monitoring
with Zabbix
Alexei Vladishev, Founder of Zabbix
www.zabbix.com

Plan
What is Zabbix:

• Zabbix overview
• Highlights of Zabbix features
• Monitoring of large distributed environments

Future:

• Zabbix Roadmap

Why shall we use monitoring?
Most important reasons:

• Warn and act in case of any problems.
• Downtimes are very expensive!
• To identify and fix problems ASAP before customers start calling.
• More productive work of IT staff
• To automate routine tasks, check of availability of resources
• To plan hardware resources. Capacity planning and trends.
• To measure and analyse quality of provided and used services (SLA)

A good monitoring system makes us confident our business is running!

History
Zabbix is celebrating its 8th anniversary!

• Choice of 1998 — HP OpenView, IBM, BMC: expensive to buy and maintain
• How to name it? ABCDE...Zabbix!
• April 2001 — the first public release Zabbix 1.0alpha1
• April 2004 — the first stable release Zabbix 1.0
• April 2005 — the company Zabbix SIA was established: commercial support

Zabbix today. We have made a good progress!

Zabbix 1.6.4, 500 downloads per day, 15.000 forum users
Zabbix company is growing, 20 Zabbix partners (Europe, Japan, the US)

What is Zabbix?
Zabbix is an Open Source distributed monitoring system capable of monitoring
availability and performance of servers, network devices, applications.

Zabbix functionality:
• Agent-less/based monitoring
• Auto-discovery
• Escalations and repeated notifications
• Pro-active monitoring, remote actions
• WEB monitoring
• Graphs, maps, screens
• IT Services (SLA), reports
• Distributed monitoring, IPv6 and more!

Zabbix: main components
Server:
• Zabbix core, system logic
• Data processing, escalations

WEB front-end:
• Access to historical data
• Configuration

Agent:
• Server data collection, actions

Proxy:
• Remote data collection

Technical details
Important technical decisions:
• WEB front-end for data visualisation and configuration
• Written in the C language, PHP front-end. No Java/Python/Perl/Ruby on the
server and agent side! No fork(), native syscalls() are used instead.
• Support of virtually all platforms (Linux, *BSD, Solaris, AIX, HP-UX,
Windows,...)
• Choice of database engines: MySQL, PostgreSQL, Oracle, SQLite
• We do not reuse Nagios, RRD, Cacti

Key principles of Zabbix development:
• Keep things simple (KISS), yet be very flexible
• Maintain low hardware requirements, should not affect production

Why would we choose Zabbix?
What makes Zabbix so special?
• All-in-one solution only when it comes to monitoring!
• All historical data, trends and configuration is stored in a database
• Ready for monitoring of small and LARGE distributed environments
• True Open Source (GPLv2) solution, no commercial versions.
• All logic is on the server side, agents are for data collection only
• Extremely flexible! Triggers, escalations, new checks, screens, and more.
• Designed to deal with unstable communications
• Full support of IPv6

How to monitor
Service checks: SNMP v1,v2,v3:
• FTP, SSH, HTTP, SMTP, DNS ... • Network devices
• Normally NET-SNMP for servers
Zabbix Agent: • Monitoring of applications (Oracle,
• Аctive and passive checks Weblogic, Websphere, PostgreSQL,
• Monitoring of logs, event logs MySQL, ...)
• Easy to extend • SNMP traps
• Remote command execution
• Extremely efficient! IPMI:
• Monitoring of hardware
Other: • Remote management (reboot, reset,
WMI, JMX, Nagios plugins halt)

Use of Zabbix agent
Active checks:
• Highly efficient
• Buffering of collected data

Passive checks:
• Requires polling on the Zabbix
server side
• Additional performance hit
because of polling and network
bandwidth

Mmm... Triggers!
Trigger is a flexible logical expression used to define a problem condition.
• Status (value) of a trigger represents system state
• Change of trigger value generates events
• It is one of the ways to deal with flapping

CPU load is too high: {host:cpuload.last(0)}>5
CPU load is too high: {host:cpuload.min(300)}>2
CPU load is too high: {host:cpuload.min(300)}>2 & {host:cpuuser.min(300)}>50
CPU load is too high: {host:cpuload.min(300)}>2 & {host2:backup.last(0)}=0

We decide how to define «CPU load is too high» not Zabbix itself!

Dependencies
They are used to:

• Avoid notifications
• Define dependencies between different problems (related to networks,
applications, anything). No host dependencies!

Server is down → Switch1 is down → Switch2 is down

WEB App is down → MySQL is not responsive → No free disk space on /tmp

Escalations
Different scenarios: Example (reaction to a failed WEB check):
• Delayed notifications
• Repeated notifications Increase step every 5 minutes
• Execution of commands Step 1-3: Send message to Unix Admins
• Escalation to other users Step 3-5: Send message to Boss if not ACK
• Recovery messages Step 6: Restart Apache if not ACK
• Different actions for Step 7: Reboot server if not ACK
acknowledged and not Step 10: Send message to all of not ACK
acknowledges events

Visualisation: Dashboard
Favourite resources:
• Maps
• Graphs
• Screens

High-level view:
• Problems by host group
• Zabbix statistics
• List of the latest issues
• WEB monitoring info
• Auto-discovery

Visualisation: Graphs
Immediate access:
• Any period of time
• Easy time-navigation
• Two mouse-click zooming
• Problem conditions displayed
• Non-working time is marked
• Not generated in advance!

Graph types:
• Standard (dots, lines, colors)
• Stacked
• Pie

Visualisation: Screens
Different blocks:
• Graphs
• Maps
• Plain text data
• List of problems
• High level stats

Slide shows:
• Combination of screens
• Displayed one after
another

WEB monitoring
Goals:
• Monitoring of user experience
• Support of complex scenarios
• Performance monitoring
• Availability monitoring

Example:
Step 1 Access home page
Step 2 Login (POST, GET)
Step 3 Run report
Step 4 Logout

IT Services
Goals:
• Business level monitoring
• SLA monitoring
• We care about services
• Escalation of problems
• Root cause of the problem

Tree structure based on:
• Dependencies
• Physical location
• Type of service, etc

User management
Authentication:
• Standard: Zabbix database
• LDAP (Active Directory)
• Apache (Kerberos, Unix, etc)

Permissions:
• Depends of user type
• User group level permissions

Also:
• Notifications-only user groups

Extending Zabbix
New Zabbix agent-side check:

UserParameter=mysql.qps,mysqladmin –uroot status|cut –f9 –d”:”
UserParameter=sum[*],echo “$1+$2”|bc
Examples: mysql.qps = 456, sum[4,5] = 9

New notification methods:
• Just a matter of writing a shell script (voice generation, Skype call, anything)

New server side checks:
• Just a matter of writing a shell script

Monitoring of large environments

Our environment
Situation:
• Several thousands of servers and network devices
• Distributed accross 2-100 data centers or branches
• Centralised monitoring is required

Zabbix: several approaches
1 Server
1 Server Distributed
Many Proxies

• One Zabbix server • One Zabbix server • One Zabbix server per
does everything • One Proxy per data data center
center or company • More effort to maintain
branch • Can be used with
Proxies

What is Proxy?
Proxy is a data collector. It is also used for auto-discovery.

Advantages:
• Makes architecture easier
• Does not require significant resources
• Offloads Zabbix server

Proxy: how does it work?
Management: Connection loss processing:
• Data is buferred in the Proxy database
• Data collection only • Will be sent on connection recovery
• Fully managed via WEB front-end • No notifications about local problems!
• Configuration is stored on the
Zabbix server side
• All connections are initiated by
Proxy
• Collection of thousands of values
per second

Distributed monitoring
Basic attributes:
• Tree-like structure
• Node is a Zabbix server
• Nodes are platform
independent

Managements:
• Two-way replication of
configuration
• Parent node controls child
nodes

Processing of connection loss
What will stop working?
• Data sending to parent node
• Synchronisation of configuration

Everything else will keep working!

Thousands of devices: solutions
Problems and solutions:
• Huge data volume: use database partitions for historical data
• Integration with existing systems: LDAP authentication, notifcation
methods to open tickets, XML import/export for configuration
management and inventory
• Maintenance: templates, mass updates
• Upgrades: all Zabbix components are compatible within one major
release 1.6.x

Choice of the best schema
Depends on the requirements:
• Local administration
• Full-featured monitoring when no connection between data centers
(branches)
Distributed
1 Server
Many Proxies
1 Server Distributed monitoring
Adding Proxies
Getting used to
Zabbix
Adopt Open Source

General directions
General GUI Other

• Better integration • Flexible Dashboard • Infrastructure for
• REST API/RPC • Personalization widgets
• Better scalability (widgets) • Business level
monitoring

Questions?
Today and tomorrow I am around!

Alexei vladishev - Open Source Monitoring With Zabbix

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Alexei vladishev - Open Source Monitoring With Zabbix

Similar to Alexei vladishev - Open Source Monitoring With Zabbix (20)

More from André Déo

More from André Déo (20)

Recently uploaded

Recently uploaded (20)

Alexei vladishev - Open Source Monitoring With Zabbix