Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Oracle


Published on

Scalability on a large environment can be a challenge on many different aspects involving customization of monitors, performance and reporting. The goal of this presentation is to share the experience we had at Dell, monitoring a big number of servers in an environment with constant changes, lots of custom monitors and new servers configured every week. We will present, from our 3 years of experience with Zabbix and Oracle, which positive/negative aspects we have taken from the configuration parameters we used, involving strong use of User Macros, optimization of Database Queries, Table Partitioning and Automation.

Published in: Software
  • Be the first to comment

Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Oracle

  1. 1. Monitoring Challenges on Large Environment using Oracle Rodrigo Mohr - Dell IT
  2. 2. Size of the Problem
  3. 3. 3 Operations 4.5k+ Items: 75k+ Triggers: 30k+ Host Groups: 400+ Monitored Hosts 297 Standard interval of 5 minutes Values per Second 24 avg Registered: 200+ Active Users 500+ /day Incidents: 250+ Peak: 6k+ False positive: 10% New events 50+/day Created ad-hoc and via automated process New maintenances 200+ /month Monitors Created: 1k+ Monitors Updated: 500+ Monitors Removed: 200+ Configuration Changes
  4. 4. Background
  5. 5. 5 Biography Bachelor Degree in Computer Science
  6. 6. 6 Zabbix Infrastructure at Dell IT Users
  7. 7. 7 Our Processes Windows and Linux serversServer Monitoring Oracle and SQL databasesDatabase Monitoring Incident Mgmt, Change Mgmt, Request Mgmt, …ITIL Process Focused on setup of monitoringZabbix Admin Team Focused on watching Monitoring IncidentsMonitoring Team (L1) Escalation for L1, define monitoring requirementsApplication Team (L2) Identified by L2, created by Zabbix AdminCustom Monitors Identified and created by Zabbix AdminBaseline Monitors
  8. 8. 8 Global Team Brazil • Zabbix Admin • Monitoring L1 • Application L2 • Developers L3 United States • Application L2 India • Application L2 • Developers L3 Malaysia • Zabbix Admin • Monitoring L1
  9. 9. Challenges and Our Approach
  10. 10. 10 Main Challenges Environment Maintenances • Frequent changes in the environment being monitored • Issues caused by changes Performance • Oracle Database • Large environment Configuration Updates • Constant changes on monitored items Reporting
  11. 11. 11 Table Partitioning Our Approach - 1) Performance Pros: - Keep size of tables under control - Reduces housekeeping effort Cons: - Don’t take benefit of partitioning during SELECT New column: DATE_COL HISTORY HISTORY LOG HISTORY STR HISTORY TEXT HISTORY UINT Faster queries in History - Daily Partition - Daily cleanup job (deletes old partition)
  12. 12. 12 Query Optimization Our Approach - 1) Performance Identify top offending queries Debug mode in Zabbix frontend SQL profiling tool inside Web servers DBA Analytics Optimize queries in code Create new index Apply SQL Profile
  13. 13. 13 Query Optimization Our Approach - 1) Performance • Web Servers • File: /var/www/html/include/ • Function: Dbselect • Queries – Last value from history with clock filter – OLD: SELECT * FROM (SELECT * FROM history_uint h WHERE h.itemid='152604' AND h.clock>1453661848 ORDER BY h.clock DESC) WHERE rownum BETWEEN 0 AND 1 – NEW: SELECT * FROM history_uint h WHERE h.itemid='152604' and h.clock>1453661848 and H.CLOCK = (SELECT MAX(H.CLOCK) FROM history_uint h WHERE h.itemid='152604' and h.clock>1453661848) – Last value from history – OLD: SELECT * FROM (SELECT * FROM history_uint h WHERE h.itemid='137781' ORDER BY h.clock DESC) WHERE rownum BETWEEN 0 AND 1 – NEW: SELECT * FROM history_uint h WHERE h.itemid='137781' and H.CLOCK = (SELECT MAX(H.CLOCK) FROM history_uint h WHERE h.itemid='137781') • Improvement – Execution Time (avg): 0.9s (Old) X 0.001s (New) – Hourly runs: 300k+ – Hourly savings: 75h (parallel executions)
  14. 14. 14 Others Our Approach - 1) Performance .last(0) function Active Proxy Items Not Supported Actions with Delay Passive agents
  15. 15. 15 Our Approach - 2) Configuration updates Generic Templates Baseline Templates - Basic monitors, valid for all servers of that type - Example: Windows Template with CPU Usage, Memory Usage, Disk Space monitors - User Macros to customize thresholds per server Extended Templates - Specific types of monitors per template - All Items/Triggers are the same, changing only the macro they refer to - Example: - service_state[{$SVC01}] - service_state[{$SVC02}] - If server needs new monitor, add User Macro, link template and enable Item/Trigger - Limited amount of Items (covering 90% of servers) - Same concept of the Generic Templates - Difference: number of Items/Triggers pre-configured - Example: - Generic Service Template - 7 Items/Triggers - 600+ Hosts - Extended Service Template - 20 Items/Triggers - 30+ Hosts text text Baseline Templates Generic Templates Extended Templates Baseline Templates
  16. 16. 16 Our Approach - 2) Configuration updates Generic Template Extended Template
  17. 17. 17 Automation Our Approach - 3) Environment Maintenances Zabbix agent issue/installation To manage thousands of hosts, it’s very important to fix agent issues quickly Integration with Change Mgmt Tool Automatically create Maintenance periods when a change is happening, avoids alerts during code update Quick fix of common issues Windows service restart, disk / partition space cleanup and others
  18. 18. 18 Others Our Approach - 3) Environment Maintenances Load Balancer Monitor Quickly remove traffic from bad Web Server Oracle Database Monitor corrupted indexes, automate for quick fix Action step delay Wait 30min before sending event to Incident Mgmt tool
  19. 19. 19 Our Approach - 4) Reporting Used • Availability Report – Extracted weekly by one person – Available in shared folder for everyone • Inventory Hosts – Checking which groups a Host is part of Not Used 92% of users are Zabbix Users (no access to configuration) • IT Services and Maps – Manual configuration – Too many triggers (30k+) – Too many hosts (4.5k+) – Too many logical groups (400+)
  20. 20. The Upshot
  21. 21. 21 Key Achievements Zabbix Reporting Understand environment stability via weekly reporting Avoiding Issues Fix code issues in Non-Production before they go into Production Stability Enable Testers / Developers to use their systems when needed
  22. 22. 22 Wish List Maintenances flexibility • More flexible permissions for configuring maintenances • Allowing certain user groups to setup maintenances without modifying the configuration of the hosts Dashboards / Reporting • More dashboards allowing multi-group filtering • Pre-configure report before running it (availability report) User Macros • Develop discovery based on User Macros, to enable dynamic setup/removal of the monitors • User Macros on Host Groups Templates • Associate a template with a Host Group, so that all Hosts inside that group would be linked with that template as well
  23. 23. 23 Main Take Away Database partitioning in HISTORY tables User macros are really helpful for managing custom monitors Work with DBA to identify top offending queries, replace them in code if needed Large Environment with Oracle
  24. 24. Questions??
  25. 25. Thank you! Keep in touch! - - -