Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

State of Development - Icinga Meetup Linz August 2019

Talk by Lead Icinga 2 Developer Michael Friedrich at the Icinga meetup on 22nd of August at OÖ Gesundheitsholding, Goethestraße 89, 4020 Linz -

  • Be the first to comment

State of Development - Icinga Meetup Linz August 2019

  1. 1. State of Development 22nd August 2019 Icinga Meetup Linz
  2. 2. Introduction
  3. 3. Introduction Michael Friedrich
  4. 4. Responsibilities Contact Personal Icinga 2 Lead Developer Community Manager Vagrant Boxes michael.friedrich@icinga.c om @dnsmichi on Twitter A taste of Austria #drageekeksi #lego & #perryrhodan Michael Friedrich Chief Evangelist
  5. 5. Introduction
  6. 6. Introduction Icinga Stack Monitoring Availability, Reliability, Observability Log Management ElasticStack, Graylog Automation Director, CfgMgmt-Support, API Metrics and Analytics Graphite, Grafana, InfluxDB, OpenTSDB
  7. 7. Icinga 2 Core Scalable infrastructure monitoring Combine high availability clusters with a distributed setup, and you have a best practice scenario for large and complex environments. Monitoring as code with dynamic configurations. Icinga 2 Core
  8. 8. Icinga Director Our configuration and orchestration solution The Director aims to be the favorite Icinga config deployment tool. Director is designed for those who want to automate their configuration deployment and those who want to grant their “point & click” users easy access to the configuration. Icinga Director
  9. 9. Elasticsearch Keep in touch with all your logs all the time The Elasticsearch module for Icinga Web 2 gives you access to this data, embedded in your Icinga Web 2 interface. Custom filters allow you to limit the data that should be displayed. You can give your users access to certain data types without revealing everything stored in Elasticsearch. Module for Elasticsearch
  10. 10. Graphite for Icinga Quick access to your monitoring metrics Add graphs from your Graphite metrics backend directly into the host/service detail view. This module also provides a new menu section with two general overviews for hosts and services. Graphite for Icinga
  11. 11. Support for vSphere® Analyze your VMware vSphere® infrastructure
  12. 12. Icinga Module for vSphere® Analyze your VMware vSphere® infrastructure The easiest way to monitor a VMware vSphere® environment. Configure a connection to your vCenter® or ESXi™ host and you're ready to go. This module provides a lot of context, deep insight and great oversight. Fast drill-down possibilities, valuable hints and reports. Icinga Module for vSphere®
  13. 13. Integrations Support for leading solutions
  14. 14. Community
  15. 15. Events
  16. 16. Germany CH, NL, USA, Russia Austria Meetups Thanks Nicolai, Max & Carsten for community building J Thanks Moritz and Thomas for your community invest J More to come all over the world
  17. 17. Past Next Soon Camps Berlin and Atlanta Stockholm, Zurich, Milan Something bigger …
  18. 18. May 13 – 14, 2020 Amsterdam Subscribe Now and Save 20%
  19. 19. Ongoing Projects
  20. 20. Cube
  21. 21. Certificate Monitoring
  22. 22. Business Process
  23. 23. 01 Reporting 03 Integrations 02 Icinga DB Running projects Next to feature & bugfix releases for Icinga 2 and Icinga Web 2 • User feedback from early adopter releases • PDF templates – our trainee project • Core: Writer feature to Redis • IcingaDB: Daemon which syncs Redis & DB • Web: New Monitoring module • AWS Director Import • Graphite • Icingabeat
  24. 24. 01 Core 03 Integrations 02 Web Future projects To be defined in our strategy workshop – more at OSMC • Performance: Embedded plugins • DSL: Formatting • Logging capabilities • Metrics – plugin API • Reporting based on IcingaDB • Cloud modules • Director packages & core feature • Plugins: Windows • Graphite, InfluxDB fields and tags • Notifications, Events & Incidents
  25. 25. Icinga 2.11
  26. 26. 01 Boost 03 HTTP API 02 I/O Engine Network Stack Rewrite core parts: The story. Boost 1.66+ allows the usage of additional libraries for socket/network I/O, thread pools and HTTP server/clients. Package Boost on platforms which don’t have this in EPEL/Backports. Status: Done Replace the current TLS socket I/O implementation with custom event handling (poll, epoll) with Boost ASIO. Use IoBoundWork and CpuBoundWork thread pools. Status: Done Replace custom HTTP handling with Boost ASIO & Boost Beast. Use Beast Buffers, HTTP verbs and more things for compile time errors, not runtime. Replace HTTP Clients (InfluxDB, Elasticsearch, CLI commands, check_nscp_api) with Boost implementation. Status: Done Done
  27. 27. • Feature HA • Elasticsearch, Graphite, InfluxDB, etc. • Failover in HA zones • Object authority update every 10s (was 30s) • DB IDO failover_timeout 30s (was 60s) • More logging • Status: Done 01 HA & Failover 03 Runtime Objects 02 Configuration Icinga 2.11 More goodness Done
  28. 28. 01 HA & Failover 02 Cluster Config Icinga 2 More goodness 03 Runtime Objects • Story • • Coming from #10000 😜 😜 😜 😜 😜 • Tackle existing problems • Staged sync, no broken config after restart • Don‘t include deleted zones on startup • Deal with race conditions on sync • Status: Done Done
  29. 29. 01 HA & Failover 03 Runtime Objects Icinga 2.11 Runtime Objects in API config packages 02 Cluster • Story: • Runtime objects (downtimes, etc.) are missing after restart (broken config package). • Uses _api package internally • Active-stage is read from disk every time • Race condition: can be empty • Incomplete object file path on disk • Repair broken active stage (timer) • Logs & troubleshooting docs • Status: Done (since Friday) Done
  30. 30. Crashes Icinga 2.11 Fixes, crashes, and code quality – all done Bugs • Permission filters API crashes #6874 (ref/NC) • Logrotate timer crash #6737 • Replay log not cleared #6932 • Windows agent 100% cpu/logging #3029 • JSON library: YAJL -> Nlohmann #6684 • UTF8 sanitizing #4703 • Boost Filesystem for I/O #7102 • Boost Asio Thread Pool (checks, etc.) #6988 Quality Done
  31. 31. Test Icinga 2.11 Status in CW 30 – RC Week Fix • Customer issues • Recovery notifications missing on restart (HA paused problem) • Problem notification after downtime ends • Killed processes on reload, KillMode=mixed • API • TLS v1.2+ & hardened cipher lists • Bugfixes • Cluster staging checksums • Unit tests unstable Profit Done
  32. 32. Last Icinga 2.11 Status in CW 30 – RC Week minute • Reload handling broken • Systemd kills process groups after reload/stop • CW28 decision: PoC and rewrite • Umbrella process managing main+helper • Bonus: Run in Docker w/o magic tricks • oc/19-technical-concepts/#core-reload- handling fixes Done
  33. 33. Docs Icinga 2.11 Status in CW 30 – RC Week = • Docs: • Service Monitoring & Plugin API (our version!) • Distributed: s/client/agent/ + images • Basics: s/custom attribute/custom variable/ • Command Arguments • Development docs for trainees • Upgrading: upgrading-icinga-2/ qa-- Done
  34. 34. 01 Ciphers 03 Reload process 02 Cluster sync 2.11 RC Feedback Add ciphers for non-ECDH support (el7, Windows 2.10, Debian/Ubuntu). We cannot patch older agents immediately. Added detailed troubleshooting docs. Binary sync is NOT supported. Detect and prevent this on the master with UTF8 sanitizing. New checksums for config change detection would result in an “always change loop” otherwise. Fix logging for systemd errors, now prints config errors again.
  35. 35. 01 Troubleshooting 03 Technical Concepts 02 Agents & more 2.11 RC Feedback Documentation sync
  36. 36. Downtime Cluster Loop
  37. 37. 01 Analysis 03 Tests 02 Fix Downtime Cluster Loop It is not related to the object version but object activation/deactivation in HA enabled cluster zones. Affects all config object create/delete ops. Whenever config::UpdateObject and config::DeleteObject messages are sent, ensure to pass the “origin” handler to config creation/deletion objects. This ensures that ConfigObject->SetActive() resp. OnActiveChanged doesn’t start “return to sender” with the cluster message. Stressed HA-master with a long delay of messages (replay log and live). Downtime which expires during a reload, ensure that the secondary master processes CREATE/DELETE after the first has finally deleted the object. All tests proof the fix working. Added into 2.11.
  38. 38. • Fork errors with “too many open files” • Raise number of open files (systemd, Icinga) • Main process has a pipe stream for the child process output • 01 Concurrent Checks 03 Ideas 02 Spawn Helper Performance Max concurrent checks
  39. 39. • Process Spawn Helper creates child process • Waits for events • 4 IO threads, 1 process • More IO threads and processes • More context switches • No real performance gain 01 Concurrent Checks 03 Ideas 02 Spawn Helper Performance Max concurrent checks
  40. 40. • Process class with Fibers & Coroutines • Less thread context switches • Combined with ASIO • PoC in the works • Embedded Perl • Subroutines, caching • Experimental tests 01 Concurrent Checks 03 Ideas 02 Spawn Helper Performance Max concurrent checks
  41. 41. 1061 Commits 17 Contributors +43450 -27330 2.11 Metrics Sep 2018: Start cluster config sync implementation. Oct 2018: Feature HA. Feb 2019: Network Stack Poc by Alexander Klimov Mar 2019: 2.10.4 Apr 2019: Boost packages by Markus Frosch (includes infra move to GitLab) Apr 2019: Windows wizard improvements by Michael Insel Apr 2019: Ongoing Boost ASIO in features, CLI commands, testing May 2019: Reload deactivates IDO hosts -> requested 2.10.5 May 2019: Merge fixes for broken _api package May 2019: 2.10.5 Jun 2019: TLS 1.2 & cipher lists Jun 2019: Finish and merge cluster config sync Jul 2019: Rewrite failing unit tests for TPs Jul 2019: Re-send suppressed notifications in HA clusters Jul 2019: Reload would kill plugin process with systemd, last minute fixes Jul 2019: Renaming the docs: client->agent, custom attrs->vars Jul 2019: 2.11.0 RC1 Aug 2019: TLS ciphers for older agents Aug 2019: Refresh Windows agent for 2.11 Aug 2019: Deny syncing binaries with the cluster config sync Aug 2019: Fix logs with systemd Aug 2019: Fix cluster downtime loop Aug 2019: Analyse check performance with max concurrent checks
  42. 42. Thank You @dnsmichi