Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Regain Control Thanks To Prometheus

183 views

Published on

In the French FedEx company we used Prometheus to monitor the infrastructure. It hosts a CQRS Architecture composed with Kafka, Spark, Cassandra, ElasticSearch, and microservices APIs in scala.

This presentation is about using Prometheus in production, you will see why we choosed Prometheus, how we integrated it, configured it and what kind of insights we extracted from the whole infrastructure.

In addition, you will see how Prometheus changed our way of working, how we implemented self-healing based on Prometheus, how we configured systemd to trigger AlertManager API, integration with slack and other cool stuffs.

Published in: Technology
  • Be the first to comment

Regain Control Thanks To Prometheus

  1. 1. Regain control thanks to Prometheus Guillaume Lefevre, DevOps Engineer, OCTO Technology Etienne Coutaud, DevOps Engineer, OCTO Technology
  2. 2. About us Guillaume Lefevre DevOps Engineer, OCTO Technology @guillaumelfv Etienne Coutaud DevOps Engineer, OCTO Technology @etiennecoutaud
  3. 3. What we are dealing with • Track every events of a package lifecycle • Aggregate and compute events to generate relevant business data • Store and serve results for consultation or further computing
  4. 4. What we are dealing with 2,000,000 packages / day 20,000,000 events / day 200 processes running 24/7
  5. 5. CQRS Architecture Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events ● ~ 90 servers ● Whole configuration managed with Zookeeper
  6. 6. Monitoring system metrics Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events prometheus node_exporter Zookeeper
  7. 7. Monitoring middleware metrics Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events prometheus node_exporter third-party exporters (specific exporter / JMX) JMX HAproxy exporter HAproxy exporter JMX Homemade exporter JMX Zookeeper Zookeeper exporter Elasticsearch exporter
  8. 8. Monitoring business metrics Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events prometheus node_exporter third-party exporters (specific exporter / JMX) JMX JMX JMX HAproxy exporter HAproxy exporter Homemade exporter client library business metrics business metrics Zookeeper Zookeeper exporter Elasticsearch exporter
  9. 9. Full stack prometheus node_exporter third-party exporters (specific exporter / JMX) client library Prometheus AlertManager Mail SlackGrafana targets Kafka Spark streaming API Scala Front Cassandra ElasticSearch Final users Package Flashcode Events JMX JMX JMX Elasticsearch exporter HAproxy exporter HAproxy exporter Homemade exporter business metrics business metrics Zookeeper Zookeeper exporter
  10. 10. Data volume ~ 250 endpoints scrapped ~ 15,000 different metrics ~ 50 GB data / day
  11. 11. Enabling metrics: node_exporter $ ./node_exporter --web.listen-address 0.0.0.0:9100 --collector.systemd Have a look at the “disabled by default” collectors https://github.com/prometheus/node_exporter Wrapped with systemd
  12. 12. Enabling metrics: JMX KAFKA_OPTS="-javaagent:/var/lib/jmx-exporter/jmx_prometheus_javaagent.jar=9999:/var/lib/jmx-exporter/kafka.yml" Use the exporter as a java-agent at JVM startup Kafka JMX exporter Prometheus HTTP : 9999 JVM JMX
  13. 13. Enabling metrics: Scala app object InjectorReacApp extends SparkApp ({ (ssc: StreamingContext) => ReacUserMetricsSystem.load("InjectorReacMetrics")(ssc.sparkContext) object SparkAppMetrics extends ContextLogger { private lazy val handledXMLCounter: SparkCounter = ReacUserMetricsSystem.counter("handledXMLCounter") private lazy val measuredXMLCounter: SparkCounter = ReacUserMetricsSystem.counter("measuredXMLCounter") private lazy val handledMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("handledMessagesCounter") private lazy val rejectedMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("rejectedMessagesCounter") private lazy val succeededMessagesCounter: SparkCounter = ReacUserMetricsSystem.counter("succeededMessagesCounter") Use the scala library to define metrics Inject metrics into Spark context
  14. 14. SelfHealing using AlertManager + Jenkins ALERT SparkWorkerDataFull IF node_filesystem_free{job='spark_system_metrics} < 0.05 FOR 5m LABELS { severity = "high", environment = "production", selfhealing = "true", jenkins_job = "clean_data_worker" } ANNOTATIONS { summary = "Disk /data on {{ $labels.instance }} 95% full" } route: - match: selfhealing: true jenkins_job: clean_data_worker receiver: jenkins-selfhealing-clean_data_worker receivers: - name: jenkins-selfhealing-clean_data_worker webhook_configs: - url: "http://jenkins:8080/job/production/clean_data_worker/build" AlertRule in Prometheus server AlertManager config file
  15. 15. Our favorite Prometheus queries ● Topk: used to get largest K elements by sample value, useful to extract components from the whole metrics topk(3,(1 - node_memory_MemFree / node_memory_MemTotal)) Example: Get the 3 servers with the least free RAM percentage ● predict_linear: used to predict a value with a simple linear regression, useful to extrapolate disk filling predict_linear(node_filesystem_free{mountpoint='/data'}[1h], 6 * 3600) < 0 Example: Predict filesystem fullness in less than 6h
  16. 16. Demo time Code available at: https://github.com/EtienneCoutaud/kubecon2017-demo.git Kafka App Grafana Prometheus Alert Manager Jenkins Slack system metrics middleware metrics service metrics inject read start
  17. 17. How do ops use it? • Self-healing & Alerting help you focus on delivering more value • Predict failure and correct it to prevent outage and business loss • Faster troubleshooting and root cause pinpointing • Tune and improve configuration middleware parameters • Discover bottleneck and improve overall platform performance
  18. 18. How do dev use it? • Easier performance testing and benchmarking • Bring visibility and confidence into the platform • Give a comprehensible view of the platform & KPIs to top management • Allow developers to add their own metrics, dashboards and alerts at application level
  19. 19. Tip: Never stop tweaking • Choose the right metrics • Choose accurate thresholds • Prevent alert fatigue Continuous improvement is the best way to:
  20. 20. Tip: Mind your NTP • Ensure all your applications, servers and monitoring platform are in-sync • Servers and browser clocks must not be too far apart
  21. 21. Trick: Scraping interval tradeoff Choose wisely based on your business requirements! low scraping interval (5s) higher scraping interval (15s) • Almost real-time • CPU intensive (increase with endpoints) • Robust Prom server or federation needed • Metrics “lag effect” • Use less server resources • Sustain more endpoints
  22. 22. Golden tip Work with your users!
  23. 23. What’s next? • Upgrade to 2.0 to improve performance • Explore advanced features (federation, HA & third party-storage) • Improve trigger precision and alarms • Set up dashboard drilling for faster investigation • Automate incident responses because we are lazy • Monitor the underlying infrastructure
  24. 24. Take-away • FullStack Monitoring • Fast to deploy, long to master • Share it with everybody / Provide feedback to everyone • Changes the way of working • Prometheus has many differents exporters, you will surely find yours! • The format makes it easy to build a custom exporter

×