Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring With Prometheus

644 views

Published on

Prometheus: Monitoring by "Pravin Magdum" from "Crevise". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author

Published in: Technology
  • Be the first to comment

Monitoring With Prometheus

  1. 1. #DOPPA17 Prometheus: Monitoring Pravin Magdum 9th September 2017
  2. 2. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) • Devops evangelist @ Crevise Technology • Developer turned into Devops evangelist • 9 + years of development and project management exp in various technology. • Love to resolve tech problems,debug issues. Who Am I?
  3. 3. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) What is and why Monitoring ? • Continuously keep track of the status of the system • Continuously keep track of deployed applications • Earliest warning of failures, defects or problems and to improve them • Trending to see over time - help with upgrade /downgrade infra resources • To know when things go wrong • If issue persists, analysed data to debug issue and prevent it in future • Black box monitoring • Whitebox monitoring
  4. 4. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Black box monitoring • Just like smoke testing • Examples - Ping ,http requests • To check if server is up and working etc • When - when system broken and to test from outside n/w • Won’t get info -whats going inside machine
  5. 5. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) White box monitoring • Complementary to black box testing • Get info -whats inside going in system • Example - check CPU usage, n/w usage
  6. 6. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) What to Monitor ? • It is best to, first, understand what holds business value to you and your customers. • CPU, Memory, IO, storage - typical metrics • Application monitoring - to make application run in cluster depending on these metrics • Predicate resource utilization to avoid downtime.
  7. 7. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Prometheus • Inspired from Google’ Borgmon monitoring system • Mainly written in GO , publicly launched in 2015 • Open source Monitoring and alerting system with active Eco system • Used by Docker, Digital ocean, Core Os to name few
  8. 8. Prometheus Offers Prometheus Offers - • Multi-dimensional data model(time series data) – No strings like “doppa.pune” – Key value pairs {event=Doppa, city=pune} • Powerful Queries - To leverage this dimensionality • Precise alerting • Pull model over HTTP • Scalable • Dashboards • Efficient - – Single server can handle - Millions of metrics
  9. 9. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Lets Understand with simple diagram
  10. 10. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Components • Prometheus Server - scrapes and stores time series data • Exporters - to get metrics from resources • Alet Rules - define alert rules • Alert Manager - to notify on different communication channel about alerts
  11. 11. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Powerful Queries • Can multiply ,join,add,aggregate ,predict in same query • Can evaluate current as well as backdated data • E.g. • Which are top 3 services who are consuming CPU most or more than 80% ? • Will my storage get full in next 4 hours?
  12. 12. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Some Query examples • CPU: 100 - (avg by (instance) (irate(node_cpu{instance="node1:9100",job="node",mode="idle"}[1m])) * 100) • Memory: node_memory_MemTotal{job=‘node’,instance=‘node1:9100’} - node_memory_MemFree{job=‘node’,instance=‘node1:9100’} - node_memory_Buffers{job=‘node’,instance=‘node1:9100’} - node_memory_Cached{job=‘node’,instance=‘node1:9100’} • Disk Write : irate(node_disk_bytes_written[60s]) / 1024
  13. 13. Out of box feature • The textfile collector is similar to the Pushgateway, in that it allows exporting of statistics from batch jobs,shell scripts. • Metrics not exported by node-exporter • You can still have such metrics in prometheus with the help of Textfile Collector • Produce output that is compatible with Prometheus text output format • Write your own exporters to feed prometheus • ./node_exporter --collector.textfile.directory=Metrics
  14. 14. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Running prometheus • Download prometheus https://prometheus.io/download/#prometheus • Extract and Run - done. • Let’s hit http://localhost:9090 • Let’s see in action
  15. 15. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Node Exporter installation • Again two steps • Go to https://prometheus.io/download/#node_exporter • And download node exporter ,Extract and Run - done • Exports metrics at port : 9100 • let’s hit http://localhost:9100
  16. 16. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Configuration • It’s time to tell prometheus to pull metrics from node exporter • Edit Prometheus.yml file - configuration file for prometheus • Scrape interval -15 sec • Scrape_configs: what are we scraping • targets: nodes Ip/hostname to monitor • Labels: logical group of hosts
  17. 17. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Sample configuration file Global: scrape_interval: 15s Alert.Rules: -’CriticalAlert.Rules’ scrape_configs: job_name: node static_configs: labels: Group: ‘QA-Env' targets: - "IP:9100"
  18. 18. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Reload prometheus curl -X POST http://localhost:9090/-/reload # above curl command will reload prometheus server with new configuration without restart
  19. 19. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert Rules and manager • Define alert rules with powerful prom queries • Predicate about linear changes at nodes • Send Alerts on your choice of communication channel • e.g. slack , pagerduty , email ,sms etc
  20. 20. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Setting up Alert Manager • Alertmanager can be configured to send prometheus alerts to your mailbox,slack,get automated calls in critical situation etc. • Download Alertmanager from https://prometheus.io/download/#alertmanager • Extract and configure • Run
  21. 21. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert rules - instance is up and running ?ALERT InstanceDown IF up == 0 FOR 10m LABELS { severity = "CRITICAL" } ANNOTATIONS { summary = "Instance down", description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute." }
  22. 22. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert Rule - does my cpu usage going beyond 75% ? ALERT NodeCPUUsage IF (100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[1m])) * 100)) > 75 FOR 2m LABELS { severity="CRITICAL"} ANNOTATIONS { SUMMARY = "{{ $labels.group }}-{{$labels.instance}}: High CPU usage detected", DESCRIPTION = " CPU usage is above 75% (current value is: {{ $value }})" }
  23. 23. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert Rule- storage packed 80 % ? ALERT filesystem_threshold_exceeded IF 100 *(1 - (node_filesystem_free{mountpoint="/"} / node_filesystem_size{ mountpoint="/"}) ) > 80 LABELS {severity="CRITICAL" } ANNOTATIONS { summary = "{{ $labels.group }}-{{ $labels.instance }} High filesystem usage is detected", description = "This device's filesystem usage has exceeded the threshold with a value of {{ $value }}.", }
  24. 24. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) AlertManager.yml Alert manager Configuration file global: route: repeat_interval: 4h routes: - receiver: email-QA match: group: 'QA-trad' - receiver: email-prod
  25. 25. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Receivers receivers: - name: "email-Prod" email_configs: - to: 'Prodsupport@crevise.com' from: 'no-reply@crevise.com' smarthost: 'smtp.office365.com:587' auth_username: 'no-reply@crevise.com' auth_identity: 'no-reply@crevise.com' auth_password: 'fXXXX'
  26. 26. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Slack Alerts
  27. 27. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Email Alerts
  28. 28. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Set up Grafana • Download tar from official site,Extract it and run binary • Check URL http://<Server-IP>:3000 • Add Prometheus as datasource • Go to http://<Server-IP>:3000 • Enter the username admin and password admin, and then click “Log In”. • Click “Data Sources” on the left menu • Click “Add new” on the top menu
  29. 29. Grafana Dashboard
  30. 30. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Thank you !! Questions ? Reachable at pravin.magdum@crevise.com Twitter - @pravin_magdum

×