2. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
• Devops evangelist @ Crevise Technology
• Developer turned into Devops evangelist
• 9 + years of development and project management exp in
various technology.
• Love to resolve tech problems,debug issues.
Who Am I?
3. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
What is and why Monitoring ?
• Continuously keep track of the status of the system
• Continuously keep track of deployed applications
• Earliest warning of failures, defects or problems and to improve them
• Trending to see over time - help with upgrade /downgrade infra resources
• To know when things go wrong
• If issue persists, analysed data to debug issue and prevent it in future
• Black box monitoring
• Whitebox monitoring
4. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Black box monitoring
• Just like smoke testing
• Examples - Ping ,http requests
• To check if server is up and working etc
• When - when system broken and to test from outside n/w
• Won’t get info -whats going inside machine
5. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
White box monitoring
• Complementary to black box testing
• Get info -whats inside going in system
• Example - check CPU usage, n/w usage
6. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
What to Monitor ?
• It is best to, first, understand what holds business value to
you and your customers.
• CPU, Memory, IO, storage - typical metrics
• Application monitoring - to make application run in cluster
depending on these metrics
• Predicate resource utilization to avoid downtime.
7. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Prometheus
• Inspired from Google’ Borgmon monitoring system
• Mainly written in GO , publicly launched in 2015
• Open source Monitoring and alerting system with active Eco
system
• Used by Docker, Digital ocean, Core Os to name few
8. Prometheus Offers
Prometheus Offers -
• Multi-dimensional data model(time series data)
– No strings like “doppa.pune”
– Key value pairs {event=Doppa, city=pune}
• Powerful Queries - To leverage this dimensionality
• Precise alerting
• Pull model over HTTP
• Scalable
• Dashboards
• Efficient -
– Single server can handle - Millions of metrics
9. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Lets Understand with simple
diagram
10. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Components
• Prometheus Server - scrapes and stores time series data
• Exporters - to get metrics from resources
• Alet Rules - define alert rules
• Alert Manager - to notify on different communication channel
about alerts
11. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Powerful Queries
• Can multiply ,join,add,aggregate ,predict in same query
• Can evaluate current as well as backdated data
• E.g.
• Which are top 3 services who are consuming CPU most or more
than 80% ?
• Will my storage get full in next 4 hours?
12. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Some Query examples
• CPU: 100 - (avg by (instance)
(irate(node_cpu{instance="node1:9100",job="node",mode="idle"}[1m])) *
100)
• Memory: node_memory_MemTotal{job=‘node’,instance=‘node1:9100’} -
node_memory_MemFree{job=‘node’,instance=‘node1:9100’} -
node_memory_Buffers{job=‘node’,instance=‘node1:9100’} -
node_memory_Cached{job=‘node’,instance=‘node1:9100’}
• Disk Write : irate(node_disk_bytes_written[60s]) / 1024
13. Out of box feature
• The textfile collector is similar to the Pushgateway, in that it allows
exporting of statistics from batch jobs,shell scripts.
• Metrics not exported by node-exporter
• You can still have such metrics in prometheus with the help of Textfile
Collector
• Produce output that is compatible with Prometheus text output format
• Write your own exporters to feed prometheus
• ./node_exporter --collector.textfile.directory=Metrics
14. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Running prometheus
• Download prometheus
https://prometheus.io/download/#prometheus
• Extract and Run - done.
• Let’s hit http://localhost:9090
• Let’s see in action
15. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Node Exporter installation
• Again two steps
• Go to https://prometheus.io/download/#node_exporter
• And download node exporter ,Extract and Run - done
• Exports metrics at port : 9100
• let’s hit http://localhost:9100
16. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Configuration
• It’s time to tell prometheus to pull metrics from node exporter
• Edit Prometheus.yml file - configuration file for prometheus
• Scrape interval -15 sec
• Scrape_configs: what are we scraping
• targets: nodes Ip/hostname to monitor
• Labels: logical group of hosts
17. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Sample configuration file
Global:
scrape_interval: 15s
Alert.Rules:
-’CriticalAlert.Rules’
scrape_configs:
job_name: node
static_configs:
labels:
Group: ‘QA-Env'
targets:
- "IP:9100"
18. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Reload prometheus
curl -X POST
http://localhost:9090/-/reload
# above curl command will reload prometheus server with new
configuration without restart
19. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Alert Rules and manager
• Define alert rules with powerful prom queries
• Predicate about linear changes at nodes
• Send Alerts on your choice of communication channel
• e.g. slack , pagerduty , email ,sms etc
20. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Setting up Alert Manager
• Alertmanager can be configured to send prometheus alerts to
your mailbox,slack,get automated calls in critical situation etc.
• Download Alertmanager from
https://prometheus.io/download/#alertmanager
• Extract and configure
• Run
21. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Alert rules - instance is up and
running ?ALERT InstanceDown
IF up == 0
FOR 10m
LABELS { severity = "CRITICAL" }
ANNOTATIONS {
summary = "Instance down",
description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute." }
22. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Alert Rule - does my cpu usage
going beyond 75% ?
ALERT NodeCPUUsage
IF (100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[1m])) * 100)) > 75
FOR 2m
LABELS { severity="CRITICAL"}
ANNOTATIONS {
SUMMARY = "{{ $labels.group }}-{{$labels.instance}}: High CPU usage detected",
DESCRIPTION = " CPU usage is above 75% (current value is: {{ $value }})"
}
23. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Alert Rule- storage packed 80 % ?
ALERT filesystem_threshold_exceeded
IF 100 *(1 - (node_filesystem_free{mountpoint="/"} / node_filesystem_size{ mountpoint="/"}) ) > 80
LABELS {severity="CRITICAL" }
ANNOTATIONS {
summary = "{{ $labels.group }}-{{ $labels.instance }} High filesystem usage is detected",
description = "This device's filesystem usage has exceeded the threshold with a value of {{ $value }}.",
}
24. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
AlertManager.yml
Alert manager Configuration file
global:
route:
repeat_interval: 4h
routes:
- receiver: email-QA
match:
group: 'QA-trad'
- receiver: email-prod
25. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Receivers
receivers:
- name: "email-Prod"
email_configs:
- to: 'Prodsupport@crevise.com'
from: 'no-reply@crevise.com'
smarthost: 'smtp.office365.com:587'
auth_username: 'no-reply@crevise.com'
auth_identity: 'no-reply@crevise.com'
auth_password: 'fXXXX'
26. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Slack Alerts
27. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Email Alerts
28. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Set up Grafana
• Download tar from official site,Extract it and run binary
• Check URL http://<Server-IP>:3000
• Add Prometheus as datasource
• Go to http://<Server-IP>:3000
• Enter the username admin and password admin, and then click “Log In”.
• Click “Data Sources” on the left menu
• Click “Add new” on the top menu
30. #DOPPA17
As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media
marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us)
Thank you !!
Questions ?
Reachable at
pravin.magdum@crevise.com
Twitter - @pravin_magdum