Prometheus Monitoring – Docker Enterprise Edition
Tim Tyler – Docker Captain
January 03, 2017
This is a training deck I originally developed in December 2016 and presented as part of a
company training plan for the Docker Enterprise Edition platform.
As of this edit it is 2019 and substantially dated – that is all of the tool stack has moved
forward significantly as well as some of the nuts and bolts originally described here (for
instance its very easy to not need HAProxy as described in favor of Interlock) I’ve decided to
share it – as it does have some intrinsic value remaining and can form the basis for an
updated and modernized version and potential MeetUp talk.
I’ve removed about 10% of the original content that was company specific or proprietary,
leaving only publicly available detail, and obscured some data. Many of the images worked
better on a white background, and rather than fiddle too much with them I’ve just applied
some quick picture styles.
It would be very easy to base an updated tech stack on this document and install a portable
training system on a Raspberry Pi. I am currently building a Prometheus and Grafana based
system for monitoring, alerting, and visualizing my Samsung SmartThings home automation
on a spare iMac.
@timotyler
ttyler
3
Who’s Keeping An Eye On Your
Containers?
 Monitoring Stack Overview
 Prometheus
 Exporters
 Alertmanager
 Queries
 Alerts
Nuts and Bolts
Questions
4
Agenda
Monitoring Stack Overview
5
I'm still passionately interested in what my fellow humans are up to. For me, a day
spent monitoring the passing parade is a day well-spent. - Garry Trudeau
 Monitoring containerized and microservices environments present new challenges.
 Containers can be highly ephemeral
 Microservervices are able to scale up and down to meet design and performance criteria
 Microservices may exist for seconds, or persist indefinitely
 Microservices are generally a single process
 Containers live on hosts, but hosts are just pooled resources
 Generally we don’t think about what host an application microservice is running on
 Instances of a microservice may live on multiple hosts in a Docker Swarm
 Instances of a microservice may move to different hosts within a Docker Swarm
 The Swarm is a pool, and the microservices just swim in it
 Monitoring, like the microservice architecture, needs to be elastic
6
What’s the Problem?
 We have options, and several are readily available
 Prometheus
Time series dimensional data model with Docker aware agents
 Dynatrace
Specialist in application performance monitoring with Docker support
 SignalFX
Newer offering with native Docker support
 Sysdig
Swiss army knife for infrastructure and microservices monitoring
7
Can We Solve This?
 Prometheus is a Pitts S-2A RC muscle biplane
 Prometheus is a prequel and fifth installment in the Alien franchise
 Prometheus is a Greek Titan that gave us fire and suffered an unfortunate fate
involving a hungry eagle and his liver
 Prometheus is a leading Open Source monitoring solution
Prometheus is straightforward to implement as a primary cluster monitoring
stack
A complete stack can also include the Open Source data visualization tool
Grafana
8
PROMETHEUS!
What’s Prometheus?
 Dimensional Data Metric Collector
 Interactive Query Engine
 Calculator for discrete multidimensional data streams
 Great Visualization
 Efficient Storage
 Simple Operation
 Alerting
 Many Client Libraries
 Many Integrations
9
PROMETHEUS!
But what does it do?
 Prometheus Server
 Scrapes and stores time series data
 Alertmanager
 Handles alerts generated by Prometheus Server, deduplicating, grouping, and routing alerts to configured receivers
 AM-Exporter
 Receiver to transmit alerts from Alertmanager to custom intake process
 Exporters
 Agents with specific duties that collect metrics and present them to Prometheus Server
 cAdvisor, node-exporter, blackbox-exporter
 Grafana
 Data visualization
 HaProxy
 Routes calls to Prometheus Server, Alertmanager, and Grafana within the Docker overlay network
10A typical Prometheus Monitoring Stack
11
Prometheus Architecture
12
Prometheus Implementation
 Infrastructure as Code
 IaC is to treat the configuration of systems the same way that software code is treated
 We’re all devs now
 Automate and modularize
 Apply test pyramid
 Version control changes, patches, and releases
 Share work! (Because DevOps)
 Installed via Docker orchestration and some basic automation
 Makefile driven
 Apply environment specific customizations (hostnames, passwords, alerts, etc.) to config files
 Deploy configs across cluster
13
Stack Installation
Prometheus
14
15
Why should the thirst for knowledge be
aroused, only to be disappointed and
punished? Yet, like a second
Prometheus, I will endure this and worse
- Edwin Abbott in Flatland: A Romance of Many
Dimensions (1884)
 Open Source systems monitoring and alerting tool originally
built at SoundCloud
 Very active developer and user community
 Docs and stuff
https://prometheus.io/docs/introduction/overview/
16
Prometheus Server
 Collect and store time series data
 Scrape defined targets for functionally specific data
 Discover targets statically or dynamically
 Evaluate rulesets
 Allow vector arithmetic
 Send alerts
17
What can Prometheus Server Do?
18
Prometheus Has a Really Boring UI
We’ll go poke around for a minute
Exporters
19
Prometheus Exporters are basically agents that are responsible for
collecting application specific, time series, metrics and presenting
them via an API endpoint for Prometheus to collect.
20
What Are Exporters?
Prometheus has support either directly, or via third parties, for
dozens of exporters. Some tools have been directly instrumented
to provide a Prometheus endpoint such as etcd, cAdvisor,
Kubernetes, and Docker.
Custom, business specific, exporters can be easily written in any
language, however Go seems popular.
21
A Bunch of Exporters
A basic Docker monitoring stack implements 3 exporters
 cAdvisor
 Provides metrics on docker and container environment
 node-exporter
 exporter for hardware and OS metrics exposed by the kernel
 blackbox-exporter
 allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP.
22
What is a Minimal Set of Exporters?
 Pushgateway
 allow ephemeral and batch jobs to expose their metrics
 HAProxy-exporter
periodically scrapes HAProxy stats and exports them via HTTP/JSON for Prometheus
 JMX-exporter
configurably scrape and expose mBeans of a JMX target
 Mongodb-exporter
 Rabbitmq-exporter
23
What are Some Other Exporters?
Alertmanager
24
25
Alertmanager
26
Alertmanager
Queries
27
Prometheus provides a functional language that lets the user select and
aggregate time series data in real time. Results can be rendered as
follows:
 Displayed in a graph
 Viewed as tabular data
 Consumed by external systems
Grafana for instance
28
The Basics
 Instant vector
A set of time series data containing a single sample for each series
 Range vector
A set of time series data containing a range of data points over time
 Scalar
A simple numeric floating point value
29
Data Types
Prometheus has 3 basic data types
30
Operators
Prometheus supports basic logical and arithmetic operators
Arithmetic Operators Comparison Operators Aggregation Operators
+ (addition)
- (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ exponentiation)
== (equal)
!= (not equal)
> (greater than)
< (less than)
>= (greater or equal)
<= (less or equal)
sum
min
max
avg
count
topk
 sum
 count
 irate
 sort
 topk
 time
31
Functions
Prometheus supports about 40 built in functions
32
Simple Query
Whats up?
sort_desc(
topk(5,
sum by (image) (
irate(container_cpu_usage_seconds_total {
id=~"/docker/.*"}[5m]
)
)
)
)
33
To edit go to: Insert > Header and Footer
Fancy Query
Top 5 Docker Images by CPU
Alerts
34
35
Big things have small beginnings –
David, from the movie Prometheus
(2012)
Lets build an Alert!
 Alerts are just queries with comparison operators
 Alerts are written in a simple format in a plain text file
 Alerts can be decorated with interesting metadata
 Alert metadata can be templated
 Alerts can be sent to an external service
36
First Things First
37
The Anatomy of an Alert
An alert starts with a Query – like up
38
The Anatomy of an Alert
This is more info than we want though
39
The Anatomy of an Alert
What we really want is to count how many we have
40
The Anatomy of an Alert
Or change how we count them
41
The Anatomy of an Alert
And do some math
42
The Anatomy of an Alert
Check out a quick chart
43
The Anatomy of an Alert
This is more fun
ALERT NodeDown
IF up{job="node"} == 0
FOR 1m
LABELS {prdcode=“0000", host=“Shared_Infra", severity="critical", support="Prometheus_Critical"}
ANNOTATIONS {
description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 1 minutes.",
rosguide="Please see Application ROS guide",
summary="Instance {{$labels.instance}} down“
}
44
The Anatomy of an Alert
And go back and turn our earlier query into an alert
 OptimusPrime (bot)3:32 PM
 AlertManager message: [FIRING:1] NodeDown (0000 prod
node.metrics Shared_Infra node app critical
Prometheus_Critical). Learn more at
https://somewhere.dockeralerts.company.com:8443/#/alert
s?receiver=ChatBot
45
A NodeDown Alert Sent To Chat
Fate rarely calls on us at a moment of our choosing – Optimus Prime
 Rules/Alerts are segregated into functionally specific rule files
 alert.rules
 basic alert installed with 1 rule ‘IF up{job="node"} == 0’
 alert.infra.logging.rules
 Logging ruleset
 alert.infra.monitoring.rules
 Monitoring stack rules
 alert.infra.rules
 Basic infrastructure rules such as file systems, memory, and thinpool
 alert.service.app.prod.rules
 Service level rules such as redis, mongodb, rabbitmq, etc.
 alert.docker.rules
 Rules for Docker itself
 alert.0000.app.rules
 Application specific rules
46
How are Rules/Alerts Categorized?
Grafana
47
 Grafana is a leading Open Source Data Visualization Tool
 Create and share intuitive dashboards
 Rich graphing and charting
 Mixed styling within a dashboard
 Dashboard templates
 Lots of additional features
48
What is Grafana?
49
Nuts and Bolts
50
The Infrastructure Monitoring Stack is currently considered v1.0
 Prometheus v1.3.1
 Grafana v3.1.1
 Alertmanager v0.4.2 custom-v2
 HaProxy v1.6.9
 cAdvisor v0.24.1
 Node-exporter v0.12.0
 Blackbox-exporter v0.2.0
51
Whats in Your Stack?
We use Git to manage configurations and changes to the tech stack. Git is a distributed
version control system.
 Simple to use
 Enables code collaboration
 Eases deployments
 https://somewhere.company.com/git/projects/PRJ0000/repos/infra-prom-
stack/browse
52
Tech Stack SOA
The Monitoring Stack is deployed and configured from 1 location in each Docker Swarm, this is typically on the first Docker
Master Node.
 Configuration files
 /company/compose/infra-prom-stack
 Prometheus Configuration
 /company/compose/infra-prom-stack/infra/prometheus/config/prometheus.yml
 Alertmanager Configuration
 /company/compose/infra-prom-stack/infra/prometheus/config/alertmanager.conf
 Alert Files
 /company/compose/infra-prom-stack/infra/prometheus/alerts
53
Basic Stack Deployment
The Makefile simplifies stack management by reducing error prone commands to simple make targets. It is used to both
configure and install the Monitoring Stack, and to manage the stack during runtime. Some examples:
 make pushconfigs-all
 Distributes configuration to all Swarm nodes
 make hup-prometheus
 Gently restarts Prometheus Server after a configuration change
 make start
 Equivalent to a `docker compose up` with cluster specific information
 make start-all
 Starts the stack and scales all required services
54
Controlling the Stack
These commands are run from the /company/compose/infra-prom-stack on the first Master Node
 There are 1 or more cAdvisor containers down
 Restart via UCP
 If that fails remove the stopped containers
 Run `make scale-cadvisor` from /company/compose/infra-prom-stack
 There are 1 or more node-exporter containers down
 Restart via UCP
 If that fails remove the stopped containers
 Run `make scale-node-exporter` from /company/compose/infra-prom-stack
 Cannot connect to Prometheus Server, Grafana, or Alertmanager
 Validate they are up via UCP
 Occasionally HAProxy seems to get confused and needs a simple restart via UCP
55
Fixing Some Basic Problems
56
Prometheus UCP View
57
Prometheus UCP View
Infrastructure Monitoring and Logging services are currently
deployed as shared infrastructure services in a Docker Overlay
network.
 Overlay name: infra_netmon
Monitoring stack
Logging stack
58
Network Overlay and Shared Services
Prometheus is Federated, enabling existing Prometheus Servers to monitor other Prometheus Servers.
 north-nonprod monitors both
 east
 west
 east monitors
 west
 west monitors
 east
 Basic synthetic monitoring
59
Federation
Who monitors the monitors?
If we stick with Prometheus then there are several improvements that will need exploration and engineering
 Integrate configuration and deployment via a CI/CD pipeline
 Improve and refine Rules/Alerts
 Update Prometheus Server to latest version
 Not much to gain here at the moment
 Update Grafana to latest version
 Some interesting new features including built in alerts
 Back Grafana with a relational database
 Enables persistent annotations
 Engineer HA Prometheus and Alertmanager within a cluster
 Figure out a better persistent storage strategy
 This is bigger than Prometheus/Monitoring
60
Future Work
Since this is an Open Source solution we will have new tradeoffs vs. a fully vendored solution. The following resources are suggested for those
wanting to dive deeper into this technology stack.
 See the Prometheus docs, GitHub repo, YouTube videos, and Robust Perception blog
 https://prometheus.io/docs/introduction/overview/
 https://github.com/prometheus/prometheus
 https://www.youtube.com/watch?v=gNmWzkGViAY&t
 https://www.robustperception.io/blog/
 See the Grafana docs, GitHub repo, and Screencasts
 http://docs.grafana.org/
 https://github.com/grafana/grafana
 https://www.youtube.com/playlist?list=PLDGkOdUX1Ujo3wHw9-z5Vo12YLqXRjzg2
 See the cAdvisor GitHub repo
 https://github.com/google/cadvisor
61
Want to Learn More?
 Microservices are (intended to be) ephemeral
 We need to monitor potentially transient services and act accordingly
 This is an Open Source solution down the stack
 Prometheus is targeted to replace existing on-prem roles
Capable of very basic synthetics
Can set up service level monitoring for mongodb, rabbitmq, etc(d).
 Interface with 3rd party connectors
 Alerts are easy to create and manage
 Deployed as Infrastructure as Code
Embrace DevOps
62
Key Points
Questions, Maybe Answers
63
64
I Hope This Isn’t You Right Now

Prometheus Training

  • 1.
    Prometheus Monitoring –Docker Enterprise Edition Tim Tyler – Docker Captain January 03, 2017 This is a training deck I originally developed in December 2016 and presented as part of a company training plan for the Docker Enterprise Edition platform. As of this edit it is 2019 and substantially dated – that is all of the tool stack has moved forward significantly as well as some of the nuts and bolts originally described here (for instance its very easy to not need HAProxy as described in favor of Interlock) I’ve decided to share it – as it does have some intrinsic value remaining and can form the basis for an updated and modernized version and potential MeetUp talk. I’ve removed about 10% of the original content that was company specific or proprietary, leaving only publicly available detail, and obscured some data. Many of the images worked better on a white background, and rather than fiddle too much with them I’ve just applied some quick picture styles. It would be very easy to base an updated tech stack on this document and install a portable training system on a Raspberry Pi. I am currently building a Prometheus and Grafana based system for monitoring, alerting, and visualizing my Samsung SmartThings home automation on a spare iMac. @timotyler ttyler
  • 2.
    3 Who’s Keeping AnEye On Your Containers?
  • 3.
     Monitoring StackOverview  Prometheus  Exporters  Alertmanager  Queries  Alerts Nuts and Bolts Questions 4 Agenda
  • 4.
    Monitoring Stack Overview 5 I'mstill passionately interested in what my fellow humans are up to. For me, a day spent monitoring the passing parade is a day well-spent. - Garry Trudeau
  • 5.
     Monitoring containerizedand microservices environments present new challenges.  Containers can be highly ephemeral  Microservervices are able to scale up and down to meet design and performance criteria  Microservices may exist for seconds, or persist indefinitely  Microservices are generally a single process  Containers live on hosts, but hosts are just pooled resources  Generally we don’t think about what host an application microservice is running on  Instances of a microservice may live on multiple hosts in a Docker Swarm  Instances of a microservice may move to different hosts within a Docker Swarm  The Swarm is a pool, and the microservices just swim in it  Monitoring, like the microservice architecture, needs to be elastic 6 What’s the Problem?
  • 6.
     We haveoptions, and several are readily available  Prometheus Time series dimensional data model with Docker aware agents  Dynatrace Specialist in application performance monitoring with Docker support  SignalFX Newer offering with native Docker support  Sysdig Swiss army knife for infrastructure and microservices monitoring 7 Can We Solve This?
  • 7.
     Prometheus isa Pitts S-2A RC muscle biplane  Prometheus is a prequel and fifth installment in the Alien franchise  Prometheus is a Greek Titan that gave us fire and suffered an unfortunate fate involving a hungry eagle and his liver  Prometheus is a leading Open Source monitoring solution Prometheus is straightforward to implement as a primary cluster monitoring stack A complete stack can also include the Open Source data visualization tool Grafana 8 PROMETHEUS! What’s Prometheus?
  • 8.
     Dimensional DataMetric Collector  Interactive Query Engine  Calculator for discrete multidimensional data streams  Great Visualization  Efficient Storage  Simple Operation  Alerting  Many Client Libraries  Many Integrations 9 PROMETHEUS! But what does it do?
  • 9.
     Prometheus Server Scrapes and stores time series data  Alertmanager  Handles alerts generated by Prometheus Server, deduplicating, grouping, and routing alerts to configured receivers  AM-Exporter  Receiver to transmit alerts from Alertmanager to custom intake process  Exporters  Agents with specific duties that collect metrics and present them to Prometheus Server  cAdvisor, node-exporter, blackbox-exporter  Grafana  Data visualization  HaProxy  Routes calls to Prometheus Server, Alertmanager, and Grafana within the Docker overlay network 10A typical Prometheus Monitoring Stack
  • 10.
  • 11.
  • 12.
     Infrastructure asCode  IaC is to treat the configuration of systems the same way that software code is treated  We’re all devs now  Automate and modularize  Apply test pyramid  Version control changes, patches, and releases  Share work! (Because DevOps)  Installed via Docker orchestration and some basic automation  Makefile driven  Apply environment specific customizations (hostnames, passwords, alerts, etc.) to config files  Deploy configs across cluster 13 Stack Installation
  • 13.
  • 14.
    15 Why should thethirst for knowledge be aroused, only to be disappointed and punished? Yet, like a second Prometheus, I will endure this and worse - Edwin Abbott in Flatland: A Romance of Many Dimensions (1884)
  • 15.
     Open Sourcesystems monitoring and alerting tool originally built at SoundCloud  Very active developer and user community  Docs and stuff https://prometheus.io/docs/introduction/overview/ 16 Prometheus Server
  • 16.
     Collect andstore time series data  Scrape defined targets for functionally specific data  Discover targets statically or dynamically  Evaluate rulesets  Allow vector arithmetic  Send alerts 17 What can Prometheus Server Do?
  • 17.
    18 Prometheus Has aReally Boring UI We’ll go poke around for a minute
  • 18.
  • 19.
    Prometheus Exporters arebasically agents that are responsible for collecting application specific, time series, metrics and presenting them via an API endpoint for Prometheus to collect. 20 What Are Exporters?
  • 20.
    Prometheus has supporteither directly, or via third parties, for dozens of exporters. Some tools have been directly instrumented to provide a Prometheus endpoint such as etcd, cAdvisor, Kubernetes, and Docker. Custom, business specific, exporters can be easily written in any language, however Go seems popular. 21 A Bunch of Exporters
  • 21.
    A basic Dockermonitoring stack implements 3 exporters  cAdvisor  Provides metrics on docker and container environment  node-exporter  exporter for hardware and OS metrics exposed by the kernel  blackbox-exporter  allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP. 22 What is a Minimal Set of Exporters?
  • 22.
     Pushgateway  allowephemeral and batch jobs to expose their metrics  HAProxy-exporter periodically scrapes HAProxy stats and exports them via HTTP/JSON for Prometheus  JMX-exporter configurably scrape and expose mBeans of a JMX target  Mongodb-exporter  Rabbitmq-exporter 23 What are Some Other Exporters?
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Prometheus provides afunctional language that lets the user select and aggregate time series data in real time. Results can be rendered as follows:  Displayed in a graph  Viewed as tabular data  Consumed by external systems Grafana for instance 28 The Basics
  • 28.
     Instant vector Aset of time series data containing a single sample for each series  Range vector A set of time series data containing a range of data points over time  Scalar A simple numeric floating point value 29 Data Types Prometheus has 3 basic data types
  • 29.
    30 Operators Prometheus supports basiclogical and arithmetic operators Arithmetic Operators Comparison Operators Aggregation Operators + (addition) - (subtraction) * (multiplication) / (division) % (modulo) ^ exponentiation) == (equal) != (not equal) > (greater than) < (less than) >= (greater or equal) <= (less or equal) sum min max avg count topk
  • 30.
     sum  count irate  sort  topk  time 31 Functions Prometheus supports about 40 built in functions
  • 31.
  • 32.
    sort_desc( topk(5, sum by (image)( irate(container_cpu_usage_seconds_total { id=~"/docker/.*"}[5m] ) ) ) ) 33 To edit go to: Insert > Header and Footer Fancy Query Top 5 Docker Images by CPU
  • 33.
  • 34.
    35 Big things havesmall beginnings – David, from the movie Prometheus (2012) Lets build an Alert!
  • 35.
     Alerts arejust queries with comparison operators  Alerts are written in a simple format in a plain text file  Alerts can be decorated with interesting metadata  Alert metadata can be templated  Alerts can be sent to an external service 36 First Things First
  • 36.
    37 The Anatomy ofan Alert An alert starts with a Query – like up
  • 37.
    38 The Anatomy ofan Alert This is more info than we want though
  • 38.
    39 The Anatomy ofan Alert What we really want is to count how many we have
  • 39.
    40 The Anatomy ofan Alert Or change how we count them
  • 40.
    41 The Anatomy ofan Alert And do some math
  • 41.
    42 The Anatomy ofan Alert Check out a quick chart
  • 42.
    43 The Anatomy ofan Alert This is more fun
  • 43.
    ALERT NodeDown IF up{job="node"}== 0 FOR 1m LABELS {prdcode=“0000", host=“Shared_Infra", severity="critical", support="Prometheus_Critical"} ANNOTATIONS { description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 1 minutes.", rosguide="Please see Application ROS guide", summary="Instance {{$labels.instance}} down“ } 44 The Anatomy of an Alert And go back and turn our earlier query into an alert
  • 44.
     OptimusPrime (bot)3:32PM  AlertManager message: [FIRING:1] NodeDown (0000 prod node.metrics Shared_Infra node app critical Prometheus_Critical). Learn more at https://somewhere.dockeralerts.company.com:8443/#/alert s?receiver=ChatBot 45 A NodeDown Alert Sent To Chat Fate rarely calls on us at a moment of our choosing – Optimus Prime
  • 45.
     Rules/Alerts aresegregated into functionally specific rule files  alert.rules  basic alert installed with 1 rule ‘IF up{job="node"} == 0’  alert.infra.logging.rules  Logging ruleset  alert.infra.monitoring.rules  Monitoring stack rules  alert.infra.rules  Basic infrastructure rules such as file systems, memory, and thinpool  alert.service.app.prod.rules  Service level rules such as redis, mongodb, rabbitmq, etc.  alert.docker.rules  Rules for Docker itself  alert.0000.app.rules  Application specific rules 46 How are Rules/Alerts Categorized?
  • 46.
  • 47.
     Grafana isa leading Open Source Data Visualization Tool  Create and share intuitive dashboards  Rich graphing and charting  Mixed styling within a dashboard  Dashboard templates  Lots of additional features 48 What is Grafana?
  • 48.
  • 49.
  • 50.
    The Infrastructure MonitoringStack is currently considered v1.0  Prometheus v1.3.1  Grafana v3.1.1  Alertmanager v0.4.2 custom-v2  HaProxy v1.6.9  cAdvisor v0.24.1  Node-exporter v0.12.0  Blackbox-exporter v0.2.0 51 Whats in Your Stack?
  • 51.
    We use Gitto manage configurations and changes to the tech stack. Git is a distributed version control system.  Simple to use  Enables code collaboration  Eases deployments  https://somewhere.company.com/git/projects/PRJ0000/repos/infra-prom- stack/browse 52 Tech Stack SOA
  • 52.
    The Monitoring Stackis deployed and configured from 1 location in each Docker Swarm, this is typically on the first Docker Master Node.  Configuration files  /company/compose/infra-prom-stack  Prometheus Configuration  /company/compose/infra-prom-stack/infra/prometheus/config/prometheus.yml  Alertmanager Configuration  /company/compose/infra-prom-stack/infra/prometheus/config/alertmanager.conf  Alert Files  /company/compose/infra-prom-stack/infra/prometheus/alerts 53 Basic Stack Deployment
  • 53.
    The Makefile simplifiesstack management by reducing error prone commands to simple make targets. It is used to both configure and install the Monitoring Stack, and to manage the stack during runtime. Some examples:  make pushconfigs-all  Distributes configuration to all Swarm nodes  make hup-prometheus  Gently restarts Prometheus Server after a configuration change  make start  Equivalent to a `docker compose up` with cluster specific information  make start-all  Starts the stack and scales all required services 54 Controlling the Stack
  • 54.
    These commands arerun from the /company/compose/infra-prom-stack on the first Master Node  There are 1 or more cAdvisor containers down  Restart via UCP  If that fails remove the stopped containers  Run `make scale-cadvisor` from /company/compose/infra-prom-stack  There are 1 or more node-exporter containers down  Restart via UCP  If that fails remove the stopped containers  Run `make scale-node-exporter` from /company/compose/infra-prom-stack  Cannot connect to Prometheus Server, Grafana, or Alertmanager  Validate they are up via UCP  Occasionally HAProxy seems to get confused and needs a simple restart via UCP 55 Fixing Some Basic Problems
  • 55.
  • 56.
  • 57.
    Infrastructure Monitoring andLogging services are currently deployed as shared infrastructure services in a Docker Overlay network.  Overlay name: infra_netmon Monitoring stack Logging stack 58 Network Overlay and Shared Services
  • 58.
    Prometheus is Federated,enabling existing Prometheus Servers to monitor other Prometheus Servers.  north-nonprod monitors both  east  west  east monitors  west  west monitors  east  Basic synthetic monitoring 59 Federation Who monitors the monitors?
  • 59.
    If we stickwith Prometheus then there are several improvements that will need exploration and engineering  Integrate configuration and deployment via a CI/CD pipeline  Improve and refine Rules/Alerts  Update Prometheus Server to latest version  Not much to gain here at the moment  Update Grafana to latest version  Some interesting new features including built in alerts  Back Grafana with a relational database  Enables persistent annotations  Engineer HA Prometheus and Alertmanager within a cluster  Figure out a better persistent storage strategy  This is bigger than Prometheus/Monitoring 60 Future Work
  • 60.
    Since this isan Open Source solution we will have new tradeoffs vs. a fully vendored solution. The following resources are suggested for those wanting to dive deeper into this technology stack.  See the Prometheus docs, GitHub repo, YouTube videos, and Robust Perception blog  https://prometheus.io/docs/introduction/overview/  https://github.com/prometheus/prometheus  https://www.youtube.com/watch?v=gNmWzkGViAY&t  https://www.robustperception.io/blog/  See the Grafana docs, GitHub repo, and Screencasts  http://docs.grafana.org/  https://github.com/grafana/grafana  https://www.youtube.com/playlist?list=PLDGkOdUX1Ujo3wHw9-z5Vo12YLqXRjzg2  See the cAdvisor GitHub repo  https://github.com/google/cadvisor 61 Want to Learn More?
  • 61.
     Microservices are(intended to be) ephemeral  We need to monitor potentially transient services and act accordingly  This is an Open Source solution down the stack  Prometheus is targeted to replace existing on-prem roles Capable of very basic synthetics Can set up service level monitoring for mongodb, rabbitmq, etc(d).  Interface with 3rd party connectors  Alerts are easy to create and manage  Deployed as Infrastructure as Code Embrace DevOps 62 Key Points
  • 62.
  • 63.
    64 I Hope ThisIsn’t You Right Now