Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Agile Lean Ireland - Workshop - Cloud native monitoring with prometheus


Published on

Note that provided environments will not be available outside the workshop - you can follow instructions from to run the environment yourself.

In the world of cloud native and distributed applications, Prometheus has quickly risen to be one of the leading open-source monitoring tools. In this workshop, you will get to learn as much as possible to get you started with Prometheus for monitoring a service-oriented architecture.

We will cover:
- The core concepts of Prometheus
- Instrumenting your code to expose metrics
- Querying Prometheus to gain insights on how your applications behave
- Defining rules to trigger alerts based on metrics and thresholds
- Building Grafana dashboards combining multiple metrics

Published in: Software
  • Be the first to comment

  • Be the first to like this

Agile Lean Ireland - Workshop - Cloud native monitoring with prometheus

  1. 1. Cloud Native Monitoring with Prometheus & Grafana April 26th, 2019 – Dublin @PierreVincent
  2. 2. @PierreVincent Reaching production is only the beginning
  3. 3. Pierre Vincent Infrastructure & Reliability Manager @PierreVincent
  4. 4. @PierreVincent Workshop Overview Slides - Metrics & Prometheus basics Part 1 - Intro to Prometheus UI and Queries Part 2 - Building Grafana Dashboards Part 3 - Creating Prometheus Alerts Part 4 - Instrumenting Code (Golang)
  5. 5. @PierreVincent System metrics Application metrics Business metrics CPU usage Error rates Customer conversions Metrics
  6. 6. @PierreVincent “Cloud Native” changes the game Monolithic architectures Long-running instances Long-running servers Loosely-coupled architectures Short-lived instances Short/Medium-lived servers Microservices Auto-scaling deployments Multiple deploys/day Cloud VMsAuto-scaling clusters SOA
  7. 7. @PierreVincent Servers / VMs Appliances/Infra Services /metrics /metrics /metrics Prometheus Overview
  8. 8. @PierreVincent Scraping for samples User Service /metrics # HELP http_requests_total Total number of http requests by response status code # TYPE http_requests_total counter http_requests_total{endpoint="/login",status="200"} 1584 http_requests_total{endpoint="/login",status="500"} 9 ... metric http_requests_total labels endpoint=/login status=200 timestamp 1519205931 value 1584 tsdb Each value results in a sample Every scrape interval Persist
  9. 9. @PierreVincent Our example http-simulator /metrics http_requests_total http_request_duration_milliseconds + standard go metrics Option 1: Deploy on your own cluster See instructions in kubernetes/install Option 2: Use pre-deployed setup OR
  10. 10. PierreVincent/prometheus-workshop
  11. 11. @PierreVincent Exercises 1 - Counters & Rates ● What's the overall request rate (with a 1 minute rolling-window) for the http- simulator service? ● How many requests per minute are errors? ● What's the error rate (in %) of requests to the /users endpoint? sum(rate(http_requests_total{app="http-simulator"}[1m])) 60*sum(rate(http_requests_total{app="http-simulator", status="500"}[1m])) 100 * sum(rate(http_requests_total{app="http-simulator", endpoint="/users", status="500"}[1m])) / sum(rate(http_requests_total{app="http-simulator", endpoint="/users"}[1m]))
  12. 12. @PierreVincent Exercises 2 - Latency distribution ● What is the median latency of all requests to the http-simulator service? ● Does the /users endpoint fulfill the SLO of 3 Nines requests responding within 400ms? histogram_quantile(0.5,rate(http_request_duration_milliseconds_ bucket{app="http-simulator"}[5m])) sum(http_request_duration_milliseconds_bucket{app="http- simulator", status="200", endpoint="/users", le="400"}) / sum(http_request_duration_milliseconds_count{app="http- simulator", status="200", endpoint="/users"})
  13. 13. @PierreVincent Exercises 3 - Grafana widgets Some examples of widgets (or come up with your own ones): ● Graph of latency distribution ● Cumulative % graph of endpoint request rate ● Memory usage over time ● CPU usage over time ● Graph % of requests fulfilling the SLO of 400ms for /login endpoint ● ...