Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Monitoring Microservices
with Prometheus
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie @dagrobie tobidt@gmail.c...
Monitoring
● Ability to observe and understand systems and their behavior.
○ Know when things go wrong
○ Understand and debug service...
● Metrics monitoring system and time series database
○ Instrumentation (client libraries and exporters)
○ Metrics collecti...
Instrumentation case study
Gusta: a simple like service
● Service to handle everything around liking a resource
○ List all liked likes on a resource
○ Create a like on a resource...
// Like represents all information of a single like.
type Like struct {
ResourceID string `json:"resourceID"`
UserID strin...
// main.go
var store gusta.Store
store = gusta.NewMemoryStore()
var s gusta.Service
s = gusta.NewService(store)
s = gusta....
./gusta
ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080
ts=2017-05-16T19:38:24.203071341Z method=LikeResource ...
Basic Instrumentation
Providing operational insight
● “Four golden signals” cover the essentials
○ Latency
○ Traffic
○ Errors
○ Saturation
● Similar concepts: RED and USE met...
● Direct instrumentation
○ Traffic, Latency, Errors, Saturation
○ Service specific metrics (and interaction with dependenc...
// main.go
import "github.com/prometheus/client_golang/prometheus"
var registry = prometheus.NewRegistry()
registry.MustRe...
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requests := prometheus...
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requestDuration := pro...
Exposing metrics
Observing the current state
● Prometheus is a pull based monitoring system
○ Instances expose an HTTP endpoint to expose their metrics
○ Prometheus us...
// main.go
// ...
http.Handle("/metrics", promhttp.HandlerFor(
registry,
promhttp.HandlerOpts{},
))
Exposing the metrics v...
curl -s http://localhost:8080/metrics | grep requests
# HELP gusta_http_server_requests_total Total number of requests han...
curl -s http://localhost:8080/metrics | grep request_duration
# HELP gusta_http_server_request_duration_seconds A histogra...
curl -s http://localhost:8080/metrics | grep process
# HELP process_cpu_seconds_total Total user and system CPU time spent...
Collecting metrics
Scraping all service instances
# Scrape all targets every 5 seconds by default.
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
# Scr...
scrape_configs:
# Scrape the Gusta service using Consul.
- job_name: consul
consul_sd_configs:
- server: localhost:8500
re...
Target overview
Simple Graph UI
Simple Graph UI
Dashboards
Human-readable metrics
Grafana example
Alerts
Actionable metrics
ALERT InstanceDown
IF up == 0
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance down for more than ...
ALERT GustaHighErrorRate
IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))
/ sum wi...
ALERT FilesystemRunningFull
IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60)...
Summary
● Monitoring is essential to run, understand and operate services.
● Prometheus
○ Client instrumentation
○ Scrape configur...
● https://prometheus.io
● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/
● Our “StackOverflow” htt...
Thank you
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie - @dagrobie
● High availability
○ Run two identical servers
● Scaling
○ Shard by datacenter / team / service ( / instance )
● Aggregat...
Upcoming SlideShare
Loading in …5
×

Monitoring microservices with Prometheus

1,781 views

Published on

In recent years, many companies have adopted service-oriented architectures by deploying tens to hundreds of small microservices. But with the increasing number of independent services, do you still know what’s going on in your infrastructure?

Traditional monitoring solutions were mostly focused on machines and fell short keeping track of infrastructures where service deployments happen multiple times per day and instances get dynamically allocated on a multitude of nodes. Prometheus is a relatively new monitoring system which has gained a lot of popularity in the last two years as it was explicitly designed for today’s needs of service monitoring and container infrastructure.

In this session, you’ll learn how to instrument a service with a Prometheus client library to provide information about its current health and state. In order to get automatically notified when the service becomes unhealthy, you’ll see how to configure alerts and notifications. Along the way, I’ll discuss a few important key metrics paramount to successfully monitor a microservice.

Published in: Engineering
  • Be the first to comment

Monitoring microservices with Prometheus

  1. 1. Monitoring Microservices with Prometheus Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie @dagrobie tobidt@gmail.com
  2. 2. Monitoring
  3. 3. ● Ability to observe and understand systems and their behavior. ○ Know when things go wrong ○ Understand and debug service misbehavior ○ Detect trends and act in advance ● Blackbox vs. Whitebox monitoring ○ Blackbox: Observes systems externally with periodic checks ○ Whitebox: Provides internally observed metrics ● Whitebox: Different levels of granularity ○ Logging ○ Tracing ○ Metrics Monitoring
  4. 4. ● Metrics monitoring system and time series database ○ Instrumentation (client libraries and exporters) ○ Metrics collection, processing and storage ○ Querying, alerting and dashboards ○ Analysis, trending, capacity planning ○ Focused on infrastructure, not business metrics ● Key features ○ Powerful query language for metrics with label dimensions ○ Stable and simple operation ○ Built for modern dynamic deploy environments ○ Easy setup ● What it’s not ○ Logging system ○ Designed for perfect answers Prometheus
  5. 5. Instrumentation case study Gusta: a simple like service
  6. 6. ● Service to handle everything around liking a resource ○ List all liked likes on a resource ○ Create a like on a resource ○ Delete a like on a resource ● Implementation ○ Written in golang ○ Uses the gokit.io toolkit Gusta overview
  7. 7. // Like represents all information of a single like. type Like struct { ResourceID string `json:"resourceID"` UserID string `json:"userID"` CreatedAt time.Time `json:"createdAt"` } // Service describes all methods provided by the gusta service. type Service interface { ListResourceLikes(resourceID string) ([]Like, error) LikeResource(resourceID, userID string) error UnlikeResource(resourceID, userID string) error } Gusta core
  8. 8. // main.go var store gusta.Store store = gusta.NewMemoryStore() var s gusta.Service s = gusta.NewService(store) s = gusta.LoggingMiddleware(logger)(s) var h http.Handler h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP")) http.Handle("/", h) if err := http.ListenAndServe(*httpAddr, nil); err != nil { logger.Log("exit error", err) } Gusta server
  9. 9. ./gusta ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080 ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not found" ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null Gusta server
  10. 10. Basic Instrumentation Providing operational insight
  11. 11. ● “Four golden signals” cover the essentials ○ Latency ○ Traffic ○ Errors ○ Saturation ● Similar concepts: RED and USE methods ○ Request: Rate, Errors, Duration ○ Utilization, Saturation, Errors ● Information about the service itself ● Interaction with dependencies (other services, databases, etc.) What information should be provided?
  12. 12. ● Direct instrumentation ○ Traffic, Latency, Errors, Saturation ○ Service specific metrics (and interaction with dependencies) ○ Prometheus client libraries provide packages to instrument HTTP requests out of the box ● Exporters ○ Utilization, Saturation ○ node_exporter CPU, memory, IO utilization per host ○ wmi_exporter does the same for Windows ○ cAdvisor (Container advisor) provides similar metrics for each container Where to get the information from?
  13. 13. // main.go import "github.com/prometheus/client_golang/prometheus" var registry = prometheus.NewRegistry() registry.MustRegister( prometheus.NewGoCollector(), prometheus.NewProcessCollector(os.Getpid(), ""), ) // Pass down registry when creating HTTP handlers. h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry) Initializing Prometheus client library
  14. 14. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requests := prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "gusta_http_server_requests_total", Help: "Total number of requests handled by the HTTP server.", ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{"code"}, ) registry.MustRegister(requests) h = promhttp.InstrumentHandlerCounter(requests, h) Counting HTTP requests
  15. 15. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requestDuration := prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "gusta_http_server_request_duration_seconds", Help: "A histogram of latencies for requests.", Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1}, ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{}, ) registry.MustRegister(requestDuration) h = promhttp.InstrumentHandlerDuration(requestDuration, h) Observing HTTP request latency
  16. 16. Exposing metrics Observing the current state
  17. 17. ● Prometheus is a pull based monitoring system ○ Instances expose an HTTP endpoint to expose their metrics ○ Prometheus uses service discovery or static target lists to collect the state periodically ● Centralized management ○ Prometheus decides how often to scrape instances ● Prometheus stores the data on local disc ○ In a big outage, you could run Prometheus on your laptop! How to collect the metrics?
  18. 18. // main.go // ... http.Handle("/metrics", promhttp.HandlerFor( registry, promhttp.HandlerOpts{}, )) Exposing the metrics via HTTP
  19. 19. curl -s http://localhost:8080/metrics | grep requests # HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server. # TYPE gusta_http_server_requests_total counter gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3 gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429 gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51 gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14 gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3 Request metrics
  20. 20. curl -s http://localhost:8080/metrics | grep request_duration # HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests. # TYPE gusta_http_server_request_duration_seconds histogram ... gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429 gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984 gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429 ... Latency metrics
  21. 21. curl -s http://localhost:8080/metrics | grep process # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 892.78 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 23 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 9.3446144e+07 ... Out-of-the-box process metrics
  22. 22. Collecting metrics Scraping all service instances
  23. 23. # Scrape all targets every 5 seconds by default. global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: # Scrape the Prometheus server itself. - job_name: prometheus static_configs: - targets: [localhost:9090] # Scrape the Gusta service. - job_name: gusta static_configs: - targets: [localhost:8080] Static configuration
  24. 24. scrape_configs: # Scrape the Gusta service using Consul. - job_name: consul consul_sd_configs: - server: localhost:8500 relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prod,.* action: keep - source_labels: [__meta_consul_service] target_label: job Consul service discovery
  25. 25. Target overview
  26. 26. Simple Graph UI
  27. 27. Simple Graph UI
  28. 28. Dashboards Human-readable metrics
  29. 29. Grafana example
  30. 30. Alerts Actionable metrics
  31. 31. ALERT InstanceDown IF up == 0 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance down for more than 5 minutes.", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.", } ALERT RunningOutOfFileDescriptors IF process_open_fds / process_fds * 100 > 95 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance has many open file descriptors.", description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.", } Alert examples
  32. 32. ALERT GustaHighErrorRate IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m])) / sum without(code, instance) (rate(gusta_http_server_requests_total[1m])) * 100 > 0.1 FOR 2m LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high error rate.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.", } ALERT GustaHighLatency IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1 LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high latency.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} has a 95% percentile latency of {{ $value }} seconds.", } Alert examples
  33. 33. ALERT FilesystemRunningFull IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0 FOR 1h LABELS { severity = "warning" } ANNOTATIONS { summary = "Filesystem space is filling up.", description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.", } Alert examples
  34. 34. Summary
  35. 35. ● Monitoring is essential to run, understand and operate services. ● Prometheus ○ Client instrumentation ○ Scrape configuration ○ Querying ○ Dashboards ○ Alert rules ● Important Metrics ○ Four golden signals: Latency, Traffic, Error, Saturation ● Best practices Recap
  36. 36. ● https://prometheus.io ● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/ ● Our “StackOverflow” https://www.robustperception.io/blog/ ● Ask the community https://prometheus.io/community/ ● Google’s SRE book https://landing.google.com/sre/book/index.html ● USE method http://www.brendangregg.com/usemethod.html ● My philosophy on alerting https://goo.gl/UnvYhQ Sources
  37. 37. Thank you Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie - @dagrobie
  38. 38. ● High availability ○ Run two identical servers ● Scaling ○ Shard by datacenter / team / service ( / instance ) ● Aggregation across Prometheus servers ○ Federation ● Retention time ○ Generic remote storage support available. ● Pull vs. Push ○ Doesn’t matter in practice. Advantages depend on use case. ● Security ○ Focused on writing a monitoring system, left to the user. FAQ

×