Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Grafana optimization
for Prometheus
About me
• MitsuhiroTanda
• Infrastructure Engineer @ GREE
• Use Prometheus on AWS (1.5 year)
• Grafana committer
• @mtanda
Our environment
• deploy multiple Prometheus for each service
• each service launch 100 or more instances
• rarely use RDS...
Dashboard policy
• Adapt dynamic environment with Auto
scaling
• Avoid service specific parameter hard coding
• Reuse same...
Periodic check
Service trend
Alert!Find a problem!
Drill down to the
Root cause
Operation flow
Key Grafana feature
• Templating
– Query parameter
– Datasource
• Panel Repeat
• Scripted dashboard
• Table panel (with An...
Templated queries
Name Description
label_values(label) Returns a list of label values for the label in every metric.
label...
Service trend
• Prepare dashboard for key metrics
– CPU Utilization
– Response time
– Etc.
• Filter by role, and check dee...
Dynamic dashboard
• Refresh option
– “OnTime Range Change”
– Query each time when time range changed
• label_values(metrics, label_key)
– Ge...
Datasource templating
• Switch datasource quickly
• Can check several service on same dashboard
Alert
• We use PagerDuty to call on-call engineer
• Alert message also be posted to chat
• Message contains shortcut link ...
Prometheus alert view
Grafana alert view
• Use Scripted dashboard
• Parse Prometheus alert view HTML
• And generate dashboard
• https://gist.github.com/mtanda/2aba...
Alert history
• Query “ALERTS” metrics of Prometheus
• Set alert annotation data toTable panel
Drilldown
• Show graphs for corresponding instance roles
• Host level metrics and systems metrics
• Need to create dashboa...
graph definition (JSON file)
Instance role
Scripted dashboard
Generate dashboard!
EBS latency dashboard
• Filter by role and threshold
• Quickly find problematic instances
Wrap up
• Grafana is very powerful visualization tool
• It is little tricky, but very flexible
• Make better Grafana by co...
Grafana optimization for Prometheus
Grafana optimization for Prometheus
Grafana optimization for Prometheus
Grafana optimization for Prometheus
Grafana optimization for Prometheus
Upcoming SlideShare
Loading in …5
×

Grafana optimization for Prometheus

2,803 views

Published on

Grafana optimization for Prometheus

Published in: Engineering
  • Be the first to comment

Grafana optimization for Prometheus

  1. 1. Grafana optimization for Prometheus
  2. 2. About me • MitsuhiroTanda • Infrastructure Engineer @ GREE • Use Prometheus on AWS (1.5 year) • Grafana committer • @mtanda
  3. 3. Our environment • deploy multiple Prometheus for each service • each service launch 100 or more instances • rarely use RDS, run MySQL on EC2 • various service, role, and many instances
  4. 4. Dashboard policy • Adapt dynamic environment with Auto scaling • Avoid service specific parameter hard coding • Reuse same dashboard for several service • Prepare dashboard for drilldown analysis
  5. 5. Periodic check Service trend Alert!Find a problem! Drill down to the Root cause Operation flow
  6. 6. Key Grafana feature • Templating – Query parameter – Datasource • Panel Repeat • Scripted dashboard • Table panel (with Annotations)
  7. 7. Templated queries Name Description label_values(label) Returns a list of label values for the label in every metric. label_values(metric, label) Returns a list of label values for the label in the specified metric. metrics(metric) Returns a list of metrics matching the specified metric regex. query_result(query) Returns a list of Prometheus query result for the query. http://docs.grafana.org/datasources/prometheus/
  8. 8. Service trend • Prepare dashboard for key metrics – CPU Utilization – Response time – Etc. • Filter by role, and check deeply
  9. 9. Dynamic dashboard
  10. 10. • Refresh option – “OnTime Range Change” – Query each time when time range changed • label_values(metrics, label_key) – Get label values from metrics – Only match the metrics in current time range – (match current active instances)
  11. 11. Datasource templating
  12. 12. • Switch datasource quickly • Can check several service on same dashboard
  13. 13. Alert • We use PagerDuty to call on-call engineer • Alert message also be posted to chat • Message contains shortcut link to alert dashboard
  14. 14. Prometheus alert view
  15. 15. Grafana alert view
  16. 16. • Use Scripted dashboard • Parse Prometheus alert view HTML • And generate dashboard • https://gist.github.com/mtanda/2aba0e96d2a8aace7b6b9a903bcd6b31
  17. 17. Alert history
  18. 18. • Query “ALERTS” metrics of Prometheus • Set alert annotation data toTable panel
  19. 19. Drilldown • Show graphs for corresponding instance roles • Host level metrics and systems metrics • Need to create dashboard dynamically • Use Scripted dashboard
  20. 20. graph definition (JSON file) Instance role Scripted dashboard Generate dashboard!
  21. 21. EBS latency dashboard
  22. 22. • Filter by role and threshold • Quickly find problematic instances
  23. 23. Wrap up • Grafana is very powerful visualization tool • It is little tricky, but very flexible • Make better Grafana by contributing!

×