Presented at DevOpsDays Phoenix 2018, in this talk I demonstrate what a potential end-state developer-oriented Service Dashboard can look like and discuss what it took to get there. I discuss some of the trade-offs involved, such as the merits between which system to utilize for Alerts, and go over some ways to integrate lesser-known features to make dashboard users and alert responders have an easier time getting to what they need to.
14. ●
Who all has been counting?
●
How high have they gotten?
●
How fast have they been counting up over
time?
●
When and who do we have time series for?
21. Why not just Prometheus?
When we want to combine our queries and
interact visually
22. Why not just Prometheus?
Can query the API and throw the results in a table, or
better a chart using one of the myriad of libraries
available.
Grafana is a great place to start, even just to help
you get along while you work on your custom
solutions.
https://github.com/grafana/grafana/issues/7795
https://stackoverflow.com/questions/50033085/how-t
o-draw-a-network-diagram-in-grafana
23. Why not just Prometheus?
Can query the API and throw the results in a table, or
better a chart using one of the myriad of libraries
available.
Grafana is a great place to start, even just to help
you get along while you work on your custom
solutions.
https://github.com/grafana/grafana/issues/7795
https://stackoverflow.com/questions/50033085/how-t
o-draw-a-network-diagram-in-grafana
24. Grafana
●
Grafana likes to query things too
– But not periodically like Prometheus
– Only queries to render your charts
●
Grafana doesn’t record the data,
instead it saves the dashboards that query the
data
25. Grafana
●
Multiple charts coming together to make
dashboards
●
Interactive, coordinated charts
●
Variables to make queries dynamic
26. Not all metrics are equal
How high something is right now is different
from how fast it has risen over time.
27. Instant vs Ranges
●
Prometheus represents this as instant queries
versus range queries.
●
Grafana represents this as the "Instant"
checkbox on a Prometheus metric on a chart on
a dashboard.
●
This can affect certain graphs – e.g. a table is
likely to use Instant values, a graph is likely to
use Ranges, but esoteric graphs might be less
intuitive.
28. Not all metrics are equal
How high something is right now is different
from how fast it has risen over time.
29. Not all metrics are equal
How high something is right now is different
from how fast it has risen over time.
30. Not all metrics are equal
How high something is right now is different
from how fast it has risen over time.
31. Not all metrics are equal
How high something is right now is different
from how fast it has risen over time.
32. Not all metrics are equal
How high something is right now is different
from how fast it has risen over time.
37. ●
Min/Max are usually safe and usually the most
helpful values to look at
●
Averages can be tricky.
– https://prometheus.io/docs/practices/histogra
ms/#errors-of-quantile-estimation
– http://highscalability.com/blog/2015/10/5/your
-load-generator-is-probably-lying-to-you-take-
the-red-pi.html
55. Effective Variables
●
Enable compaction
●
Extra linking points between dashboards
– Requires both dashboards to use same
Variable with same name, but can drive a
powerful user experience
●
Queries can reference other Variables
56. Prometheus and
Multi-select Variables
●
With Prometheus specifically, when using Multi-
select or All, in your queries..
Generally you will use .* or .+
{mything=”$variable”} BAD
{mything=~”$variable”} GOOD
57. Latest Grafana Changes
●
Newer versions of Grafana default to sane
Prometheus Multi-select values
– Still need to use =~ in your queries
●
Global variables for referencing currently
selected time range in queries
– Enables some really cool top-N graphing
capabilities
http://docs.grafana.org/features/datasources/pr
ometheus/#using-interval-and-range-variables
https://www.robustperception.io/graph-top-n-tim
e-series-in-grafana
65. Combining Time Series
●
We can use matchers on labels like on() and
ignoring() to whitelist/blacklist labels
66. Combining Time Series
●
We can use matchers on labels like on() and
ignoring() to whitelist/blacklist labels
●
We can specify group sides with group_left and
group_right to tell Prometheus who to
aggregate by
67. Combining Time Series
●
We can use matchers on labels like on() and
ignoring() to whitelist/blacklist labels
●
We can specify group sides with group_left and
group_right to tell Prometheus who to
aggregate by
●
We can reduce label sets to aggregate to only
the labels wanted with by() and without()
68. Combining Time Series
●
We can use matchers on labels like on() and
ignoring() to whitelist/blacklist labels
●
We can specify group sides with group_left and
group_right to tell Prometheus who to
aggregate by
●
We can reduce label sets to aggregate to only
the labels wanted with by() and without()
●
We can match disjointed time series and labels
using label_replace to provide the joining label.
69. Combining Time Series
●
https://www.robustperception.io/using-group_lef
t-to-calculate-label-proportions
– demonstrates group_left with ignoring() and
then without() to reduce labels twice
●
https://www.robustperception.io/how-to-have-la
bels-for-machine-roles
– demonstrates group_left with on() and then
by() to reduce labels twice
●
https://www.robustperception.io/understanding-
machine-cpu-usage
– demonstrates using by() to reduce labels
71. Performance |
Upstream
●
Prefer changing instrumentation code or
exporter configuration if possible!
●
Consider pruning unused time series by
identifying with outlier queries
– https://www.robustperception.io/which-are-m
y-biggest-metrics
84. Effective Alerts
●
How can you aid the person who needs to take
action in taking action?
●
Think of what the person getting the
notification will (have to) do.
88. Thanks!
●
If you have any questions or would like to reach
out:
●
My name is Jasmine Hegman
– jasmine@jhegman.com
– http://twitter.com/hegpetz
– https://www.linkedin.com/in/jasminehegman