Prometheus for Monitoring Metrics (Fermilab 2018)

Brian Brazil
Founder
Prometheus for
Monitoring Metrics

Who am I?
● One of the main developers of Prometheus
● Founder of Robust Perception
● Contributor to many open source projects

Why monitor?
● Know when things go wrong
○ To call in a human to prevent a business-level issue
● Be able to debug and gain insight
● Trending to see changes over time, and drive
technical/business decisions
● To feed into other systems/processes

Monitor as a Service, not as Machines

What is Prometheus?
Metrics monitoring system (not logs).
A time series database. A query language.
Client libraries. An Ecosystem.
A modern approach to monitoring services.

Client Libraries
Instrument your code to capture the metrics that
matter to you.
If upstream libraries are instrumented, you get that
for free!
Also many exporters, cAdvisor, MySQL, MongoDB,
SNMP, JMX, HAProxy, Minecraft, Factorio...

Let’s Talk Code
pip install prometheus_client
from prometheus_client import Summary, start_http_server
REQUEST_DURATION = Summary('request_duration_seconds',
'Request duration in seconds')
@REQUEST_DURATION.time()
def my_handler(request):
pass // Your code here
start_http_server(8000)

Multiple Dimensions
from prometheus_client import Counter
REQUESTS = Counter('requests_total',
'Total requests', ['method'])
REQUESTS.labels(request.method).inc()

Exceptional Circumstances In Progress
from prometheus_client import Counter, Gauge
EXCEPTIONS = Counter('exceptions_total', 'Total exceptions')
IN_PROGRESS = Gauge('inprogress_requests', 'In progress')
@EXCEPTIONS.count_exceptions()
@IN_PROGRESS.track_inprogress()

Getting Data Out
from prometheus_client import start_http_server
if __name__ == '__main__':
start_http_server(8080)
Also possible with Django, Twisted etc.

The PromQL Query Language
Arbitrary aggregation, joins and slicing all possible.
Can calculate how close you'll be to your quota in 4
hours, or the 95th percentile latency across an entire
datacenter.
If you can graph it, you can alert on it!

Analytics: Top 5 Docker images by CPU
topk(5,
sum by (image)(
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)

Heterogeneity
Not all VMs are equal.
Noisy neighbours mean different application
instance have different performance.
But PromQL can aggregate latency across
instances, allowing you to alert on overall end-user
visible latency rather than outliers.

Alert management
Not every alert results in a page.
Group similar alerts together, route them to the right
team and throttle notifications.
Designed to work reliably during network partitions.

Reliability is Key
Core Prometheus server is a single binary.
Each Prometheus server is independent.
No clustering or attempts to backfill "missing" data
when scrapes fail.
Option for remote storage for long term storage.

Monitoring Approach
Service management went from manual to Chef to
Kubernetes. Need to do the same for monitoring.
Care about what matters to end users, such as
latency and error rates.
Distracting a human with alerts for everything that's
vaguely off only leads to burnout.

A Rich Community
Today there are 750+ contributors to the core
repositories, and 350+ 3rd party integrations.
There are 1000+ subscribers on our mailing lists,
600+ people in IRC and an estimated 10000+
companies using Prometheus in production.
Many companies funding Prometheus development.

Prometheus: The Book
Coming in 2018!

Resources
Official Project Website: prometheus.io
User Mailing List: prometheus-users@googlegroups.com
Dev Mailing List: prometheus-developers@googlegroups.com
IRC: #prometheus on chat.freenode.net
Robust Perception Blog: www.robustperception.io/blog

Prometheus for Monitoring Metrics (Fermilab 2018)

More Related Content

What's hot

Similar to Prometheus for Monitoring Metrics (Fermilab 2018)

More from Brian Brazil

Recently uploaded

In this document

Prometheus for Monitoring Metrics (Fermilab 2018)