● Know when things go wrong
○ To call in a human to prevent a business-level issue, or prevent an issue in advance
● Be able to debug and gain insight
● Trending to see changes over time, and drive technical/business decisions
● To feed into other systems/processes (e.g. QA, security, automation)
Common Monitoring Challenges
Themes common among companies I’ve talk to:
● Monitoring tools are limited, both technically and conceptually
● Tools don’t scale well and are unwieldy to manage
● Operational practices don’t align with the business
Your customers care about increased latency and it’s in your SLAs. You can only
alert on individual machine CPU usage.
Result: Engineers continuously woken up for non-issues, get fatigued
Inspired by Google’s Borgmon monitoring system.
Started in 2012 by ex-Googlers working in Soundcloud as an open source project,
mainly written in Go. Publically launched in early 2015, and continues to be
independent of any one company.
Over 100 companies have started relying on it since then.
What does Prometheus offer?
● Inclusive Monitoring
● Powerful data model
● Powerful query language
● Manageable and Reliable
● Easy to integrate with
Don’t monitor just at the edges:
● Instrument client libraries
● Instrument server libraries (e.g. HTTP/RPC)
● Instrument business logic
Library authors get information about usage.
Application developers get monitoring of common components for free.
Dashboards and alerting can be provided out of the box, customised for your
Powerful Data Model
All metrics have arbitrary multi-dimensional labels.
No need to force your model into dotted strings.
Can aggregate, cut, and slice along them.
Supports any double value, labels support full unicode.
Powerful Query Language
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Answer questions like:
● What’s the 95th percentile latency in the European datacenter?
● How full will the disks be in 4 hours?
● Which services are the top 5 users of CPU?
Can alert based on any query.
Manageable and Reliable
Core Prometheus server is a single binary.
Doesn’t depend on Zookeeper, Consul, Cassandra, Hadoop or the Internet.
Only requires local disk (SSD recommended). No potential for cascading failure.
Pull based, so easy to on run a workstation for testing and rogue servers can’t
push bad metrics.
Advanced service discovery finds what to monitor.
Instrumenting everything means a lot of data.
Prometheus is best in class for lossless storage efficiency, 3.5 bytes per datapoint.
A single server can handle:
● millions of metrics
● hundreds of thousands of datapoints per second
Prometheus is easy to run, can give one to each team in each datacenter.
Federation allows pulling key metrics from other Prometheus servers.
When one job is too big for a single Prometheus server, can use
sharding+federation to scale out. Needed with thousands of machines.
Easy to integrate with
Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine,
Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul,
HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ,
Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft...
Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition.
It’s so easy, most of the above were written without the core team even knowing
What do we do?
Robust Perception is an independent provider of Prometheus-related services.
We can help you:
● Decide if Prometheus is for you
● Manage your transition to Prometheus
● Resolve issues that arise
● Use Prometheus to run and scale your production systems efficiently
We are proud to be among the core contributors to the Prometheus project.
Official Project Website: prometheus.io
Official Mailing List: firstname.lastname@example.org
Robust Perception Website: www.robustperception.io