Who am I?
Engineer passionate about running software reliably in production.
● TCD CS Degree
● Google SRE for 7 years, working on high-scale reliable systems such as
Adwords, Adsense, Ad Exchange, Billing, Database
● Boxever TL Systems&Infrastructure, applied processes and technology to let
allow company to scale and reduce operational load
● Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
● Founder of Robust Perception, making scalability and efficiency available to
everyone
Prometheus
Inspired by Google’s Borgmon monitoring system.
Started in 2012 by ex-Googlers working in Soundcloud as an open source project.
Mainly written in Go. Publically launched in early 2015.
100+ companies using it including Digital Ocean, GoPro, Apple, Red Hat and
Google.
Why monitor?
● Know when things go wrong
○ To call in a human to prevent a business-level issue, or prevent an issue in advance
● Be able to debug and gain insight
● Trending to see changes over time, and drive technical/business decisions
● To feed into other systems/processes (e.g. QA, security, automation)
Inclusive Monitoring
Don’t monitor just at the edges:
● Instrument client libraries
● Instrument server libraries (e.g. HTTP/RPC)
● Instrument business logic
Library authors get information about usage.
Application developers get monitoring of common components for free.
Dashboards and alerting can be provided out of the box, customised for your
organisation!
Prometheus is About Metrics, not Events
Event based monitoring such as logging is limited in how much data you can have
per event.
Each piece of data about each event needs to be stored and processed, which is
challenging to scale.
Metric based monitoring allows you to have thousands of metrics, allowing you to
track performance of every subsystem.
Prometheus regularly polls in-memory state of metrics.
What about Hadoop?
Batch jobs such as MapReduces are a very common way to use Hadoop.
How do you monitor your regular jobs are working today?
● Checking dashboards?
● Emails about every run?
● Emails on failure?
What do you really care about?
The thing you want to know is:
Has my batch job been successful recently
enough?
So let’s monitor that!
Java snippet
CollectorRegistry registry = new CollectorRegistry();
JobClient.runJob(job); // Submit job to Hadoop and wait for completion.
Gauge lastSuccess = Gauge.build()
.name("my_batch_job_last_success")
.help("Last time my batch job succeeded, in unixtime.")
.register(registry);
lastSuccess.setToCurrentTime()
PushGateway pg = new PushGateway("127.0.0.1:9091");
pg.pushAdd(registry, "my_batch_job");
Prometheus Alerts
Prometheus has a powerful expression language that can be used in graphs, pre-
calculation and alerts.
Let’s alert if our batch job hasn’t succeeded in a day:
ALERT MyBatchJobNotSuccessfulRecently
IF time() - my_batch_job_last_success{job="my_batch_job"}
> 86400
New World!
No longer have to manually check dashboards or emails every single day for
every single batch job.
Monitoring and alerting is now aligned with what we care about.
More reliable, and scales better too!
Aside: Idempotency and Frequency
You shouldn’t care about a single failure.
To make things even easier to manage, write your batch jobs so that if one run
fails the next run will automatically catch up.
Then run your batch jobs at least twice as often as needed.
Result: A single failure is automatically handled, and if there is a problem you run
it again. No more messing with command line flags and config files!
Beyond Batch
Prometheus has integrations with 50+ other systems, including JMX, EC2,
MySQL, Postgresql, Redis, MongoDB, CouchDB, RethinkDB, Redis, Collected,
Graphite, Nagios, InfluxDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis,
RethinkDB, Rsyslog, HAProxy, Meteor.js, Java, Haskell, Python, Go, Ruby, .Net,
Machine, Cloudwatch, Minecraft…
Easy to run, easy to use, easy to scale.
A single Prometheus can handle over 100k samples per second!
Powerful Data Model
All metrics have arbitrary multi-dimensional labels.
No need to force your model into dotted strings.
Can aggregate, cut, and slice along them.
Supports any double value, labels support full unicode.
Powerful Query Language
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Answer questions like:
● What’s the 95th percentile latency in the European datacenter?
● How full will the disks be in 4 hours?
● Which services are the top 5 users of CPU?
Can alert based on any query.
What does this all mean for Hadoop?
Due to it’s extensive integrations, Prometheus can monitor Hadoop and the rest of
your infrastructure and applications.
With its powerful data model and query language, you can graph and alert on what
matters - not what your monitoring system limits you to.
Better alerts with fewer false positives means more sleep, higher reliability and
more confidence that your system is functioning correctly.
Resources
Official Project Website: prometheus.io
Official Mailing List: prometheus-developers@googlegroups.com
Demo: demo.robustperception.io
Robust Perception Website: www.robustperception.io
Queries: prometheus@robustperception.io