Advertisement

Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Founder at Robust Perception
Dec. 14, 2015
Advertisement

More Related Content

Slideshows for you(20)

Similar to Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)(20)

Advertisement

More from Brian Brazil(20)

Advertisement

Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

  1. Brian Brazil Founder Monitoring Hadoop with Prometheus Making batch jobs manageable
  2. Who am I? Engineer passionate about running software reliably in production. ● TCD CS Degree ● Google SRE for 7 years, working on high-scale reliable systems such as Adwords, Adsense, Ad Exchange, Billing, Database ● Boxever TL Systems&Infrastructure, applied processes and technology to let allow company to scale and reduce operational load ● Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. ● Founder of Robust Perception, making scalability and efficiency available to everyone
  3. Prometheus Inspired by Google’s Borgmon monitoring system. Started in 2012 by ex-Googlers working in Soundcloud as an open source project. Mainly written in Go. Publically launched in early 2015. 100+ companies using it including Digital Ocean, GoPro, Apple, Red Hat and Google.
  4. Why monitor? ● Know when things go wrong ○ To call in a human to prevent a business-level issue, or prevent an issue in advance ● Be able to debug and gain insight ● Trending to see changes over time, and drive technical/business decisions ● To feed into other systems/processes (e.g. QA, security, automation)
  5. Your Services Shouldn’t be a Black Box
  6. Services have Internals
  7. Monitor the Internals
  8. Monitor as a Service, not as Machines
  9. Inclusive Monitoring Don’t monitor just at the edges: ● Instrument client libraries ● Instrument server libraries (e.g. HTTP/RPC) ● Instrument business logic Library authors get information about usage. Application developers get monitoring of common components for free. Dashboards and alerting can be provided out of the box, customised for your organisation!
  10. Prometheus is About Metrics, not Events Event based monitoring such as logging is limited in how much data you can have per event. Each piece of data about each event needs to be stored and processed, which is challenging to scale. Metric based monitoring allows you to have thousands of metrics, allowing you to track performance of every subsystem. Prometheus regularly polls in-memory state of metrics.
  11. What about Hadoop? Batch jobs such as MapReduces are a very common way to use Hadoop. How do you monitor your regular jobs are working today? ● Checking dashboards? ● Emails about every run? ● Emails on failure?
  12. What do you really care about? The thing you want to know is: Has my batch job been successful recently enough? So let’s monitor that!
  13. Introducing the Pushgateway The Pushgateway holds metric state for ephemeral jobs.
  14. Java snippet CollectorRegistry registry = new CollectorRegistry(); JobClient.runJob(job); // Submit job to Hadoop and wait for completion. Gauge lastSuccess = Gauge.build() .name("my_batch_job_last_success") .help("Last time my batch job succeeded, in unixtime.") .register(registry); lastSuccess.setToCurrentTime() PushGateway pg = new PushGateway("127.0.0.1:9091"); pg.pushAdd(registry, "my_batch_job");
  15. Prometheus Alerts Prometheus has a powerful expression language that can be used in graphs, pre- calculation and alerts. Let’s alert if our batch job hasn’t succeeded in a day: ALERT MyBatchJobNotSuccessfulRecently IF time() - my_batch_job_last_success{job="my_batch_job"} > 86400
  16. New World! No longer have to manually check dashboards or emails every single day for every single batch job. Monitoring and alerting is now aligned with what we care about. More reliable, and scales better too!
  17. Aside: Idempotency and Frequency You shouldn’t care about a single failure. To make things even easier to manage, write your batch jobs so that if one run fails the next run will automatically catch up. Then run your batch jobs at least twice as often as needed. Result: A single failure is automatically handled, and if there is a problem you run it again. No more messing with command line flags and config files!
  18. Beyond Batch Prometheus has integrations with 50+ other systems, including JMX, EC2, MySQL, Postgresql, Redis, MongoDB, CouchDB, RethinkDB, Redis, Collected, Graphite, Nagios, InfluxDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, HAProxy, Meteor.js, Java, Haskell, Python, Go, Ruby, .Net, Machine, Cloudwatch, Minecraft… Easy to run, easy to use, easy to scale. A single Prometheus can handle over 100k samples per second!
  19. Powerful Data Model All metrics have arbitrary multi-dimensional labels. No need to force your model into dotted strings. Can aggregate, cut, and slice along them. Supports any double value, labels support full unicode.
  20. Powerful Query Language Can multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time. Answer questions like: ● What’s the 95th percentile latency in the European datacenter? ● How full will the disks be in 4 hours? ● Which services are the top 5 users of CPU? Can alert based on any query.
  21. Dashboards
  22. What does this all mean for Hadoop? Due to it’s extensive integrations, Prometheus can monitor Hadoop and the rest of your infrastructure and applications. With its powerful data model and query language, you can graph and alert on what matters - not what your monitoring system limits you to. Better alerts with fewer false positives means more sleep, higher reliability and more confidence that your system is functioning correctly.
  23. Resources Official Project Website: prometheus.io Official Mailing List: prometheus-developers@googlegroups.com Demo: demo.robustperception.io Robust Perception Website: www.robustperception.io Queries: prometheus@robustperception.io
Advertisement