Evolution of Monitoring and Prometheus (Dublin 2018)

Brian Brazil
Brian BrazilFounder at Robust Perception
Brian Brazil
Founder
Evolution of
Monitoring and Prometheus
Who am I?
Engineer passionate about running software reliably in production.
● One of the main developers of Prometheus
● Author of Prometheus: Up&Running
● Founder of Robust Perception
● Based in Dublin
Historical Roots
A lot of what we do today for monitoring is based on tools and techniques that
were awesome decades ago.
A lot of our practices still come from that time, and often we are still feeding the
machine with human blood.
Treating systems as pets rather than cattle.
A little history - MRTG and RRD
In 1994 Tobias Oetiker created a perl script, which became MRTG 1.0 in 1995.
Used to graph metrics from SNMP or external programs. Stored data in constantly
rewritten ASCII file.
MRTG 2.0 moved some code to C, released 1997.
RRD started in 1997 by Tobias Oetiker to further improve performance, released
1999.
Many tools use/used RRD, e.g. Graphite and Munin.
A little history - MRTG and RRD
Source: Wikipedia
A little history - Nagios
Written initially by Ethan Galstad in 1996. MS-DOS application to do pings.
Started more properly in 1998, first release in 1999 as NetSaint (renamed in 2002
for legal reasons).
Runs scripts on a regular basis, and sends alerts based on their exit code.
Many projects inspired by/based off Nagios such as Icinga, Sensu and Zmon.
A little history - Nagios
Source: Wikipedia
Historical Heritage
These very popular tools and their offspring left us with a world where graphing is
whitebox and alerting is blackbox, and they are separate concerns.
They come from a world where machines are pets, and services tend to live on
one machine.
They come from a world where even slight deviance would be immediately
jumped upon by heroic engineers in a NOC.
We need a new perspective in a cloud native environment.
What is Different Now?
It's no longer one service on one machine that will live there for years.
Services are dynamically assigned to machines, and can be moved around on an
hourly basis.
Microservices rather than monoliths mean more services created more often.
More dynamic, more churn, more to monitor.
So what should be monitoring?
We need a view that goes beyond alerting, graphing and jumping on every
potential problem.
We need to consider additional data sources, such as logs and browser events.
What about looking at the problem statement rather than what the tools can give
us?
What is the ultimate goal of all this "monitoring"?
Why do we monitor?
● Know when things go wrong
● Be able to debug and gain insight
● Trending to see changes over time
● Plumbing data to other systems/processes
Knowing when things go wrong
The first thing many people think of you say monitoring is alerting.
What is the wrongness we want to detect and alert on?
A blip with no real consequence, or a latency issue affecting users?
Symptoms vs Causes
Humans are limited in what they can handle.
If you alert on every single thing that might be a problem, you'll get overwhelmed
and suffer from alert fatigue.
Key problem: You care about things like user facing latency. There are hundreds
of things that could cause that.
Alerting on every possible cause is a Sisyphean task, but alerting on the symptom
of high latency is just one alert.
Example: CPU usage
Some monitoring systems don't allow you to alert on the latency of your servers.
The closest you can get is CPU usage.
False positives due to e.g. logrotate running too long.
False negatives due to deadlocks.
End result: Spammy alerts which operators learn to ignore, missing real problems.
Human attention is limited
Alerts should require intelligent human action!
Alerts should relate to actual end-user problems!
Your users don't care if a machine has a load of 4.
They care if they can't view their cat videos.
Debugging to Gain Insight
After you receive an alert notification you need to investigate it.
How do you work from a high level symptom alert such as increased latency?
You methodically drill down through your stack with dashboards to find the
subsystem that's the cause.
Break out more tools as you drill down into the suspect process.
Complementary Debugging Tools
Trending and Reporting
Alerting and debugging is short term.
Trending is medium to long term.
How is cache hit rate changing over time?
Is anyone still using that obscure feature?
When will I need more machines?
Plumbing
When all you have is a hammer, everything starts to look like a nail.
Often it'd be really convenient to use a monitoring system as a data transport as
part of some other process (often a control loop of some form).
This isn't monitoring, but it's going to happen.
If it's ~free, it's not necessarily even a bad idea.
How do we monitor?
Now knowing the three general goals of monitoring (plus plumbing), how do we go
about it?
What data do we collect? How do we process it?
What tradeoffs do we make?
Monitoring resources aren't free, so everything all the time is rarely an option.
The core is the Event
Events are what monitoring systems work off.
An event might be a HTTP request coming in, a packet being sent out, a library
call being made or a blackbox probe failing.
Events have context, such as the customer ultimately making the request, which
machines are involved, and how much data is being processed.
A single event might accumulate thousands of pieces of context as it goes through
the system, and there can be millions of events per second.
How do we monitor?
We're going to look at 4 classes of approaches to monitoring.
● Profiling
● Metrics
● Logs
● Distributed Tracing
When someone says "monitoring" they often have one of these stuck in their
head, however they're all complementary with different tradeoffs.
Profiling
Profiling is using tools like tcpdump, gdb, strace, dtrace, BPF and pretty much
anything Brendan Gregg talks about.
They give you very detailed information about individual events.
So detailed that you can't keep it all, so use must be targeted and temporary.
Great for debugging if have an idea what's wrong, not so much for anything else.
Metrics
Metrics make the tradeoff that we're going to ignore individual events, but track
how often particular contexts show up. Examples: Prometheus, Graphite.
For example you wouldn't track the exact time each HTTP request came in, but
you do track enough to tell that there were 3594 in the past minute.
And 14 of them failed. And 85 were for the login subsystem. And 5.4MB was
transferred. With 2023 cache hits.
And one of those requests hit a weird code path everyone has forgotten about.
Metric Tradeoffs
Metrics give you breadth.
You can easily track tens of thousands of metrics, but you can't break them out by
too many dimensions. E.g. breaking out metrics by email address is rarely a good
idea.
Great for figuring out what's generally going on, writing meaningful alerts,
narrowing down the scope of what needs debugging.
Not so great for tracking individual events.
Logs
Logs take the opposite approach to metrics.
They track individual events. So you can tell that Mr. Foo visited /myendpoint
yesterday at 7pm and received a 7892 byte response with status code 200.
The downside is you're limited in how many fields you can track due, likely <100.
Data volumes involved can also mean it takes some time for data to be available.
Examples: ELK stack, Graylog
Distributed Tracing
Distributed tracing is really a special case of logging.
It gives each request an unique ID that is logged as the request is handled
through various services in your architecture.
It ties these logs back together to see how the request flowed through the system,
and where time was spent.
Essential for debugging large-scale distributed systems.
Examples: OpenTracing, OpenZipkin
Distributed Tracing
Source: OpenZipkin
Where does Prometheus fit in?
Prometheus is a metrics-based monitoring system.
It tracks overall statistics over time, not individual events.
It has a Time Series DataBase (TSDB) at its core.
Powerful Data Model and Query Language
All metrics have arbitrary multi-dimensional labels.
Supports any double value with millisecond resolution timestamps.
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Can alert on any query.
Prometheus and the Cloud
Dynamic environments mean that new application instances continuously appear
and disappear.
Service Discovery can automatically detect these changes, and monitor all the
current instances.
Even better as Prometheus is pull-based, we can tell the difference between an
instance being down and an instance being turned off on purpose!
Heterogeneity
Not all Cloud VMs are equal.
Noisy neighbours mean different application instances have different performance.
Alerting on individual instance latency would be spammy.
But PromQL can aggregate latency across instances, allowing you to alert on
overall end-user visible latency rather than outliers.
Reliability is Key
Core Prometheus server is a single binary.
Each Prometheus server is independent, it only relies on local SSD.
No clustering or attempts to backfill "missing" data when scrapes fail. Such
approaches are difficult/impossible to get right, and often cause the type of
outages you're trying to prevent.
Option for remote storage for long term storage.
Dashboards with Grafana
Prometheus History - 1
Prometheus started in 2012 by Matt Proud and Julius Volz in Berlin.
In 2013 developed within SoundCloud, expanded to support Bazooka (cluster
manager/scheduler), Go, Java and Ruby clients.
In 2014 other companies start using it, including myself working at Boxever.
Project matures: new v2 storage by Beorn, new text format.
In January 2015 we "publicly release", adoption increases.
Prometheus History - 2
In May 2016 Prometheus is 2nd project to join the CNCF.
In July 2016 Prometheus releases 1.0.
In September 2016, first Promcon in Berlin.
In early 2017, Fabian starts work on a new TSDB.
In November 2017, Prometheus releases 2.0.
In August 2018, Prometheus is 2nd project to graduate within the CNCF. This is
announced at the 3rd Promcon.
Community Growth
2016: 100+ users, 250+ contributors, 35+ integrations
2017: 500+ users, 300+ contributors, 100+ integrations, 600+ on lists
2018: 10k+ users, 900+ contributors, 300+ integrations, 1.2k+ on lists
Prometheus: The Book
Released in August 2018 with O'Reilly, 386
pages.
Core content was written of the course of 61
days.
Copies to give out today!
Questions?
Book: https://www.amazon.co.uk/dp/1492034142
Prometheus: prometheus.io
Demo: demo.robustperception.io
My Company Website: www.robustperception.io
1 of 39

Recommended

Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017) by
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
1.1K views18 slides
Prometheus - Open Source Forum Japan by
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
549 views42 slides
What does "monitoring" mean? (FOSDEM 2017) by
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
2.4K views39 slides
Prometheus for Monitoring Metrics (Percona Live Europe 2017) by
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
1.1K views17 slides
Evolving Prometheus for the Cloud Native World (FOSDEM 2018) by
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
1.5K views25 slides
Evolution of the Prometheus TSDB (Percona Live Europe 2017) by
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
824 views18 slides

More Related Content

What's hot

Anatomy of a Prometheus Client Library (PromCon 2018) by
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
902 views23 slides
Prometheus: A Next Generation Monitoring System (FOSDEM 2016) by
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
9.6K views28 slides
End to-end monitoring with the prometheus operator - Max Inden by
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
2.4K views86 slides
Prometheus for Monitoring Metrics (Fermilab 2018) by
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
2K views24 slides
An Introduction to Prometheus (GrafanaCon 2016) by
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
3.4K views48 slides
Cloud Monitoring with Prometheus by
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
3.8K views20 slides

What's hot(20)

Anatomy of a Prometheus Client Library (PromCon 2018) by Brian Brazil
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
Brian Brazil902 views
Prometheus: A Next Generation Monitoring System (FOSDEM 2016) by Brian Brazil
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil9.6K views
End to-end monitoring with the prometheus operator - Max Inden by Paris Container Day
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
Prometheus for Monitoring Metrics (Fermilab 2018) by Brian Brazil
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil2K views
An Introduction to Prometheus (GrafanaCon 2016) by Brian Brazil
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil3.4K views
Cloud Monitoring with Prometheus by QAware GmbH
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
QAware GmbH3.8K views
Prometheus (Prometheus London, 2016) by Brian Brazil
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil1.2K views
Prometheus design and philosophy by Docker, Inc.
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
Docker, Inc.2.7K views
Prometheus Overview by Brian Brazil
Prometheus OverviewPrometheus Overview
Prometheus Overview
Brian Brazil34.1K views
Ansible at FOSDEM (Ansible Dublin, 2016) by Brian Brazil
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
Brian Brazil833 views
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich) by Brian Brazil
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
Brian Brazil3.6K views
Prometheus (Monitorama 2016) by Brian Brazil
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
Brian Brazil4.5K views
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016) by Brian Brazil
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil9.4K views
Efficient monitoring and alerting by Tobias Schmidt
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alerting
Tobias Schmidt2.7K views
Monitoring Cloud Native Applications with Prometheus by Jacopo Nardiello
Monitoring Cloud Native Applications with PrometheusMonitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with Prometheus
Jacopo Nardiello793 views
The Open-Source Monitoring Landscape by Mike Merideth
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
Mike Merideth8.6K views
Prometheus (Microsoft, 2016) by Brian Brazil
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil3.3K views
Introduction to Prometheus and Cortex (WOUG) by Weaveworks
Introduction to Prometheus and Cortex (WOUG)Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)
Weaveworks619 views

Similar to Evolution of Monitoring and Prometheus (Dublin 2018)

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud by
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSylvain Kalache
64.4K views21 slides
Red lambda FAQ's by
Red lambda FAQ'sRed lambda FAQ's
Red lambda FAQ'sIla Group
351 views14 slides
RTI Data-Distribution Service (DDS) Master Class 2011 by
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011Gerardo Pardo-Castellote
5.3K views151 slides
Monitoring - deeper dive by
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper diveRobert Kubiś
68 views58 slides
Chaos Engineering Without Observability ... Is Just Chaos by
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosCharity Majors
1.4K views53 slides
Cartographer, or Building A Next Generation Management Framework by
Cartographer, or Building A Next Generation Management FrameworkCartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management Frameworkansmtug
429 views63 slides

Similar to Evolution of Monitoring and Prometheus (Dublin 2018)(20)

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud by Sylvain Kalache
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Sylvain Kalache64.4K views
Red lambda FAQ's by Ila Group
Red lambda FAQ'sRed lambda FAQ's
Red lambda FAQ's
Ila Group351 views
Chaos Engineering Without Observability ... Is Just Chaos by Charity Majors
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
Charity Majors1.4K views
Cartographer, or Building A Next Generation Management Framework by ansmtug
Cartographer, or Building A Next Generation Management FrameworkCartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management Framework
ansmtug429 views
Go Observability (in practice) by Eran Levy
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
Eran Levy1.1K views
Messaging is not just for investment banks! by elliando dias
Messaging is not just for investment banks!Messaging is not just for investment banks!
Messaging is not just for investment banks!
elliando dias256 views
Effective Data Erasure and Anti Forensics Techniques by ijtsrd
Effective Data Erasure and Anti Forensics TechniquesEffective Data Erasure and Anti Forensics Techniques
Effective Data Erasure and Anti Forensics Techniques
ijtsrd30 views
NetFlow Auditor Anomaly Detection Plus Forensics February 2010 08 by NetFlowAuditor
NetFlow Auditor Anomaly Detection Plus Forensics February 2010 08NetFlow Auditor Anomaly Detection Plus Forensics February 2010 08
NetFlow Auditor Anomaly Detection Plus Forensics February 2010 08
NetFlowAuditor822 views
The Internet Of Things Can Be Developed And Refined by Stephanie Roberts
The Internet Of Things Can Be Developed And RefinedThe Internet Of Things Can Be Developed And Refined
The Internet Of Things Can Be Developed And Refined
Infrastructure - a journey from datacentres to cloud by Equal Experts
Infrastructure - a journey from datacentres to cloudInfrastructure - a journey from datacentres to cloud
Infrastructure - a journey from datacentres to cloud
Equal Experts722 views
Microsoft Dryad by Colin Clark
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
Colin Clark1.2K views
Webinar Security: Apps of Steel transcription by Service2Media
Webinar Security:  Apps of Steel transcriptionWebinar Security:  Apps of Steel transcription
Webinar Security: Apps of Steel transcription
Service2Media373 views
Performance monitoring and call tracing in microservice environments by Martin Gutenbrunner
Performance monitoring and call tracing in microservice environmentsPerformance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environments
Martin Gutenbrunner2.4K views

More from Brian Brazil

Evaluating Prometheus Knowledge in Interviews (PromCon 2018) by
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Brian Brazil
569 views17 slides
Staleness and Isolation in Prometheus 2.0 (PromCon 2017) by
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Brian Brazil
5.1K views44 slides
Rule 110 for Prometheus (PromCon 2017) by
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Brian Brazil
714 views16 slides
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017) by
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Brian Brazil
15.3K views43 slides
Provisioning and Capacity Planning (Travel Meets Big Data) by
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
768 views45 slides
So You Want to Write an Exporter by
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an ExporterBrian Brazil
4.2K views24 slides

More from Brian Brazil(12)

Evaluating Prometheus Knowledge in Interviews (PromCon 2018) by Brian Brazil
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Brian Brazil569 views
Staleness and Isolation in Prometheus 2.0 (PromCon 2017) by Brian Brazil
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Brian Brazil5.1K views
Rule 110 for Prometheus (PromCon 2017) by Brian Brazil
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
Brian Brazil714 views
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017) by Brian Brazil
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Brian Brazil15.3K views
Provisioning and Capacity Planning (Travel Meets Big Data) by Brian Brazil
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
Brian Brazil768 views
So You Want to Write an Exporter by Brian Brazil
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil4.2K views
An Exploration of the Formal Properties of PromQL by Brian Brazil
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
Brian Brazil1.5K views
Life of a Label (PromCon2016, Berlin) by Brian Brazil
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
Brian Brazil2.8K views
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016) by Brian Brazil
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil16.5K views
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015) by Brian Brazil
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil6.6K views
Prometheus and Docker (Docker Galway, November 2015) by Brian Brazil
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil9.8K views
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire... by Brian Brazil
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Brian Brazil6K views

Recently uploaded

MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
38 views8 slides
Uni Systems for Power Platform.pptx by
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptxUni Systems S.M.S.A.
58 views21 slides
Five Things You SHOULD Know About Postman by
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
40 views43 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
50 views69 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
369 views20 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
46 views13 slides

Recently uploaded(20)

Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman40 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10369 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue46 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue88 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue62 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue91 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software344 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue57 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu141 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue56 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue119 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue46 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue111 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays40 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue48 views

Evolution of Monitoring and Prometheus (Dublin 2018)

  • 2. Who am I? Engineer passionate about running software reliably in production. ● One of the main developers of Prometheus ● Author of Prometheus: Up&Running ● Founder of Robust Perception ● Based in Dublin
  • 3. Historical Roots A lot of what we do today for monitoring is based on tools and techniques that were awesome decades ago. A lot of our practices still come from that time, and often we are still feeding the machine with human blood. Treating systems as pets rather than cattle.
  • 4. A little history - MRTG and RRD In 1994 Tobias Oetiker created a perl script, which became MRTG 1.0 in 1995. Used to graph metrics from SNMP or external programs. Stored data in constantly rewritten ASCII file. MRTG 2.0 moved some code to C, released 1997. RRD started in 1997 by Tobias Oetiker to further improve performance, released 1999. Many tools use/used RRD, e.g. Graphite and Munin.
  • 5. A little history - MRTG and RRD Source: Wikipedia
  • 6. A little history - Nagios Written initially by Ethan Galstad in 1996. MS-DOS application to do pings. Started more properly in 1998, first release in 1999 as NetSaint (renamed in 2002 for legal reasons). Runs scripts on a regular basis, and sends alerts based on their exit code. Many projects inspired by/based off Nagios such as Icinga, Sensu and Zmon.
  • 7. A little history - Nagios Source: Wikipedia
  • 8. Historical Heritage These very popular tools and their offspring left us with a world where graphing is whitebox and alerting is blackbox, and they are separate concerns. They come from a world where machines are pets, and services tend to live on one machine. They come from a world where even slight deviance would be immediately jumped upon by heroic engineers in a NOC. We need a new perspective in a cloud native environment.
  • 9. What is Different Now? It's no longer one service on one machine that will live there for years. Services are dynamically assigned to machines, and can be moved around on an hourly basis. Microservices rather than monoliths mean more services created more often. More dynamic, more churn, more to monitor.
  • 10. So what should be monitoring? We need a view that goes beyond alerting, graphing and jumping on every potential problem. We need to consider additional data sources, such as logs and browser events. What about looking at the problem statement rather than what the tools can give us? What is the ultimate goal of all this "monitoring"?
  • 11. Why do we monitor? ● Know when things go wrong ● Be able to debug and gain insight ● Trending to see changes over time ● Plumbing data to other systems/processes
  • 12. Knowing when things go wrong The first thing many people think of you say monitoring is alerting. What is the wrongness we want to detect and alert on? A blip with no real consequence, or a latency issue affecting users?
  • 13. Symptoms vs Causes Humans are limited in what they can handle. If you alert on every single thing that might be a problem, you'll get overwhelmed and suffer from alert fatigue. Key problem: You care about things like user facing latency. There are hundreds of things that could cause that. Alerting on every possible cause is a Sisyphean task, but alerting on the symptom of high latency is just one alert.
  • 14. Example: CPU usage Some monitoring systems don't allow you to alert on the latency of your servers. The closest you can get is CPU usage. False positives due to e.g. logrotate running too long. False negatives due to deadlocks. End result: Spammy alerts which operators learn to ignore, missing real problems.
  • 15. Human attention is limited Alerts should require intelligent human action! Alerts should relate to actual end-user problems! Your users don't care if a machine has a load of 4. They care if they can't view their cat videos.
  • 16. Debugging to Gain Insight After you receive an alert notification you need to investigate it. How do you work from a high level symptom alert such as increased latency? You methodically drill down through your stack with dashboards to find the subsystem that's the cause. Break out more tools as you drill down into the suspect process.
  • 18. Trending and Reporting Alerting and debugging is short term. Trending is medium to long term. How is cache hit rate changing over time? Is anyone still using that obscure feature? When will I need more machines?
  • 19. Plumbing When all you have is a hammer, everything starts to look like a nail. Often it'd be really convenient to use a monitoring system as a data transport as part of some other process (often a control loop of some form). This isn't monitoring, but it's going to happen. If it's ~free, it's not necessarily even a bad idea.
  • 20. How do we monitor? Now knowing the three general goals of monitoring (plus plumbing), how do we go about it? What data do we collect? How do we process it? What tradeoffs do we make? Monitoring resources aren't free, so everything all the time is rarely an option.
  • 21. The core is the Event Events are what monitoring systems work off. An event might be a HTTP request coming in, a packet being sent out, a library call being made or a blackbox probe failing. Events have context, such as the customer ultimately making the request, which machines are involved, and how much data is being processed. A single event might accumulate thousands of pieces of context as it goes through the system, and there can be millions of events per second.
  • 22. How do we monitor? We're going to look at 4 classes of approaches to monitoring. ● Profiling ● Metrics ● Logs ● Distributed Tracing When someone says "monitoring" they often have one of these stuck in their head, however they're all complementary with different tradeoffs.
  • 23. Profiling Profiling is using tools like tcpdump, gdb, strace, dtrace, BPF and pretty much anything Brendan Gregg talks about. They give you very detailed information about individual events. So detailed that you can't keep it all, so use must be targeted and temporary. Great for debugging if have an idea what's wrong, not so much for anything else.
  • 24. Metrics Metrics make the tradeoff that we're going to ignore individual events, but track how often particular contexts show up. Examples: Prometheus, Graphite. For example you wouldn't track the exact time each HTTP request came in, but you do track enough to tell that there were 3594 in the past minute. And 14 of them failed. And 85 were for the login subsystem. And 5.4MB was transferred. With 2023 cache hits. And one of those requests hit a weird code path everyone has forgotten about.
  • 25. Metric Tradeoffs Metrics give you breadth. You can easily track tens of thousands of metrics, but you can't break them out by too many dimensions. E.g. breaking out metrics by email address is rarely a good idea. Great for figuring out what's generally going on, writing meaningful alerts, narrowing down the scope of what needs debugging. Not so great for tracking individual events.
  • 26. Logs Logs take the opposite approach to metrics. They track individual events. So you can tell that Mr. Foo visited /myendpoint yesterday at 7pm and received a 7892 byte response with status code 200. The downside is you're limited in how many fields you can track due, likely <100. Data volumes involved can also mean it takes some time for data to be available. Examples: ELK stack, Graylog
  • 27. Distributed Tracing Distributed tracing is really a special case of logging. It gives each request an unique ID that is logged as the request is handled through various services in your architecture. It ties these logs back together to see how the request flowed through the system, and where time was spent. Essential for debugging large-scale distributed systems. Examples: OpenTracing, OpenZipkin
  • 29. Where does Prometheus fit in? Prometheus is a metrics-based monitoring system. It tracks overall statistics over time, not individual events. It has a Time Series DataBase (TSDB) at its core.
  • 30. Powerful Data Model and Query Language All metrics have arbitrary multi-dimensional labels. Supports any double value with millisecond resolution timestamps. Can multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time. Can alert on any query.
  • 31. Prometheus and the Cloud Dynamic environments mean that new application instances continuously appear and disappear. Service Discovery can automatically detect these changes, and monitor all the current instances. Even better as Prometheus is pull-based, we can tell the difference between an instance being down and an instance being turned off on purpose!
  • 32. Heterogeneity Not all Cloud VMs are equal. Noisy neighbours mean different application instances have different performance. Alerting on individual instance latency would be spammy. But PromQL can aggregate latency across instances, allowing you to alert on overall end-user visible latency rather than outliers.
  • 33. Reliability is Key Core Prometheus server is a single binary. Each Prometheus server is independent, it only relies on local SSD. No clustering or attempts to backfill "missing" data when scrapes fail. Such approaches are difficult/impossible to get right, and often cause the type of outages you're trying to prevent. Option for remote storage for long term storage.
  • 35. Prometheus History - 1 Prometheus started in 2012 by Matt Proud and Julius Volz in Berlin. In 2013 developed within SoundCloud, expanded to support Bazooka (cluster manager/scheduler), Go, Java and Ruby clients. In 2014 other companies start using it, including myself working at Boxever. Project matures: new v2 storage by Beorn, new text format. In January 2015 we "publicly release", adoption increases.
  • 36. Prometheus History - 2 In May 2016 Prometheus is 2nd project to join the CNCF. In July 2016 Prometheus releases 1.0. In September 2016, first Promcon in Berlin. In early 2017, Fabian starts work on a new TSDB. In November 2017, Prometheus releases 2.0. In August 2018, Prometheus is 2nd project to graduate within the CNCF. This is announced at the 3rd Promcon.
  • 37. Community Growth 2016: 100+ users, 250+ contributors, 35+ integrations 2017: 500+ users, 300+ contributors, 100+ integrations, 600+ on lists 2018: 10k+ users, 900+ contributors, 300+ integrations, 1.2k+ on lists
  • 38. Prometheus: The Book Released in August 2018 with O'Reilly, 386 pages. Core content was written of the course of 61 days. Copies to give out today!
  • 39. Questions? Book: https://www.amazon.co.uk/dp/1492034142 Prometheus: prometheus.io Demo: demo.robustperception.io My Company Website: www.robustperception.io