Microservices and Prometheus (Microservices NYC 2016)

Brian Brazil
Brian BrazilFounder at Robust Perception
Brian Brazil
Founder
Microservices
and
Prometheus
Who am I?
Engineer passionate about running software reliably in production.
● TCD CS Degree
● Google SRE for 7 years, working on high-scale reliable systems such as
Adwords, Adsense, Ad Exchange, Billing, Database
● Boxever TL Systems&Infrastructure, applied processes and technology to let
allow company to scale and reduce operational load
● Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
● Founder of Robust Perception, making scalability and efficiency available to
everyone
Why monitor?
● Know when things go wrong
○ To call in a human to prevent a business-level issue, or prevent an issue in advance
● Be able to debug and gain insight
● Trending to see changes over time, and drive technical/business decisions
● To feed into other systems/processes (e.g. QA, security, automation)
Anatomy of an Microservice Architecture
Many small services
Services send requests to each other
Services can be in written in a variety of languages
Services added and removed regularly
Common Monitoring Challenges
Themes common among companies I’ve talk to and tools I've looked at:
● Monitoring tools are limited, both technically and conceptually
● Tools don’t scale well and are unwieldy to manage
● Operational practices don’t align with the business
For example:
Your customers care about increased latency and it’s in your SLAs. You can only
alert on individual machine CPU usage.
Result: Engineers continuously woken up for non-issues, get fatigued
Challenges for Microservice Monitoring
Services added and removed regularly, needs to be low-friction.
Tends to be a very dynamic environment, machines and replicas added and
removed automatically.
Many different teams with slightly different requirements and views of the world.
How to support all the languages and types of server?
What do we need?
Monitoring designed for a dynamic microservice environment.
Must be usable across multiple languages, frameworks and server types.
Allows each team to own their monitoring.
Powerful analysis features.
Takes advantage of similarity between services
Lots of Monitoring Systems Look Like This
Services have Internals
Monitor the Internals
Monitor as a Service, not as Machines
Prometheus
Inspired by Google’s Borgmon monitoring system.
Started in 2012 by ex-Googlers working in Soundcloud as an open source project,
mainly written in Go to monitor microservices. Publically launched in early 2015,
and continues to be independent of any one company.
Hundreds of companies have started relying on it since then.
Over 250 people have contributed to official repos. 50+ 3rd party integrations.
Inclusive Monitoring
Don’t monitor just at the edges:
● Instrument client libraries
● Instrument server libraries (e.g. HTTP/RPC)
● Instrument business logic
Library authors get information about usage.
Application developers get monitoring of common components for free.
Dashboards and alerting can be provided out of the box, customised for your
organisation!
Instrumentation made easy
Prometheus clients don’t just marshall data.
They take care of the nitty gritty details like concurrency and state tracking.
We take advantage of the strengths of each of the 10 languages currently
supported.
In Python for example that means context managers and decorators.
Python Instrumentation Example
pip install prometheus_client
from prometheus_client import Summary, start_http_server
REQUEST_DURATION = Summary('request_duration_seconds',
'Request duration in seconds')
@REQUEST_DURATION.time()
def my_handler(request):
pass # Your code here
start_http_server(8000)
Exceptional Circumstances In Progress
from prometheus_client import Counter, Gauge
EXCEPTIONS = Counter('exceptions_total', 'Total exceptions')
IN_PROGRESS = Gauge('inprogress_requests', 'In progress')
@EXCEPTIONS.count_exceptions()
@IN_PROGRESS.track_inprogress()
def my_handler(request):
pass // Your code here
Open ecosystem
Prometheus client libraries don’t tie you into Prometheus.
For example the Python and Java clients can output to Graphite, with no need to
run any Prometheus components.
This means that you as a library author can instrument with Prometheus, and your
users with just a few lines of code can output to whatever monitoring system they
want. No need for users to worry about a new standard.
This can be done incrementally.
It goes the other way too
It’s unlikely that everyone is going to switch to Prometheus all at once, so we’ve
integrations that can take in data from other monitoring systems and make it
useful.
Graphite, Collectd, Statsd, SNMP, JMX, Dropwizard, AWS Cloudwatch, New
Relic, Rsyslog and Scollector (Bosun) are some examples.
Prometheus and its client libraries can act a clearinghouse to convert between
monitoring systems.
For example Zalando’s Zmon uses the Python client to parse Prometheus metrics
from directly instrumented binaries such as the Node exporter.
Easy to integrate with
Many existing integrations: Cadvisor, Java, JMX, Python, Go, Ruby, .Net,
Machine, Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP,
Consul, HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached,
RabbitMQ, Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft...
Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition.
It’s so easy, most of the above were written without the core team even knowing
about them!
Distinguishing Timeseries
Prometheus doesn’t use dotted.strings like metric.tech.dublin.
Multi-dimensional labels instead like metric{industry=”tech”,city=”dublin”}
Can aggregate, cut, and slice along them.
Can come from instrumentation, or be added based on the service you are
monitoring.
Example: Labels from Node Exporter
Adding Dimensions (No Evil Twins Please)
from prometheus_client import Counter
REQUESTS = Counter('requests_total',
'Total requests', ['method'])
def my_handler(request):
REQUESTS.labels(request.method).inc()
pass // Your code here
Counters
One type of time series Prometheus has is a Counter
It only goes up
If we miss a scrape, it's still possible to see what the increase was during that
period - lose granularity, not data. Other systems push diffs, not as reliable.
The rate() function also handles resets
Powerful Query Language
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Answer questions like:
● What’s the 95th percentile latency in the European datacenter?
● How full will the disks be in 4 hours?
● Which services are the top 5 users of CPU?
Can alert based on any query.
Example: Top 5 Docker images by CPU
topk(5,
sum by (image)(
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)
Manageable and Reliable
Core Prometheus server is a single binary.
Doesn’t depend on Zookeeper, Consul, Cassandra, Hadoop or the Internet.
Only requires local disk (SSD recommended). No potential for cascading failure.
State not event based, so network blips only lose resolution - not data.
Decentralised
Pull based, so easy to on run a workstation for testing and rogue servers can’t
push bad metrics.
Also means each team can run their own, no need for central management.
Advanced service discovery finds what to monitor.
Relabelling lets you apply the label taxonomy that makes sense to you.
Efficient and Scalable
Prometheus is easy to run, can give one to each team in each datacenter.
New varbit encoding uses only 1.3 bytes/sample
A single Prometheus server can handle 800K samples/s
Federation allows pulling key metrics from other Prometheus servers.
When one job is too big for a single Prometheus server, can use
sharding+federation to scale out. Needed with thousands of machines.
Dashboards
What do we have?
Instrumentation for a variety of languages that's easy to use.
Data model and query language for analysis.
Flexible alerting on any query.
Decentralised architecture, each team controls their own monitoring.
Service discovery automatically picks up new services and machines.
This makes Prometheus a great way to monitor your microservices!
10 Tips for Monitoring Microservices
With potentially millions of time series across your system, can be difficult to know
what is and isn't useful.
What approaches help manage this complexity?
How do you avoid getting caught out?
#1: Choose your key statistics
Users don't care that one of your machines is short of CPU.
Users care if the service is slow or throwing errors.
For your primary dashboards focus on high-level metrics that directly impact
users.
#2: Use aggregations
Think about services, not machines.
Once you have more than a handful of machines, you should treat them as an
amorphous blob.
Looking at the key statistics is easier for 10 services, than 10 services each of
which is on 10 machines
Once you have isolated a problem to one service, then can see if one machine is
the problem
#3: Avoid the Wall of Graphs
Dashboards tend to grow without bound. Worst I've seen was 600 graphs.
It might look impressive, but humans can't deal with that much data at once.
(and they take forever to load).
Your services will have a rough tree structure, have a dashboard per service and
talk the tree from the top when you have a problem. Similarly for each service,
have dashboards per subsystem.
Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
#4: Client-side quantiles aren't aggregatable
Many instrumentation systems calculate quantiles/percentiles inside each process,
and export it to the TSDB.
It is not statistically possible to aggregate these.
If you want meaningful quantiles, you should track histogram buckets in each
process, aggregate those in your monitoring system and then calculate the
quantile.
This is done using histogram_quantile() and rate() in Prometheus.
#5: Averages are easy to reason about
Q: Say you have a service with two backends. If 95th percentile latency goes up
due to one of the backends, what will you see in 95th percentile latency for that
backend?
A: ?
#5: Averages are easy to reason about
Q: Say you have a service with two backends. If 95th percentile latency goes up
due to one of the backends, what will you see in 95th percentile latency for that
backend?
A: It depends, could be no change. If the latencies are strongly correlated for each
request across the backends, you'll see the same latency bump.
This is tricky to reason about, especially in an emergency.
Averages don't have this problem, as they include all requests.
#6: Costs and Benefits
1s resolution monitoring of all metrics would be handy for debugging.
But is it ten time more valuable than 10s monitoring? And sixty times more
valuable than 60s monitoring?
Monitoring isn't free. It costs resources to run, and resources in the services being
monitored too. Quantiles and histograms can get expensive fast.
60s resolution is generally a good balance. Reserve 1s granularity or a literal
handful of key metrics.
#7: Nyquist-Shannon Sampling Theorem
To reconstruct a signal you need a resolution that's at least double it's frequency.
If you've got a 10s resolution time series, you can't reconstruct patterns that are
less than 20s long.
Higher frequency patterns can cause effects like aliasing, and mislead you.
If you suspect that there's something more to the data, try a higher resolution
temporarily or start profiling.
#8: Correlation is not Causation
Let's play a game called 2 4 6
I have a rule, and if you give me a sequence of numbers I'll tell you if it conforms
to the rule. Can you figure out the rule?
(If you've played this before, don't spoil it)
#8: Correlation is not Causation - Confirmation Bias
Humans are great at spotting patterns. Not all of them are actually there.
Always try to look for evidence that'd falsify your hypothesis.
If two metrics seem to correlate on a graph that doesn't mean that they're related.
They could be independent tasks running on the same schedule.
Or if you zoom out there plenty of times when one spikes but not the other.
Or one could be causing a slight increase in resource contention, pushing the
other over the edge.
#9 Know when to use Logs and Metrics
You want a metrics time series system for your primary monitoring.
Logs have information about every event. This limits the number of fields (<100),
but you have unlimited cardinality.
Metrics aggregate across events, but you can have many metrics (>10000) with
limited cardinality.
Metrics help you determine where in the system the problem is. From there, logs
can help you pinpoint which requests are tickling the problem.
#10 Have a way to deal with non-critical alerts
Most alerts don't justify waking up someone at night, but someone needs to look
at them sometime.
Often they're sent to a mailing list, where everyone promptly filters them away.
Better to have some form of ticketing system that'll assign a single owner for each
alert.
A daily email with all firing alerts that the oncall has to process can also work.
Final Word
It's easy to get in over your head when starting to appreciate to power of time
series monitoring.
The one thing I'd say is to Keep it Simple.
Just like anything else, more complex doesn't automatically mean better.
What do we do?
Robust Perception provides consulting and training to give you confidence in your
production service's ability to run efficiently and scale with your business.
We can help you:
● Decide if Prometheus is for you
● Manage your transition to Prometheus and resolve issues that arise
● With capacity planning, debugging, infrastructure etc.
We are proud to be among the core contributors to the Prometheus project.
Resources
Official Project Website: prometheus.io
Official Mailing List: prometheus-developers@googlegroups.com
Demo: demo.robustperception.io
Robust Perception Website: www.robustperception.io
Queries: prometheus@robustperception.io
1 of 46

Recommended

Prometheus - Open Source Forum Japan by
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
549 views42 slides
An Introduction to Prometheus (GrafanaCon 2016) by
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
3.4K views48 slides
Prometheus for Monitoring Metrics (Percona Live Europe 2017) by
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
1.1K views17 slides
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl... by
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
3.1K views33 slides
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016) by
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
9.4K views43 slides
Cloud Monitoring with Prometheus by
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
3.8K views20 slides

More Related Content

What's hot

Prometheus design and philosophy by
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
2.7K views32 slides
Prometheus: A Next Generation Monitoring System (FOSDEM 2016) by
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
9.6K views28 slides
Prometheus with Grafana - AddWeb Solution by
Prometheus with Grafana - AddWeb SolutionPrometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb SolutionAddWeb Solution Pvt. Ltd.
106 views20 slides
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017) by
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
1.1K views18 slides
Prometheus (Monitorama 2016) by
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
4.5K views23 slides
End to-end monitoring with the prometheus operator - Max Inden by
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
2.4K views86 slides

What's hot(20)

Prometheus design and philosophy by Docker, Inc.
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
Docker, Inc.2.7K views
Prometheus: A Next Generation Monitoring System (FOSDEM 2016) by Brian Brazil
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil9.6K views
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017) by Brian Brazil
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Brian Brazil1.1K views
Prometheus (Monitorama 2016) by Brian Brazil
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
Brian Brazil4.5K views
End to-end monitoring with the prometheus operator - Max Inden by Paris Container Day
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
Prometheus (Prometheus London, 2016) by Brian Brazil
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil1.2K views
What does "monitoring" mean? (FOSDEM 2017) by Brian Brazil
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil2.4K views
Prometheus for Monitoring Metrics (Fermilab 2018) by Brian Brazil
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil2K views
So You Want to Write an Exporter by Brian Brazil
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil4.2K views
Ansible at FOSDEM (Ansible Dublin, 2016) by Brian Brazil
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
Brian Brazil833 views
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017) by Brian Brazil
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Brian Brazil15.3K views
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015) by Brian Brazil
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Brian Brazil2.8K views
Anatomy of a Prometheus Client Library (PromCon 2018) by Brian Brazil
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
Brian Brazil902 views
Evolving Prometheus for the Cloud Native World (FOSDEM 2018) by Brian Brazil
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Brian Brazil1.5K views
Prometheus Training by Tim Tyler
Prometheus TrainingPrometheus Training
Prometheus Training
Tim Tyler437 views
Monitoring Cloud Native Applications with Prometheus by Jacopo Nardiello
Monitoring Cloud Native Applications with PrometheusMonitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with Prometheus
Jacopo Nardiello793 views
Infrastructure & System Monitoring using Prometheus by Marco Pas
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
Marco Pas4.8K views
Monitoring using Prometheus and Grafana by Arvind Kumar G.S
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.5K views

Similar to Microservices and Prometheus (Microservices NYC 2016)

Prometheus (Microsoft, 2016) by
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
3.3K views36 slides
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016) by
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
16.5K views39 slides
Prometheus Overview by
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
34.1K views19 slides
Prometheus and Docker (Docker Galway, November 2015) by
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
9.8K views33 slides
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf by
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfKnoldus Inc.
158 views14 slides
Go Observability (in practice) by
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
1.1K views29 slides

Similar to Microservices and Prometheus (Microservices NYC 2016)(20)

Prometheus (Microsoft, 2016) by Brian Brazil
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil3.3K views
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016) by Brian Brazil
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil16.5K views
Prometheus Overview by Brian Brazil
Prometheus OverviewPrometheus Overview
Prometheus Overview
Brian Brazil34.1K views
Prometheus and Docker (Docker Galway, November 2015) by Brian Brazil
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil9.8K views
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf by Knoldus Inc.
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Knoldus Inc.158 views
Go Observability (in practice) by Eran Levy
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
Eran Levy1.1K views
Monitoring as an entry point for collaboration by Julien Pivotto
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
Julien Pivotto1.3K views
Which watcher watches CloudWatch by David Lutz
Which watcher watches CloudWatch Which watcher watches CloudWatch
Which watcher watches CloudWatch
David Lutz2.6K views
Prometheus: From technical metrics to business observability by Julien Pivotto
Prometheus: From technical metrics to business observabilityPrometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observability
Julien Pivotto4.4K views
Observability for Application Developers (1)-1.pptx by OpsTree Solutions
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
Monitoring your Python with Prometheus (Python Ireland April 2015) by Brian Brazil
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
Brian Brazil17.3K views
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A... by IRJET Journal
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET Journal17 views
Kks sre book_ch10 by Chris Huang
Kks sre book_ch10Kks sre book_ch10
Kks sre book_ch10
Chris Huang329 views
An Introduction to Microservices by Ad van der Veer
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
Ad van der Veer604 views
Operator-less DataCenters -- A Reality by Kishore Arya
Operator-less DataCenters -- A RealityOperator-less DataCenters -- A Reality
Operator-less DataCenters -- A Reality
Kishore Arya76 views
Operator-Less DataCenters A Near Future Reality by Kishore Arya
Operator-Less DataCenters A Near Future RealityOperator-Less DataCenters A Near Future Reality
Operator-Less DataCenters A Near Future Reality
Kishore Arya118 views
Advantages And Disadvantages Of Open Source Learning... by Sue Jones
Advantages And Disadvantages Of Open Source Learning...Advantages And Disadvantages Of Open Source Learning...
Advantages And Disadvantages Of Open Source Learning...
Sue Jones1 view
Copy of Silk performer - KT.pptx by ssuser20fcbe
Copy of Silk performer - KT.pptxCopy of Silk performer - KT.pptx
Copy of Silk performer - KT.pptx
ssuser20fcbe7 views
Extra micrometer practices with Quarkus | DevNation Tech Talk by Red Hat Developers
Extra micrometer practices with Quarkus | DevNation Tech TalkExtra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech Talk
Red Hat Developers540 views
Osmius The Open Source, Fast and Extandable Monitoring Tool by osmius
Osmius The Open Source, Fast and Extandable Monitoring ToolOsmius The Open Source, Fast and Extandable Monitoring Tool
Osmius The Open Source, Fast and Extandable Monitoring Tool
osmius692 views

More from Brian Brazil

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich) by
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)Brian Brazil
3.6K views18 slides
Evaluating Prometheus Knowledge in Interviews (PromCon 2018) by
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Brian Brazil
569 views17 slides
Evolution of the Prometheus TSDB (Percona Live Europe 2017) by
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
824 views18 slides
Staleness and Isolation in Prometheus 2.0 (PromCon 2017) by
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Brian Brazil
5.1K views44 slides
Rule 110 for Prometheus (PromCon 2017) by
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Brian Brazil
713 views16 slides
Provisioning and Capacity Planning (Travel Meets Big Data) by
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
768 views45 slides

More from Brian Brazil(9)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich) by Brian Brazil
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
Brian Brazil3.6K views
Evaluating Prometheus Knowledge in Interviews (PromCon 2018) by Brian Brazil
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Brian Brazil569 views
Evolution of the Prometheus TSDB (Percona Live Europe 2017) by Brian Brazil
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Brian Brazil824 views
Staleness and Isolation in Prometheus 2.0 (PromCon 2017) by Brian Brazil
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Brian Brazil5.1K views
Rule 110 for Prometheus (PromCon 2017) by Brian Brazil
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
Brian Brazil713 views
Provisioning and Capacity Planning (Travel Meets Big Data) by Brian Brazil
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
Brian Brazil768 views
An Exploration of the Formal Properties of PromQL by Brian Brazil
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
Brian Brazil1.5K views
Life of a Label (PromCon2016, Berlin) by Brian Brazil
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
Brian Brazil2.8K views
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015) by Brian Brazil
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil6.6K views

Recently uploaded

Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdf by
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdfOpportunities for Youth in IG - Alena Muravska RIPE NCC.pdf
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdfRIPE NCC
9 views12 slides
Is Entireweb better than Google by
Is Entireweb better than GoogleIs Entireweb better than Google
Is Entireweb better than Googlesebastianthomasbejan
10 views1 slide
Building trust in our information ecosystem: who do we trust in an emergency by
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergencyTina Purnat
85 views18 slides
UiPath Document Understanding_Day 3.pptx by
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptxUiPathCommunity
95 views25 slides
DU Series - Day 4.pptx by
DU Series - Day 4.pptxDU Series - Day 4.pptx
DU Series - Day 4.pptxUiPathCommunity
94 views28 slides
Serverless cloud architecture patterns by
Serverless cloud architecture patternsServerless cloud architecture patterns
Serverless cloud architecture patternsJimmy Dahlqvist
17 views52 slides

Recently uploaded(20)

Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdf by RIPE NCC
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdfOpportunities for Youth in IG - Alena Muravska RIPE NCC.pdf
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdf
RIPE NCC9 views
Building trust in our information ecosystem: who do we trust in an emergency by Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat85 views
UiPath Document Understanding_Day 3.pptx by UiPathCommunity
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptx
UiPathCommunity95 views
Serverless cloud architecture patterns by Jimmy Dahlqvist
Serverless cloud architecture patternsServerless cloud architecture patterns
Serverless cloud architecture patterns
Jimmy Dahlqvist17 views
google forms survey (1).pptx by MollyBrown86
google forms survey (1).pptxgoogle forms survey (1).pptx
google forms survey (1).pptx
MollyBrown8614 views
IGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdf by RIPE NCC
IGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdfIGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdf
IGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdf
RIPE NCC15 views
PORTFOLIO 1 (Bret Michael Pepito).pdf by brejess0410
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdf
brejess04107 views
Existing documentaries (1).docx by MollyBrown86
Existing documentaries (1).docxExisting documentaries (1).docx
Existing documentaries (1).docx
MollyBrown8613 views
We see everywhere that many people are talking about technology.docx by ssuserc5935b
We see everywhere that many people are talking about technology.docxWe see everywhere that many people are talking about technology.docx
We see everywhere that many people are talking about technology.docx
ssuserc5935b6 views
IETF 118: Starlink Protocol Performance by APNIC
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol Performance
APNIC124 views
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲 by Infosec train
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲
Infosec train7 views

Microservices and Prometheus (Microservices NYC 2016)

  • 2. Who am I? Engineer passionate about running software reliably in production. ● TCD CS Degree ● Google SRE for 7 years, working on high-scale reliable systems such as Adwords, Adsense, Ad Exchange, Billing, Database ● Boxever TL Systems&Infrastructure, applied processes and technology to let allow company to scale and reduce operational load ● Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. ● Founder of Robust Perception, making scalability and efficiency available to everyone
  • 3. Why monitor? ● Know when things go wrong ○ To call in a human to prevent a business-level issue, or prevent an issue in advance ● Be able to debug and gain insight ● Trending to see changes over time, and drive technical/business decisions ● To feed into other systems/processes (e.g. QA, security, automation)
  • 4. Anatomy of an Microservice Architecture Many small services Services send requests to each other Services can be in written in a variety of languages Services added and removed regularly
  • 5. Common Monitoring Challenges Themes common among companies I’ve talk to and tools I've looked at: ● Monitoring tools are limited, both technically and conceptually ● Tools don’t scale well and are unwieldy to manage ● Operational practices don’t align with the business For example: Your customers care about increased latency and it’s in your SLAs. You can only alert on individual machine CPU usage. Result: Engineers continuously woken up for non-issues, get fatigued
  • 6. Challenges for Microservice Monitoring Services added and removed regularly, needs to be low-friction. Tends to be a very dynamic environment, machines and replicas added and removed automatically. Many different teams with slightly different requirements and views of the world. How to support all the languages and types of server?
  • 7. What do we need? Monitoring designed for a dynamic microservice environment. Must be usable across multiple languages, frameworks and server types. Allows each team to own their monitoring. Powerful analysis features. Takes advantage of similarity between services
  • 8. Lots of Monitoring Systems Look Like This
  • 11. Monitor as a Service, not as Machines
  • 12. Prometheus Inspired by Google’s Borgmon monitoring system. Started in 2012 by ex-Googlers working in Soundcloud as an open source project, mainly written in Go to monitor microservices. Publically launched in early 2015, and continues to be independent of any one company. Hundreds of companies have started relying on it since then. Over 250 people have contributed to official repos. 50+ 3rd party integrations.
  • 13. Inclusive Monitoring Don’t monitor just at the edges: ● Instrument client libraries ● Instrument server libraries (e.g. HTTP/RPC) ● Instrument business logic Library authors get information about usage. Application developers get monitoring of common components for free. Dashboards and alerting can be provided out of the box, customised for your organisation!
  • 14. Instrumentation made easy Prometheus clients don’t just marshall data. They take care of the nitty gritty details like concurrency and state tracking. We take advantage of the strengths of each of the 10 languages currently supported. In Python for example that means context managers and decorators.
  • 15. Python Instrumentation Example pip install prometheus_client from prometheus_client import Summary, start_http_server REQUEST_DURATION = Summary('request_duration_seconds', 'Request duration in seconds') @REQUEST_DURATION.time() def my_handler(request): pass # Your code here start_http_server(8000)
  • 16. Exceptional Circumstances In Progress from prometheus_client import Counter, Gauge EXCEPTIONS = Counter('exceptions_total', 'Total exceptions') IN_PROGRESS = Gauge('inprogress_requests', 'In progress') @EXCEPTIONS.count_exceptions() @IN_PROGRESS.track_inprogress() def my_handler(request): pass // Your code here
  • 17. Open ecosystem Prometheus client libraries don’t tie you into Prometheus. For example the Python and Java clients can output to Graphite, with no need to run any Prometheus components. This means that you as a library author can instrument with Prometheus, and your users with just a few lines of code can output to whatever monitoring system they want. No need for users to worry about a new standard. This can be done incrementally.
  • 18. It goes the other way too It’s unlikely that everyone is going to switch to Prometheus all at once, so we’ve integrations that can take in data from other monitoring systems and make it useful. Graphite, Collectd, Statsd, SNMP, JMX, Dropwizard, AWS Cloudwatch, New Relic, Rsyslog and Scollector (Bosun) are some examples. Prometheus and its client libraries can act a clearinghouse to convert between monitoring systems. For example Zalando’s Zmon uses the Python client to parse Prometheus metrics from directly instrumented binaries such as the Node exporter.
  • 19. Easy to integrate with Many existing integrations: Cadvisor, Java, JMX, Python, Go, Ruby, .Net, Machine, Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul, HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft... Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition. It’s so easy, most of the above were written without the core team even knowing about them!
  • 20. Distinguishing Timeseries Prometheus doesn’t use dotted.strings like metric.tech.dublin. Multi-dimensional labels instead like metric{industry=”tech”,city=”dublin”} Can aggregate, cut, and slice along them. Can come from instrumentation, or be added based on the service you are monitoring.
  • 21. Example: Labels from Node Exporter
  • 22. Adding Dimensions (No Evil Twins Please) from prometheus_client import Counter REQUESTS = Counter('requests_total', 'Total requests', ['method']) def my_handler(request): REQUESTS.labels(request.method).inc() pass // Your code here
  • 23. Counters One type of time series Prometheus has is a Counter It only goes up If we miss a scrape, it's still possible to see what the increase was during that period - lose granularity, not data. Other systems push diffs, not as reliable. The rate() function also handles resets
  • 24. Powerful Query Language Can multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time. Answer questions like: ● What’s the 95th percentile latency in the European datacenter? ● How full will the disks be in 4 hours? ● Which services are the top 5 users of CPU? Can alert based on any query.
  • 25. Example: Top 5 Docker images by CPU topk(5, sum by (image)( rate(container_cpu_usage_seconds_total{ id=~"/system.slice/docker.*"}[5m] ) ) )
  • 26. Manageable and Reliable Core Prometheus server is a single binary. Doesn’t depend on Zookeeper, Consul, Cassandra, Hadoop or the Internet. Only requires local disk (SSD recommended). No potential for cascading failure. State not event based, so network blips only lose resolution - not data.
  • 27. Decentralised Pull based, so easy to on run a workstation for testing and rogue servers can’t push bad metrics. Also means each team can run their own, no need for central management. Advanced service discovery finds what to monitor. Relabelling lets you apply the label taxonomy that makes sense to you.
  • 28. Efficient and Scalable Prometheus is easy to run, can give one to each team in each datacenter. New varbit encoding uses only 1.3 bytes/sample A single Prometheus server can handle 800K samples/s Federation allows pulling key metrics from other Prometheus servers. When one job is too big for a single Prometheus server, can use sharding+federation to scale out. Needed with thousands of machines.
  • 30. What do we have? Instrumentation for a variety of languages that's easy to use. Data model and query language for analysis. Flexible alerting on any query. Decentralised architecture, each team controls their own monitoring. Service discovery automatically picks up new services and machines. This makes Prometheus a great way to monitor your microservices!
  • 31. 10 Tips for Monitoring Microservices With potentially millions of time series across your system, can be difficult to know what is and isn't useful. What approaches help manage this complexity? How do you avoid getting caught out?
  • 32. #1: Choose your key statistics Users don't care that one of your machines is short of CPU. Users care if the service is slow or throwing errors. For your primary dashboards focus on high-level metrics that directly impact users.
  • 33. #2: Use aggregations Think about services, not machines. Once you have more than a handful of machines, you should treat them as an amorphous blob. Looking at the key statistics is easier for 10 services, than 10 services each of which is on 10 machines Once you have isolated a problem to one service, then can see if one machine is the problem
  • 34. #3: Avoid the Wall of Graphs Dashboards tend to grow without bound. Worst I've seen was 600 graphs. It might look impressive, but humans can't deal with that much data at once. (and they take forever to load). Your services will have a rough tree structure, have a dashboard per service and talk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
  • 35. #4: Client-side quantiles aren't aggregatable Many instrumentation systems calculate quantiles/percentiles inside each process, and export it to the TSDB. It is not statistically possible to aggregate these. If you want meaningful quantiles, you should track histogram buckets in each process, aggregate those in your monitoring system and then calculate the quantile. This is done using histogram_quantile() and rate() in Prometheus.
  • 36. #5: Averages are easy to reason about Q: Say you have a service with two backends. If 95th percentile latency goes up due to one of the backends, what will you see in 95th percentile latency for that backend? A: ?
  • 37. #5: Averages are easy to reason about Q: Say you have a service with two backends. If 95th percentile latency goes up due to one of the backends, what will you see in 95th percentile latency for that backend? A: It depends, could be no change. If the latencies are strongly correlated for each request across the backends, you'll see the same latency bump. This is tricky to reason about, especially in an emergency. Averages don't have this problem, as they include all requests.
  • 38. #6: Costs and Benefits 1s resolution monitoring of all metrics would be handy for debugging. But is it ten time more valuable than 10s monitoring? And sixty times more valuable than 60s monitoring? Monitoring isn't free. It costs resources to run, and resources in the services being monitored too. Quantiles and histograms can get expensive fast. 60s resolution is generally a good balance. Reserve 1s granularity or a literal handful of key metrics.
  • 39. #7: Nyquist-Shannon Sampling Theorem To reconstruct a signal you need a resolution that's at least double it's frequency. If you've got a 10s resolution time series, you can't reconstruct patterns that are less than 20s long. Higher frequency patterns can cause effects like aliasing, and mislead you. If you suspect that there's something more to the data, try a higher resolution temporarily or start profiling.
  • 40. #8: Correlation is not Causation Let's play a game called 2 4 6 I have a rule, and if you give me a sequence of numbers I'll tell you if it conforms to the rule. Can you figure out the rule? (If you've played this before, don't spoil it)
  • 41. #8: Correlation is not Causation - Confirmation Bias Humans are great at spotting patterns. Not all of them are actually there. Always try to look for evidence that'd falsify your hypothesis. If two metrics seem to correlate on a graph that doesn't mean that they're related. They could be independent tasks running on the same schedule. Or if you zoom out there plenty of times when one spikes but not the other. Or one could be causing a slight increase in resource contention, pushing the other over the edge.
  • 42. #9 Know when to use Logs and Metrics You want a metrics time series system for your primary monitoring. Logs have information about every event. This limits the number of fields (<100), but you have unlimited cardinality. Metrics aggregate across events, but you can have many metrics (>10000) with limited cardinality. Metrics help you determine where in the system the problem is. From there, logs can help you pinpoint which requests are tickling the problem.
  • 43. #10 Have a way to deal with non-critical alerts Most alerts don't justify waking up someone at night, but someone needs to look at them sometime. Often they're sent to a mailing list, where everyone promptly filters them away. Better to have some form of ticketing system that'll assign a single owner for each alert. A daily email with all firing alerts that the oncall has to process can also work.
  • 44. Final Word It's easy to get in over your head when starting to appreciate to power of time series monitoring. The one thing I'd say is to Keep it Simple. Just like anything else, more complex doesn't automatically mean better.
  • 45. What do we do? Robust Perception provides consulting and training to give you confidence in your production service's ability to run efficiently and scale with your business. We can help you: ● Decide if Prometheus is for you ● Manage your transition to Prometheus and resolve issues that arise ● With capacity planning, debugging, infrastructure etc. We are proud to be among the core contributors to the Prometheus project.
  • 46. Resources Official Project Website: prometheus.io Official Mailing List: prometheus-developers@googlegroups.com Demo: demo.robustperception.io Robust Perception Website: www.robustperception.io Queries: prometheus@robustperception.io