Monitoring SLA with Prometheus and LibreOffice Calc
1. sssssss
Monitoring SLA with Prometheus
and LibreOffice Calc
Didiet A. Pambudiono
pambudiono.os@gmail.com
didiet@dicoding.com
2. Who Am I
● Former System Administration and Network Engineer @ catfiz.com
● DevOps @ dicoding.com
– Site Reliability
– Chaos Experiment
● Member of Kelompok Linux Arek Suroboyo (KLAS), openSUSE
Indonesia and LibreOffice Indonesia (not iuran)
● Father of 2 kids
● Status : Married
● Website :
– https://medium.com/@pambudiono.os
3. ,, And why do we fall, Bruce? So we can
learn to pick ourselves up.
Thomas Wayne
5. Service Level Agreement
● Essentialy, “we will provide this level of service XOR
compensate you in this waY”
● Relevant to lawyers and management
● SLA is defined as an official commitment that prevails
betweeen a service provider and a client
● Particular aspects of the service – quality, availability,
responsibilities – are agreed between the service provider
and the service user.
● The most common component of SLA is that the services
should be provided to the customer as agreed upon in the
contract
6. Service Level Agreement
1. We run software systems to serve users
2. We need to khow how good the service we provided
3. We need to understand what users care about
a. latency : how long it takes to respond
b. error rate : how often it fails
c. througput : how much work it does
d. availability : how often cat it do work
e. durability : how often does it lose data
f. correctness : does it work properly
g. and so on...
These are all indicators of the quality of service
7. Service Level Indicator
● Service Level Indicator (SLI) is a measure of the service level
provided by a service provider to a customer.
● SLIs form the basis of Service Level Objectives (SLOs),
which in turn form the basis of Service Level Agreements
(SLAs)
● an SLI is thus also called an SLA metric.
● Common SLIs include :
● latency
● throughput
● availability
● error rate
8. Service Level Indicator
● Choose SLI's judiciously : more isn't better
● Consider definisiton carefully
● where & how are metrics collected
● over what period?
● is the metric aggregated? If so, how?
● prefer distributions to averages
● Standradize common SLI features & reuse
9. Service Level Objective
● For instance :
● search results' latency @ 95ile < 100ms
● system will be available between 99.9% and 99.95%
Measure SLI :
1. Is SLI within the SLO target?
● If yes, no action needed
● If No, figure out what needs to be done to meet the target
again
2. Repeat
Note :
● Don't pick targets based on current performance
● Avoid absolutes like "infinitely scalabe" or "always available"
● Keep safety margin
● Don't overachieve
10. Service Level Objective
● SLO is a mathematical relation like :
● SLI <= target
● lower bound <= SLI <= upper bound
● Use as input to control loop
● SLO set expectations for system behaviour
● user want to know what performance / avaibility/
durability / ... the system will provide
● without published SLO, users will expect current
performance to continue forever
● Use a stricter internal target than you publish
● give time to respond to chronic conditions
● permit future reengineering with different performance-cost
tradeoffs
12. Define Your SLA :
How?
● Indicators?
● Objective?
● Agreement?
Example :
● Uptime?
● Avaibility?
● Service failure from our servers can not over 5%
● How many error code of 5xx produced in servers?
13. Note:
● If you want to have 5-nines of availability, you can only
afford 5 minutes of downtime a year!!
● If __any__ humans are involved in restoring your system,
you can say bye-bye to the Infamous Nines.
15. What is Prometheus?
● Open-source systems monitoring and alerting toolkit
originally built at SoundCloud.
● Since its inception in 2012, many companies and
organizations have adopted Prometheus, and the project has
a very active developer and user community.
● It is now a standalone open source project and maintained
independently of any company.
● To emphasize this, and to clarify the project's governance
structure, Prometheus joined the Cloud Native Computing
Foundation in 2016 as the second hosted project, after
Kubernetes.
16. Features
● Prometheus's main features are:
● a multi-dimensional data model with time series data
identified by metric name and key/value pairs
● a flexible query language to leverage this dimensionality
● no reliance on distributed storage; single server nodes are
autonomous
● time series collection happens via a pull model over HTTP
● pushing time series is supported via an intermediary
gateway
● targets are discovered via service discovery or static
configuration
● multiple modes of graphing and dashboarding support
17. Components
The Prometheus ecosystem consists of multiple components,
many of which are optional:
● the main Prometheus server which scrapes and stores time
series data
● client libraries for instrumenting application code
● a push gateway for supporting short-lived jobs
● special-purpose exporters for services like HAProxy, StatsD,
Graphite, etc.
● an alertmanager to handle alerts
● various support tools
Most Prometheus components are written in Go, making them
easy to build and deploy as static binaries.
18. Measurement of Service Failure
Source :
● Apache response code : 2xx and 5xx
● Apache logs
Tools :
● Grok Exporter for Prometheus
● Python script to grab the Prometheus data from
RobustPerceptions
(https://www.robustperception.io/prometheus-query-results-
as-csv/)
● And of course we need LibreOffice Calc
19. Apache status code
from Apache Logs
Grok Exporter
Prometheus Server
Python script to
query
Csv file
LibreOffice Calc