Monitoring & Observability

Operations, Monitoring and
Observability

©2021LinkAja Indonesia
Agenda
 Operations Overview
 Monitoring Overview
 Observability Overview

Operations Excellent
• Everything is Automated
• Reduce Costs
• No Support Calls

Operations Model
Manual
Reactive
Proactive
• User Initiated
• Interactive, Command line-tools, simple scripts
• Checklist and process driven
• Hardware-centric data collection
• Simple metric and log collection
• Siloed tools and information
• Manual Analysis and remediation
• Application-centric data collection
• End-to-end Observability
• Key-metrics and thresholds well understood
• Semi-automated analysis and remediation

User care about
• Availability : Is my system Online ? Yes/No
• Latency : Does it take a long time to access application ?
• Reliability : Can the user rely on using the application ?

Agenda
 Operations Overview
 Monitoring Overview
 Obervability Overvew

Outline
 Monitoring…...for what?
 What really want to monitor?
 How to design it ?
 What is not monitoring?
 We can do it better?

Monitoring… for what ??
Your monitoring system should address two question: What’s broken and
Why?
The “What’s broken” indicates the symptom
“In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the
failure as well as the effect of any fix deployed” by Cindy Sridharan
The “Why” indicated a (possibly intermediate) cause

Examples
Symptom (What?) Cause (Why?)
I’m serving HTTP 500s DB are refusing connections
Responses are slow Web Server is queueing requests
Users can’t login Auth client is receiving HTTP 503
Blackbox Whitebox
Blackbox externally observed, what user sees.
Whitebox data exposed the system allow to act on imminent issues

Key Distinction
Blackbox Monitoring (what?)
 User/Business point of view
 SLI/SLO based control
 Mostly easy to know
 Detect active problem
 Reactive approach
 Tend to be the last to alert
 Usually On-Call resolution
 Preferably few metrics
Whitebox Monitoring (why?)
 Component point of view
 Threshold based control
 Mostly hard to know
 Detect imminent problem
 Proactive approach
 Tend to be the early alarm
 Usually automatic resolution
 Preferably not few metrics

Methodology
4 Golden Signal
1. Traffic
2. Latency
3. Errors
4. Saturation
R.E.D (Microservice Level)
• Request Rate: the number of requests, per second, you services are
serving
• (Request) Error: the number (error rate) of failed requests persecond
• (Request) Duration: distributions of the amount of time each request
takes.
U.S.E (Low Level/Infrastructure)
Every resource, check Utilization, Saturation, and Errors
• Utilization: % time that the resource was busy
• Saturation: amount of work resource has to do, often queue length
• Error: amount of work resource has to do, often queue length : the
count of error events

Monitoring with SLI
SLI = Service Level Indicator
Quantifies meeting user expectations:
is our service working as our users expect it to?

Monitoring with SLI
Examples: backend API for user info
Availability
Specification: % GET requests complete successfully
Implementation:
Latency
Specification: % of requests that return 2xx will complete in < 500ms.
Implementation:

Monitoring with SLI + SLO
SLO = Service Level Objective
Example:
- Measured across all the backend servers from the load balancer
- Taking the past 24 hours
Availability: 99.9% GET requests complete successfully
Latency: 95% of requests that return 2xx will complete in < 500ms.

Observability
 Operations Overview
 Monitoring Overview
 Obervability Overvew

Observability
Observability is how well you can understand a system’s and
measures all entire of the application.
Observability captures what "monitoring" doesn't (and shouldn’t),
based on evidences (not conjectures)
When you lost the power to know and predict the behaviors of the system and that's
where the observability tools come in...

Monitoring vs Observability
Monitoring tells you when something is wrong,
while Observability enables you to understand why.

Pillars of Observability
Metrics are a numeric representation of data measured
over intervals of time
Event Logging is an immutable, timestamped record of
discrete events that happened over time.
Tracing is a representation of a series of causally
related distributed events that encode the end-to-end
request flow through a distributed system.

Observability
Reliability and trending in use:
o What happens right now ?
o What will happen next ?
A few of the critical questions that Tracing can answer
quickly and easily:
o Which did a request pass through?
o Where are the bottlenecks?
o How much time is lost due to network lag during
communication between services?
o What occurred in each service for a given request?
1. Metrics 2. Tracing
Good practices for more effective logs:
o Logging with context (trace-id / uuid/ whatever) ?
o Standardized Logging Levels ?
o Use structured-logs for enable machine-readability
3. Logging

Monitoring & Observability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring & Observability

Similar to Monitoring & Observability (20)

Recently uploaded

Recently uploaded (20)

Monitoring & Observability