Operations, Monitoring and
Observability
©2021LinkAja Indonesia
Agenda
 Operations Overview
 Monitoring Overview
 Observability Overview
©2021LinkAja Indonesia
Operations Excellent
• Everything is Automated
• Reduce Costs
• No Support Calls
©2021LinkAja Indonesia
Operations Model
Manual
Reactive
Proactive
• User Initiated
• Interactive, Command line-tools, simple scripts
• Checklist and process driven
• Hardware-centric data collection
• Simple metric and log collection
• Siloed tools and information
• Manual Analysis and remediation
• Application-centric data collection
• End-to-end Observability
• Key-metrics and thresholds well understood
• Semi-automated analysis and remediation
©2021LinkAja Indonesia
User care about
• Availability : Is my system Online ? Yes/No
• Latency : Does it take a long time to access application ?
• Reliability : Can the user rely on using the application ?
©2021LinkAja Indonesia
Agenda
 Operations Overview
 Monitoring Overview
 Obervability Overvew
©2021LinkAja Indonesia
Outline
 Monitoring…...for what?
 What really want to monitor?
 How to design it ?
 What is not monitoring?
 We can do it better?
©2021LinkAja Indonesia
Monitoring… for what ??
Your monitoring system should address two question: What’s broken and
Why?
The “What’s broken” indicates the symptom
“In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the
failure as well as the effect of any fix deployed” by Cindy Sridharan
The “Why” indicated a (possibly intermediate) cause
©2021LinkAja Indonesia
Examples
Symptom (What?) Cause (Why?)
I’m serving HTTP 500s DB are refusing connections
Responses are slow Web Server is queueing requests
Users can’t login Auth client is receiving HTTP 503
Blackbox Whitebox
Blackbox externally observed, what user sees.
Whitebox data exposed the system allow to act on imminent issues
©2021LinkAja Indonesia
Key Distinction
Blackbox Monitoring (what?)
 User/Business point of view
 SLI/SLO based control
 Mostly easy to know
 Detect active problem
 Reactive approach
 Tend to be the last to alert
 Usually On-Call resolution
 Preferably few metrics
Whitebox Monitoring (why?)
 Component point of view
 Threshold based control
 Mostly hard to know
 Detect imminent problem
 Proactive approach
 Tend to be the early alarm
 Usually automatic resolution
 Preferably not few metrics
©2021LinkAja Indonesia
Methodology
4 Golden Signal
1. Traffic
2. Latency
3. Errors
4. Saturation
R.E.D (Microservice Level)
• Request Rate: the number of requests, per second, you services are
serving
• (Request) Error: the number (error rate) of failed requests persecond
• (Request) Duration: distributions of the amount of time each request
takes.
U.S.E (Low Level/Infrastructure)
Every resource, check Utilization, Saturation, and Errors
• Utilization: % time that the resource was busy
• Saturation: amount of work resource has to do, often queue length
• Error: amount of work resource has to do, often queue length : the
count of error events
©2021LinkAja Indonesia
Monitoring with SLI
SLI = Service Level Indicator
Quantifies meeting user expectations:
is our service working as our users expect it to?
©2021LinkAja Indonesia
Monitoring with SLI
Examples: backend API for user info
Availability
Specification: % GET requests complete successfully
Implementation:
Latency
Specification: % of requests that return 2xx will complete in < 500ms.
Implementation:
©2021LinkAja Indonesia
Monitoring with SLI + SLO
SLO = Service Level Objective
Example:
- Measured across all the backend servers from the load balancer
- Taking the past 24 hours
Availability: 99.9% GET requests complete successfully
Latency: 95% of requests that return 2xx will complete in < 500ms.
©2021LinkAja Indonesia
Observability
 Operations Overview
 Monitoring Overview
 Obervability Overvew
©2021LinkAja Indonesia
Observability
Observability is how well you can understand a system’s and
measures all entire of the application.
Observability captures what "monitoring" doesn't (and shouldn’t),
based on evidences (not conjectures)
When you lost the power to know and predict the behaviors of the system and that's
where the observability tools come in...
©2021LinkAja Indonesia
Monitoring vs Observability
Monitoring tells you when something is wrong,
while Observability enables you to understand why.
©2021LinkAja Indonesia
Pillars of Observability
Metrics are a numeric representation of data measured
over intervals of time
Event Logging is an immutable, timestamped record of
discrete events that happened over time.
Tracing is a representation of a series of causally
related distributed events that encode the end-to-end
request flow through a distributed system.
©2021LinkAja Indonesia
Observability
Reliability and trending in use:
o What happens right now ?
o What will happen next ?
A few of the critical questions that Tracing can answer
quickly and easily:
o Which did a request pass through?
o Where are the bottlenecks?
o How much time is lost due to network lag during
communication between services?
o What occurred in each service for a given request?
1. Metrics 2. Tracing
Good practices for more effective logs:
o Logging with context (trace-id / uuid/ whatever) ?
o Standardized Logging Levels ?
o Use structured-logs for enable machine-readability
3. Logging
Terimakasih
#PakeLinkAja

Monitoring & Observability

  • 1.
  • 2.
    ©2021LinkAja Indonesia Agenda  OperationsOverview  Monitoring Overview  Observability Overview
  • 3.
    ©2021LinkAja Indonesia Operations Excellent •Everything is Automated • Reduce Costs • No Support Calls
  • 4.
    ©2021LinkAja Indonesia Operations Model Manual Reactive Proactive •User Initiated • Interactive, Command line-tools, simple scripts • Checklist and process driven • Hardware-centric data collection • Simple metric and log collection • Siloed tools and information • Manual Analysis and remediation • Application-centric data collection • End-to-end Observability • Key-metrics and thresholds well understood • Semi-automated analysis and remediation
  • 5.
    ©2021LinkAja Indonesia User careabout • Availability : Is my system Online ? Yes/No • Latency : Does it take a long time to access application ? • Reliability : Can the user rely on using the application ?
  • 6.
    ©2021LinkAja Indonesia Agenda  OperationsOverview  Monitoring Overview  Obervability Overvew
  • 7.
    ©2021LinkAja Indonesia Outline  Monitoring…...forwhat?  What really want to monitor?  How to design it ?  What is not monitoring?  We can do it better?
  • 8.
    ©2021LinkAja Indonesia Monitoring… forwhat ?? Your monitoring system should address two question: What’s broken and Why? The “What’s broken” indicates the symptom “In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the failure as well as the effect of any fix deployed” by Cindy Sridharan The “Why” indicated a (possibly intermediate) cause
  • 9.
    ©2021LinkAja Indonesia Examples Symptom (What?)Cause (Why?) I’m serving HTTP 500s DB are refusing connections Responses are slow Web Server is queueing requests Users can’t login Auth client is receiving HTTP 503 Blackbox Whitebox Blackbox externally observed, what user sees. Whitebox data exposed the system allow to act on imminent issues
  • 10.
    ©2021LinkAja Indonesia Key Distinction BlackboxMonitoring (what?)  User/Business point of view  SLI/SLO based control  Mostly easy to know  Detect active problem  Reactive approach  Tend to be the last to alert  Usually On-Call resolution  Preferably few metrics Whitebox Monitoring (why?)  Component point of view  Threshold based control  Mostly hard to know  Detect imminent problem  Proactive approach  Tend to be the early alarm  Usually automatic resolution  Preferably not few metrics
  • 11.
    ©2021LinkAja Indonesia Methodology 4 GoldenSignal 1. Traffic 2. Latency 3. Errors 4. Saturation R.E.D (Microservice Level) • Request Rate: the number of requests, per second, you services are serving • (Request) Error: the number (error rate) of failed requests persecond • (Request) Duration: distributions of the amount of time each request takes. U.S.E (Low Level/Infrastructure) Every resource, check Utilization, Saturation, and Errors • Utilization: % time that the resource was busy • Saturation: amount of work resource has to do, often queue length • Error: amount of work resource has to do, often queue length : the count of error events
  • 12.
    ©2021LinkAja Indonesia Monitoring withSLI SLI = Service Level Indicator Quantifies meeting user expectations: is our service working as our users expect it to?
  • 13.
    ©2021LinkAja Indonesia Monitoring withSLI Examples: backend API for user info Availability Specification: % GET requests complete successfully Implementation: Latency Specification: % of requests that return 2xx will complete in < 500ms. Implementation:
  • 14.
    ©2021LinkAja Indonesia Monitoring withSLI + SLO SLO = Service Level Objective Example: - Measured across all the backend servers from the load balancer - Taking the past 24 hours Availability: 99.9% GET requests complete successfully Latency: 95% of requests that return 2xx will complete in < 500ms.
  • 15.
    ©2021LinkAja Indonesia Observability  OperationsOverview  Monitoring Overview  Obervability Overvew
  • 16.
    ©2021LinkAja Indonesia Observability Observability ishow well you can understand a system’s and measures all entire of the application. Observability captures what "monitoring" doesn't (and shouldn’t), based on evidences (not conjectures) When you lost the power to know and predict the behaviors of the system and that's where the observability tools come in...
  • 17.
    ©2021LinkAja Indonesia Monitoring vsObservability Monitoring tells you when something is wrong, while Observability enables you to understand why.
  • 18.
    ©2021LinkAja Indonesia Pillars ofObservability Metrics are a numeric representation of data measured over intervals of time Event Logging is an immutable, timestamped record of discrete events that happened over time. Tracing is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.
  • 19.
    ©2021LinkAja Indonesia Observability Reliability andtrending in use: o What happens right now ? o What will happen next ? A few of the critical questions that Tracing can answer quickly and easily: o Which did a request pass through? o Where are the bottlenecks? o How much time is lost due to network lag during communication between services? o What occurred in each service for a given request? 1. Metrics 2. Tracing Good practices for more effective logs: o Logging with context (trace-id / uuid/ whatever) ? o Standardized Logging Levels ? o Use structured-logs for enable machine-readability 3. Logging
  • 20.