Observability in highly distributed systems

DEVOPS INDONESIA
PAGE
1
DEVOPS INDONESIA
Akbar
Shopee Indonesia
Jakarta, 15 Februari 2022
Observability in Highly Distributed Systems

1
Observability in Highly
Distributed Systems
15 February 2022

Speaker’s Proﬁle
EDUCATION
InstitutTeknologi Bandung
Geodetic and Geomatic Engineering
EXPERIENCE
Shopee Indonesia
Engineer
(2021-present)
PT. SIGMA SOLUSI INTEGRASI
Database Consultant
(2019-2021)
IT Group, Inc. (2015-2019)
Senior Database Engineer
Private & Conﬁdential 3

Observability on Highly Distributed Systems
About Shopee
Q&A Session
A
B
C

About Shopee

First launched in 2015, Shopee now has oﬃces in 8 markets
Singapore
Malaysia Taiwan
Thailand
Indonesia Vietnam Philippines Brazil

6
Shopee Values
We Adapt We Run
We Commit We Stay Humble
We Serve

Rewarding & Impactful Career Journey
Learning & Development Opportunities
Shopee Academy
8
E-learning Classes Training &
Development
Programs

Career Development
Internal Transfer Program Work with Other Shopee Team
AcrossThe Globe
9

Career Development
Collaboration with talented young leaders across Asia,
Europe and Latin America
1
0

Private & Conﬁdential
10
About Shopee
Q&A Session
A
B
C

Observability in a Nutshell
Observability is proactively collecting,
visualizing, and applying intelligence to all of
your metrics, events, logs, and traces — so you
can understand the behavior of your complex
digital system.
Metrics tell us the “what”;
Logs tell us the “why”;
Traces tell us the “where”.
12

Why Observability Matters
13
Fundamental tension between personas:
Developers: The job is to make changes
Operators: Change is the enemy of stability
Bottom-line:
The environment is constantly changing
Systems fail in innumerable, unpredictable and exotic ways
Failure is not an option - it is inevitable
Observability lets you handle/manage change and move forward

Observability vs Monitoring
● Monitoring: Is it working?
○ Detect problems affecting your customers
○ Black-box (e.g., health check) vs white-box
monitoring (e.g., based on service metrics)
○ Fundamentally reactive in nature
● Observability: What/how is it doing?
○ Improve insight into complex systems
○ Enhance understanding of changes
○ Visibility into internal state
○ Many proactive interactions
14

Pillars of Observability
Metrics - set of data that show the performance of
a system
Logs - a written record of events ongoing in the
system
Trace - the entire journey of a request or action as
it moves through all the nodes of a distributed
system
15

Metrics
16
Golden Signals
Golden signals are an effective way of monitoring the overall state of the system and identifying problems.
● Availability: State of your system measured from the perspective of clients (e.g. percentage of errors on total
requests).
● Health: State of your system measured using periodic pings.
● Request rate: Rate of incoming requests to the system.
● Saturation: How free or loaded the system is (e.g. queue depth or available memory).
● Utilization: How busy the system is (e.g. CPU load or memory usage). This is represented in percentage.
● Error rate: Rate of errors being produced in the system.
● Latency: Response time of the system, usually measured in the 95th or 99th percentile.

Metrics
17

Logs
● Logging can be used to save information about requests (duration, status code, userId), database
queries, load balancer usage and more.
● You need to find the right balance between logging everything and nothing to gain enough context for it to
be useful.
Logging consists of multiple steps:
Collecting & Ingesting: when you generate logs in different services, you need a central place where
to send them
Processing: ingested logs are enriched with metadata and attributes for future use
Indexing: logs are segmented into groups to generate metrics, patterns and dashboards
18

Logs
19

Trace
20
Tracing acts like the blackbox of an aircraft during a crash: it helps you understand how things went during a
crash, to discover the chain of events that led to a problem.
It provides a low-level view to understand:
● what triggered what in the program
● with which arguments
● in which order
● how long did each step lasts
The result of Tracing can be visualized in two ways:
● Traces: it looks like a flame graph with spans and their associated metadata
● Service maps: it looks like a cloud of nodes and links between them to visualize the flow of requests

Trace
21

Three Phases of Observability
Know about the problem
● Knowing the issue exists ideally before it
impacts any customers
Triage the problem
● Quickly understand the context and
impact of an issue
Understand the problem
● Post mortem on an incident
22

Highly Distributed System
23

Considerations for Distributed System Observability
24
Consideration Description
High availability Can you ensure the platform is available 24/7?
Scalability Is the platform scalable while ensuring you don't lose data?
Auto discovery Can the platform automatically discover which endpoints to monitor by connecting, for
example, to a service discovery such as Kubernetes or Consul?
Cost management How much time and money can the company invest in an observability platform and its
implementation/management?
Compliance What is required by the company's security compliance measures?
Integration Does the platform integrate with external services?
Compatibility Is the platform compatible with the company's objectives? Is the solution flexible and
modular enough to support those in the next three years?

Monitor a Distributed System
Automating the collection of metrics from a
distributed system.
A simple method for this process is to:
1. Deploy an agent on each system to
collect necessary data.
2. Centralize the collected data on a
remote platform.
25

Log a Distributed System
It is strongly recommended to outline a
global log management policy to facilitate
the process.
This policy must define aspects such as:
● The logs' format and workflow;
● Their verbosity according to the
environment;
● The potential transformations
necessary to quickly identify each
event.
26

Trace a Distributed System
Trace gives deep visibility into your applications with
for web services, queues, and databases to monitor
requests, errors, and latency
Traces start in your instrumented applications and
flow
a. Service map
b. Continuous Profiler
c. Trace retention and ingestion
27

Best Practices in Observability
Don’t try to monitor everything. Instead, gather only the necessary data.
Focus more on monitoring essential things and fixing them if they fail.
Avoid storing every log or data available. Rather, store those that give insights to critical
events.
Put up alerts on critical events.
Create data graphs that are easily understandable by every team member, as this will
improve the usability of the information.
28

Culture Driven
Observability is not about tools,
it will create more focus on people
Observability is a team sport Observability is a methodology that
needs to be practiced
29

A
30
About Shopee
B
C
Q&A Session

Drop your questions:
To Akbar - [Insert Question]
Live Q&A:
31

Drop your CV and feedback!
bit.ly/techwebinarfeedback
bit.ly/shopeexdevops15feb
Drop your CV here: Share your feedback here:
32

Stay Connected With Us!
t.me/iddevops
DevOps Indonesia
DevOps Indonesia
DevOps Indonesia
@iddevops
@iddevops
DevOps Indonesia
Scan here

DEVOPS INDONESIA
Alone Wearesmart,togetherWearebrilliant
THANKYOU!
Quote by Steve Anderson

Observability in highly distributed systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Observability in highly distributed systems

Similar to Observability in highly distributed systems (20)

More from DevOps Indonesia

More from DevOps Indonesia (20)

Recently uploaded

Recently uploaded (20)

Observability in highly distributed systems