6. First launched in 2015, Shopee now has offices in 8 markets
Singapore
Private & Confidential 6
Malaysia Taiwan
Thailand
Indonesia Vietnam Philippines Brazil
8. Rewarding & Impactful Career Journey
Learning & Development Opportunities
Shopee Academy
8
E-learning Classes Training &
Development
Programs
9. Rewarding & Impactful Career Journey
Learning & Development Opportunities
Career Development
Internal Transfer Program Work with Other Shopee Team
AcrossThe Globe
9
10. Rewarding & Impactful Career Journey
Learning & Development Opportunities
Career Development
Collaboration with talented young leaders across Asia,
Europe and Latin America
1
0
11. Observability on Highly Distributed Systems
Private & Confidential
10
About Shopee
Observability on Highly Distributed Systems
Q&A Session
A
B
C
12. Observability in a Nutshell
Observability is proactively collecting,
visualizing, and applying intelligence to all of
your metrics, events, logs, and traces — so you
can understand the behavior of your complex
digital system.
Metrics tell us the “what”;
Logs tell us the “why”;
Traces tell us the “where”.
12
Private & Confidential
13. Why Observability Matters
13
Private & Confidential
Fundamental tension between personas:
Developers: The job is to make changes
Operators: Change is the enemy of stability
Bottom-line:
The environment is constantly changing
Systems fail in innumerable, unpredictable and exotic ways
Failure is not an option - it is inevitable
Observability lets you handle/manage change and move forward
14. Observability vs Monitoring
● Monitoring: Is it working?
○ Detect problems affecting your customers
○ Black-box (e.g., health check) vs white-box
monitoring (e.g., based on service metrics)
○ Fundamentally reactive in nature
● Observability: What/how is it doing?
○ Improve insight into complex systems
○ Enhance understanding of changes
○ Visibility into internal state
○ Many proactive interactions
14
Private & Confidential
15. Pillars of Observability
Metrics - set of data that show the performance of
a system
Logs - a written record of events ongoing in the
system
Trace - the entire journey of a request or action as
it moves through all the nodes of a distributed
system
15
Private & Confidential
16. Metrics
16
Private & Confidential
Golden Signals
Golden signals are an effective way of monitoring the overall state of the system and identifying problems.
● Availability: State of your system measured from the perspective of clients (e.g. percentage of errors on total
requests).
● Health: State of your system measured using periodic pings.
● Request rate: Rate of incoming requests to the system.
● Saturation: How free or loaded the system is (e.g. queue depth or available memory).
● Utilization: How busy the system is (e.g. CPU load or memory usage). This is represented in percentage.
● Error rate: Rate of errors being produced in the system.
● Latency: Response time of the system, usually measured in the 95th or 99th percentile.
18. Logs
● Logging can be used to save information about requests (duration, status code, userId), database
queries, load balancer usage and more.
● You need to find the right balance between logging everything and nothing to gain enough context for it to
be useful.
Logging consists of multiple steps:
Collecting & Ingesting: when you generate logs in different services, you need a central place where
to send them
Processing: ingested logs are enriched with metadata and attributes for future use
Indexing: logs are segmented into groups to generate metrics, patterns and dashboards
18
Private & Confidential
20. Trace
20
Private & Confidential
Tracing acts like the blackbox of an aircraft during a crash: it helps you understand how things went during a
crash, to discover the chain of events that led to a problem.
It provides a low-level view to understand:
● what triggered what in the program
● with which arguments
● in which order
● how long did each step lasts
The result of Tracing can be visualized in two ways:
● Traces: it looks like a flame graph with spans and their associated metadata
● Service maps: it looks like a cloud of nodes and links between them to visualize the flow of requests
22. Three Phases of Observability
Know about the problem
● Knowing the issue exists ideally before it
impacts any customers
Triage the problem
● Quickly understand the context and
impact of an issue
Understand the problem
● Post mortem on an incident
22
Private & Confidential
24. Considerations for Distributed System Observability
24
Private & Confidential
Consideration Description
High availability Can you ensure the platform is available 24/7?
Scalability Is the platform scalable while ensuring you don't lose data?
Auto discovery Can the platform automatically discover which endpoints to monitor by connecting, for
example, to a service discovery such as Kubernetes or Consul?
Cost management How much time and money can the company invest in an observability platform and its
implementation/management?
Compliance What is required by the company's security compliance measures?
Integration Does the platform integrate with external services?
Compatibility Is the platform compatible with the company's objectives? Is the solution flexible and
modular enough to support those in the next three years?
25. Monitor a Distributed System
Automating the collection of metrics from a
distributed system.
A simple method for this process is to:
1. Deploy an agent on each system to
collect necessary data.
2. Centralize the collected data on a
remote platform.
25
Private & Confidential
26. Log a Distributed System
It is strongly recommended to outline a
global log management policy to facilitate
the process.
This policy must define aspects such as:
● The logs' format and workflow;
● Their verbosity according to the
environment;
● The potential transformations
necessary to quickly identify each
event.
26
Private & Confidential
27. Trace a Distributed System
Trace gives deep visibility into your applications with
for web services, queues, and databases to monitor
requests, errors, and latency
Traces start in your instrumented applications and
flow
a. Service map
b. Continuous Profiler
c. Trace retention and ingestion
27
Private & Confidential
28. Best Practices in Observability
Don’t try to monitor everything. Instead, gather only the necessary data.
Focus more on monitoring essential things and fixing them if they fail.
Avoid storing every log or data available. Rather, store those that give insights to critical
events.
Put up alerts on critical events.
Create data graphs that are easily understandable by every team member, as this will
improve the usability of the information.
28
Private & Confidential
29. Culture Driven
Observability is not about tools,
it will create more focus on people
Observability is a team sport Observability is a methodology that
needs to be practiced
29
Private & Confidential
30. Observability on Highly Distributed Systems
A
30
Private & Confidential
About Shopee
B
C
Observability on Highly Distributed Systems
Q&A Session
32. Drop your CV and feedback!
bit.ly/techwebinarfeedback
bit.ly/shopeexdevops15feb
Drop your CV here: Share your feedback here:
32
Private & Confidential