Metrics driven development with dedicated Observability Team

Metrics driven development,
a observability perspective
Huy Do
LINE corp

Introduction
• Huy Do
• Software Engineer at Observability Team
• Founded kipalog.com & Ruby Vietnam group

Agenda
• Metrics driven culture at LINE
• Introduce our observability stack

LINE
• A lot of end users (~170M active)
• A lot of traffics
• A lot of services (delivery, taxi, games,
manga…)

What we care
• User Experience
• One important prospect of User Experience
is Reliability

RELIABILITY
• No Downtime
• Low MTTR (Mean Time To Repair)
• Fast Response
• Fair response time
• Fair percentile latency : p99, p95, p50

• EVERY Engineers MUST care about their application
statuses
• EVERY Engineers MUST do on-call rotate
• NO "application engineer" who write code only
• We have a dedicate team to provide them stable tools
to care about their application status at best
CULTURE

– Wikipedia
“observability is a measure of how
well internal states of a system can be
inferred from knowledge of its external
outputs”

METRICS
LOGGING
TRACING
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing

• Metrics
• Most simplest form is a triple
• (name, value, timestamp)
• Could be represent as graph
METRICS

• System Metrics
• CPU/Disk IO/Network/DiskUsage...
• MUST: have alert for critical metrics by default (users
don't know what to monitor, and don't know the good
threshold)
• Application Metrics
• Internal queue size, endpoint latency tail (p50, p95,
p99), request size, request count
METRICS

• In LINE we care A LOT about Application Metrics
• We try to instrument every single new added logic
• Some of our heavy servers exported over 10000
metrics per server
METRICS

Warn / Error / Fatal log
for alerting

• In LINE All error / warning logs MUST be
• Permanent stored (for trouble shooting later)
• Used for alerting
• Easy to query (you should not go to each host,
and do grep access log)
LOGGING

LOGGING
Real time error/warn log analysis with help of  
Elasticsearch / Kibana

LOGGING
Daily report for error trend

• Not a common concept in normal service
• Very helpful in microservice or fully async
system , when a response could come from
multiple services or multiple async threads.
TRACING

• We call it IMON
• IMON could
• Aggregate metrics from dozen of thousands of hosts, and
do alert
• Aggregate warn/error logs from application and do alert
• (on going) Tracing requests across services

• ~ 20 millions metrics per minute
• And keep growing every day
• ~ 500k log received per minute (peak time could
up to few millions)

•Shard-ing MySQL cluster (~50 servers)
•Partition by “customers”
•Batching write for better throughput
METRICS DATABASE

• MySQL is not fit for time series database
• "Good TSDB"?
• Compression
• Optimize for write, but read MUST fast enough
• Flexible query (topK, rate, delta)
• Fast aggregate
• We're moving to OpenTSDB
METRICS DATABASE

• ElasticSearch to store warn/error log
• ElasticSearch is very good at writing (with support
of batching write from application layer)
• However, some bad read query will kill the server
LOGGING DATABASE

• Wrote our own in golang
• Similar architect with telegraf (but with buffer)
• Fully managed
• Monitor all agents CPU / memory usage..
• Monitor all agents error
• Automatically roll-out
TELEMETRY AGENT

• Flexbile routing rules
• Dedicated collector for big customer
• Drop request by dynamic configuration
• Written by armeria and centraldogma
ROUTING GATEWAY
https://github.com/line/armeria
https://github.com/line/centraldogma

• Faster, more stable TSDB
• Wire everything together
• For every alert, see the big image with metrics/
log/tracing in same place
• Autonomous alerting
• With help of Machine Learning
FUTURE

FINALLY
• How you monitor reflect your engineering
culture
• Data driven culture
• Stability driven culture
• Monitoring IS NOT for devops engineer or
sysadmin only, but for EVERY
ENGINEERS

Metrics driven development with dedicated Observability Team

More Related Content

What's hot

Similar to Metrics driven development with dedicated Observability Team

More from LINE Corporation

Recently uploaded

Metrics driven development with dedicated Observability Team