Metrics driven development,
a observability perspective
Huy Do
LINE corp
Introduction
• Huy Do
• Software Engineer at Observability Team
• Founded kipalog.com & Ruby Vietnam group
Agenda
• Metrics driven culture at LINE
• Introduce our observability stack
LINE
• A lot of end users (~170M active)
• A lot of traffics
• A lot of services (delivery, taxi, games,
manga…)
What we care
• User Experience
• One important prospect of User Experience
is Reliability
RELIABILITY
• No Downtime
• Low MTTR (Mean Time To Repair)
• Fast Response
• Fair response time
• Fair percentile latency : p99, p95, p50
HOW
CULTURE
• EVERY Engineers MUST care about their application
statuses
• EVERY Engineers MUST do on-call rotate
• NO "application engineer" who write code only
• We have a dedicate team to provide them stable tools
to care about their application status at best
CULTURE
APPLICATION STATUS?
OBSERVABILITY
– Wikipedia
“observability is a measure of how
well internal states of a system can be
inferred from knowledge of its external
outputs”
METRICS
LOGGING
TRACING
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
METRICS
• Metrics
• Most simplest form is a triple
• (name, value, timestamp)
• Could be represent as graph
METRICS
• System Metrics
• CPU/Disk IO/Network/DiskUsage...
• MUST: have alert for critical metrics by default (users
don't know what to monitor, and don't know the good
threshold)
• Application Metrics
• Internal queue size, endpoint latency tail (p50, p95,
p99), request size, request count
METRICS
• In LINE we care A LOT about Application Metrics
• We try to instrument every single new added logic
• Some of our heavy servers exported over 10000
metrics per server
METRICS
LOGGING
Warn / Error / Fatal log
for alerting
• In LINE All error / warning logs MUST be
• Permanent stored (for trouble shooting later)
• Used for alerting
• Easy to query (you should not go to each host,
and do grep access log)
LOGGING
LOGGING
Real time error/warn log analysis with help of 

Elasticsearch / Kibana
LOGGING
Daily report for error trend
TRACING
• Not a common concept in normal service
• Very helpful in microservice or fully async
system , when a response could come from
multiple services or multiple async threads.
TRACING
TRACING
OpenZipkin
LINE OBSERVABILITY
STACK
• We call it IMON
• IMON could
• Aggregate metrics from dozen of thousands of hosts, and
do alert
• Aggregate warn/error logs from application and do alert
• (on going) Tracing requests across services
HOW BIG?
• ~ 20 millions metrics per minute
• And keep growing every day
• ~ 500k log received per minute (peak time could
up to few millions)
ARCHITECTURE
DETAILS
•Shard-ing MySQL cluster (~50 servers)
•Partition by “customers”
•Batching write for better throughput
METRICS DATABASE
• MySQL is not fit for time series database
• "Good TSDB"?
• Compression
• Optimize for write, but read MUST fast enough
• Flexible query (topK, rate, delta)
• Fast aggregate
• We're moving to OpenTSDB
METRICS DATABASE
• ElasticSearch to store warn/error log
• ElasticSearch is very good at writing (with support
of batching write from application layer)
• However, some bad read query will kill the server
LOGGING DATABASE
• Wrote our own in golang
• Similar architect with telegraf (but with buffer)
• Fully managed
• Monitor all agents CPU / memory usage..
• Monitor all agents error
• Automatically roll-out
TELEMETRY AGENT
• Flexbile routing rules
• Dedicated collector for big customer
• Drop request by dynamic configuration
• Written by armeria and centraldogma
ROUTING GATEWAY
https://github.com/line/armeria
https://github.com/line/centraldogma
• Faster, more stable TSDB
• Wire everything together
• For every alert, see the big image with metrics/
log/tracing in same place
• Autonomous alerting
• With help of Machine Learning
FUTURE
FINALLY
• How you monitor reflect your engineering
culture
• Data driven culture
• Stability driven culture
• Monitoring IS NOT for devops engineer or
sysadmin only, but for EVERY
ENGINEERS
Thank you for listening

Metrics driven development with dedicated Observability Team

  • 1.
    Metrics driven development, aobservability perspective Huy Do LINE corp
  • 2.
    Introduction • Huy Do •Software Engineer at Observability Team • Founded kipalog.com & Ruby Vietnam group
  • 3.
    Agenda • Metrics drivenculture at LINE • Introduce our observability stack
  • 4.
    LINE • A lotof end users (~170M active) • A lot of traffics • A lot of services (delivery, taxi, games, manga…)
  • 5.
    What we care •User Experience • One important prospect of User Experience is Reliability
  • 6.
    RELIABILITY • No Downtime •Low MTTR (Mean Time To Repair) • Fast Response • Fair response time • Fair percentile latency : p99, p95, p50
  • 7.
  • 8.
  • 9.
    • EVERY EngineersMUST care about their application statuses • EVERY Engineers MUST do on-call rotate • NO "application engineer" who write code only • We have a dedicate team to provide them stable tools to care about their application status at best CULTURE
  • 10.
  • 11.
  • 12.
    – Wikipedia “observability isa measure of how well internal states of a system can be inferred from knowledge of its external outputs”
  • 13.
  • 14.
  • 15.
    • Metrics • Mostsimplest form is a triple • (name, value, timestamp) • Could be represent as graph METRICS
  • 16.
    • System Metrics •CPU/Disk IO/Network/DiskUsage... • MUST: have alert for critical metrics by default (users don't know what to monitor, and don't know the good threshold) • Application Metrics • Internal queue size, endpoint latency tail (p50, p95, p99), request size, request count METRICS
  • 17.
    • In LINEwe care A LOT about Application Metrics • We try to instrument every single new added logic • Some of our heavy servers exported over 10000 metrics per server METRICS
  • 18.
  • 19.
    Warn / Error/ Fatal log for alerting
  • 20.
    • In LINEAll error / warning logs MUST be • Permanent stored (for trouble shooting later) • Used for alerting • Easy to query (you should not go to each host, and do grep access log) LOGGING
  • 21.
    LOGGING Real time error/warnlog analysis with help of 
 Elasticsearch / Kibana
  • 22.
  • 23.
  • 24.
    • Not acommon concept in normal service • Very helpful in microservice or fully async system , when a response could come from multiple services or multiple async threads. TRACING
  • 25.
  • 26.
  • 27.
    • We callit IMON • IMON could • Aggregate metrics from dozen of thousands of hosts, and do alert • Aggregate warn/error logs from application and do alert • (on going) Tracing requests across services
  • 28.
  • 29.
    • ~ 20millions metrics per minute • And keep growing every day • ~ 500k log received per minute (peak time could up to few millions)
  • 30.
  • 32.
  • 33.
    •Shard-ing MySQL cluster(~50 servers) •Partition by “customers” •Batching write for better throughput METRICS DATABASE
  • 34.
    • MySQL isnot fit for time series database • "Good TSDB"? • Compression • Optimize for write, but read MUST fast enough • Flexible query (topK, rate, delta) • Fast aggregate • We're moving to OpenTSDB METRICS DATABASE
  • 35.
    • ElasticSearch tostore warn/error log • ElasticSearch is very good at writing (with support of batching write from application layer) • However, some bad read query will kill the server LOGGING DATABASE
  • 36.
    • Wrote ourown in golang • Similar architect with telegraf (but with buffer) • Fully managed • Monitor all agents CPU / memory usage.. • Monitor all agents error • Automatically roll-out TELEMETRY AGENT
  • 37.
    • Flexbile routingrules • Dedicated collector for big customer • Drop request by dynamic configuration • Written by armeria and centraldogma ROUTING GATEWAY https://github.com/line/armeria https://github.com/line/centraldogma
  • 38.
    • Faster, morestable TSDB • Wire everything together • For every alert, see the big image with metrics/ log/tracing in same place • Autonomous alerting • With help of Machine Learning FUTURE
  • 39.
    FINALLY • How youmonitor reflect your engineering culture • Data driven culture • Stability driven culture • Monitoring IS NOT for devops engineer or sysadmin only, but for EVERY ENGINEERS
  • 40.
    Thank you forlistening