Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metrics driven development with dedicated Observability Team

503 views

Published on

Do Xuan Huy (LINE Corporation)
LINE Vietnam Opening Day, March 31st 2018

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Metrics driven development with dedicated Observability Team

  1. 1. Metrics driven development, a observability perspective Huy Do LINE corp
  2. 2. Introduction • Huy Do • Software Engineer at Observability Team • Founded kipalog.com & Ruby Vietnam group
  3. 3. Agenda • Metrics driven culture at LINE • Introduce our observability stack
  4. 4. LINE • A lot of end users (~170M active) • A lot of traffics • A lot of services (delivery, taxi, games, manga…)
  5. 5. What we care • User Experience • One important prospect of User Experience is Reliability
  6. 6. RELIABILITY • No Downtime • Low MTTR (Mean Time To Repair) • Fast Response • Fair response time • Fair percentile latency : p99, p95, p50
  7. 7. HOW
  8. 8. CULTURE
  9. 9. • EVERY Engineers MUST care about their application statuses • EVERY Engineers MUST do on-call rotate • NO "application engineer" who write code only • We have a dedicate team to provide them stable tools to care about their application status at best CULTURE
  10. 10. APPLICATION STATUS?
  11. 11. OBSERVABILITY
  12. 12. – Wikipedia “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”
  13. 13. METRICS LOGGING TRACING https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
  14. 14. METRICS
  15. 15. • Metrics • Most simplest form is a triple • (name, value, timestamp) • Could be represent as graph METRICS
  16. 16. • System Metrics • CPU/Disk IO/Network/DiskUsage... • MUST: have alert for critical metrics by default (users don't know what to monitor, and don't know the good threshold) • Application Metrics • Internal queue size, endpoint latency tail (p50, p95, p99), request size, request count METRICS
  17. 17. • In LINE we care A LOT about Application Metrics • We try to instrument every single new added logic • Some of our heavy servers exported over 10000 metrics per server METRICS
  18. 18. LOGGING
  19. 19. Warn / Error / Fatal log for alerting
  20. 20. • In LINE All error / warning logs MUST be • Permanent stored (for trouble shooting later) • Used for alerting • Easy to query (you should not go to each host, and do grep access log) LOGGING
  21. 21. LOGGING Real time error/warn log analysis with help of 
 Elasticsearch / Kibana
  22. 22. LOGGING Daily report for error trend
  23. 23. TRACING
  24. 24. • Not a common concept in normal service • Very helpful in microservice or fully async system , when a response could come from multiple services or multiple async threads. TRACING
  25. 25. TRACING OpenZipkin
  26. 26. LINE OBSERVABILITY STACK
  27. 27. • We call it IMON • IMON could • Aggregate metrics from dozen of thousands of hosts, and do alert • Aggregate warn/error logs from application and do alert • (on going) Tracing requests across services
  28. 28. HOW BIG?
  29. 29. • ~ 20 millions metrics per minute • And keep growing every day • ~ 500k log received per minute (peak time could up to few millions)
  30. 30. ARCHITECTURE
  31. 31. DETAILS
  32. 32. •Shard-ing MySQL cluster (~50 servers) •Partition by “customers” •Batching write for better throughput METRICS DATABASE
  33. 33. • MySQL is not fit for time series database • "Good TSDB"? • Compression • Optimize for write, but read MUST fast enough • Flexible query (topK, rate, delta) • Fast aggregate • We're moving to OpenTSDB METRICS DATABASE
  34. 34. • ElasticSearch to store warn/error log • ElasticSearch is very good at writing (with support of batching write from application layer) • However, some bad read query will kill the server LOGGING DATABASE
  35. 35. • Wrote our own in golang • Similar architect with telegraf (but with buffer) • Fully managed • Monitor all agents CPU / memory usage.. • Monitor all agents error • Automatically roll-out TELEMETRY AGENT
  36. 36. • Flexbile routing rules • Dedicated collector for big customer • Drop request by dynamic configuration • Written by armeria and centraldogma ROUTING GATEWAY https://github.com/line/armeria https://github.com/line/centraldogma
  37. 37. • Faster, more stable TSDB • Wire everything together • For every alert, see the big image with metrics/ log/tracing in same place • Autonomous alerting • With help of Machine Learning FUTURE
  38. 38. FINALLY • How you monitor reflect your engineering culture • Data driven culture • Stability driven culture • Monitoring IS NOT for devops engineer or sysadmin only, but for EVERY ENGINEERS
  39. 39. Thank you for listening

×