當線上運作環境發生問題時,如何在最短時間找出問題核心?我們使用 counter 這個工具來解決。Counter 是在程式裡的一行程式碼,用來記錄感興趣的事件。本演講包含以下內容:Golang counter 程式碼範例,後台系統架構,線上 dashboard,以及如何使用 counter 來偵錯,甚至可以一路追至某一行程式碼。另外,本演講會提及幾個有趣的應用:counter 如何協助定期伺服器更新,如何使用 counter 協助 autoscaling,以及未來的應用。
When production alert triggers, how to identify the root cause within the shortest amount of time? We solve the problem by counter, a line of code inserted by developer to count interesting events. In this talk, we'll cover the following topics: 1) how counter looks like in Golang production code, 2) our counter pipeline, 3) service dashboard with counters, 4) how to use counter to find production issues all the way to certain line of code. We'll also cover a few interesting counter use-cases, including: 1) How counter helps our weekly server upgrade, 2) Use counter for autoscaling, and 3) case-studies to demonstrate what counter can do when outages happen.
2. About me
- 17 Media Architect
- Past
- HTC: cloud backend
- Google: Google Fiber, embedded system
- NVIDIA: vlsi hardware
- roylou@gmail.com
3. About HTC CSI Project
- Cloud service infrastructure for
mobile apps (similar to Parse.com)
- Backed 5+ apps and 3M+ users
- 50 < # of VMs < 200 (Autoscaled)
- ~15 microservices
- Team of 15 engineers
One Gallery Umadeit
(Fun Fit)
14. How Frequent Should I Send Counter?
Option 1: Forward every counter to Elasticsearch
Option 2: Aggregate locally before forwarding
1000 counters / container * 100 counts / second = 100k qps
For us, aggregate and send every 30 sec
16. How Long Can I Store Counters?
- 50,000 counters
- 1 record every 30 seconds
To save counter for 1 year:
50,000 * 4 (bytes) * 2 (counters/minute) * 525,600 (mins/year)
= 210,240,000,000 Bytes
= 210.24 TB
Need to aggregate for long term storage
36. autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)
minimum # of instances
maximum # of instances
maximum # of VMs to be scaled
target workload
safeguard
workload
▵Instance
0.65 0.85
0.45
6
safeguard
43. Summary of Counter
A line of code. Can be used for:
- Rolling update
- Monitor / alert
- Debug cluster
- Autoscale cluster
- Simple business logics
- And many others (use your imagination)