SRE CH12 - Effective Troubleshooting

想打世界杯，看清楚以下是你的對手在做的事：
● 線上閱讀：https://landing.google.com/sre/book.html
● SRE Conference:
a. SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs
b. SREcon16 - Performance Checklists for SREs
c. SREcon16 - The Realities of the Job of Delivering Reliability
d. SouthBay SRE: Cloud Capacity Planning - August 9th 2016
e. Site Reliability Engineering at Dropbox
2

Chapter 12
Effective Troubleshooting
有效的故障排除
4
作者：Chris Jones

https://www.usenix.org/system/files/login/articles/login_june_07_jones.pdf 5

沒有什麼大神，雷踩得夠多，而且都能解決，就是大神。
Hit mines will make the guy to be a great geek.
-- Rick Hwang
值得警惕的是，理解一個系統應該如何工作，並不能使人成為專家。只能靠調查系統
為何不能正常工作才行。
Be warned that being an expert is more than understanding how a system is
supposed to work. Expertise is gained by investigating why a system doesn’t work.
-- Brian Redman
神寫的系統是不會有雷
6

系統正常，只是該系統無數異常情況下的一種特例。
Ways in which things go right are special cases of the ways in which things go
wrong.
-- John Allspaw
7

8
健康檢查報告的結果
『健康』只是一種特別的現象
馬雲沒有講過這句話

A process for troubleshooting
11
Triage
定位
Problem Report
故障報告
Examine
檢查
Diagnose
診斷
Test / Treat
測試 / 修復
Cure
治癒
Consider re-triaging if situation
changes.
如果情況發生改變，考慮重新
定位

陷阱
● 低效率的故障排除原因：定位、檢查、診斷
○ 對於系統不夠瞭解
○ 關注錯誤的系統現象，或者誤解系統現象的含義
○ 將問題過早歸納極為不可能的因素
○ 試圖解決與當前問題相關的其他問題
● 如何避免：
○ 學習系統的運作原理，了解分散式系統的基本模式
○ 當聽到蹄子聲響時，要先想到馬，而不是斑馬
○ 所有可能都存在時，要優先考慮做簡單的解釋
● 相關性 (Correlation) 不等於因果關係 (Causation)
○ 網路封包遺失，和硬碟損壞可能是同一個原因，例如供電故障
○ 系統規模越來越複雜，監控指標越來越多，無法避免『純屬巧合』
12

Triage
定位
Problem
Report
故障報告
Examine
檢查
Diagnose
診斷
Test /
Treat
測試 / 修復
Cure
治癒
故障報告
15

故障報告
● 起源：
○ 同事告知
○ 系統發起
● 有效的報告
○ 如何重現 How to reproducre
○ 預期結果 Expected Result
○ 實際結果 Actual Result
● 報告應該存在可以搜尋的系統，像是 Bug / Issue Tracking
● 報告要有分析工具
16

Defect - 開發過程中，發現的問題
17

Bug - 系統上線後發現的問題
18

故障報告 - 自動回報
莎士比亞收尋系統故障實例：過去五分鐘連續搜尋 “the forms of things unknow” 無
法得到正確結果：
1. 報警系統自動建立一個 Bug
2. 自動填入測試 URL、on-call 關聯，將 Bug Assign 給相關人員
19

自動化 XXX - 陷阱
● 不要陷入『自動化 XXX 』的迷思
○ 自動化測試
○ 自動化部屬
○ 自動化維運
○ 自動化挖礦
○ 自動化加薪
● 自動化 != 黑魔法
○ 不要以為有『自動化』就沒事了、下班了
● 一般說的『自動化』大多說的是把程序 / 步驟變成程式碼而已。
○ 自動化：Automation
○ 程序控制：Process Control
○ 步驟：Steps
○ Automation = Process = Steps ??
21
https://rickhw.github.io/2017/11/12/DevOps/Gossip-Automation/

22
電機：自動控制 (Automation Control)
● 自動 (v) ⇒ Conditions ⇒ Actions
● 控制 (v) ⇒ 程序 (obj)

24
AWS Summit Series 2016: Big Data Architectural Patterns and Best Practices on AWS

FeedbackActions
(Do Something)
25
Conditions
(event driven)
AI / ML
Log
Collection
Analyze
(realtime)
ProcessInput Output
Automation

26
要強調這件事的理由
● 一般人會以為『自動』是做到 80 分的事情
● 實際上做的『自動』是 -25 分的事情 -- 只是在做程序性 (Process) 任務

Triage
定位
Problem
Report
故障報告
Examine
檢查
Diagnose
診斷
Test /
Treat
測試 / 修復
Cure
治癒
定位 Triage
28

定位
● 判斷問題的嚴重性 (Severity)
● 大型系統 / 重要系統
○ 盡最大可能讓系統恢復服務 → First Priority
○ 立即開始故障排除過程，試圖找出問題根源 → 不是 First Priority
● 處理方法
○ 讓流量導向正常服務的 Cluster
○ 關閉系統的某一些功能
○ 緩解系統是第一要務
● 快速定位問題
○ 保留現場
○ 蒐集 Log
初級飛行員遇到警急事故時，首要任
務保持飛機飛行，而不是保證乘客安
全降落、故障定位排除則是次要的。
29

Triage
定位
Problem
Report
故障報告
Examine
檢查
Diagnose
診斷
Test /
Treat
測試 / 修復
Cure
治癒
檢查
31

檢查
● 透過監控系統了解整個系統是否正常運作
● time-series 是理解系統的好方法，但也要小心虛假相關!
● Log 很重要，透過他了解分散式系統的運作關係
● 不同產品需要極為不同的 Log 系統
32

352017/08/25 (五): https://www.cool3c.com/article/128238

Ref: 淺談系統監控與 AWS CloudWatch 的應用
Levels of Health Check
● Light / Static Health Check
● Layer Health Check
● Deep Health Check
37

Light / Static Health Check
38
ASG
ELB
(Internet-Facing)
Route 53
Web App
ASG
Web Servers ELB
(Internal ELB)
App ServersThird Party
Services
Health-Checker
Light Health Check
Layer Health Check
Deep Health Check
Service A
Service B

Layer Health Check
39
ASG
ELB
(Internet-Facing)
Route 53
Web App
ASG
Web Servers ELB
(Internal ELB)
Services
Health-Checker
Light Health Check
Layer Health Check
Deep Health Check
Service A
Service B

Deep Health Check
40
ASG
ELB
(Internet-Facing)
Route 53
Web App
ASG
Web Servers ELB
(Internal ELB)
Services
Health-Checker
Light Health Check
Layer Health Check
Deep Health Check
Service A
Service B

41
● Light / Static Health Check
○ Application 自己是正常的, 像是: Tomcat, IIS 正常運作
● Layer Health Check
○ App 跟另一個 App 溝通是正常的, Tomcat to Redis
○ 出問題時，釐清問題的節點
● Deep Health Check
○ 確認 Service 自身的商務邏輯是正常的：登入、結帳

42
Service A Service B
Service C
Service D
Service E
(Third Party)
Service Dependencies (Internal)

43
● Light / Static Health Check - Application Self
● Layer Health Check - App to App
● Deep Health Check: Service Self
● Service Health Check: Service to Services

● 開發好的應用程式，交給其他單位 (Test、Operation) 部署時
，用來確認部署正確性、確認點
● CD 時可以自測
● 跨很多系統時，釐清問題的基本參考點，特別是 Micro
Service 架構
● 系統異常發生時，檢查的起始點
Health Check 的用途
44

Triage
定位
Problem
Report
故障報告
Examine
檢查
Diagnose
診斷
Test /
Treat
測試 / 修復
Cure
治癒
診斷
46

診斷 - 簡化和縮略
● 系統中的介面 (interface) 都有明確的 Input / Output
○ 用黑盒測試 (Blackbox) 檢測
● Divide and Conquer - 分而治之
○ 將一個大問題，切割成許多小問題；
○ 將這些小問題解決之後，原本的大問題也就解決了。
○ 如果小問題還是很難，那就再切成更小的問題就行了。
○ 分割問題、各個擊破，就是 Divide and Conquer 的精神。
插話一：Unit Test 是測試的最小問題，所以 UT 很重要。
插話三：規劃專案時程的時候，要 “Breakdown” 任務
插話三：Divide and Conquer 在演算法用來分析 Recursive、Merge Sort
47

診斷 - Ask “what”, “where”, and “why”
48

診斷 - 最後一次修改 (What touched it last)
● 正常的系統會持續工作，直到外力因素出現，例如
○ 配置文件的修改
○ 用戶流量改變
● 檢查最近對系統的修改，找問題很有幫助
● 良好的系統會紀錄完整的部署 (Deployment) 與配置 (Configuration) 異動
○ 將部署版本資訊，與監控系統整合，有助於比對異常
49

診斷 - 最後一次修改 (What touched it last)
50

Triage
定位
Problem Report
故障報告
Examine
檢查
Diagnose
診斷
Test / Treat
測試 / 修復
Cure
治癒
測試 / 修復
52

測試和修復
● 理想的測試：互斥性，推翻假設組，讓比較組成立。
○ 執行較為困難
● 測試產生誤導性結果
● 執行測試的副作用
● 測試無法推論結果，只是建議性
53

Triage
定位
Problem Report
故障報告
Examine
檢查
Diagnose
診斷
Test / Treat
測試 / 修復
Cure
治癒
治癒
55

治癒
因為以下因素，Production 經常只能發現可能的原因：
● 系統複雜度：可能有很多因素共同影響問題
● 在 Production 重現問題也許是不可能
56

使故障排查更簡單
簡化以及加速故障排除的方法：
● 增加可觀察性
● 利用成熟、觀察性好的組間接口設計
57

參考資料
60
● https://www.usenix.org/system/files/login/articles/login_june_07_jones.pdf

Triage
定位
Problem Report
故障報告
Examine
檢查
Diagnose
診斷
Test / Treat
測試 / 修復
Cure
治癒
神奇的負面效果
63

神奇的負面結果 - 沒預期出現的
● 新系統的設計、啟發性算法、工作流程沒有改進試圖取代的就系統。
● 負面結果不應該被忽略、或者被輕視
● 試驗中出現的負面結果是要緊的 (Conclusive)
● 工具和方法可能超越目前的試驗，為未來工作提供幫助
● 公布負面結果有助於提升整個行業的數據驅動風氣
● 公布結果
64

SRE CH12 - Effective Troubleshooting

More Related Content

Similar to SRE CH12 - Effective Troubleshooting

More from Rick Hwang

SRE CH12 - Effective Troubleshooting