服務品質術語: SLI /SLO / SLA
● Service Level Indicator (SLI):服務品質具體的量化指標
● Service Level Objectvie (SLO):SLI 的目標值、範圍值
● Service Level Agreement (SLA)
29
30.
服務品質具體的量化指標,像是:
○ request latency
○error rate
○ system throughput: request per second (RPS), query per second (QPS)
○ availability: SRE 重視的 SLI
○ durability (持久性):資料能夠完整保存間
Service Level Indicator (SLI)
30
指標標準化
標準化常見的 SLI,避免每次都重新評估
● Aggregationintervals: “Averaged over 1 minute”
● Aggregation regions: “All the tasks in a cluster”
● How frequently measurements are made: “Every 10 seconds”
● Which requests are included: “HTTP GETs from black-box monitoring jobs”
● How the data is acquired: “Through our monitoring, measured at the server”
● Data-access latency: “Time to last byte”
39
目標的定義
● SLO 具體定義(可以有多個)
○ 90% Get RPC < 1ms
○ 99% Get RPC < 10ms
○ 99.9% Get RPC < 100ms
● 如果同時有批次處理用戶,以及即時用戶,SLO 可以是:
○ 批次: 95% Set RPC < 1s
○ 即時: (99% Set RPC + RPC Loading < 1K) < 10m
● 再次強調: SLO 100% 是不合理,也是高成本的
41
42.
SLO 的選擇
● Don’tpick a target based on current performance
○ 不能只看眼前,要從全局出發
● Keep it simple
○ 太複雜的匯總,會難以理解,同時會掩蓋系統性的變化
● Avoid absolutes (絕對值)
○ 要求擴展系統而沒有增加任何 latency ,或者永遠 Available 都是不切實際的
● Have as few SLOs as possible
○ 選擇足夠的 SLO 覆蓋系統屬性
● Perfection can wait (不完美也很美)
○ 隨著時間了解系統之後,進行 SLO 定義與調整。
42