Implementing error budgets

•

3 likes•1,068 views

Yaroslav Molochko

Practical implementation of error budgets

Engineering

Who am I
● SRE team lead @AnchorFree
● @onorua
● https://www.meetup.com/Kubernetes-Kyiv/
● Deal a lot with production
● Introduced Error Budgets @Anchorfree
Yaroslav Molochko

Anchorfree in numbers
● 650M customers
● Several thousands nodes
● 148 containers deal with user traffic
● 16M metrics

Possible options to prioritisation
1. Review documentation
2. Escalate
3. Round-Robin
4. Random

SLO
(Service Level Objective)
Binding target for collections of SLIs

SLO practice
1. Obliged part
2. Validity Period
3. Expression

SLI
(Service Level Indicator)
A measure of the service level provided by a service
provider

43.2m
432m
7.2h
2160m
36h
E
R
R
O
R
B
U
D
G
E
T

100%
10%
100%
E
R
R
O
R
B
U
D
G
E
T
R
E
M
A
I
N
I
N

E
R
R
O
R
B
U
D
G
E
T
R
E
M
A
I
N
I
N
43.2m
43.2m
2160m
36h

0.01%
min
0.1%
min
1%
min
B
U
R
N
R
A
T
E

60m 120m
1%
min
0.432 * 0.01 * 60 sec = 0.26sec per second
36h
43.2m
43.2min
166 m
180m
0.432 * 0.1 * 60 sec = 2.592 sec per second
43.2min
16 m
166 m
16 m 100m
ErrorBudgetremaining

What happens when Error Budget is fully utilized
● Your team switch to maintenance mode for the service
● Your team stop to onboard new services
● Nothing but tests and hotfix are allowed to get into

Extra responsibility?
Don’t forget, you have at least 2 sides for “contract”

This will not work in our
unique case
Tell me that when you become Google

Sell this to your boss
1. Acknowledge responsibility for subsystem
2. Focus on user needs
3. Agree on what happens when Error Budget is beyond budget?
4. Get a sign-off from neighbor teams and management

How to merge
multiple SLI
(symptoms) into
one SLO?

Obvious options
● AVG
● MAX
● MIN
Non Obvious options
● An Approach for QoS aware Service Composition based on Genetic
Algorithms. (link)

Logic AND over bool
● 1 * 1 * ... = 1 All services are within SLO
● 1 * 0 * … = 0 At least one service is acting
up

Logic AND over bool
min(sli:gpr_edge_message_delivery:last30d >= bool 0.9999)
*
min(sli:gpr_edge_message_95le:last30d >= bool 0.95)
== bool 1

Alert attributes
● Detection time
● Precision
● Reset time

Default alerting:
Error
rate
0
1000
5 10 15 20
time
ALERT!

Default alerting: FOR 10m
Error
rate
0
1000
5 10 15 20
time

Burn Rate alerting:
Error
rate
0
1000
5 10 15 20
time
ALERT! ALERT! ALERT!
ALERT!

Burn Rate alerting: AND over 2 windows
Error
rate
0
1000
5 10 15 20
time
ALERT! ALERT!

Burn rate calculation
BR =
W
P
* E
BR =
30 * 24
1
2% =
720
1
0.02 = 14.4

Severity Long
Window
Short
Window
Burn rate Error
budget
consumed
Page 1 hour 5 minutes 14.4 2%
Page 6 hours 30 minutes 6 5%
Ticket 3 days 6 hours 1 10%
Site Reliability Workbook | Ways to Alert on Significant Events | page 85

Main alerting takeaways
● Short vs Long windows ratio is 1/12 (magic ratio)
● Don’t use extra FOR in alert manager
● Burn rate is not a magic number
● 2% leave you with 50 alerts per month within budget

Error budgets
● Rules (you know how to play, you know how to score)
● Self-Escalation
● Over commitment protection
● Put your users first!

What's hot

Getting started with Site Reliability Engineering (SRE)Abeer R

Site reliability engineeringJason Loeffler

Site (Service) Reliability EngineeringMark Underwood

SRE From ScratchGrier Johnson

Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.

Performance Engineering Masterclass: Efficient Automation with the Help of SR...ScyllaDB

SRE-iously! Reliability!New Relic

A Crash Course in Building Site ReliabilityAcquia

Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...Amazon Web Services

SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer

Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv

When down is not good enough. SRE On Azure - PolarConfRene Van Osnabrugge

Chaos EngineeringAmazon Web Services

How Small Team Get Ready for SRE (public version)Setyo Legowo

Sre summaryYogesh Shah

SRE 101Diego Pacheco

SRE 101 (Site Reliability Engineering)Hussain Mansoor

What is Site Reliability Engineering (SRE)jeetendra mandal

DevOps & SRE at Google ScaleKaushik Bhattacharya

Microservices, DevOps & SREAraf Karsh Hamid

What's hot (20)

Getting started with Site Reliability Engineering (SRE)

Site reliability engineering

Site (Service) Reliability Engineering

SRE From Scratch

Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...

Performance Engineering Masterclass: Efficient Automation with the Help of SR...

SRE-iously! Reliability!

A Crash Course in Building Site Reliability

Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...

SRE Demystified - 01 - SLO SLI and SLA

Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...

When down is not good enough. SRE On Azure - PolarConf

Chaos Engineering

How Small Team Get Ready for SRE (public version)

Sre summary

SRE 101

SRE 101 (Site Reliability Engineering)

What is Site Reliability Engineering (SRE)

DevOps & SRE at Google Scale

Microservices, DevOps & SRE

Similar to Implementing error budgets

Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, GoogleAmbassador Labs

Master thesis presentation (VU)Viktor Gregor

Microservices summit talk 1/31Varun Talwar

Three Perspectives on Measuring LatencyScyllaDB

2017 Microservices Practitioner Virtual Summit: Microservices at Squarespace ...Ambassador Labs

Embedded Programming for QuadcoptersRyan Boland

chapter 4GAGANAP12

MixedSignal UVM Demo CDNLiveRobert O. Peruzzi, PhD, PE, DFE

Design and implementation of synchronous 4 bit up counter using 180 nm cmos p...eSAT Publishing House

project reportNashath Hussain

Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf

Container world 2019 Canary ReleaseBilly Yuen

Ardupilot Gazebo status.pdfssuserd7d2f2

Rate limits and all aboutAlexander Tokarev

PyConUK 2018 - Journey from HTTP to gRPCTatiana Al-Chueyr

Ttl interface-7-inch-1024x600-all-view-angle-lcd-dkingtech display

Introduction to-cleanflightZachary Scally

Unit-III.pptxSambasiva62

Ur gen iiiBen Blower

Network Automation with Salt and NAPALM: a self-resilient networkCloudflare

Similar to Implementing error budgets (20)

Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google

Master thesis presentation (VU)

Microservices summit talk 1/31

Three Perspectives on Measuring Latency

2017 Microservices Practitioner Virtual Summit: Microservices at Squarespace ...

Embedded Programming for Quadcopters

chapter 4

MixedSignal UVM Demo CDNLive

Design and implementation of synchronous 4 bit up counter using 180 nm cmos p...

project report

Mirko Damiani - An Embedded soft real time distributed system in Go

Container world 2019 Canary Release

Ardupilot Gazebo status.pdf

Rate limits and all about

PyConUK 2018 - Journey from HTTP to gRPC

Ttl interface-7-inch-1024x600-all-view-angle-lcd-d

Introduction to-cleanflight

Unit-III.pptx

Ur gen iii

Network Automation with Salt and NAPALM: a self-resilient network

Recently uploaded

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

Introduction to Multiple Access Protocol.pptxupamatechverse

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Introduction and different types of Ethernet.pptxupamatechverse

Introduction to IEEE STANDARDS and its different types.pptxupamatechverse

Porous Ceramics seminar and technical writingrakeshbaidya232001

result management system report for college projectTonystark477637

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Recently uploaded (20)

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

Introduction to Multiple Access Protocol.pptx

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

Roadmap to Membership of RICS - Pathways and Routes

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Introduction and different types of Ethernet.pptx

Introduction to IEEE STANDARDS and its different types.pptx

Porous Ceramics seminar and technical writing

result management system report for college project

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Implementing error budgets

1. Error budgets Practical implementation

2. Who am I ● SRE team lead @AnchorFree ● @onorua ● https://www.meetup.com/Kubernetes-Kyiv/ ● Deal a lot with production ● Introduced Error Budgets @Anchorfree Yaroslav Molochko

3. Anchorfree in numbers ● 650M customers ● Several thousands nodes ● 148 containers deal with user traffic ● 16M metrics

4. Let’s talk about production

10. Possible options to prioritisation 1. Review documentation 2. Escalate 3. Round-Robin 4. Random

11.

12.

13.

14.

15.

16. SLO (Service Level Objective) Binding target for collections of SLIs

17. SLO practice 1. Obliged part 2. Validity Period 3. Expression

18. SLI (Service Level Indicator) A measure of the service level provided by a service provider

19.

20.

21. 99.9% 99% 95% S L O

22. Error Budget

23.

24.

25. E R R O R B U D G E T 99.9% 99% 95%

26. 43.2m 432m 7.2h 2160m 36h E R R O R B U D G E T

27. 100% 10% 100% E R R O R B U D G E T R E M A I N I N

28. E R R O R B U D G E T R E M A I N I N 43.2m 43.2m 2160m 36h

29. Burn rate

30. 0.01% min 0.1% min 1% min B U R N R A T E

31. 60m 120m 1% min 0.432 * 0.01 * 60 sec = 0.26sec per second 36h 43.2m 43.2min 166 m 180m 0.432 * 0.1 * 60 sec = 2.592 sec per second 43.2min 16 m 166 m 16 m 100m ErrorBudgetremaining

32. 99.9% 99% 95% S L O

33. What happens when Error Budget is fully utilized ● Your team switch to maintenance mode for the service ● Your team stop to onboard new services ● Nothing but tests and hotfix are allowed to get into

34. Stages of Error Budgets adoption

35.

36.

37.

38.

39. Get support from your peers

40. Extra responsibility? Don’t forget, you have at least 2 sides for “contract”

41. This will not work in our unique case Tell me that when you become Google

42. Sell this to your boss 1. Acknowledge responsibility for subsystem 2. Focus on user needs 3. Agree on what happens when Error Budget is beyond budget? 4. Get a sign-off from neighbor teams and management

43. Measure what is important to your user

44.

45.

46.

47. Measure symptoms

48.

49. How to merge multiple SLI (symptoms) into one SLO?

50. Obvious options ● AVG ● MAX ● MIN Non Obvious options ● An Approach for QoS aware Service Composition based on Genetic Algorithms. (link)

51.

52. Logic AND over bool 1 0 1 0 1 0

53. Logic AND over bool ● 1 * 1 * ... = 1 All services are within SLO ● 1 * 0 * … = 0 At least one service is acting up

54. Logic AND over bool min(sli:gpr_edge_message_delivery:last30d >= bool 0.9999) * min(sli:gpr_edge_message_95le:last30d >= bool 0.95) == bool 1

55. Burning Rate for alerts?

56. Alert attributes ● Detection time ● Precision ● Reset time

57. Default alerting: Error rate 0 1000 5 10 15 20 time ALERT!

58. Default alerting: FOR 10m Error rate 0 1000 5 10 15 20 time

59. Burn Rate alerting: Error rate 0 1000 5 10 15 20 time ALERT! ALERT! ALERT! ALERT!

60. Burn Rate alerting: AND over 2 windows Error rate 0 1000 5 10 15 20 time ALERT! ALERT!

61. Burn rate calculation BR = W P * E BR = 30 * 24 1 2% = 720 1 0.02 = 14.4

62.

63. Severity Long Window Short Window Burn rate Error budget consumed Page 1 hour 5 minutes 14.4 2% Page 6 hours 30 minutes 6 5% Ticket 3 days 6 hours 1 10% Site Reliability Workbook | Ways to Alert on Significant Events | page 85

64. Main alerting takeaways ● Short vs Long windows ratio is 1/12 (magic ratio) ● Don’t use extra FOR in alert manager ● Burn rate is not a magic number ● 2% leave you with 50 alerts per month within budget

65. Error budgets ● Rules (you know how to play, you know how to score) ● Self-Escalation ● Over commitment protection ● Put your users first!

66. Q.A.

Editor's Notes

Основная идея доклада: Бюджет ошибок необходим каждому у кого есть production. 3 Основных базиса почему это правда 3 Вывода которых можно сделать из основной мысли Результирующая акция, действие которое аудитория должна предпринять
Один из этих сервисов - запуск баллистической ракеты Второй - анимации в Slack Третий - процессинг 3 миллионов долларов в секунду
Один из этих сервисов - запуск баллистической ракеты Второй - анимации в Slack Третий - процессинг 3 миллионов долларов в секунду Что чинить первым?
Burnout - выгорание
Один из этих сервисов - запуск баллистической ракеты Второй - анимации в Slack Третий - процессинг 3 миллионов долларов в секунду Что чинить первым?
боль
отрицание
злость
Может быть много сервис индикаторов, которые влияют на работу сервиса, но только несколько из них видны клиенту. Например низкая скорость чтения из БД может влиять на сервис, но в SLO должен быть success rate или request latency 0.95 percentile. Слишком много SLI в SLO приводит к бесполезной трате времени.

Implementing error budgets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Implementing error budgets

Similar to Implementing error budgets (20)

Recently uploaded

Recently uploaded (20)

Implementing error budgets

Editor's Notes