Protect your users with
Circuit Breakers
Scott Triglia
Europython 2016
1
Jim’s
2
Yelp’s Mission:
Connecting people with great
local businesses.
Yelp Stats
As of Q1 2016
90M 3270%102M
Scott Triglia
4
@scott_triglia
Work with the $$$
Let’s talk Circuit Breakers
5
6
7
8
9
10
Our goals today:
introduce a basic circuit breaker
11
Our goals today:
a modular circuit breaker
12
Our goals today:
test it out on several scenarios
13
14
15
1616
1
2
3
4
the fundamental rule:
your systems will fail
what’s your response?
17
1818
1
2
1919
1
2
3
2020
1
2
3
4
21
Nygard’s circuit breaker
22
23
24
25
26
Circuit Breaker States:
* Healthy (or “closed”)
* Recovering (or “half-open”)
* Unhealthy (or “open”)
27
28
Recovery:
* Wait for recovery_timeout seconds
* Send a trial request, trust its results
29
Before a circuit breaker:
30
Before a circuit breaker:
* Diners wait forever to get food
31
Before a circuit breaker:
* Diners wait forever to get food
* Kitchen has a growing backlog
32
Before a circuit breaker:
* Diners wait forever to get food
* Kitchen has a growing backlog
* New diners making things worse
33
With a circuit breaker:
34
With a circuit breaker:
* Fewer frustrated users
35
With a circuit breaker:
* Fewer frustrated users
* Reduced load on the backend
36
With a circuit breaker:
* Fewer frustrated users
* Reduced load on the backend
* A well defined failure mode
37
38
39
40
41
42
Should our waiters all agree?
Module 1:
43
44
45
New Behavior:
* Clients inform each other
* Processes are no longer independent
46
* Propagate failure faster
* Requires distributed datastore
* Forces decisions about consistency
47
What should we do in response?
Module 2:
48
49
50
51
52
New Behavior:
* Code can check in advance about
healthiness of system
* Automatic monitoring!
53
* Build features on top of system
health status
* Requires a single source of truth?
54
Who decides we’re unhealthy?
Module 3:
55
56
def signal_overload(cb):
if len(jobs) > THRESH:
cb.mark_unhealthy()
57
New Behavior:
* CB gets signals from anywhere
* Signal combining logic
58
* Allows many (many) new signals
* Must combine signals
* Adds complexity to system
59
How do we recover?
Module 4:
60
61
Dark launch:
* Reject but process normally
* Dangerous with side effects
Block User
Request
Try to process anyway!
62
Synthetic:
* Dark launching with fake requests
* Not necessarily representative
Block User
Request
Process fake requests
63
New Behavior:
* Traffic determines health
* Removal of recovery timeouts
64
* Faster(?) recovery
* No timeout tuning required
* Dark launching not always possible
* Synthetic can be unrepresentative
65
in summary
66
Your system will fail, have a plan!
67
The basic CB is better than nothing
68
Questions to ask:
* Should our waiters all agree?
* How should I deal with unhealthiness?
* Who decides we’re unhealthy?
* How do we recover?
69
Questions to ask:
* Should our waiters all agree?
* How should I deal with unhealthiness?
* Who decides we’re unhealthy?
* How do we recover?
70
Questions to ask:
* Should our waiters all agree?
* How should I deal with unhealthiness?
* Who decides we’re unhealthy?
* How do we recover?
71
Questions to ask:
* Should our waiters all agree?
* How should I deal with unhealthiness?
* Who decides we’re unhealthy?
* How do we recover?
72
Questions to ask:
* Should our waiters all agree?
* How should I deal with unhealthiness?
* Who decides we’re unhealthy?
* How do we recover?
73
…and much more!
Much comes down to your use case
74
Questions?
75
striglia@yelp.com
@scott_triglia
77
Can’t we do better than
rejecting requests?
http://techblog.netflix.com/2011/12/
making-netflix-api-more-resilient.html
79
How do I safely test out a
new circuit breaker?
https://engineering.heroku.com/blogs/
2015-06-30-improved-production-
stability-with-circuit-breakers/

Protect your users with Circuit breakers