Chaos patterns - architecting for failure in distributed systems

CHAOS PATTERNS
Architecting for failure in distributed systems
Bruce Wong - @bruce_m_wong / Jos Boumans - @jiboumans
http://www.soponderando.com.br/

http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg
How to measure
everything
Architecting
in AWS for
resilience & cost
www.slideshare.net/jiboumans/aws-architecting-for-resilience-cost-at-scale
http://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead

VP of Operations &
Infrastructure
http://www.krux.com/
3 Billion Users

ABOUT BRUCE
2010 2015
Software Engineer
Insight Engineering
Senior Engineering Manager
Chaos Engineering
Prosumers Consumers Enterprise
http://techblog.netﬂix.com/2014/09/introducing-chaos-engineering.html

A LOT OFTRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/
REAL WORLD FAILURES

SEPTEMBER 20TH, 2015
Also:April 21, 2011 - June 29, 2012 - October 22, 2012 -
December 24, 2012 - August 26, 2013 <out of space>
https://twitter.com/iamDeveloper/status/645659734767329281 https://aws.amazon.com/message/5467D2/

ISOLATION & CONTAINMENT
Ideally limit failure to a single service
Stop it from spreading
http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/

So#ware,)8)
Automa/on,)4)
Process,)14)
Amazon"Cloud"Major"Outage"7"Issues"Categories"
https://steamcommunity.com/app/620/
http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg
AWS Root Cause
Analysis over time
http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis

Humans,
Software,
Processes
All likely causes
of failure
Isolation
Unlikely
2 - 4x
Yearly frequency of
catastrophic failure

THERE ARE DOWNSIDES
http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines

Complex Systems
Difﬁcult to model, not feasible to simulate at scale

Products evolve
Embrace change

Resilience is a feature
Embrace failure

CORE EXPERIENCE
Enriched with optional enhancements

http://usa.streetsblog.org/category/issues-campaigns/air-quality/
NAVIGATING THE CHAOS

FALLBACK PATTERNS
“Expect the Unexpected”
http://blabitcanada.com/category/twitter-2/

BASIC API CALL
3 potential points of failure

FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varnish-cache.org/

https://www.flickr.com/photos/ichijo2009/8501266124
ENSURING DATA ACCESS

CAP THEOREM APPLIES
Your choice: sacriﬁce availability or consistency.
Orange is a lie.
RDBMS
BigTable Based
Master / Slave based
CouchDB
Dynamo Based
http://ferd.ca/beating-the-cap-theorem-checklist.html

http://paul-barford.blogspot.com/2015/01/sappho-pap-obbink-further-painting-into.html
SPLIT OUTYOUR CONTROL PLANE

EC2
S3
RDS
Dynamo
Cloudfront
CDN
Route53
DNS
Cloudwatch
Monitoring

Cloudfront
CDN
Route53
DNS
Cloudwatch
Monitoring

Control
plane
Separate
from workload
DNS & CDN
Your best friends
Latency or
Accuracy
Pick one to sacriﬁce
for resilience

USER EXPERIENCE
My tweet got posted

http://mclaughlindrums.com/wp-content/uploads/2013/04/Relativity-by-Escher.jpg
ORDERED CHAOS

CHAOS DEFINED
Intentionally introducing failure into a system
with the purpose of validating resilience design.

http://www.cnbc.com/id/102394893

BREAKINGTHE SYSTEM
How Conﬁdent are you?
-Next week?
-Next month?
-After that “quick patch”

CHAOSVS OUTAGE
Chaos
• Controlled
• Planned
• Intentional
• Microscopic user impact
Outages
• Uncontrolled
• Unpredictable
• Unintended
• Large impact

Single Point of Failure
Discover - Fix -Validate

https://github.com/Netflix/SimianArmy
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
CHAOS MONKEY

9am-5pm
Mon-Fri
Don’t upset
your on-call
1 Instance
Per group / per day
Detect SPOF
Intentionally

SLOW IS HARD
Product + Business + Engineering Decisions
https://pragprog.com/book/mnee/release-it

Custom
Fallback
Accuracy or latency
https://github.com/Netflix/Hystrix
Fail Silent
For optional data
Fail Fast
Keep servers healthy

LATENCY MONKEY
Other frameworks
http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html
http://www.infoq.com/presentations/failure-as-a-service-netflixhttp://

HTTP 5xx
1 minute duration
10-100ms
Sleep during request
1-100%
Of requests

PREVENT PROPAGATION
Avoid cascading failures

CHAOS KONG
Because regions fail
http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html

GeoDNS
fallback to LatencyDNS
Proxy
Cross-Region
communication
Capacity
Cost-Beneﬁt Decision

"ONCE IN A BLUE MOON"
Happens at least a few times a year....
https://whisperofangels.wordpress.com/2013/08/20/once-in-a-blue-moon/

TAKE AWAY
Go found chaos engineering at your company RIGHT NOW

Most enterprises hire people to fix things. Netflix hires
people to break things….
…we should embrace Netflix's culture of "chaos engineering"
throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone

Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@bruce_m_wong / @jiboumans
Slides - http://www.slideshare.net/jiboumans

Chaos patterns - architecting for failure in distributed systems

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Chaos patterns - architecting for failure in distributed systems

Similar to Chaos patterns - architecting for failure in distributed systems (20)

Recently uploaded

Recently uploaded (20)

Chaos patterns - architecting for failure in distributed systems