Kolton Andrus (@deelyle)
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/failure-as-a-service-netflix
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Overview
1. Why is Failure Testing Important?
2. How did we build Failure as a Service?
3. How has this made our systems more
resilient?
Why Failure Testing?
1. Makes our systems immune to failure
2. Prevents larger outages
3. Production verification is requisite
Failure testing is a form of Hormesis -
we imbibe the poison to become
immune.
Validating that our defenses will work
when called upon - by exercising them
at scale in production.
Building Failure as a
Service
FIT - Failure Injection Testing
What about the monkeys?
The 5 W’s
1. Why
2. Who - Failure Scope
3. Where - Injection Point
4. What - Injected Failure
5. When - Ad-hoc & Automated
Zuul (Proxy)
API
Critical
Critical
Service
Secondary
Secondary
Service
Cache
C*
Circuit Breaker
Network Calls
Injection Points
“Knowing how the system behaves in
the face of failure is invaluable - our
assumptions are often incomplete”
Zuul (Proxy)
API
Critical
Critical
Critical
Secondary
Secondary
Secondary
Cache
C*
Circuit Breaker
Network Calls
Injected Failure
Failure
Metadata
FIT
Failure Scope
Decorated Request
Great, does it work?
Aggressive failure testing creates not
just robust programs, but an antifragile
programming culture.
Take Aways
1. Failure Testing is a worthwhile investment
2. Testing in Production is sustainable
3. It can harden your systems against failure
Kolton Andrus (@deelyle)
Resources
● Netflix Techblog - FIT
● “On Designing and Deploying Internet-Scale
Services” - James Hamilton
● Drift into Failure - Sidney Dekker
● Antifragile - Nassim Nicholas Taleb
Photo Credits
● Nuclear Blast - Mark Waldrep
● Forest Fire
● Poison
● Needle
● Explosion
● Robot
Demo Slides
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/failure-as-
a-service-netflix

Breaking Bad at Netflix: Building Failure as a Service

  • 1.
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /failure-as-a-service-netflix
  • 3.
    Presented at QConNew York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
    Overview 1. Why isFailure Testing Important? 2. How did we build Failure as a Service? 3. How has this made our systems more resilient?
  • 7.
    Why Failure Testing? 1.Makes our systems immune to failure 2. Prevents larger outages 3. Production verification is requisite
  • 9.
    Failure testing isa form of Hormesis - we imbibe the poison to become immune.
  • 12.
    Validating that ourdefenses will work when called upon - by exercising them at scale in production.
  • 13.
    Building Failure asa Service FIT - Failure Injection Testing
  • 15.
  • 16.
    The 5 W’s 1.Why 2. Who - Failure Scope 3. Where - Injection Point 4. What - Injected Failure 5. When - Ad-hoc & Automated
  • 19.
  • 23.
    “Knowing how thesystem behaves in the face of failure is invaluable - our assumptions are often incomplete”
  • 25.
    Zuul (Proxy) API Critical Critical Critical Secondary Secondary Secondary Cache C* Circuit Breaker NetworkCalls Injected Failure Failure Metadata FIT Failure Scope Decorated Request
  • 27.
  • 31.
    Aggressive failure testingcreates not just robust programs, but an antifragile programming culture.
  • 32.
    Take Aways 1. FailureTesting is a worthwhile investment 2. Testing in Production is sustainable 3. It can harden your systems against failure Kolton Andrus (@deelyle)
  • 33.
    Resources ● Netflix Techblog- FIT ● “On Designing and Deploying Internet-Scale Services” - James Hamilton ● Drift into Failure - Sidney Dekker ● Antifragile - Nassim Nicholas Taleb
  • 34.
    Photo Credits ● NuclearBlast - Mark Waldrep ● Forest Fire ● Poison ● Needle ● Explosion ● Robot
  • 35.
  • 41.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/failure-as- a-service-netflix