SlideShare a Scribd company logo
Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




                                             1
Dozens of dependencies.

   One going down takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.


                                                  2
3
4
5
No single dependency should
 take down the entire app.

          Fail fast.
         Fail silent.
         Fallback.

        Shed load.



                              6
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              7
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              8
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              9
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 10
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 11
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 12
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              13
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             14
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             15
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             16
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          17
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          18
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          19
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              20
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   21
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   22
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   23
Netflix uses all 4 in combination




                                   24
25
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”



                                                     26
27
28
29
Benefits of Separate Threads

         Protection from client libraries

    Lower risk to accept new/updated clients

           Quick recovery from failure

             Client misconfiguration

Client service performance characteristic changes

              Built-in concurrency
                                                    30
Drawbacks of Separate Threads

    Some computational overhead

Load on machine can be pushed too far

                 ...

    Benefits outweigh drawbacks
    when clients are “untrusted”


                                        31
32
Visualizing Circuits in Realtime
      (generally sub-second latency)




      Video available at
https://vimeo.com/33576628




                                       33
Rolling 10 second counter – 1 second granularity

      Median Mean 90th 99th 99.5th

       Latent Error Timeout Rejected

                 Error Percentage
            (error+timeout+rejected)/
(success+latent success+error+timeout+rejected).

                                                   34
Netflix DependencyCommand Implementation




                                          35
Netflix DependencyCommand Implementation




                                          36
Netflix DependencyCommand Implementation




                                          37
Netflix DependencyCommand Implementation




                                          38
Netflix DependencyCommand Implementation




                                          39
Netflix DependencyCommand Implementation




                                          40
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response




                                          41
Netflix DependencyCommand Implementation




                                          42
Netflix DependencyCommand Implementation




                                          43
Rolling Number
Realtime Stats and
 Decision Making




                     44
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    45
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    46
47
Fail fast.
Fail silent.
Fallback.

Shed load.




               48
Questions & More Information


Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



         Making the Netflix API More Resilient
   http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html




                        Ben Christensen
                          @benjchristensen
                 http://www.linkedin.com/in/benjchristensen



                                                                              49

More Related Content

What's hot

Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
jaxconf
 
Rac introduction
Rac introductionRac introduction
Rac introduction
Riyaj Shamsudeen
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
maclean liu
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)
Timote Lima
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle database
Riyaj Shamsudeen
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issues
Riyaj Shamsudeen
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
Riyaj Shamsudeen
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework Evolution
FITC
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)
Doug Burns
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commands
nanker phelge
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321maclean liu
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel Execution
Doug Burns
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focus
Riyaj Shamsudeen
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012
Guillaume Laforge
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
YUCHENG HU
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume Laforge
Guillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
Riyaj Shamsudeen
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3
Saurav Basu
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user group
Jorge Nerín
 

What's hot (20)

Virtual machine re building
Virtual machine re buildingVirtual machine re building
Virtual machine re building
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle database
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issues
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework Evolution
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commands
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel Execution
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focus
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user group
 

Similar to Fault Tolerance in a High Volume, Distributed System

Fork and join framework
Fork and join frameworkFork and join framework
Fork and join framework
Minh Tran
 
Curator intro
Curator introCurator intro
Curator intro
Jordan Zimmerman
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and Trends
Carol McDonald
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in Java
Ruben Inoto Soto
 
Resilience mit Hystrix
Resilience mit HystrixResilience mit Hystrix
Resilience mit Hystrix
Java Usergroup Berlin-Brandenburg
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
Michele Giacobazzi
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
wellD
 
Concurrency-5.pdf
Concurrency-5.pdfConcurrency-5.pdf
Concurrency-5.pdf
ssuser04005f
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced concepts
Riccardo Cardin
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)
aragozin
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizers
ts_v_murthy
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
Uwe Friedrichsen
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
ekino
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvmaragozin
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
Uwe Friedrichsen
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications
Eberhard Wolff
 
Celery
CeleryCelery
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrencypriyank09
 

Similar to Fault Tolerance in a High Volume, Distributed System (20)

JavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan JuričićJavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan Juričić
 
Fork and join framework
Fork and join frameworkFork and join framework
Fork and join framework
 
Curator intro
Curator introCurator intro
Curator intro
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and Trends
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in Java
 
Resilience mit Hystrix
Resilience mit HystrixResilience mit Hystrix
Resilience mit Hystrix
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Concurrency-5.pdf
Concurrency-5.pdfConcurrency-5.pdf
Concurrency-5.pdf
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced concepts
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)
 
Java Concurrency
Java ConcurrencyJava Concurrency
Java Concurrency
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizers
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvm
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications
 
Celery
CeleryCelery
Celery
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrency
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Fault Tolerance in a High Volume, Distributed System

  • 1. Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
  • 2. Dozens of dependencies. One going down takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  • 7. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  • 8. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  • 10. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 10
  • 11. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 11
  • 12. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 12
  • 13. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  • 14. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 14
  • 15. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 15
  • 16. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 16
  • 17. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 17
  • 18. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 18
  • 19. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 19
  • 20. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  • 21. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 21
  • 22. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 22
  • 23. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 23
  • 24. Netflix uses all 4 in combination 24
  • 25. 25
  • 26. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30
  • 31. Drawbacks of Separate Threads Some computational overhead Load on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  • 32. 32
  • 33. Visualizing Circuits in Realtime (generally sub-second latency) Video available at https://vimeo.com/33576628 33
  • 34. Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34
  • 41. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  • 44. Rolling Number Realtime Stats and Decision Making 44
  • 45. Request Collapsing Take advantage of resiliency to improve efficiency 45
  • 46. Request Collapsing Take advantage of resiliency to improve efficiency 46
  • 47. 47
  • 49. Questions & More Information Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49