FRONTLINE SYSTEMS
Circuit Breaker Pattern Vikash Kodati
13th July2016
AGENDA
4/6/2016 T-MobileConfidential2
• Problem Statement
• Circuit Breaker Definition
• Solution Landscape
• Live Demo
• Q&A
CHARACTERISTICS OF MICROSERVICE
6/13/2016 T-MobileConfidential3
• Componentization via services
• Organized around business capabilities
• Products not projects
• Smart endpoints and dump pipes
• Decentralized Data Management
• Infrastructure Automation
• Design for failure
DESIGN FOR FAILURE
6/13/2016 T-MobileConfidential4
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packet loss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Note: Data taken from Jeff Dean’s slides
PROBLEM STATEMENT
4/6/2016 T-MobileConfidential5
Given the types of failures that can occur, we need a Fault-
Tolerant system such that it
• System to continues to operate in event of failure of a
subset of its components
• System needs to be Highly Available (HA)
• Handles failure gracefully
SOLUTION LANDSCAPE
4/6/2016 T-MobileConfidential6
Development Phase
• Avoiding Cascading failures
• Circuit breaker
• Timeouts
• Retry
• Bulkhead
• Cache optimizations
• Avoid malicious clients
• Rate limiting
Pre-Deploy Phase
• Load test
• A/B test
• Longevity
Post-Deploy Phase
• Health check
• Metrics
CIRCUIT BREAKER PATTERN
4/6/2016 T-MobileConfidential7
• If a power surge occurs in the electrical wiring, the breaker will
trip. (“On” to “Off”)
• Netflix Hystrix follows circuit breaker pattern
• If a service’s error rate exceeds a threshold it will trip the
circuit breaker and blocks the requests for a specific period of
time
• Threshold configurable:
• End point taking > 1 sec to respond
• End point returns a 500 error
• End point returns a 500 error 6 times in a row
CIRCUIT BREAKER ILLUSTRATION
4/6/2016 T-MobileConfidential8
CIRCUIT BREAKER STATE TRANSITIONS
4/6/2016 T-MobileConfidential9
Closed
Open
Half-Open
Success
Trip Breaker
Calls failing fast
Attempt Reset
Trip Breaker
Reset Breaker
DEMO TOPOLOGY
4/6/2016 T-MobileConfidential10
Web
browser
Zuul
(Proxy)
Eureka Server
Reading
Service
BookStore
ROLES
6/13/2016 T-MobileConfidential11
The pattern includes
• Service Discovery (Eureka),
• Circuit Breaker (Hystrix),
• Intelligent Routing & Reverse Proxy (Zuul) and
• Microservices (Spring Cloud)
HYSTRIX DASHBOARD
4/6/2016 T-MobileConfidential12
HYSTRIX DASHBOARD DRILL DOWN
4/6/2016 T-MobileConfidential13
SUMMARY
6/13/2016 T-MobileConfidential14
• Like a physical circuit breaker, the circuit breaker
pattern allows a subsystem to fail gracefully without
a complete system failure
• Failure is inevitable, be prepared for it
• Primarily used in aggregation scnearios
THANK YOU & QA
6/13/2016 T-MobileConfidential15
Vikash Kodati
• Email: Vikash.Kodati@t-mobile.com
• Yammer: https://www.yammer.com/t-mobile.com/users/vikashkodati
• Github: https://github.com/vikashkodati
• LinkedIn: /in/vikashkodati
• Twitter: @vikashkodati
• Blog: https://tmobileusa.sharepoint.com/portals/hub/personal/vikashkodati

Circuit Breaker Pattern

  • 1.
    FRONTLINE SYSTEMS Circuit BreakerPattern Vikash Kodati 13th July2016
  • 2.
    AGENDA 4/6/2016 T-MobileConfidential2 • ProblemStatement • Circuit Breaker Definition • Solution Landscape • Live Demo • Q&A
  • 3.
    CHARACTERISTICS OF MICROSERVICE 6/13/2016T-MobileConfidential3 • Componentization via services • Organized around business capabilities • Products not projects • Smart endpoints and dump pipes • Decentralized Data Management • Infrastructure Automation • Design for failure
  • 4.
    DESIGN FOR FAILURE 6/13/2016T-MobileConfidential4 Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packet loss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc. Note: Data taken from Jeff Dean’s slides
  • 5.
    PROBLEM STATEMENT 4/6/2016 T-MobileConfidential5 Giventhe types of failures that can occur, we need a Fault- Tolerant system such that it • System to continues to operate in event of failure of a subset of its components • System needs to be Highly Available (HA) • Handles failure gracefully
  • 6.
    SOLUTION LANDSCAPE 4/6/2016 T-MobileConfidential6 DevelopmentPhase • Avoiding Cascading failures • Circuit breaker • Timeouts • Retry • Bulkhead • Cache optimizations • Avoid malicious clients • Rate limiting Pre-Deploy Phase • Load test • A/B test • Longevity Post-Deploy Phase • Health check • Metrics
  • 7.
    CIRCUIT BREAKER PATTERN 4/6/2016T-MobileConfidential7 • If a power surge occurs in the electrical wiring, the breaker will trip. (“On” to “Off”) • Netflix Hystrix follows circuit breaker pattern • If a service’s error rate exceeds a threshold it will trip the circuit breaker and blocks the requests for a specific period of time • Threshold configurable: • End point taking > 1 sec to respond • End point returns a 500 error • End point returns a 500 error 6 times in a row
  • 8.
  • 9.
    CIRCUIT BREAKER STATETRANSITIONS 4/6/2016 T-MobileConfidential9 Closed Open Half-Open Success Trip Breaker Calls failing fast Attempt Reset Trip Breaker Reset Breaker
  • 10.
  • 11.
    ROLES 6/13/2016 T-MobileConfidential11 The patternincludes • Service Discovery (Eureka), • Circuit Breaker (Hystrix), • Intelligent Routing & Reverse Proxy (Zuul) and • Microservices (Spring Cloud)
  • 12.
  • 13.
    HYSTRIX DASHBOARD DRILLDOWN 4/6/2016 T-MobileConfidential13
  • 14.
    SUMMARY 6/13/2016 T-MobileConfidential14 • Likea physical circuit breaker, the circuit breaker pattern allows a subsystem to fail gracefully without a complete system failure • Failure is inevitable, be prepared for it • Primarily used in aggregation scnearios
  • 15.
    THANK YOU &QA 6/13/2016 T-MobileConfidential15 Vikash Kodati • Email: Vikash.Kodati@t-mobile.com • Yammer: https://www.yammer.com/t-mobile.com/users/vikashkodati • Github: https://github.com/vikashkodati • LinkedIn: /in/vikashkodati • Twitter: @vikashkodati • Blog: https://tmobileusa.sharepoint.com/portals/hub/personal/vikashkodati

Editor's Notes

  • #2 Encourage interactive session Informal discussion Eat lunch My Goal is to keep us all on the same page at a conceptual level. Please stop me and ask questions
  • #3 Again a set of statements.
  • #4 Its still hard to come up with a firm definition. Instead of defining think about common characteristics. Most of those who are doing MS will be doing most of these things Lets go through each of these
  • #5 Netflix randomly brings down nodes and simulate failure to check resiliency Bring up CAP theorem here
  • #6 If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown.
  • #7 Again a set of statements.
  • #8 Again a set of statements. Primarily an aggregator pattern
  • #9 Again a set of statements.
  • #10 Again a set of statements.
  • #11 Again a set of statements.
  • #13 Again a set of statements.
  • #14 You would want to use it at a service level Aggregate Service can have multiple instances Ribbon vs turbine