SlideShare a Scribd company logo
1 of 58
@atseitlin
Resiliency through failure
Netflix's Approach to Extreme Availability in the Cloud
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin
@atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
more than 38 million
members in 40
countries enjoying more
than one billion hours
of TV shows and movies
per month, including
original series[1]
[1] http://ir.netflix.com/
@atseitlin
A complex distributed system
@atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
Browse
Play
Watch
@atseitlin
Highly Available Architecture
Micro-services, redundancy,
resiliency
@atseitlin
Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie
group chooser
Each icon is
three to a few
hundred
instances
across three
AWS zones
@atseitlin
Component Micro-Services
Test With Chaos Monkey, Latency Monkey
@atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
@atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
@atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
@atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…
@atseitlin
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Fail often
@atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
@atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your service blind
• Monitor services, not instances
– Make instance failure a non-event
• Don’t pay people to watch screens
– Instead pay them to build alerting
@atseitlin
Edda
AWS
Instances, ASGs, et
c.
Eureka Services
metadata
AppDynamics
Request flow
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
@atseitlin
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}
@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, auto-delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds
@atseitlin
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
@atseitlin
@atseitlin
@atseitlin
Our goal is availability
• Members can stream Netflix whenever they
want
• New users can explore and sign up for the
service
• New members can activate their service and
add new devices
@atseitlin
Failure is all around us
• Disks fail
• Power goes out. And your generator fails.
• Software bugs introduced
• People make mistakes
Failure is unavoidable
@atseitlin
We design around failure
• Exception handling
• Clusters
• Redundancy
• Fault tolerance
• Fall-back or degraded experience (Hystrix)
• All to insulate our users from failure
Is that enough?
@atseitlin
It’s not enough
• How do we know if we’ve succeeded?
• Does the system work as designed?
• Is it as resilient as we believe?
• How do we prevent drifting into failure?
The typical answer is…
@atseitlin
More testing!
• Unit testing
• Integration testing
• Stress testing
• Exhaustive test suites to simulate and test all
failure mode
Can we effectively simulate a large-
scale distributed system?
@atseitlin
Building distributed systems is hard
Testing them exhaustively is even harder
• Massive data sets and changing shape
• Internet-scale traffic
• Complex interaction and information flow
• Asynchronous nature
• 3rd party services
• All while innovating and building features
Prohibitively expensive, if not impossible,
for most large-scale systems
@atseitlin
What if we could reduce variability of failures?
@atseitlin
There is another way
• Cause failure to validate resiliency
• Test design assumption by stressing them
• Don’t wait for random failure. Remove its
uncertainty by forcing it periodically
@atseitlin
And that’s exactly what we did
@atseitlin
Instances fail
@atseitlin
@atseitlin
Chaos Monkey taught us…
• State is bad
• Clusters are good
• Surviving single instance failure is not enough
@atseitlin
Lots of instances fail
@atseitlin
Chaos Gorilla
@atseitlin
Chaos Gorilla taught us…
• Hidden assumptions on deployment topology
• Infrastructure control plane can be a
bottleneck
• Large scale events are hard to simulate
• Rapidly shifting traffic is error prone
• Smooth recovery is a challenge
• Cassandra works as expected
@atseitlin
What about larger catastrophes?
Anyone remember Sandy?
@atseitlin
Chaos Kong (*some day soon*)
@atseitlin
The Sick and Wounded
@atseitlin
Latency Monkey
@atseitlin
@atseitlin
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
@atseitlin
Latency Monkey taught us
• Startup resiliency is often missed
• An ongoing unified approach to runtime
dependency management is important (visibility &
transparency gets missed otherwise)
• Know thy neighbor (unknown dependencies)
• Fall backs can fail too
@atseitlin
Entropy
@atseitlin
Clutter accumulates
• Complexity
• Cruft
• Vulnerabilities
• Cost
@atseitlin
Janitor Monkey
@atseitlin
Janitor Monkey taught us…
• Label everything
• Clutter builds up
@atseitlin
Ranks of the Simian Army
• Chaos Monkey
• Chaos Gorilla
• Latency Monkey
• Janitor Monkey
• Conformity
Monkey
• Circus Monkey
• Doctor Monkey
• Howler Monkey
• Security Monkey
• Chaos Kong
• Efficiency Monkey
@atseitlin
Observability is key
• Don’t exacerbate real customer issues with
failure exercises
• Deep system visibility is key to root-cause
failures and understand the system
@atseitlin
Organizational elements
• Every engineer is an operator of the service
• Each failure is an opportunity to learn
• Blameless culture
Goal is to create a learning organization
@atseitlin
Assembling the Puzzle
@atseitlin
Netflix Highly Available Platform
now open
@NetflixOSS
@atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Service
Astyanax
Cassandra client for Java
CassJMeter
Cassandra test suite
Cassandra
Multi-region EC2 datastore
support
Aegisthus
Hadoop ETL for Cassandra
Ice
Spend analytics
Governator
Library lifecycle and dependency
injection
Odin
Cloud orchestration
Blitz4j Async logging
Exhibitor
Zookeeper as a Service
Curator
Zookeeper Patterns
EVCache
Memcached as a Service
Eureka / Discovery
Service Directory
Archaius
Dynamics Properties Service
Edda
Config state with history
Denominator
Ribbon
REST Client + mid-tier LB
Karyon
Instrumented REST Base Serve
Servo and Autoscaling Scripts
Genie
Hadoop PaaS
Hystrix
Robust service pattern
RxJava Reactive Patterns
Asgard
AutoScaleGroup based AWS
console
Chaos Monkey
Robustness verification
Latency Monkey
Janitor Monkey
Bakeries / Aminotor
Legend
@atseitlin
How does it all fit together?
@atseitlin
@atseitlin
Our Current Catalog of Releases
Free code available at http://netflix.github.com
@atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Edge Services
• Many, many more
jobs.netflix.com
@atseitlin
Takeaways
Create fine-grained micro-services. Don’t trust your dependencies.
Regularly inducing failure in your production environment validates resiliency
and increases availability
Netflix has built and deployed a scalable global and highly available Platform
as a Service and opened sourced it (NetflixOSS)
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/atseitlin
@atseitlin @NetflixOSS
@atseitlin
Thank you!
Any questions?
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin

More Related Content

What's hot

Security as Code
Security as CodeSecurity as Code
Security as Code
Ed Bellis
 

What's hot (20)

Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevReactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
 
Security as Code
Security as CodeSecurity as Code
Security as Code
 
Chaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsChaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos Experiments
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
 
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
 
Slam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterSlam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data Center
 
Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache Cassandra
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the World
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How?
 
Mobile User Experience: Auto Drive through Performance Metrics
Mobile User Experience:Auto Drive through Performance MetricsMobile User Experience:Auto Drive through Performance Metrics
Mobile User Experience: Auto Drive through Performance Metrics
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesGet Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled Architectures
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertoncdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast Delivery
 
All Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaAll Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about Java
 

Viewers also liked

Quiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thQuiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8th
awltech
 
Mohamed Sayed C.V.
Mohamed Sayed C.V.Mohamed Sayed C.V.
Mohamed Sayed C.V.
darsh0225
 
Nationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingNationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - Crowdfunding
Ronald Kleverlaan
 
091710 NTNUMUN說明會
091710 NTNUMUN說明會091710 NTNUMUN說明會
091710 NTNUMUN說明會
Peitung Wang
 

Viewers also liked (20)

AWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS Resiliency
 
Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVN
 
Create a Loyal Following of Customers
Create a Loyal Following of CustomersCreate a Loyal Following of Customers
Create a Loyal Following of Customers
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015
 
Innovation, Service, and Shared References
Innovation, Service, and Shared ReferencesInnovation, Service, and Shared References
Innovation, Service, and Shared References
 
Quiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thQuiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8th
 
Mohamed Sayed C.V.
Mohamed Sayed C.V.Mohamed Sayed C.V.
Mohamed Sayed C.V.
 
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsCrowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
 
Nationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingNationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - Crowdfunding
 
Community Mill: Data, Media & Communities
Community Mill: Data, Media & CommunitiesCommunity Mill: Data, Media & Communities
Community Mill: Data, Media & Communities
 
Crowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyCrowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient Academy
 
How to access to capitol
How to access to capitolHow to access to capitol
How to access to capitol
 
091710 NTNUMUN說明會
091710 NTNUMUN說明會091710 NTNUMUN說明會
091710 NTNUMUN說明會
 
The role of the information architect
The role of the information architectThe role of the information architect
The role of the information architect
 
Dengue
DengueDengue
Dengue
 
Crowdfunding algeracorridor
Crowdfunding algeracorridorCrowdfunding algeracorridor
Crowdfunding algeracorridor
 
Innovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitInnovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization Summit
 
Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Keynote at UX Sofia 2013
Keynote at UX Sofia 2013
 
Happy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateHappy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you Celebrate
 
Using New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsUsing New media to create a band of youth social journalists
Using New media to create a band of youth social journalists
 

Similar to Resiliency through Failure @ OSCON 2013

LF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death Star
LF_APIStrat
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
QA or the Highway
 

Similar to Resiliency through Failure @ OSCON 2013 (20)

Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
 
LF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death Star
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Running a Lean Startup with AWS
Running a Lean Startup with AWSRunning a Lean Startup with AWS
Running a Lean Startup with AWS
 
Fantastic Elastic
Fantastic ElasticFantastic Elastic
Fantastic Elastic
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native Future
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A Study
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
 
ChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptxChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptx
 
Create an architecture for web test automation
Create an architecture for web test automationCreate an architecture for web test automation
Create an architecture for web test automation
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesGDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos Engineering
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Resiliency through Failure @ OSCON 2013

  • 1. @atseitlin Resiliency through failure Netflix's Approach to Extreme Availability in the Cloud Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin
  • 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with more than 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1] http://ir.netflix.com/
  • 4. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
  • 6. @atseitlin Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
  • 7. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
  • 8. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 9. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 10. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  • 11. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  • 12. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
  • 13. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
  • 14. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
  • 15. @atseitlin Edda AWS Instances, ASGs, et c. Eureka Services metadata AppDynamics Request flow Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
  • 16. @atseitlin Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" … }
  • 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
  • 21. @atseitlin Our goal is availability • Members can stream Netflix whenever they want • New users can explore and sign up for the service • New members can activate their service and add new devices
  • 22. @atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails. • Software bugs introduced • People make mistakes Failure is unavoidable
  • 23. @atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience (Hystrix) • All to insulate our users from failure Is that enough?
  • 24. @atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
  • 25. @atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large- scale distributed system?
  • 26. @atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
  • 27. @atseitlin What if we could reduce variability of failures?
  • 28. @atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
  • 32. @atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
  • 35. @atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
  • 36. @atseitlin What about larger catastrophes? Anyone remember Sandy?
  • 41. @atseitlin Resilient Design – Hystrix, RxJava http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  • 42. @atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fall backs can fail too
  • 44. @atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
  • 46. @atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
  • 47. @atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
  • 48. @atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
  • 49. @atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
  • 51. @atseitlin Netflix Highly Available Platform now open @NetflixOSS
  • 52. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
  • 53. @atseitlin How does it all fit together?
  • 55. @atseitlin Our Current Catalog of Releases Free code available at http://netflix.github.com
  • 56. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Edge Services • Many, many more jobs.netflix.com
  • 57. @atseitlin Takeaways Create fine-grained micro-services. Don’t trust your dependencies. Regularly inducing failure in your production environment validates resiliency and increases availability Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/atseitlin @atseitlin @NetflixOSS
  • 58. @atseitlin Thank you! Any questions? Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin

Editor's Notes

  1. The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.