This document discusses the principles and practices of chaos engineering. It describes how Netflix used chaos engineering to successfully handle an Amazon maintenance update that rebooted 10% of their servers by regularly experimenting with server reboots. The key principles of chaos engineering outlined are to build hypotheses around steady state behavior, vary real-world events, run experiments in production, automate experiments continuously, minimize negative impact, and experiment in production while containing fallout. Chaos engineering is about testing systems to build resilience and is not just tools, but also involves culture and understanding complexity.
2. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF
HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH
news.trivadis.com/blog@lwieske
Chaos Engineering
Here We Go
Lothar
3. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF
HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH
Lothar
I am solutions architect and digital disruptor.
Since 2009, I work at the intersection between
cloud and analytics. Digital disruption is coming
to ever more sectors and I want to understand its
technological, societal and economical impacts.
Before 2009, I managed large project budgets,
turned to an architect later on and built a digital
radiology and migrated the Miles & More.
@lwieske news.trivadis.com/blog
6. Cloud native technologies empower organizations to
build and run scalable applications in modern,
dynamic environments such as public, private, and
hybrid clouds. Containers, service meshes,
microservices, immutable infrastructure, and
declarative APIs exemplify this approach.
9. 2012: Netflix Open Sourced Chaos Monkey.
2016: Netflix Completed Transition To a 100% AWS Infrastructure
Cloud Changed the Way Netflix Runs the Company
10. Netflix Handled Amazon Maintenance Update
• Amazon performed a major maintenance update at the end of September 2014 in order to patch a
security vulnerability in a Xen hypervisor affecting about 10% of their global fleet of cloud servers.
• Netflix has a long history of using their Simian army - Chaos Monkey, Gorilla and Kong – to force
reboots of their servers in order to see how the overall system reacts and what can be done to
improve resilience. The problem this time was that the operation would affect some of their
database servers, more exactly 218 Cassandra nodes. It is one thing to perform a live restart of a
server streaming a video, and it is a lot more difficult to do the same to a stateful database.
• Out of our 2700+ production Cassandra nodes, 218 were rebooted.
• 22 Cassandra nodes were on hardware that did not reboot successfully.
• They were detected and replaced with minimal human intervention.
• Netflix experienced 0 downtime that weekend.
13. PRINCIPLES OF CHAOS ENGINEERING
• The following principles describe an ideal application of Chaos Engineering, applied to the processes
of experimentation described above. The degree to which these principles are pursued strongly
correlates to the confidence we can have in a distributed system at scale.
• Build a Hypothesis around Steady State Behavior
• Vary Real-world Events
• Run Experiments in Production
• Automate Experiments to Run Continuously
• Minimize Blast Radius
• Experimenting in production has the potential to cause unnecessary customer pain. While there
must be an allowance for some short-term negative impact, it is the responsibility and obligation of
the Chaos Engineer to ensure the fallout from experiments are minimized and contained.
16. Chaos Engineering Is Not Just Tools.
Culture Is Part Of Your System.
Complexity Is Part Of Your System.
Testing In Production? Yes You Can!
You Should Chaos Engineer Everything Cloud
and Microservices – Among Others
19. Session Feedback – now
• Please use the Trivadis Events mobile app to give feedback on each session
• Use "My schedule" if you have registered for a session
• Otherwise use "Agenda" and the search function
• If the mobile app does not work (or if you have a Windows smartphone /
Desktop), use your smartphone browser
• URL: http://trivadis.quickmobileplatform.eu/
• User name: <your_loginname> (such as “svv”)
• Password: sent by e-mail...