As we architect our systems for greater demands, scale, uptime, and performance, the hardest thing to control becomes the environment in which we deploy and the subtle but crucial interactions between complicated systems. Chaos Patterns help us establish and implement a virtuous cycle that let’s us both prove & improve our system along each of these dimensions before the inevitable happens. While it may seem reckless or counter-intuitive, our experience has proven that it’s a matter of how and when (not if) we will learn about the limitations and failure modes of the system. This is the story of the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate; using chaos testing, we learn more about how our system fails from a 10 second controlled failure than a multi-hour uncontrolled outage. In this session we will cover various implementation techniques, available to any developer & operator, which will vastly increase the resilience of your systems and provide a superior end user experience; from optimizing your use of DNS for failure, to configuring your CDN to have your back, to synthetic responses and expected database outages. But why stop there? Netflix has pioneered a culture and suite of tools that actively injects ‘once in a blue moon’ failures into its production systems, which lets you battle test your resilience design and let developers & operators sleep comfortably at night knowing their systems are able to handle even the worst of worst case scenarios.