This document discusses managing failures and building resilience into systems at Etsy. Some key points:
1. Etsy has a complex architecture with many services and data stores that are functionally partitioned. This architecture is designed to limit the impact of failures.
2. Failures cannot be prevented, but they can be mitigated through techniques like redundant systems, small code changes, feature flags, extensive metrics collection, and resilient user interfaces.
3. Rather than focusing only on 100% uptime, product design also considers availability during failures through approaches like non-blocking interfaces that adapt to technical issues.
4. Building resilience is a shared responsibility of operations, engineering, product, and design teams through