Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?