With serverless applications, execution can happen everywhere. It’s hard to predict and design for all troublesome issues. Chaos engineering can help you build highly resilient systems. Join us to learn how and why this approach is especially valuable when building serverless applications.
2. Who am I?
● Developer for 6+ years
● Product guy for 2 years
● VP of Product for Thundra
● Organizing committee
● Serverlessdays İstanbul
On October 11st!
3. Agenda
● What’s chaos engineering?
● Why chaos testing on serverless?
● Best practices on chaos testing for serverless
● How to apply chaos testing on AWS Lambda
● How to apply silence in a world of chaos
4. Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
5. Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
11. Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
http://principlesofchaos.org/
12. Chaos Engineering is
● Like injecting vaccine to your system to make it more
immune
● To improve your system’s resilience by uncovering
weaknesses.
● Identifying failures before they become outages.
● To understand the steady state of your system and
challenge it.
13. Chaos Engineering is not
● Breaking down production for purpose.
● For blaming a group of people.
● Surprising your colleagues with partial outages.
● Taking down all the system at the same time.
17. States of chaos engineering
● Define steady state
● Hypothesis on steady state of the system with the designed failure
● Run your experiment
○ Define blast radius
○ Define halting condition
○ Have a rollback plan!
● Verify & Learn
○ If your system breaks you understood an issue before it causes an outage. Go fix it!
○ If it is resilient, congrats! Now, inject some other failure!
18. Don’t break on purpose!
● Start experimenting with the first row, the
leftmost cell: Known-knowns.
● Blast radius: The effect will make the
smallest effect.
● Put a stop button somewhere!
● Plan how you learn.
● You don’t need to do it on production for
the first time.
● The most important Let the other people
know! Surprising chaos is not funny. No, at
all!
19. Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
20. Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Result: People experiences timeouts while waiting for results.
27. Every service has its own failure mode
Lots of managed intermediate service which has its own bad-day
characteristics.
Different throttling, different retry mechanisms for different services.
35. Chaos experiments in serverless
● Inject latency to downstream services
● Inject failure to resources
36. Injecting latency
● Don’t attack your system.
● You don’t need to do on prod
first.
● There is no point to inject
latency to async calls.
Hypothesis: Entry point Lambda will
degrade gracefully when the
downstream Lambda times out or turns
really late.
37. Where else to inject?
Inject latency to resources, too.
44. Common fixes
● Exponential backoff
● Properly tunes timeouts
● Circuit breakers
● Use async communication when possible
45. Don’t forget! Aim is
● Not to break but to improve
● Not to blame people but to give them room to fix
● Not to surprise your colleagues but to make your system resilient