Serverless applications are the epitome of highly distributed, microservices applications. Execution happens everywhere - both inside and outside the serverless compute environment. For example, your functions could be triggered by an external service, then execute some code within AWS Lambda, then send a request over to a database, which *then* requires AWS Lambda to perform an update in a second data store.
You might be able to predict and design for certain troublesome issues but there are many, many more that you probably will not be able to easily plan for. How do you build a resilient system under these highly distributed circumstances? The answer is chaos engineering.
Join us as we walk through:
The unique challenges of building a highly resilient serverless app
Why you need to design for problems you cannot predict and cannot easily test for
How you can use chaos engineering to build a resilient serverless application
How you can take advantage of out of the box and third-party observability solutions to measure the impact of chaos experiments.
2. Who am I?
● Developer for 6+ years
● Product guy for 2 years
● VP of Product for Thundra
● Organizing committee
● Father of a chaos monkey
3. @emrahsamdan
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
4. @emrahsamdan
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
10. @emrahsamdan
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
http://principlesofchaos.org/
25. @emrahsamdan
Ups and downs of the
system.
If I take this server down,
maybe everything will still
run smooth. Maybe?
Let me attack on my system!
Cute! Let me break
something else.
Oh! I should fix this before it
actually happens and then
break something else.
27. @emrahsamdan
Don’t break on purpose!
● Start experimenting with the first row, the
leftmost cell: Known-knowns.
● Blast radius: The effect will make the
smallest effect.
● Put a stop button somewhere!
● Plan how you learn.
● You don’t need to do it on production for
the first time.
● The most important Let the other people
know! Surprising chaos is not funny. No, at
all!
28. @emrahsamdan
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
29. @emrahsamdan
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Result: People experiences timeouts while waiting for results.
36. @emrahsamdan
Every service has its own failure mode
Lots of managed intermediate service which has its own bad-day
characteristics.
Different throttling, different retry mechanisms for different services.
43. @emrahsamdan
Injecting latency
● Don’t attack your system.
● You don’t need to do on prod
first.
● There is no point to inject
latency to async calls.
Hypothesis: Entry point Lambda will
degrade gracefully when the
downstream Lambda times out or turns
really late.
51. @emrahsamdan
Don’t forget! Aim is
● Not to break but to improve
● Not to blame people but to give them room to fix
● Not to surprise your colleagues but to make your system resilient