@emrahsamdan
Applying Chaos Engineering to build resilient
serverless applications
Emrah Şamdan
(@emrahsamdan)
11/6/2019
Who am I?
● Developer for 6+ years
● Product guy for 2 years
● VP of Product for Thundra
● Organizing committee
● Father of a chaos monkey
@emrahsamdan
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
@emrahsamdan
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!
@emrahsamdan
@emrahsamdan
@emrahsamdan
Your third party API slows down so badly..
@emrahsamdan
Some part of your system becomes unreachable.
@emrahsamdan
Your cache/DB is down so you can’t load your data.
@emrahsamdan
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
http://principlesofchaos.org/
@emrahsamdan
Breaking things on purpose in production
@emrahsamdan
Breaking things on purpose in production
To make them more resilient
@emrahsamdan
Breaking things on purpose in production
To make them more resilient
Well, maybe instaging?
@emrahsamdan
Chaos Engineering is
Vaccine to software
@emrahsamdan
Chaos Engineering is
Vaccine to software For resiliency
@emrahsamdan
Chaos Engineering is
Vaccine to software For resiliency To prevent outages
@emrahsamdan
Chaos Engineering is
Vaccine to software For resiliency To prevent outages To define steady state
@emrahsamdan
Chaos Engineering is not
For breaking down
@emrahsamdan
Chaos Engineering is not
For breaking down For bad surprises
@emrahsamdan
Chaos Engineering is not
For breaking down For bad surprises For blaming
@emrahsamdan
Chaos Engineering is not
For breaking down For bad surprises For blaming For causing outages
@emrahsamdan
@emrahsamdan
History of chaos engineering?
2010 2011 2014 2019
@emrahsamdan
Companies applying Chaos Engineering
@emrahsamdan
Ups and downs of the
system.
If I take this server down,
maybe everything will still
run smooth. Maybe?
Let me attack on my system!
Cute! Let me break
something else.
Oh! I should fix this before it
actually happens and then
break something else.
@emrahsamdan
Chaos experiments will(should) never end!
@emrahsamdan
Don’t break on purpose!
● Start experimenting with the first row, the
leftmost cell: Known-knowns.
● Blast radius: The effect will make the
smallest effect.
● Put a stop button somewhere!
● Plan how you learn.
● You don’t need to do it on production for
the first time.
● The most important Let the other people
know! Surprising chaos is not funny. No, at
all!
@emrahsamdan
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
@emrahsamdan
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Result: People experiences timeouts while waiting for results.
@emrahsamdan
@emrahsamdan
You never fail!
@emrahsamdan
Chaos when everything is more granular.
SERVERLESS
@emrahsamdan
More Granular Functions
@emrahsamdan
More Granular Functions
@emrahsamdan
More Granular Functions
@emrahsamdan
Every service has its own failure mode
Lots of managed intermediate service which has its own bad-day
characteristics.
Different throttling, different retry mechanisms for different services.
@emrahsamdan
Every function has its own configuration
● Timeouts
● IAM Roles
@emrahsamdan
@emrahsamdan
Common weaknesses in serverless
● Nested functions with improper timeouts
@emrahsamdan
Common weaknesses in serverless
● Unhandled errors from upstream services
@emrahsamdan
Common weaknesses in serverless
● Failures in resources
@emrahsamdan
Chaos experiments in serverless
● Inject latency to downstream services
● Inject failure to resources
@emrahsamdan
Injecting latency
● Don’t attack your system.
● You don’t need to do on prod
first.
● There is no point to inject
latency to async calls.
Hypothesis: Entry point Lambda will
degrade gracefully when the
downstream Lambda times out or turns
really late.
@emrahsamdan
Where else to inject?
Inject latency to resources, too.
@emrahsamdan
How to inject latency
@emrahsamdan
How to inject latency with Thundra
@emrahsamdan
Injecting Error
● Connection errors with third party services
● Cache down
● AWS Resource is unreachable
@emrahsamdan
What if we lose the connection to Redis?
@emrahsamdan
Let’s inject error to Redis with Thundra
@emrahsamdan
Common fixes
● Exponential backoff
● Properly tunes timeouts
● Circuit breakers
● Use async communication when possible
@emrahsamdan
Don’t forget! Aim is
● Not to break but to improve
● Not to blame people but to give them room to fix
● Not to surprise your colleagues but to make your system resilient
@emrahsamdan
Thank you!

Applying Chaos Engineering to build Resilient Serverless Applications - Emrah Samdan