Applying Chaos Engineering to build Resilient Serverless Applications - Emrah Samdan

@emrahsamdan
Applying Chaos Engineering to build resilient
serverless applications
Emrah Şamdan
(@emrahsamdan)
11/6/2019

Who am I?
● Developer for 6+ years
● Product guy for 2 years
● VP of Product for Thundra
● Organizing committee
● Father of a chaos monkey

@emrahsamdan
Why chaos engineering?
Unit Tests
● My function is running properly
and meets the expectations.
Integration Tests
● My system is running properly
and meets the expectations.
UI/UX Tests
● It is like a charm!

@emrahsamdan
Your third party API slows down so badly..

@emrahsamdan
Some part of your system becomes unreachable.

@emrahsamdan
Your cache/DB is down so you can’t load your data.

@emrahsamdan
Chaos Engineering is the discipline of experimenting on a system
in order to build conﬁdence in the system’s capability
to withstand turbulent conditions in production.
http://principlesofchaos.org/

@emrahsamdan
Breaking things on purpose in production

@emrahsamdan
To make them more resilient

@emrahsamdan
To make them more resilient
Well, maybe instaging?

@emrahsamdan
Chaos Engineering is
Vaccine to software

@emrahsamdan
Vaccine to software For resiliency

@emrahsamdan
Vaccine to software For resiliency To prevent outages

@emrahsamdan
Vaccine to software For resiliency To prevent outages To deﬁne steady state

@emrahsamdan
Chaos Engineering is not
For breaking down

@emrahsamdan
For breaking down For bad surprises

@emrahsamdan
For breaking down For bad surprises For blaming

@emrahsamdan
For breaking down For bad surprises For blaming For causing outages

@emrahsamdan
History of chaos engineering?
2010 2011 2014 2019

@emrahsamdan
Companies applying Chaos Engineering

@emrahsamdan
Ups and downs of the
system.
If I take this server down,
maybe everything will still
run smooth. Maybe?
Let me attack on my system!
Cute! Let me break
something else.
Oh! I should ﬁx this before it
actually happens and then
break something else.

@emrahsamdan
Chaos experiments will(should) never end!

@emrahsamdan
Don’t break on purpose!
● Start experimenting with the ﬁrst row, the
leftmost cell: Known-knowns.
● Blast radius: The effect will make the
smallest effect.
● Put a stop button somewhere!
● Plan how you learn.
● You don’t need to do it on production for
the ﬁrst time.
● The most important Let the other people
know! Surprising chaos is not funny. No, at
all!

@emrahsamdan
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.

@emrahsamdan
Chaos examples
● Your system keeps records on the DB.
● DB is returning too slow for 1% of your customers.
Hypothesis: The system won’t experience an outage when DB is hardly
accessible.
Result: People experiences timeouts while waiting for results.

@emrahsamdan
Chaos when everything is more granular.
SERVERLESS

@emrahsamdan
More Granular Functions

@emrahsamdan
Every service has its own failure mode
Lots of managed intermediate service which has its own bad-day
characteristics.
Different throttling, different retry mechanisms for different services.

@emrahsamdan
Every function has its own conﬁguration
● Timeouts
● IAM Roles

@emrahsamdan
Common weaknesses in serverless
● Nested functions with improper timeouts

@emrahsamdan
● Unhandled errors from upstream services

@emrahsamdan
● Failures in resources

@emrahsamdan
Chaos experiments in serverless
● Inject latency to downstream services
● Inject failure to resources

@emrahsamdan
Injecting latency
● Don’t attack your system.
● You don’t need to do on prod
ﬁrst.
● There is no point to inject
latency to async calls.
Hypothesis: Entry point Lambda will
degrade gracefully when the
downstream Lambda times out or turns
really late.

@emrahsamdan
Where else to inject?
Inject latency to resources, too.

@emrahsamdan
How to inject latency

@emrahsamdan
How to inject latency with Thundra

@emrahsamdan
Injecting Error
● Connection errors with third party services
● Cache down
● AWS Resource is unreachable

@emrahsamdan
What if we lose the connection to Redis?

@emrahsamdan
Let’s inject error to Redis with Thundra

@emrahsamdan
Common ﬁxes
● Exponential backoff
● Properly tunes timeouts
● Circuit breakers
● Use async communication when possible

@emrahsamdan
Don’t forget! Aim is
● Not to break but to improve
● Not to blame people but to give them room to ﬁx
● Not to surprise your colleagues but to make your system resilient

Applying Chaos Engineering to build Resilient Serverless Applications - Emrah Samdan

More Related Content

Similar to Applying Chaos Engineering to build Resilient Serverless Applications - Emrah Samdan

Recently uploaded

Applying Chaos Engineering to build Resilient Serverless Applications - Emrah Samdan