2. $ whoami
Senior engineer @ Reliably
Live in the UK but (thanks to COVID19) I work anywhere, any-when
Worked in tech for 12 years, in dev, ops and product management
I was once chased up a tree by a disgruntled bovine
3. SL*
SLA: legal chicanery that can get you sued.
SLO: a goal used by the team operating a service to measure its
performance.
SLI: a measurement of activity over time that can be used to
calculate the actual performance of the service.
A service level applies to any system that has a user.
4. The SOA connection
The world of modern tech is a bowl of service-spaghetti.
A system that depends on another service becomes beholden to
that service's service level.
5. Crunching the numbers
n = (a x b) x 100
n = the aggregate service level
a = the individual service level
b = the product of the service level agreement(s) of a service(s) you
depend upon
e.g. your service has an SLO of 99.5%, and you depend on 2 services
with an SLA of 99.95% and 99% respectively.
n = (0.995 x (0.9995 x 0.99)) x 100
n = 98.46%
6. Its not that simple...
Not every path through your system may depend on every
dependency
You may be able to engineer resiliency
Not all dependencies are equal
7. Architecting a better SLO
Handling transient errors
The impact of a retry mechanism:
Without retry: (0.999 x 0.99) x 100 = 98.9%
With 3 retries: (0.999 x (1 - (1 - 0.99) / 3)) x 100 = 99.6%
Make sure you send the correct status codes - 503 FTW!!
8. Architecting a better SLO
Handling transactional errors
Become transactionally transparent.
user -> request a mutation -> {time passes...} -> handle request -> inform user
Super powerful, but changes the user experience.
Existing tools and patterns like event bus or webhook can be used to achieve an event-
driven mutation.
AVOID DISTRIBUTED TRANSACTIONS WHEREVER POSSIBLE.
9. Architecting a better SLO
Caching
The 2 hardest things in software engineering are naming things and cache
expiration
Please utilise cache control headers
10. And now, the end is near...
In a world of networked services, no service level stands alone.
Try to build your services in such a way that your users are given the best
opportunity to use your service well.
Don't be afraid of downtime - no system can offer 100% uptime, and its
exceptionally expensive to try!