Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Rolf Koski – Business driven availability
April 25 2019
1
Who am I
2
Rolf Koski
CTO
Cybercom AWS Business Group
rolf.koski@cybercom.com
rolle
therolle
- “Guy with the sticker”
- Cl...
Why SLAs are not an
excuse for poor
architectures
Disclaimer: this presentation makes you ask questions more than it gives...
Everything Fails
(so if you think you can have 100%, you are lying to yourself)
SL(A) ?
Objective vs. Agreement
Quick introduction to
availability arithmetics
99% 99%
~98%
Aggregate availability – Series
99%
99%
99,99%
Aggregate availability – Parallel
99%
99,98%
Aggregate availability – Combination
99%
99%
99%
99%
99,98%
Aggregate availability – Partial failure
99%
99%
99%
20% failing
Function execution availability
Time
Parallel
execution
Function execution availability
Time
Parallel
execution
Fail
95%?
Function execution availability
Time
Parallel
execution
Fail
95%?
Retry
Success
Service Level is
not just nines
Service Level is
not just nines
• What service is provided
• How it is supported
• During which time service is to be prov...
The Serverless Promise
Built-in availability & fault tolerance
The Serverless Promise
Built-in availability & fault tolerance
• What about cold starts?
• What about endless retries or m...
SLA Credits Suck
(and they have no real business value whatsoever)
Example: S3 SLA
Monthly Uptime Percentage Service Credit Percentage
Equal to or greater than 99.0% but less than 99.9% 10%...
The Cost of Availability
(and when enough is enough)
Total Cost of Service Level
21
Cost of breech
Cost of service level target
Number of 9’s
Cost
So, how to decide
what to optimize?
Analyze & classify
Analysis
• How much is loss/corruption of data worth to you
• How much is downtime worth to you
• How much is malicious br...
Classification
• Business criticality
• Data privacy / confidentiality
• Availability
• Consistency
• Resiliency
• Origina...
Everything is not equal
Your most valuable
availability metric is not
probably in %
Amazon: 100 ms of extra load time caused a 1% drop in sales
(Greg Linden).
Google: 500 ms of extra load time caused 20% fe...
It’s actually not IF it works,
but HOW it works
Some real advise
Some real advise
• Automation and deployment pipeline
• Infrastructure as Code
• Versioning and ability to roll back
• Dep...
Resilient Design
Resilient Design
• People
• Application implementation
• Network & Data architecture
• Infrastructure
Humans fail too.
(Actually, more than you’d like)
Who is responsible in the Cloud?
(It’s You)
36
Upcoming SlideShare
Loading in …5
×

Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability

77 views

Published on

This talk concentrates on understanding, what issues are at play, when operating on systems run on public clouds. This talk should get you thinking, why service levels are not supposed to be thought as a sequence of 9s, but how to take more holistic approach and how to think of investing in the resilience the correct amount before going live and running in production. Also it is equally important to understanding the human element, which is where most of the errors occur in any case and being able to minimize the impact and occurrence of the human based errors. The key takeaway in this talk is to understanding that everything can and will eventually fail and how to approach your design in such a way, that you are able to handle those situations gracefully

Published in: Internet
  • Be the first to comment

Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability

  1. 1. Rolf Koski – Business driven availability April 25 2019 1
  2. 2. Who am I 2 Rolf Koski CTO Cybercom AWS Business Group rolf.koski@cybercom.com rolle therolle - “Guy with the sticker” - Cloud Advisor & Evangelist - Community Leader - AWS Partner Ambassador - Well-Architected Lead
  3. 3. Why SLAs are not an excuse for poor architectures Disclaimer: this presentation makes you ask questions more than it gives answers…
  4. 4. Everything Fails (so if you think you can have 100%, you are lying to yourself)
  5. 5. SL(A) ? Objective vs. Agreement
  6. 6. Quick introduction to availability arithmetics
  7. 7. 99% 99% ~98% Aggregate availability – Series
  8. 8. 99% 99% 99,99% Aggregate availability – Parallel
  9. 9. 99% 99,98% Aggregate availability – Combination 99% 99% 99%
  10. 10. 99% 99,98% Aggregate availability – Partial failure 99% 99% 99% 20% failing
  11. 11. Function execution availability Time Parallel execution
  12. 12. Function execution availability Time Parallel execution Fail 95%?
  13. 13. Function execution availability Time Parallel execution Fail 95%? Retry Success
  14. 14. Service Level is not just nines
  15. 15. Service Level is not just nines • What service is provided • How it is supported • During which time service is to be provided • What performance is to be expected • What are responsibilities of agreement parties
  16. 16. The Serverless Promise Built-in availability & fault tolerance
  17. 17. The Serverless Promise Built-in availability & fault tolerance • What about cold starts? • What about endless retries or multiple executions? • What about ”dead letters” • What about timeouts? • What about running out of memory?
  18. 18. SLA Credits Suck (and they have no real business value whatsoever)
  19. 19. Example: S3 SLA Monthly Uptime Percentage Service Credit Percentage Equal to or greater than 99.0% but less than 99.9% 10% Less than 99.0% 25% In literal terms: For 1 TB of data which was unavailable for up to 7 hours and 12 minutes, you get service credits for $2.34
  20. 20. The Cost of Availability (and when enough is enough)
  21. 21. Total Cost of Service Level 21 Cost of breech Cost of service level target Number of 9’s Cost
  22. 22. So, how to decide what to optimize?
  23. 23. Analyze & classify
  24. 24. Analysis • How much is loss/corruption of data worth to you • How much is downtime worth to you • How much is malicious breach worth to you • How much is your public image worth to you • How much are you willing to invest in advance • How much are you willing to set aside for corrective action • How much risk are you willing to accumulate in regards of legislation, compliance and similar
  25. 25. Classification • Business criticality • Data privacy / confidentiality • Availability • Consistency • Resiliency • Original or derivative
  26. 26. Everything is not equal
  27. 27. Your most valuable availability metric is not probably in %
  28. 28. Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  29. 29. It’s actually not IF it works, but HOW it works
  30. 30. Some real advise
  31. 31. Some real advise • Automation and deployment pipeline • Infrastructure as Code • Versioning and ability to roll back • Deployment scenarios (A/B, B/G, Canary) • Immutable and stateless • Origin data vs. recomputable data • Feature flags and support partially failing • Throttling and DLQs • Multi-AZ, multiregion • Monitoring: shallow & deep
  32. 32. Resilient Design
  33. 33. Resilient Design • People • Application implementation • Network & Data architecture • Infrastructure
  34. 34. Humans fail too. (Actually, more than you’d like)
  35. 35. Who is responsible in the Cloud? (It’s You)
  36. 36. 36

×