Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Getting all the 99.(9) you always wanted
@mitemitreski
What is Klarna?
What will I talk about?
● The need for the 99.(9)
● The road to get there
● Common failures
Number of active merchants
Why we need to do this?
Orders placed daily via Klarna services
70 000
300 000
60 milion Klarna...
Impact of a production failure
http://status.klarna.com/
Consumer impact of outage
Merchant impact of outage
Internal impact of outage
How to get all the 99.(9)
1. Elimination of single points of failure
2. Reliable crossover.
3. Detection of failures as th...
Credits http://roguedudes.github.io
Service ownership is key
Decentralised - You build it you run it model
Specialised - Narrow focus lead to better
knowledge...
Release strategies
Sample setup
with reliable
crossover
Cloudformation
Rolling release
Cloudformation
Rolling release
Canary release
Original setup
Active Passive AS
Active Passive AS
Active Passive AS
Feature toggle
API Simplifications
Coupling
URI coupling
Linked Service
First Second
class Customer {
String name;
}
{
"name" : "Alice"
}
{
"name": "Alice",
"links": [ {
"rel": "self",
"href": "http://localhost:8080/customer/1"
} ]
}
Richardson maturity model
Level 3: Hypermedia Controls
Level 2: HTTP Verbs
Level 1: Resources
Level 0: The Swamp of Plain ...
Service registry
Registry
Service
Provider
Client
Temporal coupling
Asynchronous Response Handler
Request/Acknowledge
ACK
queue
request processor
Fast producing Slow
Data structure and function
coupling
● date time
● unicode or not
● ISO *
● number format
● phone number format
Various Service patterns
Stability patterns
Circuit breakers
Failing server
Fallback
Client with Circuit breaker
Handshake
A fast system should not
override a slow one
Bulkheads
Bulkheads
Connectionpool
Failing server
Good server
Bulkheads
Connection
pool
Connection
pool
Failing server
Good server
Monitoring
Traceability
Use tracking ID
Clojure
Spring
App
NodeJS
ID : 223-305
Use tracking ID
Clojure
Spring
App
NodeJS
ID : 223-305
ID : 223-305
Use tracking ID
Clojure
Spring
App
NodeJS
ID : 223-305
ID : 223-305
ID : 223-305
Example tracking ID: 25cb4264-8064-4149-9922-df4477723102
Detection of failures as they occur.
Monks - internal health checks
Monks - internal health checks
External checks
Internet facing service
USA Europe Asia
South
America
External checks are the real checks
Pingdom
Uptime
Time series event monitoring
Aggregationlayer
Time series event monitoring
Problem 1: Seasonal changes
Problem 2: Looking at average
Problem 2: Looking at median
Problem 2: Another distribution
Percentiles for the win
Percentiles for the win
Future of time series monitoring
Summary
1. Elimination of single points of failure
2. Reliable crossover.
3. Detection of failures as they occur.
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Upcoming SlideShare
Loading in …5
×

Getting all the 99.99(9) you always wanted

262 views

Published on

Any software running on the internet has some expectations of uptime. Many having millions of users can have high availability expectation. While running an application looks like a simple task, constantly evolving software, ever-changing clients and various request peaks can sure cause some headache. This talk will be focusing on the downsides and lessons learned from running and developing high uptime system and show you how you could also get the 99.999… uptime. We will also do some showing of concrete examples in various technologies like Docker, AWS and Java.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Getting all the 99.99(9) you always wanted

  1. 1. Getting all the 99.(9) you always wanted @mitemitreski
  2. 2. What is Klarna?
  3. 3. What will I talk about? ● The need for the 99.(9) ● The road to get there ● Common failures
  4. 4. Number of active merchants Why we need to do this? Orders placed daily via Klarna services 70 000 300 000 60 milion Klarna customers 10 000 Active users at given any time NOTE : Numbers are for Illustration purpose only and not accurate day to day numbers
  5. 5. Impact of a production failure http://status.klarna.com/
  6. 6. Consumer impact of outage
  7. 7. Merchant impact of outage
  8. 8. Internal impact of outage
  9. 9. How to get all the 99.(9) 1. Elimination of single points of failure 2. Reliable crossover. 3. Detection of failures as they occur.
  10. 10. Credits http://roguedudes.github.io
  11. 11. Service ownership is key Decentralised - You build it you run it model Specialised - Narrow focus lead to better knowledge of details Preventative - Move away from “quick patch job” mentality
  12. 12. Release strategies
  13. 13. Sample setup with reliable crossover
  14. 14. Cloudformation Rolling release
  15. 15. Cloudformation Rolling release
  16. 16. Canary release
  17. 17. Original setup
  18. 18. Active Passive AS
  19. 19. Active Passive AS
  20. 20. Active Passive AS
  21. 21. Feature toggle
  22. 22. API Simplifications
  23. 23. Coupling
  24. 24. URI coupling
  25. 25. Linked Service First Second
  26. 26. class Customer { String name; } { "name" : "Alice" }
  27. 27. { "name": "Alice", "links": [ { "rel": "self", "href": "http://localhost:8080/customer/1" } ] }
  28. 28. Richardson maturity model Level 3: Hypermedia Controls Level 2: HTTP Verbs Level 1: Resources Level 0: The Swamp of Plain Old XML
  29. 29. Service registry Registry Service Provider Client
  30. 30. Temporal coupling
  31. 31. Asynchronous Response Handler
  32. 32. Request/Acknowledge ACK queue request processor Fast producing Slow
  33. 33. Data structure and function coupling ● date time ● unicode or not ● ISO * ● number format ● phone number format
  34. 34. Various Service patterns
  35. 35. Stability patterns
  36. 36. Circuit breakers
  37. 37. Failing server Fallback Client with Circuit breaker
  38. 38. Handshake
  39. 39. A fast system should not override a slow one
  40. 40. Bulkheads
  41. 41. Bulkheads Connectionpool Failing server Good server
  42. 42. Bulkheads Connection pool Connection pool Failing server Good server
  43. 43. Monitoring
  44. 44. Traceability
  45. 45. Use tracking ID Clojure Spring App NodeJS ID : 223-305
  46. 46. Use tracking ID Clojure Spring App NodeJS ID : 223-305 ID : 223-305
  47. 47. Use tracking ID Clojure Spring App NodeJS ID : 223-305 ID : 223-305 ID : 223-305
  48. 48. Example tracking ID: 25cb4264-8064-4149-9922-df4477723102
  49. 49. Detection of failures as they occur.
  50. 50. Monks - internal health checks
  51. 51. Monks - internal health checks
  52. 52. External checks Internet facing service USA Europe Asia South America
  53. 53. External checks are the real checks Pingdom Uptime
  54. 54. Time series event monitoring Aggregationlayer
  55. 55. Time series event monitoring
  56. 56. Problem 1: Seasonal changes
  57. 57. Problem 2: Looking at average
  58. 58. Problem 2: Looking at median
  59. 59. Problem 2: Another distribution
  60. 60. Percentiles for the win
  61. 61. Percentiles for the win
  62. 62. Future of time series monitoring
  63. 63. Summary 1. Elimination of single points of failure 2. Reliable crossover. 3. Detection of failures as they occur.

×