Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2018-05-16 Geeknight Dallas - Distributed Systems Talk


Published on

Note: with deep gratitude to Jimmy Bogard for teaching me many of these concepts:

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." - Leslie Lamport

We are increasingly living in the world of distributed systems - microservices, 3rd party APIs, NoSQL databases. But many developers haven't made the paradigm shift from the way we used to build monoliths. Not taking into account failure handling, code coupling, data latency, etc. can make our systems fragile and our customers unhappy on a massive scale. Come take a deep dive into the process of designing for a distributed world.

Published in: Software
  • My only statement is "WOW"...I thought your other systems were special but this is going to turn out to be the "Holy Grail" of all MLB systems, no doubt! ◆◆◆
    Are you sure you want to  Yes  No
    Your message goes here
  • If you want to enjoy the Good Life: making money in the comfort of your own home with just your laptop, then this is for YOU... 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

2018-05-16 Geeknight Dallas - Distributed Systems Talk

  1. 1. Living in a Distributed World An intro to design considerations for distributed systems
  2. 2. Hello! Vishal Bardoloi @bardoloi Lead Dev at ThoughtWorks
  3. 3. Overview 1. A real world scenario 2. Distributed data 3. Distributed transactions 4. Client-side considerations 5. Feedback / Next Topics
  4. 4. 1. A Real World Scenario
  5. 5. <Disclaimer> This scenario is a work of fiction. Names, characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.
  6. 6. Case study:
  7. 7. About Online ticket sales company based in NY, with operations in many countries Clients: event organizers - music festivals, sports stadiums, halls, theatres etc. TicketMeister acts as an agent, selling tickets that the clients make available, and charging a service fee. Clients manage their available inventory via a Client Portal Users (ticket buyers) purchase tickets through an e-commerce portal To prevent ticket scalping: customers must have an account with a verified email address
  8. 8. “We need a new e-commerce portal!”
  9. 9. Why? ● Customers complain that the current portal is too slow, especially during big sales events ● We’ve had incidents of inadvertent overselling (bad customer experience; plus, refunds need to be manually handled) ● We’re launching in EU, Brazil, China and South Africa soon, and are expecting 10x traffic after go-live ● Our existing data center (running MySQL) is barely handling the current load
  10. 10. Our customers don’t love the buying experience
  11. 11. Our customers really don’t love the buying experience
  12. 12. Our customers really really don’t love the buying experience
  13. 13. Our customers really really really don’t love the buying experience
  14. 14. Your Scope: only 2 API endpoints 1. Allow customers to look up current ticket prices & availability for an event a. Must be fast, accurate and reliable b. Must be globally available (i.e. you can look up US events from Australia) 2. Allow customers to purchase tickets for an event a. Add the order to the Customer’s account in our Customer Information System (CIS) b. Take payment c. Send a confirmation email d. Update inventory database e. Notify the event organizer’s systems (3rd party) f. Log the transaction
  15. 15. Easy job for you, SuperDev!
  16. 16. Questions so far?
  17. 17. 1. Allow customers to look up current ticket prices & availability for an event 2. Allow customers to purchase ticket(s) for an event Your Scope: only 2 API endpoints
  18. 18. 2. Working with Distributed Data
  19. 19. Request #1: Look up current availability & price MySQL DB
  20. 20. Problem: our primary data center in NY can’t handle the expected user load
  21. 21. We’d like to scale to multiple global Data Centers to handle the extra load
  22. 22. Define Success: Priority # Goal Target Metric 1 Reliable ~99.999% (five 9’s) uptime 2 Fast <800ms response time 3 Accurate No overselling
  23. 23. Let’s review the requirements Reliable (upto five 9’s uptime) Fast (<500ms response time) Accurate (no overselling) Global Scale (multiple replicated data centers)
  24. 24. Can’t Do It Pick 2 out of 3
  25. 25. Can NoSQL help?
  26. 26. <link: Distributed Databases talk slides>
  27. 27. Early NoSQL systems: you had to make explicit tradeoffs
  28. 28. Modern alternatives give you way more flexibility
  29. 29. Example: Azure Cosmos DB
  30. 30. Cosmos DB provides 5 different consistency model choices Banking systems, Payment apps, etc. Product reviews, Social media “wall” posts, etc. Baseball scores, Blog comments, etc. Shopping carts, User profile updates, etc. Flight status tracking, Package tracking, etc.
  31. 31. Questions so far?
  32. 32. 3. Working with Distributed Transactions
  33. 33. Request #2: API endpoint to purchase ticket(s)
  34. 34. In a monolith, the database provides the transaction with ACID guarantees A - atomicity C - consistency I - isolation D - durability
  35. 35. Distributed transactions can’t do that!
  36. 36. 3 things to consider 1. What could fail? 2. How important is the failure? 3. If it fails, how should we respond?
  37. 37. What could fail?
  38. 38. What could fail? Result: system is consistent
  39. 39. What could fail? Result: system is inconsistent
  40. 40. What could fail? Result: system is inconsistent
  41. 41. What could fail? Result: system is inconsistent
  42. 42. What could fail? Result: system is inconsistent
  43. 43. What could fail? Result: system is inconsistent
  44. 44. If something fails, what can we do?
  45. 45. Write off the error in resource B, and proceed as if normal Good option when: ● B is non-critical (e.g. logging metrics) ● There are decent alternatives to B (e.g. if customer can re-print their confirmation email from a user portal) Option 1: Ignore
  46. 46. Option 2: Retry If resource B fails, retry a few # of times Good option when: ● Retries are safe** on B ● Actions can be queued (i.e. time constraints on “being done” are not strict)
  47. 47. Option 3: Undo If resource B fails, perform an “undo” (compensating action) on resource A Good option when: ● An “undo” or compensating action exists ● There is no penalty for the undo operation
  48. 48. Option 4: Coordinate Coordinate the 2 actions between A and B using a separate coordinator. Prepare, coordinate, then commit (or rollback on failure) Good option when: ● A reliable coordinator is available ● Action can be broken into 2 “prepare” and “commit” phases
  49. 49. Ignore: write off the error in resource B If something fails, what can we do? Retry: if resource B fails, retry on B until it succeeds Coordinate: 2 phases between A and B: prepare, coordinate, then commit (or rollback on failure) Undo: if resource B fails, undo action on A
  50. 50. Decision Matrix: options for each point of failure Best option available? Step Order ? Ignore Retry Undo Coordinate CIS Payment Events Ctr Email Logging Inventory DB
  51. 51. General considerations 1. Business model constraints a. inventory >> demand, can process things offline later (asynchronously) b. TicketMeister: limited time-bound inventory, must process everything right now 2. Alternative definitions of success a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal? b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures? c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system depends on the logs being accurate and complete 3. Easier to undo? Call it first! 4. Research all 3rd party APIs: assume nothing!
  52. 52. 4. Client-side Design Considerations
  53. 53. How does the user see a failure?
  54. 54. All these failure scenarios look the same to a client Retry is safe Retry may be safe Retry likely NOT safe
  55. 55. The client doesn’t know the server’s internal state Retry is safe Retry may be safe Retry likely NOT safe
  56. 56. Clients (especially humans) will almost always retry POST succeeded! +$500 POST succeeded! +$500
  57. 57. Can’t we rely on the client to do the right thing? ● “Or else what?” ● “But what if my Wi-Fi goes down?” ● “Is there another way?” ● “How will I know it’s OK to bail?” ● “Should I call customer care?” ● “Should I just go on Twitter?”
  58. 58. Idempotency: guaranteeing “Exactly Once” semantics POST succeeded! +$500 Oops! You already did that action!
  59. 59. One option: send an Idempotency Key The client (front end, or another API) uses a unique identifier on its end for the “transaction” - so that retries can be safely rejected
  60. 60. Other client-side considerations: exponential backoff
  61. 61. Exponential backoff: prevent clients from causing Denial-of-Service on a struggling service ...
  62. 62. Thundering herds: Avoiding resource contention
  63. 63. Solution: exponential back-off with random jitter Example: Stripe.rb client
  64. 64. Further reading ● Two Generals Problem ● Byzantine Consensus Protocols ○ Achieving quorum ○ 2-phase commit ○ Paxos ○ Blockchain ● “Distributed Systems Observability” - Cindy Sridharan
  65. 65. Thank You! Questions?
  66. 66. Feedback / Ideas for next time? ● NoSQL deep dive ● Monitoring & Observability in distributed systems
  67. 67. APPENDIX
  68. 68. 3 Characteristics of a Distributed System 1. Operate concurrently 2. Can fail independently 3. Don’t share a global clock
  69. 69. 1. Reads >> Writes? a. Read replication: 1 master, many replicated slaves i. Broken: consistency (new: eventual consistency) - even with re ii. Writes are still a bottleneck and will take over the master b. Sharding (break up 1 write database into many based some key) i. Each one is read-replicated ii. Broken: data model, completely isolated instances (can’t join tables across shards) c. Adding indexes? i. As writes scale up, you’ll lose the benefits of this ii. May end up denormalizing the relational database (BAD!) 2. Why not go NoSQL anyway? Distributed Data: scaling
  70. 70. Byzantine Generals Problem: Byzantine agreement protocols: e-245f46fe8419 Further reading