2018-05-16 Geeknight Dallas - Distributed Systems Talk

Living in a Distributed World
An intro to design considerations for distributed systems

Hello!
Vishal Bardoloi
@bardoloi
Lead Dev at ThoughtWorks
medium.com/@v.bardoloi

Overview
1. A real world scenario
2. Distributed data
3. Distributed transactions
4. Client-side considerations
5. Feedback / Next Topics

<Disclaimer>
This scenario is a work of fiction. Names, characters, businesses, places, events,
locales, and incidents are either the products of the author’s imagination or used in
a fictitious manner. Any resemblance to actual persons, living or dead, or actual
events is purely coincidental.

About TicketMeister.com
Online ticket sales company based in NY, with operations in many countries
Clients: event organizers - music festivals, sports stadiums, halls, theatres etc.
TicketMeister acts as an agent, selling tickets that the clients make available, and
charging a service fee.
Clients manage their available inventory via a Client Portal
Users (ticket buyers) purchase tickets through an e-commerce portal
To prevent ticket scalping: customers must have an account with a verified email
address

TicketMeister.com: “We need a new e-commerce portal!”

Why?
● Customers complain that the current
portal is too slow, especially during
big sales events
● We’ve had incidents of inadvertent
overselling (bad customer
experience; plus, refunds need to be
manually handled)
● We’re launching in EU, Brazil, China
and South Africa soon, and are
expecting 10x traffic after go-live
● Our existing data center (running
MySQL) is barely handling the
current load

Our customers don’t love the buying experience

Our customers really don’t love the buying experience

Our customers really really don’t love the buying experience

Our customers really really really don’t love the buying experience

Your Scope: only 2 API endpoints
1. Allow customers to look up current ticket prices & availability for an event
a. Must be fast, accurate and reliable
b. Must be globally available (i.e. you can look up US events from Australia)
2. Allow customers to purchase tickets for an event
a. Add the order to the Customer’s account in our Customer Information System (CIS)
b. Take payment
c. Send a confirmation email
d. Update inventory database
e. Notify the event organizer’s systems (3rd party)
f. Log the transaction

1. Allow customers to look up current ticket prices & availability for an event
2. Allow customers to purchase ticket(s) for an event
Your Scope: only 2 API endpoints

2. Working with Distributed Data

Request #1: Look up current availability & price
MySQL DB

Problem: our primary data center in NY can’t handle the expected user load

We’d like to scale to multiple global Data Centers to handle the extra load

Define Success:
Priority # Goal Target Metric
1 Reliable ~99.999% (five 9’s) uptime
2 Fast <800ms response time
3 Accurate No overselling

Let’s review the requirements
Reliable
(upto five 9’s uptime)
Fast
(<500ms response time)
Accurate
(no overselling)
Global Scale
(multiple replicated data centers)

Early NoSQL systems: you had to make explicit tradeoffs

Modern alternatives give you way more flexibility

Cosmos DB provides 5 different consistency model choices
Banking systems, Payment apps, etc.
Product reviews, Social media “wall” posts, etc.
Baseball scores, Blog comments, etc.
Shopping carts, User profile updates, etc.
Flight status tracking, Package tracking, etc.

3. Working with Distributed Transactions

Request #2: API endpoint to purchase ticket(s)

In a monolith, the database provides the transaction with ACID guarantees
A - atomicity
C - consistency
I - isolation
D - durability

Distributed transactions can’t do that!

3 things to consider
1. What could fail?
2. How important is the failure?
3. If it fails, how should we respond?

What could fail?
Result: system is consistent

What could fail?
Result: system is inconsistent

If something fails, what can we do?

Write off the error in resource B, and
proceed as if normal
Good option when:
● B is non-critical (e.g. logging metrics)
● There are decent alternatives to B
(e.g. if customer can re-print their
confirmation email from a user portal)
Option 1: Ignore

Option 2: Retry
If resource B fails, retry a few # of times
Good option when:
● Retries are safe** on B
● Actions can be queued (i.e. time constraints
on “being done” are not strict)

Option 3: Undo
If resource B fails, perform an “undo”
(compensating action) on resource A
Good option when:
● An “undo” or compensating action exists
● There is no penalty for the undo
operation

Option 4: Coordinate
Coordinate the 2 actions between A
and B using a separate coordinator.
Prepare, coordinate, then commit (or
rollback on failure)
Good option when:
● A reliable coordinator is available
● Action can be broken into 2
“prepare” and “commit” phases

Ignore: write off the error in resource B
If something fails, what can we do?
Retry: if resource B fails, retry on B until it
succeeds
Coordinate: 2 phases between A and B: prepare,
coordinate, then commit (or rollback on failure)
Undo: if resource B fails, undo action on A

Decision Matrix: options for each point of failure
Best option available?
Step Order ? Ignore Retry Undo Coordinate
CIS
Payment
Events Ctr
Email
Logging
Inventory DB

General considerations
1. Business model constraints
a. Amazon.com: inventory >> demand, can process things offline later (asynchronously)
b. TicketMeister: limited time-bound inventory, must process everything right now
2. Alternative definitions of success
a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal?
b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures?
c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system
depends on the logs being accurate and complete
3. Easier to undo? Call it first!
4. Research all 3rd party APIs: assume nothing!

4. Client-side Design Considerations

How does the user see a failure?

All these failure scenarios look the same to a client
Retry is safe
Retry may be safe
Retry likely NOT safe

The client doesn’t know the server’s internal state
Retry is safe
Retry may be safe
Retry likely NOT safe

Clients (especially humans) will almost always retry
POST succeeded! +$500

Can’t we rely on the client to do the right thing?
● “Or else what?”
● “But what if my Wi-Fi goes down?”
● “Is there another way?”
● “How will I know it’s OK to bail?”
● “Should I call customer care?”
● “Should I just go on Twitter?”

Idempotency: guaranteeing “Exactly Once” semantics
Oops! You already did that action!

One option: send an Idempotency Key
The client (front end, or another API) uses a unique identifier on its end for the
“transaction” - so that retries can be safely rejected

Other client-side considerations: exponential backoff

Exponential backoff: prevent clients from causing Denial-of-Service on a struggling service
...

Thundering herds: Avoiding resource contention

Solution: exponential back-off with random jitter
Example: Stripe.rb client

Further reading
● Two Generals Problem
● Byzantine Consensus Protocols
○ Achieving quorum
○ 2-phase commit
○ Paxos
○ Blockchain
● “Distributed Systems Observability” - Cindy Sridharan

Feedback / Ideas for next time?
● NoSQL deep dive
● Monitoring & Observability in distributed systems

3 Characteristics of a Distributed System
1. Operate concurrently
2. Can fail independently
3. Don’t share a global clock

1. Reads >> Writes?
a. Read replication: 1 master, many replicated slaves
i. Broken: consistency (new: eventual consistency) - even with re
ii. Writes are still a bottleneck and will take over the master
b. Sharding (break up 1 write database into many based some key)
i. Each one is read-replicated
ii. Broken: data model, completely isolated instances (can’t join tables across shards)
c. Adding indexes?
i. As writes scale up, you’ll lose the benefits of this
ii. May end up denormalizing the relational database (BAD!)
2. Why not go NoSQL anyway?
Distributed Data: scaling

Byzantine Generals Problem: https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Byzantine agreement protocols:
https://en.wikipedia.org/wiki/Byzantine_fault_tolerance
https://en.wikipedia.org/wiki/Quantum_Byzantine_agreement
https://medium.com/loom-network/understanding-blockchain-fundamentals-part-1-byzantine-fault-toleranc
e-245f46fe8419
https://medium.com/all-things-ledger/the-byzantine-generals-problem-168553f31480
Further reading

2018-05-16 Geeknight Dallas - Distributed Systems Talk

Recommended

Recommended

More Related Content

Similar to 2018-05-16 Geeknight Dallas - Distributed Systems Talk

Similar to 2018-05-16 Geeknight Dallas - Distributed Systems Talk (20)

Recently uploaded

Recently uploaded (20)

2018-05-16 Geeknight Dallas - Distributed Systems Talk