SlideShare a Scribd company logo
Living in a Distributed World
An intro to design considerations for distributed systems
Hello!
Vishal Bardoloi
@bardoloi
Lead Dev at ThoughtWorks
medium.com/@v.bardoloi
Overview
1. A real world scenario
2. Distributed data
3. Distributed transactions
4. Client-side considerations
5. Feedback / Next Topics
1. A Real World Scenario
<Disclaimer>
This scenario is a work of fiction. Names, characters, businesses, places, events,
locales, and incidents are either the products of the author’s imagination or used in
a fictitious manner. Any resemblance to actual persons, living or dead, or actual
events is purely coincidental.
Case study: TicketMeister.com
About TicketMeister.com
Online ticket sales company based in NY, with operations in many countries
Clients: event organizers - music festivals, sports stadiums, halls, theatres etc.
TicketMeister acts as an agent, selling tickets that the clients make available, and
charging a service fee.
Clients manage their available inventory via a Client Portal
Users (ticket buyers) purchase tickets through an e-commerce portal
To prevent ticket scalping: customers must have an account with a verified email
address
TicketMeister.com: “We need a new e-commerce portal!”
Why?
● Customers complain that the current
portal is too slow, especially during
big sales events
● We’ve had incidents of inadvertent
overselling (bad customer
experience; plus, refunds need to be
manually handled)
● We’re launching in EU, Brazil, China
and South Africa soon, and are
expecting 10x traffic after go-live
● Our existing data center (running
MySQL) is barely handling the
current load
Our customers don’t love the buying experience
Our customers really don’t love the buying experience
Our customers really really don’t love the buying experience
Our customers really really really don’t love the buying experience
Your Scope: only 2 API endpoints
1. Allow customers to look up current ticket prices & availability for an event
a. Must be fast, accurate and reliable
b. Must be globally available (i.e. you can look up US events from Australia)
2. Allow customers to purchase tickets for an event
a. Add the order to the Customer’s account in our Customer Information System (CIS)
b. Take payment
c. Send a confirmation email
d. Update inventory database
e. Notify the event organizer’s systems (3rd party)
f. Log the transaction
Easy job for you, SuperDev!
Questions so far?
1. Allow customers to look up current ticket prices & availability for an event
2. Allow customers to purchase ticket(s) for an event
Your Scope: only 2 API endpoints
2. Working with Distributed Data
Request #1: Look up current availability & price
MySQL DB
Problem: our primary data center in NY can’t handle the expected user load
We’d like to scale to multiple global Data Centers to handle the extra load
Define Success:
Priority # Goal Target Metric
1 Reliable ~99.999% (five 9’s) uptime
2 Fast <800ms response time
3 Accurate No overselling
Let’s review the requirements
Reliable
(upto five 9’s uptime)
Fast
(<500ms response time)
Accurate
(no overselling)
Global Scale
(multiple replicated data centers)
Can’t Do It
Pick 2 out of 3
Can NoSQL help?
<link: Distributed Databases talk slides>
Early NoSQL systems: you had to make explicit tradeoffs
Modern alternatives give you way more flexibility
Example: Azure Cosmos DB
Cosmos DB provides 5 different consistency model choices
Banking systems, Payment apps, etc.
Product reviews, Social media “wall” posts, etc.
Baseball scores, Blog comments, etc.
Shopping carts, User profile updates, etc.
Flight status tracking, Package tracking, etc.
Questions so far?
3. Working with Distributed Transactions
Request #2: API endpoint to purchase ticket(s)
In a monolith, the database provides the transaction with ACID guarantees
A - atomicity
C - consistency
I - isolation
D - durability
Distributed transactions can’t do that!
3 things to consider
1. What could fail?
2. How important is the failure?
3. If it fails, how should we respond?
What could fail?
What could fail?
Result: system is consistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
If something fails, what can we do?
Write off the error in resource B, and
proceed as if normal
Good option when:
● B is non-critical (e.g. logging metrics)
● There are decent alternatives to B
(e.g. if customer can re-print their
confirmation email from a user portal)
Option 1: Ignore
Option 2: Retry
If resource B fails, retry a few # of times
Good option when:
● Retries are safe** on B
● Actions can be queued (i.e. time constraints
on “being done” are not strict)
Option 3: Undo
If resource B fails, perform an “undo”
(compensating action) on resource A
Good option when:
● An “undo” or compensating action exists
● There is no penalty for the undo
operation
Option 4: Coordinate
Coordinate the 2 actions between A
and B using a separate coordinator.
Prepare, coordinate, then commit (or
rollback on failure)
Good option when:
● A reliable coordinator is available
● Action can be broken into 2
“prepare” and “commit” phases
Ignore: write off the error in resource B
If something fails, what can we do?
Retry: if resource B fails, retry on B until it
succeeds
Coordinate: 2 phases between A and B: prepare,
coordinate, then commit (or rollback on failure)
Undo: if resource B fails, undo action on A
Decision Matrix: options for each point of failure
Best option available?
Step Order ? Ignore Retry Undo Coordinate
CIS
Payment
Events Ctr
Email
Logging
Inventory DB
General considerations
1. Business model constraints
a. Amazon.com: inventory >> demand, can process things offline later (asynchronously)
b. TicketMeister: limited time-bound inventory, must process everything right now
2. Alternative definitions of success
a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal?
b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures?
c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system
depends on the logs being accurate and complete
3. Easier to undo? Call it first!
4. Research all 3rd party APIs: assume nothing!
4. Client-side Design Considerations
How does the user see a failure?
All these failure scenarios look the same to a client
Retry is safe
Retry may be safe
Retry likely NOT safe
The client doesn’t know the server’s internal state
Retry is safe
Retry may be safe
Retry likely NOT safe
Clients (especially humans) will almost always retry
POST succeeded! +$500
POST succeeded! +$500
Can’t we rely on the client to do the right thing?
● “Or else what?”
● “But what if my Wi-Fi goes down?”
● “Is there another way?”
● “How will I know it’s OK to bail?”
● “Should I call customer care?”
● “Should I just go on Twitter?”
Idempotency: guaranteeing “Exactly Once” semantics
POST succeeded! +$500
Oops! You already did that action!
One option: send an Idempotency Key
The client (front end, or another API) uses a unique identifier on its end for the
“transaction” - so that retries can be safely rejected
Other client-side considerations: exponential backoff
Exponential backoff: prevent clients from causing Denial-of-Service on a struggling service
...
Thundering herds: Avoiding resource contention
Solution: exponential back-off with random jitter
Example: Stripe.rb client
Further reading
● Two Generals Problem
● Byzantine Consensus Protocols
○ Achieving quorum
○ 2-phase commit
○ Paxos
○ Blockchain
● “Distributed Systems Observability” - Cindy Sridharan
Thank You!
Questions?
Feedback / Ideas for next time?
● NoSQL deep dive
● Monitoring & Observability in distributed systems
APPENDIX
3 Characteristics of a Distributed System
1. Operate concurrently
2. Can fail independently
3. Don’t share a global clock
1. Reads >> Writes?
a. Read replication: 1 master, many replicated slaves
i. Broken: consistency (new: eventual consistency) - even with re
ii. Writes are still a bottleneck and will take over the master
b. Sharding (break up 1 write database into many based some key)
i. Each one is read-replicated
ii. Broken: data model, completely isolated instances (can’t join tables across shards)
c. Adding indexes?
i. As writes scale up, you’ll lose the benefits of this
ii. May end up denormalizing the relational database (BAD!)
2. Why not go NoSQL anyway?
Distributed Data: scaling
Byzantine Generals Problem: https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Byzantine agreement protocols:
https://en.wikipedia.org/wiki/Byzantine_fault_tolerance
https://en.wikipedia.org/wiki/Quantum_Byzantine_agreement
https://medium.com/loom-network/understanding-blockchain-fundamentals-part-1-byzantine-fault-toleranc
e-245f46fe8419
https://medium.com/all-things-ledger/the-byzantine-generals-problem-168553f31480
Further reading

More Related Content

Similar to 2018-05-16 Geeknight Dallas - Distributed Systems Talk

Bba401 e-commerce
Bba401  e-commerceBba401  e-commerce
Bba401 e-commerce
smumbahelp
 
REV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaperREV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaper
Myles Kennedy
 

Similar to 2018-05-16 Geeknight Dallas - Distributed Systems Talk (20)

Ticket Management Solution - astCRM
Ticket Management Solution - astCRMTicket Management Solution - astCRM
Ticket Management Solution - astCRM
 
E - C O M M E R C E
E - C O M M E R C EE - C O M M E R C E
E - C O M M E R C E
 
Software Engineering Testing & Research
Software Engineering Testing & Research Software Engineering Testing & Research
Software Engineering Testing & Research
 
Evaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled dataEvaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled data
 
E commerce
E commerceE commerce
E commerce
 
Introduction to BDD
Introduction to BDD Introduction to BDD
Introduction to BDD
 
Connecting Apache Kafka to Cash
Connecting Apache Kafka to CashConnecting Apache Kafka to Cash
Connecting Apache Kafka to Cash
 
JUG Amsterdam - Orchestration of microservices
JUG Amsterdam - Orchestration of microservicesJUG Amsterdam - Orchestration of microservices
JUG Amsterdam - Orchestration of microservices
 
Building a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathyBuilding a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathy
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - Reboot
 
Introduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesIntroduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business Opportunties
 
AI, Bitcoin, and the Future of Mortgage Webinar
AI, Bitcoin, and the Future of Mortgage WebinarAI, Bitcoin, and the Future of Mortgage Webinar
AI, Bitcoin, and the Future of Mortgage Webinar
 
Software for Payment Cards: Choosing Wisely
Software for Payment Cards: Choosing WiselySoftware for Payment Cards: Choosing Wisely
Software for Payment Cards: Choosing Wisely
 
Bba401 e-commerce
Bba401  e-commerceBba401  e-commerce
Bba401 e-commerce
 
MuCon London 2017: Break your event chains
MuCon London 2017: Break your event chainsMuCon London 2017: Break your event chains
MuCon London 2017: Break your event chains
 
IRJET - Analysis & Study of E-Procurement System in Current Scenario
IRJET -  	  Analysis & Study of E-Procurement System in Current ScenarioIRJET -  	  Analysis & Study of E-Procurement System in Current Scenario
IRJET - Analysis & Study of E-Procurement System in Current Scenario
 
Atm project
Atm projectAtm project
Atm project
 
REV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaperREV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaper
 
DDD Belgium Meetup 2017: Events, flows and long running services
DDD Belgium Meetup 2017: Events, flows and long running servicesDDD Belgium Meetup 2017: Events, flows and long running services
DDD Belgium Meetup 2017: Events, flows and long running services
 
Digital travel summit channel attribution 2014 04-01
Digital travel summit channel attribution 2014 04-01Digital travel summit channel attribution 2014 04-01
Digital travel summit channel attribution 2014 04-01
 

Recently uploaded

Recently uploaded (20)

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 

2018-05-16 Geeknight Dallas - Distributed Systems Talk

  • 1. Living in a Distributed World An intro to design considerations for distributed systems
  • 2. Hello! Vishal Bardoloi @bardoloi Lead Dev at ThoughtWorks medium.com/@v.bardoloi
  • 3.
  • 4. Overview 1. A real world scenario 2. Distributed data 3. Distributed transactions 4. Client-side considerations 5. Feedback / Next Topics
  • 5. 1. A Real World Scenario
  • 6. <Disclaimer> This scenario is a work of fiction. Names, characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.
  • 8. About TicketMeister.com Online ticket sales company based in NY, with operations in many countries Clients: event organizers - music festivals, sports stadiums, halls, theatres etc. TicketMeister acts as an agent, selling tickets that the clients make available, and charging a service fee. Clients manage their available inventory via a Client Portal Users (ticket buyers) purchase tickets through an e-commerce portal To prevent ticket scalping: customers must have an account with a verified email address
  • 9. TicketMeister.com: “We need a new e-commerce portal!”
  • 10. Why? ● Customers complain that the current portal is too slow, especially during big sales events ● We’ve had incidents of inadvertent overselling (bad customer experience; plus, refunds need to be manually handled) ● We’re launching in EU, Brazil, China and South Africa soon, and are expecting 10x traffic after go-live ● Our existing data center (running MySQL) is barely handling the current load
  • 11. Our customers don’t love the buying experience
  • 12. Our customers really don’t love the buying experience
  • 13. Our customers really really don’t love the buying experience
  • 14. Our customers really really really don’t love the buying experience
  • 15. Your Scope: only 2 API endpoints 1. Allow customers to look up current ticket prices & availability for an event a. Must be fast, accurate and reliable b. Must be globally available (i.e. you can look up US events from Australia) 2. Allow customers to purchase tickets for an event a. Add the order to the Customer’s account in our Customer Information System (CIS) b. Take payment c. Send a confirmation email d. Update inventory database e. Notify the event organizer’s systems (3rd party) f. Log the transaction
  • 16. Easy job for you, SuperDev!
  • 18. 1. Allow customers to look up current ticket prices & availability for an event 2. Allow customers to purchase ticket(s) for an event Your Scope: only 2 API endpoints
  • 19. 2. Working with Distributed Data
  • 20. Request #1: Look up current availability & price MySQL DB
  • 21. Problem: our primary data center in NY can’t handle the expected user load
  • 22. We’d like to scale to multiple global Data Centers to handle the extra load
  • 23. Define Success: Priority # Goal Target Metric 1 Reliable ~99.999% (five 9’s) uptime 2 Fast <800ms response time 3 Accurate No overselling
  • 24. Let’s review the requirements Reliable (upto five 9’s uptime) Fast (<500ms response time) Accurate (no overselling) Global Scale (multiple replicated data centers)
  • 25. Can’t Do It Pick 2 out of 3
  • 28. Early NoSQL systems: you had to make explicit tradeoffs
  • 29. Modern alternatives give you way more flexibility
  • 31. Cosmos DB provides 5 different consistency model choices Banking systems, Payment apps, etc. Product reviews, Social media “wall” posts, etc. Baseball scores, Blog comments, etc. Shopping carts, User profile updates, etc. Flight status tracking, Package tracking, etc.
  • 33. 3. Working with Distributed Transactions
  • 34. Request #2: API endpoint to purchase ticket(s)
  • 35.
  • 36. In a monolith, the database provides the transaction with ACID guarantees A - atomicity C - consistency I - isolation D - durability
  • 38. 3 things to consider 1. What could fail? 2. How important is the failure? 3. If it fails, how should we respond?
  • 40. What could fail? Result: system is consistent
  • 41. What could fail? Result: system is inconsistent
  • 42. What could fail? Result: system is inconsistent
  • 43. What could fail? Result: system is inconsistent
  • 44. What could fail? Result: system is inconsistent
  • 45. What could fail? Result: system is inconsistent
  • 46. If something fails, what can we do?
  • 47. Write off the error in resource B, and proceed as if normal Good option when: ● B is non-critical (e.g. logging metrics) ● There are decent alternatives to B (e.g. if customer can re-print their confirmation email from a user portal) Option 1: Ignore
  • 48. Option 2: Retry If resource B fails, retry a few # of times Good option when: ● Retries are safe** on B ● Actions can be queued (i.e. time constraints on “being done” are not strict)
  • 49. Option 3: Undo If resource B fails, perform an “undo” (compensating action) on resource A Good option when: ● An “undo” or compensating action exists ● There is no penalty for the undo operation
  • 50. Option 4: Coordinate Coordinate the 2 actions between A and B using a separate coordinator. Prepare, coordinate, then commit (or rollback on failure) Good option when: ● A reliable coordinator is available ● Action can be broken into 2 “prepare” and “commit” phases
  • 51. Ignore: write off the error in resource B If something fails, what can we do? Retry: if resource B fails, retry on B until it succeeds Coordinate: 2 phases between A and B: prepare, coordinate, then commit (or rollback on failure) Undo: if resource B fails, undo action on A
  • 52. Decision Matrix: options for each point of failure Best option available? Step Order ? Ignore Retry Undo Coordinate CIS Payment Events Ctr Email Logging Inventory DB
  • 53. General considerations 1. Business model constraints a. Amazon.com: inventory >> demand, can process things offline later (asynchronously) b. TicketMeister: limited time-bound inventory, must process everything right now 2. Alternative definitions of success a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal? b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures? c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system depends on the logs being accurate and complete 3. Easier to undo? Call it first! 4. Research all 3rd party APIs: assume nothing!
  • 54. 4. Client-side Design Considerations
  • 55. How does the user see a failure?
  • 56. All these failure scenarios look the same to a client Retry is safe Retry may be safe Retry likely NOT safe
  • 57. The client doesn’t know the server’s internal state Retry is safe Retry may be safe Retry likely NOT safe
  • 58. Clients (especially humans) will almost always retry POST succeeded! +$500 POST succeeded! +$500
  • 59. Can’t we rely on the client to do the right thing? ● “Or else what?” ● “But what if my Wi-Fi goes down?” ● “Is there another way?” ● “How will I know it’s OK to bail?” ● “Should I call customer care?” ● “Should I just go on Twitter?”
  • 60. Idempotency: guaranteeing “Exactly Once” semantics POST succeeded! +$500 Oops! You already did that action!
  • 61. One option: send an Idempotency Key The client (front end, or another API) uses a unique identifier on its end for the “transaction” - so that retries can be safely rejected
  • 62. Other client-side considerations: exponential backoff
  • 63. Exponential backoff: prevent clients from causing Denial-of-Service on a struggling service ...
  • 64. Thundering herds: Avoiding resource contention
  • 65. Solution: exponential back-off with random jitter Example: Stripe.rb client
  • 66. Further reading ● Two Generals Problem ● Byzantine Consensus Protocols ○ Achieving quorum ○ 2-phase commit ○ Paxos ○ Blockchain ● “Distributed Systems Observability” - Cindy Sridharan
  • 68. Feedback / Ideas for next time? ● NoSQL deep dive ● Monitoring & Observability in distributed systems
  • 69.
  • 71. 3 Characteristics of a Distributed System 1. Operate concurrently 2. Can fail independently 3. Don’t share a global clock
  • 72. 1. Reads >> Writes? a. Read replication: 1 master, many replicated slaves i. Broken: consistency (new: eventual consistency) - even with re ii. Writes are still a bottleneck and will take over the master b. Sharding (break up 1 write database into many based some key) i. Each one is read-replicated ii. Broken: data model, completely isolated instances (can’t join tables across shards) c. Adding indexes? i. As writes scale up, you’ll lose the benefits of this ii. May end up denormalizing the relational database (BAD!) 2. Why not go NoSQL anyway? Distributed Data: scaling
  • 73. Byzantine Generals Problem: https://en.wikipedia.org/wiki/Two_Generals%27_Problem Byzantine agreement protocols: https://en.wikipedia.org/wiki/Byzantine_fault_tolerance https://en.wikipedia.org/wiki/Quantum_Byzantine_agreement https://medium.com/loom-network/understanding-blockchain-fundamentals-part-1-byzantine-fault-toleranc e-245f46fe8419 https://medium.com/all-things-ledger/the-byzantine-generals-problem-168553f31480 Further reading