Designing large scale distributed systems

2,623 views
2,414 views

Published on

Published in: Technology

Designing large scale distributed systems

  1. 1. Designing Large­Scale  Distributed Systems Ashwani Priyedarshi
  2. 2. “the network is the computer.” John Gage, Sun Microsystems
  3. 3. “A distributed system is one in which the failure  of a computer you didn’t even know existed can  render your own computer unusable.” Leslie Lamport
  4. 4. “Of three properties of distributed data systems­  consistency, availability, partition­tolerance –  choose two.” Eric Brewer, CAP Theorem, PODC 2000
  5. 5. Agenda● Consistency Models● Transactions● Why to distribute?● Decentralized Architecture● Design Techniques & Tradeoffs● Few Real World Examples● Conclusions
  6. 6. Consistency Model• Restricts possible values that a read operation on  an item can return – Some are very restrictive, others are less – The less restrictive ones are easier to implement• The most natural semantic for storage system is ­  "read should return the last written value” – In case of concurrent accesses and multiple replicas, its  not easy to identify what "last write" means
  7. 7. Strict Consistency● Assumes the existence of absolute global time● It is impossible to implement on a large distributed  system● No two operations (in different clients) allowed at the  same time● Example: Sequence (a) satisfies strict consistency, but  sequence (b) does not
  8. 8. Sequential Consistency● The result of any execution is the same as if  ● the read and write operations by all processes on the data  store were executed in some sequential order ● the operations of each individual process appear in this  sequence in the order specified by its program● All processes see the same interleaving of operations● Many interleavings are valid● Different runs of a program might act differently● Example: Sequence (a) satisfies sequential consistency,  but sequence (b) does not
  9. 9. Consistency vs Availability• In large shared­data distributed systems, network  partitions are a given• Consistency or Availability• Both options require the client developer to be aware  of what the system is offering
  10. 10. Eventual Consistency• An eventual consistent storage system guarantees that  if no new updates are made to the object, eventually  all accesses will return the last updated value• If no failures occur, the maximum size of the  inconsistency window can be determined based on factors  such as: – load on the system – communication delays – number of replicas• The most popular system that implements eventual  consistency is DNS
  11. 11. Quorum­based Technique • To enforce consistent operation in a distributed  system.• Consider the following parameters: – N = Total number of replicas – W = Replicas to wait for acknowledgement during writes – R = Replicas to access during reads• If W+R > N – the read set and the write set always overlap and one can  guarantee strong consistency• If W+R <= N – the read and write set might not overlap and consistency  cannot be guaranteed
  12. 12. Agenda● Consistency Models● Transactions● Why to distribute?● Decentralized Architecture● Design Techniques & Tradeoffs● Few Real World Examples● Conclusions
  13. 13. Transactions● Extended form of consistency across multiple operations● Example: Transfer money from A to B ● Subtract from A ● Add to B● What if something happens in between? ● Another transaction on A or B ● Machine Crashes ● ...
  14. 14. Why Transactions?● Correctness● Consistency● Enforce Invariants● ACID
  15. 15. Agenda● Consistency Models● Transactions● Why to distribute?● Decentralized Architecture● Design Techniques & Tradeoffs● Few Real World Examples● Conclusions
  16. 16. Why to distribute?● Catastrophic Failures● Expected Failures● Routine Maintenance● Geolocality ● CDN, edge caching
  17. 17. Why NOT to distribute?● Within a Datacenter ● High bandwidth: 1­100Gbps interconnects ● Low latency: < 1ms within a rack, < 5ms across ● Little to no cost● Between Datacenters ● Low bandwidth: 10Mbps­1Gbps ● High latency: expect 100s of ms ● High Cost for fiber
  18. 18. Agenda● Consistency Models● Transactions● Why to distribute?● Decentralized Architecture● Design Techniques & Tradeoffs● Few Real World Examples● Conclusions
  19. 19. Decentralized Architecture● Operating from multiple data­centers simultaneously● Hard problem● Maintaining consistency? Harder● Transactions? Hardest
  20. 20. Option 1: Dont● Most common ● Make sure data­center never goes down● Bad at catastrophic failure ● Large scale data loss● Not great for serving ● No geolocation
  21. 21. Option 2: Primary with hot failover(s)● Better, but not ideal ● Mediocre at catastrophic failure ● Window of lost data ● Failover data may be inconsistent● Geolocated for reads, not for writes
  22. 22. Option 3: Truly Distributed● Simultaneous writes in different DCs, maintaining  consistency● Two­way: Hard● N­way: Harder● Handles catastrophic failure, geolocality● But high latency
  23. 23. Agenda● Consistency Models● Transactions● Why to distribute?● Decentralized Architecture● Design Techniques & Tradeoffs● Few Real World Examples● Conclusions
  24. 24. Tradeoffs Backups M/S MM 2PC PaxosConsistencyTransactionsLatencyThroughputData LossFailover
  25. 25. Backups● Make a copy● Weak consistency● Usually no transactions
  26. 26. Tradeoffs – Backups Backups M/S MM 2PC PaxosConsistency WeakTransactions NoLatency LowThroughput HighData Loss HighFailover Down
  27. 27. Master/slave replication● Usually asynchronous ● Good for throughput, latency● Weak/eventual consistency● Support transactions
  28. 28. Tradeoffs – Master/Slave Backups M/S MM 2PC PaxosConsistency Weak EventualTransactions No FullLatency Low LowThroughput High HighData Loss High SomeFailover Down Read Only
  29. 29. Multi­master replication● Asynchronous, eventual consistency● Concurrent writes● Need serialization protocol ● e.g. monotonically increasing timestamps ● Either with master election or distributed consensus protocol● No strong consistency● No global transactions
  30. 30. Tradeoffs ­ Multi­master Backups M/S MM 2PC PaxosConsistency Weak Eventual EventualTransactions No Full LocalLatency Low Low LowThroughput High High HighData Loss High Some SomeFailover Down Read Only Read/write
  31. 31. Two Phase Commit● Semi­distributed consensus protocol ● deterministic coordinator● 1: Request 2: Commit/Abort● Heavyweight, synchronous, high latency● 3PC: Asynchronous (One extra round trip)● Poor Throughput
  32. 32. Tradeoffs ­ 2PC Backups M/S MM 2PC PaxosConsistency Weak Eventual Eventual StrongTransactions No Full Local FullLatency Low Low Low HighThroughput High High High LowData Loss High Some Some NoneFailover Down Read Only Read/write Read/write
  33. 33. Paxos● Decentralized, distributed consensus protocol● Protocol similar to 2PC/3PC ● Lighter, but still high latency● Three class of agents: proposers, acceptors, learners● 1. a) prepare b) promise 2. a) accept b) accepted ● Survives minority failure
  34. 34. Tradeoffs Backups M/S MM 2PC PaxosConsistency Weak Eventual Eventual Strong StrongTransactions No Full Local Full FullLatency Low Low Low High HighThroughput High High High Low MediumData Loss High Some Some None NoneFailover Down Read Only Read/write Read/write Read/write
  35. 35. Agenda● Consistency Models● Transactions● Why to distribute?● Decentralized Architecture● Design Techniques & Tradeoffs● Few Real World Examples● Conclusions
  36. 36. Examples● Megastore ● Googles Scalable, Highly Available Datastore ● Strong Consistency, Paxos ● Optimized for reads● Dynamo ● Amazon’s Highly Available Key­value Store ● Eventual Consistency, Consistent Hashing, Vector Clocks ● Optimized for writes● PNUTS ● Yahoos Massively Parallel & Distributed Database System ● Timeline Consistency  ● Optimized for reads
  37. 37. Conclusions● No silver bullet ● There are no simple solutions● Design systems based on application needs
  38. 38. The End
  39. 39. Backup Slides
  40. 40. Vector Clocks• Used to capture causality between different  versions of the same object.• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated with  one vector clock.• If the counters on the first object’s clock are  less­than­or­equal to all of the nodes in the  second clock, then the first is an ancestor of the  second and can be forgotten.
  41. 41. Vector Clock Example
  42. 42. Partitioning Algorithm• Consistent hashing: – The output range of a hash  function is treated as a  fixed circular space or  “ring”.• Virtual Nodes – Each node can be responsible  for more than one virtual  node. – When a new node is added, it  is assigned multiple  positions. – Various Advantages

×