Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Design Principles of Scalable,               Distributed Systems                                                 Tinniam V...
Distributed SystemsThere are two classes of systems- Monolithic- Distributed03/28/12       Tinniam V Ganesh - http://gigad...
Traditional Client Server Architecture                Client                                          Server03/28/12      ...
Properties of Distributed SystemsDistributed Systems are made up of 100s of commodity servers• No machine has complete inf...
Characteristics of Distributed SystemsDistributed Systems are made up of• Commodity Servers• Large number of servers• Serv...
Examples of Distributed Systems• Amazon’s e-retail store• Google• Yahoo• Facebook• Twitter• YoutubeEtc03/28/12            ...
Key principles of distributed systems•   Incremental scalability•   Symmetry – All nodes are equal•   Decentralization – N...
Transaction Processing System•   Traditional databases have to ensure that transactions are consistent. Transaction    mus...
ACID postulateTransactions in traditional system have to have the following propertiesEarlier Systems were designed for AC...
ACIDAtomic – This property ensures that each transaction happens completely or not at allConsistent - The transaction shou...
ScalingThere are 2 types of scalingVertical scaling – This method scales by adding faster CPU , more memory and a   larger...
System behavior on Scaling                         Response                                                               ...
Consistency and ReplicationIn order to increase reliability against failures data has to be replicated across multiple    ...
Reasons for ReplicationData is replicated in distributed systems for two reasons- Reliability – Ensuring that there is a c...
Downside of Replication•   Replication of data has several advantages but the downside is the issue    maintaining consist...
SynchronizationNo machine has a view of the global system state•   Problems with distributed systems•   How can processes ...
Hypothetical situationConsider a hypothetical situation with banks - Man deposits Rs 55,000/- at 10.00 am- Man withdraws R...
Vector ClocksVector clocks are used to capture causality between different versions of the same   object.Amazon’s Dynamo u...
Vector Clocks      2    OK                     5                               8      4                          10       ...
Dynamo’s reconciliation process03/28/12     Tinniam V Ganesh - http://gigadom.wordpress.com   20
Problem with Relational DatabasesRDBMS databases provide the user the ability to construct complex queries but they   do n...
No SQL Databases•   Databases horizontally partitioned•   Simple queries based on gets() and sets()•   Access are made on ...
Databases that use Consistent Hashing1.         Cassandra2.         Amazon’s Dynamo3.         NoSQL4.         HBASE5.     ...
Hash Tables  •   Distribute records among many servers  •   Distribution based on keys which is hashed  •   Key – 128 bit ...
Distributed Hash Table•   Hashing the keys results in reaching servers are assumed to reside on the    circumference of a ...
Distributed Hash TableAn entity with key K falls under the jurisdiction of the node  with the smallest id >= K•   For e.g....
Consistent HashingA naïve approach with 8 nodes and 100 keys could use a simple modulo algorithm.So key 18 would end up on...
Consistent HashingSource: http://offthelip.org/03/28/12                        Tinniam V Ganesh - http://gigadom.wordpress...
Distributed Hash Table03/28/12        Tinniam V Ganesh - http://gigadom.wordpress.com   29
Consistent Hashing                                                       Source: http://horicky.blogspot.in03/28/12      T...
1      4             Chord System                                                1                                        ...
Process of determining nodeTo look up a key k node p will forward request to node q with index j in p’s finger table    su...
Hashing efficiency of Chord SystemThe Chord System gets to the node in O (log n) stepsThere are other hashing techniques t...
Joining the Chord SystemSuppose node p wants to join. It performs the following steps- Requests lookup for succ (p+1)- Ins...
Maintaining consistencyPeriodically each node checks its successor’s predecessor.Node ‘q’ contacts succ(q+1) and requests ...
CAP TheoremDatabases that are designed based on ACID properties have poor availability.Postulated by Eric Brewer of Univer...
CAP Theorem•   Consistency – Ability for repeated reads to provide the same value•   Availability – Ability to be resilien...
Real world examples of CAP TheoremAmazon’s Dynamo chooses availability over consistency. Dynamo implements   eventual cons...
Consistency issuesData replication used in many commercial systems perform synchronous replica    coordination to provide ...
Quorum ProtocolTo maintain consistency data is replicated in many servers.For e.g. let us assume there are N servers in th...
Quorum ProtocolSimilarly reads are done from an arbitrary number of server replicas Nr. This   is known as a read quorumRe...
Election AlgorithmMany distributed systems usually have one process to act as a coordinator. If   the coordinator crashes ...
Traditional Fault ToleranceTraditional systems use redundancy to handle failures and be tolerant to fault as   shown below...
Process ResilienceHandling failures in distributed systems is much more difficult as no system has any   view of the globa...
Byzantine FailuresByzantine refers to Byzantine General Problem where an army must unanimously   decide whether to attack ...
CheckpointingIn fault tolerant distributed computing backward error recovery requires that the    system regularly save it...
Gossip ProtocolUsed to handle server crashes and server or servers joining into the systemChanges to the distributed syste...
Sloppy QuorumQuorum protocol is applied on first N healthy nodes rather than N nodes walking   clockwise in the ring.Data ...
Thank You !                             Tinniam V Ganesh                             tvganesh.85@gmail.com                ...
Upcoming SlideShare
Loading in …5
×

Design principles of scalable, distributed systems

21,806 views

Published on

Key algorithms of scalable, distributed systems

Published in: Technology
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Brent Forsman is paralyzed from a hunting accident. He now uses the Demolisher system to generate income. He says: "I believe there is no way this system can fail!" ●●● http://t.cn/A6zP2wH9
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I think Mike Cruickshank's new Profit Maximiser service is going to be one of the big hits and I can say hand on heart, if you follow the instructions, and just get involved, you will make a lot of money over the coming weeks, months and years. ➤➤ http://t.cn/A6hPRSfx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I think Mike Cruickshank's new Profit Maximiser service is going to be one of the big hits and I can say hand on heart, if you follow the instructions, and just get involved, you will make a lot of money over the coming weeks, months and years. ➤➤ http://t.cn/A6hP86vM
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/2u6xbL5 ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Design principles of scalable, distributed systems

  1. 1. Design Principles of Scalable, Distributed Systems Tinniam V Ganesh tvganesh.85@gmail.com03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 1
  2. 2. Distributed SystemsThere are two classes of systems- Monolithic- Distributed03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 2
  3. 3. Traditional Client Server Architecture Client Server03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 3
  4. 4. Properties of Distributed SystemsDistributed Systems are made up of 100s of commodity servers• No machine has complete information about the system state• Machines make decisions based on local information• Failure of one machine does not cause any problems• There is no implicit assumption about a global clock03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 4
  5. 5. Characteristics of Distributed SystemsDistributed Systems are made up of• Commodity Servers• Large number of servers• Servers crash, there network failures, messages not sent, received• New Servers can join without changing behavior03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 5
  6. 6. Examples of Distributed Systems• Amazon’s e-retail store• Google• Yahoo• Facebook• Twitter• YoutubeEtc03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 6
  7. 7. Key principles of distributed systems• Incremental scalability• Symmetry – All nodes are equal• Decentralization – No central control• Work distribution heterogenity03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 7
  8. 8. Transaction Processing System• Traditional databases have to ensure that transactions are consistent. Transaction must be fully complete or not at all.• Successful transactions are committed.• Otherwise transactions are rolled back03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 8
  9. 9. ACID postulateTransactions in traditional system have to have the following propertiesEarlier Systems were designed for ACID propertiesA – AtomicC – ConsistentI – IsolatedD - Durable03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 9
  10. 10. ACIDAtomic – This property ensures that each transaction happens completely or not at allConsistent - The transaction should maintain system invariants. For e.g. an internal bank transfer should result in the total amount in the bank before and after the transaction to be same. It may be temporarily differentIsolated – Different transactions should be isolated or serializable. It must appear that transactions happen sequentially in some particular orderDurable – Once the transaction commits the effect is complete and durable going forward.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 10
  11. 11. ScalingThere are 2 types of scalingVertical scaling – This method scales by adding faster CPU , more memory and a larger database. Does not scale beyond a particular pointHorizontal scalability – This method scales laterally by adding more servers with the same capacity03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 11
  12. 12. System behavior on Scaling Response ResponseTransactions Throughput TimePer Second Load 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 12
  13. 13. Consistency and ReplicationIn order to increase reliability against failures data has to be replicated across multiple servers.The problem with replicas is the need to keep the data consistent03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 13
  14. 14. Reasons for ReplicationData is replicated in distributed systems for two reasons- Reliability – Ensuring that there is a consistency in data in a majority of the replicas- Performance – Performance can be improved by accessing a replica that is closer to the user. Geographical resiliency03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 14
  15. 15. Downside of Replication• Replication of data has several advantages but the downside is the issue maintaining consistency• A modification of a copy makes it different from the rest and this update has to be propagated to all copies to ensure consistency03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 15
  16. 16. SynchronizationNo machine has a view of the global system state• Problems with distributed systems• How can processes synchronize ?• Clocks on different systems will be slightly different• Is there a way to maintain a global view of the clock• Can we order events causally?03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 16
  17. 17. Hypothetical situationConsider a hypothetical situation with banks - Man deposits Rs 55,000/- at 10.00 am- Man withdraws Rs 20,000/- at 10.02 amWhat will happen if the updates happen in different order- Operations must be idempotent. Idempotency refers to getting the same result no matter how many times the operation is performed.eCommerce Site – Amazon-add to shopping cart-delete from shopping cart03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 17
  18. 18. Vector ClocksVector clocks are used to capture causality between different versions of the same object.Amazon’s Dynamo uses vector clocks to reconcile different versions of the objects and determine the causal ordering of events.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 18
  19. 19. Vector Clocks 2 OK 5 8 4 10 16 6 15 24 8 20 32 10 25 Adjust 40 12 30 48 14 41 56 16 46 64 18 51 6803/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 19
  20. 20. Dynamo’s reconciliation process03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 20
  21. 21. Problem with Relational DatabasesRDBMS databases provide the user the ability to construct complex queries but they do not scale well.ProblemPerformance deteriorates as the number of records reach several millionSolutionTo partition the database horizontally and distribute records across several servers.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 21
  22. 22. No SQL Databases• Databases horizontally partitioned• Simple queries based on gets() and sets()• Access are made on key/value pairs• Cannot do complex queries like joins• Database can contain several hundred million records03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 22
  23. 23. Databases that use Consistent Hashing1. Cassandra2. Amazon’s Dynamo3. NoSQL4. HBASE5. CouchDB6. MongoDB03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 23
  24. 24. Hash Tables • Distribute records among many servers • Distribution based on keys which is hashed • Key – 128 bit or 160 bits • Hash values fall into a range servers visualized to lie on the circumference of a circle going clockwise.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 24
  25. 25. Distributed Hash Table• Hashing the keys results in reaching servers are assumed to reside on the circumference of a circle• The highest key coincides back to the beginning of this circle• The movement is clockwise03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 25
  26. 26. Distributed Hash TableAn entity with key K falls under the jurisdiction of the node with the smallest id >= K• For e.g. if we have two nodes, one at position 50 and another at position 200.• If we want to store a key / value pair in the DHT and the key hash is 100, would go to node 200.• Another key hash of 30 would go to the node 5003/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 26
  27. 27. Consistent HashingA naïve approach with 8 nodes and 100 keys could use a simple modulo algorithm.So key 18 would end up on node 2 and key 63 on node 7.But how do we handle servers crashing or new servers joining the system.Consistent Hashing handles this issue03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 27
  28. 28. Consistent HashingSource: http://offthelip.org/03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 28
  29. 29. Distributed Hash Table03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 29
  30. 30. Consistent Hashing Source: http://horicky.blogspot.in03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 30
  31. 31. 1 4 Chord System 1 3 4 9 Resolving K = 26 4 9 5 18 1 1 1 2 1 1 3 3 1 4 4 28 4 5 141 281 28 23 284 1 21 95 9 1 21 20 1 28 3 28 1 20 4 28 1 20 18 FTp[i]=succ(p+2 i-1) 5 4 3 28 14 4 28 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 31 5 4
  32. 32. Process of determining nodeTo look up a key k node p will forward request to node q with index j in p’s finger table such thatq = FTp[j] <= k < FTp[j+1]To resolve k =264. 26> FT1[5] = 18. Hence forwarded to Node 185. FT18[2] <= 26 < FT 18[3]6. FT20[1] <=26 < FT20[2]7. 26 > FT21[1] = 28 Hence Node 28 is responsible for key 2603/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 32
  33. 33. Hashing efficiency of Chord SystemThe Chord System gets to the node in O (log n) stepsThere are other hashing techniques that get in O(1) but use a larger local table. For example attains a O(1) hashing method.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 33
  34. 34. Joining the Chord SystemSuppose node p wants to join. It performs the following steps- Requests lookup for succ (p+1)- Inserts itself before this node03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 34
  35. 35. Maintaining consistencyPeriodically each node checks its successor’s predecessor.Node ‘q’ contacts succ(q+1) and requests it to return pred(succ(q+1))If q = pred(succ(q+1)) then nothing has changed. If the node passes another value then q knows that a new node ‘p’ has joined the systemq < p < succ (q+1)so it updates its Finger table so qWill set FTq[1] = p03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 35
  36. 36. CAP TheoremDatabases that are designed based on ACID properties have poor availability.Postulated by Eric Brewer of University of BerkeleyAt most only 2 of 3 properties are possible in distributed systemsC – ConsistencyA – AvailabilityP – Partition Tolerance03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 36
  37. 37. CAP Theorem• Consistency – Ability for repeated reads to provide the same value• Availability – Ability to be resilient to server crashes• Partition Tolerance – Ability to partition data between servers and always be able to get the data03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 37
  38. 38. Real world examples of CAP TheoremAmazon’s Dynamo chooses availability over consistency. Dynamo implements eventual consistency where data become consistent over timeGoogle’s BigTable chooses consistency over availabilityConsistentcy, Partition Tolerance (CP)Big TableHbaseAvailability, Partition Tolerance (AP)DynamoVoldemortCassandra03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 38
  39. 39. Consistency issuesData replication used in many commercial systems perform synchronous replica coordination to provide strongly consistent data.The downside of this approach is the poor availabilityThese systems maintain that the data is unavailable if they are not able to ensure consistencyFor e.g.If data is replicated on 5 servers and an update needs to be made then the following has to be done- Update all 5 copies- Ensure all of them are successful- If one of them fails roll back the updates on the other 4If a read is done when one of the server fail a strongly consistent system would return “data unavailable” when correctness is undetermined.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 39
  40. 40. Quorum ProtocolTo maintain consistency data is replicated in many servers.For e.g. let us assume there are N servers in the systemTypical algorithms maintain at least writes to > N/2 => N/2 +1Usually Nw> N/2A write is successful if it has been successfully committed in N/2 +1 serversThis is known as write quorum03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 40
  41. 41. Quorum ProtocolSimilarly reads are done from an arbitrary number of server replicas Nr. This is known as a read quorumReads from different servers are comparedA consistent design requires that Nw + Nr > NWith this you are assured of reading your writes03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 41
  42. 42. Election AlgorithmMany distributed systems usually have one process to act as a coordinator. If the coordinator crashes then an election takes place to identify the new coordinator2. P sends a ELECTION message to all higher numbered processes3. If no one responds P becomes coordinator4. If a higher number process answers, it takes over the election process03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 42
  43. 43. Traditional Fault ToleranceTraditional systems use redundancy to handle failures and be tolerant to fault as shown below Active Standby Active Standby03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 43
  44. 44. Process ResilienceHandling failures in distributed systems is much more difficult as no system has any view of the global state.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 44
  45. 45. Byzantine FailuresByzantine refers to Byzantine General Problem where an army must unanimously decide whether to attack another army. The problem is complicated because the generals must use messengers to communicate and by the presence of traitorsDistributed Systems are prone to a type of failures known as Byzantine failuresOmission failures – Disk crashes, network congestion, failure to receive request etcCommission failures – Failures when the server behaves incorrectly, corrupting local state etcSolution: To be able to handle Byzantine Failures where k processes are sick is to have a minimum 2k+1 processes so that we are left with k+1 replies given that k process are behaving incorrectly03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 45
  46. 46. CheckpointingIn fault tolerant distributed computing backward error recovery requires that the system regularly save its state at periodic intervals. We need to create a consistent global state called a distributed snapshot.In a distributed snapshot if a process P has recorded the receipt of a message then there should be a process Q that has sent a corresponding message.Each process saves its state from time to time.To recover we need to construct a consistent global state from these local states03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 46
  47. 47. Gossip ProtocolUsed to handle server crashes and server or servers joining into the systemChanges to the distributed system like membership changes are spread similar to gossiping- A server picks another random server and sends a message regarding a server crash or a server joining- If the receiver has already received this message it is dropped.- The receiving server similarly gossips to other servers and the system reaches a steady state soon03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 47
  48. 48. Sloppy QuorumQuorum protocol is applied on first N healthy nodes rather than N nodes walking clockwise in the ring.Data meant for Node A is sent to Node D if A is temporarily down.Node D has a hinted handoff in its metadata that updates Node A when it is up.03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 48
  49. 49. Thank You ! Tinniam V Ganesh tvganesh.85@gmail.com Read my blogs: http://gigadom.wordpress.com/ http://savvydom.wordpress.com/03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 49

×