SlideShare a Scribd company logo
Design Principles of Scalable,
               Distributed Systems


                                                 Tinniam V Ganesh
                                                 tvganesh.85@gmail.com



03/28/12           Tinniam V Ganesh - http://gigadom.wordpress.com       1
Distributed Systems
There are two classes of systems
- Monolithic
- Distributed




03/28/12       Tinniam V Ganesh - http://gigadom.wordpress.com   2
Traditional Client Server Architecture




                Client                                          Server




03/28/12                 Tinniam V Ganesh - http://gigadom.wordpress.com   3
Properties of Distributed Systems
Distributed Systems are made up of 100s of commodity servers
• No machine has complete information about the system state
• Machines make decisions based on local information
• Failure of one machine does not cause any problems
• There is no implicit assumption about a global clock




03/28/12                  Tinniam V Ganesh - http://gigadom.wordpress.com   4
Characteristics of Distributed Systems

Distributed Systems are made up of
• Commodity Servers
• Large number of servers
• Servers crash, there network failures, messages not sent, received
• New Servers can join without changing behavior




03/28/12                   Tinniam V Ganesh - http://gigadom.wordpress.com   5
Examples of Distributed Systems
• Amazon’s e-retail store
• Google
• Yahoo
• Facebook
• Twitter
• Youtube
Etc




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com   6
Key principles of distributed systems

•   Incremental scalability
•   Symmetry – All nodes are equal
•   Decentralization – No central control
•   Work distribution heterogenity




03/28/12             Tinniam V Ganesh - http://gigadom.wordpress.com   7
Transaction Processing System
•   Traditional databases have to ensure that transactions are consistent. Transaction
    must be fully complete or not at all.
•   Successful transactions are committed.
•   Otherwise transactions are rolled back




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com             8
ACID postulate
Transactions in traditional system have to have the following properties
Earlier Systems were designed for ACID properties
A – Atomic
C – Consistent
I – Isolated
D - Durable




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com   9
ACID
Atomic – This property ensures that each transaction happens completely or not at all

Consistent - The transaction should maintain system invariants. For e.g. an internal
   bank transfer should result in the total amount in the bank before and after the
   transaction to be same. It may be temporarily different

Isolated – Different transactions should be isolated or serializable. It must appear that
    transactions happen sequentially in some particular order

Durable – Once the transaction commits the effect is complete and durable going
   forward.




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com           10
Scaling
There are 2 types of scaling

Vertical scaling – This method scales by adding faster CPU , more memory and a
   larger database. Does not scale beyond a particular point
Horizontal scalability – This method scales laterally by adding more servers with the
   same capacity




03/28/12                       Tinniam V Ganesh - http://gigadom.wordpress.com          11
System behavior on Scaling




                         Response
                                                                          Response
Transactions                                                 Throughput   Time
Per Second




                               Load
    03/28/12       Tinniam V Ganesh - http://gigadom.wordpress.com           12
Consistency and Replication

In order to increase reliability against failures data has to be replicated across multiple
    servers.
The problem with replicas is the need to keep the data consistent




03/28/12                      Tinniam V Ganesh - http://gigadom.wordpress.com            13
Reasons for Replication

Data is replicated in distributed systems for two reasons
- Reliability – Ensuring that there is a consistency in data in a majority of the replicas
- Performance – Performance can be improved by accessing a replica that is closer
   to the user. Geographical resiliency




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com            14
Downside of Replication
•   Replication of data has several advantages but the downside is the issue
    maintaining consistency
•   A modification of a copy makes it different from the rest and this update has to be
    propagated to all copies to ensure consistency




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com         15
Synchronization
No machine has a view of the global system state

•   Problems with distributed systems
•   How can processes synchronize ?
•   Clocks on different systems will be slightly different
•   Is there a way to maintain a global view of the clock
•   Can we order events causally?




03/28/12                      Tinniam V Ganesh - http://gigadom.wordpress.com   16
Hypothetical situation
Consider a hypothetical situation with banks

 - Man deposits Rs 55,000/- at 10.00 am
- Man withdraws Rs 20,000/- at 10.02 am
What will happen if the updates happen in different order
- Operations must be idempotent. Idempotency refers to getting the same
  result no matter how many times the operation is performed.

eCommerce Site – Amazon
-add to shopping cart
-delete from shopping cart



03/28/12                 Tinniam V Ganesh - http://gigadom.wordpress.com   17
Vector Clocks
Vector clocks are used to capture causality between different versions of the same
   object.
Amazon’s Dynamo uses vector clocks to reconcile different versions of the objects and
   determine the causal ordering of events.




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com        18
Vector Clocks
      2    OK                     5                               8

      4                          10                               16

      6                          15                               24

      8                          20                               32

     10                          25                 Adjust        40

     12                          30                               48

     14                          41                               56

     16                          46                               64

     18                          51                               68

03/28/12        Tinniam V Ganesh - http://gigadom.wordpress.com        19
Dynamo’s reconciliation process




03/28/12     Tinniam V Ganesh - http://gigadom.wordpress.com   20
Problem with Relational Databases
RDBMS databases provide the user the ability to construct complex queries but they
   do not scale well.
Problem
Performance deteriorates as the number of records reach several million

Solution
To partition the database horizontally and distribute records across several servers.




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com            21
No SQL Databases
•   Databases horizontally partitioned
•   Simple queries based on gets() and sets()
•   Access are made on key/value pairs
•   Cannot do complex queries like joins
•   Database can contain several hundred million records




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com   22
Databases that use Consistent Hashing
1.         Cassandra
2.         Amazon’s Dynamo
3.         NoSQL
4.         HBASE
5.         CouchDB
6.         MongoDB




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com   23
Hash Tables
  •   Distribute records among many servers
  •   Distribution based on keys which is hashed
  •   Key – 128 bit or 160 bits
  •   Hash values fall into a range servers visualized to lie on the circumference of a
      circle going clockwise.




03/28/12                      Tinniam V Ganesh - http://gigadom.wordpress.com             24
Distributed Hash Table
•   Hashing the keys results in reaching servers are assumed to reside on the
    circumference of a circle
•   The highest key coincides back to the beginning of this circle
•   The movement is clockwise




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com     25
Distributed Hash Table
An entity with key K falls under the jurisdiction of the node
  with the smallest id >= K

•   For e.g. if we have two nodes, one at position 50 and another at position 200.
•   If we want to store a key / value pair in the DHT and the key hash is 100, would go
    to node 200.
•   Another key hash of 30 would go to the node 50




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com          26
Consistent Hashing
A naïve approach with 8 nodes and 100 keys could use a simple modulo algorithm.
So key 18 would end up on node 2 and key 63 on node 7.
But how do we handle servers crashing or new servers joining the system.
Consistent Hashing handles this issue




03/28/12                   Tinniam V Ganesh - http://gigadom.wordpress.com        27
Consistent Hashing




Source: http://offthelip.org/
03/28/12                        Tinniam V Ganesh - http://gigadom.wordpress.com   28
Distributed Hash Table




03/28/12        Tinniam V Ganesh - http://gigadom.wordpress.com   29
Consistent Hashing




                                                       Source: http://horicky.blogspot.in



03/28/12      Tinniam V Ganesh - http://gigadom.wordpress.com                               30
1      4


             Chord System                                                1
                                                                         3
                                                                                4
                                                                                9                     Resolving K = 26
                                                                         4      9
                                                                         5      18
                                                                 1
             1        1                                                              2
             1        1
                                                                                                 3
             3        1
             4        4             28
                                                                                                       4
             5        14




1        28
1        28
                                                                  2
3        28
4        1                 21                                                                                 9
5        9

     1           21
                            20
     1           28
     3           28
                                1    20
     4           28
                                1    20   18                                                         FTp[i]=succ(p+2 i-1)
     5           4
                                3    28                                              14
                                4    28
    03/28/12                                   Tinniam V Ganesh - http://gigadom.wordpress.com                        31
                                5    4
Process of determining node
To look up a key k node p will forward request to node q with index j in p’s finger table
    such that
q = FTp[j] <= k < FTp[j+1]
To resolve k =26
4. 26> FT1[5] = 18. Hence forwarded to Node 18
5. FT18[2] <= 26 < FT 18[3]
6. FT20[1] <=26 < FT20[2]
7. 26 > FT21[1] = 28 Hence Node 28 is responsible for key 26




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com           32
Hashing efficiency of Chord System
The Chord System gets to the node in O (log n) steps
There are other hashing techniques that get in O(1) but use a larger local table. For
   example attains a O(1) hashing method.




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com            33
Joining the Chord System

Suppose node p wants to join. It performs the following steps
- Requests lookup for succ (p+1)
- Inserts itself before this node




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com   34
Maintaining consistency
Periodically each node checks its successor’s predecessor.
Node ‘q’ contacts succ(q+1) and requests it to return pred(succ(q+1))
If q = pred(succ(q+1)) then nothing has changed. If the node passes another value
     then q knows that a new node ‘p’ has joined the system
q < p < succ (q+1)so it updates its Finger table so q
Will set FTq[1] = p




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com         35
CAP Theorem
Databases that are designed based on ACID properties have poor availability.

Postulated by Eric Brewer of University of Berkeley
At most only 2 of 3 properties are possible in distributed systems
C – Consistency
A – Availability
P – Partition Tolerance




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com   36
CAP Theorem
•   Consistency – Ability for repeated reads to provide the same value
•   Availability – Ability to be resilient to server crashes
•   Partition Tolerance – Ability to partition data between servers and always be able
    to get the data




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com         37
Real world examples of CAP Theorem

Amazon’s Dynamo chooses availability over consistency. Dynamo implements
   eventual consistency where data become consistent over time
Google’s BigTable chooses consistency over availability

Consistentcy, Partition Tolerance (CP)
Big Table
Hbase

Availability, Partition Tolerance (AP)
Dynamo
Voldemort
Cassandra



03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com   38
Consistency issues
Data replication used in many commercial systems perform synchronous replica
    coordination to provide strongly consistent data.
The downside of this approach is the poor availability
These systems maintain that the data is unavailable if they are not able to ensure
    consistency
For e.g.
If data is replicated on 5 servers and an update needs to be made then the following
    has to be done
- Update all 5 copies
- Ensure all of them are successful
- If one of them fails roll back the updates on the other 4

If a read is done when one of the server fail a strongly consistent system would return
     “data unavailable” when correctness is undetermined.

03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com           39
Quorum Protocol
To maintain consistency data is replicated in many servers.
For e.g. let us assume there are N servers in the system
Typical algorithms maintain at least writes to > N/2 => N/2 +1
Usually Nw> N/2
A write is successful if it has been successfully committed in N/2 +1 servers
This is known as write quorum




03/28/12                  Tinniam V Ganesh - http://gigadom.wordpress.com       40
Quorum Protocol
Similarly reads are done from an arbitrary number of server replicas Nr. This
   is known as a read quorum
Reads from different servers are compared
A consistent design requires that Nw + Nr > N
With this you are assured of reading your writes




03/28/12                  Tinniam V Ganesh - http://gigadom.wordpress.com       41
Election Algorithm
Many distributed systems usually have one process to act as a coordinator. If
   the coordinator crashes then an election takes place to identify the new
   coordinator
2. P sends a ELECTION message to all higher numbered processes
3. If no one responds P becomes coordinator
4. If a higher number process answers, it takes over the election process




03/28/12                 Tinniam V Ganesh - http://gigadom.wordpress.com    42
Traditional Fault Tolerance
Traditional systems use redundancy to handle failures and be tolerant to fault as
   shown below




                Active                   Standby




                Active                      Standby



03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com         43
Process Resilience
Handling failures in distributed systems is much more difficult as no system has any
   view of the global state.




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com            44
Byzantine Failures
Byzantine refers to Byzantine General Problem where an army must unanimously
   decide whether to attack another army. The problem is complicated because the
   generals must use messengers to communicate and by the presence of traitors



Distributed Systems are prone to a type of failures known as Byzantine failures
Omission failures – Disk crashes, network congestion, failure to receive request etc
Commission failures – Failures when the server behaves incorrectly, corrupting local
    state etc

Solution: To be able to handle Byzantine Failures where k processes are sick is to have
    a minimum 2k+1 processes so that we are left with k+1 replies given that k process
    are behaving incorrectly



03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com            45
Checkpointing
In fault tolerant distributed computing backward error recovery requires that the
    system regularly save its state at periodic intervals. We need to create a consistent
    global state called a distributed snapshot.

In a distributed snapshot if a process P has recorded the receipt of a message then
    there should be a process Q that has sent a corresponding message.

Each process saves its state from time to time.
To recover we need to construct a consistent global state from these local states




03/28/12                     Tinniam V Ganesh - http://gigadom.wordpress.com           46
Gossip Protocol
Used to handle server crashes and server or servers joining into the system
Changes to the distributed system like membership changes are spread
   similar to gossiping
- A server picks another random server and sends a message regarding a
   server crash or a server joining
- If the receiver has already received this message it is dropped.
- The receiving server similarly gossips to other servers and the system
   reaches a steady state soon




03/28/12                 Tinniam V Ganesh - http://gigadom.wordpress.com      47
Sloppy Quorum
Quorum protocol is applied on first N healthy nodes rather than N nodes walking
   clockwise in the ring.

Data meant for Node A is sent to Node D if A is temporarily down.
Node D has a hinted handoff in its metadata that updates Node A when it is up.




03/28/12                    Tinniam V Ganesh - http://gigadom.wordpress.com       48
Thank You !



                             Tinniam V Ganesh
                             tvganesh.85@gmail.com
                             Read my blogs: http://gigadom.wordpress.com/

                                                http://savvydom.wordpress.com/




03/28/12    Tinniam V Ganesh - http://gigadom.wordpress.com                      49

More Related Content

What's hot

file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada
umardanjumamaiwada
 
Hybrid Cloud and Its Implementation
Hybrid Cloud and Its ImplementationHybrid Cloud and Its Implementation
Hybrid Cloud and Its Implementation
Sai P Mishra
 
Storage area network
Storage area networkStorage area network
Storage area network
Syed Ubaid Ali Jafri
 
Fog Computing
Fog ComputingFog Computing
Fog Computing
Pachipulusu Giridhar
 
Storage Area Network(SAN)
Storage Area Network(SAN)Storage Area Network(SAN)
Storage Area Network(SAN)
Krishna Kahar
 
Distributed System ppt
Distributed System pptDistributed System ppt
OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)
OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)
OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)
Tutun Juhana
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical Clocks
Dilum Bandara
 
edge computing seminar report.pdf
edge computing seminar report.pdfedge computing seminar report.pdf
edge computing seminar report.pdf
firstlast467690
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
Information Technology
 
Seminar ppt fog comp
Seminar ppt fog compSeminar ppt fog comp
Seminar ppt fog comp
Mahantesh Hiremath
 
Networking in cloud computing
Networking in cloud computingNetworking in cloud computing
Networking in cloud computing
Barani Tharan
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
Amritanshu Mehra
 
CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1
Dr K V Subba Reddy
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
Mrinal Kumar
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
Anamika Singh
 
Fog computing 000
Fog computing 000Fog computing 000
Fog computing 000
pranjali rawke
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
Dr Neelesh Jain
 
Domain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTPDomain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTP
saurav kumar
 

What's hot (20)

file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada
 
Hybrid Cloud and Its Implementation
Hybrid Cloud and Its ImplementationHybrid Cloud and Its Implementation
Hybrid Cloud and Its Implementation
 
Storage area network
Storage area networkStorage area network
Storage area network
 
Fog Computing
Fog ComputingFog Computing
Fog Computing
 
Storage Area Network(SAN)
Storage Area Network(SAN)Storage Area Network(SAN)
Storage Area Network(SAN)
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)
OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)
OSI Reference Model and TCP/IP (Lecture #3 ET3003 Sem1 2014/2015)
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical Clocks
 
edge computing seminar report.pdf
edge computing seminar report.pdfedge computing seminar report.pdf
edge computing seminar report.pdf
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Seminar ppt fog comp
Seminar ppt fog compSeminar ppt fog comp
Seminar ppt fog comp
 
Networking in cloud computing
Networking in cloud computingNetworking in cloud computing
Networking in cloud computing
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1CLOUD COMPUTING UNIT-1
CLOUD COMPUTING UNIT-1
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
Fog computing 000
Fog computing 000Fog computing 000
Fog computing 000
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Domain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTPDomain name system (dns) , TELNET ,FTP, TFTP
Domain name system (dns) , TELNET ,FTP, TFTP
 

Similar to Design principles of scalable, distributed systems

Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
Xuhong Zhang
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
Radha Krishna
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
Rohit Dubey
 
Rethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsRethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming Systems
Yingjun Wu
 
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
HostedbyConfluent
 
Light sayed database_system_architecture
Light sayed database_system_architectureLight sayed database_system_architecture
Light sayed database_system_architecture
Sayed Ahmed
 
Light sayed database_system_architecture
Light sayed database_system_architectureLight sayed database_system_architecture
Light sayed database_system_architecture
Sayed Ahmed
 
MySQL Backed - Fraud Prevention
MySQL Backed - Fraud PreventionMySQL Backed - Fraud Prevention
MySQL Backed - Fraud Prevention
Ran Grushkowsky
 
Csc concepts
Csc conceptsCsc concepts
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
 
Keeping Data in Sync with Syncsort
Keeping Data in Sync with SyncsortKeeping Data in Sync with Syncsort
Keeping Data in Sync with Syncsort
Precisely
 
Database System Architecture
Database System ArchitectureDatabase System Architecture
Database System Architecture
University of Potsdam
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOps
Gonzalo Maldonado
 
VEDAViz for ETSAP partners
VEDAViz for ETSAP partnersVEDAViz for ETSAP partners
VEDAViz for ETSAP partners
IEA-ETSAP
 
Data data everywhere
Data data everywhereData data everywhere
Data data everywhere
Metron
 
Scaling systems using change propagation across data stores
Scaling systems using change propagation across data storesScaling systems using change propagation across data stores
Scaling systems using change propagation across data stores
Jagadeesh Huliyar
 
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
The Linux Foundation
 
Case Study with Answers.com on Scaling with Memcached and MySQL
Case Study with Answers.com on Scaling with Memcached and MySQLCase Study with Answers.com on Scaling with Memcached and MySQL
Case Study with Answers.com on Scaling with Memcached and MySQL
answers
 
DBMS Bascis
DBMS BascisDBMS Bascis

Similar to Design principles of scalable, distributed systems (20)

Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
 
Rethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsRethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming Systems
 
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
 
Light sayed database_system_architecture
Light sayed database_system_architectureLight sayed database_system_architecture
Light sayed database_system_architecture
 
Light sayed database_system_architecture
Light sayed database_system_architectureLight sayed database_system_architecture
Light sayed database_system_architecture
 
MySQL Backed - Fraud Prevention
MySQL Backed - Fraud PreventionMySQL Backed - Fraud Prevention
MySQL Backed - Fraud Prevention
 
Csc concepts
Csc conceptsCsc concepts
Csc concepts
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Keeping Data in Sync with Syncsort
Keeping Data in Sync with SyncsortKeeping Data in Sync with Syncsort
Keeping Data in Sync with Syncsort
 
Database System Architecture
Database System ArchitectureDatabase System Architecture
Database System Architecture
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOps
 
VEDAViz for ETSAP partners
VEDAViz for ETSAP partnersVEDAViz for ETSAP partners
VEDAViz for ETSAP partners
 
Data data everywhere
Data data everywhereData data everywhere
Data data everywhere
 
Scaling systems using change propagation across data stores
Scaling systems using change propagation across data storesScaling systems using change propagation across data stores
Scaling systems using change propagation across data stores
 
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
 
Case Study with Answers.com on Scaling with Memcached and MySQL
Case Study with Answers.com on Scaling with Memcached and MySQLCase Study with Answers.com on Scaling with Memcached and MySQL
Case Study with Answers.com on Scaling with Memcached and MySQL
 
DBMS Bascis
DBMS BascisDBMS Bascis
DBMS Bascis
 

More from Tinniam V Ganesh (TV)

Internet of Things - TEDx talk
Internet of Things - TEDx talkInternet of Things - TEDx talk
Internet of Things - TEDx talk
Tinniam V Ganesh (TV)
 
Long Term Evolution (LTE) -
Long Term Evolution (LTE) -Long Term Evolution (LTE) -
Long Term Evolution (LTE) -
Tinniam V Ganesh (TV)
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Tinniam V Ganesh (TV)
 
Intelligent networks, camel_services_and_applications_v1
Intelligent networks, camel_services_and_applications_v1Intelligent networks, camel_services_and_applications_v1
Intelligent networks, camel_services_and_applications_v1
Tinniam V Ganesh (TV)
 
Wireless technologies - Part 2
Wireless technologies - Part 2Wireless technologies - Part 2
Wireless technologies - Part 2
Tinniam V Ganesh (TV)
 
Wireless technologies - Part 1
Wireless technologies - Part 1Wireless technologies - Part 1
Wireless technologies - Part 1
Tinniam V Ganesh (TV)
 
Seven habits of highly effective people
Seven habits of highly effective peopleSeven habits of highly effective people
Seven habits of highly effective people
Tinniam V Ganesh (TV)
 
Signaling system 7 (ss7)
Signaling system 7 (ss7)Signaling system 7 (ss7)
Signaling system 7 (ss7)
Tinniam V Ganesh (TV)
 
Technology trends that will shape our future
Technology trends that will shape our futureTechnology trends that will shape our future
Technology trends that will shape our future
Tinniam V Ganesh (TV)
 

More from Tinniam V Ganesh (TV) (9)

Internet of Things - TEDx talk
Internet of Things - TEDx talkInternet of Things - TEDx talk
Internet of Things - TEDx talk
 
Long Term Evolution (LTE) -
Long Term Evolution (LTE) -Long Term Evolution (LTE) -
Long Term Evolution (LTE) -
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Intelligent networks, camel_services_and_applications_v1
Intelligent networks, camel_services_and_applications_v1Intelligent networks, camel_services_and_applications_v1
Intelligent networks, camel_services_and_applications_v1
 
Wireless technologies - Part 2
Wireless technologies - Part 2Wireless technologies - Part 2
Wireless technologies - Part 2
 
Wireless technologies - Part 1
Wireless technologies - Part 1Wireless technologies - Part 1
Wireless technologies - Part 1
 
Seven habits of highly effective people
Seven habits of highly effective peopleSeven habits of highly effective people
Seven habits of highly effective people
 
Signaling system 7 (ss7)
Signaling system 7 (ss7)Signaling system 7 (ss7)
Signaling system 7 (ss7)
 
Technology trends that will shape our future
Technology trends that will shape our futureTechnology trends that will shape our future
Technology trends that will shape our future
 

Recently uploaded

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 

Recently uploaded (20)

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 

Design principles of scalable, distributed systems

  • 1. Design Principles of Scalable, Distributed Systems Tinniam V Ganesh tvganesh.85@gmail.com 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 1
  • 2. Distributed Systems There are two classes of systems - Monolithic - Distributed 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 2
  • 3. Traditional Client Server Architecture Client Server 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 3
  • 4. Properties of Distributed Systems Distributed Systems are made up of 100s of commodity servers • No machine has complete information about the system state • Machines make decisions based on local information • Failure of one machine does not cause any problems • There is no implicit assumption about a global clock 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 4
  • 5. Characteristics of Distributed Systems Distributed Systems are made up of • Commodity Servers • Large number of servers • Servers crash, there network failures, messages not sent, received • New Servers can join without changing behavior 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 5
  • 6. Examples of Distributed Systems • Amazon’s e-retail store • Google • Yahoo • Facebook • Twitter • Youtube Etc 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 6
  • 7. Key principles of distributed systems • Incremental scalability • Symmetry – All nodes are equal • Decentralization – No central control • Work distribution heterogenity 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 7
  • 8. Transaction Processing System • Traditional databases have to ensure that transactions are consistent. Transaction must be fully complete or not at all. • Successful transactions are committed. • Otherwise transactions are rolled back 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 8
  • 9. ACID postulate Transactions in traditional system have to have the following properties Earlier Systems were designed for ACID properties A – Atomic C – Consistent I – Isolated D - Durable 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 9
  • 10. ACID Atomic – This property ensures that each transaction happens completely or not at all Consistent - The transaction should maintain system invariants. For e.g. an internal bank transfer should result in the total amount in the bank before and after the transaction to be same. It may be temporarily different Isolated – Different transactions should be isolated or serializable. It must appear that transactions happen sequentially in some particular order Durable – Once the transaction commits the effect is complete and durable going forward. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 10
  • 11. Scaling There are 2 types of scaling Vertical scaling – This method scales by adding faster CPU , more memory and a larger database. Does not scale beyond a particular point Horizontal scalability – This method scales laterally by adding more servers with the same capacity 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 11
  • 12. System behavior on Scaling Response Response Transactions Throughput Time Per Second Load 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 12
  • 13. Consistency and Replication In order to increase reliability against failures data has to be replicated across multiple servers. The problem with replicas is the need to keep the data consistent 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 13
  • 14. Reasons for Replication Data is replicated in distributed systems for two reasons - Reliability – Ensuring that there is a consistency in data in a majority of the replicas - Performance – Performance can be improved by accessing a replica that is closer to the user. Geographical resiliency 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 14
  • 15. Downside of Replication • Replication of data has several advantages but the downside is the issue maintaining consistency • A modification of a copy makes it different from the rest and this update has to be propagated to all copies to ensure consistency 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 15
  • 16. Synchronization No machine has a view of the global system state • Problems with distributed systems • How can processes synchronize ? • Clocks on different systems will be slightly different • Is there a way to maintain a global view of the clock • Can we order events causally? 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 16
  • 17. Hypothetical situation Consider a hypothetical situation with banks - Man deposits Rs 55,000/- at 10.00 am - Man withdraws Rs 20,000/- at 10.02 am What will happen if the updates happen in different order - Operations must be idempotent. Idempotency refers to getting the same result no matter how many times the operation is performed. eCommerce Site – Amazon -add to shopping cart -delete from shopping cart 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 17
  • 18. Vector Clocks Vector clocks are used to capture causality between different versions of the same object. Amazon’s Dynamo uses vector clocks to reconcile different versions of the objects and determine the causal ordering of events. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 18
  • 19. Vector Clocks 2 OK 5 8 4 10 16 6 15 24 8 20 32 10 25 Adjust 40 12 30 48 14 41 56 16 46 64 18 51 68 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 19
  • 20. Dynamo’s reconciliation process 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 20
  • 21. Problem with Relational Databases RDBMS databases provide the user the ability to construct complex queries but they do not scale well. Problem Performance deteriorates as the number of records reach several million Solution To partition the database horizontally and distribute records across several servers. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 21
  • 22. No SQL Databases • Databases horizontally partitioned • Simple queries based on gets() and sets() • Access are made on key/value pairs • Cannot do complex queries like joins • Database can contain several hundred million records 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 22
  • 23. Databases that use Consistent Hashing 1. Cassandra 2. Amazon’s Dynamo 3. NoSQL 4. HBASE 5. CouchDB 6. MongoDB 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 23
  • 24. Hash Tables • Distribute records among many servers • Distribution based on keys which is hashed • Key – 128 bit or 160 bits • Hash values fall into a range servers visualized to lie on the circumference of a circle going clockwise. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 24
  • 25. Distributed Hash Table • Hashing the keys results in reaching servers are assumed to reside on the circumference of a circle • The highest key coincides back to the beginning of this circle • The movement is clockwise 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 25
  • 26. Distributed Hash Table An entity with key K falls under the jurisdiction of the node with the smallest id >= K • For e.g. if we have two nodes, one at position 50 and another at position 200. • If we want to store a key / value pair in the DHT and the key hash is 100, would go to node 200. • Another key hash of 30 would go to the node 50 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 26
  • 27. Consistent Hashing A naïve approach with 8 nodes and 100 keys could use a simple modulo algorithm. So key 18 would end up on node 2 and key 63 on node 7. But how do we handle servers crashing or new servers joining the system. Consistent Hashing handles this issue 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 27
  • 28. Consistent Hashing Source: http://offthelip.org/ 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 28
  • 29. Distributed Hash Table 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 29
  • 30. Consistent Hashing Source: http://horicky.blogspot.in 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 30
  • 31. 1 4 Chord System 1 3 4 9 Resolving K = 26 4 9 5 18 1 1 1 2 1 1 3 3 1 4 4 28 4 5 14 1 28 1 28 2 3 28 4 1 21 9 5 9 1 21 20 1 28 3 28 1 20 4 28 1 20 18 FTp[i]=succ(p+2 i-1) 5 4 3 28 14 4 28 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 31 5 4
  • 32. Process of determining node To look up a key k node p will forward request to node q with index j in p’s finger table such that q = FTp[j] <= k < FTp[j+1] To resolve k =26 4. 26> FT1[5] = 18. Hence forwarded to Node 18 5. FT18[2] <= 26 < FT 18[3] 6. FT20[1] <=26 < FT20[2] 7. 26 > FT21[1] = 28 Hence Node 28 is responsible for key 26 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 32
  • 33. Hashing efficiency of Chord System The Chord System gets to the node in O (log n) steps There are other hashing techniques that get in O(1) but use a larger local table. For example attains a O(1) hashing method. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 33
  • 34. Joining the Chord System Suppose node p wants to join. It performs the following steps - Requests lookup for succ (p+1) - Inserts itself before this node 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 34
  • 35. Maintaining consistency Periodically each node checks its successor’s predecessor. Node ‘q’ contacts succ(q+1) and requests it to return pred(succ(q+1)) If q = pred(succ(q+1)) then nothing has changed. If the node passes another value then q knows that a new node ‘p’ has joined the system q < p < succ (q+1)so it updates its Finger table so q Will set FTq[1] = p 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 35
  • 36. CAP Theorem Databases that are designed based on ACID properties have poor availability. Postulated by Eric Brewer of University of Berkeley At most only 2 of 3 properties are possible in distributed systems C – Consistency A – Availability P – Partition Tolerance 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 36
  • 37. CAP Theorem • Consistency – Ability for repeated reads to provide the same value • Availability – Ability to be resilient to server crashes • Partition Tolerance – Ability to partition data between servers and always be able to get the data 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 37
  • 38. Real world examples of CAP Theorem Amazon’s Dynamo chooses availability over consistency. Dynamo implements eventual consistency where data become consistent over time Google’s BigTable chooses consistency over availability Consistentcy, Partition Tolerance (CP) Big Table Hbase Availability, Partition Tolerance (AP) Dynamo Voldemort Cassandra 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 38
  • 39. Consistency issues Data replication used in many commercial systems perform synchronous replica coordination to provide strongly consistent data. The downside of this approach is the poor availability These systems maintain that the data is unavailable if they are not able to ensure consistency For e.g. If data is replicated on 5 servers and an update needs to be made then the following has to be done - Update all 5 copies - Ensure all of them are successful - If one of them fails roll back the updates on the other 4 If a read is done when one of the server fail a strongly consistent system would return “data unavailable” when correctness is undetermined. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 39
  • 40. Quorum Protocol To maintain consistency data is replicated in many servers. For e.g. let us assume there are N servers in the system Typical algorithms maintain at least writes to > N/2 => N/2 +1 Usually Nw> N/2 A write is successful if it has been successfully committed in N/2 +1 servers This is known as write quorum 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 40
  • 41. Quorum Protocol Similarly reads are done from an arbitrary number of server replicas Nr. This is known as a read quorum Reads from different servers are compared A consistent design requires that Nw + Nr > N With this you are assured of reading your writes 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 41
  • 42. Election Algorithm Many distributed systems usually have one process to act as a coordinator. If the coordinator crashes then an election takes place to identify the new coordinator 2. P sends a ELECTION message to all higher numbered processes 3. If no one responds P becomes coordinator 4. If a higher number process answers, it takes over the election process 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 42
  • 43. Traditional Fault Tolerance Traditional systems use redundancy to handle failures and be tolerant to fault as shown below Active Standby Active Standby 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 43
  • 44. Process Resilience Handling failures in distributed systems is much more difficult as no system has any view of the global state. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 44
  • 45. Byzantine Failures Byzantine refers to Byzantine General Problem where an army must unanimously decide whether to attack another army. The problem is complicated because the generals must use messengers to communicate and by the presence of traitors Distributed Systems are prone to a type of failures known as Byzantine failures Omission failures – Disk crashes, network congestion, failure to receive request etc Commission failures – Failures when the server behaves incorrectly, corrupting local state etc Solution: To be able to handle Byzantine Failures where k processes are sick is to have a minimum 2k+1 processes so that we are left with k+1 replies given that k process are behaving incorrectly 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 45
  • 46. Checkpointing In fault tolerant distributed computing backward error recovery requires that the system regularly save its state at periodic intervals. We need to create a consistent global state called a distributed snapshot. In a distributed snapshot if a process P has recorded the receipt of a message then there should be a process Q that has sent a corresponding message. Each process saves its state from time to time. To recover we need to construct a consistent global state from these local states 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 46
  • 47. Gossip Protocol Used to handle server crashes and server or servers joining into the system Changes to the distributed system like membership changes are spread similar to gossiping - A server picks another random server and sends a message regarding a server crash or a server joining - If the receiver has already received this message it is dropped. - The receiving server similarly gossips to other servers and the system reaches a steady state soon 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 47
  • 48. Sloppy Quorum Quorum protocol is applied on first N healthy nodes rather than N nodes walking clockwise in the ring. Data meant for Node A is sent to Node D if A is temporarily down. Node D has a hinted handoff in its metadata that updates Node A when it is up. 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 48
  • 49. Thank You ! Tinniam V Ganesh tvganesh.85@gmail.com Read my blogs: http://gigadom.wordpress.com/ http://savvydom.wordpress.com/ 03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 49