Design principles of scalable, distributed systems
Design Principles of Scalable,
Distributed Systems
Tinniam V Ganesh
tvganesh.85@gmail.com
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 1
Distributed Systems
There are two classes of systems
- Monolithic
- Distributed
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 2
Traditional Client Server Architecture
Client Server
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 3
Properties of Distributed Systems
Distributed Systems are made up of 100s of commodity servers
• No machine has complete information about the system state
• Machines make decisions based on local information
• Failure of one machine does not cause any problems
• There is no implicit assumption about a global clock
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 4
Characteristics of Distributed Systems
Distributed Systems are made up of
• Commodity Servers
• Large number of servers
• Servers crash, there network failures, messages not sent, received
• New Servers can join without changing behavior
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 5
Examples of Distributed Systems
• Amazon’s e-retail store
• Google
• Yahoo
• Facebook
• Twitter
• Youtube
Etc
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 6
Key principles of distributed systems
• Incremental scalability
• Symmetry – All nodes are equal
• Decentralization – No central control
• Work distribution heterogenity
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 7
Transaction Processing System
• Traditional databases have to ensure that transactions are consistent. Transaction
must be fully complete or not at all.
• Successful transactions are committed.
• Otherwise transactions are rolled back
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 8
ACID postulate
Transactions in traditional system have to have the following properties
Earlier Systems were designed for ACID properties
A – Atomic
C – Consistent
I – Isolated
D - Durable
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 9
ACID
Atomic – This property ensures that each transaction happens completely or not at all
Consistent - The transaction should maintain system invariants. For e.g. an internal
bank transfer should result in the total amount in the bank before and after the
transaction to be same. It may be temporarily different
Isolated – Different transactions should be isolated or serializable. It must appear that
transactions happen sequentially in some particular order
Durable – Once the transaction commits the effect is complete and durable going
forward.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 10
Scaling
There are 2 types of scaling
Vertical scaling – This method scales by adding faster CPU , more memory and a
larger database. Does not scale beyond a particular point
Horizontal scalability – This method scales laterally by adding more servers with the
same capacity
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 11
System behavior on Scaling
Response
Response
Transactions Throughput Time
Per Second
Load
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 12
Consistency and Replication
In order to increase reliability against failures data has to be replicated across multiple
servers.
The problem with replicas is the need to keep the data consistent
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 13
Reasons for Replication
Data is replicated in distributed systems for two reasons
- Reliability – Ensuring that there is a consistency in data in a majority of the replicas
- Performance – Performance can be improved by accessing a replica that is closer
to the user. Geographical resiliency
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 14
Downside of Replication
• Replication of data has several advantages but the downside is the issue
maintaining consistency
• A modification of a copy makes it different from the rest and this update has to be
propagated to all copies to ensure consistency
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 15
Synchronization
No machine has a view of the global system state
• Problems with distributed systems
• How can processes synchronize ?
• Clocks on different systems will be slightly different
• Is there a way to maintain a global view of the clock
• Can we order events causally?
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 16
Hypothetical situation
Consider a hypothetical situation with banks
- Man deposits Rs 55,000/- at 10.00 am
- Man withdraws Rs 20,000/- at 10.02 am
What will happen if the updates happen in different order
- Operations must be idempotent. Idempotency refers to getting the same
result no matter how many times the operation is performed.
eCommerce Site – Amazon
-add to shopping cart
-delete from shopping cart
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 17
Vector Clocks
Vector clocks are used to capture causality between different versions of the same
object.
Amazon’s Dynamo uses vector clocks to reconcile different versions of the objects and
determine the causal ordering of events.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 18
Problem with Relational Databases
RDBMS databases provide the user the ability to construct complex queries but they
do not scale well.
Problem
Performance deteriorates as the number of records reach several million
Solution
To partition the database horizontally and distribute records across several servers.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 21
No SQL Databases
• Databases horizontally partitioned
• Simple queries based on gets() and sets()
• Access are made on key/value pairs
• Cannot do complex queries like joins
• Database can contain several hundred million records
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 22
Databases that use Consistent Hashing
1. Cassandra
2. Amazon’s Dynamo
3. NoSQL
4. HBASE
5. CouchDB
6. MongoDB
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 23
Hash Tables
• Distribute records among many servers
• Distribution based on keys which is hashed
• Key – 128 bit or 160 bits
• Hash values fall into a range servers visualized to lie on the circumference of a
circle going clockwise.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 24
Distributed Hash Table
• Hashing the keys results in reaching servers are assumed to reside on the
circumference of a circle
• The highest key coincides back to the beginning of this circle
• The movement is clockwise
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 25
Distributed Hash Table
An entity with key K falls under the jurisdiction of the node
with the smallest id >= K
• For e.g. if we have two nodes, one at position 50 and another at position 200.
• If we want to store a key / value pair in the DHT and the key hash is 100, would go
to node 200.
• Another key hash of 30 would go to the node 50
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 26
Consistent Hashing
A naïve approach with 8 nodes and 100 keys could use a simple modulo algorithm.
So key 18 would end up on node 2 and key 63 on node 7.
But how do we handle servers crashing or new servers joining the system.
Consistent Hashing handles this issue
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 27
Process of determining node
To look up a key k node p will forward request to node q with index j in p’s finger table
such that
q = FTp[j] <= k < FTp[j+1]
To resolve k =26
4. 26> FT1[5] = 18. Hence forwarded to Node 18
5. FT18[2] <= 26 < FT 18[3]
6. FT20[1] <=26 < FT20[2]
7. 26 > FT21[1] = 28 Hence Node 28 is responsible for key 26
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 32
Hashing efficiency of Chord System
The Chord System gets to the node in O (log n) steps
There are other hashing techniques that get in O(1) but use a larger local table. For
example attains a O(1) hashing method.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 33
Joining the Chord System
Suppose node p wants to join. It performs the following steps
- Requests lookup for succ (p+1)
- Inserts itself before this node
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 34
Maintaining consistency
Periodically each node checks its successor’s predecessor.
Node ‘q’ contacts succ(q+1) and requests it to return pred(succ(q+1))
If q = pred(succ(q+1)) then nothing has changed. If the node passes another value
then q knows that a new node ‘p’ has joined the system
q < p < succ (q+1)so it updates its Finger table so q
Will set FTq[1] = p
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 35
CAP Theorem
Databases that are designed based on ACID properties have poor availability.
Postulated by Eric Brewer of University of Berkeley
At most only 2 of 3 properties are possible in distributed systems
C – Consistency
A – Availability
P – Partition Tolerance
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 36
CAP Theorem
• Consistency – Ability for repeated reads to provide the same value
• Availability – Ability to be resilient to server crashes
• Partition Tolerance – Ability to partition data between servers and always be able
to get the data
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 37
Real world examples of CAP Theorem
Amazon’s Dynamo chooses availability over consistency. Dynamo implements
eventual consistency where data become consistent over time
Google’s BigTable chooses consistency over availability
Consistentcy, Partition Tolerance (CP)
Big Table
Hbase
Availability, Partition Tolerance (AP)
Dynamo
Voldemort
Cassandra
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 38
Consistency issues
Data replication used in many commercial systems perform synchronous replica
coordination to provide strongly consistent data.
The downside of this approach is the poor availability
These systems maintain that the data is unavailable if they are not able to ensure
consistency
For e.g.
If data is replicated on 5 servers and an update needs to be made then the following
has to be done
- Update all 5 copies
- Ensure all of them are successful
- If one of them fails roll back the updates on the other 4
If a read is done when one of the server fail a strongly consistent system would return
“data unavailable” when correctness is undetermined.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 39
Quorum Protocol
To maintain consistency data is replicated in many servers.
For e.g. let us assume there are N servers in the system
Typical algorithms maintain at least writes to > N/2 => N/2 +1
Usually Nw> N/2
A write is successful if it has been successfully committed in N/2 +1 servers
This is known as write quorum
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 40
Quorum Protocol
Similarly reads are done from an arbitrary number of server replicas Nr. This
is known as a read quorum
Reads from different servers are compared
A consistent design requires that Nw + Nr > N
With this you are assured of reading your writes
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 41
Election Algorithm
Many distributed systems usually have one process to act as a coordinator. If
the coordinator crashes then an election takes place to identify the new
coordinator
2. P sends a ELECTION message to all higher numbered processes
3. If no one responds P becomes coordinator
4. If a higher number process answers, it takes over the election process
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 42
Traditional Fault Tolerance
Traditional systems use redundancy to handle failures and be tolerant to fault as
shown below
Active Standby
Active Standby
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 43
Process Resilience
Handling failures in distributed systems is much more difficult as no system has any
view of the global state.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 44
Byzantine Failures
Byzantine refers to Byzantine General Problem where an army must unanimously
decide whether to attack another army. The problem is complicated because the
generals must use messengers to communicate and by the presence of traitors
Distributed Systems are prone to a type of failures known as Byzantine failures
Omission failures – Disk crashes, network congestion, failure to receive request etc
Commission failures – Failures when the server behaves incorrectly, corrupting local
state etc
Solution: To be able to handle Byzantine Failures where k processes are sick is to have
a minimum 2k+1 processes so that we are left with k+1 replies given that k process
are behaving incorrectly
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 45
Checkpointing
In fault tolerant distributed computing backward error recovery requires that the
system regularly save its state at periodic intervals. We need to create a consistent
global state called a distributed snapshot.
In a distributed snapshot if a process P has recorded the receipt of a message then
there should be a process Q that has sent a corresponding message.
Each process saves its state from time to time.
To recover we need to construct a consistent global state from these local states
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 46
Gossip Protocol
Used to handle server crashes and server or servers joining into the system
Changes to the distributed system like membership changes are spread
similar to gossiping
- A server picks another random server and sends a message regarding a
server crash or a server joining
- If the receiver has already received this message it is dropped.
- The receiving server similarly gossips to other servers and the system
reaches a steady state soon
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 47
Sloppy Quorum
Quorum protocol is applied on first N healthy nodes rather than N nodes walking
clockwise in the ring.
Data meant for Node A is sent to Node D if A is temporarily down.
Node D has a hinted handoff in its metadata that updates Node A when it is up.
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 48
Thank You !
Tinniam V Ganesh
tvganesh.85@gmail.com
Read my blogs: http://gigadom.wordpress.com/
http://savvydom.wordpress.com/
03/28/12 Tinniam V Ganesh - http://gigadom.wordpress.com 49