Topic 3: Large-scale Distributed Systems

3: Large-scale Distributed Systems

Zubair Nabi

zubair.nabi@itu.edu.pk

April 17, 2013

Zubair Nabi 3: Large-scale Distributed Systems April 17, 2013 1 / 29

Outline

1 Introduction

2 Client-server Interaction

3 Characteristics

4 Message Passing Interface


Outline

1 Introduction


3 Characteristics



Distributed Systems

Set of discrete machines which cooperate to perform computation


Distributed Systems

Give the notion of a single “machine”


Distributed Systems

Examples:
Compute clusters


Distributed Systems

Examples:
Compute clusters
Distributed storage systems, such as Dropbox, Google Drive, etc.


Distributed Systems

Examples:
Compute clusters
Distributed storage systems, such as Dropbox, Google Drive, etc.
The Web


Advantages

Scalability:
The scale of the Internet (think how many queries Google servers
handle daily)


Advantages

Scalability:
handle daily)
Only a matter of adding more machines


Advantages

Scalability:
handle daily)
Cheaper than super computers


Advantages

Scalability:
handle daily)
More machines means more parallelism, hence better performance


Advantages

Scalability:
handle daily)
Sharing:
The same resource is shared between multiple users


Advantages

Scalability:
handle daily)
Sharing:
Just like the Internet is shared between millions of users


Advantages

Scalability:
handle daily)
Sharing:
Communication:
Communication between (potentially geographically isolated) machines
and users (via email, Facebook, etc.)


Advantages

Scalability:
handle daily)
Sharing:
Communication:
Communication between (potentially geographically isolated) machines
and users (via email, Facebook, etc.)
Reliability:
The service can remain active even if multiple machines go down


Challenges

Concurrency:
Concurrent execution requires some form of coordination


Challenges

Concurrency:
Fault-tolerance:
Any component can fail at any instant due to a software or a hardware
bug


Challenges

Concurrency:
Fault-tolerance:
bug
Security:
One machine can compromise the entire system


Challenges

Concurrency:
Fault-tolerance:
bug
Security:
Coordination:
No global time so non-trivial to coordinate


Challenges

Concurrency:
Fault-tolerance:
bug
Security:
Coordination:
No global time so non-trivial to coordinate
Trouble shooting:
Hard to trouble shoot because hard to reason about the system


Transparency

Distributed systems give the notion of a single machine or keep the
distribution transparent


Transparency

The degree of this transparency can be mapped onto an entire
spectrum of options for both users and programmers


Transparency

For instance:
A web user is aware of network communication but the number of
accessed machines is transparent


Transparency

For instance:
Transparency can be ensured by middleware that adds a layer of
abstraction


Transparency

For instance:
Transparency can be ensured by middleware that adds a layer of
abstraction
Can span access, concurrency, failure, location, migration,
persistence, relocation, replication


Outline

1 Introduction


3 Characteristics



Request-reply protocol

Standard operation
1 Client sends request to the server



Standard operation
2 Server processes the request and sends a corresponding response



Standard operation
In the synchronous model, the client blocks till the response is received



Standard operation
In case of the asynchronous model, the client continues its execution



Standard operation
For instance: HTTP 1.0



Standard operation
1 Client sends GET /index.html



Standard operation
2 Server responds with index.html



Standard operation
2 Server responds with index.html
3 Client renders index.html


Errors and failures

Errors are handled at the application-level


Errors and failures

For instance, if the client requests a non-existent web page just return a
special reply: 404 Not Found


Errors and failures

Failures are system-level things


Errors and failures

For instance, lost message, client/server crash, etc.


Errors and failures

To handle failure, the client must timeout after T


Errors and failures

The client can retry on a timeout


Errors and failures

The client can retry on a timeout
Setting value of T is system-speciﬁc


Remote Procedure Call

Request/response protocols are widely used but too low level



Need to deﬁne each request separately including their network message
representation



representation
Remote procedure call (RPC) presents a simpler abstraction



representation
Programmer invokes a procedure which executes on a remote machine
(the server)



representation
(the server)
RPC subsystem takes care of message formats, communication,
timeouts, etc.



representation
(the server)
timeouts, etc.
Distribution of the system becomes transparent



representation
(the server)
timeouts, etc.
Integrated with the programming language



representation
(the server)
timeouts, etc.
Integrated with the programming language
RPC layer adds stubs at client end which when invoked execute a
method at the server


Example: XML-RPC

XML is used to encode method invocations (method names,
parameters, etc.)


Example: XML-RPC

parameters, etc.)
HTTP POST used to send request and receive response (also
encoded in XML)


Example: XML-RPC

parameters, etc.)
encoded in XML)
Looks like a regular web session on wire so plays well with
middleboxes


Example: XML-RPC

parameters, etc.)
encoded in XML)
middleboxes
Language agnostic and extensible


Example: XML-RPC

parameters, etc.)
encoded in XML)
middleboxes
Language agnostic and extensible
Extended with more features (namespaces, user-deﬁned types, etc.)
and diverse transports (TCP, UDP, etc.) to result in Simple Object
Access Protocol (SOAP)


RPC shortcomings

RPC mechanisms are synchronous


RPC shortcomings

Client blocks till response is received


RPC shortcomings

Poor responsiveness, especially in high latency networks


RPC shortcomings

2006 ushered in the age of Asynchronous JavaScript with XML (AJAX)


RPC shortcomings

Update web page without reloading


RPC shortcomings

Update web page without reloading
For instance, Google Maps, Gmail, etc.


Representational State Transfer

AJAX still revolves around RPC (just asynchronously)



Representational State Transfer (REST) offers an alternative
All resources have a name: URL or URI



Resources are manipulated with PUT, GET, POST, and DELETE
methods



methods
State is sent along with operations



methods
State is sent along with operations
Widely used these days (For instance, by Amazon, Twitter, etc.)


Outline

1 Introduction


3 Characteristics



Clocks

Distributed systems need to be able to:


Clocks

Order events produced by concurrent processes


Clocks

Synchronize senders and receivers of messages


Clocks

Serialize concurrent accesses to shared objects


Clocks

Coordinate joint activity


Clocks

Clocks are employed for this


Clocks

Clocks are employed for this
But quartz oscillators oscillate at slightly different frequencies leading
to clock drift and resulting in clock skew between clocks


Clock synchronization

Clock synchronization algorithms try to minimize skew between a set of
clocks



clocks
Decide upon a correct time



clocks
Communicate to agree (compensating for delays)



clocks
Possibly multiple servers involved



clocks
Possibly multiple servers involved
In reality, still a 1-10ms skew after sync (but we can live with that)


Ordering

Time is used to ensure ordering


Ordering

Withdraw money at 23:59.45


Ordering

Bank calculates interest at 00:00.0


Ordering

The withdraw money should not be included in the interest calculation


Ordering

In most cases, only need to know that a happened before b, known as
the happens-before relation


Ordering

In most cases, only need to know that a happened before b, known as
the happens-before relation
Multiple algorithms exists to ensure the happens-before relation


Distributed Mutual Exclusion

Concurrent access to shared resources needs to be synchronized



Need hardware support on local machine



Locks, semaphores, etc.



Locks, semaphores, etc.
But this support is not available across a distributed system


Distributed Mutual Exclusion (2)

Multiple methods exist to ensure this:
Central lock server: All lock requests are handled by a central server



Token passing: Arrange nodes into a ring and a token is passed
around



Token passing: Arrange nodes into a ring and a token is passed
around
Totally-ordered multicast: Clients multicast requests to each other


Consensus

Getting processes in a distributed system to agree on something


Consensus

Requirements for correct solution
Agreement: All nodes arrive at the same answer


Consensus

Validity: Answer is one that was proposed by someone


Consensus

Validity: Answer is one that was proposed by someone
Termination: All nodes eventually decide


Distributed transactions

Composite operations (i.e. A collection of reads and updates to a set of
objects)



objects)
A transaction is atomic



objects)
If it commits, all operations are applied



objects)
If it aborts, no state mutation at all



objects)
Distributed transactions span multiple transaction processing servers



objects)
For instance, booking ﬂights: Lahore -> Dubai -> New York



objects)
Need to book entire trip



objects)
Need to book entire trip
Actions need to be coordinated across multiple parties


Replication

A number of distributed systems involve replication


Replication

Data replication: Multiple copies of some object stored at different
servers


Replication

servers
Computation replication: Multiple servers capable of providing an
operation


Replication

servers
operation
Advantages:
1 Load balancing: Work spread out across clients


Replication

servers
operation
Advantages:
2 Lower latency: Better performance if replica close to the client


Replication

servers
operation
Advantages:
3 Fault tolerance: Failure of some replicas can be tolerated


Replication

servers
operation
Advantages:
3 Fault tolerance: Failure of some replicas can be tolerated

Examples: DNS, content distribution networks, database replication,
etc.


CAP

CAP:
1 Consistency: All nodes see the same state


CAP

CAP:
2 Availability: All requests get a response


CAP

CAP:
3 Partitioning: System continues to operate even in the face of node
failure


CAP

CAP:
failure
Brewer’s conjecture states that in a distributed system only 2 out of 3
possible


CAP

CAP:
failure
possible
In the current setup, partitioning is a given: Hardware/software fails all
the time


CAP

CAP:
failure
possible
In the current setup, partitioning is a given: Hardware/software fails all
the time
Therefore, systems need to choose between consistency and
availability


References

George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair.
2011. Distributed Systems: Concepts and Design (5th ed.).
Addison-Wesley Publishing Company, USA.


Topic 3: Large-scale Distributed Systems

Recommended

Recommended

More Related Content

Similar to Topic 3: Large-scale Distributed Systems

Similar to Topic 3: Large-scale Distributed Systems (20)

More from Zubair Nabi

More from Zubair Nabi (20)

Recently uploaded

Recently uploaded (20)

Topic 3: Large-scale Distributed Systems