Distributed Systems –
An Introduction
CS4262 Distributed Systems
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides extracted from Dr. Srinath Perera & Dr. Rajkumar Buyya’s
Presentation Deck
What is a Distributed System?
2
Source: http://people.reed.edu/~jimfix/442ds/
What is a Distributed System?
3
Source: www.cs.bham.ac.uk/~nagarajs/courses/DSS/
What is a Distributed System?
“A distributed system is
one on which I cannot get
any work done because
some machine I have
never heard of has
crashed.”
--Leslie Lamport
4
What is a Distributed System?
“A system in which hardware or
software components located at
networked computers
communicate & coordinate their
actions only by message
passing.” - [Coulouris]
“A distributed system is a
collection of independent
computers that appear to the
users of the system as a single
coherent system.” - [Tanenbaum]
5
Key Characteristics
Many independent computers
Only communicate via message passing
Coordinate/communicate to fulfill a common
goal
6
Question
Which of the following is true?
a) Horizontal scaling is used in enterprise applications
b) Vertical scaling is used in cloud computing
c) Elasticity is the ability to scale in & out
d) Stateful servers/components are easier to scale
8
Why Distributed Systems?
Some systems are inherently distributed
e.g., file sharing, mobile social networks, vehicles
Some problems are too big for a single system
For scalability
e.g., Google, Amazon, YouTube
For better QoS/QoE
e.g., YouTube, Google
For reliability
e.g., Google, Amazon
For specific demographics
e.g., Amazon, Yahoo, Google, CircInfo
Economic reasons
Resource sharing, Cloud
System evolution
Heterogeneous system with time
9
Applications – Online Stores
Many items
List of all items, with
their specs
Index items by many
dimensions &
support search
Many sellers
Many byers
Byers behaviour
Support checkout, tracks delivery, returns, ratings, & complains
Supported by partitioning sellers/ items across many nodes
Business analytics 10
Applications – Desktop Grids
Many volunteering their
computing power
Scientists submit
computing jobs to system
Broker match resources
with jobs
Resource run jobs &
return results
Handle failures
Avoid free riding
Biggest computers on
earth
www.seti.cl/aprendiendo-mas-sobre-boinc-y-la-computacion-voluntaria/
11
Applications – IoT
Many sensors – Weather, travel, traffic, surveillance,
stock exchange, smart grid, production lines
Monitor, understand, & react to events
Usually handled with Stream Processing, Complex Event
Processing, or custom applications
Source: www.flickr.com/photos/imuttoo/4257813689/ by Ian Muttoo, www.flickr.com/photos/eastcapital/4554220770/ ,
www.flickr.com/photos/patdavid/4619331472/ by Pat David copyright CC
12
Applications – Mobile Crowdsourcing
Modern mobile phones
are like a weather centre
GPS, Barometer,
Temperature, Light
Proximity, & Moisture
sensors
Get volunteer phones to
send sensor data (Crowd
source)
Report on weather
Crop diseases (agriculture
officials)
Epidemics (from hospitals,
doctors)
Use that to do weather
predictions, crop disease
& epidemic spread
Source: www.fotopedia.com/items/flickr-2548697541 ,
www.geograph.org.uk/photo/1534209, and
www.yourbdnews.com/2011/10/17/samsung-files-to-halt-iphone-4s-
in-japan-australia/iphone-4s, Licensed CC
Mobile:
Solving the Last Mile
Problem
13
Applications – Data Storages &
Provenance (Sky Server)
Telescopes (Square
Kilometer Array) keep
collecting data from sky
(TBs per day)
Sky Survey lets scientists
to come & see the sky of a
given location, as seen at
a given time
Moving data take a long
time 1TB takes
100 Mbps network - 30 hrs
1 Gbps network - 3 hrs
10 Gbps network - 20
minuteswww.geograph.org.uk/photo/103069, Licensed CC
14
http://cas.sdss.org
Applications – Theoretical Computer
Science
Concerns with
Coordination algorithms
Leader election, multicast,
distributed locks, barriers,
snapshot algorithms
Impossibility results, upper &
lower bounds
Distributed versions of some
centralized algorithms
e.g., shortest path
A lot of work done on 70s lay
the ground work for Distributed
Systems
http://www.flickr.com/photos/lodz_na_nowo/5690492370/
http://xkcd.com/384/
http://www.flickr.com/photos/quinnanya/4990131194/sizes/z/in/photostream/,
Licensed CC
15
Parallel vs. Distributed Systems
Convergence of concepts of parallel &
distributed systems
Differentiation with parallel systems is blurring
New middleware
Extensibility of clusters leads to heterogeneity
As new hardware is added
16
Parallel Systems Distributed Systems
Tight coupling Loose coupling
Physical proximity Server room to Global
Homogeneous Heterogeneous
Threads & MPI RPC, Web Services, REST
Distributed Systems Timeline/History
17
Period Topics
1965-late 70s Parallel Programming, Self Stabilization, Fault Tolerance,
ER Model/ Transactions, Time Clock
1980s Consensus & impossibility, SQL, Distributed Snapshots,
Replications, Group Communication
Early 90s Linearizability, Parallel DB, Transactional Memory, RAID,
MPI
Late 90s Volunteer Computing, P2P file sharing, Complex event
processing
Early 2000 Oceanostore, Web Services, Symantec Web, REST,
DHT, Pub/Sub, Grid, Autonomic Computing, Google File
System, Virtualization, SOA, Map reduce
2005-2010 Cloud, NoSQL, Mobile Apps, Data Provenance
Design Goals
Transparency
Differences between computers & the way they
communicate are hidden from users
Single System View (SSV/SSI)
Users & applications interact with a distributed system
in a consistent & uniform way regardless of where or
when the interaction takes place
Scalability
Relatively easy to expand or scale
Availability
Continuously available even though certain parts may
be temporarily out of order 18
Fallacies of Distributed Systems
Network is reliable
Latency is zero
Bandwidth is infinite
Transport cost is zero
Network is secure
Topology doesn't change
There is one administrator
Infrastructure is homogeneous
20
By Arnon Rotem-Gal-Oz
Distributed System Challenges
Concurrency
No global state
Failures in different elements
Transparency
Fault tolerance
Scalability
Heterogeneity
Security
Openness
21
1. Concurrency Challenges
Every software or hardware component
Autonomous, enable resource sharing, & synchronize
& coordinate via message passing
A & B are concurrent, if either A can happen before B,
or B can happen before A
Typical problems
Deadlocks
Unreliable communication
Provide & manage concurrent access to shared
resources
Preserve dependencies, e.g., distributed transactions
Fair scheduling 22
2. Lack of Global State
Absence of a global state
Typically no single process would have a knowledge
of current global state of the system
Due to concurrency & message passing communication
Absence of a global clock
Hard to say who’s first
There are limits on precision with which processes in
a can synchronize their clocks
However, problem can now be addressed in some
application with GPS time stamps
23
3. Failures in Different Elements
Failures are more common than in centralized
systems
Processes run autonomously, in isolation
Failures of individual processes may remain
undetected
Individual processes may be unaware of failures
in the system context
Network failures isolate processes & partition
system
24
4. Transparency
Present system to users & applications as a
single computer system
Hides the fact that resources are physically
distributed across multiple computers
There’s a trade-off between high degree of
transparency & system performance
Attempting to blindly hide all distribution aspects from
users is not always a good idea
Transparency can be applied to several aspects
25
Forms of Transparency
Access transparency
Access to local or remote resources is identical
e.g., Network File System
Location transparency
Access without knowledge of location
e.g., URLs, e-mail addresses
Failure transparency
Tasks can be completed despite failures
e.g., retransmission of e-mails, failure of a Web server node should
not bring down the website
Replication transparency
Access to replicated resources as if there was just one
Provide enhanced reliability & performance
26
Forms of Transparency (Cont.)
Migration (mobility/relocation) transparency
Movement of resources & clients within a system without affection
operation of users or applications
e.g., switching from one name server to another, migration of a VM
from physical server to another
Concurrency transparency
A process shouldn’t notice that there are others sharing same
resources
Performance transparency
Allows system to be dynamically reconfigured to improve
performance as loads vary
Scaling transparency
Allows system & applications to expand in scale without change to
system structure or application algorithms 27
Question(s)
Migration transparency
a) Allows access without knowledge of location
b) Enables multiple instances of resources
c) Enables the movement of resources
d) Enables the concealment of faults
Higher degree of transparency is always
desirable True / False
28
5. Fault Tolerance
Failure
When an offered service no longer complies with its
specification
Fault
Cause of a failure
Fault tolerance
No failure despite faults
Failures in distributed systems are partial
Some components fail while others continue to
function
Therefore, handling failure is particularly difficult
29
6. Scalability
At many different scales
No of applications, users, transactions, products to be
sold, attributes of products
Goal is to remain effective when there is a significant
increase in no of resources & users
3 different dimensions:
Size scalability
Limitations due to centralized services, centralized data,
centralized algorithms
Geographic scalability
Unreliable communication, lack of performance guarantees
Administrative scalability
Conflicting policies for resource usage, security, etc. 31
Scaling Techniques
Scalability problems typically appear as
performance problems
3 basic scaling techniques
Hiding communication latencies
Distribution
Replication
32
Scalability Concerns
Cost of physical resources
Cost should linearly increase with system size
Performance loss
Finding things in large & distributed systems are hard
Looking for algorithms with O(log n), n is size of data
Preventing software resources from running out
Nos used to represent resources, users, services, etc.
e.g., IP v4 to V6, Y2K problem, Year 2038 problem
Avoiding performance bottlenecks
Maintaining global state
Difficulty of maintaining up to date state 33
7-9. Other Challenges
Heterogeneity
Heterogeneous components must be able to
interoperate
Applies to hardware, software, middleware, & protocols
Security
System should only be used in the way intended
Distributed resources, networks, & users
Distributed authentication, authorization, enforcing
integrity, non-repudiation, & accounting is hard
Openness
Interfaces should be publicly available to ease adding
new components 34
Summary
We use them without realizing they are
distributed
Goals
Transparency, single system view, scalability,
availability
Challenges
Concurrency, no global state, failures, transparency,
fault tolerance, scalability, heterogeneity, security,
debugging
35
Editor's Notes
Sky Server – 200 GB a night
Chilbolton Radio Telescope - 7 PB of raw data will be produced each year ~19 TB a day
According to the Fundamentals of Software Engineering book by Ghezzi et. al., software engineering principles include (1) Rigor and Formality (2) Separation of Concerns (3) Modularity (4) Abstraction (5) Anticipation of Change (6) Generality and (7) Incrementality.
Specific issues need to be resolved for the design of software for distributed systems.