SlideShare a Scribd company logo
1 of 32
Download to read offline
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 1
UNIT V
P2P & DISTRIBUTED SHARED MEMORY
Peer-to-peer computing and overlay graphs: Introduction – Data indexing and overlays –
Chord – Content addressable networks – Tapestry. Distributed shared memory:
Abstraction and advantages – Memory consistency models –Shared memory Mutual
Exclusion.
PEER-TO-PEER COMPUTING AND OVERLAY GRAPHS: INTRODUCTION
Characteristics
 P2P network: application-level organization of the network to flexibly share resources
 All nodes are equal; communication directly between peers (no client-server) Allow
location of arbitrary objects; no DNS servers required
 Large combined storage, CPU power, other resources, without scalability costs
 Dynamic insertion and deletion of nodes, as well as of resources, at low cost
Features Performance
self-organizing largecombinedstorage,CPUpower,andresources
distributed control fast search for machines and data objects
role symmetry for nodes scalable
anonymity efficient management of churn
naming mechanism selection of geographically close servers
security, authentication, trust redundancy in storage and paths
Table5.1: Desirable characteristics and performance features of P2P systems.
Napster
Central server maintains a table with the following information of each registered client: (i) the
client’s address (IP) and port, and offered bandwidth, and (ii) information about the files that the
client can allow to share.
 A client connects to a meta-server that assigns a lightly-loaded server.
 The client connects to the assigned server and forwards its query and identity.
 The server responds to the client with information about the users connected to it and the
files they are sharing.
 On receiving the response from the server, the client chooses one of the users from whom
to download a desired file. The address to enable the P2P connection between the client
and the selected user is provided by the server to the client.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 2
Users are generally anonymous to each other. The directory serves to provide the mapping from
a particular host that contains the required content, to the IP address needed to download from it.
Structured and Unstructured Overlays
 Search for data and placement of data depends on P2P overlay (which can be thought of
as being below the application level overlay)
 Search is data-centric, not host-centric Structured P2P overlays:
o E.g., hypercube, mesh, de Bruijn graphs
o rigid organizational principles for object storage and object search
 Unstructured P2P overlays:
o Loose guidelines for object search and storage
o Search mechanisms are ad-hoc, variants of flooding and random walk
 Object storage and search strategies are intricately linked to the overlay structure as well
as to the data organization mechanisms.
DATA INDEXING
Data identified by indexing, which allows physical data independence from apps.
 Centralized indexing, e.g., versions of Napster, DNS
 Distributed indexing. Indexes to data scattered across peers. Access data through
mechanisms such as Distributed Hash Tables (DHT). These differ in hash mapping,
search algorithms, diameter for lookup, fault tolerance, churn resilience.
 Local indexing. Each peer indexes only the local objects. Remote objects need to be
searched for. Typical DHT uses flat key space. Used commonly in unstructured overlays
(E.g., Gnutella) along with flooding search or random walk search.
Another classification
 Semantic indexing - human readable, e.g., filename, keyword, database key. Supports
keyword searches, range searches, approximate searches.
 Semantic-free indexing. Not human readable. Corresponds to index obtained by use of
hash function.
Simple Distributed Hash Table scheme
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 3
Figure 5.1 The mappings from node address space and object space in a typical DHT scheme,
e.g., Chord, CAN, Tapestry.
Mappings from node address space and object space in a simple DHT.
 Highly deterministic placement of files/data allows fast lookup. But file
insertions/deletions under churn incurs some cost. Attribute search, range search,
keyword search etc. not possible.
Structured vs. unstructured overlays
Structured Overlays:
 structure = placement of files is highly deterministic, file insertions and deletions
have some overhead
Fast lookup
 Hash mapping based on a single characteristic (e.g., file name) Range queries, keyword
queries, attribute queries difficult to support
Unstructured Overlays:
 No structure for overlay =⇒ no structure for data/file placement
 Node join/departures are easy; local overlay simply adjusted
 Only local indexing used
 File search entails high message overhead and high delays
 Complex, keyword, range, attribute queries supported
 Some overlay topologies naturally emerge:
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 4
o Power Law Random Graph (PLRG) where node degrees follow a power law.
Here, if the nodes are ranked in terms of degree, then the i th node has c/iα
neighbors, where c is a constant.
o simple random graph: nodes typically have a uniform degree.
Unstructured Overlays: Properties
 Semantic indexing possible =⇒ keyword, range, attribute-based queries
 Easily accommodate high churn
 Efficient when data is replicated in network
 Good if user satisfied with ”best-effort” search
 Network is not so large as to cause high delays in search
Gnutella features
 A joiner connects to some standard nodes from Gnutella directory
 Ping used to discover other hosts; allows new host to announce itself
 Pong in response to Ping ; Pong contains IP, port #, max data size for download
 Query msgs used for flooding search; contains required parameters QueryHit are
responses. If data is found, this message contains the IP, port #, file size, download rate,
etc. Path used is reverse path of Query .
Search in Unstructured Overlays
Consider a system with n nodes and m objects. Let qi be the popularity of object i , as measured
by the fraction of all queries that are queries for object i . All objects may be equally popular, or
more realistically, a Zipf-like power law distribution of popularity exists. Thus,
Let ri be the number of replicas of object i , and let pi be the fraction of all objects that are
replicas of i . Three static replication strategies are: uniform, proportional, and square root. Thus,
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 5
Under uniform replication, all objects have an equal number of replicas; hence the performance
for all query rates is the same. With a uniform query rate, proportional and square-root
replication schemes reduce to the uniform replication scheme.
For an object search, some of the metrics of efficiency:
 probability of success of finding the queried object.
 delay or the number of hops in finding an object.
 the number of messages processed by each node in a search.
 node coverage, the fraction of (distinct) nodes visited
 message duplication, which is (#messages - #nodes visited)/#messages
 maximum number of messages at a node
 recall, the number of objects found satisfying the desired search criteria. This metric is
useful for keyword, inexact, and range queries.
 message efficiency, which is the recall per message used
Guided versus Unguided Search. In unguided or blind search, there is no history of earlier
searches. In guided search, nodes store some history of past searches to aid future searches.
Various mechanisms for caching hints are used. We focus on unguided searches in the context of
unstructured overlays.
CHORD DISTRIBUTED HASH TABLE
Overview
 Node address as well as object value is mapped to a logical identifier in a common flat
key space using a consistent hash function.
 When a node joins or leaves the network of n nodes, only 1/n keys have to moved.
 Two steps involved.
o Map the object value to its key
o Map the key to the node in the native address space using lookup
 Common address space is a m-bit identifier (2m addresses), and this space is arranged on
a logical ring mod (2m ).
 A key k gets assigned to the first node such that the node identifier equals or is greater
than the key identifier k in the logical space address
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 6
Figure 5.2 An example Chord ring with m = 7, showing mappings to the
Chord address space, and a query lookup using a simple scheme
Simple lookup
 Each node tracks its successor on the ring.
 A query for key x is forwarded on the ring until it reaches the first node whose identifier
y≥key
x mod (2m ).
 The result, which includes the IP address of the node with key y , is returned to the
querying node along the reverse of the path that was followed by the query.
 This mechanism requires O(1) local space but O(n) hops.
Algorithm 5.1 A simple object location algorithm in Chord at node i
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 7
Example The steps for the query: lookup(K8) initiated at node 28, are shown in Figure 18.2
using arrows.
Scalable lookup
 Each node i maintains a routing table, called the finger table, with O(log n) entries, such
that the x th entry (1≤x≤ m) is the node identifier of the node succ (i + 2 x−1
).
 This is denoted by i .finger [x ] = succ (i + 2 x−1
). This is the first node whose key is
greater than the key of node i by at least 2 x−1
mod 2m
.
 Complexity: O(log n) message hops at the cost of O(log n) space in the local routing
tables
 Each finger table entry contains the IP address and port number in addition to the node
identifier
 Due to the log structure of the finger table, there is more info about nodes closer by than
about nodes further away.
 Consider a query on key key at node i ,
o if key lies between i and its successor, the key would reside at the successor and
its address is returned.
o If key lies beyond the successor, then node i searches through the m entries in its
finger table to identify the node j such that j most immediately precedes key ,
among all the entries in the finger table.
o As j is the closest known node that precedes key , j is most likely to have the
most information on locating key , i.e., locating the immediate successor node to
which key has been mapped
Algorithm 5.2 A scalable object location algorithm in Chord at node i
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 8
Figure 5.3 An example showing a query lookup using the logarithmically-structured finger tables
Managing Churn
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 9
Algorithm 5.3 Managing churn in Chord. Code shown is for node i
Figure 5.4 Steps in the integration of node i in the ring, where j > i > k
 How are node departures handled? or node failures?
 For a Chord network with n nodes, each node is responsible for at most (1 + s)K/n keys,
with “high probability”, where K is the total number of
 keys. Using consistent hashing, s can be shown to be bounded by O(log n). The search
for a successor in Locate Successor in a Chord network with n
 nodes requires time complexity O(log n) with high probability.
 The size of the finger table is log (n) ≤ m.
 The average lookup time is 1/2 log (n). Details of Complexity derivations: refer text.
Content addressible networks (CAN)
 An indexing mechanism that maps objects to locations in CAN
 object-location in P2P networks, large-scale storage management, wide-area name
resolution services that decouple name resolution and the naming scheme
 Efficient, scalable addition of and location of objects using location-independent names
or keys.
 3 basic operations: insertion, search, deletion of (key , value) pairs
 d -dimensional logical Cartesian space organized as a d -torus logical topology, i.e.. d -
dimensional mesh with wraparound.
 Space partitioned dynamically among nodes, i.e., node i has space r (i ). For object v , its
key r (v ) is mapped to a point ˙p in the space. (v , key (v )) tuple stored at node which is
the present owner containing the point ˙p.
 Analogously to retrieve object v . 3 components of CAN
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 10
o Set up CAN virtual coordinate space, partition among nodes
o Routing in virtual coordinate space to locate the node that is assigned the region
corresponding to ˙p
o Maintain the CAN in spite of node departures and failures.
CAN initialization
 Each CAN has a unique DNS name that maps to the IP address of a few bootstrap nodes.
Bootstrap node: tracks a partial list of the nodes that it believes are currently in the
CAN.
 A joiner node queries a bootstrap node via a DNS lookup. Bootstrap node replies with the
IP addresses of some randomly chosen nodes that it believes are in the CAN.
 The joiner chooses a random point ˙p in the coordinate space. The joiner sends a request
to one of the nodes in the CAN, of which it learnt in Step 2, asking to be assigned a
region containing ˙p. The recipient of the request routes the request to the owner old
owner (˙p) of the region containing ˙p, using CAN routing algorithm.
 The old owner (˙p) node splits its region in half and assigns one half to the joiner. The
region splitting is done using an a priori ordering of all the dimensions. This also helps to
methodically merge regions, if necessary. The (k, v ) tuples for which the key k now
maps to the zone to be transferred to the joiner, are also transferred to the joiner.
 The joiner learns the IP addresses of its neighbours from old owner (˙p). The neighbors
are old owner (˙p) and a subset of the neighbours of old owner (˙p). old owner (˙p) also
updates its set of neighbours. The new joiner as well as old owner (˙p) inform their
neighbours of the changes to the space allocation, In fact, each node has to send an
immediate update of its assigned region, followed by periodic HEARTBEAT refresh
messages, to all its neighbours.
 When a node joins a CAN, only the neighbouring nodes in the coordinate space are
required to participate. The overhead is thus of the order of the number of neighbours,
which is O(d ) and independent of n.
CAN routing
 Straight-line path routing in Euclidean space
 Each node’s routing table maintains list of neighbor nodes and their IP addresses virtual
Euclidean coordinate regions.
 To locate value v , its key k(v ) is mapped to a point ˙p. Knowing the neighbours’ region
coordinates, each node follows simple greedy routing by forwarding the message to that
neighbour having coordinates that are closest to the destination.
Avg # neighbours of a node is O(d )
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 11
Average path length d/4 • n
Advantages over Chord:
 Each node has about same # neighbors and same amt. of state info, independent of n =
scalability
 Multiple paths in CAN provide fault-tolerance
 Avg path length
Two-dimensional CAN space. Seven regions are shown. The dashed arrows show the routing
from node 2 to the coordinate ˙p shown by the shaded circle.
CAN maintainence
When a node voluntarily departs from CAN, it hands over its region and the associated database
of (key,value) tuples to one of its neighbors.
 Neighbor choice: formation of a convex region after merger of regions
 Otherwise, neighbor with smallest volume. However, regions are not merged and
neighbor handles both regions until background reassignment protocol is run.
 Node failure detected when periodic HEARTBEAT message not received by neighbors.
They then run a TAKEOVER protocol to decide which neighbor will own dead node’s
region. This protocol favors region with smallest volume.
 Despite TAKEOVER protocol, the (key, value) tuples remain lost until background
region reassignment protocol is run.
 Background reassignment protocol: for 1-1 load balancing, restore 1-1 node to region
assignment, and prevent fragmentation.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 12
Figure 5.6 Example showing region reassignment in a CAN
CAN optimizations
 Improve per-hop latency, path length, fault tolerance, availability, and load balancing.
These techniques typically demonstrate a trade-off.
 Multiple dimensions. As the path length is O(d • n 1/d
), increasing the number of
dimensions decreases the path length and increases routing fault tolerance at the expense
of larger state space per node.
 Multiple realities or coordinate spaces. The same node will store different (k, v ) tuples
belonging to the region assigned to it in each reality, and will also have a different
neighbour set. The data contents (k, v ) get replicated, leading to higher availability.
Furthermore, the multiple copies of each (k, v ) tuple offer a choice.
 Routing fault tolerance also improves.
 Use delay metric instead of Cartesian metric for routing
 Overloading coordinate regions by having multiple nodes assigned to each region. Path
length and latency can reduce, fault tolerance improves, per-hop latency decreases.
 Use multiple hash functions. Equivalent to using multiple realities. Topologically
sensitive overlay. This can greatly reduce per-hop latency.
 CAN Complexity: O(d ) for a joiner. O(d/4 log (n)) for routing. Node departure O(d 2
).
Tapestry
 Nodes and objects are assigned IDs from common space via a distributed hashing.
 Hashed node ids are termed VIDs or vid . Hashed object identifiers are termed GUIDs or
OG .
 ID space typically has m = 160 bits, and is expressed in hexadecimal.
 If a node v exists such that vid = OG exists, then that v become the root. If such a v does
not exist, then another unique node sharing the largest common prefix with OG is chosen
to be the surrogate root.
 The object OG is stored at the root, or the root has a direct pointer to the object.
 To access object O, reach the root (real or surrogate) using prefix routing
 Prefix routing to select the next hop is done by increasing the prefix match of the next
hop’s VID with the destination OGR . Thus, a message destined for OGR = 62C 35 could
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 13
be routed along nodes with VIDs 6****, then 62***, then 62C**, then 62C3*, and then
to 62C35.
Overlay and routing
 Let M = 2m
. The routing table at node vid contains b • logb M entries, organized in logb
M levels i = 1 . . . logb M. Each entry is of the form (wid , IP address).
 Each entry denotes some “neighbour” node VIDs with a (i − 1)-digit prefix match with
vid – thus, the entry’s wid matches vid in the (i − 1)-digit prefix. Further, in level i , for
each digit j in the chosen base (e.g., 0, 1, . . . E, F when b = 16), there is an entry for
which the i th digit position is j.
 For each forward pointer, there is a backward pointer.
Figure 5.7 Some example links of the Tapestry routing mesh at node with identifier “7C25”[35].
Three links from each level 1 through 4 are labeled by the level
 The jth entry in level i may not exist because no node meets the criterion. This is a hole in
the routing table.
 Surrogate routing can be used to route around holes. If the jth entry in level i should be
chosen but is missing, route to the next non-empty entry in level i , using wraparound if
needed. All the levels from 1 to logb 2m need to be considered in routing, thus requiring
logb 2m hops.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 14
Algorithm 5.4 Routing in Tapestry. The logic for determining the next hop at a node with node
identifier v, 1 ≤ v ≤ n, based on the ith
digit of OG, i.e., based on the digit in the ith
most
significant
position in OG.
Figure 5.8 An example of routing from FAB11 to 62C35 [35]. The numbers on the arrows show
the level of the routing table used. The dashed arrows show some unused links.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 15
Object publication and object search
 The unique spanning tree used to route to vid is used to publish and locate an object
whoseuniquerootidentifierOGR isvid.
 A server S that stores object O having GUID OG and root OGR periodically publishes the
objectbyroutingapublishmessagefromS towardsOGR .
 At each hop and including the root node OGR , the publish message creates a pointer to the
object
 This is the directory info and is maintained in soft-state.
 Tosearch for an object O with GUID OG , a client sends a query destined for the root OGR
.
o Along the logb 2m hops, if a node finds a pointer to the object residing on
server S , the node redirects the query directly to S .
o Otherwise,itforwardsthequerytowardstherootOGR whichisguaranteedto
have the pointer for the location mapping.
 A query gets redirected directly to the object as soon as the query path overlaps the
publishpathtowardsthesameroot.
An example showing publishing of object with identifier 72EA1 at two replicas 1F329 and
C2B40. A query for the object from 094ED will find the object pointer at 7FAB1. A query
from 7826C will find the object pointer at 72F11. A query from BCF35 will find the object
pointer at 729CC.
Node insertion
 For any node Y on the path between a publisher of object O and the root GOR , node Y
should have a pointer to O.
 Nodes which have a hole in their routing table should be notified if the insertion of node
X can fill that hole.
 If X becomes the new root of existing objects, references to those objects should now
lead to X .
 The routing table for node X must be constructed.
72EA1
72EA8
7826C
72E33 72E34
72F11
720B4 729CC
7FAB1
70666 75BB1 7D4FF
094ED
C2B40 25011 BCF35 17202 1F329
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 16
 The nodes near X should include X in their routing tables to perform more efficient
routing.
Node deletion
 Node A informs the nodes to which it has (routing) backpointers. It also provides them
with replacement entries for each level from its routing table. This is to prevent holes in
their routing tables. (The notified neighbours can periodically run the nearest neighbour
algorithm to fine-tune their tables.)
 The servers to which A has object pointers are also notified. The notified servers send
object republish messages.
 During the above steps, node A routes messages to objects rooted at itself to their new
roots. On completion of the above steps, node A informs the nodes reachable via its
backpointers and forward pointers that it is leaving, and then leaves.
 Node failures: Repair the object location pointers, routing tables and mesh, using the
redundancy in the Tapestry routing network. Refer to the book for the algorithms
Complexity
 A search for an object expected to take (logb 2m ) hops. However, the routing tables are
optimized to identify nearest neighbour hops (as per the space metric). Thus, the latency
for each hop is expected to be small, compared to that for CAN and Chord protocols.
 The size of the routing table at each node is c b logb 2m
, where c is the constant that
limits the size of the neighbour set that is maintained for fault-tolerance.
 The larger the Tapestry network, the more efficient is the performance. Hence, better if
different applications share the same overlay.
Distributed shared memory
Distributed Shared Memory Abstractions
 No Send and Receive primitives to be used by application
o Under covers, Send and Receive used by DSM manager
 Locking is too restrictive; need concurrent access
 With replica management, problem of consistency arises!
=⇒ weaker consistency models (weaker than von Neumann) reqd
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 17
Advantages:
Shields programmer from Send/Receive primitives
Single address space; simplifies passing-by-reference and passing complex data structures
Exploit locality-of-reference when a block is moved
DSM uses simpler software interfaces, and cheaper off-the-shelf hardware. Hence cheaper than
dedicated multiprocessor systems
No memory access bottleneck, as no single bus Large virtual memory space
DSM programs portable as they use common DSM programming interface
Disadvantages:
Programmers need to understand consistency models, to write correct programs
DSM implementations use async message-passing, and hence cannot be more efficient than msg-
passing implementations
By yielding control to DSM manager software, programmers cannot use their own msg-passing
solutions.
The main issues in designing a DSM system are the following:
• Determining what semantics to allow for concurrent access to shared objects. The semantics
needs to be clearly specified so that the programmer can code his program using an appropriate
logic.
• Determining the best way to implement the semantics of concurrent access to shared data. One
possibility is to use replication. One decision to be made is the degree of replication – partial
replication at some sites, or full replication at all the sites. A further decision then is to decide on
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 18
whether to use read-replication (replication for the read operations) or write-replication
(replication for the write operations) or both.
• Selecting the locations for replication (if full replication is not used), to optimize efficiency
from the system’s viewpoint.
• Determining the location of remote data that the application needs to access, if full replication
is not used.
• Reducing communication delays and the number of messages that are involved under the
covers while implementing the semantics of concurrent access to shared data.
There are four broad dimensions along which DSM systems can be classified and implemented:
• Whether data is replicated or cached.
• Whether remote access is by hardware or by software.
• Whether the caching/replication is controlled by hardware or software.
• Whether the DSM is controlled by the distributed memory managers, by the operating system,
or by the language runtime system.
Comparison of DSM systems
Table 5.2 Comparison of DSM systems
MEMORY CONSISTENCY MODELS
Memory coherence
Memory coherence is the ability of the system to execute memory operations correctly. Assume
n processes and si memory operations per process Pi.
 si memory operations by Pi
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 19
 (s1 + s2 + . . . sn )!/(s1!s2! . . . sn !) possible interleavings
 Memory coherence model defines which interleavings are permitted Traditionally, Read
returns the value written by the most recent Write ”Most recent” Write is ambiguous with
replicas and concurrent accesses
 DSM consistency model is a contract between DSM system and application programmer
Figure 5.9 Sequential invocations and responses in a DSM system, without any pipelining.
Strict consistency/atomic consistency/linearizability
Strict consistency
1. A Read should return the most recent value written, per a global time axis. For operations
that overlap per the global time axis, the following must hold.
2. All operations appear to be atomic and sequentially executed.
3. All processors see the same order of events, equivalent to the global time ordering of
non-overlapping events.
Figure: Strict consistency
 (Liveness:) Each invocation must have a corresponding response.
 (Correctness:) The projection of Seq on any processor i, denoted Seqi, must be a
sequence of alternating invocations and responses if pipelining is disallowed.
Implementations
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 20
 Implementing linearizability is expensive because a global time scale needs to be
simulated. As all processors need to agree on a common order, the implementation needs
to use total order. For simplicity, we assume full replication of each data item at all the
processors. Hence, total ordering needs to be combined with a broadcast.
Figure 5.10 Examples to illustrate definitions of linearizability and sequential
consistency. The initial values of variables are zero.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 21
Algorithm 5.5 Implementing linearizability (LIN) using total order broadcasts [6]. Code shown is
for
Pi, 1 ≤ i ≤ n.
When a Read in simulated at other processes, there is a no-op. Why do Reads participate in total
order broadcasts?
Reads need to be serialized w.r.t. other Reads and all Write operations. See counter-example
where Reads do not participate in total order broadcast.
Figure 5.11 A violation of linearizability (LIN) if Read operations do not participate in the total
order broadcast
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 22
Sequential Consistency.
 The result of any execution is the same as if all operations of the processors were
executed in some sequential order.
 The operations of each individual processor appear in this sequence in the local program
order.
Any interleaving of the operations from the different processors is possible. But all processors
must see the same interleaving. Even if two operations from different processors (on the same or
different variables) do not overlap in a global time scale, they may appear in reverse order in the
common sequential order seen by all. See examples used for linearizability.
Only Writes participate in total order BCs. Reads do not because:
 all consecutive operations by the same processor are ordered in that same order (no
pipelining), and
 Read operations by different processors are independent of each other; to be ordered only
with respect to the Write operations.
 Direct simplification of the LIN algorithm.
 Reads executed atomically. Not so for Writes.
 Suitable for Read-intensive programs.
Algorithm 5.6 Implementing sequential consistency (SC) using local Read operations [6]. Code
shown is for Pi, 1 ≤ i ≤ n.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 23
Algorithm 5.7 Implementing Sequential Consistency (SC) using local Write operations [6]. Code
shown is for Pi, 1 ≤ i ≤ n.
Causal consistency
 In SC, all Write ops should be seen in common order.
 For causal consistency, only causally related Writes should be seen in common P1 order.
The causality relation for shared memory systems is defined as follows:
 Local order At a processor, the serial order of the events defines the local causal order.
 Inter-process order A Write operation causally precedes a Read operation issued by
another processor if the Read returns a value written by the Write.
 Transitive closure The transitive closure of the above two relations defines the (global)
causal order.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 24
Figure 5.12 Examples to illustrate definitions of sequential consistency (SC), causal consistency
(CC), and PRAM consistency. The initial values of variables are zero.
PRAM (pipelined RAM) or processor consistency
PRAM memory
Only Write ops issued by the same processor are seen by others in the order they were issued,
but Writes from different processors may be seen by other processors in different orders.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 25
PRAM can be implemented by FIFO broadcast? PRAM memory can exhibit counter-intuitive
behavior, see below.
Slow Memory
Only Writeoperationsissuedbythesameprocessor andtothesame memory location must be
seenbyothersinthatorder.
Figure 5.13 Examples to illustrate definitions of PRAM consistency and slow memory.
The initial values of variables are zero.
Hierarchy of consistency models
int: x, y ;
Process 1
Process 2
... ...
(1b) if y = 0 then kill(P2).
(1a) x ←− 4; (2a) y ←− 6;
(2b) if x = 0 then kill(P1).
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 26
Figure 5.14 A strict hierarchy of the memory consistency models.
Other models based on synchronization instructions
Consistency
 Consistency conditions apply only to special ”synchronization” instructions, e.g., barrier
synchronization
 Non-sync statements may be executed in any order by various processors. E.g.,weak
consistency, release consistency, entry consistency
Weak consistency:
All Writes are propagated to other processes, and all Writes done elsewhere are brought locally,
at a sync instruction.
 Accesses to sync variables are sequentially consistent
 Access to sync variable is not permitted unless all Writes elsewhere have completed
 No data access is allowed until all previous synchronization variable accesses have been
performed
 Drawback: cannot tell whether beginning access to shared variables (enter CS), or
finished access to shared variables (exit CS).
Consistency and Entry Consistency
Two types of synchronization Variables: Acquire and Release
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 27
Release Consistency
 Acquire indicates CS is to be entered. Hence all Writes from other processors should be
locally reflected at this instruction
 Release indicates access to CS is being completed. Hence, all Updates made locally
should be propagated to the replicas at other processors.
 Acquire and Release can be defined on a subset of the variables.
 If no CS semantics are used, then Acquire and Release act as barrier synchronization
variables.
 Lazy release consistency: propagate updates on-demand, not the PRAM way.
Entry Consistency
 Each ordinary shared variable is associated with a synchronization variable (e.g., lock,
barrier)
 For Acquire /Release on a synchronization variable, access to only those ordinary
variables guarded by the synchronization variables is performed.
SHARED MEMORY MUTUAL EXCLUSION
Lamport’s bakery algorithm
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 28
Algorithm 5.7 Lamport’s n-process bakery algorithm for shared memory mutual exclusion. Code
shown is for process Pi, 1 ≤ i ≤ n.
Mutual exclusion
 Role of line (1e)? Wait for others’ timestamp choice to stabilize ...
 Role of line (1f)? Wait for higher priority (lex. lower timestamp) process to enter CS
Bounded waiting: Pi can be overtaken by other processes at most once (each) Progress:
lexicographic order is a total order; process with lowest timestamp in lines (1d)-(1g) enters CS
Space complexity: lower bound of n registers Time complexity: (n) time for Bakery algorithm
Lamport’s fast mutex algorithm takes O(1) time in the absence of contention. However it
compromises on bounded waiting. Uses (W −R−W –R) sequence necessary and sufficient to
check for contention, and safely enter CS
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 29
Bounded waiting is satisfied because each other process j can “overtake” process i at most once
after i has completed choosing its timestamp. The second time j chooses a timestamp, the value
will necessarily be larger than i’s timestamp if i has not yet entered its CS.
Progress is guaranteed because the lexicographic order is a total order and the process with the
lowest timestamp at any time in the loop (lines 1d–1g) is guaranteed to enter the CS.
Lamport’s WRWR mechanism and fast mutual exclusion
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 30
Figure 5.14 An example showing the need for a boolean vector for fast mutual exclusion.
Hardware support for mutual exclusion
Algorithm 5.8 Definitions of synchronization operations Test&Set and Swap.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 31
Algorithm 5.9 Mutual exclusion using Swap. Code shown is for process Pi, 1 ≤ i ≤ n.
CS8603-Distributed systems
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 32
Algorithm 5.10 Mutual exclusion with bounded waiting, using Test&Set. Code shown is for
process Pi, 1 ≤ i ≤ n.

More Related Content

Similar to UNIT V.pdf

Routing Protocols of Distributed Hash Table Based Peer to Peer Networks
Routing Protocols of Distributed Hash Table Based Peer to Peer NetworksRouting Protocols of Distributed Hash Table Based Peer to Peer Networks
Routing Protocols of Distributed Hash Table Based Peer to Peer NetworksIOSR Journals
 
Data and File Structure Lecture Notes
Data and File Structure Lecture NotesData and File Structure Lecture Notes
Data and File Structure Lecture NotesFellowBuddy.com
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataIRJET Journal
 
UNIT III DIS.pptx
UNIT III DIS.pptxUNIT III DIS.pptx
UNIT III DIS.pptxSamPrem3
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywordsswathi78
 
Simple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudSimple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudIOSR Journals
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast EfficientlyFull-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficientlyijsrd.com
 
New approaches with chord in efficient p2p grid resource discovery
New approaches with chord in efficient p2p grid resource discoveryNew approaches with chord in efficient p2p grid resource discovery
New approaches with chord in efficient p2p grid resource discoveryijgca
 
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERYNEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERYijgca
 
Spot db consistency checking and optimization in spatial database
Spot db  consistency checking and optimization in spatial databaseSpot db  consistency checking and optimization in spatial database
Spot db consistency checking and optimization in spatial databasePratik Udapure
 
IRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT ProtocolsIRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT ProtocolsIRJET Journal
 
P2P Lookup Protocols
P2P Lookup ProtocolsP2P Lookup Protocols
P2P Lookup ProtocolsZubin Bhuyan
 
Bx32903907
Bx32903907Bx32903907
Bx32903907IJMER
 
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search MethodA Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search MethodIRJET Journal
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. ElasticsearchSelecto
 

Similar to UNIT V.pdf (20)

Routing Protocols of Distributed Hash Table Based Peer to Peer Networks
Routing Protocols of Distributed Hash Table Based Peer to Peer NetworksRouting Protocols of Distributed Hash Table Based Peer to Peer Networks
Routing Protocols of Distributed Hash Table Based Peer to Peer Networks
 
Data and File Structure Lecture Notes
Data and File Structure Lecture NotesData and File Structure Lecture Notes
Data and File Structure Lecture Notes
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted Data
 
UNIT III DIS.pptx
UNIT III DIS.pptxUNIT III DIS.pptx
UNIT III DIS.pptx
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
Simple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudSimple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In Cloud
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast EfficientlyFull-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
 
New approaches with chord in efficient p2p grid resource discovery
New approaches with chord in efficient p2p grid resource discoveryNew approaches with chord in efficient p2p grid resource discovery
New approaches with chord in efficient p2p grid resource discovery
 
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERYNEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
NEW APPROACHES WITH CHORD IN EFFICIENT P2P GRID RESOURCE DISCOVERY
 
Spot db consistency checking and optimization in spatial database
Spot db  consistency checking and optimization in spatial databaseSpot db  consistency checking and optimization in spatial database
Spot db consistency checking and optimization in spatial database
 
IRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT ProtocolsIRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT Protocols
 
P2P Lookup Protocols
P2P Lookup ProtocolsP2P Lookup Protocols
P2P Lookup Protocols
 
Bx32903907
Bx32903907Bx32903907
Bx32903907
 
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search MethodA Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
 
IJET-V2I6P33
IJET-V2I6P33IJET-V2I6P33
IJET-V2I6P33
 
4 026
4 0264 026
4 026
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 

Recently uploaded

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 

Recently uploaded (20)

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 

UNIT V.pdf

  • 1. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 1 UNIT V P2P & DISTRIBUTED SHARED MEMORY Peer-to-peer computing and overlay graphs: Introduction – Data indexing and overlays – Chord – Content addressable networks – Tapestry. Distributed shared memory: Abstraction and advantages – Memory consistency models –Shared memory Mutual Exclusion. PEER-TO-PEER COMPUTING AND OVERLAY GRAPHS: INTRODUCTION Characteristics  P2P network: application-level organization of the network to flexibly share resources  All nodes are equal; communication directly between peers (no client-server) Allow location of arbitrary objects; no DNS servers required  Large combined storage, CPU power, other resources, without scalability costs  Dynamic insertion and deletion of nodes, as well as of resources, at low cost Features Performance self-organizing largecombinedstorage,CPUpower,andresources distributed control fast search for machines and data objects role symmetry for nodes scalable anonymity efficient management of churn naming mechanism selection of geographically close servers security, authentication, trust redundancy in storage and paths Table5.1: Desirable characteristics and performance features of P2P systems. Napster Central server maintains a table with the following information of each registered client: (i) the client’s address (IP) and port, and offered bandwidth, and (ii) information about the files that the client can allow to share.  A client connects to a meta-server that assigns a lightly-loaded server.  The client connects to the assigned server and forwards its query and identity.  The server responds to the client with information about the users connected to it and the files they are sharing.  On receiving the response from the server, the client chooses one of the users from whom to download a desired file. The address to enable the P2P connection between the client and the selected user is provided by the server to the client.
  • 2. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 2 Users are generally anonymous to each other. The directory serves to provide the mapping from a particular host that contains the required content, to the IP address needed to download from it. Structured and Unstructured Overlays  Search for data and placement of data depends on P2P overlay (which can be thought of as being below the application level overlay)  Search is data-centric, not host-centric Structured P2P overlays: o E.g., hypercube, mesh, de Bruijn graphs o rigid organizational principles for object storage and object search  Unstructured P2P overlays: o Loose guidelines for object search and storage o Search mechanisms are ad-hoc, variants of flooding and random walk  Object storage and search strategies are intricately linked to the overlay structure as well as to the data organization mechanisms. DATA INDEXING Data identified by indexing, which allows physical data independence from apps.  Centralized indexing, e.g., versions of Napster, DNS  Distributed indexing. Indexes to data scattered across peers. Access data through mechanisms such as Distributed Hash Tables (DHT). These differ in hash mapping, search algorithms, diameter for lookup, fault tolerance, churn resilience.  Local indexing. Each peer indexes only the local objects. Remote objects need to be searched for. Typical DHT uses flat key space. Used commonly in unstructured overlays (E.g., Gnutella) along with flooding search or random walk search. Another classification  Semantic indexing - human readable, e.g., filename, keyword, database key. Supports keyword searches, range searches, approximate searches.  Semantic-free indexing. Not human readable. Corresponds to index obtained by use of hash function. Simple Distributed Hash Table scheme
  • 3. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 3 Figure 5.1 The mappings from node address space and object space in a typical DHT scheme, e.g., Chord, CAN, Tapestry. Mappings from node address space and object space in a simple DHT.  Highly deterministic placement of files/data allows fast lookup. But file insertions/deletions under churn incurs some cost. Attribute search, range search, keyword search etc. not possible. Structured vs. unstructured overlays Structured Overlays:  structure = placement of files is highly deterministic, file insertions and deletions have some overhead Fast lookup  Hash mapping based on a single characteristic (e.g., file name) Range queries, keyword queries, attribute queries difficult to support Unstructured Overlays:  No structure for overlay =⇒ no structure for data/file placement  Node join/departures are easy; local overlay simply adjusted  Only local indexing used  File search entails high message overhead and high delays  Complex, keyword, range, attribute queries supported  Some overlay topologies naturally emerge:
  • 4. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 4 o Power Law Random Graph (PLRG) where node degrees follow a power law. Here, if the nodes are ranked in terms of degree, then the i th node has c/iα neighbors, where c is a constant. o simple random graph: nodes typically have a uniform degree. Unstructured Overlays: Properties  Semantic indexing possible =⇒ keyword, range, attribute-based queries  Easily accommodate high churn  Efficient when data is replicated in network  Good if user satisfied with ”best-effort” search  Network is not so large as to cause high delays in search Gnutella features  A joiner connects to some standard nodes from Gnutella directory  Ping used to discover other hosts; allows new host to announce itself  Pong in response to Ping ; Pong contains IP, port #, max data size for download  Query msgs used for flooding search; contains required parameters QueryHit are responses. If data is found, this message contains the IP, port #, file size, download rate, etc. Path used is reverse path of Query . Search in Unstructured Overlays Consider a system with n nodes and m objects. Let qi be the popularity of object i , as measured by the fraction of all queries that are queries for object i . All objects may be equally popular, or more realistically, a Zipf-like power law distribution of popularity exists. Thus, Let ri be the number of replicas of object i , and let pi be the fraction of all objects that are replicas of i . Three static replication strategies are: uniform, proportional, and square root. Thus,
  • 5. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 5 Under uniform replication, all objects have an equal number of replicas; hence the performance for all query rates is the same. With a uniform query rate, proportional and square-root replication schemes reduce to the uniform replication scheme. For an object search, some of the metrics of efficiency:  probability of success of finding the queried object.  delay or the number of hops in finding an object.  the number of messages processed by each node in a search.  node coverage, the fraction of (distinct) nodes visited  message duplication, which is (#messages - #nodes visited)/#messages  maximum number of messages at a node  recall, the number of objects found satisfying the desired search criteria. This metric is useful for keyword, inexact, and range queries.  message efficiency, which is the recall per message used Guided versus Unguided Search. In unguided or blind search, there is no history of earlier searches. In guided search, nodes store some history of past searches to aid future searches. Various mechanisms for caching hints are used. We focus on unguided searches in the context of unstructured overlays. CHORD DISTRIBUTED HASH TABLE Overview  Node address as well as object value is mapped to a logical identifier in a common flat key space using a consistent hash function.  When a node joins or leaves the network of n nodes, only 1/n keys have to moved.  Two steps involved. o Map the object value to its key o Map the key to the node in the native address space using lookup  Common address space is a m-bit identifier (2m addresses), and this space is arranged on a logical ring mod (2m ).  A key k gets assigned to the first node such that the node identifier equals or is greater than the key identifier k in the logical space address
  • 6. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 6 Figure 5.2 An example Chord ring with m = 7, showing mappings to the Chord address space, and a query lookup using a simple scheme Simple lookup  Each node tracks its successor on the ring.  A query for key x is forwarded on the ring until it reaches the first node whose identifier y≥key x mod (2m ).  The result, which includes the IP address of the node with key y , is returned to the querying node along the reverse of the path that was followed by the query.  This mechanism requires O(1) local space but O(n) hops. Algorithm 5.1 A simple object location algorithm in Chord at node i
  • 7. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 7 Example The steps for the query: lookup(K8) initiated at node 28, are shown in Figure 18.2 using arrows. Scalable lookup  Each node i maintains a routing table, called the finger table, with O(log n) entries, such that the x th entry (1≤x≤ m) is the node identifier of the node succ (i + 2 x−1 ).  This is denoted by i .finger [x ] = succ (i + 2 x−1 ). This is the first node whose key is greater than the key of node i by at least 2 x−1 mod 2m .  Complexity: O(log n) message hops at the cost of O(log n) space in the local routing tables  Each finger table entry contains the IP address and port number in addition to the node identifier  Due to the log structure of the finger table, there is more info about nodes closer by than about nodes further away.  Consider a query on key key at node i , o if key lies between i and its successor, the key would reside at the successor and its address is returned. o If key lies beyond the successor, then node i searches through the m entries in its finger table to identify the node j such that j most immediately precedes key , among all the entries in the finger table. o As j is the closest known node that precedes key , j is most likely to have the most information on locating key , i.e., locating the immediate successor node to which key has been mapped Algorithm 5.2 A scalable object location algorithm in Chord at node i
  • 8. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 8 Figure 5.3 An example showing a query lookup using the logarithmically-structured finger tables Managing Churn
  • 9. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 9 Algorithm 5.3 Managing churn in Chord. Code shown is for node i Figure 5.4 Steps in the integration of node i in the ring, where j > i > k  How are node departures handled? or node failures?  For a Chord network with n nodes, each node is responsible for at most (1 + s)K/n keys, with “high probability”, where K is the total number of  keys. Using consistent hashing, s can be shown to be bounded by O(log n). The search for a successor in Locate Successor in a Chord network with n  nodes requires time complexity O(log n) with high probability.  The size of the finger table is log (n) ≤ m.  The average lookup time is 1/2 log (n). Details of Complexity derivations: refer text. Content addressible networks (CAN)  An indexing mechanism that maps objects to locations in CAN  object-location in P2P networks, large-scale storage management, wide-area name resolution services that decouple name resolution and the naming scheme  Efficient, scalable addition of and location of objects using location-independent names or keys.  3 basic operations: insertion, search, deletion of (key , value) pairs  d -dimensional logical Cartesian space organized as a d -torus logical topology, i.e.. d - dimensional mesh with wraparound.  Space partitioned dynamically among nodes, i.e., node i has space r (i ). For object v , its key r (v ) is mapped to a point ˙p in the space. (v , key (v )) tuple stored at node which is the present owner containing the point ˙p.  Analogously to retrieve object v . 3 components of CAN
  • 10. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 10 o Set up CAN virtual coordinate space, partition among nodes o Routing in virtual coordinate space to locate the node that is assigned the region corresponding to ˙p o Maintain the CAN in spite of node departures and failures. CAN initialization  Each CAN has a unique DNS name that maps to the IP address of a few bootstrap nodes. Bootstrap node: tracks a partial list of the nodes that it believes are currently in the CAN.  A joiner node queries a bootstrap node via a DNS lookup. Bootstrap node replies with the IP addresses of some randomly chosen nodes that it believes are in the CAN.  The joiner chooses a random point ˙p in the coordinate space. The joiner sends a request to one of the nodes in the CAN, of which it learnt in Step 2, asking to be assigned a region containing ˙p. The recipient of the request routes the request to the owner old owner (˙p) of the region containing ˙p, using CAN routing algorithm.  The old owner (˙p) node splits its region in half and assigns one half to the joiner. The region splitting is done using an a priori ordering of all the dimensions. This also helps to methodically merge regions, if necessary. The (k, v ) tuples for which the key k now maps to the zone to be transferred to the joiner, are also transferred to the joiner.  The joiner learns the IP addresses of its neighbours from old owner (˙p). The neighbors are old owner (˙p) and a subset of the neighbours of old owner (˙p). old owner (˙p) also updates its set of neighbours. The new joiner as well as old owner (˙p) inform their neighbours of the changes to the space allocation, In fact, each node has to send an immediate update of its assigned region, followed by periodic HEARTBEAT refresh messages, to all its neighbours.  When a node joins a CAN, only the neighbouring nodes in the coordinate space are required to participate. The overhead is thus of the order of the number of neighbours, which is O(d ) and independent of n. CAN routing  Straight-line path routing in Euclidean space  Each node’s routing table maintains list of neighbor nodes and their IP addresses virtual Euclidean coordinate regions.  To locate value v , its key k(v ) is mapped to a point ˙p. Knowing the neighbours’ region coordinates, each node follows simple greedy routing by forwarding the message to that neighbour having coordinates that are closest to the destination. Avg # neighbours of a node is O(d )
  • 11. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 11 Average path length d/4 • n Advantages over Chord:  Each node has about same # neighbors and same amt. of state info, independent of n = scalability  Multiple paths in CAN provide fault-tolerance  Avg path length Two-dimensional CAN space. Seven regions are shown. The dashed arrows show the routing from node 2 to the coordinate ˙p shown by the shaded circle. CAN maintainence When a node voluntarily departs from CAN, it hands over its region and the associated database of (key,value) tuples to one of its neighbors.  Neighbor choice: formation of a convex region after merger of regions  Otherwise, neighbor with smallest volume. However, regions are not merged and neighbor handles both regions until background reassignment protocol is run.  Node failure detected when periodic HEARTBEAT message not received by neighbors. They then run a TAKEOVER protocol to decide which neighbor will own dead node’s region. This protocol favors region with smallest volume.  Despite TAKEOVER protocol, the (key, value) tuples remain lost until background region reassignment protocol is run.  Background reassignment protocol: for 1-1 load balancing, restore 1-1 node to region assignment, and prevent fragmentation.
  • 12. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 12 Figure 5.6 Example showing region reassignment in a CAN CAN optimizations  Improve per-hop latency, path length, fault tolerance, availability, and load balancing. These techniques typically demonstrate a trade-off.  Multiple dimensions. As the path length is O(d • n 1/d ), increasing the number of dimensions decreases the path length and increases routing fault tolerance at the expense of larger state space per node.  Multiple realities or coordinate spaces. The same node will store different (k, v ) tuples belonging to the region assigned to it in each reality, and will also have a different neighbour set. The data contents (k, v ) get replicated, leading to higher availability. Furthermore, the multiple copies of each (k, v ) tuple offer a choice.  Routing fault tolerance also improves.  Use delay metric instead of Cartesian metric for routing  Overloading coordinate regions by having multiple nodes assigned to each region. Path length and latency can reduce, fault tolerance improves, per-hop latency decreases.  Use multiple hash functions. Equivalent to using multiple realities. Topologically sensitive overlay. This can greatly reduce per-hop latency.  CAN Complexity: O(d ) for a joiner. O(d/4 log (n)) for routing. Node departure O(d 2 ). Tapestry  Nodes and objects are assigned IDs from common space via a distributed hashing.  Hashed node ids are termed VIDs or vid . Hashed object identifiers are termed GUIDs or OG .  ID space typically has m = 160 bits, and is expressed in hexadecimal.  If a node v exists such that vid = OG exists, then that v become the root. If such a v does not exist, then another unique node sharing the largest common prefix with OG is chosen to be the surrogate root.  The object OG is stored at the root, or the root has a direct pointer to the object.  To access object O, reach the root (real or surrogate) using prefix routing  Prefix routing to select the next hop is done by increasing the prefix match of the next hop’s VID with the destination OGR . Thus, a message destined for OGR = 62C 35 could
  • 13. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 13 be routed along nodes with VIDs 6****, then 62***, then 62C**, then 62C3*, and then to 62C35. Overlay and routing  Let M = 2m . The routing table at node vid contains b • logb M entries, organized in logb M levels i = 1 . . . logb M. Each entry is of the form (wid , IP address).  Each entry denotes some “neighbour” node VIDs with a (i − 1)-digit prefix match with vid – thus, the entry’s wid matches vid in the (i − 1)-digit prefix. Further, in level i , for each digit j in the chosen base (e.g., 0, 1, . . . E, F when b = 16), there is an entry for which the i th digit position is j.  For each forward pointer, there is a backward pointer. Figure 5.7 Some example links of the Tapestry routing mesh at node with identifier “7C25”[35]. Three links from each level 1 through 4 are labeled by the level  The jth entry in level i may not exist because no node meets the criterion. This is a hole in the routing table.  Surrogate routing can be used to route around holes. If the jth entry in level i should be chosen but is missing, route to the next non-empty entry in level i , using wraparound if needed. All the levels from 1 to logb 2m need to be considered in routing, thus requiring logb 2m hops.
  • 14. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 14 Algorithm 5.4 Routing in Tapestry. The logic for determining the next hop at a node with node identifier v, 1 ≤ v ≤ n, based on the ith digit of OG, i.e., based on the digit in the ith most significant position in OG. Figure 5.8 An example of routing from FAB11 to 62C35 [35]. The numbers on the arrows show the level of the routing table used. The dashed arrows show some unused links.
  • 15. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 15 Object publication and object search  The unique spanning tree used to route to vid is used to publish and locate an object whoseuniquerootidentifierOGR isvid.  A server S that stores object O having GUID OG and root OGR periodically publishes the objectbyroutingapublishmessagefromS towardsOGR .  At each hop and including the root node OGR , the publish message creates a pointer to the object  This is the directory info and is maintained in soft-state.  Tosearch for an object O with GUID OG , a client sends a query destined for the root OGR . o Along the logb 2m hops, if a node finds a pointer to the object residing on server S , the node redirects the query directly to S . o Otherwise,itforwardsthequerytowardstherootOGR whichisguaranteedto have the pointer for the location mapping.  A query gets redirected directly to the object as soon as the query path overlaps the publishpathtowardsthesameroot. An example showing publishing of object with identifier 72EA1 at two replicas 1F329 and C2B40. A query for the object from 094ED will find the object pointer at 7FAB1. A query from 7826C will find the object pointer at 72F11. A query from BCF35 will find the object pointer at 729CC. Node insertion  For any node Y on the path between a publisher of object O and the root GOR , node Y should have a pointer to O.  Nodes which have a hole in their routing table should be notified if the insertion of node X can fill that hole.  If X becomes the new root of existing objects, references to those objects should now lead to X .  The routing table for node X must be constructed. 72EA1 72EA8 7826C 72E33 72E34 72F11 720B4 729CC 7FAB1 70666 75BB1 7D4FF 094ED C2B40 25011 BCF35 17202 1F329
  • 16. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 16  The nodes near X should include X in their routing tables to perform more efficient routing. Node deletion  Node A informs the nodes to which it has (routing) backpointers. It also provides them with replacement entries for each level from its routing table. This is to prevent holes in their routing tables. (The notified neighbours can periodically run the nearest neighbour algorithm to fine-tune their tables.)  The servers to which A has object pointers are also notified. The notified servers send object republish messages.  During the above steps, node A routes messages to objects rooted at itself to their new roots. On completion of the above steps, node A informs the nodes reachable via its backpointers and forward pointers that it is leaving, and then leaves.  Node failures: Repair the object location pointers, routing tables and mesh, using the redundancy in the Tapestry routing network. Refer to the book for the algorithms Complexity  A search for an object expected to take (logb 2m ) hops. However, the routing tables are optimized to identify nearest neighbour hops (as per the space metric). Thus, the latency for each hop is expected to be small, compared to that for CAN and Chord protocols.  The size of the routing table at each node is c b logb 2m , where c is the constant that limits the size of the neighbour set that is maintained for fault-tolerance.  The larger the Tapestry network, the more efficient is the performance. Hence, better if different applications share the same overlay. Distributed shared memory Distributed Shared Memory Abstractions  No Send and Receive primitives to be used by application o Under covers, Send and Receive used by DSM manager  Locking is too restrictive; need concurrent access  With replica management, problem of consistency arises! =⇒ weaker consistency models (weaker than von Neumann) reqd
  • 17. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 17 Advantages: Shields programmer from Send/Receive primitives Single address space; simplifies passing-by-reference and passing complex data structures Exploit locality-of-reference when a block is moved DSM uses simpler software interfaces, and cheaper off-the-shelf hardware. Hence cheaper than dedicated multiprocessor systems No memory access bottleneck, as no single bus Large virtual memory space DSM programs portable as they use common DSM programming interface Disadvantages: Programmers need to understand consistency models, to write correct programs DSM implementations use async message-passing, and hence cannot be more efficient than msg- passing implementations By yielding control to DSM manager software, programmers cannot use their own msg-passing solutions. The main issues in designing a DSM system are the following: • Determining what semantics to allow for concurrent access to shared objects. The semantics needs to be clearly specified so that the programmer can code his program using an appropriate logic. • Determining the best way to implement the semantics of concurrent access to shared data. One possibility is to use replication. One decision to be made is the degree of replication – partial replication at some sites, or full replication at all the sites. A further decision then is to decide on
  • 18. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 18 whether to use read-replication (replication for the read operations) or write-replication (replication for the write operations) or both. • Selecting the locations for replication (if full replication is not used), to optimize efficiency from the system’s viewpoint. • Determining the location of remote data that the application needs to access, if full replication is not used. • Reducing communication delays and the number of messages that are involved under the covers while implementing the semantics of concurrent access to shared data. There are four broad dimensions along which DSM systems can be classified and implemented: • Whether data is replicated or cached. • Whether remote access is by hardware or by software. • Whether the caching/replication is controlled by hardware or software. • Whether the DSM is controlled by the distributed memory managers, by the operating system, or by the language runtime system. Comparison of DSM systems Table 5.2 Comparison of DSM systems MEMORY CONSISTENCY MODELS Memory coherence Memory coherence is the ability of the system to execute memory operations correctly. Assume n processes and si memory operations per process Pi.  si memory operations by Pi
  • 19. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 19  (s1 + s2 + . . . sn )!/(s1!s2! . . . sn !) possible interleavings  Memory coherence model defines which interleavings are permitted Traditionally, Read returns the value written by the most recent Write ”Most recent” Write is ambiguous with replicas and concurrent accesses  DSM consistency model is a contract between DSM system and application programmer Figure 5.9 Sequential invocations and responses in a DSM system, without any pipelining. Strict consistency/atomic consistency/linearizability Strict consistency 1. A Read should return the most recent value written, per a global time axis. For operations that overlap per the global time axis, the following must hold. 2. All operations appear to be atomic and sequentially executed. 3. All processors see the same order of events, equivalent to the global time ordering of non-overlapping events. Figure: Strict consistency  (Liveness:) Each invocation must have a corresponding response.  (Correctness:) The projection of Seq on any processor i, denoted Seqi, must be a sequence of alternating invocations and responses if pipelining is disallowed. Implementations
  • 20. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 20  Implementing linearizability is expensive because a global time scale needs to be simulated. As all processors need to agree on a common order, the implementation needs to use total order. For simplicity, we assume full replication of each data item at all the processors. Hence, total ordering needs to be combined with a broadcast. Figure 5.10 Examples to illustrate definitions of linearizability and sequential consistency. The initial values of variables are zero.
  • 21. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 21 Algorithm 5.5 Implementing linearizability (LIN) using total order broadcasts [6]. Code shown is for Pi, 1 ≤ i ≤ n. When a Read in simulated at other processes, there is a no-op. Why do Reads participate in total order broadcasts? Reads need to be serialized w.r.t. other Reads and all Write operations. See counter-example where Reads do not participate in total order broadcast. Figure 5.11 A violation of linearizability (LIN) if Read operations do not participate in the total order broadcast
  • 22. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 22 Sequential Consistency.  The result of any execution is the same as if all operations of the processors were executed in some sequential order.  The operations of each individual processor appear in this sequence in the local program order. Any interleaving of the operations from the different processors is possible. But all processors must see the same interleaving. Even if two operations from different processors (on the same or different variables) do not overlap in a global time scale, they may appear in reverse order in the common sequential order seen by all. See examples used for linearizability. Only Writes participate in total order BCs. Reads do not because:  all consecutive operations by the same processor are ordered in that same order (no pipelining), and  Read operations by different processors are independent of each other; to be ordered only with respect to the Write operations.  Direct simplification of the LIN algorithm.  Reads executed atomically. Not so for Writes.  Suitable for Read-intensive programs. Algorithm 5.6 Implementing sequential consistency (SC) using local Read operations [6]. Code shown is for Pi, 1 ≤ i ≤ n.
  • 23. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 23 Algorithm 5.7 Implementing Sequential Consistency (SC) using local Write operations [6]. Code shown is for Pi, 1 ≤ i ≤ n. Causal consistency  In SC, all Write ops should be seen in common order.  For causal consistency, only causally related Writes should be seen in common P1 order. The causality relation for shared memory systems is defined as follows:  Local order At a processor, the serial order of the events defines the local causal order.  Inter-process order A Write operation causally precedes a Read operation issued by another processor if the Read returns a value written by the Write.  Transitive closure The transitive closure of the above two relations defines the (global) causal order.
  • 24. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 24 Figure 5.12 Examples to illustrate definitions of sequential consistency (SC), causal consistency (CC), and PRAM consistency. The initial values of variables are zero. PRAM (pipelined RAM) or processor consistency PRAM memory Only Write ops issued by the same processor are seen by others in the order they were issued, but Writes from different processors may be seen by other processors in different orders.
  • 25. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 25 PRAM can be implemented by FIFO broadcast? PRAM memory can exhibit counter-intuitive behavior, see below. Slow Memory Only Writeoperationsissuedbythesameprocessor andtothesame memory location must be seenbyothersinthatorder. Figure 5.13 Examples to illustrate definitions of PRAM consistency and slow memory. The initial values of variables are zero. Hierarchy of consistency models int: x, y ; Process 1 Process 2 ... ... (1b) if y = 0 then kill(P2). (1a) x ←− 4; (2a) y ←− 6; (2b) if x = 0 then kill(P1).
  • 26. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 26 Figure 5.14 A strict hierarchy of the memory consistency models. Other models based on synchronization instructions Consistency  Consistency conditions apply only to special ”synchronization” instructions, e.g., barrier synchronization  Non-sync statements may be executed in any order by various processors. E.g.,weak consistency, release consistency, entry consistency Weak consistency: All Writes are propagated to other processes, and all Writes done elsewhere are brought locally, at a sync instruction.  Accesses to sync variables are sequentially consistent  Access to sync variable is not permitted unless all Writes elsewhere have completed  No data access is allowed until all previous synchronization variable accesses have been performed  Drawback: cannot tell whether beginning access to shared variables (enter CS), or finished access to shared variables (exit CS). Consistency and Entry Consistency Two types of synchronization Variables: Acquire and Release
  • 27. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 27 Release Consistency  Acquire indicates CS is to be entered. Hence all Writes from other processors should be locally reflected at this instruction  Release indicates access to CS is being completed. Hence, all Updates made locally should be propagated to the replicas at other processors.  Acquire and Release can be defined on a subset of the variables.  If no CS semantics are used, then Acquire and Release act as barrier synchronization variables.  Lazy release consistency: propagate updates on-demand, not the PRAM way. Entry Consistency  Each ordinary shared variable is associated with a synchronization variable (e.g., lock, barrier)  For Acquire /Release on a synchronization variable, access to only those ordinary variables guarded by the synchronization variables is performed. SHARED MEMORY MUTUAL EXCLUSION Lamport’s bakery algorithm
  • 28. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 28 Algorithm 5.7 Lamport’s n-process bakery algorithm for shared memory mutual exclusion. Code shown is for process Pi, 1 ≤ i ≤ n. Mutual exclusion  Role of line (1e)? Wait for others’ timestamp choice to stabilize ...  Role of line (1f)? Wait for higher priority (lex. lower timestamp) process to enter CS Bounded waiting: Pi can be overtaken by other processes at most once (each) Progress: lexicographic order is a total order; process with lowest timestamp in lines (1d)-(1g) enters CS Space complexity: lower bound of n registers Time complexity: (n) time for Bakery algorithm Lamport’s fast mutex algorithm takes O(1) time in the absence of contention. However it compromises on bounded waiting. Uses (W −R−W –R) sequence necessary and sufficient to check for contention, and safely enter CS
  • 29. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 29 Bounded waiting is satisfied because each other process j can “overtake” process i at most once after i has completed choosing its timestamp. The second time j chooses a timestamp, the value will necessarily be larger than i’s timestamp if i has not yet entered its CS. Progress is guaranteed because the lexicographic order is a total order and the process with the lowest timestamp at any time in the loop (lines 1d–1g) is guaranteed to enter the CS. Lamport’s WRWR mechanism and fast mutual exclusion
  • 30. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 30 Figure 5.14 An example showing the need for a boolean vector for fast mutual exclusion. Hardware support for mutual exclusion Algorithm 5.8 Definitions of synchronization operations Test&Set and Swap.
  • 31. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 31 Algorithm 5.9 Mutual exclusion using Swap. Code shown is for process Pi, 1 ≤ i ≤ n.
  • 32. CS8603-Distributed systems Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 32 Algorithm 5.10 Mutual exclusion with bounded waiting, using Test&Set. Code shown is for process Pi, 1 ≤ i ≤ n.