Concurrent and Distributed Systems (Bechini) 
MUTUAL EXCLUSION 
 
Volatile​: any write to a volatile variable establishes a happens­before relationship with subsequent reads of                             
that same variable. This means that changes to a volatile variable are always visible to other threads. What's                                   
more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the                                         
volatile, but also the side effects of the code that led up the change. 
Reads and writes are ​atomic​ for ​all​ variables​ declared volatile (​including​ long and double variables). 
 
Mutual Exclusion desired properties​: 
1. ME guaranteed in any case 
2. A process out of the Critical Section MUST NOT prevent any other to access 
3. No deadlock 
4. No busy waiting 
5. No starvation 
 
Deadlock​: a situation in which two or more competing actions are each waiting for the other to finish, and thus                                       
neither ever does. 
A deadlockers situation can arise if all of the following conditions hold simultaneously in a system:​[1] 
1. Mutual Exclusion​: At least one resource must be held in a non­shareable mode.​[1]
Only one process                               
can use the resource at any given instant of time. 
2. Hold and Wait or ​Resource Holding: A process is currently holding at least one resource and                               
requesting additional resources which are being held by other processes. 
3. No​ ​Preemption​:​ a resource can be released only voluntarily by the process holding it. 
4. Circular Wait​: A process must be waiting for a resource which is being held by another process, which                                   
in turn is waiting for the first process to release the resource. In general, there is a ​set of waiting                                       
processes, P = {P​1​, P​2​, ..., P​N​}, such that P​1 is waiting for a resource held by P​2​, P​2 is waiting for a                                             
resource held by P​3​ and so on until P​N​ is waiting for a resource held by P​1​.​[1]​[7] 
These four conditions are known as the ​Coffman conditions from their first description in a 1971 article by                                   
Edward G. Coffman, Jr.​[7]
Unfulfillment of any of these conditions is enough to preclude a deadlock from                                 
occurring. 
 
Busy waiting​: In ​software engineering​, ​busy­waiting or ​spinning is a technique in which a ​process                             
repeatedly checks to see if a condition is true, such as whether ​keyboard input or a ​lock is available. In                                       
low­level programming, busy­waits may actually be desirable. It may not be desirable or practical to implement                               
interrupt­driven processing for every hardware device, particularly those that are seldom accessed. 
 
Sleeping lock​: as opposite to a spinning lock, this technique puts a thread waiting for accessing a resource in                                     
sleeping/ready mode. So, a thread is paused and its execution stops. The CPU can perform a context switch                                   
and can keep working on another process/thread. This saves some CPU time (cycles) that would be wasted by                                   
a spinning lock implementation. Anyway, also sleeping locks have a time overhead that should be taken into                                 
account when evaluating which solution to adopt. 
 
Starvation​: a ​process is perpetually denied necessary ​resources to proceeds its work.​[1]
Starvation may be                             
caused by errors in a scheduling or ​mutual exclusion algorithm, but can also be caused by ​resource leaks​, and                                     
can be intentionally caused via a​ ​denial­of­service attack​ such as a​ ​fork bomb​. 
 
Race Condition​: (or ​race hazard​) is the behavior of an electronic, software or other ​system where the output                                   
is dependent on the sequence or timing of other uncontrollable events. It becomes a ​bug when events do not                                     
happen in the order the programmer intended. The term originates with the idea of two ​signals racing each                                   
other to influence the output first. Race conditions arise in software when an application depends on the                                 
sequence or timing of​ ​processes​ or​ ​threads​ for it to operate properly. 
 
Dekker’s alg.​: the first known correct solution to the​ ​mutual exclusion​ problem in​ ​concurrent programming​. 
 
Dekker's algorithm guarantees​ ​mutual exclusion​, freedom from​ ​deadlock​, and freedom from​ ​starvation​. 
One advantage of this algorithm is that it doesn't require special ​Test­and­set (atomic read/modify/write)                           
instructions and is therefore highly portable between languages and machine architectures. One disadvantage                         
is that it is ​limited to two processes ​and makes use of ​busy waiting instead of process suspension. (The use of                                         
busy waiting suggests that processes should spend a minimum of time inside the critical section.) 
This algorithm won't work on​ ​SMP​ machines equipped with these CPUs without the use of​ ​memory barriers​. 
 
Peterson’s alg.​: a ​concurrent programming ​algorithm for ​mutual exclusion that allows two processes to share                             
a single­use resource without conflict, using only shared memory for communication. 
 
The algorithm uses two variables, ​flag and ​turn​. A ​flag[n] value of ​true indicates that the process ​n wants to                                       
enter the ​critical section​. Entrance to the critical section is granted for process P0 if P1 does not want to enter                                         
its critical section or if P1 has given priority to P0 by setting ​turn​ to 0. 
 
The algorithm does satisfy the three essential criteria to solve the critical section problem, provided that                               
changes to the variables turn, flag[0], and flag[1] propagate immediately and atomically. The while condition                             
works even with preemption.​[1] 
 
Filter alg.: generalization to N>2 of Peterson’s alg 
 
 
HW support to ME​: an operation (or set of operations) is ​atomic​, ​linearizable​, ​indivisible or ​uninterruptible                               
if it appears to the rest of the system to occur instantaneously. Atomicity is a guarantee of ​isolation from                                     
concurrent processes​. Additionally, atomic operations commonly have a ​succeed­or­fail definition — they                       
either successfully change the state of the system, or have no apparent effect. 
● Test_and_Set​: ​reads a variable’s value, copies it to store the old value, ​modifies it with the new value                                   
and ​writes the old value returning it. ​Maurice Herlihy (1991) proved that test­and­set has a finite                               
consensus number​, in contrast to the ​compare­and­swap operation. The test­and­set operation can                       
solve the wait­free ​consensus problem for no more than two concurrent processes.​[1]
However, more                           
than two decades before Herlihy's proof, ​IBM had already replaced Test­and­set by                       
Compare­and­swap, which is a more general solution to this problem. It may suffer of ​starvation​. (from                               
notes: in a multiprocessor system it is not enough alone to guarantee ME, it is such that interrupts must                                     
be disabled). 
 
 
● Compare_and_Swap (CAS)​: an ​atomic ​instruction used in ​multithreading to achieve ​synchronization​.                     
It compares the contents of a memory location to a given value and, only if they are the same, modifies                                       
the contents of that memory location to a given new value. This is done as a single atomic operation. ​It                                       
might suffer ​starvation​. On server­grade multi­processor architectures of the 2010s, compare­and­swap                     
is relatively cheap relative to a simple load that is not served from cache. A 2013 paper points out that a                                         
CAS is only 1.15 times more expensive that a non­cached load on Intel Xeon (​Westmere­EX​) and 1.35                                 
times on AMD ​Opteron (Magny­Cours).​[6]
As of 2013, most ​multiprocessor architectures support CAS in                           
hardware. As of 2013, ​the compare­and­swap operation is the most popular ​synchronization primitive                         
for implementing both lock­based and non­blocking​ ​concurrent data structures​.​[4]
 
 
 
● Fetch_and_Add​: a special instruction that ​atomically modifies the contents of a memory location. It                           
fetches ​(copies) the old value of the parameter, ​adds to the parameter the value given by the second                                   
parameter and returns the old value. ​Maurice Herlihy (1991) proved that fetch­and­add has a finite                             
consensus number, in contrast to the ​compare­and­swap operation. ​The fetch­and­add operation can                       
solve the wait­free consensus problem for no more than two concurrent processes​.​[1] 
 
 
 
 
 
Guidelines​: 
● do not use ​synchronized​ blocks if not strictly necessary 
● prefer ​atomic variables​ already implemented 
● when there is more than one variable to share, wrap critical sections with ​lock/unlock 
● synchronized ​only when there is a write operation 
 
In Java exist three main mechanisms for ME: ​atomic variables, ​implicit locks​ and ​explicit locks​ (​reference​). 
 
Condition Variables​: a threads’ synchronization mechanism. A thread can suspend its execution while waiting                           
for a specific condition to happen. CVs do not guarantee ME on shared resources, so they have to be used in                                         
conjunction with mutexes. 
A thread can perform on a CV two operations: 
● wait​: ConditionVariable c; c.wait(); = a thread is suspended while waiting for a signal to resume it 
● signal​: c. signal(); = a thread exiting the CS signals to a waiting thread in the waiting queue to resume                                       
its execution. The awaken thread must check again that the condition it was waiting for has actually                                 
happened 
● signalAll​: c.signalAll(); = wakes up all the waiting threads. 
These three methods must be invoked inside a critical section. 
 
Java Monitor​: in java, every object (Object) has three methods, namely ​wait​, ​notify and ​notifyAll​, performing                               
the same as the three previous ones. So, it is possible to implement a mechanism completely similar to                                   
condition variables by using these three methods. By dealing with Objects, in order to guarantee ME it is                                   
necessary to use the ​synchronized​ keyword. 
Java ReentrantLock ​(​oracle reference​): A reentrant mutual exclusion ​Lock with the same basic behavior and                             
semantics as the implicit monitor lock accessed using synchronized methods and statements, but with                           
extended capabilities. A ReentrantLock is ​owned by the thread last successfully locking, but not yet unlocking                               
it. A thread invoking lock will return, successfully acquiring the lock, when the lock is not owned by another                                     
thread. The method will return immediately if the current thread already owns the lock. The constructor for this                                   
class accepts an optional ​fairness parameter. When set true, under contention, locks favor granting access to                               
the longest­waiting thread. Otherwise this lock does not guarantee any particular access order. Programs                           
using fair locks accessed by many threads may display lower overall ​throughput (i.e., are slower; often much                                 
slower) than those using the default setting, but have smaller ​variances in times to obtain locks and guarantee                                   
lack of starvation​. 
It is recommended practice to ​always​ immediately follow a call to lock with a try block: 
 
Since a CV must always be used together with a mutex, the ​ReentrantLock gives the method ​newCondition()                                 
to obtain a CV, so that when performing a ​wait on this CV the lock used to create the same condition will be                                             
released and the thread suspended. 
The followings are the most relevant methods of ReentrantLock: 
● lock() 
● tryLock()​: Acquires the lock only if it is not held by another thread at the time of invocation. 
● unlock() 
● newCondition() 
There is also a method that permit to specify a timeout to observe while trying to lock: ​tryLock (long timeout,                                       
TimeUnit​ unit) 
 
Barging​: usually locks are starvation free, because they use queues to put suspended threads that are FIFO,                                 
so every single thread entering the queue will acquire the lock with a bounded time. There are such situation                                     
when the JVM performs some performance optimization: since the suspend/resume mechanism requires                       
overhead and waste of time, when there is contention for the same lock the JVM might wake up a thread                                       
already executing instead of a sleeping one. This process is called ​barging​. 
 
Monitor ​(​oracle reference​): mutex + condition variable; a synchronization construct that allows ​threads to have                             
both ​mutual exclusion and the ability to wait (block) for a certain condition to become true. Only one process at                                       
a time is able to enter the monitor. In other words, a monitor is a mechanism that associate to a given data                                           
structure a set of procedures/operations that are the only ones allowed. ​Each procedure/operation is mutually                             
exclusive: only one process/thread can access the monitor at a time​. 
Two variants of monitor exist: 
● Hoare’s m.​: blocking condition variable; ​signal­and­wait​: There are 3 kind of queues: ​enter (threads                           
aiming to enter the monitor), ​urgent (threads who left the monitor lock to those ones that were already                                   
waiting on the condition variable) and ​condition (a queue for each CV). When a thread executes a                                 
signal, it auto­suspends, moves to the urgent queue and wakes up the first thread waiting in the waiting                                   
queue of that variable. Threads waiting in the urgent queue will be awaken before the ones in the enter                                     
queue. 
● MESA​: non­blocking CV; ​signal­and­continue​; a signal does not suspend the invoking thread, but                         
makes a thread waiting on the waiting queue of that CV move to the entering queue. The signalling                                   
thread leaves the monitor continuing with its execution. 
Semaphore​: another mechanism for ME; in practice, it is a shared integer variable ​p>=0 ​on which increments                                 
and decrements are  performed by means of ​two atomic operations​: 
● V ​(​verhoog ​= increment, means ​signal​): used when a thread is exiting a portion of code. 
● P​ (​prolaag​ = try, means ​wait​): used when a thread wants to enter a CS. 
In practice, a semaphore allow ​n threads to “enter” a specific portion of code, just by setting up the initial value                                         
of p to n. 
 
Readers and Writers problem​: n threads want to read and/or write from/to a shared variable, namely a buffer.                                   
ME is required since writes to the same memory must be synchronized. 
In Java exist the ​ReentrantReadWriteLock class: maintains a pair of associated ​locks​, one for read­only                             
operations and one for writing; also keeps the capabilities of a ReentrantLock; in particular, a lock downgrading                                 
is possible: a thread holding the write lock might obtain a reentrant lock on the read lock. 
Another solution to ME for the readers and writers problem is the synchronization of all the methods in a class                                       
containing a shared ​collection​, so that the class actually wraps the shared variable and has ME guaranteed. 
Concurrent collections​ (​oracle reference​): classes that contains thread­safe data structures. They can be: 
● blocking​: read and write operations (take and put) wait for the data structure to be non­empty or                                 
non­full. 
● non­blocking 
CopyOnWriteArrayList (​oracle reference​): A thread­safe variant of ​ArrayList in which all mutative operations                         
(add, set, and so on) are implemented by making a fresh copy of the underlying array. Ordinarily too costly, but                                       
may be ​more efficient than alternatives when traversal operations vastly outnumber mutations, and is useful                             
when you cannot or don't want to synchronize traversals, yet need to preclude interference among concurrent                               
threads. 
Dequeue​: “​double end queue​”, a queue where an element can be put either from the head or the tail of the                                         
queue. 
Work stealing​: suppose there are m producers and n consumers; each producer writes to a shared buffer and                                   
each consumer has its own queue where the work flows from the buffer (according to a specific policy for                                     
example). It might happen that a consumer is observing its queue growing because it is not able to consume                                     
all the work. In this situation, any other consumer that might observe an empty queue, might “steal” work from                                     
the queue of that other consumer. 
JVM Memory​: is organized in three main areas: 
● method​: for each class, methods’ code, attributes and field values are stored 
● heap​: all the instances are stored here 
● thread stack​: for each thread, a data structure is placed here, containing methods’ stack, PC register,                               
etc... 
 
 
Tasks and Executors​: the way to manage threads execution, in terms of starting, stopping and resuming a                                 
thread 
 
Executor​: An object that executes submitted ​Runnable tasks. This interface provides a way of decoupling task                               
submission from the mechanics of how each task will be run, including details of thread use, scheduling, etc.                                   
An Executor is normally used instead of explicitly creating threads. 
void ​execute​(​Runnable​ command) 
 
Runnable​: implemented by any class whose instances are intended to be executed by a thread. The class                                 
must define a ​method of no arguments called ​run​. 
 
Callable​: A task that ​returns a result and may throw an exception​. Implementors define a ​single method with                                   
no arguments called call​. The Callable interface is similar to ​Runnable​, in that both are designed for classes                                   
whose instances are potentially executed by another thread. A Runnable, however, does not return a result                               
and cannot throw a ​checked exception​. 
Future​: represents the result of an asynchronous computation. Methods are provided to check if the                             
computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can                                     
only be retrieved using method get when the computation has completed, blocking if necessary until it is ready.                                   
Cancellation is performed by the cancel method. Additional methods are provided to determine if the task                               
completed normally or was cancelled. Once a computation has completed, the computation cannot be                           
cancelled. If you would like to use a Future for the sake of cancellability but not provide a usable result, you                                         
can declare types of the form Future<?> and return null as a result of the underlying task.  
boolean ​cancel​(boolean mayInterruptIfRunning) 
boolean ​isCancelled​() 
boolean ​isDone​() 
V​ ​get​() throws​ ​InterruptedException​, ​ExecutionException 
V​ ​get​(long timeout, ​TimeUnit​ unit) throws​ ​InterruptedException​, ​ExecutionException​, ​TimeoutException 
 
 
ExecutorService​: An ​Executor that provides methods to manage termination and methods that can produce a                             
Future for tracking progress of one or more asynchronous tasks. An ExecutorService can be shut down, which                                 
will cause it to reject new tasks. Two different methods are provided for shutting down an ExecutorService. The                                   
shutdown() method will allow previously submitted tasks to execute before terminating, while the                         
shutdownNow() method prevents waiting tasks from starting and attempts to stop currently executing tasks.                           
Upon termination, an executor has no tasks actively executing, no tasks awaiting execution, and no new tasks                                 
can be submitted. 
void ​shutdown​() 
List​<​Runnable​> ​shutdownNow​() 
<T> ​Future​<T> ​submit​(​Callable​<T> task) 
Future​<?> ​submit​(​Runnable​ task) 
 
ThreadPoolExecutor​: An ​ExecutorService that executes each submitted task using one of possibly several                         
pooled threads, normally configured using​ ​Executors​ factory methods. 
Thread pools address two different problems: they usually provide improved performance when executing                         
large numbers of asynchronous tasks, due to reduced per­task invocation overhead, and they provide a means                               
of bounding and managing the resources, including threads, consumed when executing a collection of tasks.                             
Each ThreadPoolExecutor also maintains some basic statistics, such as the number of completed tasks. 
 
ScheduledThreadPoolExecutor​: A ​ThreadPoolExecutor that can additionally schedule commands to run                   
after a given delay, or to execute periodically. 
public ​ScheduledThreadPoolExecutor​(int corePoolSize) 
public ​ScheduledFuture​<?> ​schedule​(​Runnable​ command, long delay, ​TimeUnit 
   unit) 
public <V> ​ScheduledFuture​<V> ​schedule​(​Callable​<V> callable, long delay, 
  ​TimeUnit​ unit) 
public ​ScheduledFuture​<?> ​scheduleAtFixedRate​(​Runnable​ command, 
long initialDelay, long period, ​TimeUnit​ unit) 
public ​ScheduledFuture​<?> ​scheduleWithFixedDelay​(​Runnable​ command, 
long initialDelay, long delay, ​TimeUnit​ unit) 
public void ​execute​(​Runnable​ command) 
public ​Future​<?> ​submit​(​Runnable​ task) 
public <T> ​Future​<T> ​submit​(​Callable​<T> task) 
public void ​shutdown​() 
public ​List​<​Runnable​> shutdownNow() 
 
Task lifecycle: 
 
 
Executor lifecycle​: 
 
Number of threads in the pool​: 
  N    )Npool =   CPU ∙ U ∙ ( T COMPUTE
T   + T WAIT COMPUTE
+ 1 
 
where U is the CPUs and +1 is for safety reasons: it might happen that all the necessary threads block and no                                           
more threads then could enter the pool. 
 
Deadlock with nested monitors​: suppose two nested monitors, where a thread first waits on a CV and then                                   
enters its monitor and waits on a nested CV. Now, if another thread, a second one, wants to enter the first                                         
monitor, it might happen that: the first thread signalled on the outer CV and signal got lost, or the inner thread                                         
still has to signal on the outer CV. In any case, the thread waiting outside will never be able to enter the                                           
monitor because the inner thread will always be blocked on the internal CV that will never receive any signal.                                     
So, this situation leads to a ​deadlock​. 
 
Deadlock with executor and monitor (Thread Starving Deadlock)​: suppose to have an Executor with a                             
thread pool whose size is N. Suppose that all the threads encounter the ​same CV and they ​all wait on it​. If                                           
there is no other thread available in the pool, then no one will be able to wake up the waiting threads, thus                                           
resulting in a deadlock. 
 
Wait­for­graph​: a ​directed graph used for ​deadlock detection in ​operating systems and ​relational database                           
systems. 
 
See ​Coffman​. 
 
Memory barrier​: a type of ​barrier ​instruction that causes a ​central processing unit (CPU) or ​compiler to                                 
enforce an ​ordering constraint on ​memory operations issued before and after the barrier instruction; necessary                             
because most modern CPUs employ performance optimizations that can result in​ ​out­of­order execution​. 
 
Java Memory Model​ (See Java Memory Model paper): 
● Program order rule​: actions in a thread are performed in their coding order. 
● Monitor lock rule​: an unlock to a monitor ​happens­before​ any subsequent lock on the same. 
● Volatile variable rule​: a write to a v. variable ​happens­before any subsequent read to the same var.;                                 
the same holds for ​atomic var.s​. 
● Thread start rule​: a call to ​Thread.start​ ​happens­before​ any action in the started thread. 
● Thread termination rule​: any action in a thread ​happens­before any other thread detects that it has                               
terminated​, either by a successfully return from ​Thread.join​ or ​Thread.isAlive​ returning false. 
● Interruption rule​: a call to ​interrupt for a thread ​happens­before the interrupted thread detects the                             
interrupt. 
● Finalizer rule​: the end of a constructor for an object ​happens­before​ the start of its finalizer. 
● Transitivity​: if ​A happens­before B​, ​B happens­before C​, then ​A happens­before C​. 
Performance 
With Java, in order to perform accurate performance measurements it is important to exactly know how the                                 
code is run by the JVM. An important component is the ​Just In Time (JIT) compilation engine: this module                                     
performs ​dynamic compilation during the execution of a program – at ​run time – rather than prior to                                   
execution.​[1]
Most often this consists of translation to ​machine code​, which is then executed directly, but can                                 
also refer to translation to another format. It allows ​adaptive optimization such as ​dynamic recompilation – thus                                 
in theory JIT compilation can yield faster execution than static compilation. Interpretation and JIT compilation                             
are particularly suited for ​dynamic programming languages​, as the runtime system can handle late­bound data                             
types and enforce security guarantees. 
Evaluating a java source code performance requires collecting statistics about its execution times. In particular,                             
when interested in evaluating a multi­thread software, it is important to have such a way to synchronize the                                   
threads starting execution time. In other words, all the threads have to start at the same time so that the                                       
measurements are faithful to reality. Java supplies a specific class for this task. 
CountDownLatch ​(​oracle reference​): a synchronization aid that allows one or more threads to wait until a set                                 
of operations being performed in other threads completes. 
A CountDownLatch is initialized with a given ​count​. The ​await methods block until the current count reaches                                 
zero due to invocations of the ​countDown() method, after which all waiting threads are released and any                                 
subsequent invocations of ​await return immediately. This is a one­shot phenomenon ­­ the count cannot be                               
reset. If you need a version that resets the count, consider using a​ ​CyclicBarrier​. 
A CountDownLatch is a versatile synchronization tool and can be used for a number of purposes. A                                 
CountDownLatch initialized with a count of one serves as a simple on/off latch, or gate: all threads invoking                                   
await wait at the gate until it is opened by a thread invoking ​countDown()​. A CountDownLatch initialized to ​N                                     
can be used to make one thread wait until ​N threads have completed some action, or some action has been                                       
completed N times. 
public ​CountDownLatch​(int count) 
public void ​await​() throws​ ​InterruptedException 
public void ​countDown​() 
 
CyclicBarrier (​oracle reference​): a synchronization aid that allows a set of threads to all wait for each other to                                     
reach a common barrier point. CyclicBarriers are useful in programs involving a ​fixed sized party of threads                                 
that must occasionally wait for each other​. The barrier is called ​cyclic because it can be re­used after the                                     
waiting threads are released. 
public ​CyclicBarrier​(int parties, ​Runnable​ barrierAction) 
public int ​await​() throws​ ​InterruptedException​, ​BrokenBarrierException 
 
Performance in synchronization 
Synchronization impacts performance because of: 
● context switches 
● memory synchronization 
● thread synchronization 
Considering our code, one of its performance indexes is throughput, in terms of number threads entering and                                 
leaving it in a given time. So, here comes to us ​Little’s Law​: where, in our case, L represents the                            λ w L =                  
number of threads in our system, i.e. the number of threads waiting to execute our portion of code (waiting                                     
because of synchronization of course), λ is the arrival rate and w the delay of the system. What we want to do                                           
is ​minimize ​L. 
Possible solutions: 
● CS shrinking​: reduce the size of the CS so that a thread do not spend too much time in it. 
● CS splitting​: split a CS in smaller ones, so perform ​lock splitting​. 
● JVM optimizations​: 
○ Lock coarsening​: whenever the JVM observes that a thread moves from a waiting queue to                             
another one (because of a locks chain) always in the same sequential order, then it collapse all                                 
the CSs into a single one, thus merging the locks into a single one. This leads to performance                                   
improvements in terms of waiting time, so improving the throughput. 
○ Lock elision​: if a CS is ​always executed by the ​same thread​, then it is not necessary to protect                                     
it with a lock: so the JVM removes it. 
● Lock granularity​: when the shared data structure is wide, it is important to put locks only where                                 
needed, and not on the entire data structure. For example, an hashMap could be so wide that threads                                   
concurrently would access it on different parts, thus not violating any ME constraint. So, locking the                               
entire table would be penalizing the performance of our software. A ​solution is to have more locks to                                   
distribute on the overall data structure. Suppose: , then              number of Locks and  N  table size NL  =   =      
. N​L​ should be dimensioned so that the conflict probability is minimized.ocks distribution  N % N   L =   L  
● Non­blocking algorithms​: if failure or ​suspension of any ​thread cannot cause failure or suspension of                             
another thread​[1]​
 (use of ​volatile​ and ​atomic variables​). 
○ Wait­free​: the strongest non­blocking guarantee of progress, combining guaranteed                 
system­wide throughput with ​starvation​­freedom. ​An algorithm is wait­free if every operation has                       
a bound on the number of steps the algorithm will take before the operation completes​.​[11]
 
○ Lock­free​: allows individual threads to starve but guarantees system­wide throughput. An                     
algorithm is ​lock­free if it satisfies that when the program threads are run sufficiently long at                               
least one of the threads makes progress (for some sensible definition of progress). All wait­free                             
algorithms are lock­free. ​An algorithm is lock free if every operation has a bound on the number                                 
of steps before one of the threads operating on a data structure completes its operation.​[11]
 
○ Obstruction­free​: the weakest natural non­blocking progress guarantee. ​An algorithm is                   
obstruction­free if at any point, a single thread executed in isolation ​(i.e., with all obstructing                             
threads suspended) ​for a bounded number of steps will complete its operation.​[11]
All lock­free                           
algorithms are obstruction­free. 
See ​Treiber's stack and ​Michael and Scott's queue​. These two algorithms are a non­blocking implementation                             
of a stack and a queue respectively. Both are based on the use of ​AtomicReference variables, that make it                                     
possible to deal with threads synchronization. They result with better performance when used in such                             
programs that suffer from ​high contention rates​. Basically, they implement a spinning solution to deal with                               
concurrent modification on the stack pointer and the ​head and ​tail references in the queue. Thus, whenever a                                      
thread tries to modify one of them has to be sure (see the if statements) that it is actually modifying what it is                                             
expecting to modify (note the use of ​compareAndSet​). If such a check fails, then go back (spin) and try again                                       
from the beginning. 
 
JAVA: Nested Classes (​oracle reference​) 
A nested class is a member of its enclosing class. Non­static nested classes (inner classes) have access to                                   
other members of the enclosing class, even if they are declared private. Static nested classes do not have                                   
access to other members of the enclosing class. As a member of the OuterClass, a nested class can be                                     
declared private, public, protected, or ​package private​. (Recall that outer classes can only be declared public                               
or ​package private​.) 
Static nested class​: A static nested class ​interacts with the instance members of its outer class (and other                                   
classes) ​just like any other top­level class​. In effect, a static nested class is ​behaviorally a top­level class that                                     
has been ​nested​ in another top­level class ​for packaging convenience​.  
 
Inner class​: is ​associated with an instance of its enclosing class and ​has direct access to that object's                                   
methods and fields​. Also, because an inner class is associated with an instance, it cannot define any static                                   
members itself. Objects that are instances of an inner class ​exist ​within​ an instance of the outer class​. 
 
To instantiate an inner class, you must first instantiate the outer class. Then, create the inner object within the                                     
outer object with this syntax: 
OuterClass.InnerClass innerObject = outerObject.new InnerClass(); 
 
Local classes (​oracle reference​): are classes that are defined in a ​block​, which is a group of zero or more                                       
statements between balanced braces. You typically find local classes defined in the body of a method. ​A local                                   
class has access to the members of its enclosing class and to local variables​. However, a local class can only                                       
access local variables that are declared ​final​. Starting in Java SE 8, a local class can access local variables                                     
and parameters of the enclosing block that are final or ​effectively final ​(i.e. never changed in the enclosing                                   
block). 
Shadowing​: If a declaration of a type (such as a member variable or a parameter name) in a particular scope                                       
(such as an inner class or a method definition) has the same name as another declaration in the enclosing                                     
scope, then the declaration ​shadows​ the declaration of the enclosing scope. 
Anonymous classes​: enable you to declare and instantiate a class at the same time. They are like local                                   
classes except that they do not have a name. Use them if you need to use a local class only once. Ex.: when                                             
making a new of an interface and then opening { to write “on the fly” the class. 
AtomicInteger (​oracle reference​): An int value that may be updated atomically. Methods: getAndAdd(int                         
delta), getAndIncrement(), addAndGet(int delta), incrementAndGet(). 
ThreadLocal<T> (​oracle reference​): This class provides ​thread­local variables​. These variables differ from                       
their normal counterparts in that each thread that accesses one (via its get or set method) has its own,                                     
independently initialized copy of the variable. ThreadLocal instances are typically private static fields in classes                             
that wish to associate state with a thread (e.g., a user ID or Transaction ID).  
 
 
MESSAGE PASSING MODEL 
Message passing between a pair of processes can be supported by two message communication operations,                             
send and ​receive​, defined in terms of destinations and messages. A ​queue ​is associated with each ​message                                 
destination​. Sending processes cause messages to be added to remote queues and receiving processes                           
remove messages from local queues. Sending and receiving processes may be either ​synchronous or                           
asynchronous​. In the synchronous form of communication, the sending and receiving processes synchronize                         
at every message​. In this case, both send and receive are ​blocking operations​. 
In the asynchronous form of communication, ​the use of the send operation is nonblocking in that the sending                                   
process is allowed to proceed as soon as the message has been copied to a local buffer, and the transmission                                       
of the message proceeds in parallel with the sending process​. The ​receive operation can have ​blocking and                                 
non­blocking variants​. In the non­blocking variant, the receiving process proceeds with its program after                           
issuing a receive operation, which provides a buffer to be filled in the background, but it must separately                                   
receive notification that its buffer has been filled, by polling or interrupt. 
Non­blocking communication appears to be more efficient, but it involves extra complexity in the receiving                             
process associated with the need to acquire the incoming message out of its flow of control. For these                                   
reasons, today’s systems do not generally provide the nonblocking form of receive. 
Communication channel hypothesis​: 
● Reliability​: in terms of validity and integrity. As far as the validity property is concerned, a point­to­point                                 
message service can be described as reliable if messages are guaranteed to be delivered despite a                               
‘reasonable’ number of packets being dropped or lost. In contrast, a point­to­point message service can                             
be described as unreliable if messages are not guaranteed to be delivered in the face of even a single                                     
packet dropped or lost. For integrity, messages must arrive uncorrupted and without duplication. 
● Ordering​: Some applications require that messages be delivered in sender order – that is, the order in                                 
which they were transmitted by the sender. The delivery of messages out of sender order is regarded                                 
as a failure by such applications. 
● QoS 
● Queue policy 
 
Processes addressing​: in the Internet protocols, messages are sent to ​(Internet address, local port) pairs. A                               
local port is a message destination within a computer, specified as an integer. A port has exactly one receiver                                     
(multicast ports are an exception) but can have many senders. An alternative to address processes is to use                                   
the values ​(Process ID, local port)​. 
 
Distributed systems models​: 
● Physical​: focus on hardware organization. 
● Architectural​ (behavioural): how the different components interact (client­server, peer­to­peer). 
● Fundamental ​(abstract): an high level abstraction that makes possible to represent mathematically the                         
real model, so to perform hypothesis validation 
 
Bounded­buffer with asynchronous message passing and Dijkstra’s guarded commands​: the most                     
important element of the guarded command language. In a guarded command, just as the name says, the                                 
command is "guarded". The guard is a ​proposition​, which must be true before the statement is ​executed​. At the                                     
start of that statement's execution, one may assume the guard to be true. Also, if the guard is false, the                                       
statement will not be executed. The use of guarded commands makes it easier to prove the ​program meets the                                     
specification​. The statement is often another guarded command. 
A guard can be in one of these three states: 
● failed​: condition is ​false 
● valid​: condition ​true​, result received 
● delayed​: condition ​true​, but ​no results available yet 
See notes from Professor about the algorithm for the bounded­buffer (readers­writers) problem solved with                           
guarded commands. 
 
Message Passing Interface (MPI, ​a reference​)​: the first standardized, vendor independent, message passing                         
library. The advantages of developing message passing software using MPI closely match the design goals of                               
portability, efficiency, and flexibility. MPI is not an IEEE or ISO standard, but has in fact, become the "industry                                     
standard" for writing message passing programs on HPC platforms.  
MPI primarily addresses the ​message­passing parallel programming model​: data is moved from the address                           
space of one process to that of another process through ​cooperative operations on each process​. 
MPI main components and features: 
● Communicators and Groups​: MPI uses objects called communicators and groups to define ​which                         
collection of processes may communicate with each other​. ​MPI_COMM_WORLD is the predefined                       
communicator that includes all of your MPI processes. 
○ A ​group ​is ​an ordered set of processes​. Each process in a group is associated with a unique                                   
integer rank. Rank values start at zero and go to N­1, where N is the number of processes in the                                       
group. In MPI, a group is represented within system memory as an object. It is accessible to the                                   
programmer only by a "handle". A group is always associated with a communicator object.  
○ A ​communicator ​encompasses a group of processes that may communicate with each other.                         
All MPI messages must specify a communicator. In the simplest sense, the communicator is an                             
extra "tag" that must be included with MPI calls. 
○ From the programmer's perspective, a group and a communicator are one. 
● Rank​: ​Within a communicator​, every ​process has its ​own unique, integer identifier assigned by the                             
system when the process initializes. A rank is sometimes also called a "task ID". ​Ranks are contiguous                                 
and begin at zero​. It is used by the programmer to specify the source and destination of messages.                                   
Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do                                   
that).  
● MPI buffer​: since send and receive are rarely ​perfectly synchronized, the MPI architecture presents a                             
library buffer​ that is used to store transiting messages while the receiver cannot receive them. 
● Blocking and Nonblocking op.s​: Blocking is interpreted as ‘​blocked until it is safe to return​’, in the                                 
sense that ​application data has been copied into the MPI system and hence is in transit or delivered                                   
and therefore the application buffer can be reused​. ​Safe means that modifications will not affect the                               
data intended for the receive task​. Safe does not imply that the data was actually received ­ it may very                                       
well be sitting in a system buffer. Non­blocking send and receive routines behave similarly ­ they will                                 
return almost immediately. They do not wait for any communication events to complete, such as                             
message copying from user memory to system buffer space or the actual arrival of message.  
● Point­to­Point Communication​: message passing between two, and only two, different MPI tasks.                       
One task is performing a send operation and the other task is performing a matching receive operation. 
● Collective Communication​: involve ​all processes within the scope of a communicator. It is the                           
programmer's responsibility to ensure that all processes within a communicator participate in any                         
collective operations. Unexpected behavior, including program failure, can occur if even one task in the                             
communicator doesn't participate. 
● Types of Collective Operations​: 
   
● Synchronization ­ processes wait until all members of the group have reached the                         
synchronization point. 
● Data Movement​ ­ broadcast, scatter/gather, all to all. 
● Collective Computation (reductions) ­ one member of the group collects data from the other                           
members and performs an operation (min, max, add, multiply, etc.) on that data. 
 
● Order​: 
○ MPI guarantees that messages will not overtake each other.  
○ Order rules do not apply if there are multiple threads participating in the communication                           
operations. 
● Fairness​: MPI does not guarantee fairness ­ it's up to the programmer to prevent "operation                             
starvation". 
● Envelope​: source+destination+tag+communicator (see later) 
 
 
The underlying architectural model for MPI is relatively simple and captured in Figure 4.17; note the added                                 
dimension of ​explicitly having MPI library buffers in both the sender and the receiver​, managed by the MPI                                   
library and used to hold data in transit. 
 
 
MPI point­to­point routines: 
 
Buffer ​= Program (application) address space that references the data that is to be sent or received. 
Count ​= Indicates the number of data elements of a particular type to be sent. 
Type ​= For reasons of portability, MPI predefines its elementary data types. 
Dest ​= The rank of the receiving process. 
Source = The rank of the sending process. This may be set to the wild card MPI_ANY_SOURCE to receive a                                       
message from any task. 
Tag = Arbitrary non­negative integer assigned by the programmer to uniquely identify a message. Send and                               
receive operations should match message tags. For a receive operation, the wild card MPI_ANY_TAG can be                               
used to receive any message regardless of its tag. 
Communicator = communication context, or set of processes for which the source or destination fields are                               
valid. 
Status​ = For a receive operation, indicates the source of the message and the tag of the message. 
Request​ = Used by non­blocking send and receive operations. The system issues a unique “request number”. 
 
TIME AND GLOBAL STATES 
Clocks, events and process states 
We define an event to be the occurrence of a single action that a process carries out as it executes – a                                           
communication action or a state­transforming action. The sequence of events within a single process p​i can be                                 
placed in a single, total ordering, which we denote by the relation ➝​i between the events. That is, ​e ➝​i e' if and                                             
only if the event e occurs before e’ at p​i​. This ordering is well defined, whether or not the process is                                         
multithreaded, since we have assumed that the process executes on a single processor. Now we can define                                 
the history of process p​i to be the series of events that take place within it, ordered as we have described by                                           
the relation ➝​i​: 
history(p​i​) h​i​ = < e​i​
0​
 ei​1​
 e​2​
 … > 
 
The operating system reads the node’s hardware clock value, ​H​i​(t) , scales it and adds an offset so as to                                       
produce ​a software clock ​C​i​(t) = ​α H​i​(t) + β ​that approximately measures real, physical time t for process p​i . In                                           
other words, when the real time in an absolute frame of reference is t, ​C​i​(t)​ is the reading on the software clock. 
Successive events will correspond to different timestamps only if the clock resolution – the period between                               
updates of the clock value – is smaller than the time interval between successive events. 
 
SKEW ​between computer clocks in a distributed system Network: The instantaneous difference between the                           
readings of any two clocks is called their skew. 
 
Clock ​DRIFT​: which means that they count time at different rates, and so diverge. 
 
Clock's ​DRIFT RATE​: change in the offset between the clock and a nominal perfect reference clock per unit of                                     
time measured by the reference clock. 
 
UTC (Coordinated Universal Time): is based on atomic time, but a so­called ‘leap second’ is inserted – or,                                   
more rarely, deleted – occasionally to keep it in step with astronomical time. UTC signals are synchronized and                                   
broadcast regularly from landbased radio stations and satellites covering many parts of the world. 
 
Synchronizing Physical Clocks 
Synchronization in a synchronous system​: In general, for a synchronous system, the optimum bound that                             
can be achieved on clock skew when synchronizing N clocks is u * (1 – 1 / N) [Lundelius and Lynch 1984], u =                                               
max ­ min, the max and min time that a transmission of a message can observe in a synchronous system. 
 
Cristian’s method for synchronizing clocks​: use of a time server, connected to a device that receives                               
signals from a source of UTC, to synchronize computers externally. There is no upper bound on message                                 
transmission delays in an asynchronous system, the round­trip times for messages exchanged between pairs                           
of processes are often reasonably short – a small fraction of a second. He describes the algorithm as                                   
probabilistic: the method achieves synchronization only if the observed round­trip times between client and                           
server are sufficiently short compared with the required accuracy. A simple estimate of the time to which p                                   
should set its clock is t + T​round / 2 , which assumes that the elapsed time is split equally before and after S                                               
placed t in mt(=message for timestamp sinc.). This is normally a reasonably accurate assumption, unless the                               
two messages are transmitted over different networks. 
 
Discussion​: Cristian’s method suffers from the problem associated with all services implemented by a single                             
server: that the single time server might fail and thus render synchronization temporarily impossible. Cristian                             
suggested, for this reason, that time should be provided by a group of synchronized time servers, each with a                                     
receiver for UTC time signals. Dolev et al. [1986] showed that if f is the number of faulty clocks out of a total of                                               
N, then we must have N = 3f if the other, correct, clocks are still to be able to achieve agreement. 
 
Berkeley's Algorithm​: an algorithm for internal synchronization developed for collections of computers                       
running Berkeley UNIX. A coordinator computer is chosen to act as the master. Unlike in Cristian’s protocol,                                 
this computer periodically polls the other computers whose clocks are to be synchronized, called slaves. The                               
slaves send back their clock values to it. The master estimates their local clock times by observing the                                   
round­trip times (similarly to Cristian’s technique), and it averages the values obtained (including its own                             
clock’s reading). The balance of probabilities is that this average cancels out the individual clocks’ tendencies                               
to run fast or slow. The accuracy of the protocol depends upon a nominal maximum round­trip time between                                   
the master and the slaves. 
The master takes a FAULT­TOLERANT AVERAGE. That is, a subset is chosen of clocks that do not differ                                   
from one another by more than a specified amount, and the average is taken of readings from only these                                     
clocks. 
 
NTP​: Cristian’s method and the Berkeley algorithm are intended primarily for use within 
intranets. NTP’s chief design aims and features are as follows: 
● To provide a service enabling clients across the Internet to be synchronized accurately to UTC​:                             
Although large and variable message delays are encountered in Internet communication, NTP employs                         
statistical techniques for the filtering of timing data and it ​discriminates between the quality of timing                               
data from different servers​. 
● To provide a reliable service that can survive lengthy losses of connectivity​: There are ​redundant                             
servers and redundant paths ​between the servers​. The servers can reconfigure so as to continue to                               
provide the service if one of them becomes unreachable. 
● To enable clients to resynchronize sufficiently frequently to offset the rates of drift found in most                               
computers​: The service is designed to ​scale to large numbers of clients and servers​. 
● To provide protection against interference with the time service, whether malicious or accidental​: The                           
time service uses ​authentication techniques to check that timing data originate from the ​claimed trusted                             
sources​. It also validates the return addresses of messages sent to it. 
 
Logical Time and Logical Clocks 
As Lamport [1978] pointed out, since we cannot synchronize clocks perfectly across a distributed system, we                               
cannot in general use physical time to find out the order of any arbitrary pair of events occurring within it. Two                                         
simple and intuitively obvious points: 
● If two events occurred at the same process p​i (i = 1, 2,...N), then they occurred in the order in which p​i                                           
observes them – this is the order ​➝​i​ ​ that we defined above. 
● Whenever a message is sent between processes, the event of sending the message occurred before                             
the event of receiving the message. 
 
Lamport called the ​partial ordering obtained by generalizing these two relationships the ​happened­before                         
relation. It is also sometimes known as the ​relation of causal ordering​ or ​potential causal ordering​. 
The sequence of events need not be unique. 
 
For example, a ​↛e and e ​↛a, since they occur at different processes, and there is no chain of messages                                           
intervening between them. We say that events such as a and e that are not ordered by ​➝ are concurrent and                                         
write this a || e . 
 
Logical clocks​: Lamport [1978] invented a simple mechanism by which the ​happened­before ​ordering can be                             
captured numerically, called a logical clock. A ​Lamport logical clock is a monotonically increasing software                             
counter​, whose value need bear ​no particular relationship to any physical clock​. Each process ​p​i keeps its own                                   
logical clock, ​L​i , which it uses to apply so­called Lamport timestamps to events. We denote the timestamp of                                     
event ​e​ at ​p​i​ by ​L​i​(e)​ , and by ​L(e)​ we denote the timestamp of event ​e ​at whatever process it occurred at. 
To capture the happened­before relation ​➝​, processes update their logical clocks and transmit the values of                               
their logical clocks in messages as follows: 
● LC1​: ​L​i​ is incremented before each event is issued at process ​p​i​ : L​i​ := L​i​ + 1. 
● LC2​:  
(a) When a process ​p​i​ sends a message ​m​, it piggybacks on ​m​ the value ​t = L​i​. 
(b) On receiving ​(m, t)​, a process ​p​j computes ​L​j := max(​L​j ,​t​) and then applies LC1 before timestamping                                     
the event receive(m). 
 
 
Note: e​➝​ e’ ⇒ L(e) < L(e’) . 
The converse is not true. If L(e) < L(e’) , then we cannot infer that e ​➝​ e’. 
 
Totally ordered logical clocks​: Some pairs of distinct events, generated by different processes, have                           
numerically identical Lamport timestamps. We can create a ​total order on the set of events – that is, one for                                       
which all pairs of distinct events are ordered – by ​taking into account the identifiers of the processes at which                                       
events occur. 
We define the ​global logical timestamps for these events to be ​(T​i​, i) and ​(T​j​, j) , respectively. ​(T​i​, i) < (T​j​, j) if                                               
and only if either ​T​i​ < T​j​  , or T​i​ = T​j​ and i < j​ . 
 
Vector clocks​: Mattern [1989] and Fidge [1991] developed vector clocks to overcome the shortcoming of                             
Lamport’s clocks: the fact that from L(e) < L(e’)  we cannot conclude that e ​➝​ e’. 
A vector clock for a system of N processes is an array of N integers. Each process keeps its own vector clock,                                           
Vi , which it uses to timestamp local events. There are simple rules for updating the clocks: 
 
For a vector clock V​i​, V​i​(i) is the number of events that p​i has timestamped, and V​i​(j) (j ≠ i) is the number of                                               
events that have occurred at p​j that have ​potentially ​affected p​i​. (Process p​j may have timestamped more                                 
events by this point, but no information has flowed to pi about them in messages as yet.) 
 
Figure 14.7 shows the vector timestamps of the events of Figure 14.5. It can be seen, for example, that V(a) <                                         
V(f) , which reflects the fact that a​➝ f. Similarly, we can tell when two events are concurrent by comparing their                                         
timestamps. For example, that c || e can be seen from the facts that neither V(c) ≤ V(e) nor V(e) ≤ V(c). 
Vector timestamps have the ​disadvantage​, compared with Lamport timestamps, of taking up an amount of                             
storage and message payload that is proportional to N, the number of processes. 
 
Global states​: the problem of finding out whether a particular property is true of a distributed system as it                                     
executes. 
● Distributed garbage collection​: An object is considered to be garbage if there are no longer any                               
references to it anywhere in the distributed system. To check that an object is garbage, we must verify                                   
that there are no references to it anywhere in the system. When we consider properties of a system, we                                     
must include the state of communication channels as well as the state of the processes. 
● Distributed deadlock detection​: A distributed deadlock occurs when each of a collection of processes                           
waits for another process to send it a message, and where there is a cycle in the graph of this                                       
‘​waits­for​’ relationship. 
● Distributed termination detection​: The phenomena of termination and deadlock are similar in some                         
ways, but they are different problems. First, a deadlock may affect only a subset of the processes in a                                     
system, whereas all processes must have terminated. Second, process ​passivity is not the same as                             
waiting in a deadlock cycle: a deadlocked process is attempting to perform a further action, for which                                 
another process waits; a passive process is not engaged in any activity. 
● Distributed debugging​: Distributed systems are complex to debug. 
 
 
Global states and consistent cuts​: The essential problem is the absence of global time. 
 
 
A global state corresponds to initial prefixes of the individual process histories. A cut of the system’s execution                                   
is a subset of its global history that is a union of prefixes of process histories: 
 
 
The leftmost cut is inconsistent. This is because at p​2 it includes the receipt of the message m​1​, but at p​1 it                                           
does not include the sending of that message. This is showing an ‘effect’ without a ‘cause’. The actual                                   
execution never was in a global state corresponding to the process states at that frontier, and we can in                                     
principle tell this by examining the ➝ relation between events. By contrast, the rightmost cut is consistent. 
 
 
INTERPROCESS COMMUNICATION 
Direct comm. (direct naming)​: unique names are given to all processes comprising a program 
● symmetrical ​direct naming: ​both the sender and receiver name​ the corresponding ​process​. 
● asymmetrical​ direct naming: the receiver can receive messages from any process. 
Indirect comm.​ ​(indirect naming)​: uses intermediaries called ​channels​ or ​mailboxes 
● symmetrical​ indirect naming: ​both the sender and receiver name​ the corresponding ​channel​. 
● asymmetrical​ indirect naming: the receiver can receive messages from any channel. 
 
Request/Reply protocol​: a requestor sends a request message to a replier system which receives and                             
processes the request, ultimately returning a message in response. This is a simple, but powerful messaging                               
pattern which allows two applications to have a two­way conversation with one another over a channel. This                                 
pattern is especially common in client­server architectures.​[1] 
For simplicity, this pattern is typically implemented in a purely ​synchronous fashion, as in ​web service calls                                 
over ​HTTP​, which holds a connection open and waits until the response is delivered or the ​timeout period                                   
expires. However, request–response may also be implemented ​asynchronously​, with a response being                       
returned at some unknown later time. This is often referred to as "sync over async", or "sync/async", and is                                     
common in ​enterprise application integration (EAI) implementations where slow ​aggregations​, time­intensive                     
functions, or​ ​human workflow​ must be performed before a response can be constructed and delivered. 
 
 
 
 
 
Marshalling and Unmarshalling​: The information stored in running programs is represented as data                         
structures, whereas the ​information in messages consists of sequences of bytes​. Irrespective of the form of                               
communication used, the ​data structures must be flattened (converted to a sequence of bytes) ​before                             
transmission and rebuilt on arrival​. There are differences in data representation from a computer to another                               
one. So, when communicating, the following problems must be addressed: 
● primitive data representation (such as integers and floating­point numbers). 
● set of codes used to represent characters (ASCII  or Unicode). 
 
There are two ways for enabling any two computers to exchange binary data values: 
● The values are converted to an agreed external format before transmission and converted to the local                               
form on receipt. 
● The values are transmitted in the sender’s format, together with an indication of the format used, and                                 
the recipient converts the values if necessary. 
An agreed standard for the representation of data structures and primitive values is called an ​external data                                 
representation​. 
Marshalling is the process of taking a collection of data items and assembling them into a form suitable for                                     
transmission in a message. ​Unmarshalling is the process of disassembling them on arrival to produce an                               
equivalent collection of data items at the destination. Thus marshalling consists of the translation of structured                               
data items and primitive values into an external data representation. Similarly, unmarshalling consists of the                             
generation of primitive values from their ​external data representation​ and the rebuilding of the data structures. 
Three alternative approaches to external data representation and marshalling: 
● CORBA​’s common data representation, which is concerned with an external representation for the                         
structured and primitive types that can be passed as the arguments and results of remote method                               
invocations in CORBA. 
● Java’s object serialization​, which is concerned with the flattening and external data representation of                           
any single object or tree of objects that may need to be transmitted in a message or stored on a disk. 
● XML​ (Extensible Markup Language), which defines a textual format for representing structured data. 
In the first two approaches, the primitive data types are marshalled into a binary form. In the third approach                                     
(XML), the primitive data types are represented textually. The textual representation of a data value will                               
generally be longer than the equivalent binary representation. The HTTP protocol is another example of the                               
textual approach. 
Two main issues exist in marshalling: 
● compactness​: the resulting message should be as compact as possible. 
● data type inclusion​: CORBA’s representation includes just the values of the objects transmitted; Java                           
serialization and XML does include type information. 
Two other techniques for external data representation are worthy of mention: 
● Google uses an approach called ​protocol buffers to capture representations of both stored and                           
transmitted data. 
● JSON (JavaScript Object Notation)  
Both these last two methods represent a step towards more lightweight approaches to data representation                             
(when compared, for example, to XML). 
Particular attention is to be paid to ​remote object references​. A remote object reference is an identifier for a                                     
remote object that is valid throughout a distributed system. A remote object reference is passed in the                                 
invocation message to specify which object is to be invoked. Remote object references must be generated in a                                   
manner that ensures ​uniqueness over space and time​. Also, object references must be unique among all of the                                   
processes in the various computers in a distributed system. One way is to construct a remote object reference                                   
by concatenating the Internet address of its host computer and the ​port number of the process that created it                                     
with the ​time of its creation ​and a ​local object number​. The local object number is incremented each time an                                       
object is created in that process. 
 
The last field of the remote object reference shown in Figure 4.13 contains some information about the                                 
interface of the remote object, for example, the interface name. This information is relevant to any process that                                   
receives a remote object reference as an argument or as the result of a remote invocation, because it needs to                                       
know about the methods offered by the remote object. 
 
Idempotent op.s​: an operation that will produce the same results if executed once or multiple times.​[7]
In the                                   
case of ​methods or ​subroutine calls with ​side effects​, for instance, it means that the modified state remains the                                     
same after the first call. In ​functional programming​, though, an idempotent function is one that has the property                                   
f​(​f​(​x​)) = ​f​(​x​) for any value ​x​.​[8]
 
This is a very useful property in many situations, as it means that an operation can be repeated or retried as                                         
often as necessary without causing unintended effects. With non­idempotent operations, the algorithm may                         
have to keep track of whether the operation was already performed or not. 
In the ​HyperText Transfer Protocol (HTTP)​, idempotence and ​safety are the major attributes that separate                             
HTTP verbs​. Of the major HTTP verbs, GET, PUT, and DELETE are idempotent (if implemented according to                                 
the standard), but POST is not.​[9]
 
 
Remote Procedure Call (RPC) 
Request­reply protocols provide relatively low­level support for requesting the execution of a remote operation,                           
and also provide direct support for RPC and RMI. 
RPC allows client programs to ​call procedures ​transparently in server programs running in separate processes                             
and generally in different computers from the client. 
RPC has the goal of making the programming of distributed systems look similar, if not identical, to                                 
conventional programming – that is, achieving a high level of ​distribution transparency​. This unification is                             
achieved in a very simple manner, by ​extending the abstraction of a procedure call to distributed environments​.                                 
In particular, in RPC, procedures on remote machines can be called as if they are procedures in the local                                     
address space​. The underlying RPC system then hides important aspects of distribution, including the                           
encoding and decoding of parameters and results, the passing of messages and the preserving of the required                                 
semantics for the procedure call. 
Three issues that are important in understanding this concept: 
● programming with interfaces ​(the style of programming): in order to control the possible interactions                           
between modules, an explicit interface is defined for each module; as long as its interface remains the                                 
same, the implementation may be changed without affecting the users of the module. 
○ service interface​: the specification of the procedures offered by a server, defining the types of                             
the arguments of each of the procedures. But why an interface? 
■ It is not possible for a client module running in one process to access the variables in a                                   
module in another process. 
■ The parameter­passing mechanisms used in local procedure calls are not suitable when                       
the caller and procedure are in different processes. In particular, call by reference is not                             
supported. Rather, the specification of a procedure in the interface of a module in a                             
distributed program describes the parameters as input or output, or sometimes both.                       
Input parameters are passed to the remote server by sending the values of the                           
arguments in the request message and output parameters are returned in the reply                         
message and are used as the result of the call. 
■ Addresses in one process are not valid in another remote one. 
○ Interface Definition Language (IDL)​: designed to allow procedures implemented in different                     
languages to invoke one another. An IDL provides a notation for defining interfaces in which                             
each of the parameters of an operation may be described as for input or output in addition to                                   
having its type specified. 
● the ​call semantics ​associated with RPC: 
○ Retry request message​: Controls whether to retransmit the request message until either a                         
reply is received or the server is assumed to have failed. 
○ Duplicate filtering​: Controls when retransmissions are used and whether to filter out duplicate                         
requests at the server. 
○ Retransmission of results​: Controls whether to keep a history of result messages to enable                           
lost results to be retransmitted without re­executing the operations at the server. 
Combinations of these choices lead to a variety of possible semantics for the ​reliability ​of remote invocations                                 
as seen by the invoker. Note that for local procedure calls, the semantics are exactly once, meaning that every                                     
procedure is executed exactly once (except in the case of process failure). The choices of RPC invocation                                 
semantics are defined as follows. 
○ Maybe semantics​: the remote procedure call ​may be executed once or not at all​. Maybe                             
semantics arises when ​no fault­tolerance measures are applied and ​can suffer from the                         
following types of failure​: 
■ omission failures​ if the request or result message is lost; 
■ crash failures​ when the server containing the remote operation fails. 
Useful only for applications in which occasional failed calls are acceptable. 
○ At­least­once semantics​: ​the invoker receives either a result​, in which case the invoker knows                           
that the procedure was executed at least once, ​or an exception informing it that no result was                                 
received. It can be achieved by the ​retransmission of request messages​, which masks the                           
omission failures of the request or result message. At­least­once semantics ​can suffer from the                           
following types of failure​: 
■ crash failures​:when the server containing the remote procedure fails; 
■ arbitrary failures​: in cases when ​the request message is retransmitted​, the remote server                         
may receive it and ​execute the procedure more than once​, possibly causing wrong                         
values to be stored or returned. 
If the operations in a server can be designed so that all of the procedures in their service                                   
interfaces ​are idempotent operations​, then at­least­once call semantics may be acceptable. 
○ At­most­once semantics​: ​the caller receives either a result​, in which case the caller knows that                             
the procedure was executed exactly once, ​or an exception informing it that ​no result was                             
received, in which case the ​procedure will have been ​executed either once or not at all​. It can                                   
be achieved by using all of the fault­tolerance measures outlined in Figure 5.9. 
 
 
● Transparency​: RPC strives to offer at least ​location and access transparency​, hiding the physical                           
location of the (potentially remote) procedure and also accessing local and remote procedures in the                             
same way. Middleware can also offer additional levels of transparency to RPC. However, remote                           
procedure calls suffer the followings: 
○ vulnerability to failure due to the ​network ​and/or of the ​remote server process and no ability to                                 
distinguish among them. 
○ latency of a remote procedure call is several orders of magnitude greater than that of a local                                 
one. 
The current consensus is that remote calls should be made transparent in the sense that the syntax of                                   
a remote call is the same as that of a local invocation, but that the difference between local and remote                                       
calls should be expressed in their interfaces. 
 
RPC Implementation​: The software components required to implement RPC are shown in Figure 5.10. The                             
client that accesses a service ​includes one ​stub procedure for each procedure in the service interface​. The                                 
stub procedure behaves like a local procedure to the client​, but instead of executing the call, it ​marshals the                                     
procedure identifier and the arguments into a request message​, which it sends via its communication module                               
to the server. When the reply message arrives, it ​unmarshals the results​. The ​server process contains a                                 
dispatcher together ​with ​one server stub procedure and one service procedure for each procedure in the                               
service interface​. ​The dispatcher selects one of the server stub procedures ​according to the procedure                             
identifier in the request message. The server stub procedure then unmarshals the arguments in the request                               
message, calls the corresponding service procedure and marshals the return values for the reply message.                             
The service procedures implement the procedures in the service interface​. ​The client and server stub                             
procedures and the dispatcher can be generated automatically by an interface compiler from the interface                             
definition of the service. RPC is generally implemented over a request­reply protocol like the ones discussed                               
so far. The contents of request and reply messages are the same as those illustrated for request­reply                                 
protocols in Figure 5.4. RPC may be implemented to have one of the choices of invocation semantics                                 
discussed: at­least once or at­most­once is generally chosen. To achieve this, the communication module will                             
implement the desired design choices in terms of retransmission of requests, dealing with duplicates and                             
retransmission of results, as shown in Figure 5.9. 
 
 
 
Remote Method Invocation (RMI) 
RMI allows ​objects in different processes to communicate with one another​; it is an extension of local method                                   
invocation that allows an object living in one process to invoke the methods of an object living in another                                     
process. 
The ​commonalities between RMI and RPC​ are as follows: 
● They both support ​programming with interfaces​. 
● both typically ​constructed on top of request­reply protocols and can offer a range of ​call semantics such                                 
as ​at­least­once​ and ​at­most­once​. 
● both offer a ​similar level of transparency – that is, local and remote calls employ the same syntax but                                     
remote interfaces typically expose the distributed nature of the underlying call, for example by                           
supporting remote exceptions. 
 
The following ​differences​ lead to added expressiveness: 
● The programmer is able to use the ​full expressive power of object­oriented programming in the                             
development of distributed systems software. 
● Building on the concept of object identity in object­oriented systems, ​all objects in an RMI­based                             
system have unique object references (whether they are local or remote), such object references can                             
also be passed as parameters, thus offering ​significantly richer parameter­passing semantics than in                         
RPC​. 
The issue of parameter passing is particularly important in distributed systems. RMI allows the programmer to                               
pass parameters not only by value, as input or output parameters, but also by object reference. 
RMI shares the ​same ​design issues as RPC in terms of programming with interfaces, call semantics and level                                   
of transparency. The key design issue relates to the object model and, in particular, achieving the transition                                 
from objects to distributed objects. For example, for use in a distributed object system, an object’s data should                                   
be accessible only via its methods. 
Distributed object systems may adopt the client­server architecture. In this case, objects are managed by                             
servers and their clients invoke their methods using remote method invocation. ​In RMI, the client’s request to                                 
invoke a method of an object is sent in a message to the server managing the object​. The invocation is carried                                         
out by executing a method of the object at the server and the result is returned to the client in another                                         
message. To ​allow for chains of related invocations, objects in servers are allowed to become clients of objects                                   
in other servers​. 
Having client and server objects in different processes enforces ​encapsulation​. That is, ​the state of an object                                 
can be accessed only by the methods of the object​, which means that it is not possible for unauthorized                                     
methods to act on the state. 
Another advantage of treating the ​shared state of a distributed program as a collection of objects is that ​an                                     
object may be accessed via RMI​. 
The fact that objects are accessed only via their methods gives rise to another ​advantage of heterogeneous                                 
systems, that different data formats may be used at different sites – these formats will be unnoticed by clients                                     
that use RMI to access the methods of the objects. 
Distributed object model​: A process contains a collection of objects, some of which may receive local and                                 
remote invocations, while others only local invocations. ​Method invocations between objects in different                         
processes​, whether in the same computer or not, ​are known as ​remote method invocations​. ​Method                             
invocations between objects in the same process are ​local method invocations​. 
The following two fundamental concepts are at the heart of the distributed object model: 
● Remote object references​: Other objects can invoke the methods of a remote object if they have                               
access to its remote object reference. 
A remote object reference is an identifier that can be used throughout a distributed system to refer to ​a 
particular unique remote object​. It may be passed as arguments and results of remote method 
invocations. 
● Remote interfaces​: Every remote object has a remote interface that specifies which of its methods can                               
be invoked remotely. 
The class of a remote object implements the methods of its remote interface. 
 
 
 
Actions in a distributed object system​: As in the non­distributed case, an action is initiated by a method                                   
invocation, which may result in further invocations of methods in other objects. But in the distributed case, the                                   
objects involved in a chain of related invocations may be located in different processes or different computers.                                 
When an invocation crosses the boundary of a process or computer, RMI is used, and the remote reference of                                     
the object must be available to the invoker. 
 
 
Garbage collection in a distributed­object system​: generally achieved by ​cooperation between the existing                         
local garbage collector and an added module that carries out a form of distributed garbage collection, usually                                 
based on reference counting​. 
 
Exceptions​: Any remote invocation may fail for reasons related to the invoked object being in a different                                 
process or computer from the invoker. Remote method invocation should be able to raise exceptions such as                                 
timeouts that are due to distribution as well as those raised during the execution of the method invoked. 
 
RMI Implementation​: Several separate objects and modules are involved in achieving a remote method                           
invocation. 
 
● Communication module​: The two cooperating communication modules ​carry out the request­reply                     
protocol​, which transmits request and reply messages between the client and server. The contents of                             
request and reply messages are shown in Figure 5.4. They are ​responsible for providing a specified                               
invocation semantics​, for example ​at­most­once​. 
 
 
 
● Remote reference module​: is responsible for translating between local and remote object references                         
and for creating remote object references. In each process, it has a ​remote object table that ​records the                                   
correspondence between local object references in that process and remote object references (which                         
are system­wide). The table includes: 
○ An entry for all the remote objects held by the process​. For example, in Figure 5.15 the remote                                   
object B will be recorded in the table at the server. 
○ An entry for each ​local proxy​. For example, in Figure 5.15 the proxy for B will be recorded in the                                       
table at the client. 
The actions of the remote reference module are as follows: 
○ When a remote object is to be passed as an argument or a result for the first time, the remote                                       
reference module is asked to ​create a remote object reference, which it adds to its table​. 
○ When a remote object reference arrives in a request or reply message, the remote reference                             
module is asked for the corresponding ​local object reference​, which ​may refer either to a proxy                               
or to a remote object​. In the case that the remote object reference is not in the table, the RMI                                       
software creates a new proxy and asks the remote reference module to add it to the table. 
This module is called by components of the RMI software when they are marshalling and unmarshalling                               
remote object references. For example, when a request message arrives, the table is used to find out                                 
which local object is to be invoked. 
● Servant​: an instance of a class that provides the body of a remote object. It is the servant that                                     
eventually ​handles the remote requests passed on by the corresponding ​skeleton​. Servants live within                           
a server process. 
● The RMI software​: a layer of software between the application­level objects and the communication                           
and remote reference modules. The roles of the middleware objects shown in Figure 5.15 are as                               
follows: 
○ Proxy​: makes remote method invocation transparent to clients by ​behaving like a local object to                             
the invoker​; but instead of executing an invocation, it forwards it in a message to a remote                                 
object. It hides the details of the remote object reference, the marshalling of arguments,                           
unmarshalling of results and sending and receiving of messages from the client. ​There is one                             
proxy for each remote object for which a process holds a remote object reference​. ​The class of                                 
a proxy implements the methods in the remote interface of the remote object it represents​,                             
which ensures that remote method invocations are suitable for the type of the remote object.                             
However, the proxy implements them quite differently. ​Each method of the proxy marshals a                           
reference to the target object, its own operationId and its arguments into a request message                             
and sends it to the target. It then awaits the ​reply message, unmarshals it and returns the                                 
results to the invoker. 
○ Dispatcher​: ​A server has one dispatcher and one skeleton for each class representing a                           
remote object​. The dispatcher receives request messages from the communication module. ​It                       
uses the operationId to select the appropriate method in the skeleton​, passing on the request                             
message. The dispatcher and the proxy use the same allocation of operationIds to the methods                             
of the remote interface. 
○ Skeleton​: The class of a remote object has a skeleton, which implements the methods in the                               
remote interface. They are implemented quite differently from the methods in the servant that                           
incarnates a remote object. A skeleton method ​unmarshals the arguments in the request                         
message and invokes the corresponding method in the servant​. It waits for the invocation to                             
complete and then ​marshals the result, together with any exceptions​, in a reply message to the                               
sending proxy’s method. 
 
Remote object references are marshalled in the form shown in Figure 4.13, which includes information                             
about the remote interface of the remote object – for example, the name of the remote interface or the                                     
class of the remote object. This information enables the proxy class to be determined so that a new                                   
proxy may be created when it is needed. For example, the proxy class name may be generated by                                   
appending _proxy to the name of the remote interface. 
 
● Generation of the classes for proxies, dispatchers and skeletons​: The classes for the proxy,                           
dispatcher and skeleton used in RMI are generated automatically by an interface compiler. 
 
● Factory methods​: We noted earlier that ​remote object interfaces cannot include constructors​. This                         
means that servants cannot be created by remote invocation on constructors. Servants are created                           
either in the initialization section or ​in methods in a remote interface designed for that purpose​. The                                 
term factory method is sometimes used to refer to a method that creates servants, and a factory object                                   
is an object with factory methods. Any remote object that needs to be able to create new remote                                   
objects on demand for clients must provide methods in its remote interface for this purpose. Such                               
methods are called factory methods, although they are really just normal methods. 
● Server threads​: Whenever an object executes a remote invocation, that execution may lead to further                             
invocations of methods in other remote objects, which may take some time to return. To avoid the                                 
execution of one remote invocation delaying the execution of another, ​servers generally allocate a                           
separate thread for the execution of each remote invocation. When this is the case, ​the designer of the                                   
implementation of a remote object ​must allow for the effects on its state of concurrent executions​.                               
Some alternatives exist: 
○ thread­per­request 
○ thread­per­connection 
○ thread­per­remote object 
● Object location​: A ​location service helps clients to locate remote objects from their remote object                             
references. It uses a database that maps remote object references to their probable current locations –                               
the locations are probable because an object may have migrated again since it was last heard of. 
 
Java RMI (java.rmi package) 
An object making a remote invocation is aware that its target is remote because it ​must handle                                 
RemoteExceptions; and the implementor of a remote object is aware that it is remote because ​it must                                 
implement the Remote interface. ​The arguments and return values in a remote invocation are serialized to a                                 
stream using the method described previously, with the following modifications: 
1. Whenever an object that implements the Remote interface is serialized, it is replaced by its remote 
object reference, which contains the name of its (the remote object’s) class. 
2. When any object is serialized, its class information is annotated with the location of the class (as a 
URL), enabling the class to be downloaded by the receiver. 
 
● Downloading of classes: ​Java is designed to allow classes to be downloaded from one virtual                             
machine to another​. This is particularly relevant to distributed objects that communicate by means of                             
remote invocation. We have seen that non­remote objects are passed by value and remote objects are                               
passed by reference as arguments and results of RMIs. ​If the recipient does not already possess the                                 
class of an object passed by value, its code is downloaded automatically. Similarly, if the recipient of a                                   
remote object reference does not already possess the class for a proxy, its code is downloaded                               
automatically​. This has two advantages: 
○ There is no need for every user to keep the same set of classes in their working environment. 
○ Both client and server programs can make transparent use of instances of new classes                           
whenever they are added. 
● RMIregistry​: is ​the binder for Java RMI​. An instance of RMIregistry should normally run on every                               
server computer that hosts remote objects. It ​maintains a table mapping textual, URL­style names to                             
references to remote objects hosted on that computer​. It is accessed by methods of the Naming class,                                 
whose methods take as an argument a URL­formatted string of the form: 
//computerName:port/objectName 
where computerName and port refer to the location of the RMIregistry. 
Used in this way, ​clients must direct their lookup enquiries to particular hosts​. Alternatively, it is possible                                 
to set up a system­wide binding service. To achieve this, it is necessary to run an instance of the                                     
RMIregistry in the networked environment and then use the class ​LocateRegistry​, which is in                           
java.rmi.registry, ​to discover this registry​. 
 
● Distributed garbage collection​: the Java distributed garbage collection algorithm is based on                       
reference counting​. Whenever a remote object reference enters a process, a proxy will be created and                               
will stay there for as long as it is needed. The process where the object lives (its server) should be                                       
informed of the new proxy at the client. Then later when there is no longer a proxy at the client, the                                         
server should be informed. The distributed garbage collector works in cooperation with the local                           
garbage collectors as follows: 
○ Each server process maintains ​a set of the names of the processes that hold remote object                               
references for each of its remote objects​; for example, ​B.holders is the set of client processes                               
(virtual machines) that have proxies for object B. This set can be held in an additional column in                                   
the remote object table. 
○ When a client C first receives a remote reference to a particular remote object, B, it makes an                                   
addRef(B) invocation to the server of that remote object and then creates a proxy; the server                               
adds C to B.holders. 
○ When a client C’s garbage collector notices that a proxy for remote object B is no longer                                 
reachable, it makes a removeRef(B) invocation to the corresponding server and then deletes                         
the proxy; the server removes C from B.holders.  
○ When B.holders is empty, the server’s local garbage collector will reclaim the space occupied by                             
B unless there are any local holders. 
This algorithm is intended to be carried out by means of pairwise request­reply communication with 
at­most­once invocation semantics between the remote reference modules in processes – it does not 
require any global synchronization. Note also that the extra invocations made on behalf of the garbage 
collection algorithm do not affect every normal RMI; they occur only when proxies are created and 
deleted. There is a possibility that one client may make a removeRef(B) invocation at about the same 
time as another client makes an addRef(B) invocation. If the removeRef arrives first and B.holders is 
empty, the remote object B could be deleted before the addRef arrives. To avoid this situation, if the set 
B.holders is empty at the time when a remote object reference is transmitted, a temporary entry is 
added until the addRef arrives. ​The Java distributed garbage collection algorithm tolerates 
communication failures by using the following approach. The addRef and removeRef operations are 
idempotent.​ In the case that an addRef(B) call returns an exception (meaning that the method was 
either executed once or not at all), the client will not create a proxy but will make a removeRef(B) call. 
The effect of removeRef is correct whether or not the addRef succeeded. ​The case where removeRef 
fails is dealt with by ​leases​. ​Servers lease their objects to clients for a limited period of time​. The lease 
period starts when the client makes an addRef invocation to the server. It ends either when the time                                   
has 
expired or when the client makes a removeRef invocation to the server. The information stored by the 
server concerning each lease contains the identifier of the client’s virtual machine and the period of the 
lease. ​Clients are responsible for requesting the server to renew their leases before they expire​. 
 
DISTRIBUTED ALGORITHMS FOR MUTUAL EXCLUSION 
Distributed processes often need to coordinate their activities. We require a solution to distributed mutual                             
exclusion: one that is based solely on message passing. It would be convenient for these processes to be able                                     
to obtain mutual exclusion solely by communicating among themselves, eliminating the need for a separate                             
server. 
 
Essential requirements for mutual exclusion are as follows: 
ME1: (​safety​) At most one process may execute in the critical section (CS) at a time. 
ME2: (​liveness​) Requests to enter and exit the critical section eventually succeed. 
Condition ME2 implies freedom from both deadlock and starvation. 
 
DEADLOCK​: two or more of the processes becoming stuck indefinitely while attempting to 
enter or exit the critical section, by virtue of their mutual interdependence. 
 
STARVATION​: the indefinite postponement of entry for a process that has requested it. 
 
FAIRNESS: absence of STARVATION and use of the HAPPENED­BEFORE ordering between messages that                         
request entry to the critical section. 
ME3: ( ➝ ​ordering​) If one request to enter the CS happened­before another, then entry to the CS is                                     
granted in that order. 
 
SYMMETRICAL­INFORMATION ALG.: every process executes the same code 
 
Performance​ of algorithms for mutual exclusion: 
➔ bandwidth​: proportional to the number of messages 
➔ throughput​: of the entire system, measured as the time between a client exiting and another one                               
entering the CS 
➔ delay​: whenever a client enters and exits a CS 
 
 
Ring­based Mutual Exclusion (ME)​: N processes communicate with the next process in a clockwise                           
direction, say. 
In order to enter the CS, a process must hold the token granting the access. The token keeps moving around                                       
the ring until a process requires it. Then it holds until completes and exits the CS. At this point, the token is                                           
released and passed again to the next process. 
Pro​: easy to implement, easy to adapt to N CSs, to exit the CS a process observes no delay. 
Cons​: continuous bandwidth consumption, to enter the CS a process might experience from 0 to N messages                                 
delay, synchronization delay goes from 1 to N messages. 
 
 
 
RICART­AGRAWALA (Multicast and logical blocks, [1981])​: mutual exclusion between N peer processes                       
based upon ​multicast​. Processes that require entry to a critical section multicast a request message, and can                                 
enter it only when all the other processes have replied to this message.  
 
 
 
A variable ​state is used from each process to record its state of being outside the critical section (RELEASED),                                     
wanting entry (WANTED) or being in the critical section (HELD). 
If two or more processes request entry at the same time, then whichever process’s request bears the lowest                                   
timestamp will be the first to collect N – 1 replies, granting it entry next. If the requests bear equal Lamport                                         
timestamps, the requests are ordered according to the processes’ corresponding identifiers. 
This algorithm achieves the safety property ME1. If it were possible for two processes p​i and p​j ( i ≠ j ) to enter                                               
the critical section at the same time, then both of those processes would have to have replied to the other. But                                         
since the pairs <T​i​, p​i​ > are totally ordered, this is impossible. 
Performance​: 
● Gaining entry takes 2(N – 1) messages in this algorithm: N – 1 to multicast the request, followed by N –                                         
1 replies.  
● Client delay​ in requesting entry is again a round­trip time (with hw multicast support). 
● synchronization delay​ is only one message transmission time (with hw multicast support). 
 
 
Fault tolerance: The main points to consider when evaluating the above algorithms with respect to fault                               
tolerance are: 
● What happens when messages are lost? 
● What happens when a process crashes? 
 
 
Elections 
An algorithm for ​choosing a unique process to play a particular role is called an election algorithm. It is                                     
essential that all the processes agree on the choice. 
At any point in time, a process p​i is either a participant – meaning that it is engaged in some run of the election                                               
algorithm – or a non­participant – meaning that it is not currently engaged in any election. Our requirements                                   
are that, during any particular run of the algorithm: 
E1: (safety) A participant process p​i​ has elected​i​ = 丄 or elected​i​ = P, where P is chosen as the 
non­crashed process at the end of the run with the largest identifier. 
E2: (liveness) All processes p​i​ participate and eventually either set elected​i​ ≠丄 – or crash. 
 
The algorithm of Chang and Roberts [1979] (A ring­based election algorithm): suitable for a collection of                               
processes arranged in a logical ring. Each process p​i has a communication channel to the next process in the                                     
ring, p​(i + 1)​mod N , and all messages are sent clockwise around the ring. We ​assume that no failures occur, and                                           
that ​the system is asynchronous​. The goal of this algorithm is to ​elect a single process called ​the ​coordinator​,                                     
which is ​the process with the largest identifier​. Initially, every process is marked as a non­participant in an                                   
election. Any process can begin an election. It proceeds by marking itself as a participant, placing its identifier                                   
in an election message and sending it to its clockwise neighbour. When a process receives an election                                 
message, it compares the identifier in the message with its own. If the arrived identifier is greater, then it                                     
forwards the message to its neighbour. If the arrived identifier is smaller and the receiver is not a participant,                                     
then it substitutes its own identifier in the message and forwards it; but it does not forward the message if it is                                           
already a participant. On forwarding an election message in any case, the process marks itself as a participant. 
If, however, the received identifier is that of the receiver itself, then this process’s identifier must be the                                   
greatest, and it becomes the coordinator. The coordinator marks itself as a non­participant once more and                               
sends an elected message to its neighbour, announcing its election and enclosing its identity. When a process                                 
pi receives an elected message, it marks itself as a nonparticipant, sets its variable elected​i to the identifier in                                     
the message and, unless it is the new coordinator, forwards the message to its neighbour. 
 
It is easy to see that condition E1 is met. All identifiers are compared, since a process must receive its own                                         
identifier back before sending an elected message. For any two processes, the one with the larger identifier will                                   
not pass on the other’s identifier. It is therefore impossible that both should receive their own identifier back.                                   
Condition E2 follows immediately from the guaranteed traversals of the ring (there are no failures). Note how                                 
the non­participant and participant states are used so that duplicate messages arising when two processes                             
start an election at the same time are extinguished as soon as possible, and always before the ‘winning’                                   
election result has been announced. If only a single process starts an election, then the worst­performing case                                 
is when its anti­clockwise neighbour has the highest identifier. A total of N – 1 messages are then required to                                       
reach this neighbour, which will not announce its election until its identifier has completed another circuit,                               
taking a further N messages. The elected message is then sent N times, making 3N – 1 messages in all. The                                         
turnaround time is also 3N – 1 , since these messages are sent sequentially. 
While the ring­based algorithm is useful for understanding the properties of election algorithms in general, the                               
fact that it tolerates no failures makes it of limited practical value. However, with a reliable failure detector it is                                       
in principle possible to reconstitute the ring when a process crashes. 
Bully algorithm​: ​allows processes to crash during an election​, although it assumes that ​message delivery                             
between processes is ​reliable​. Unlike the ring­based algorithm, this algorithm assumes that the ​system is                             
synchronous​: it ​uses timeouts to detect a process failure​. Another difference is that the ring­based algorithm                               
assumed that processes have minimal a priori knowledge of one another: each knows only how to                               
communicate with its neighbour, and none knows the identifiers of the other processes. The bully algorithm, on                                 
the other hand, assumes that ​each process knows which processes have higher identifiers​, and that it can 
communicate with all such processes. 
There are ​three types of message in this algorithm: an ​election message is sent to announce an election; an                                     
answer message is sent in response to an election message and a ​coordinator message is sent to announce                                   
the identity of the elected process – the new ‘coordinator’. ​A process begins an election when it notices,                                   
through ​timeouts​, that ​the coordinator has failed​. Several processes may discover this concurrently. 
Since the system is synchronous, we can construct a reliable failure detector. There is a maximum message                                 
transmission delay, T​trans , and a maximum delay for processing a message T​process . Therefore, we can                                 
calculate a time ​T = 2T​trans ​+ T​process ​that is an upper bound on the time that can elapse between sending a                                           
message to another process and receiving a response. The process that knows it has ​the highest identifier can                                   
elect itself as the coordinator simply by sending a coordinator message to all processes with lower identifiers.                                 
On the other hand, a process with a lower identifier can begin an election by sending an election message to                                       
those processes that have a higher identifier and awaiting answer messages in response. If none arrives within                                 
time T, the process considers itself the coordinator and sends a coordinator message to all processes with                                 
lower identifiers announcing this. Otherwise, the process waits a further period T’ for a coordinator message to                                 
arrive from the new coordinator. If none arrives, it begins another election. 
When a process is started to replace a crashed process, it begins an election​. ​If it has the highest process                                       
identifier, then it will decide that it is the coordinator and announce this to the other processes. ​Thus it will                                       
become the coordinator, even though the current coordinator is functioning​. It is for this reason that the                                 
algorithm is called the ​‘bully’ algorithm​. 
This algorithm clearly meets the liveness condition E2, by the assumption of reliable message delivery. And if                                 
no process is replaced, then the algorithm meets condition E1. It is impossible for two processes to decide that                                     
they are the coordinator, since the process with the lower identifier will discover that the other exists and defer                                     
to it. 
But the algorithm is not guaranteed to meet the safety condition E1 if processes that have crashed are                                   
replaced by processes with the same identifiers. A process that replaces a crashed process p may decide that                                   
it has the highest identifier just as another process (which has detected p’s crash) decides that it has the                                     
highest identifier. Two processes will therefore announce themselves as the coordinator concurrently.                       
Unfortunately, there are no guarantees on message delivery order, and the recipients of these messages may                               
reach different conclusions on which is the coordinator process. 
 
CONSENSUS: 
Key coordination and agreement problems related to group communication – that is, how to achieve the                               
desired reliability and ordering properties across all members of a group. The system under consideration                             
contains a collection of processes, which can communicate reliably over one­to­one channels. As before,                           
processes may 
fail only by crashing. The processes are members of groups, which are the destinations of messages sent with                                   
the multicast operation. 
The operation ​multicast(g, m) sends the message m to all members of the group g of processes.                                 
Correspondingly, there is an operation ​deliver(m) that delivers a message sent by multicast to the calling                               
process. We use the term deliver rather than receive to make clear that a multicast message is not always                                     
handed to the application layer inside the process as soon as it is received at the process’s node. 
Every message m carries the unique identifier of the process sender(m) that sent it, and the unique destination                                   
group identifier group(m). We assume that processes do not lie about the origin or destinations of messages. 
 
BASIC MULTICAST​: a basic multicast primitive that guarantees, unlike IP multicast, that a correct process will                               
eventually deliver the message, as long as the multicaster does not crash. We call the primitive B­multicast                                 
and its corresponding basic delivery primitive B­deliver. We allow processes to belong to several groups, and                               
each 
message is destined for some particular group. A straightforward way to implement B­multicast is to use a                                 
reliable one­to­one send operation. 
 
RELIABLE MULTICAST​: a reliable multicast with corresponding operations ​R­multicast and ​R­deliver​.                     
Properties analogous to integrity and validity are clearly highly desirable in reliable multicast delivery, but we                               
add another: a requirement that ​all correct processes in the group must receive a message if any of them                                     
does​. 
A reliable multicast is one that satisfies the following ​properties​: 
● Integrity​: A correct process p delivers a message m at most once. Furthermore, p ∈ group(m) and m                                   
was supplied to a multicast operation by sender(m). (As with one­to­one communication, messages can                           
always be distinguished by a sequence number relative to their sender.) 
● Validity​: If a correct process multicasts message m, then it will eventually deliver m. 
● Agreement​: If a correct process delivers message m, then all other correct processes in group(m) will                               
eventually deliver m. 
 
Performance: 
● Message complexity O(g​2​
) 
● Round complexity O(2) = O(1) 
 
CONSENSUS​: the problem is for processes to agree on a value after one or more of the processes has                                     
proposed what that value should be. In mutual exclusion, the processes agree on which process can enter the                                   
critical section. In an election, the processes agree on which is the elected process. In totally ordered multicast,                                   
the processes agree on the order of message delivery. Three related agreement problems exist: ​Byzantine                             
generals​, ​interactive consistency and ​totally ordered multicast​. We go on to examine ​under what                           
circumstances the problems can be solved​, and to sketch some solutions. In particular, we discuss the                               
well­known impossibility result of Fischer et al. [1985]​, which states that ​in an asynchronous system a                               
collection of processes containing only one faulty process cannot be guaranteed to reach consensus​. 
Our system model includes a collection of processes p​i ( i = 1, 2, …, N ) communicating by message passing.                                         
An important requirement that applies in many practical situations is for consensus to be reached even in the                                   
presence of faults. We assume, as before, that ​communication is reliable but that processes may fail​. In this                                   
section we consider Byzantine (arbitrary) process failures, as well as crash failures. We sometimes specify an                               
assumption that ​up to some number ​f of the ​N processes are faulty – that is, they exhibit some specified types                                         
of fault; the remainder of the processes are correct. 
To reach consensus, ​every process p​i begins in the ​undecided state and proposes a single value ​v​i , drawn                                     
from a set D ( i = 1, 2, … , N ). The processes communicate with one another, exchanging values. Each                                           
process then sets the value of a decision variable, ​d​i . In doing so it enters the decided state, in which it may no                                               
longer change ​d​i​ ( i = 1, 2, … , N ). 
 
 
The ​requirements of a consensus algorithm are that the following conditions should hold for every                             
execution of it: 
● Termination​: Eventually each correct process sets its decision variable. 
● Agreement​: The ​decision value of all correct processes is the same​: if ​p​i and ​p​j are correct and have                                     
entered the decided state, then ​d​i​ = d​j​ ( i, j = 1, 2, … , N ). 
● Integrity​: If the ​correct processes ​all proposed the same value​, then any correct process in the decided                                 
state has chosen that value. 
Consider a system in which processes cannot fail. It is then straightforward to solve consensus. For example,                                 
we can ​collect the processes into a group and ​have each process reliably multicast its proposed value to the                                     
members of the group​. ​Each process waits until it has collected all N values (including its own). It then                                     
evaluates the function ​majority(v​1 , v​2 , … , v​N )​, which returns the value that occurs most often among its                                         
arguments, or the special value 丄 ∉D if no majority exists. ​Termination is guaranteed by the reliability of the                                       
multicast operation​. ​Agreement and integrity are guaranteed by the definition of majority and the integrity                             
property of a reliable multicast​. Every process receives the same set of proposed values, and every process                                 
evaluates the same function of those values. So they must all agree, and if every process proposed the same                                     
value, then they all decide on this value. 
If processes can fail in arbitrary (Byzantine) ways​, then faulty processes can in principle communicate random                               
values to the others. This may seem unlikely in practice, but it is not beyond the bounds of possibility for a                                         
process with a bug to fail in this way. Moreover, the fault may not be accidental, but the result of mischievous                                         
or malevolent operation. Someone could deliberately make a process send different values to different peers in                               
an attempt to thwart the others, which are trying to reach consensus. In case of inconsistency, correct                                 
processes must compare what they have received with what other processes claim to have received. 
 
The Byzantine generals problem​: In the informal statement of the Byzantine generals problem [Lamport et al.                               
1982], ​three or more generals are to agree to attack or to retreat​. One, the commander, issues the order. The                                       
others, lieutenants to the commander, must decide whether to attack or retreat. But ​one or more of the                                   
generals may be ‘treacherous’ – that is, faulty. ​If the commander is treacherous, he proposes attacking to one                                   
general and retreating to another​. If a lieutenant is treacherous, he tells one of his peers that the commander                                     
told him to attack and another that they are to retreat. ​The Byzantine generals problem differs from consensus                                   
in that a distinguished process supplies a value that the others are to agree upon, instead of each of them 
proposing a value​. The requirements are: 
● Termination​: Eventually each correct process sets its decision variable. 
● Agreement​: The decision value of all correct processes is the same: if p​i and p​j are correct and have                                     
entered the decided state, then d​i​ = d​j​ ( i, j = 1, 2, … , N ). 
● Integrity: If the commander is correct, then all correct processes decide on the value that the                               
commander proposed. 
Note that, for the Byzantine generals problem, integrity implies agreement when the commander is correct; but                               
the commander need not be correct. 
 
Consensus in a synchronous system​: here is described an algorithm to solve consensus in a synchronous                               
system, although it is ​based on a modified form of the integrity requirement​. The algorithm uses only a basic                                     
multicast protocol. It assumes that ​up to f of the N processes exhibit crash failures​.  
To reach consensus, each correct process collects proposed values from the other processes. The algorithm                             
proceeds in f + 1 rounds, in each of which the correct processes B­multicast the values between themselves. 
Their modified form of the integrity requirement applies to the proposed values of all processes, not just the                                   
correct ones​: if all processes, whether correct or not, proposed the same value, then any correct process in the                                     
decided state would chose that value​. Given that ​the algorithm assumes crash failures at worst​, ​the proposed                                 
values of correct and non­correct processes would not be expected to differ​, ​at least not on the basis of                                     
failures​. The revised form of integrity enables the convenient use of the minimum function to choose a decision                                   
value from those proposed. 
After f + 1 rounds, each process chooses the minimum value it has received as its decision value. To check the 
correctness of the algorithm, we must show that ​each process arrives at the same set of values at the end of                                         
the final round​.  
Assume, to the contrary, that two processes differ in their final set of values. Without loss of generality, some                                     
correct process p​i possesses a value ​v that another correct process p​j ( i ≠ j ) does not possess. ​The only                                           
explanation for p​i possessing a proposed value ​v at the end that p​j does not possess is that any third process,                                         
p​k , say, that managed to send ​v to p​i crashed before ​v could be delivered to p​j​. In turn, any process sending ​v                                               
in the previous round must have crashed, to explain why p​k possesses ​v in that round but p​j did not receive it.                                           
Proceeding in this way, we have to posit at least one crash in each of the preceding rounds. But we have                                         
assumed that at most f crashes can occur, and there are f + 1 rounds. We have arrived at a contradiction. 
It turns out that any algorithm to reach consensus ​despite up to f crash failures requires at least ​f + 1                                         
rounds of message exchanges​, no matter how it is constructed [Dolev and Strong 1983]. This lower                               
bound also applies in the case of Byzantine failures [Fischer and Lynch 1982]. 
 
 
The Byzantine generals problem in a synchronous system​: here we assume that ​processes can exhibit                             
arbitrary failures​. That is, ​a faulty process may send any message with any value at any time; and it may omit                                         
to send any message​. Up to ​f of the N processes may be faulty. Correct processes can detect the absence of a                                           
message through a ​timeout​; but they cannot conclude that the sender has crashed, since it may be silent for                                     
some time and then send messages again. 
We assume that the communication channels between pairs of processes are private. Our default assumption                             
of channel reliability means that ​no faulty process can inject messages into the communication channel                             
between correct processes​. 
Lamport et al. [1982] considered the case of three processes that send unsigned messages to one another.                                 
They showed that there is no solution that guarantees to meet the conditions of the Byzantine generals                                 
problem if one process is allowed to fail. They generalized this result to show that no solution exists if N ≤ 3f.                                           
They went on to give ​an algorithm that solves the Byzantine generals problem in a synchronous system if ​N ≥                                       
3f + 1​ , for unsigned (they call them ‘oral’) messages. 
 
 
 
The algorithm takes account of the fact that a faulty process may omit to send a message. If a correct process                                         
does not receive a message within a suitable time limit (the system is synchronous), it proceeds as though the                                     
faulty process had sent it the value 丄 . 
 
Impossibility in asynchronous systems​: The algorithms seen so far assume that message exchanges take                           
place in rounds, and that processes are entitled to time out and assume that a faulty process has not sent                                       
them a message within the round, because the maximum delay has been exceeded. 
Fischer et al. [1985] proved that ​no algorithm ​can guarantee to reach consensus in an asynchronous system​,                                 
even with one process crash failure ​(​FLP restriction​). ​In an asynchronous system, processes can respond to                               
messages at arbitrary times, so a crashed process is indistinguishable from a slow one. Their proof involves                                 
showing that there is ​always some continuation of the processes’ execution that avoids consensus being                             
reached​. Note the word ‘guarantee’ in the statement of the impossibility result. The result does not mean that                                   
processes can never reach distributed consensus in an asynchronous system if one is faulty. It allows that                                 
consensus can be reached with some probability greater than zero​, confirming what we know in practice. 
Three techniques exist for working around the impossibility result: 
● fault masking​: avoid the impossibility result altogether by masking any process failures that occur; ​if a                               
process crashes, then it is restarted​. ​The process places sufficient information in persistent storage at                             
critical points in its program so that if it should crash and be restarted, it will find sufficient data to be                                         
able to continue correctly with its interrupted task. In other words, ​it will behave like a process that is                                     
correct, but that sometimes takes a long time to perform a processing step. 
● failure detectors​: processes can agree to ​deem a process that has not responded for more than a                                 
bounded time to have failed​. ​An unresponsive process may not really have failed​, but the remaining                               
processes act as if it had done. ​They make the failure ‘fail­silent’ by discarding any subsequent                               
messages that they do in fact receive from a ‘failed’ process​. In other words, we have effectively turned                                   
an asynchronous system into a synchronous one. This technique is used in the ISIS system [Birman                               
1993].  
● randomizing aspects of the processes’ behaviour​: The result of Fischer et al. [1985] depends on                             
what we can consider to be an ‘adversary’. This is a ‘character’ (actually, just a collection of random                                   
events) who can exploit the phenomena of asynchronous systems so as to foil the processes’ attempts                               
to reach consensus. The adversary manipulates the network to delay messages so that they arrive at                               
just the wrong time, and similarly it slows down or speeds up the processes just enough so that they                                     
are in the ‘wrong’ state when they receive a message. This technique that addresses the impossibility                               
result is to ​introduce an element of chance in the processes’ behaviour​, so that the adversary cannot                                 
exercise its thwarting strategy effectively. ​Consensus might still not be reached in some cases, but this                               
method enables processes to ​reach consensus in a finite expected time​. 
 
 
Technologies for Cloud Computing (Lettieri) 
 
HW assisted Virtualization 
Popek and Goldberg’s 1974 paper establishes ​three essential characteristics for system software to be                           
considered a VMM​: 
1. Fidelity​. Software on the VMM executes identically to its execution on hardware, barring timing effects. 
2. Performance​. An overwhelming majority of guest instructions are executed by the hardware without                         
the intervention of the VMM. 
3. Safety​. The VMM manages all hardware resources. 
 
We shall use the term ​classically virtualizable to describe an architecture that can be virtualized purely with                                 
trap­and­emulate. In this sense, x86 is not classically virtualizable, but it is virtualizable by Popek and                               
Goldberg’s criteria. The Pentium instruction set contains ​sensitive, unprivileged instructions​. ​The processor will                         
execute unprivileged, sensitive instructions without generating an interrupt or exception. Thus, a VMM will                           
never have the opportunity to simulate the effect of the instruction​. It has ​four modes of operation​, known as                                     
rings​, or current privilege level (CPL), 0 through 3. Ring 0, the most privileged, is occupied by operating                                   
systems. Application programs execute in Ring 3, the least privileged. The Pentium also has ​a method to                                 
control transfer of program execution between privilege levels so that nonprivileged tasks can call privileged                             
system routines: the ​call gate​. In the Intel architecture, privileged instructions cause a general protection                             
exception if the CPL is not equal to 0 (and this is a requirement for a VMM to work properly). It uses interrupts                                             
and traps to redirect program execution and allow interrupt and exception handlers to execute when a                               
privileged instruction is executed by an unprivileged task. 
It was found that seventeen instructions do not respect such requirements: The ​SGDT​, ​SIDT and ​SLDT                               
instructions are an example (PUSHF and POPF too). These instructions are ​normally only used by operating                               
systems but ​are not privileged in the Intel architecture​. Since the ​Intel processor only has one LDTR, IDTR and                                     
GDTR​, ​a problem arises when multiple operating systems try to use the same registers​. Although ​these                               
instructions do not protect the sensitive registers from ​reading by unprivileged software​, the processor allows                             
partial protection for these registers by only allowing tasks at CPL 0 to load the registers. This means that ​if a                                         
VM tries to write to one of these registers, a trap will be generated​. The trap allows a VMM to produce the                                           
expected result for the VM. However, ​if an OS in a VM uses SGDT, SLDT, or SIDT to reference the contents of                                           
the IDTR, LDTR, or GDTR​, ​the register contents that are applicable to the host OS will be given​. ​This could                                       
cause a problem if an operating system of a virtual machine ​tries to use these values for its own operations​: ​it                                         
might see the state of a different VMOS executing within a VM running on the same VMM​. Therefore, ​a                                     
VMM must provide each VM with its own virtual set of IDTR, LDTR, and GDTR registers​. 
 
The most important ideas from classical VMM implementations are ​de­privileging​, ​shadow structures and                         
traces​. 
 
● De­privileging​: all instructions that read or write privileged state can be made to trap when executed in                                 
an unprivileged context. A classical VMM executes guest operating systems directly, but at a reduced                             
privilege level. The VMM intercepts traps from the de­privileged guest, and emulates the trapping                           
instruction against the virtual machine state. 
● Primary and shadow structures​: By definition, the privileged state of a virtual system differs from that                               
of the underlying hardware. To fill this gap, the VMM derives shadow structures from guest­level                             
primary structures. On­CPU privileged state, such as the page table pointer register or processor status                             
register, is handled trivially: ​the VMM maintains an image of the guest register, and refers to that image                                   
in instruction emulation as guest operations trap​. Off­CPU privileged data, such as page tables, may                             
reside in memory. In this case, guest accesses to the privileged state may not naturally coincide with                                 
trapping instructions. Dependencies on this privileged state are not accompanied by traps: every guest                           
virtual memory reference depends on the permissions and mappings encoded in the corresponding                         
PTE. ​Such in­memory privileged state can be modified by any store in the guest instruction stream​, ​or                                 
even implicitly modified as a side effect of a DMA I/O operation​. Memory­mapped I/O devices present a                                 
similar difficulty: reads and writes to this privileged data can originate from almost any memory                             
operation in the guest instruction stream. 
● Traces​: To protect the host from guest memory accesses, VMMs typically construct ​shadow page                           
tables in which to run the guest. ​x86 specifies hierarchical hardware­walked page tables having 2, 3 or                                 
4 levels​. The hardware page table pointer is control register ​%cr3​. The VMM distinguishes ​true page                               
faults​, caused by violations of the protection policy encoded in the guest PTEs, from ​hidden page faults​,                                 
caused by misses in the shadow page table. ​True faults are forwarded to the guest; hidden faults cause                                   
the VMM to construct an appropriate shadow PTE, and resume guest execution. The fault is “hidden”                               
because it has no guest­visible effect. 
 
x86 obstacles to virtualization 
Ignoring the legacy “real” and “virtual 8086” modes of x86, even the more recently architected 32­ and 64­bit                                   
protected modes are not classically virtualizable: 
● Visibility of privileged state​. The guest can observe that it has been deprivileged when it reads its                                 
code segment selector (%cs) since the current privilege level (CPL) is stored in the low two bits of %cs. 
● Lack of traps when privileged instructions run at user­level​. For example, in privileged code popf                             
(“pop flags”) may change both ALU flags (e.g., ZF) and system flags (e.g., IF, which controls interrupt                                 
delivery). For a deprivileged guest, we need kernel mode popf to trap so that the VMM can emulate it                                     
against the virtual IF. Unfortunately, a deprivileged popf, like any user­mode popf, simply suppresses                           
attempts to modify IF; no trap happens. 
 
Simple binary translation 
VMWare software VMM uses a translator with these properties: 
● Binary​. Input is binary x86 code, not source code. 
● Dynamic​. Translation happens at runtime, interleaved with execution of the generated code. 
● On demand​. Code is translated only when it is about to execute. This laziness side­steps the problem                                 
of telling code and data apart. 
● Adaptive​. Translated code is adjusted in response to guest behavior changes to improve overall                           
efficiency. 
The translator does not attempt to “improve” the translated code. We assume that if guest code is performance                                   
critical, the OS developers have optimized it and a simple binary translator would find few remaining                               
opportunities. 
 
x86 Architecture Extension 
The hardware exports ​a number of ​new primitives to support a classical VMM for the x86​. An ​in­memory data                                     
structure​, which we will refer to as the ​virtual machine control block​, or ​VMCB​, combines ​control state with                                   
a subset of the state of a guest virtual CPU​. ​A new, less privileged execution mode​, ​guest mode​, ​supports                                     
direct execution of guest code​, ​including privileged code​. We refer to the previously architected x86                             
execution environment as ​host mode​. ​A new instruction​, ​vmrun​, ​transfers from host to guest mode​. Upon                               
execution of vmrun, the hardware loads guest state from the VMCB and continues execution in guest mode.                                 
Guest execution proceeds until some condition, expressed by the VMM using control bits of the VMCB, is                                 
reached. At this point, the hardware performs an exit operation, which is the inverse of a vmrun operation. On                                     
exit, the hardware saves guest state to the VMCB, loads VMM­supplied state into the hardware, and resumes                                 
in host mode, now executing the VMM. 
Diagnostic fields in the VMCB aid the VMM in handling the exit; e.g., exits due to guest I/O provide the port,                                         
width, and direction of I/O operation. After emulating the effect of the exiting operation in the VMCB, the VMM                                     
again executes vmrun, returning to guest mode. 
The hardware extensions provide a complete virtualization solution, essentially prescribing the structure of our                           
hardware VMM (or indeed any VMM using the extensions). When running a protected mode guest, the VMM                                 
fills in a VMCB with the current guest state and executes vmrun. On guest exits, the VMM reads the VMCB                                       
fields describing the conditions for the exit, and vectors to appropriate emulation code. 

Notes about concurrent and distributed systems & x86 virtualization

  • 1.
    Concurrent and Distributed Systems (Bechini)  MUTUAL EXCLUSION    Volatile​: any writeto a volatile variable establishes a happens­before relationship with subsequent reads of                              that same variable. This means that changes to a volatile variable are always visible to other threads. What's                                    more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the                                          volatile, but also the side effects of the code that led up the change.  Reads and writes are ​atomic​ for ​all​ variables​ declared volatile (​including​ long and double variables).    Mutual Exclusion desired properties​:  1. ME guaranteed in any case  2. A process out of the Critical Section MUST NOT prevent any other to access  3. No deadlock  4. No busy waiting  5. No starvation    Deadlock​: a situation in which two or more competing actions are each waiting for the other to finish, and thus                                        neither ever does.  A deadlockers situation can arise if all of the following conditions hold simultaneously in a system:​[1]  1. Mutual Exclusion​: At least one resource must be held in a non­shareable mode.​[1] Only one process                                can use the resource at any given instant of time.  2. Hold and Wait or ​Resource Holding: A process is currently holding at least one resource and                                requesting additional resources which are being held by other processes.  3. No​ ​Preemption​:​ a resource can be released only voluntarily by the process holding it.  4. Circular Wait​: A process must be waiting for a resource which is being held by another process, which                                    in turn is waiting for the first process to release the resource. In general, there is a ​set of waiting                                        processes, P = {P​1​, P​2​, ..., P​N​}, such that P​1 is waiting for a resource held by P​2​, P​2 is waiting for a                                              resource held by P​3​ and so on until P​N​ is waiting for a resource held by P​1​.​[1]​[7]  These four conditions are known as the ​Coffman conditions from their first description in a 1971 article by                                    Edward G. Coffman, Jr.​[7] Unfulfillment of any of these conditions is enough to preclude a deadlock from                                  occurring.    Busy waiting​: In ​software engineering​, ​busy­waiting or ​spinning is a technique in which a ​process                              repeatedly checks to see if a condition is true, such as whether ​keyboard input or a ​lock is available. In                                        low­level programming, busy­waits may actually be desirable. It may not be desirable or practical to implement                                interrupt­driven processing for every hardware device, particularly those that are seldom accessed.    Sleeping lock​: as opposite to a spinning lock, this technique puts a thread waiting for accessing a resource in                                      sleeping/ready mode. So, a thread is paused and its execution stops. The CPU can perform a context switch                                    and can keep working on another process/thread. This saves some CPU time (cycles) that would be wasted by                                   
  • 2.
    a spinning lockimplementation. Anyway, also sleeping locks have a time overhead that should be taken into                                  account when evaluating which solution to adopt.    Starvation​: a ​process is perpetually denied necessary ​resources to proceeds its work.​[1] Starvation may be                              caused by errors in a scheduling or ​mutual exclusion algorithm, but can also be caused by ​resource leaks​, and                                      can be intentionally caused via a​ ​denial­of­service attack​ such as a​ ​fork bomb​.    Race Condition​: (or ​race hazard​) is the behavior of an electronic, software or other ​system where the output                                    is dependent on the sequence or timing of other uncontrollable events. It becomes a ​bug when events do not                                      happen in the order the programmer intended. The term originates with the idea of two ​signals racing each                                    other to influence the output first. Race conditions arise in software when an application depends on the                                  sequence or timing of​ ​processes​ or​ ​threads​ for it to operate properly.    Dekker’s alg.​: the first known correct solution to the​ ​mutual exclusion​ problem in​ ​concurrent programming​.    Dekker's algorithm guarantees​ ​mutual exclusion​, freedom from​ ​deadlock​, and freedom from​ ​starvation​.  One advantage of this algorithm is that it doesn't require special ​Test­and­set (atomic read/modify/write)                            instructions and is therefore highly portable between languages and machine architectures. One disadvantage                          is that it is ​limited to two processes ​and makes use of ​busy waiting instead of process suspension. (The use of                                          busy waiting suggests that processes should spend a minimum of time inside the critical section.)  This algorithm won't work on​ ​SMP​ machines equipped with these CPUs without the use of​ ​memory barriers​.   
  • 3.
    Peterson’s alg.​: a​concurrent programming ​algorithm for ​mutual exclusion that allows two processes to share                              a single­use resource without conflict, using only shared memory for communication.    The algorithm uses two variables, ​flag and ​turn​. A ​flag[n] value of ​true indicates that the process ​n wants to                                        enter the ​critical section​. Entrance to the critical section is granted for process P0 if P1 does not want to enter                                          its critical section or if P1 has given priority to P0 by setting ​turn​ to 0.    The algorithm does satisfy the three essential criteria to solve the critical section problem, provided that                                changes to the variables turn, flag[0], and flag[1] propagate immediately and atomically. The while condition                              works even with preemption.​[1]    Filter alg.: generalization to N>2 of Peterson’s alg      HW support to ME​: an operation (or set of operations) is ​atomic​, ​linearizable​, ​indivisible or ​uninterruptible                                if it appears to the rest of the system to occur instantaneously. Atomicity is a guarantee of ​isolation from                                      concurrent processes​. Additionally, atomic operations commonly have a ​succeed­or­fail definition — they                        either successfully change the state of the system, or have no apparent effect.  ● Test_and_Set​: ​reads a variable’s value, copies it to store the old value, ​modifies it with the new value                                    and ​writes the old value returning it. ​Maurice Herlihy (1991) proved that test­and­set has a finite                                consensus number​, in contrast to the ​compare­and­swap operation. The test­and­set operation can                        solve the wait­free ​consensus problem for no more than two concurrent processes.​[1] However, more                           
  • 4.
    than two decadesbefore Herlihy's proof, ​IBM had already replaced Test­and­set by                        Compare­and­swap, which is a more general solution to this problem. It may suffer of ​starvation​. (from                                notes: in a multiprocessor system it is not enough alone to guarantee ME, it is such that interrupts must                                      be disabled).      ● Compare_and_Swap (CAS)​: an ​atomic ​instruction used in ​multithreading to achieve ​synchronization​.                      It compares the contents of a memory location to a given value and, only if they are the same, modifies                                        the contents of that memory location to a given new value. This is done as a single atomic operation. ​It                                        might suffer ​starvation​. On server­grade multi­processor architectures of the 2010s, compare­and­swap                      is relatively cheap relative to a simple load that is not served from cache. A 2013 paper points out that a                                          CAS is only 1.15 times more expensive that a non­cached load on Intel Xeon (​Westmere­EX​) and 1.35                                  times on AMD ​Opteron (Magny­Cours).​[6] As of 2013, most ​multiprocessor architectures support CAS in                            hardware. As of 2013, ​the compare­and­swap operation is the most popular ​synchronization primitive                          for implementing both lock­based and non­blocking​ ​concurrent data structures​.​[4]  
  • 5.
        ● Fetch_and_Add​: aspecial instruction that ​atomically modifies the contents of a memory location. It                            fetches ​(copies) the old value of the parameter, ​adds to the parameter the value given by the second                                    parameter and returns the old value. ​Maurice Herlihy (1991) proved that fetch­and­add has a finite                              consensus number, in contrast to the ​compare­and­swap operation. ​The fetch­and­add operation can                        solve the wait­free consensus problem for no more than two concurrent processes​.​[1]       
  • 6.
        Guidelines​:  ● do not use ​synchronized​ blocks if not strictly necessary  ● prefer ​atomic variables​ already implemented  ●when there is more than one variable to share, wrap critical sections with ​lock/unlock  ● synchronized ​only when there is a write operation    In Java exist three main mechanisms for ME: ​atomic variables, ​implicit locks​ and ​explicit locks​ (​reference​).    Condition Variables​: a threads’ synchronization mechanism. A thread can suspend its execution while waiting                            for a specific condition to happen. CVs do not guarantee ME on shared resources, so they have to be used in                                          conjunction with mutexes.  A thread can perform on a CV two operations:  ● wait​: ConditionVariable c; c.wait(); = a thread is suspended while waiting for a signal to resume it  ● signal​: c. signal(); = a thread exiting the CS signals to a waiting thread in the waiting queue to resume                                        its execution. The awaken thread must check again that the condition it was waiting for has actually                                  happened  ● signalAll​: c.signalAll(); = wakes up all the waiting threads.  These three methods must be invoked inside a critical section.   
  • 7.
    Java Monitor​: injava, every object (Object) has three methods, namely ​wait​, ​notify and ​notifyAll​, performing                                the same as the three previous ones. So, it is possible to implement a mechanism completely similar to                                    condition variables by using these three methods. By dealing with Objects, in order to guarantee ME it is                                    necessary to use the ​synchronized​ keyword.  Java ReentrantLock ​(​oracle reference​): A reentrant mutual exclusion ​Lock with the same basic behavior and                              semantics as the implicit monitor lock accessed using synchronized methods and statements, but with                            extended capabilities. A ReentrantLock is ​owned by the thread last successfully locking, but not yet unlocking                                it. A thread invoking lock will return, successfully acquiring the lock, when the lock is not owned by another                                      thread. The method will return immediately if the current thread already owns the lock. The constructor for this                                    class accepts an optional ​fairness parameter. When set true, under contention, locks favor granting access to                                the longest­waiting thread. Otherwise this lock does not guarantee any particular access order. Programs                            using fair locks accessed by many threads may display lower overall ​throughput (i.e., are slower; often much                                  slower) than those using the default setting, but have smaller ​variances in times to obtain locks and guarantee                                    lack of starvation​.  It is recommended practice to ​always​ immediately follow a call to lock with a try block:    Since a CV must always be used together with a mutex, the ​ReentrantLock gives the method ​newCondition()                                  to obtain a CV, so that when performing a ​wait on this CV the lock used to create the same condition will be                                              released and the thread suspended.  The followings are the most relevant methods of ReentrantLock:  ● lock()  ● tryLock()​: Acquires the lock only if it is not held by another thread at the time of invocation.  ● unlock()  ● newCondition()  There is also a method that permit to specify a timeout to observe while trying to lock: ​tryLock (long timeout,                                        TimeUnit​ unit)    Barging​: usually locks are starvation free, because they use queues to put suspended threads that are FIFO,                                  so every single thread entering the queue will acquire the lock with a bounded time. There are such situation                                      when the JVM performs some performance optimization: since the suspend/resume mechanism requires                       
  • 8.
    overhead and wasteof time, when there is contention for the same lock the JVM might wake up a thread                                        already executing instead of a sleeping one. This process is called ​barging​.    Monitor ​(​oracle reference​): mutex + condition variable; a synchronization construct that allows ​threads to have                              both ​mutual exclusion and the ability to wait (block) for a certain condition to become true. Only one process at                                        a time is able to enter the monitor. In other words, a monitor is a mechanism that associate to a given data                                            structure a set of procedures/operations that are the only ones allowed. ​Each procedure/operation is mutually                              exclusive: only one process/thread can access the monitor at a time​.  Two variants of monitor exist:  ● Hoare’s m.​: blocking condition variable; ​signal­and­wait​: There are 3 kind of queues: ​enter (threads                            aiming to enter the monitor), ​urgent (threads who left the monitor lock to those ones that were already                                    waiting on the condition variable) and ​condition (a queue for each CV). When a thread executes a                                  signal, it auto­suspends, moves to the urgent queue and wakes up the first thread waiting in the waiting                                    queue of that variable. Threads waiting in the urgent queue will be awaken before the ones in the enter                                      queue.  ● MESA​: non­blocking CV; ​signal­and­continue​; a signal does not suspend the invoking thread, but                          makes a thread waiting on the waiting queue of that CV move to the entering queue. The signalling                                    thread leaves the monitor continuing with its execution.  Semaphore​: another mechanism for ME; in practice, it is a shared integer variable ​p>=0 ​on which increments                                  and decrements are  performed by means of ​two atomic operations​:  ● V ​(​verhoog ​= increment, means ​signal​): used when a thread is exiting a portion of code.  ● P​ (​prolaag​ = try, means ​wait​): used when a thread wants to enter a CS.  In practice, a semaphore allow ​n threads to “enter” a specific portion of code, just by setting up the initial value                                          of p to n.    Readers and Writers problem​: n threads want to read and/or write from/to a shared variable, namely a buffer.                                    ME is required since writes to the same memory must be synchronized.  In Java exist the ​ReentrantReadWriteLock class: maintains a pair of associated ​locks​, one for read­only                              operations and one for writing; also keeps the capabilities of a ReentrantLock; in particular, a lock downgrading                                  is possible: a thread holding the write lock might obtain a reentrant lock on the read lock.  Another solution to ME for the readers and writers problem is the synchronization of all the methods in a class                                        containing a shared ​collection​, so that the class actually wraps the shared variable and has ME guaranteed.  Concurrent collections​ (​oracle reference​): classes that contains thread­safe data structures. They can be:  ● blocking​: read and write operations (take and put) wait for the data structure to be non­empty or                                  non­full.  ● non­blocking  CopyOnWriteArrayList (​oracle reference​): A thread­safe variant of ​ArrayList in which all mutative operations                          (add, set, and so on) are implemented by making a fresh copy of the underlying array. Ordinarily too costly, but                                        may be ​more efficient than alternatives when traversal operations vastly outnumber mutations, and is useful                              when you cannot or don't want to synchronize traversals, yet need to preclude interference among concurrent                                threads.  Dequeue​: “​double end queue​”, a queue where an element can be put either from the head or the tail of the                                          queue. 
  • 9.
    Work stealing​: supposethere are m producers and n consumers; each producer writes to a shared buffer and                                    each consumer has its own queue where the work flows from the buffer (according to a specific policy for                                      example). It might happen that a consumer is observing its queue growing because it is not able to consume                                      all the work. In this situation, any other consumer that might observe an empty queue, might “steal” work from                                      the queue of that other consumer.  JVM Memory​: is organized in three main areas:  ● method​: for each class, methods’ code, attributes and field values are stored  ● heap​: all the instances are stored here  ● thread stack​: for each thread, a data structure is placed here, containing methods’ stack, PC register,                                etc...      Tasks and Executors​: the way to manage threads execution, in terms of starting, stopping and resuming a                                  thread    Executor​: An object that executes submitted ​Runnable tasks. This interface provides a way of decoupling task                                submission from the mechanics of how each task will be run, including details of thread use, scheduling, etc.                                    An Executor is normally used instead of explicitly creating threads.  void ​execute​(​Runnable​ command)    Runnable​: implemented by any class whose instances are intended to be executed by a thread. The class                                  must define a ​method of no arguments called ​run​.    Callable​: A task that ​returns a result and may throw an exception​. Implementors define a ​single method with                                    no arguments called call​. The Callable interface is similar to ​Runnable​, in that both are designed for classes                                   
  • 10.
    whose instances arepotentially executed by another thread. A Runnable, however, does not return a result                                and cannot throw a ​checked exception​.  Future​: represents the result of an asynchronous computation. Methods are provided to check if the                              computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can                                      only be retrieved using method get when the computation has completed, blocking if necessary until it is ready.                                    Cancellation is performed by the cancel method. Additional methods are provided to determine if the task                                completed normally or was cancelled. Once a computation has completed, the computation cannot be                            cancelled. If you would like to use a Future for the sake of cancellability but not provide a usable result, you                                          can declare types of the form Future<?> and return null as a result of the underlying task.   boolean ​cancel​(boolean mayInterruptIfRunning)  boolean ​isCancelled​()  boolean ​isDone​()  V​ ​get​() throws​ ​InterruptedException​, ​ExecutionException  V​ ​get​(long timeout, ​TimeUnit​ unit) throws​ ​InterruptedException​, ​ExecutionException​, ​TimeoutException      ExecutorService​: An ​Executor that provides methods to manage termination and methods that can produce a                              Future for tracking progress of one or more asynchronous tasks. An ExecutorService can be shut down, which                                  will cause it to reject new tasks. Two different methods are provided for shutting down an ExecutorService. The                                    shutdown() method will allow previously submitted tasks to execute before terminating, while the                          shutdownNow() method prevents waiting tasks from starting and attempts to stop currently executing tasks.                            Upon termination, an executor has no tasks actively executing, no tasks awaiting execution, and no new tasks                                  can be submitted.  void ​shutdown​()  List​<​Runnable​> ​shutdownNow​()  <T> ​Future​<T> ​submit​(​Callable​<T> task)  Future​<?> ​submit​(​Runnable​ task)    ThreadPoolExecutor​: An ​ExecutorService that executes each submitted task using one of possibly several                          pooled threads, normally configured using​ ​Executors​ factory methods.  Thread pools address two different problems: they usually provide improved performance when executing                          large numbers of asynchronous tasks, due to reduced per­task invocation overhead, and they provide a means                                of bounding and managing the resources, including threads, consumed when executing a collection of tasks.                              Each ThreadPoolExecutor also maintains some basic statistics, such as the number of completed tasks.    ScheduledThreadPoolExecutor​: A ​ThreadPoolExecutor that can additionally schedule commands to run                    after a given delay, or to execute periodically.  public ​ScheduledThreadPoolExecutor​(int corePoolSize)  public ​ScheduledFuture​<?> ​schedule​(​Runnable​ command, long delay, ​TimeUnit     unit)  public <V> ​ScheduledFuture​<V> ​schedule​(​Callable​<V> callable, long delay,    ​TimeUnit​ unit)  public ​ScheduledFuture​<?> ​scheduleAtFixedRate​(​Runnable​ command, 
  • 11.
    long initialDelay, long period, ​TimeUnit​ unit)  public ​ScheduledFuture​<?> ​scheduleWithFixedDelay​(​Runnable​ command,  long initialDelay, long delay, ​TimeUnit​ unit)  public void ​execute​(​Runnable​ command)  public ​Future​<?> ​submit​(​Runnable​ task)  public <T> ​Future​<T> ​submit​(​Callable​<T> task)  public void ​shutdown​()  public ​List​<​Runnable​> shutdownNow()    Task lifecycle:      Executor lifecycle​:    Number of threads in the pool​:    N   )Npool =   CPU ∙ U ∙ ( T COMPUTE T   + T WAIT COMPUTE + 1    where U is the CPUs and +1 is for safety reasons: it might happen that all the necessary threads block and no                                            more threads then could enter the pool. 
  • 12.
      Deadlock with nestedmonitors​: suppose two nested monitors, where a thread first waits on a CV and then                                    enters its monitor and waits on a nested CV. Now, if another thread, a second one, wants to enter the first                                          monitor, it might happen that: the first thread signalled on the outer CV and signal got lost, or the inner thread                                          still has to signal on the outer CV. In any case, the thread waiting outside will never be able to enter the                                            monitor because the inner thread will always be blocked on the internal CV that will never receive any signal.                                      So, this situation leads to a ​deadlock​.    Deadlock with executor and monitor (Thread Starving Deadlock)​: suppose to have an Executor with a                              thread pool whose size is N. Suppose that all the threads encounter the ​same CV and they ​all wait on it​. If                                            there is no other thread available in the pool, then no one will be able to wake up the waiting threads, thus                                            resulting in a deadlock.    Wait­for­graph​: a ​directed graph used for ​deadlock detection in ​operating systems and ​relational database                            systems.    See ​Coffman​.    Memory barrier​: a type of ​barrier ​instruction that causes a ​central processing unit (CPU) or ​compiler to                                  enforce an ​ordering constraint on ​memory operations issued before and after the barrier instruction; necessary                              because most modern CPUs employ performance optimizations that can result in​ ​out­of­order execution​.    Java Memory Model​ (See Java Memory Model paper):  ● Program order rule​: actions in a thread are performed in their coding order.  ● Monitor lock rule​: an unlock to a monitor ​happens­before​ any subsequent lock on the same.  ● Volatile variable rule​: a write to a v. variable ​happens­before any subsequent read to the same var.;                                  the same holds for ​atomic var.s​.  ● Thread start rule​: a call to ​Thread.start​ ​happens­before​ any action in the started thread.  ● Thread termination rule​: any action in a thread ​happens­before any other thread detects that it has                                terminated​, either by a successfully return from ​Thread.join​ or ​Thread.isAlive​ returning false.  ● Interruption rule​: a call to ​interrupt for a thread ​happens­before the interrupted thread detects the                              interrupt.  ● Finalizer rule​: the end of a constructor for an object ​happens­before​ the start of its finalizer.  ● Transitivity​: if ​A happens­before B​, ​B happens­before C​, then ​A happens­before C​.  Performance  With Java, in order to perform accurate performance measurements it is important to exactly know how the                                  code is run by the JVM. An important component is the ​Just In Time (JIT) compilation engine: this module                                      performs ​dynamic compilation during the execution of a program – at ​run time – rather than prior to                                    execution.​[1] Most often this consists of translation to ​machine code​, which is then executed directly, but can                                  also refer to translation to another format. It allows ​adaptive optimization such as ​dynamic recompilation – thus                                  in theory JIT compilation can yield faster execution than static compilation. Interpretation and JIT compilation                             
  • 13.
    are particularly suitedfor ​dynamic programming languages​, as the runtime system can handle late­bound data                              types and enforce security guarantees.  Evaluating a java source code performance requires collecting statistics about its execution times. In particular,                              when interested in evaluating a multi­thread software, it is important to have such a way to synchronize the                                    threads starting execution time. In other words, all the threads have to start at the same time so that the                                        measurements are faithful to reality. Java supplies a specific class for this task.  CountDownLatch ​(​oracle reference​): a synchronization aid that allows one or more threads to wait until a set                                  of operations being performed in other threads completes.  A CountDownLatch is initialized with a given ​count​. The ​await methods block until the current count reaches                                  zero due to invocations of the ​countDown() method, after which all waiting threads are released and any                                  subsequent invocations of ​await return immediately. This is a one­shot phenomenon ­­ the count cannot be                                reset. If you need a version that resets the count, consider using a​ ​CyclicBarrier​.  A CountDownLatch is a versatile synchronization tool and can be used for a number of purposes. A                                  CountDownLatch initialized with a count of one serves as a simple on/off latch, or gate: all threads invoking                                    await wait at the gate until it is opened by a thread invoking ​countDown()​. A CountDownLatch initialized to ​N                                      can be used to make one thread wait until ​N threads have completed some action, or some action has been                                        completed N times.  public ​CountDownLatch​(int count)  public void ​await​() throws​ ​InterruptedException  public void ​countDown​()    CyclicBarrier (​oracle reference​): a synchronization aid that allows a set of threads to all wait for each other to                                      reach a common barrier point. CyclicBarriers are useful in programs involving a ​fixed sized party of threads                                  that must occasionally wait for each other​. The barrier is called ​cyclic because it can be re­used after the                                      waiting threads are released.  public ​CyclicBarrier​(int parties, ​Runnable​ barrierAction)  public int ​await​() throws​ ​InterruptedException​, ​BrokenBarrierException    Performance in synchronization  Synchronization impacts performance because of:  ● context switches  ● memory synchronization  ● thread synchronization  Considering our code, one of its performance indexes is throughput, in terms of number threads entering and                                  leaving it in a given time. So, here comes to us ​Little’s Law​: where, in our case, L represents the                            λ w L =                   number of threads in our system, i.e. the number of threads waiting to execute our portion of code (waiting                                      because of synchronization of course), λ is the arrival rate and w the delay of the system. What we want to do                                            is ​minimize ​L.  Possible solutions:  ● CS shrinking​: reduce the size of the CS so that a thread do not spend too much time in it.  ● CS splitting​: split a CS in smaller ones, so perform ​lock splitting​.  ● JVM optimizations​: 
  • 14.
    ○ Lock coarsening​:whenever the JVM observes that a thread moves from a waiting queue to                              another one (because of a locks chain) always in the same sequential order, then it collapse all                                  the CSs into a single one, thus merging the locks into a single one. This leads to performance                                    improvements in terms of waiting time, so improving the throughput.  ○ Lock elision​: if a CS is ​always executed by the ​same thread​, then it is not necessary to protect                                      it with a lock: so the JVM removes it.  ● Lock granularity​: when the shared data structure is wide, it is important to put locks only where                                  needed, and not on the entire data structure. For example, an hashMap could be so wide that threads                                    concurrently would access it on different parts, thus not violating any ME constraint. So, locking the                                entire table would be penalizing the performance of our software. A ​solution is to have more locks to                                    distribute on the overall data structure. Suppose: , then              number of Locks and  N  table size NL  =   =       . N​L​ should be dimensioned so that the conflict probability is minimized.ocks distribution  N % N   L =   L   ● Non­blocking algorithms​: if failure or ​suspension of any ​thread cannot cause failure or suspension of                              another thread​[1]​  (use of ​volatile​ and ​atomic variables​).  ○ Wait­free​: the strongest non­blocking guarantee of progress, combining guaranteed                  system­wide throughput with ​starvation​­freedom. ​An algorithm is wait­free if every operation has                        a bound on the number of steps the algorithm will take before the operation completes​.​[11]   ○ Lock­free​: allows individual threads to starve but guarantees system­wide throughput. An                      algorithm is ​lock­free if it satisfies that when the program threads are run sufficiently long at                                least one of the threads makes progress (for some sensible definition of progress). All wait­free                              algorithms are lock­free. ​An algorithm is lock free if every operation has a bound on the number                                  of steps before one of the threads operating on a data structure completes its operation.​[11]   ○ Obstruction­free​: the weakest natural non­blocking progress guarantee. ​An algorithm is                    obstruction­free if at any point, a single thread executed in isolation ​(i.e., with all obstructing                              threads suspended) ​for a bounded number of steps will complete its operation.​[11] All lock­free                            algorithms are obstruction­free.  See ​Treiber's stack and ​Michael and Scott's queue​. These two algorithms are a non­blocking implementation                              of a stack and a queue respectively. Both are based on the use of ​AtomicReference variables, that make it                                      possible to deal with threads synchronization. They result with better performance when used in such                              programs that suffer from ​high contention rates​. Basically, they implement a spinning solution to deal with                                concurrent modification on the stack pointer and the ​head and ​tail references in the queue. Thus, whenever a                                       thread tries to modify one of them has to be sure (see the if statements) that it is actually modifying what it is                                              expecting to modify (note the use of ​compareAndSet​). If such a check fails, then go back (spin) and try again                                        from the beginning.    JAVA: Nested Classes (​oracle reference​)  A nested class is a member of its enclosing class. Non­static nested classes (inner classes) have access to                                    other members of the enclosing class, even if they are declared private. Static nested classes do not have                                    access to other members of the enclosing class. As a member of the OuterClass, a nested class can be                                      declared private, public, protected, or ​package private​. (Recall that outer classes can only be declared public                                or ​package private​.) 
  • 15.
    Static nested class​:A static nested class ​interacts with the instance members of its outer class (and other                                    classes) ​just like any other top­level class​. In effect, a static nested class is ​behaviorally a top­level class that                                      has been ​nested​ in another top­level class ​for packaging convenience​.     Inner class​: is ​associated with an instance of its enclosing class and ​has direct access to that object's                                    methods and fields​. Also, because an inner class is associated with an instance, it cannot define any static                                    members itself. Objects that are instances of an inner class ​exist ​within​ an instance of the outer class​.    To instantiate an inner class, you must first instantiate the outer class. Then, create the inner object within the                                      outer object with this syntax:  OuterClass.InnerClass innerObject = outerObject.new InnerClass();    Local classes (​oracle reference​): are classes that are defined in a ​block​, which is a group of zero or more                                        statements between balanced braces. You typically find local classes defined in the body of a method. ​A local                                    class has access to the members of its enclosing class and to local variables​. However, a local class can only                                        access local variables that are declared ​final​. Starting in Java SE 8, a local class can access local variables                                      and parameters of the enclosing block that are final or ​effectively final ​(i.e. never changed in the enclosing                                    block).  Shadowing​: If a declaration of a type (such as a member variable or a parameter name) in a particular scope                                        (such as an inner class or a method definition) has the same name as another declaration in the enclosing                                      scope, then the declaration ​shadows​ the declaration of the enclosing scope.  Anonymous classes​: enable you to declare and instantiate a class at the same time. They are like local                                    classes except that they do not have a name. Use them if you need to use a local class only once. Ex.: when                                              making a new of an interface and then opening { to write “on the fly” the class.  AtomicInteger (​oracle reference​): An int value that may be updated atomically. Methods: getAndAdd(int                          delta), getAndIncrement(), addAndGet(int delta), incrementAndGet().  ThreadLocal<T> (​oracle reference​): This class provides ​thread­local variables​. These variables differ from                        their normal counterparts in that each thread that accesses one (via its get or set method) has its own,                                      independently initialized copy of the variable. ThreadLocal instances are typically private static fields in classes                              that wish to associate state with a thread (e.g., a user ID or Transaction ID).       MESSAGE PASSING MODEL  Message passing between a pair of processes can be supported by two message communication operations,                              send and ​receive​, defined in terms of destinations and messages. A ​queue ​is associated with each ​message                                  destination​. Sending processes cause messages to be added to remote queues and receiving processes                            remove messages from local queues. Sending and receiving processes may be either ​synchronous or                            asynchronous​. In the synchronous form of communication, the sending and receiving processes synchronize                          at every message​. In this case, both send and receive are ​blocking operations​.  In the asynchronous form of communication, ​the use of the send operation is nonblocking in that the sending                                    process is allowed to proceed as soon as the message has been copied to a local buffer, and the transmission                                        of the message proceeds in parallel with the sending process​. The ​receive operation can have ​blocking and                                 
  • 16.
    non­blocking variants​. Inthe non­blocking variant, the receiving process proceeds with its program after                            issuing a receive operation, which provides a buffer to be filled in the background, but it must separately                                    receive notification that its buffer has been filled, by polling or interrupt.  Non­blocking communication appears to be more efficient, but it involves extra complexity in the receiving                              process associated with the need to acquire the incoming message out of its flow of control. For these                                    reasons, today’s systems do not generally provide the nonblocking form of receive.  Communication channel hypothesis​:  ● Reliability​: in terms of validity and integrity. As far as the validity property is concerned, a point­to­point                                  message service can be described as reliable if messages are guaranteed to be delivered despite a                                ‘reasonable’ number of packets being dropped or lost. In contrast, a point­to­point message service can                              be described as unreliable if messages are not guaranteed to be delivered in the face of even a single                                      packet dropped or lost. For integrity, messages must arrive uncorrupted and without duplication.  ● Ordering​: Some applications require that messages be delivered in sender order – that is, the order in                                  which they were transmitted by the sender. The delivery of messages out of sender order is regarded                                  as a failure by such applications.  ● QoS  ● Queue policy    Processes addressing​: in the Internet protocols, messages are sent to ​(Internet address, local port) pairs. A                                local port is a message destination within a computer, specified as an integer. A port has exactly one receiver                                      (multicast ports are an exception) but can have many senders. An alternative to address processes is to use                                    the values ​(Process ID, local port)​.    Distributed systems models​:  ● Physical​: focus on hardware organization.  ● Architectural​ (behavioural): how the different components interact (client­server, peer­to­peer).  ● Fundamental ​(abstract): an high level abstraction that makes possible to represent mathematically the                          real model, so to perform hypothesis validation    Bounded­buffer with asynchronous message passing and Dijkstra’s guarded commands​: the most                      important element of the guarded command language. In a guarded command, just as the name says, the                                  command is "guarded". The guard is a ​proposition​, which must be true before the statement is ​executed​. At the                                      start of that statement's execution, one may assume the guard to be true. Also, if the guard is false, the                                        statement will not be executed. The use of guarded commands makes it easier to prove the ​program meets the                                      specification​. The statement is often another guarded command.  A guard can be in one of these three states:  ● failed​: condition is ​false  ● valid​: condition ​true​, result received  ● delayed​: condition ​true​, but ​no results available yet  See notes from Professor about the algorithm for the bounded­buffer (readers­writers) problem solved with                            guarded commands.   
  • 17.
    Message Passing Interface(MPI, ​a reference​)​: the first standardized, vendor independent, message passing                          library. The advantages of developing message passing software using MPI closely match the design goals of                                portability, efficiency, and flexibility. MPI is not an IEEE or ISO standard, but has in fact, become the "industry                                      standard" for writing message passing programs on HPC platforms.   MPI primarily addresses the ​message­passing parallel programming model​: data is moved from the address                            space of one process to that of another process through ​cooperative operations on each process​.  MPI main components and features:  ● Communicators and Groups​: MPI uses objects called communicators and groups to define ​which                          collection of processes may communicate with each other​. ​MPI_COMM_WORLD is the predefined                        communicator that includes all of your MPI processes.  ○ A ​group ​is ​an ordered set of processes​. Each process in a group is associated with a unique                                    integer rank. Rank values start at zero and go to N­1, where N is the number of processes in the                                        group. In MPI, a group is represented within system memory as an object. It is accessible to the                                    programmer only by a "handle". A group is always associated with a communicator object.   ○ A ​communicator ​encompasses a group of processes that may communicate with each other.                          All MPI messages must specify a communicator. In the simplest sense, the communicator is an                              extra "tag" that must be included with MPI calls.  ○ From the programmer's perspective, a group and a communicator are one.  ● Rank​: ​Within a communicator​, every ​process has its ​own unique, integer identifier assigned by the                              system when the process initializes. A rank is sometimes also called a "task ID". ​Ranks are contiguous                                  and begin at zero​. It is used by the programmer to specify the source and destination of messages.                                    Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do                                    that).   ● MPI buffer​: since send and receive are rarely ​perfectly synchronized, the MPI architecture presents a                              library buffer​ that is used to store transiting messages while the receiver cannot receive them.  ● Blocking and Nonblocking op.s​: Blocking is interpreted as ‘​blocked until it is safe to return​’, in the                                  sense that ​application data has been copied into the MPI system and hence is in transit or delivered                                    and therefore the application buffer can be reused​. ​Safe means that modifications will not affect the                                data intended for the receive task​. Safe does not imply that the data was actually received ­ it may very                                        well be sitting in a system buffer. Non­blocking send and receive routines behave similarly ­ they will                                  return almost immediately. They do not wait for any communication events to complete, such as                              message copying from user memory to system buffer space or the actual arrival of message.   ● Point­to­Point Communication​: message passing between two, and only two, different MPI tasks.                        One task is performing a send operation and the other task is performing a matching receive operation.  ● Collective Communication​: involve ​all processes within the scope of a communicator. It is the                            programmer's responsibility to ensure that all processes within a communicator participate in any                          collective operations. Unexpected behavior, including program failure, can occur if even one task in the                              communicator doesn't participate.  ● Types of Collective Operations​: 
  • 18.
        ● Synchronization­ processes wait until all members of the group have reached the                          synchronization point.  ● Data Movement​ ­ broadcast, scatter/gather, all to all.  ● Collective Computation (reductions) ­ one member of the group collects data from the other                            members and performs an operation (min, max, add, multiply, etc.) on that data.    ● Order​:  ○ MPI guarantees that messages will not overtake each other.   ○ Order rules do not apply if there are multiple threads participating in the communication                            operations.  ● Fairness​: MPI does not guarantee fairness ­ it's up to the programmer to prevent "operation                              starvation".  ● Envelope​: source+destination+tag+communicator (see later)      The underlying architectural model for MPI is relatively simple and captured in Figure 4.17; note the added                                  dimension of ​explicitly having MPI library buffers in both the sender and the receiver​, managed by the MPI                                    library and used to hold data in transit.   
  • 19.
      MPI point­to­point routines:    Buffer ​= Program (application) address space that references the data that is to be sent or received.  Count ​= Indicates the number of data elements of a particular type to be sent.  Type ​= For reasons of portability, MPI predefines its elementary data types.  Dest ​= The rank of the receiving process.  Source = Therank of the sending process. This may be set to the wild card MPI_ANY_SOURCE to receive a                                        message from any task.  Tag = Arbitrary non­negative integer assigned by the programmer to uniquely identify a message. Send and                                receive operations should match message tags. For a receive operation, the wild card MPI_ANY_TAG can be                                used to receive any message regardless of its tag.  Communicator = communication context, or set of processes for which the source or destination fields are                                valid.  Status​ = For a receive operation, indicates the source of the message and the tag of the message.  Request​ = Used by non­blocking send and receive operations. The system issues a unique “request number”.   
  • 20.
    TIME AND GLOBAL STATES  Clocks, events and process states  We define anevent to be the occurrence of a single action that a process carries out as it executes – a                                            communication action or a state­transforming action. The sequence of events within a single process p​i can be                                  placed in a single, total ordering, which we denote by the relation ➝​i between the events. That is, ​e ➝​i e' if and                                              only if the event e occurs before e’ at p​i​. This ordering is well defined, whether or not the process is                                          multithreaded, since we have assumed that the process executes on a single processor. Now we can define                                  the history of process p​i to be the series of events that take place within it, ordered as we have described by                                            the relation ➝​i​:  history(p​i​) h​i​ = < e​i​ 0​  ei​1​  e​2​  … >    The operating system reads the node’s hardware clock value, ​H​i​(t) , scales it and adds an offset so as to                                        produce ​a software clock ​C​i​(t) = ​α H​i​(t) + β ​that approximately measures real, physical time t for process p​i . In                                            other words, when the real time in an absolute frame of reference is t, ​C​i​(t)​ is the reading on the software clock.  Successive events will correspond to different timestamps only if the clock resolution – the period between                                updates of the clock value – is smaller than the time interval between successive events.    SKEW ​between computer clocks in a distributed system Network: The instantaneous difference between the                            readings of any two clocks is called their skew.    Clock ​DRIFT​: which means that they count time at different rates, and so diverge.    Clock's ​DRIFT RATE​: change in the offset between the clock and a nominal perfect reference clock per unit of                                      time measured by the reference clock.    UTC (Coordinated Universal Time): is based on atomic time, but a so­called ‘leap second’ is inserted – or,                                    more rarely, deleted – occasionally to keep it in step with astronomical time. UTC signals are synchronized and                                    broadcast regularly from landbased radio stations and satellites covering many parts of the world.    Synchronizing Physical Clocks  Synchronization in a synchronous system​: In general, for a synchronous system, the optimum bound that                              can be achieved on clock skew when synchronizing N clocks is u * (1 – 1 / N) [Lundelius and Lynch 1984], u =                                                max ­ min, the max and min time that a transmission of a message can observe in a synchronous system.    Cristian’s method for synchronizing clocks​: use of a time server, connected to a device that receives                                signals from a source of UTC, to synchronize computers externally. There is no upper bound on message                                  transmission delays in an asynchronous system, the round­trip times for messages exchanged between pairs                            of processes are often reasonably short – a small fraction of a second. He describes the algorithm as                                    probabilistic: the method achieves synchronization only if the observed round­trip times between client and                            server are sufficiently short compared with the required accuracy. A simple estimate of the time to which p                                   
  • 21.
    should set itsclock is t + T​round / 2 , which assumes that the elapsed time is split equally before and after S                                                placed t in mt(=message for timestamp sinc.). This is normally a reasonably accurate assumption, unless the                                two messages are transmitted over different networks.    Discussion​: Cristian’s method suffers from the problem associated with all services implemented by a single                              server: that the single time server might fail and thus render synchronization temporarily impossible. Cristian                              suggested, for this reason, that time should be provided by a group of synchronized time servers, each with a                                      receiver for UTC time signals. Dolev et al. [1986] showed that if f is the number of faulty clocks out of a total of                                                N, then we must have N = 3f if the other, correct, clocks are still to be able to achieve agreement.    Berkeley's Algorithm​: an algorithm for internal synchronization developed for collections of computers                        running Berkeley UNIX. A coordinator computer is chosen to act as the master. Unlike in Cristian’s protocol,                                  this computer periodically polls the other computers whose clocks are to be synchronized, called slaves. The                                slaves send back their clock values to it. The master estimates their local clock times by observing the                                    round­trip times (similarly to Cristian’s technique), and it averages the values obtained (including its own                              clock’s reading). The balance of probabilities is that this average cancels out the individual clocks’ tendencies                                to run fast or slow. The accuracy of the protocol depends upon a nominal maximum round­trip time between                                    the master and the slaves.  The master takes a FAULT­TOLERANT AVERAGE. That is, a subset is chosen of clocks that do not differ                                    from one another by more than a specified amount, and the average is taken of readings from only these                                      clocks.    NTP​: Cristian’s method and the Berkeley algorithm are intended primarily for use within  intranets. NTP’s chief design aims and features are as follows:  ● To provide a service enabling clients across the Internet to be synchronized accurately to UTC​:                              Although large and variable message delays are encountered in Internet communication, NTP employs                          statistical techniques for the filtering of timing data and it ​discriminates between the quality of timing                                data from different servers​.  ● To provide a reliable service that can survive lengthy losses of connectivity​: There are ​redundant                              servers and redundant paths ​between the servers​. The servers can reconfigure so as to continue to                                provide the service if one of them becomes unreachable.  ● To enable clients to resynchronize sufficiently frequently to offset the rates of drift found in most                                computers​: The service is designed to ​scale to large numbers of clients and servers​.  ● To provide protection against interference with the time service, whether malicious or accidental​: The                            time service uses ​authentication techniques to check that timing data originate from the ​claimed trusted                              sources​. It also validates the return addresses of messages sent to it.    Logical Time and Logical Clocks  As Lamport [1978] pointed out, since we cannot synchronize clocks perfectly across a distributed system, we                                cannot in general use physical time to find out the order of any arbitrary pair of events occurring within it. Two                                          simple and intuitively obvious points: 
  • 22.
    ● If twoevents occurred at the same process p​i (i = 1, 2,...N), then they occurred in the order in which p​i                                            observes them – this is the order ​➝​i​ ​ that we defined above.  ● Whenever a message is sent between processes, the event of sending the message occurred before                              the event of receiving the message.    Lamport called the ​partial ordering obtained by generalizing these two relationships the ​happened­before                          relation. It is also sometimes known as the ​relation of causal ordering​ or ​potential causal ordering​.  The sequence of events need not be unique.    For example, a ​↛e and e ​↛a, since they occur at different processes, and there is no chain of messages                                            intervening between them. We say that events such as a and e that are not ordered by ​➝ are concurrent and                                          write this a || e .    Logical clocks​: Lamport [1978] invented a simple mechanism by which the ​happened­before ​ordering can be                              captured numerically, called a logical clock. A ​Lamport logical clock is a monotonically increasing software                              counter​, whose value need bear ​no particular relationship to any physical clock​. Each process ​p​i keeps its own                                    logical clock, ​L​i , which it uses to apply so­called Lamport timestamps to events. We denote the timestamp of                                      event ​e​ at ​p​i​ by ​L​i​(e)​ , and by ​L(e)​ we denote the timestamp of event ​e ​at whatever process it occurred at.  To capture the happened­before relation ​➝​, processes update their logical clocks and transmit the values of                                their logical clocks in messages as follows:  ● LC1​: ​L​i​ is incremented before each event is issued at process ​p​i​ : L​i​ := L​i​ + 1.  ● LC2​:   (a) When a process ​p​i​ sends a message ​m​, it piggybacks on ​m​ the value ​t = L​i​.  (b) On receiving ​(m, t)​, a process ​p​j computes ​L​j := max(​L​j ,​t​) and then applies LC1 before timestamping                                      the event receive(m).   
  • 23.
      Note: e​➝​ e’ ⇒ L(e) < L(e’) .  The converse is not true. If L(e) < L(e’) , then we cannot infer that e ​➝​ e’.    Totally ordered logicalclocks​: Some pairs of distinct events, generated by different processes, have                            numerically identical Lamport timestamps. We can create a ​total order on the set of events – that is, one for                                        which all pairs of distinct events are ordered – by ​taking into account the identifiers of the processes at which                                        events occur.  We define the ​global logical timestamps for these events to be ​(T​i​, i) and ​(T​j​, j) , respectively. ​(T​i​, i) < (T​j​, j) if                                                and only if either ​T​i​ < T​j​  , or T​i​ = T​j​ and i < j​ .    Vector clocks​: Mattern [1989] and Fidge [1991] developed vector clocks to overcome the shortcoming of                              Lamport’s clocks: the fact that from L(e) < L(e’)  we cannot conclude that e ​➝​ e’.  A vector clock for a system of N processes is an array of N integers. Each process keeps its own vector clock,                                            Vi , which it uses to timestamp local events. There are simple rules for updating the clocks:    For a vector clock V​i​, V​i​(i) is the number of events that p​i has timestamped, and V​i​(j) (j ≠ i) is the number of                                                events that have occurred at p​j that have ​potentially ​affected p​i​. (Process p​j may have timestamped more                                  events by this point, but no information has flowed to pi about them in messages as yet.) 
  • 24.
      Figure 14.7 showsthe vector timestamps of the events of Figure 14.5. It can be seen, for example, that V(a) <                                          V(f) , which reflects the fact that a​➝ f. Similarly, we can tell when two events are concurrent by comparing their                                          timestamps. For example, that c || e can be seen from the facts that neither V(c) ≤ V(e) nor V(e) ≤ V(c).  Vector timestamps have the ​disadvantage​, compared with Lamport timestamps, of taking up an amount of                              storage and message payload that is proportional to N, the number of processes.    Global states​: the problem of finding out whether a particular property is true of a distributed system as it                                      executes.  ● Distributed garbage collection​: An object is considered to be garbage if there are no longer any                                references to it anywhere in the distributed system. To check that an object is garbage, we must verify                                    that there are no references to it anywhere in the system. When we consider properties of a system, we                                      must include the state of communication channels as well as the state of the processes.  ● Distributed deadlock detection​: A distributed deadlock occurs when each of a collection of processes                            waits for another process to send it a message, and where there is a cycle in the graph of this                                        ‘​waits­for​’ relationship.  ● Distributed termination detection​: The phenomena of termination and deadlock are similar in some                          ways, but they are different problems. First, a deadlock may affect only a subset of the processes in a                                      system, whereas all processes must have terminated. Second, process ​passivity is not the same as                              waiting in a deadlock cycle: a deadlocked process is attempting to perform a further action, for which                                  another process waits; a passive process is not engaged in any activity.  ● Distributed debugging​: Distributed systems are complex to debug.   
  • 25.
      Global states and consistent cuts​: The essential problem is the absence of global time.      A global statecorresponds to initial prefixes of the individual process histories. A cut of the system’s execution                                    is a subset of its global history that is a union of prefixes of process histories:   
  • 26.
      The leftmost cutis inconsistent. This is because at p​2 it includes the receipt of the message m​1​, but at p​1 it                                            does not include the sending of that message. This is showing an ‘effect’ without a ‘cause’. The actual                                    execution never was in a global state corresponding to the process states at that frontier, and we can in                                      principle tell this by examining the ➝ relation between events. By contrast, the rightmost cut is consistent.      INTERPROCESS COMMUNICATION  Direct comm. (direct naming)​: unique names are given to all processes comprising a program  ● symmetrical ​direct naming: ​both the sender and receiver name​ the corresponding ​process​. 
  • 27.
    ● asymmetrical​ direct naming: the receiver can receive messages from any process.  Indirect comm.​ ​(indirect naming)​: uses intermediaries called ​channels​ or ​mailboxes  ● symmetrical​ indirect naming: ​both the sender and receiver name​ the corresponding ​channel​.  ●asymmetrical​ indirect naming: the receiver can receive messages from any channel.    Request/Reply protocol​: a requestor sends a request message to a replier system which receives and                              processes the request, ultimately returning a message in response. This is a simple, but powerful messaging                                pattern which allows two applications to have a two­way conversation with one another over a channel. This                                  pattern is especially common in client­server architectures.​[1]  For simplicity, this pattern is typically implemented in a purely ​synchronous fashion, as in ​web service calls                                  over ​HTTP​, which holds a connection open and waits until the response is delivered or the ​timeout period                                    expires. However, request–response may also be implemented ​asynchronously​, with a response being                        returned at some unknown later time. This is often referred to as "sync over async", or "sync/async", and is                                      common in ​enterprise application integration (EAI) implementations where slow ​aggregations​, time­intensive                      functions, or​ ​human workflow​ must be performed before a response can be constructed and delivered.            Marshalling and Unmarshalling​: The information stored in running programs is represented as data                          structures, whereas the ​information in messages consists of sequences of bytes​. Irrespective of the form of                                communication used, the ​data structures must be flattened (converted to a sequence of bytes) ​before                             
  • 28.
    transmission and rebuilton arrival​. There are differences in data representation from a computer to another                                one. So, when communicating, the following problems must be addressed:  ● primitive data representation (such as integers and floating­point numbers).  ● set of codes used to represent characters (ASCII  or Unicode).    There are two ways for enabling any two computers to exchange binary data values:  ● The values are converted to an agreed external format before transmission and converted to the local                                form on receipt.  ● The values are transmitted in the sender’s format, together with an indication of the format used, and                                  the recipient converts the values if necessary.  An agreed standard for the representation of data structures and primitive values is called an ​external data                                  representation​.  Marshalling is the process of taking a collection of data items and assembling them into a form suitable for                                      transmission in a message. ​Unmarshalling is the process of disassembling them on arrival to produce an                                equivalent collection of data items at the destination. Thus marshalling consists of the translation of structured                                data items and primitive values into an external data representation. Similarly, unmarshalling consists of the                              generation of primitive values from their ​external data representation​ and the rebuilding of the data structures.  Three alternative approaches to external data representation and marshalling:  ● CORBA​’s common data representation, which is concerned with an external representation for the                          structured and primitive types that can be passed as the arguments and results of remote method                                invocations in CORBA.  ● Java’s object serialization​, which is concerned with the flattening and external data representation of                            any single object or tree of objects that may need to be transmitted in a message or stored on a disk.  ● XML​ (Extensible Markup Language), which defines a textual format for representing structured data.  In the first two approaches, the primitive data types are marshalled into a binary form. In the third approach                                      (XML), the primitive data types are represented textually. The textual representation of a data value will                                generally be longer than the equivalent binary representation. The HTTP protocol is another example of the                                textual approach.  Two main issues exist in marshalling:  ● compactness​: the resulting message should be as compact as possible.  ● data type inclusion​: CORBA’s representation includes just the values of the objects transmitted; Java                            serialization and XML does include type information.  Two other techniques for external data representation are worthy of mention:  ● Google uses an approach called ​protocol buffers to capture representations of both stored and                            transmitted data.  ● JSON (JavaScript Object Notation)   Both these last two methods represent a step towards more lightweight approaches to data representation                              (when compared, for example, to XML).  Particular attention is to be paid to ​remote object references​. A remote object reference is an identifier for a                                      remote object that is valid throughout a distributed system. A remote object reference is passed in the                                  invocation message to specify which object is to be invoked. Remote object references must be generated in a                                    manner that ensures ​uniqueness over space and time​. Also, object references must be unique among all of the                                    processes in the various computers in a distributed system. One way is to construct a remote object reference                                   
  • 29.
    by concatenating theInternet address of its host computer and the ​port number of the process that created it                                      with the ​time of its creation ​and a ​local object number​. The local object number is incremented each time an                                        object is created in that process.    The last field of the remote object reference shown in Figure 4.13 contains some information about the                                  interface of the remote object, for example, the interface name. This information is relevant to any process that                                    receives a remote object reference as an argument or as the result of a remote invocation, because it needs to                                        know about the methods offered by the remote object.    Idempotent op.s​: an operation that will produce the same results if executed once or multiple times.​[7] In the                                    case of ​methods or ​subroutine calls with ​side effects​, for instance, it means that the modified state remains the                                      same after the first call. In ​functional programming​, though, an idempotent function is one that has the property                                    f​(​f​(​x​)) = ​f​(​x​) for any value ​x​.​[8]   This is a very useful property in many situations, as it means that an operation can be repeated or retried as                                          often as necessary without causing unintended effects. With non­idempotent operations, the algorithm may                          have to keep track of whether the operation was already performed or not.  In the ​HyperText Transfer Protocol (HTTP)​, idempotence and ​safety are the major attributes that separate                              HTTP verbs​. Of the major HTTP verbs, GET, PUT, and DELETE are idempotent (if implemented according to                                  the standard), but POST is not.​[9]     Remote Procedure Call (RPC)  Request­reply protocols provide relatively low­level support for requesting the execution of a remote operation,                            and also provide direct support for RPC and RMI.  RPC allows client programs to ​call procedures ​transparently in server programs running in separate processes                              and generally in different computers from the client.  RPC has the goal of making the programming of distributed systems look similar, if not identical, to                                  conventional programming – that is, achieving a high level of ​distribution transparency​. This unification is                              achieved in a very simple manner, by ​extending the abstraction of a procedure call to distributed environments​.                                  In particular, in RPC, procedures on remote machines can be called as if they are procedures in the local                                      address space​. The underlying RPC system then hides important aspects of distribution, including the                            encoding and decoding of parameters and results, the passing of messages and the preserving of the required                                  semantics for the procedure call.  Three issues that are important in understanding this concept:  ● programming with interfaces ​(the style of programming): in order to control the possible interactions                            between modules, an explicit interface is defined for each module; as long as its interface remains the                                  same, the implementation may be changed without affecting the users of the module. 
  • 30.
    ○ service interface​:the specification of the procedures offered by a server, defining the types of                              the arguments of each of the procedures. But why an interface?  ■ It is not possible for a client module running in one process to access the variables in a                                    module in another process.  ■ The parameter­passing mechanisms used in local procedure calls are not suitable when                        the caller and procedure are in different processes. In particular, call by reference is not                              supported. Rather, the specification of a procedure in the interface of a module in a                              distributed program describes the parameters as input or output, or sometimes both.                        Input parameters are passed to the remote server by sending the values of the                            arguments in the request message and output parameters are returned in the reply                          message and are used as the result of the call.  ■ Addresses in one process are not valid in another remote one.  ○ Interface Definition Language (IDL)​: designed to allow procedures implemented in different                      languages to invoke one another. An IDL provides a notation for defining interfaces in which                              each of the parameters of an operation may be described as for input or output in addition to                                    having its type specified.  ● the ​call semantics ​associated with RPC:  ○ Retry request message​: Controls whether to retransmit the request message until either a                          reply is received or the server is assumed to have failed.  ○ Duplicate filtering​: Controls when retransmissions are used and whether to filter out duplicate                          requests at the server.  ○ Retransmission of results​: Controls whether to keep a history of result messages to enable                            lost results to be retransmitted without re­executing the operations at the server.  Combinations of these choices lead to a variety of possible semantics for the ​reliability ​of remote invocations                                  as seen by the invoker. Note that for local procedure calls, the semantics are exactly once, meaning that every                                      procedure is executed exactly once (except in the case of process failure). The choices of RPC invocation                                  semantics are defined as follows.  ○ Maybe semantics​: the remote procedure call ​may be executed once or not at all​. Maybe                              semantics arises when ​no fault­tolerance measures are applied and ​can suffer from the                          following types of failure​:  ■ omission failures​ if the request or result message is lost;  ■ crash failures​ when the server containing the remote operation fails.  Useful only for applications in which occasional failed calls are acceptable.  ○ At­least­once semantics​: ​the invoker receives either a result​, in which case the invoker knows                            that the procedure was executed at least once, ​or an exception informing it that no result was                                  received. It can be achieved by the ​retransmission of request messages​, which masks the                            omission failures of the request or result message. At­least­once semantics ​can suffer from the                            following types of failure​:  ■ crash failures​:when the server containing the remote procedure fails;  ■ arbitrary failures​: in cases when ​the request message is retransmitted​, the remote server                          may receive it and ​execute the procedure more than once​, possibly causing wrong                          values to be stored or returned. 
  • 31.
    If the operationsin a server can be designed so that all of the procedures in their service                                    interfaces ​are idempotent operations​, then at­least­once call semantics may be acceptable.  ○ At­most­once semantics​: ​the caller receives either a result​, in which case the caller knows that                              the procedure was executed exactly once, ​or an exception informing it that ​no result was                              received, in which case the ​procedure will have been ​executed either once or not at all​. It can                                    be achieved by using all of the fault­tolerance measures outlined in Figure 5.9.      ● Transparency​: RPC strives to offer at least ​location and access transparency​, hiding the physical                            location of the (potentially remote) procedure and also accessing local and remote procedures in the                              same way. Middleware can also offer additional levels of transparency to RPC. However, remote                            procedure calls suffer the followings:  ○ vulnerability to failure due to the ​network ​and/or of the ​remote server process and no ability to                                  distinguish among them.  ○ latency of a remote procedure call is several orders of magnitude greater than that of a local                                  one.  The current consensus is that remote calls should be made transparent in the sense that the syntax of                                    a remote call is the same as that of a local invocation, but that the difference between local and remote                                        calls should be expressed in their interfaces.    RPC Implementation​: The software components required to implement RPC are shown in Figure 5.10. The                              client that accesses a service ​includes one ​stub procedure for each procedure in the service interface​. The                                  stub procedure behaves like a local procedure to the client​, but instead of executing the call, it ​marshals the                                      procedure identifier and the arguments into a request message​, which it sends via its communication module                                to the server. When the reply message arrives, it ​unmarshals the results​. The ​server process contains a                                  dispatcher together ​with ​one server stub procedure and one service procedure for each procedure in the                                service interface​. ​The dispatcher selects one of the server stub procedures ​according to the procedure                              identifier in the request message. The server stub procedure then unmarshals the arguments in the request                                message, calls the corresponding service procedure and marshals the return values for the reply message.                              The service procedures implement the procedures in the service interface​. ​The client and server stub                             
  • 32.
    procedures and thedispatcher can be generated automatically by an interface compiler from the interface                              definition of the service. RPC is generally implemented over a request­reply protocol like the ones discussed                                so far. The contents of request and reply messages are the same as those illustrated for request­reply                                  protocols in Figure 5.4. RPC may be implemented to have one of the choices of invocation semantics                                  discussed: at­least once or at­most­once is generally chosen. To achieve this, the communication module will                              implement the desired design choices in terms of retransmission of requests, dealing with duplicates and                              retransmission of results, as shown in Figure 5.9.        Remote Method Invocation (RMI)  RMI allows ​objects in different processes to communicate with one another​; it is an extension of local method                                    invocation that allows an object living in one process to invoke the methods of an object living in another                                      process.  The ​commonalities between RMI and RPC​ are as follows:  ● They both support ​programming with interfaces​.  ● both typically ​constructed on top of request­reply protocols and can offer a range of ​call semantics such                                  as ​at­least­once​ and ​at­most­once​.  ● both offer a ​similar level of transparency – that is, local and remote calls employ the same syntax but                                      remote interfaces typically expose the distributed nature of the underlying call, for example by                            supporting remote exceptions.    The following ​differences​ lead to added expressiveness:  ● The programmer is able to use the ​full expressive power of object­oriented programming in the                              development of distributed systems software.  ● Building on the concept of object identity in object­oriented systems, ​all objects in an RMI­based                              system have unique object references (whether they are local or remote), such object references can                              also be passed as parameters, thus offering ​significantly richer parameter­passing semantics than in                          RPC​. 
  • 33.
    The issue ofparameter passing is particularly important in distributed systems. RMI allows the programmer to                                pass parameters not only by value, as input or output parameters, but also by object reference.  RMI shares the ​same ​design issues as RPC in terms of programming with interfaces, call semantics and level                                    of transparency. The key design issue relates to the object model and, in particular, achieving the transition                                  from objects to distributed objects. For example, for use in a distributed object system, an object’s data should                                    be accessible only via its methods.  Distributed object systems may adopt the client­server architecture. In this case, objects are managed by                              servers and their clients invoke their methods using remote method invocation. ​In RMI, the client’s request to                                  invoke a method of an object is sent in a message to the server managing the object​. The invocation is carried                                          out by executing a method of the object at the server and the result is returned to the client in another                                          message. To ​allow for chains of related invocations, objects in servers are allowed to become clients of objects                                    in other servers​.  Having client and server objects in different processes enforces ​encapsulation​. That is, ​the state of an object                                  can be accessed only by the methods of the object​, which means that it is not possible for unauthorized                                      methods to act on the state.  Another advantage of treating the ​shared state of a distributed program as a collection of objects is that ​an                                      object may be accessed via RMI​.  The fact that objects are accessed only via their methods gives rise to another ​advantage of heterogeneous                                  systems, that different data formats may be used at different sites – these formats will be unnoticed by clients                                      that use RMI to access the methods of the objects.  Distributed object model​: A process contains a collection of objects, some of which may receive local and                                  remote invocations, while others only local invocations. ​Method invocations between objects in different                          processes​, whether in the same computer or not, ​are known as ​remote method invocations​. ​Method                              invocations between objects in the same process are ​local method invocations​.  The following two fundamental concepts are at the heart of the distributed object model:  ● Remote object references​: Other objects can invoke the methods of a remote object if they have                                access to its remote object reference.  A remote object reference is an identifier that can be used throughout a distributed system to refer to ​a  particular unique remote object​. It may be passed as arguments and results of remote method  invocations.  ● Remote interfaces​: Every remote object has a remote interface that specifies which of its methods can                                be invoked remotely.  The class of a remote object implements the methods of its remote interface.   
  • 34.
        Actions in adistributed object system​: As in the non­distributed case, an action is initiated by a method                                    invocation, which may result in further invocations of methods in other objects. But in the distributed case, the                                    objects involved in a chain of related invocations may be located in different processes or different computers.                                  When an invocation crosses the boundary of a process or computer, RMI is used, and the remote reference of                                      the object must be available to the invoker.      Garbage collection in a distributed­object system​: generally achieved by ​cooperation between the existing                          local garbage collector and an added module that carries out a form of distributed garbage collection, usually                                  based on reference counting​.    Exceptions​: Any remote invocation may fail for reasons related to the invoked object being in a different                                  process or computer from the invoker. Remote method invocation should be able to raise exceptions such as                                  timeouts that are due to distribution as well as those raised during the execution of the method invoked.    RMI Implementation​: Several separate objects and modules are involved in achieving a remote method                            invocation. 
  • 35.
      ● Communication module​:The two cooperating communication modules ​carry out the request­reply                      protocol​, which transmits request and reply messages between the client and server. The contents of                              request and reply messages are shown in Figure 5.4. They are ​responsible for providing a specified                                invocation semantics​, for example ​at­most­once​.        ● Remote reference module​: is responsible for translating between local and remote object references                          and for creating remote object references. In each process, it has a ​remote object table that ​records the                                    correspondence between local object references in that process and remote object references (which                          are system­wide). The table includes:  ○ An entry for all the remote objects held by the process​. For example, in Figure 5.15 the remote                                    object B will be recorded in the table at the server.  ○ An entry for each ​local proxy​. For example, in Figure 5.15 the proxy for B will be recorded in the                                        table at the client.  The actions of the remote reference module are as follows:  ○ When a remote object is to be passed as an argument or a result for the first time, the remote                                        reference module is asked to ​create a remote object reference, which it adds to its table​. 
  • 36.
    ○ When aremote object reference arrives in a request or reply message, the remote reference                              module is asked for the corresponding ​local object reference​, which ​may refer either to a proxy                                or to a remote object​. In the case that the remote object reference is not in the table, the RMI                                        software creates a new proxy and asks the remote reference module to add it to the table.  This module is called by components of the RMI software when they are marshalling and unmarshalling                                remote object references. For example, when a request message arrives, the table is used to find out                                  which local object is to be invoked.  ● Servant​: an instance of a class that provides the body of a remote object. It is the servant that                                      eventually ​handles the remote requests passed on by the corresponding ​skeleton​. Servants live within                            a server process.  ● The RMI software​: a layer of software between the application­level objects and the communication                            and remote reference modules. The roles of the middleware objects shown in Figure 5.15 are as                                follows:  ○ Proxy​: makes remote method invocation transparent to clients by ​behaving like a local object to                              the invoker​; but instead of executing an invocation, it forwards it in a message to a remote                                  object. It hides the details of the remote object reference, the marshalling of arguments,                            unmarshalling of results and sending and receiving of messages from the client. ​There is one                              proxy for each remote object for which a process holds a remote object reference​. ​The class of                                  a proxy implements the methods in the remote interface of the remote object it represents​,                              which ensures that remote method invocations are suitable for the type of the remote object.                              However, the proxy implements them quite differently. ​Each method of the proxy marshals a                            reference to the target object, its own operationId and its arguments into a request message                              and sends it to the target. It then awaits the ​reply message, unmarshals it and returns the                                  results to the invoker.  ○ Dispatcher​: ​A server has one dispatcher and one skeleton for each class representing a                            remote object​. The dispatcher receives request messages from the communication module. ​It                        uses the operationId to select the appropriate method in the skeleton​, passing on the request                              message. The dispatcher and the proxy use the same allocation of operationIds to the methods                              of the remote interface.  ○ Skeleton​: The class of a remote object has a skeleton, which implements the methods in the                                remote interface. They are implemented quite differently from the methods in the servant that                            incarnates a remote object. A skeleton method ​unmarshals the arguments in the request                          message and invokes the corresponding method in the servant​. It waits for the invocation to                              complete and then ​marshals the result, together with any exceptions​, in a reply message to the                                sending proxy’s method.    Remote object references are marshalled in the form shown in Figure 4.13, which includes information                              about the remote interface of the remote object – for example, the name of the remote interface or the                                      class of the remote object. This information enables the proxy class to be determined so that a new                                    proxy may be created when it is needed. For example, the proxy class name may be generated by                                    appending _proxy to the name of the remote interface.   
  • 37.
    ● Generation ofthe classes for proxies, dispatchers and skeletons​: The classes for the proxy,                            dispatcher and skeleton used in RMI are generated automatically by an interface compiler.    ● Factory methods​: We noted earlier that ​remote object interfaces cannot include constructors​. This                          means that servants cannot be created by remote invocation on constructors. Servants are created                            either in the initialization section or ​in methods in a remote interface designed for that purpose​. The                                  term factory method is sometimes used to refer to a method that creates servants, and a factory object                                    is an object with factory methods. Any remote object that needs to be able to create new remote                                    objects on demand for clients must provide methods in its remote interface for this purpose. Such                                methods are called factory methods, although they are really just normal methods.  ● Server threads​: Whenever an object executes a remote invocation, that execution may lead to further                              invocations of methods in other remote objects, which may take some time to return. To avoid the                                  execution of one remote invocation delaying the execution of another, ​servers generally allocate a                            separate thread for the execution of each remote invocation. When this is the case, ​the designer of the                                    implementation of a remote object ​must allow for the effects on its state of concurrent executions​.                                Some alternatives exist:  ○ thread­per­request  ○ thread­per­connection  ○ thread­per­remote object  ● Object location​: A ​location service helps clients to locate remote objects from their remote object                              references. It uses a database that maps remote object references to their probable current locations –                                the locations are probable because an object may have migrated again since it was last heard of.    Java RMI (java.rmi package)  An object making a remote invocation is aware that its target is remote because it ​must handle                                  RemoteExceptions; and the implementor of a remote object is aware that it is remote because ​it must                                  implement the Remote interface. ​The arguments and return values in a remote invocation are serialized to a                                  stream using the method described previously, with the following modifications:  1. Whenever an object that implements the Remote interface is serialized, it is replaced by its remote  object reference, which contains the name of its (the remote object’s) class.  2. When any object is serialized, its class information is annotated with the location of the class (as a  URL), enabling the class to be downloaded by the receiver.    ● Downloading of classes: ​Java is designed to allow classes to be downloaded from one virtual                              machine to another​. This is particularly relevant to distributed objects that communicate by means of                              remote invocation. We have seen that non­remote objects are passed by value and remote objects are                                passed by reference as arguments and results of RMIs. ​If the recipient does not already possess the                                  class of an object passed by value, its code is downloaded automatically. Similarly, if the recipient of a                                    remote object reference does not already possess the class for a proxy, its code is downloaded                                automatically​. This has two advantages:  ○ There is no need for every user to keep the same set of classes in their working environment. 
  • 38.
    ○ Both clientand server programs can make transparent use of instances of new classes                            whenever they are added.  ● RMIregistry​: is ​the binder for Java RMI​. An instance of RMIregistry should normally run on every                                server computer that hosts remote objects. It ​maintains a table mapping textual, URL­style names to                              references to remote objects hosted on that computer​. It is accessed by methods of the Naming class,                                  whose methods take as an argument a URL­formatted string of the form:  //computerName:port/objectName  where computerName and port refer to the location of the RMIregistry.  Used in this way, ​clients must direct their lookup enquiries to particular hosts​. Alternatively, it is possible                                  to set up a system­wide binding service. To achieve this, it is necessary to run an instance of the                                      RMIregistry in the networked environment and then use the class ​LocateRegistry​, which is in                            java.rmi.registry, ​to discover this registry​.    ● Distributed garbage collection​: the Java distributed garbage collection algorithm is based on                        reference counting​. Whenever a remote object reference enters a process, a proxy will be created and                                will stay there for as long as it is needed. The process where the object lives (its server) should be                                        informed of the new proxy at the client. Then later when there is no longer a proxy at the client, the                                          server should be informed. The distributed garbage collector works in cooperation with the local                            garbage collectors as follows:  ○ Each server process maintains ​a set of the names of the processes that hold remote object                                references for each of its remote objects​; for example, ​B.holders is the set of client processes                                (virtual machines) that have proxies for object B. This set can be held in an additional column in                                    the remote object table.  ○ When a client C first receives a remote reference to a particular remote object, B, it makes an                                    addRef(B) invocation to the server of that remote object and then creates a proxy; the server                                adds C to B.holders.  ○ When a client C’s garbage collector notices that a proxy for remote object B is no longer                                  reachable, it makes a removeRef(B) invocation to the corresponding server and then deletes                          the proxy; the server removes C from B.holders.   ○ When B.holders is empty, the server’s local garbage collector will reclaim the space occupied by                              B unless there are any local holders.  This algorithm is intended to be carried out by means of pairwise request­reply communication with  at­most­once invocation semantics between the remote reference modules in processes – it does not  require any global synchronization. Note also that the extra invocations made on behalf of the garbage  collection algorithm do not affect every normal RMI; they occur only when proxies are created and  deleted. There is a possibility that one client may make a removeRef(B) invocation at about the same  time as another client makes an addRef(B) invocation. If the removeRef arrives first and B.holders is  empty, the remote object B could be deleted before the addRef arrives. To avoid this situation, if the set  B.holders is empty at the time when a remote object reference is transmitted, a temporary entry is  added until the addRef arrives. ​The Java distributed garbage collection algorithm tolerates  communication failures by using the following approach. The addRef and removeRef operations are  idempotent.​ In the case that an addRef(B) call returns an exception (meaning that the method was  either executed once or not at all), the client will not create a proxy but will make a removeRef(B) call. 
  • 39.
    The effect of removeRef is correct whether or not the addRef succeeded. ​The case where removeRef  fails is dealt with by ​leases​. ​Servers lease their objects to clients for a limited period of time​. The lease  period starts whenthe client makes an addRef invocation to the server. It ends either when the time                                    has  expired or when the client makes a removeRef invocation to the server. The information stored by the  server concerning each lease contains the identifier of the client’s virtual machine and the period of the  lease. ​Clients are responsible for requesting the server to renew their leases before they expire​.    DISTRIBUTED ALGORITHMS FOR MUTUAL EXCLUSION  Distributed processes often need to coordinate their activities. We require a solution to distributed mutual                              exclusion: one that is based solely on message passing. It would be convenient for these processes to be able                                      to obtain mutual exclusion solely by communicating among themselves, eliminating the need for a separate                              server.    Essential requirements for mutual exclusion are as follows:  ME1: (​safety​) At most one process may execute in the critical section (CS) at a time.  ME2: (​liveness​) Requests to enter and exit the critical section eventually succeed.  Condition ME2 implies freedom from both deadlock and starvation.    DEADLOCK​: two or more of the processes becoming stuck indefinitely while attempting to  enter or exit the critical section, by virtue of their mutual interdependence.    STARVATION​: the indefinite postponement of entry for a process that has requested it.    FAIRNESS: absence of STARVATION and use of the HAPPENED­BEFORE ordering between messages that                          request entry to the critical section.  ME3: ( ➝ ​ordering​) If one request to enter the CS happened­before another, then entry to the CS is                                      granted in that order.    SYMMETRICAL­INFORMATION ALG.: every process executes the same code    Performance​ of algorithms for mutual exclusion:  ➔ bandwidth​: proportional to the number of messages  ➔ throughput​: of the entire system, measured as the time between a client exiting and another one                                entering the CS  ➔ delay​: whenever a client enters and exits a CS      Ring­based Mutual Exclusion (ME)​: N processes communicate with the next process in a clockwise                            direction, say. 
  • 40.
    In order toenter the CS, a process must hold the token granting the access. The token keeps moving around                                        the ring until a process requires it. Then it holds until completes and exits the CS. At this point, the token is                                            released and passed again to the next process.  Pro​: easy to implement, easy to adapt to N CSs, to exit the CS a process observes no delay.  Cons​: continuous bandwidth consumption, to enter the CS a process might experience from 0 to N messages                                  delay, synchronization delay goes from 1 to N messages.        RICART­AGRAWALA (Multicast and logical blocks, [1981])​: mutual exclusion between N peer processes                        based upon ​multicast​. Processes that require entry to a critical section multicast a request message, and can                                  enter it only when all the other processes have replied to this message.    
  • 41.
        A variable ​stateis used from each process to record its state of being outside the critical section (RELEASED),                                      wanting entry (WANTED) or being in the critical section (HELD).  If two or more processes request entry at the same time, then whichever process’s request bears the lowest                                    timestamp will be the first to collect N – 1 replies, granting it entry next. If the requests bear equal Lamport                                          timestamps, the requests are ordered according to the processes’ corresponding identifiers.  This algorithm achieves the safety property ME1. If it were possible for two processes p​i and p​j ( i ≠ j ) to enter                                                the critical section at the same time, then both of those processes would have to have replied to the other. But                                          since the pairs <T​i​, p​i​ > are totally ordered, this is impossible.  Performance​:  ● Gaining entry takes 2(N – 1) messages in this algorithm: N – 1 to multicast the request, followed by N –                                          1 replies.   ● Client delay​ in requesting entry is again a round­trip time (with hw multicast support).  ● synchronization delay​ is only one message transmission time (with hw multicast support).   
  • 42.
      Fault tolerance: Themain points to consider when evaluating the above algorithms with respect to fault                                tolerance are:  ● What happens when messages are lost?  ● What happens when a process crashes?   
  • 43.
      Elections  An algorithm for​choosing a unique process to play a particular role is called an election algorithm. It is                                      essential that all the processes agree on the choice.  At any point in time, a process p​i is either a participant – meaning that it is engaged in some run of the election                                                algorithm – or a non­participant – meaning that it is not currently engaged in any election. Our requirements                                    are that, during any particular run of the algorithm:  E1: (safety) A participant process p​i​ has elected​i​ = 丄 or elected​i​ = P, where P is chosen as the  non­crashed process at the end of the run with the largest identifier.  E2: (liveness) All processes p​i​ participate and eventually either set elected​i​ ≠丄 – or crash.    The algorithm of Chang and Roberts [1979] (A ring­based election algorithm): suitable for a collection of                                processes arranged in a logical ring. Each process p​i has a communication channel to the next process in the                                      ring, p​(i + 1)​mod N , and all messages are sent clockwise around the ring. We ​assume that no failures occur, and                                            that ​the system is asynchronous​. The goal of this algorithm is to ​elect a single process called ​the ​coordinator​,                                      which is ​the process with the largest identifier​. Initially, every process is marked as a non­participant in an                                    election. Any process can begin an election. It proceeds by marking itself as a participant, placing its identifier                                    in an election message and sending it to its clockwise neighbour. When a process receives an election                                  message, it compares the identifier in the message with its own. If the arrived identifier is greater, then it                                     
  • 44.
    forwards the messageto its neighbour. If the arrived identifier is smaller and the receiver is not a participant,                                      then it substitutes its own identifier in the message and forwards it; but it does not forward the message if it is                                            already a participant. On forwarding an election message in any case, the process marks itself as a participant.  If, however, the received identifier is that of the receiver itself, then this process’s identifier must be the                                    greatest, and it becomes the coordinator. The coordinator marks itself as a non­participant once more and                                sends an elected message to its neighbour, announcing its election and enclosing its identity. When a process                                  pi receives an elected message, it marks itself as a nonparticipant, sets its variable elected​i to the identifier in                                      the message and, unless it is the new coordinator, forwards the message to its neighbour.    It is easy to see that condition E1 is met. All identifiers are compared, since a process must receive its own                                          identifier back before sending an elected message. For any two processes, the one with the larger identifier will                                    not pass on the other’s identifier. It is therefore impossible that both should receive their own identifier back.                                    Condition E2 follows immediately from the guaranteed traversals of the ring (there are no failures). Note how                                  the non­participant and participant states are used so that duplicate messages arising when two processes                              start an election at the same time are extinguished as soon as possible, and always before the ‘winning’                                    election result has been announced. If only a single process starts an election, then the worst­performing case                                  is when its anti­clockwise neighbour has the highest identifier. A total of N – 1 messages are then required to                                        reach this neighbour, which will not announce its election until its identifier has completed another circuit,                                taking a further N messages. The elected message is then sent N times, making 3N – 1 messages in all. The                                          turnaround time is also 3N – 1 , since these messages are sent sequentially.  While the ring­based algorithm is useful for understanding the properties of election algorithms in general, the                                fact that it tolerates no failures makes it of limited practical value. However, with a reliable failure detector it is                                        in principle possible to reconstitute the ring when a process crashes. 
  • 45.
    Bully algorithm​: ​allowsprocesses to crash during an election​, although it assumes that ​message delivery                              between processes is ​reliable​. Unlike the ring­based algorithm, this algorithm assumes that the ​system is                              synchronous​: it ​uses timeouts to detect a process failure​. Another difference is that the ring­based algorithm                                assumed that processes have minimal a priori knowledge of one another: each knows only how to                                communicate with its neighbour, and none knows the identifiers of the other processes. The bully algorithm, on                                  the other hand, assumes that ​each process knows which processes have higher identifiers​, and that it can  communicate with all such processes.  There are ​three types of message in this algorithm: an ​election message is sent to announce an election; an                                      answer message is sent in response to an election message and a ​coordinator message is sent to announce                                    the identity of the elected process – the new ‘coordinator’. ​A process begins an election when it notices,                                    through ​timeouts​, that ​the coordinator has failed​. Several processes may discover this concurrently.  Since the system is synchronous, we can construct a reliable failure detector. There is a maximum message                                  transmission delay, T​trans , and a maximum delay for processing a message T​process . Therefore, we can                                  calculate a time ​T = 2T​trans ​+ T​process ​that is an upper bound on the time that can elapse between sending a                                            message to another process and receiving a response. The process that knows it has ​the highest identifier can                                    elect itself as the coordinator simply by sending a coordinator message to all processes with lower identifiers.                                  On the other hand, a process with a lower identifier can begin an election by sending an election message to                                        those processes that have a higher identifier and awaiting answer messages in response. If none arrives within                                  time T, the process considers itself the coordinator and sends a coordinator message to all processes with                                  lower identifiers announcing this. Otherwise, the process waits a further period T’ for a coordinator message to                                  arrive from the new coordinator. If none arrives, it begins another election.  When a process is started to replace a crashed process, it begins an election​. ​If it has the highest process                                        identifier, then it will decide that it is the coordinator and announce this to the other processes. ​Thus it will                                        become the coordinator, even though the current coordinator is functioning​. It is for this reason that the                                  algorithm is called the ​‘bully’ algorithm​.  This algorithm clearly meets the liveness condition E2, by the assumption of reliable message delivery. And if                                  no process is replaced, then the algorithm meets condition E1. It is impossible for two processes to decide that                                      they are the coordinator, since the process with the lower identifier will discover that the other exists and defer                                      to it.  But the algorithm is not guaranteed to meet the safety condition E1 if processes that have crashed are                                    replaced by processes with the same identifiers. A process that replaces a crashed process p may decide that                                    it has the highest identifier just as another process (which has detected p’s crash) decides that it has the                                      highest identifier. Two processes will therefore announce themselves as the coordinator concurrently.                        Unfortunately, there are no guarantees on message delivery order, and the recipients of these messages may                                reach different conclusions on which is the coordinator process.    CONSENSUS:  Key coordination and agreement problems related to group communication – that is, how to achieve the                                desired reliability and ordering properties across all members of a group. The system under consideration                              contains a collection of processes, which can communicate reliably over one­to­one channels. As before,                            processes may 
  • 46.
    fail only bycrashing. The processes are members of groups, which are the destinations of messages sent with                                    the multicast operation.  The operation ​multicast(g, m) sends the message m to all members of the group g of processes.                                  Correspondingly, there is an operation ​deliver(m) that delivers a message sent by multicast to the calling                                process. We use the term deliver rather than receive to make clear that a multicast message is not always                                      handed to the application layer inside the process as soon as it is received at the process’s node.  Every message m carries the unique identifier of the process sender(m) that sent it, and the unique destination                                    group identifier group(m). We assume that processes do not lie about the origin or destinations of messages.    BASIC MULTICAST​: a basic multicast primitive that guarantees, unlike IP multicast, that a correct process will                                eventually deliver the message, as long as the multicaster does not crash. We call the primitive B­multicast                                  and its corresponding basic delivery primitive B­deliver. We allow processes to belong to several groups, and                                each  message is destined for some particular group. A straightforward way to implement B­multicast is to use a                                  reliable one­to­one send operation.    RELIABLE MULTICAST​: a reliable multicast with corresponding operations ​R­multicast and ​R­deliver​.                      Properties analogous to integrity and validity are clearly highly desirable in reliable multicast delivery, but we                                add another: a requirement that ​all correct processes in the group must receive a message if any of them                                      does​.  A reliable multicast is one that satisfies the following ​properties​:  ● Integrity​: A correct process p delivers a message m at most once. Furthermore, p ∈ group(m) and m                                    was supplied to a multicast operation by sender(m). (As with one­to­one communication, messages can                            always be distinguished by a sequence number relative to their sender.)  ● Validity​: If a correct process multicasts message m, then it will eventually deliver m.  ● Agreement​: If a correct process delivers message m, then all other correct processes in group(m) will                                eventually deliver m.    Performance:  ● Message complexity O(g​2​ )  ● Round complexity O(2) = O(1)    CONSENSUS​: the problem is for processes to agree on a value after one or more of the processes has                                      proposed what that value should be. In mutual exclusion, the processes agree on which process can enter the                                    critical section. In an election, the processes agree on which is the elected process. In totally ordered multicast,                                    the processes agree on the order of message delivery. Three related agreement problems exist: ​Byzantine                              generals​, ​interactive consistency and ​totally ordered multicast​. We go on to examine ​under what                            circumstances the problems can be solved​, and to sketch some solutions. In particular, we discuss the                                well­known impossibility result of Fischer et al. [1985]​, which states that ​in an asynchronous system a                                collection of processes containing only one faulty process cannot be guaranteed to reach consensus​.  Our system model includes a collection of processes p​i ( i = 1, 2, …, N ) communicating by message passing.                                          An important requirement that applies in many practical situations is for consensus to be reached even in the                                    presence of faults. We assume, as before, that ​communication is reliable but that processes may fail​. In this                                   
  • 47.
    section we considerByzantine (arbitrary) process failures, as well as crash failures. We sometimes specify an                                assumption that ​up to some number ​f of the ​N processes are faulty – that is, they exhibit some specified types                                          of fault; the remainder of the processes are correct.  To reach consensus, ​every process p​i begins in the ​undecided state and proposes a single value ​v​i , drawn                                      from a set D ( i = 1, 2, … , N ). The processes communicate with one another, exchanging values. Each                                            process then sets the value of a decision variable, ​d​i . In doing so it enters the decided state, in which it may no                                                longer change ​d​i​ ( i = 1, 2, … , N ).      The ​requirements of a consensus algorithm are that the following conditions should hold for every                              execution of it:  ● Termination​: Eventually each correct process sets its decision variable.  ● Agreement​: The ​decision value of all correct processes is the same​: if ​p​i and ​p​j are correct and have                                      entered the decided state, then ​d​i​ = d​j​ ( i, j = 1, 2, … , N ).  ● Integrity​: If the ​correct processes ​all proposed the same value​, then any correct process in the decided                                  state has chosen that value.  Consider a system in which processes cannot fail. It is then straightforward to solve consensus. For example,                                  we can ​collect the processes into a group and ​have each process reliably multicast its proposed value to the                                      members of the group​. ​Each process waits until it has collected all N values (including its own). It then                                      evaluates the function ​majority(v​1 , v​2 , … , v​N )​, which returns the value that occurs most often among its                                          arguments, or the special value 丄 ∉D if no majority exists. ​Termination is guaranteed by the reliability of the                                        multicast operation​. ​Agreement and integrity are guaranteed by the definition of majority and the integrity                             
  • 48.
    property of areliable multicast​. Every process receives the same set of proposed values, and every process                                  evaluates the same function of those values. So they must all agree, and if every process proposed the same                                      value, then they all decide on this value.  If processes can fail in arbitrary (Byzantine) ways​, then faulty processes can in principle communicate random                                values to the others. This may seem unlikely in practice, but it is not beyond the bounds of possibility for a                                          process with a bug to fail in this way. Moreover, the fault may not be accidental, but the result of mischievous                                          or malevolent operation. Someone could deliberately make a process send different values to different peers in                                an attempt to thwart the others, which are trying to reach consensus. In case of inconsistency, correct                                  processes must compare what they have received with what other processes claim to have received.    The Byzantine generals problem​: In the informal statement of the Byzantine generals problem [Lamport et al.                                1982], ​three or more generals are to agree to attack or to retreat​. One, the commander, issues the order. The                                        others, lieutenants to the commander, must decide whether to attack or retreat. But ​one or more of the                                    generals may be ‘treacherous’ – that is, faulty. ​If the commander is treacherous, he proposes attacking to one                                    general and retreating to another​. If a lieutenant is treacherous, he tells one of his peers that the commander                                      told him to attack and another that they are to retreat. ​The Byzantine generals problem differs from consensus                                    in that a distinguished process supplies a value that the others are to agree upon, instead of each of them  proposing a value​. The requirements are:  ● Termination​: Eventually each correct process sets its decision variable.  ● Agreement​: The decision value of all correct processes is the same: if p​i and p​j are correct and have                                      entered the decided state, then d​i​ = d​j​ ( i, j = 1, 2, … , N ).  ● Integrity: If the commander is correct, then all correct processes decide on the value that the                                commander proposed.  Note that, for the Byzantine generals problem, integrity implies agreement when the commander is correct; but                                the commander need not be correct.    Consensus in a synchronous system​: here is described an algorithm to solve consensus in a synchronous                                system, although it is ​based on a modified form of the integrity requirement​. The algorithm uses only a basic                                      multicast protocol. It assumes that ​up to f of the N processes exhibit crash failures​.   To reach consensus, each correct process collects proposed values from the other processes. The algorithm                              proceeds in f + 1 rounds, in each of which the correct processes B­multicast the values between themselves.  Their modified form of the integrity requirement applies to the proposed values of all processes, not just the                                    correct ones​: if all processes, whether correct or not, proposed the same value, then any correct process in the                                      decided state would chose that value​. Given that ​the algorithm assumes crash failures at worst​, ​the proposed                                  values of correct and non­correct processes would not be expected to differ​, ​at least not on the basis of                                      failures​. The revised form of integrity enables the convenient use of the minimum function to choose a decision                                    value from those proposed.  After f + 1 rounds, each process chooses the minimum value it has received as its decision value. To check the  correctness of the algorithm, we must show that ​each process arrives at the same set of values at the end of                                          the final round​.   Assume, to the contrary, that two processes differ in their final set of values. Without loss of generality, some                                      correct process p​i possesses a value ​v that another correct process p​j ( i ≠ j ) does not possess. ​The only                                            explanation for p​i possessing a proposed value ​v at the end that p​j does not possess is that any third process,                                         
  • 49.
    p​k , say,that managed to send ​v to p​i crashed before ​v could be delivered to p​j​. In turn, any process sending ​v                                                in the previous round must have crashed, to explain why p​k possesses ​v in that round but p​j did not receive it.                                            Proceeding in this way, we have to posit at least one crash in each of the preceding rounds. But we have                                          assumed that at most f crashes can occur, and there are f + 1 rounds. We have arrived at a contradiction.  It turns out that any algorithm to reach consensus ​despite up to f crash failures requires at least ​f + 1                                          rounds of message exchanges​, no matter how it is constructed [Dolev and Strong 1983]. This lower                                bound also applies in the case of Byzantine failures [Fischer and Lynch 1982].      The Byzantine generals problem in a synchronous system​: here we assume that ​processes can exhibit                              arbitrary failures​. That is, ​a faulty process may send any message with any value at any time; and it may omit                                          to send any message​. Up to ​f of the N processes may be faulty. Correct processes can detect the absence of a                                            message through a ​timeout​; but they cannot conclude that the sender has crashed, since it may be silent for                                      some time and then send messages again.  We assume that the communication channels between pairs of processes are private. Our default assumption                              of channel reliability means that ​no faulty process can inject messages into the communication channel                              between correct processes​.  Lamport et al. [1982] considered the case of three processes that send unsigned messages to one another.                                  They showed that there is no solution that guarantees to meet the conditions of the Byzantine generals                                  problem if one process is allowed to fail. They generalized this result to show that no solution exists if N ≤ 3f.                                            They went on to give ​an algorithm that solves the Byzantine generals problem in a synchronous system if ​N ≥                                        3f + 1​ , for unsigned (they call them ‘oral’) messages. 
  • 50.
  • 51.
      The algorithm takesaccount of the fact that a faulty process may omit to send a message. If a correct process                                          does not receive a message within a suitable time limit (the system is synchronous), it proceeds as though the                                      faulty process had sent it the value 丄 .   
  • 52.
    Impossibility in asynchronoussystems​: The algorithms seen so far assume that message exchanges take                            place in rounds, and that processes are entitled to time out and assume that a faulty process has not sent                                        them a message within the round, because the maximum delay has been exceeded.  Fischer et al. [1985] proved that ​no algorithm ​can guarantee to reach consensus in an asynchronous system​,                                  even with one process crash failure ​(​FLP restriction​). ​In an asynchronous system, processes can respond to                                messages at arbitrary times, so a crashed process is indistinguishable from a slow one. Their proof involves                                  showing that there is ​always some continuation of the processes’ execution that avoids consensus being                              reached​. Note the word ‘guarantee’ in the statement of the impossibility result. The result does not mean that                                    processes can never reach distributed consensus in an asynchronous system if one is faulty. It allows that                                  consensus can be reached with some probability greater than zero​, confirming what we know in practice.  Three techniques exist for working around the impossibility result:  ● fault masking​: avoid the impossibility result altogether by masking any process failures that occur; ​if a                                process crashes, then it is restarted​. ​The process places sufficient information in persistent storage at                              critical points in its program so that if it should crash and be restarted, it will find sufficient data to be                                          able to continue correctly with its interrupted task. In other words, ​it will behave like a process that is                                      correct, but that sometimes takes a long time to perform a processing step.  ● failure detectors​: processes can agree to ​deem a process that has not responded for more than a                                  bounded time to have failed​. ​An unresponsive process may not really have failed​, but the remaining                                processes act as if it had done. ​They make the failure ‘fail­silent’ by discarding any subsequent                                messages that they do in fact receive from a ‘failed’ process​. In other words, we have effectively turned                                    an asynchronous system into a synchronous one. This technique is used in the ISIS system [Birman                                1993].   ● randomizing aspects of the processes’ behaviour​: The result of Fischer et al. [1985] depends on                              what we can consider to be an ‘adversary’. This is a ‘character’ (actually, just a collection of random                                    events) who can exploit the phenomena of asynchronous systems so as to foil the processes’ attempts                                to reach consensus. The adversary manipulates the network to delay messages so that they arrive at                                just the wrong time, and similarly it slows down or speeds up the processes just enough so that they                                      are in the ‘wrong’ state when they receive a message. This technique that addresses the impossibility                                result is to ​introduce an element of chance in the processes’ behaviour​, so that the adversary cannot                                  exercise its thwarting strategy effectively. ​Consensus might still not be reached in some cases, but this                                method enables processes to ​reach consensus in a finite expected time​.      Technologies for Cloud Computing (Lettieri)    HW assisted Virtualization  Popek and Goldberg’s 1974 paper establishes ​three essential characteristics for system software to be                            considered a VMM​:  1. Fidelity​. Software on the VMM executes identically to its execution on hardware, barring timing effects.  2. Performance​. An overwhelming majority of guest instructions are executed by the hardware without                          the intervention of the VMM. 
  • 53.
    3. Safety​. The VMM manages all hardware resources.    We shalluse the term ​classically virtualizable to describe an architecture that can be virtualized purely with                                  trap­and­emulate. In this sense, x86 is not classically virtualizable, but it is virtualizable by Popek and                                Goldberg’s criteria. The Pentium instruction set contains ​sensitive, unprivileged instructions​. ​The processor will                          execute unprivileged, sensitive instructions without generating an interrupt or exception. Thus, a VMM will                            never have the opportunity to simulate the effect of the instruction​. It has ​four modes of operation​, known as                                      rings​, or current privilege level (CPL), 0 through 3. Ring 0, the most privileged, is occupied by operating                                    systems. Application programs execute in Ring 3, the least privileged. The Pentium also has ​a method to                                  control transfer of program execution between privilege levels so that nonprivileged tasks can call privileged                              system routines: the ​call gate​. In the Intel architecture, privileged instructions cause a general protection                              exception if the CPL is not equal to 0 (and this is a requirement for a VMM to work properly). It uses interrupts                                              and traps to redirect program execution and allow interrupt and exception handlers to execute when a                                privileged instruction is executed by an unprivileged task.  It was found that seventeen instructions do not respect such requirements: The ​SGDT​, ​SIDT and ​SLDT                                instructions are an example (PUSHF and POPF too). These instructions are ​normally only used by operating                                systems but ​are not privileged in the Intel architecture​. Since the ​Intel processor only has one LDTR, IDTR and                                      GDTR​, ​a problem arises when multiple operating systems try to use the same registers​. Although ​these                                instructions do not protect the sensitive registers from ​reading by unprivileged software​, the processor allows                              partial protection for these registers by only allowing tasks at CPL 0 to load the registers. This means that ​if a                                          VM tries to write to one of these registers, a trap will be generated​. The trap allows a VMM to produce the                                            expected result for the VM. However, ​if an OS in a VM uses SGDT, SLDT, or SIDT to reference the contents of                                            the IDTR, LDTR, or GDTR​, ​the register contents that are applicable to the host OS will be given​. ​This could                                        cause a problem if an operating system of a virtual machine ​tries to use these values for its own operations​: ​it                                          might see the state of a different VMOS executing within a VM running on the same VMM​. Therefore, ​a                                      VMM must provide each VM with its own virtual set of IDTR, LDTR, and GDTR registers​.    The most important ideas from classical VMM implementations are ​de­privileging​, ​shadow structures and                          traces​.    ● De­privileging​: all instructions that read or write privileged state can be made to trap when executed in                                  an unprivileged context. A classical VMM executes guest operating systems directly, but at a reduced                              privilege level. The VMM intercepts traps from the de­privileged guest, and emulates the trapping                            instruction against the virtual machine state.  ● Primary and shadow structures​: By definition, the privileged state of a virtual system differs from that                                of the underlying hardware. To fill this gap, the VMM derives shadow structures from guest­level                              primary structures. On­CPU privileged state, such as the page table pointer register or processor status                              register, is handled trivially: ​the VMM maintains an image of the guest register, and refers to that image                                    in instruction emulation as guest operations trap​. Off­CPU privileged data, such as page tables, may                              reside in memory. In this case, guest accesses to the privileged state may not naturally coincide with                                  trapping instructions. Dependencies on this privileged state are not accompanied by traps: every guest                            virtual memory reference depends on the permissions and mappings encoded in the corresponding                          PTE. ​Such in­memory privileged state can be modified by any store in the guest instruction stream​, ​or                                 
  • 54.
    even implicitly modifiedas a side effect of a DMA I/O operation​. Memory­mapped I/O devices present a                                  similar difficulty: reads and writes to this privileged data can originate from almost any memory                              operation in the guest instruction stream.  ● Traces​: To protect the host from guest memory accesses, VMMs typically construct ​shadow page                            tables in which to run the guest. ​x86 specifies hierarchical hardware­walked page tables having 2, 3 or                                  4 levels​. The hardware page table pointer is control register ​%cr3​. The VMM distinguishes ​true page                                faults​, caused by violations of the protection policy encoded in the guest PTEs, from ​hidden page faults​,                                  caused by misses in the shadow page table. ​True faults are forwarded to the guest; hidden faults cause                                    the VMM to construct an appropriate shadow PTE, and resume guest execution. The fault is “hidden”                                because it has no guest­visible effect.    x86 obstacles to virtualization  Ignoring the legacy “real” and “virtual 8086” modes of x86, even the more recently architected 32­ and 64­bit                                    protected modes are not classically virtualizable:  ● Visibility of privileged state​. The guest can observe that it has been deprivileged when it reads its                                  code segment selector (%cs) since the current privilege level (CPL) is stored in the low two bits of %cs.  ● Lack of traps when privileged instructions run at user­level​. For example, in privileged code popf                              (“pop flags”) may change both ALU flags (e.g., ZF) and system flags (e.g., IF, which controls interrupt                                  delivery). For a deprivileged guest, we need kernel mode popf to trap so that the VMM can emulate it                                      against the virtual IF. Unfortunately, a deprivileged popf, like any user­mode popf, simply suppresses                            attempts to modify IF; no trap happens.    Simple binary translation  VMWare software VMM uses a translator with these properties:  ● Binary​. Input is binary x86 code, not source code.  ● Dynamic​. Translation happens at runtime, interleaved with execution of the generated code.  ● On demand​. Code is translated only when it is about to execute. This laziness side­steps the problem                                  of telling code and data apart.  ● Adaptive​. Translated code is adjusted in response to guest behavior changes to improve overall                            efficiency.  The translator does not attempt to “improve” the translated code. We assume that if guest code is performance                                    critical, the OS developers have optimized it and a simple binary translator would find few remaining                                opportunities.    x86 Architecture Extension  The hardware exports ​a number of ​new primitives to support a classical VMM for the x86​. An ​in­memory data                                      structure​, which we will refer to as the ​virtual machine control block​, or ​VMCB​, combines ​control state with                                    a subset of the state of a guest virtual CPU​. ​A new, less privileged execution mode​, ​guest mode​, ​supports                                      direct execution of guest code​, ​including privileged code​. We refer to the previously architected x86                              execution environment as ​host mode​. ​A new instruction​, ​vmrun​, ​transfers from host to guest mode​. Upon                                execution of vmrun, the hardware loads guest state from the VMCB and continues execution in guest mode.                                 
  • 55.
    Guest execution proceedsuntil some condition, expressed by the VMM using control bits of the VMCB, is                                  reached. At this point, the hardware performs an exit operation, which is the inverse of a vmrun operation. On                                      exit, the hardware saves guest state to the VMCB, loads VMM­supplied state into the hardware, and resumes                                  in host mode, now executing the VMM.  Diagnostic fields in the VMCB aid the VMM in handling the exit; e.g., exits due to guest I/O provide the port,                                          width, and direction of I/O operation. After emulating the effect of the exiting operation in the VMCB, the VMM                                      again executes vmrun, returning to guest mode.  The hardware extensions provide a complete virtualization solution, essentially prescribing the structure of our                            hardware VMM (or indeed any VMM using the extensions). When running a protected mode guest, the VMM                                  fills in a VMCB with the current guest state and executes vmrun. On guest exits, the VMM reads the VMCB                                        fields describing the conditions for the exit, and vectors to appropriate emulation code.