Reliability Evaluation Techniques

Unit IV
Dr. Lenin SB
Associate Professor/ECE

 Introduction to Reliability Evaluation Techniques –
 Reliability Models for Hardware Redundancy –
 Permanent faults only - Transient faults.
 Introduction to clock synchronization –
 A Non-Fault-Tolerant Synchronization Algorithm –
 Fault-Tolerant Synchronization in Hardware –
 Completely connected zero propagation time system –
 Sparse interconnection zero propagation time system –
 Fault tolerant analysis with Signal Propagation delays.

What is Reliability Evaluation?
 The process of determining whether an existing system / entity has
achieved a specified level of operational reliability (desired, agreed upon
or contracted behaviour).

Software Reliability Definition
The probability that the software will; operate as required (i.e., without fail),
for a specified time, in a specified environment.
Software Reliability – features
• Failures in software are design faults,
• Reliability during test changes continually (new problems are found as
old ones are fixed / new code is never perfect)
• Phenomenon of software reliability growth
• Environment is important (platform/inputs)
• New envt. may require s/w retest

Hardware Reliability - features
• failure is usually due to physical deterioration
• hardware reliability tends, more than software, towards a constant value,
• hardware reliability usually follows the ‘bathtub’ principle,
• again, environment is important; a proportion of hardware faults are
design faults

When we talk of reliability measures the irony is that we invariably talk
about failure measures.
There are four general ways of measuring failures against time;
 Time of failure,
 Interval between failures,
 Cumulative failures experienced up to a given time,
 Failures experienced in a time interval.

FAULTS
ERRORS
FAILURE
ENVIRONMENT
OPERATOR
INPUT
OR
REVEALING
MECHANISM
AND
LEADS TO ZERO OR MANY
LEADS TO ZERO OR MANY
POTENTIALLY
LEADS TO ZERO OR
MANY
MISTAKES
(PERSON
MAKES)
CAN BE ATTRIBUTED TO
ONE OR MANY
ONE OR MANY
ONE OR MANY

Hardware Reliability is ensured by conducting the following tests:
 Fault Tree Analysis
 Failure Modes Effects and Criticality Analysis
 Failsafe Tests
 Fault Injection Tests
 PCB Trace Analysis and Circuit Simulation
 Environmental Tests

Software Reliability is ensured by following the following Techniques:
 Defensive Programming
 To produce programs which detect anomalous control flow, data flow
 or data values during their execution and react to these in a predetermined and
acceptable manner.
 Fault Detection & Diagnosis
 To detect faults in a system, which might lead to a failure, thus providing the basis for
countermeasures in order to minimize the consequences of failures.

Error Detecting and Correcting Codes
 To detect and correct errors in sensitive information.
Diverse Programming
 Detect and mask residual software design faults during execution of a program, in order
to prevent Safety critical failures of the system, and to continue operation for high
reliability.
Software Error Effect Analysis
 To identify software modules, their criticality; to propose means for detecting software
errors and enhancing software robustness; to evaluate the amount of validation needed
on the various software components.

 Software Quality Audit
 Software Rule Checking
 Unit Testing
 Software Integration Tests
 Software/Hardware Integration Tests
 Fault Injection Tests
 System Validation

 Computers used in critical life applications must be so reliable that they
cannot be validated by experiment alone.
 The product of most computer companies, purely experimental approach
is impractical in such a case, to get around this difficulty, we use
mathematical models of reliability.
 We construct a mathematical model of the real-time computer, and solve
it. By doing this, we are adding one possible source of error and the
assumptions of the mathematical model.
 The correctness of the assumptions is a necessary condition of the
correctness of the predictions of the model.

 Reliability of a real-time system is one of its most important
characteristics, as real-time systems are used for mostly critical systems,
where the margin of error should be non-existent.
 Due to the potential loss of life or damages to system or process at hand.
Degradation of systems, is heavily monitored to minimize risks and
failures. This is to ensure down-time is as close to ‘0’ as possible. This
also helps to improve any impacts of profits.
 For example, and embedded pacemaker, if these devices were not
completely accurate and reliable this could result in alteration in the
regular heart beats, which could cause loss of life to the patient if it’s not
completelyreliable.

 Most of the difficult problem in reliability modeling is to keep the complexity of models
sufficiently small.
 When the various parameters of the model are exponentially distributed result in an
unacceptable complexity for all current techniques are used to reduce the complexity of
such models consist largely of state aggregation.
 In which multiple states are grouped together and treated as a single state and
decomposition, in which the overall model is broken down into sub models, each sub
model is solved.
 The overall model is broken down into sub models. These techniques are approximations
only, but approximations mandated by the underlying difficulty of the problem.

 The reliability of components is usually specified through a probability distribution function
of the lifetime of those components.
For example,
 If failures occur as a Poisson process with rate 𝜆, the lifetime distribution is given by,
𝐹𝑙 𝑡 = 1 − exp⁡(−𝜆𝑡)
 If failures occur as a weibull distribution process with a SHAPE parameter α and scale
parameter 𝛌, the lifetime distribution is given by, 𝐹𝑙 𝑡 = 1 − 𝑒𝑥𝑝⁡(−,𝜆𝑡- 𝛼
)
 We will denote by fl(t)the associated density function (we will assume here that Fl(t) is
differentiable).

 The hazard rate h(t) of a component with age t is defined as the rate of failure at time t,
given that it has not failed up to time t.
 We can use Bayes’s law to express the hazard rate as function of the lifetime distribution
function.
 h(t)dt = prob{system fails in [t, t+dt] | system has not failed up to t}
 ⁡=⁡
𝑝𝑟𝑜𝑏*𝑠𝑦𝑠𝑡𝑒𝑚⁡𝑓𝑎𝑖𝑙𝑠⁡𝑖𝑛⁡ 𝑡,𝑡+𝑑𝑡 ∩𝑠𝑦𝑠𝑡𝑒𝑚⁡ℎ𝑎𝑠⁡𝑛𝑜𝑡⁡𝑓𝑎𝑖𝑙𝑒𝑑⁡𝑢𝑝⁡𝑡𝑜⁡𝑡+
𝑝𝑟𝑜𝑏*𝑠𝑦𝑠𝑡𝑒𝑚⁡ℎ𝑎𝑠⁡𝑛𝑜𝑡⁡𝑓𝑎𝑖𝑙𝑒𝑑⁡𝑢𝑝⁡𝑡𝑜⁡𝑡+
=
𝑓 𝑙 𝑡 𝑑𝑡
1−𝐹𝑙 𝑡
 If the failure process is poisson with rate 𝛌, h t =⁡
𝜆𝑒−𝜆𝑡
𝑒−𝜆𝑡
= ⁡λ
Note: Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event
∴ ℎ 𝑡 =⁡
𝑓𝑙 𝑡
1 −⁡𝐹𝑙 𝑡

 If the failure process is weibull with shape and scale parameters α and 𝛌,
 h(t)⁡= ⁡𝛼𝜆(𝜆𝑡) 𝛼−1
0 < 𝛼 < 1, 𝑡ℎ𝑒𝑛⁡ℎ 𝑡 𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒𝑠⁡𝑤𝑖𝑡ℎ⁡𝑡𝑖𝑚𝑒
𝛼 = 1, 𝑡ℎ𝑒⁡𝑓𝑎𝑖𝑙𝑢𝑟𝑒⁡𝑝𝑟𝑜𝑐𝑒𝑠𝑠⁡𝑖𝑠⁡𝑝𝑜𝑖𝑠𝑠𝑜𝑛
𝛼 > 1, ℎ 𝑡 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒𝑠⁡𝑤𝑖𝑡ℎ⁡𝑡𝑖𝑚𝑒.
a Bathtub Curve
life time distributions, for λ =1.

 Many real life components have a hazard rate shaped according to the bath tub curve,
shown in figure. In the beginning the hazard rate is quite high, and then it begins to drop.
 This is known as infant-mortality phase, where components with manufacturing defects are
cleared out.
 The rate then becomes approximately constant, before aging effects set in and cause the
hazard rate to rise with age.
Note: a plot of the empirical cumulative distribution function of data on special axes in a type of Q-Q plot

 Series – parallel systems
 NMR clusters
 Combinatorial model
 Markov chain model
 Voter reliability

 In series connection if any of the components fails, result in system failure.
 In parallel connection all the components to fail before the system fails. R(𝑐𝑖)⁡denotes the
reliability over an given interval [0,t] of component 𝑐𝑖

 Consider N Modular Redundant cluster.
 Faulty processors are immediately identified and disconnected from the system
 System will always consist of good processor only.
 There is no repair.

 The system will fail only if there are fewer than two functional processors left in the system.
 Since there is no repair, all the failures are assumed to be permanent. The probability of
system failure over this interval is given by,
Prob{system failure in[0,t]} =
𝑝𝑟𝑜𝑏*𝑒𝑥𝑎𝑐𝑡𝑙𝑦⁡𝑖⁡𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠⁡𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑎𝑙⁡𝑎𝑡⁡𝑡+𝑙
𝑖=0

Stage Error Sources Error Detection
Specification & Design Algorithm Design Formal Specification Consistency Checks Simulation
Prototype
Algorithm Design Wiring & AssemblyTiming
ComponentFailure
Stimulus/Response Testing
Manufacture Wiring &Assembly ComponentFailure System Testing Diagnostics
Installation Assembly Component Failure System Testing Diagnostics
Field Operation
ComponentFailure Operator Errors
Environmental Factors
Diagnostics

 MTTF: Mean Time to Failure or Expected Life
 MTTF: Mean Time To (first) Failure is defined as the expected value of tf
 where λ is the failure rate
 MTTF of a system is the expected time of the first failure in a sample of identical initially
perfect systems.
 MTTR: Mean Time To Repair is defined as the expected time for repair.
 MTBF: Mean Time Between Failure
MTTF = E(t)= R(t)dt =1/λ

Availability =
MTBF/(MTBF+MTTR)

 Building a reliable serial system is extraordinarily difficult and expensive.
 For example: if one is to build a serial system with 100 components each of which had a
reliability of 0.999, the overall system reliability would be 0.999100 = 0.905
 Reliability of System of Components
 Minimal Path Set:
 Minimal set of components whose functioning ensures the functioning of the system:
{1,3,4} {2,3,4} {1,5} {2,5}

 Parallel Connected Components
 Qk(t) is 1 − Rk(t):
 Qk(t) = 1 − e−λkt
 Assuming the failure rates of components are statistically independent n
Qpar (t) =Q (t)
 Overall system reliability: Rpar (t) = 1 − (1 − Ri(t))

 Parallel and Serial Connected Components
 Total reliability is the reliability of the first half, in serial with the second half.
 Given R1=0.9, R2=0.9, R3=0.99, R4=0.99, R5=0.87
 Rt = (1 − (1 − 0.9)(1 − 0.9))(1 − (1 − 0.87)(1 − (0.99 × 0.99))) = 0.987

What is a fault?
Fault is an erroneous state of software or hardware resulting from failures of its
components
• Fault Sources
• Design errors
• Manufacturing Problems
• External disturbances
• Harsh environmental conditions
• System Misuse

• Mechanical -- “wears out”
• Deterioration: wear, fatigue, corrosion
• Shock: fractures, overload, etc.
• Electronic Hardware -- “bad fabrication; wearsout”
• Latent manufacturing defects
• Operating environment: noise, heat, ESD, electro-migration
• Design defects
• Software -- “bad design”
• Design defects
• “Code rot” -- accumulated run-time faults
• People
• Can take a whole lecture content...

Failure: Component does not provide service
Fault:Adefect within a system
Error:Adeviation from the required operation of the system or subsystem
Extent: Local (independent) or Distributed (related)
Value:
Determinate
Indeterminate (varying values)
Duration:
Transient
Intermittent
Permanent

There is four-fold categorization to deal with the system faults and increase system reliability
and/oravailability.
• Methods for MinimizingFaults
• Fault Avoidance: How to prevent the fault occurrence. Increase reliability by
conservative design and use high reliability components.
• Fault Tolerance: How to provide the service complying with the specification in spite
of faults having occurred or occurring.

• Fault Tolerance: How to provide the service complying with the specification in spite of
faults having occurred or occurring.
• Fault Removal: How to minimize the presence of faults.
• Fault Forecasting: How to estimate the presence, occurrence, and the consequences
of faults.
• Fault-Tolerance is the ability of a computer system to survive in the presence of faults.

Input
Primary
Rollback and try alternate
version Failed
Failed and alternates
exhausted
Passed Output
Recovery Memory

• Fault recovery technique's success depends on the detection of faults accurately and as
early as possible.
• Three classes of recovery procedures:
• Full Recovery
• It requires all the aspects of fault tolerant computing.
• Degraded recovery: Also referred as graceful degradation. Similar to full recovery but no
subsystem is switched-in.
• Defectivecomponent is takenout of service.
• Suited for multiprocessors.
• Safe Shutdown

Forward Recovery
• Produces correct results through continuation of normal processing.
• Highly application dependent
Backward Recovery
• Some redundant process and state information is recorded with the progress of
computation.
• Rollback the interrupted process to a point for which the correct information is
available.
• e.g. Retry, Check pointing, Journaling

• Reliability
• Serial Reliability, Parallel Reliability, System Reliability
• Fault Tolerance
• Hardware,Software

Issue
• Synchronization within one system is hard enough
• Semaphores
• Messages
• Monitors
• Synchronization among processes in a distributed system is much harder

• Time is an interesting and Important issue
• Ex. At what time in a day a particular event occurred at a particular computer.. Consistency
(use of timestamp for serialization), e-commerce, authentication etc.
• Algorithms that depend upon clock synchronization have been developed for several
problems.
• Due to loose synchrony, the notion of physical time is problematic in DS
• There is no absolute physical “global time” in DS

• How time is really measured?
• Earlier: Solar day, solar second, mean solar second
• Solar day: time between two consequtive transits of the sun
• Solar second: 1/86400 of a solar day
• Mean solar day: average length of a solar day
• Problem: solar day gets longer because of slowdown of earth rotation due to friction (300
million years ago there were 400 days per year)

• International Atomic Time (TAI): number of ticks of Cesium 133 atom since 1/1/58
(atomic second)
• Atom clock: one second defined as (since 1967) 9,192,631,770 transitions of the atom
Cesium 133
• Because of slowdown of earth, leap seconds have to be introduced
• Correction of TAI is called Universal Coordinated Time (UTC): 30 leap seconds
introduced so far
• Network Time Protocol (NTP) can synchronize globally with an accuracy of up to 50
msec

 TAI seconds are of constant length, unlike solar seconds. Leap seconds are introduced
when necessary to keep in phase with the sun.

• Let C(t) be a perfect clock
• A clock Ci(t) is called correct at time t if Ci(t) = C(t)
• A clock Ci(t) is called accurate at time t if dCi(t)/dt = dC(t)/dt = 1
• Two clocks Ci(t) and Ck(t) are synchronized at time t if Ci(t) = Ck(t)

• Computers contain physical clock (crystal oscillator)
• Physical time t, hardware time Hi(t), software time Ci(t)
• The clock output can be read by SW and scaled into a suitable time unit and the value can be
used to timestamp any event Ci(t) = Hi(t) + 
• Clock skew: The instantaneous difference between the readings of any
two clocks
• Clock drift: Crystal-based clocks count time at different rates, and so diverge.

• Underlying oscillators are subject to physical variations, with the consequence that their
frequencies of oscillation differ
• Even the same clock’s freq. varies with temp.
• Designs exists that attempt to compensate for this variation but they cannot eliminate it.
• The diff in the oscillations between two clocks might be small, but the difference accumulated
over many oscillations leads to an observable difference
• For clocks based on a quartz crystal, the drift is about 10–6 sec/sec – giving a difference
of one second every 1,000,000 sec or 11.6 days.

You want to catch the bus at 5pm in the stop, but your watch is off by 15
minutes
• What if your watch is Late by 15 minutes?
• What if your watch is Fast by 15 minutes?
Synchronization is required for
• Correctness
• Fairness

Airline reservation system
• Server A receives a client request to purchase last ticket on flight ABC 123.
• Server A timestamps purchase using local clock 9h:15m:32.45s, and logs it. Replies ok
to client.
• That was the last seat. Server A sends message to Server B saying “flight full.”
• B enters “Flight ABC 123 full” + local clock value (which reads 9h:10m:10.11s) into its
log.
• Server C queries A’s and B’s logs. Is confused that a client purchased a ticket after the
flight became full.
• May execute incorrect or unfair actions.

• An Asynchronous Distributed System (DS) consists of a number of processes.
• Each process has a state (values of variables).
• Each process takes actions to change its state, which may be an instruction or a communication
action (send, receive).
• An event is the occurrence of an action.
• Each process has a local clock – events within a process can be assigned timestamps, and thus
ordered linearly.
• But – in a DS, we also need to know the time order of events across different processes.
 Clocks across processes are not synchronized in an asynchronous DS
(unlike in a multiprocessor/parallel system, where they are). So…
1. Process clocks can be different
2. Need algorithms for either (a) time synchronization, or (b) for telling which event happened before which

• In a DS, each process has its own clock.
• Clock Skew versus Drift
• Clock Skew = Relative Difference in clock values of twoprocesses
• Clock Drift = Relative Difference in clock frequencies (rates) of twoprocesses
• A non-zero clock drift causes skew to increase (eventually).
• Maximum Drift Rate (MDR) of a clock
• Absolute MDR is defined relative to Coordinated Universal Time (UTC). UTC is the
“correct” time at any point of time.
• MDR of a process depends on the environment.
• Max drift rate between two clocks with similar MDR is 2 * MDR
• Max-Synch-Interval = (MaxAcceptableSkew—CurrentSkew)/ (MDR * 2)
• (i.e., time = distance/speed)

• If the UTC time is t and the process i’s time is Ci(t) then ideally we would like to have Ci(t)
= t, or dC/dt = 1.
• In practice, we use a tolerance variable , such that
• In external synchronization, clock is synchronized with an authoritative external source of time
• In internal synchronization clocks are synchronized with one another with a known degree of
accuracy
  11
dt
dC

• Ci(t): the reading of the software clock at process i when the real time is t.
• External synchronization: For a synchronization bound D>0, and for source S of UTC
time, for i=1,2,...,N and for all real times t. Clocks Ci are externally
accurate to within the bound D.
• In external synchronization, clock is synchronized with an authoritative external source of
time
• Internal synchronization: For a synchronization bound D>0, for i, j=1,2,...,N
and for all real times t. Clocks Ci are internally accurate within the bound D.
,)()( DtCtS i 
DtCtC ji  )()(

• In internal synchronization clocks are synchronized with one another with a known degree
of accuracy
• External synchronization with D  Internal synchronization with 2D
• Internal synchronization with D  External synchronization with ??

• UTC signals are synchronized and broadcast regularly from land-based
radio stations and satellites covering many parts of the world
• E.g. in the US the radio station WWV broadcasts time signals on several short-wave
frequencies
• Satellite sources include Geo-stationary Operational Environmental Satellites (GOES)
and the GPS

• Radio waves travel at near the speed of light. The propagation delay can be accounted
for if the exact speed and the distance from the source are known
• Unfortunately, the propagation speed varies with atmospheric conditions – leading to
inaccuracy
• Accuracy of a received signal is a function of both the accuracy of the source and its
distance from the source through the atmosphere

 The relation between clock time and UTC when clocks tick at different rates.
Problem: Show that, in order to
guarantee that no two clocks differ
by more than , clocks must be
resynchronized at least every /2
seconds.

• The constant r is specified by the manufacturer and is known as the maximum drift rate.
• If two clocks are drifting from the Universal Coordinated Time (UTC) in opposite direction,
at a time Δt after they are synchronized, they may be as much as 2*ρ*Δt apart.
• If the operating system designer want to guarantee that no two clocks ever differ by more
than δ, clocks must be synchronized at least every δ/2 ρ seconds.

Remember the definition of synchronous distributed system?
• Known bounds for message delay, clock drift rate and execution time.
• Clock synchronization is easy in this case
• In practice most DS are asynchronous.
• Cristian’s Algorithm
• The Berkeley Algorithm

• Consider internal synch between two process in a synch DS
• P sends time t on its local clock to Q in a msg m
• In principle, Q could set its clock to the time t + Ttrans, where Ttrans is the time taken to
transmit m between them
• The two processes would then agree (internal synch)

• Unfortunately, Ttrans is subject to variation and is unknown
• All processes are competing for resources with P and Q and other messages are
competing with m for the network
• But there is always a minimum transmission time min that would be obtained if no other
processes executed and no other network traffic existed
• min can be measured or conservatively estimated

• In synch system, by definition, there is also an upper bound max on the time taken to
transmit any message
• Let the uncertainty in the msg transmission time be u, so that u = (max – min)
• If Q sets its clock to be (t + min), then clock skew may be as much as u (since the message may
in fact have taken time max to arrive).
• If Q sets it to (t + max), the skew may again be as large as u.
• If, however, Q sets it clock to (t + (max + min)/2), then the skew is at most u/2.
• In general, for a synch system, the optimum bound that can be achieved on clock skew when
synchronizing N clocks is u(1-1/N)
• For an asynchronous system Ttrans = min + x, where x >=0

Asynchronous system
• Achieves synchronization only if the observed RTT between the client and server is sufficiently
short compared with the required accuracy.
Observations:
• RTT between processes are reasonably short in practice, yet theoretically unbounded
• Practical estimate possible if RTT is sufficiently short in comparison to required accuracy
• In LAN RTT should be around 1-10ms during which a clock with a drift rate of 10-6s/s varies by at
most 10-5ms. Hence the estimate of RTT is reasonably accurate

• A coordinator (time server): master
• Just the opposite approach of Cristian’s algorithm
• Periodically the master polls the time of each client (slave) whose clocks are to be synchronized.
• Based on the answer (by observing the RTT as in Cristian’s algorithm), it computes the average
(including its own clock value) and broadcasts the new time.
• This method is suitable for a system in which no machine has a WWV receiver getting the
UTC.
• The time daemon’s time must be set manually by the operator periodically.
• The balance of probabilities is that the average cancels out the individual clock’s
tendencies to run fast or slow

• The accuracy depends upon a nominal maximum RTT between the master and the slaves
• The master eliminates any occasional readings associated with larger times than this
maximum
• Instead of sending the updated current time back to the comps – which will introduce
further uncertainty due to message transmission time – the master send the amount by
which each individual slave’s clock requires adjustment (+ or - )
• The algorithm eliminates readings from faulty clocks (since these could have significant
adverse effects if an ordinary average was taken) – a subset of clock is chosen that do not
differ by more than a specified amount and then the average is taken.

• The time daemon asks all the other machines for their clock values
• The machines answer
• The time daemon tells everyone how to adjust their clock

• Both Cristian’s and Berkeley’s methods are highly centralized, with the usual
disadvantages - single point of failure, congestion around the server, … etc.
• One class of decentralized clock synchronization algorithms works by dividing time into
fixed-length re-synchronization intervals.
• The ith interval starts at T0 + iR and runs until T0 + (i+1)R, where T0 is an agreed upon
moment in the past, and R is a system parameter.
• At the beginning of each interval, every machine broadcasts the current time according to
its clock.
• After a machine broadcasts its time, it starts a local timer to collect all other broadcasts
that arrive during some interval S.
• When all broadcasts arrive, an algorithm is run to compute a new time.

• Some algorithms:
• average out the time.
• discard the m highest and m lowest and average the rest -- this is to prevent up to m
faulty clocks sending out nonsense
• correct each message by adding to it an estimate of propagation time from the source.
This estimate can be made from the known topology of the network, or by timing how
long it takes for probe message to be echoed.

Cristian’s and Berkeley algorithms  synch within intranet
• NTP – defines an architecture for a time service and a protocol to distribute
time information over the Internet
• Provides a service enabling clients across the Internet to be synchronized accurately to
UTC
• Provides a reliable service that can survive lengthy losses of connectivity
• Enables client to resynchronize sufficiently frequently to offset clock drifts
• Provides protection against interference with the time service

• Uses a network of time servers to synchronize all processes on a network.
• Time servers are connected by a synchronization subnet tree. The root is in touch with
UTC. Each node synchronizes its children nodes.
Secondary servers, synched
by the primary server
Primary server, direct synch.
Strata 3, synched by the
secondary servers
1
2 2
3 3 3 3 3 3
2

• t and t’: actual transmission times for m and m’(unknown)
• o: true offset of clock at B relative to clock at A
• oi: estimateof actual offset betweenthe twoclocks
• di: estimate of accuracy of oi ; total transmission times for m and
m’; di=t+t’
Ti
Ti-1Ti-2
Ti-3
Server B
Server A
Time
m m'
Time
i-2T = i-3T +t + o
iT = i-1T +t'-o
This leads to
id = t +t' = i-2T - i-3T + iT - i-1T
o = io +(t'-t) / 2, where
io = ( i-2T - i-3T + i-1T - iT ) / 2.
It can then be shown that
io - id / 2 £ o £ io + id / 2.

• NTP servers apply a data filtering algorithm to successive pairs < oi , di> which estimates
the offset o and calculates the quality of this estimate as a statistical quantity called the
filter dispersion.
• The eight most recent pairs < oi , di> are retained
• The value of oi that corresponds to the min value of di is chosen to estimate o.

• Compare time Ts provided by the time server to time Tc at computer C
• If Ts > Tc (e.g. 9:07 am vs 9:05 am), could advance C’s time to Ts
• May miss some clock ticks, probably OK
• If Ts < Tc (e.g. 9:07 am vs 9:10 am), cannot rollback C’s time to Ts
• Many applications assume that time always advances

• The solution is not to set C’s clock back – but can cause C’s clock to run slowly until it
resynchronizes with the time server
• This can be achieved in SW, w/o changing the rate at which the HW clock ticks (an
operation which is not always supported by HW clocks)
• Calculation …

• Value received from UTC receiver is only accurate to within 0.1–10 milliseconds
• At best, we can synchronize clocks to within 10–30 milliseconds of each other
• We have to synchronize frequently, to avoid local clock drift

• Time synchronization important for distributed systems
• Cristian’s algorithm
• Berkeley algorithm
• NTP
• Relative order of events enough for practical purposes
• Lamport’s logical clocks
• Vector clocks

Reliability Evaluation Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reliability Evaluation Techniques

Similar to Reliability Evaluation Techniques (20)

More from Sri Manakula Vinayagar Engineering College

More from Sri Manakula Vinayagar Engineering College (20)

Recently uploaded

Recently uploaded (20)

Reliability Evaluation Techniques