Sa 007 availability

Software Architecture

Quality Attributes & Tactics (4)
Availability

Vakgroep Informatietechnologie – IBCN

Availability

Availability is about system failure and its consequences.

Faults & Failures :
 Faults become failures if not corrected or masked.
 A failure is observable by the system user; a fault not.
Areas of concern:
 Fault detection and frequency
 Reduced operations
 Recovery and Prevention

Availability = MTBF
MTBF + MTTR
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 2

Availability Generic Scenario


Availability generic scenario (1/4)

Source of stimulus: ……….. who or what ?
 We differentiate between internal and external indications of faults or
failure since the desired system response may be different.

Stimulus: …………………does something ?
A fault of one of the following classes occurs.
 Omission. A component fails to respond to an input.
 Crash. The component repeatedly suffers omission faults.
 Timing. A component responds but the response is early or late.
 Bad response. A component responds with an incorrect value.

Artifact: …………. to the system or part of it ?
This specifies the resource that is required to be highly
available
 Processor,
 Communication channel,
 Process,
 Storage.



Environment: …….under certain conditions
The state of the system affects the desired system response.
 Normal mode: if this is the first fault observed, some degradation of
response time or function may be preferred
 Degraded mode: if the system has already seen some faults it may
be desirable to shut it down totally.
 Overload mode:



Response: ………how the system reacts ?
The System should detect the event & :
 Record it
 Notify appropriate parties, including the user and other
systems
 Disable sources of events that cause fault or failure
according to defined rules
 be unavailable for a specified interval, where interval
depends on criticality of system
 Continue to operate in normal or degraded mode



Response Measure…how can you measure this ?
 Time interval when the system must be available
 Availability time
 Time interval in which system can be in degraded mode
 Repair time


Availability Specific Scenario

“An unanticipated external message (DOS attack) is
received by a process during normal operation. The
process logs the receipt of the message, notifies the
operator and continues with no downtime”


Case: Digital Signage – Public Transport

Availability QAS :

SOURCE who or what A random event
STIMULUS does something ... causes a failure
ARTIFACT to the system or part of it ... to the communication system
ENVIRONMENT under certain conditions ...during normal operations
RESPONSE how the system reacts All displays must start showing
scheduled arrival times for all
buses
MEASURE how you can measure this ... Within 30 seconds of failure
detection

Q: What is the architectural impact of this requirement ?


Availability Tactics
Tactics
to Control Fault Masked
Fault
Availability or Repaired

 Fault Detection
 Echo
 Heartbeat
 Exceptions
 Fault Recovery
 Preparing for recovery
 Accomplishing the recovery
 Fault Prevention

Fault Recovery Tactics (1/4)
 Voting Tactic:
 Processes running on redundant processors each take the
input, compute and report the results to the “vote-counter.”
 Majority rules
 Preferred Component

 Preferred component:
 This corrects faulty operation of components, algorithms or
processors.
 The more severe the consequences of failures the more stringent
the effort to ensure that the redundancy is independent.
– Separate processors, separate implementation teams, … dissimilar
platforms



 Active redundancy (hot restart):
 All redundant components respond to events in parallel
 Redundant components synchronized at start then first
to return is the answer.
 This covers some faults. A faulty processor will be
slower to respond.
 When a failure occurs the downtime is usually only
milliseconds (switching to another component).
 Often used in client-server applications involving back-
end databases.
 In high availability for LANs the redundancy may be
separate paths so that failure of a bridge or router is not
fatal. Note the synchronization demands here.


 Passive Redundancy:
 One component responds to events and informs the standbys
of state updates.
 Upon failure the system must:
 Ensure that the backup is sufficiently fresh.
 Restart points, checkpoints, log points ???
 Remap the system to switch which system is the active
component.
 Often used in control systems
 Example : Air traffic Control
 Chapter 6: Air Traffic Control: A Case Study in
Designing for High Availability


 Switchovers
 Upon failure or Periodic
 Synchronization:
 is the responsibility of the primary component, broadcasting
synchronization signals to the redundant components.


Fault Prevention Tactics

 Removal from service
 To perform some preventive actions, e.g.,
rebooting to prevent slow memory leaks from
causing problems
 Transactions
 the bundling of a sequence of steps so that
they can be done all at once
 Process monitor
 Once a fault in a process is detected;
 remove–reinstantiate-reinitialize state


Availability Tactics Hierarchy
Availability

Fault detection Recovery Recovery Prevention
Preparation Reintroduction
Fault and repair
Fault
Arrives Masked
or
Repaired
Ping/echo Voting
Heartbeat Shadow
Active red. State resync. Removal from
Exception Passive red. Rollback
Spare Service

Transactions
Process
Monitor


Sa 007 availability

More Related Content

What's hot

Viewers also liked

Similar to Sa 007 availability

More from Frank Gielen

Sa 007 availability

Editor's Notes