Software Architecture


Quality Attributes & Tactics (4)
Availability




  Vakgroep Informatietechnologie – IBCN
Availability

Availability is about system failure and its consequences.

  Faults          & Failures :
        Faults become failures if not corrected or masked.
        A failure is observable by the system user; a fault not.
  Areas          of concern:
        Fault detection and frequency
        Reduced operations
        Recovery and Prevention

                   Availability =                      MTBF
                                               MTBF + MTTR
   Vakgroep Informatietechnologie – Onderzoeksgroep IBCN            p. 2
Availability Generic Scenario




Vakgroep Informatietechnologie – Onderzoeksgroep IBCN   p. 3
Availability generic scenario (1/4)

Source of stimulus:              ……….. who or what ?
      We differentiate between internal and external indications of faults or
       failure since the desired system response may be different.

Stimulus: …………………does something ?
A fault of one of the following classes occurs.
      Omission. A component fails to respond to an input.
      Crash. The component repeatedly suffers omission faults.
      Timing. A component responds but the response is early or late.
      Bad response. A component responds with an incorrect value.

Artifact: …………. to the system or part of it ?
This specifies the resource that is required to be highly
available
      Processor,
      Communication channel,
      Process,
      Storage.

    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN                        p. 4
Availability generic scenario (2/4)




Environment: …….under certain conditions
The state of the system affects the desired system response.
   Normal mode: if this is the first fault observed, some degradation of
    response time or function may be preferred
   Degraded mode: if the system has already seen some faults it may
    be desirable to shut it down totally.
   Overload mode:




    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN             p. 5
Availability generic scenario (3/4)


Response: ………how the system reacts ?
The System should detect the event & :
      Record it
      Notify appropriate parties, including the user and other
       systems
      Disable sources of events that cause fault or failure
       according to defined rules
      be unavailable for a specified interval, where interval
       depends on criticality of system
      Continue to operate in normal or degraded mode



    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN         p. 6
Availability generic scenario (4/4)

Response Measure…how can you measure this ?
   Time interval when the system must be available
   Availability time
   Time interval in which system can be in degraded mode
   Repair time




    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN   p. 7
Availability Specific Scenario

“An unanticipated external message (DOS attack) is
received by a process during normal operation. The
process logs the receipt of the message, notifies the
operator and continues with no downtime”




  Vakgroep Informatietechnologie – Onderzoeksgroep IBCN   p. 8
Case: Digital Signage – Public Transport

                                      Availability QAS :


SOURCE              who or what                              A random event
STIMULUS            does something                           ... causes a failure
ARTIFACT            to the system or part of it              ... to the communication system
ENVIRONMENT         under certain conditions                 ...during normal operations
RESPONSE            how the system reacts                    All displays must start showing
                                                             scheduled arrival times for all
                                                             buses
MEASURE             how you can measure this                 ... Within 30 seconds of failure
                                                             detection

      Q: What is the architectural impact of this requirement ?

     Vakgroep Informatietechnologie – Onderzoeksgroep IBCN                                 p. 9
Availability Tactics
                               Tactics
                              to Control                Fault Masked
    Fault
                              Availability              or Repaired

   Fault Detection
    Echo
    Heartbeat
    Exceptions
   Fault Recovery
    Preparing for recovery
    Accomplishing the recovery
   Fault Prevention
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN                  p. 10
Fault Recovery Tactics (1/4)
    Voting Tactic:
          Processes running on redundant processors each take the
           input, compute and report the results to the “vote-counter.”
               Majority rules
               Preferred Component


    Preferred component:
               This corrects faulty operation of components, algorithms or
                processors.
               The more severe the consequences of failures the more stringent
                the effort to ensure that the redundancy is independent.
                 –    Separate processors, separate implementation teams, … dissimilar
                      platforms




    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN                                p. 11
Fault Recovery Tactics (2/4)

   Active redundancy (hot restart):
        All redundant components respond to events in parallel
        Redundant components synchronized at start then first
         to return is the answer.
        This covers some faults. A faulty processor will be
         slower to respond.
        When a failure occurs the downtime is usually only
         milliseconds (switching to another component).
        Often used in client-server applications involving back-
         end databases.
        In high availability for LANs the redundancy may be
         separate paths so that failure of a bridge or router is not
         fatal. Note the synchronization demands here.



Vakgroep Informatietechnologie – Onderzoeksgroep IBCN                  p. 12
Fault Recovery Tactics (3/4)
   Passive Redundancy:
        One component responds to events and informs the standbys
         of state updates.
   Upon failure the system must:
        Ensure that the backup is sufficiently fresh.
        Restart points, checkpoints, log points ???
        Remap the system to switch which system is the active
         component.
   Often used in control systems
        Example : Air traffic Control
             Chapter 6: Air Traffic Control: A Case Study in
              Designing for High Availability




    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN        p. 13
Fault Recovery Tactics (4/4)
   Switchovers
        Upon failure or Periodic
   Synchronization:
        is the responsibility of the primary component, broadcasting
         synchronization signals to the redundant components.




    Vakgroep Informatietechnologie – Onderzoeksgroep IBCN               p. 14
Fault Prevention Tactics

   Removal from service
     To perform some preventive actions, e.g.,
      rebooting to prevent slow memory leaks from
      causing problems
   Transactions
     the bundling of a sequence of steps so that
      they can be done all at once
   Process monitor
     Once a fault in a process is detected;
              remove–reinstantiate-reinitialize state


Vakgroep Informatietechnologie – Onderzoeksgroep IBCN    p. 15
Availability Tactics Hierarchy
                                      Availability

     Fault detection         Recovery               Recovery          Prevention
                            Preparation           Reintroduction
 Fault                       and repair
                                                                                    Fault
Arrives                                                                            Masked
                                                                                     or
                                                                                   Repaired
          Ping/echo        Voting
          Heartbeat                                Shadow
                           Active red.             State resync.   Removal from
          Exception        Passive red.            Rollback
                           Spare                                   Service

                                                                   Transactions
                                                                   Process
                                                                   Monitor



     Vakgroep Informatietechnologie – Onderzoeksgroep IBCN                           p. 16

Sa 007 availability

  • 1.
    Software Architecture Quality Attributes& Tactics (4) Availability Vakgroep Informatietechnologie – IBCN
  • 2.
    Availability Availability is aboutsystem failure and its consequences. Faults & Failures :  Faults become failures if not corrected or masked.  A failure is observable by the system user; a fault not. Areas of concern:  Fault detection and frequency  Reduced operations  Recovery and Prevention Availability = MTBF MTBF + MTTR Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 2
  • 3.
    Availability Generic Scenario VakgroepInformatietechnologie – Onderzoeksgroep IBCN p. 3
  • 4.
    Availability generic scenario(1/4) Source of stimulus: ……….. who or what ?  We differentiate between internal and external indications of faults or failure since the desired system response may be different. Stimulus: …………………does something ? A fault of one of the following classes occurs.  Omission. A component fails to respond to an input.  Crash. The component repeatedly suffers omission faults.  Timing. A component responds but the response is early or late.  Bad response. A component responds with an incorrect value. Artifact: …………. to the system or part of it ? This specifies the resource that is required to be highly available  Processor,  Communication channel,  Process,  Storage. Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 4
  • 5.
    Availability generic scenario(2/4) Environment: …….under certain conditions The state of the system affects the desired system response.  Normal mode: if this is the first fault observed, some degradation of response time or function may be preferred  Degraded mode: if the system has already seen some faults it may be desirable to shut it down totally.  Overload mode: Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 5
  • 6.
    Availability generic scenario(3/4) Response: ………how the system reacts ? The System should detect the event & :  Record it  Notify appropriate parties, including the user and other systems  Disable sources of events that cause fault or failure according to defined rules  be unavailable for a specified interval, where interval depends on criticality of system  Continue to operate in normal or degraded mode Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 6
  • 7.
    Availability generic scenario(4/4) Response Measure…how can you measure this ?  Time interval when the system must be available  Availability time  Time interval in which system can be in degraded mode  Repair time Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 7
  • 8.
    Availability Specific Scenario “Anunanticipated external message (DOS attack) is received by a process during normal operation. The process logs the receipt of the message, notifies the operator and continues with no downtime” Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 8
  • 9.
    Case: Digital Signage– Public Transport Availability QAS : SOURCE who or what A random event STIMULUS does something ... causes a failure ARTIFACT to the system or part of it ... to the communication system ENVIRONMENT under certain conditions ...during normal operations RESPONSE how the system reacts All displays must start showing scheduled arrival times for all buses MEASURE how you can measure this ... Within 30 seconds of failure detection Q: What is the architectural impact of this requirement ? Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 9
  • 10.
    Availability Tactics Tactics to Control Fault Masked Fault Availability or Repaired  Fault Detection  Echo  Heartbeat  Exceptions  Fault Recovery  Preparing for recovery  Accomplishing the recovery  Fault Prevention Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 10
  • 11.
    Fault Recovery Tactics(1/4)  Voting Tactic:  Processes running on redundant processors each take the input, compute and report the results to the “vote-counter.”  Majority rules  Preferred Component  Preferred component:  This corrects faulty operation of components, algorithms or processors.  The more severe the consequences of failures the more stringent the effort to ensure that the redundancy is independent. – Separate processors, separate implementation teams, … dissimilar platforms Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 11
  • 12.
    Fault Recovery Tactics(2/4)  Active redundancy (hot restart):  All redundant components respond to events in parallel  Redundant components synchronized at start then first to return is the answer.  This covers some faults. A faulty processor will be slower to respond.  When a failure occurs the downtime is usually only milliseconds (switching to another component).  Often used in client-server applications involving back- end databases.  In high availability for LANs the redundancy may be separate paths so that failure of a bridge or router is not fatal. Note the synchronization demands here. Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 12
  • 13.
    Fault Recovery Tactics(3/4)  Passive Redundancy:  One component responds to events and informs the standbys of state updates.  Upon failure the system must:  Ensure that the backup is sufficiently fresh.  Restart points, checkpoints, log points ???  Remap the system to switch which system is the active component.  Often used in control systems  Example : Air traffic Control  Chapter 6: Air Traffic Control: A Case Study in Designing for High Availability Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 13
  • 14.
    Fault Recovery Tactics(4/4)  Switchovers  Upon failure or Periodic  Synchronization:  is the responsibility of the primary component, broadcasting synchronization signals to the redundant components. Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 14
  • 15.
    Fault Prevention Tactics  Removal from service  To perform some preventive actions, e.g., rebooting to prevent slow memory leaks from causing problems  Transactions  the bundling of a sequence of steps so that they can be done all at once  Process monitor  Once a fault in a process is detected;  remove–reinstantiate-reinitialize state Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 15
  • 16.
    Availability Tactics Hierarchy Availability Fault detection Recovery Recovery Prevention Preparation Reintroduction Fault and repair Fault Arrives Masked or Repaired Ping/echo Voting Heartbeat Shadow Active red. State resync. Removal from Exception Passive red. Rollback Spare Service Transactions Process Monitor Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 16

Editor's Notes

  • #11 Issues with Ping/Ech/Heartbeat: Measure “are you alive”. >Functionality simple: 1) Response time under high load ? 2) Capacity of the ping server 3) Availability of the communication channel Complexity: - Tradeoff with performance : - periodic - datacontent