Sa 007 availability


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Issues with Ping/Ech/Heartbeat: Measure “are you alive”. >Functionality simple: 1) Response time under high load ? 2) Capacity of the ping server 3) Availability of the communication channel Complexity: - Tradeoff with performance : - periodic - datacontent
  • Sa 007 availability

    1. 1. Software ArchitectureQuality Attributes & Tactics (4)Availability Vakgroep Informatietechnologie – IBCN
    2. 2. AvailabilityAvailability is about system failure and its consequences. Faults & Failures :  Faults become failures if not corrected or masked.  A failure is observable by the system user; a fault not. Areas of concern:  Fault detection and frequency  Reduced operations  Recovery and Prevention Availability = MTBF MTBF + MTTR Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 2
    3. 3. Availability Generic ScenarioVakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 3
    4. 4. Availability generic scenario (1/4)Source of stimulus: ……….. who or what ? We differentiate between internal and external indications of faults or failure since the desired system response may be different.Stimulus: …………………does something ?A fault of one of the following classes occurs. Omission. A component fails to respond to an input. Crash. The component repeatedly suffers omission faults. Timing. A component responds but the response is early or late. Bad response. A component responds with an incorrect value.Artifact: …………. to the system or part of it ?This specifies the resource that is required to be highlyavailable Processor, Communication channel, Process, Storage. Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 4
    5. 5. Availability generic scenario (2/4)Environment: …….under certain conditionsThe state of the system affects the desired system response. Normal mode: if this is the first fault observed, some degradation of response time or function may be preferred Degraded mode: if the system has already seen some faults it may be desirable to shut it down totally. Overload mode: Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 5
    6. 6. Availability generic scenario (3/4)Response: ………how the system reacts ?The System should detect the event & : Record it Notify appropriate parties, including the user and other systems Disable sources of events that cause fault or failure according to defined rules be unavailable for a specified interval, where interval depends on criticality of system Continue to operate in normal or degraded mode Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 6
    7. 7. Availability generic scenario (4/4)Response Measure…how can you measure this ? Time interval when the system must be available Availability time Time interval in which system can be in degraded mode Repair time Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 7
    8. 8. Availability Specific Scenario“An unanticipated external message (DOS attack) isreceived by a process during normal operation. Theprocess logs the receipt of the message, notifies theoperator and continues with no downtime” Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 8
    9. 9. Case: Digital Signage – Public Transport Availability QAS :SOURCE who or what A random eventSTIMULUS does something ... causes a failureARTIFACT to the system or part of it ... to the communication systemENVIRONMENT under certain conditions ...during normal operationsRESPONSE how the system reacts All displays must start showing scheduled arrival times for all busesMEASURE how you can measure this ... Within 30 seconds of failure detection Q: What is the architectural impact of this requirement ? Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 9
    10. 10. Availability Tactics Tactics to Control Fault Masked Fault Availability or Repaired Fault Detection Echo Heartbeat Exceptions Fault Recovery Preparing for recovery Accomplishing the recovery Fault PreventionVakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 10
    11. 11. Fault Recovery Tactics (1/4) Voting Tactic:  Processes running on redundant processors each take the input, compute and report the results to the “vote-counter.”  Majority rules  Preferred Component Preferred component:  This corrects faulty operation of components, algorithms or processors.  The more severe the consequences of failures the more stringent the effort to ensure that the redundancy is independent. – Separate processors, separate implementation teams, … dissimilar platforms Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 11
    12. 12. Fault Recovery Tactics (2/4) Active redundancy (hot restart):  All redundant components respond to events in parallel  Redundant components synchronized at start then first to return is the answer.  This covers some faults. A faulty processor will be slower to respond.  When a failure occurs the downtime is usually only milliseconds (switching to another component).  Often used in client-server applications involving back- end databases.  In high availability for LANs the redundancy may be separate paths so that failure of a bridge or router is not fatal. Note the synchronization demands here.Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 12
    13. 13. Fault Recovery Tactics (3/4) Passive Redundancy:  One component responds to events and informs the standbys of state updates. Upon failure the system must:  Ensure that the backup is sufficiently fresh.  Restart points, checkpoints, log points ???  Remap the system to switch which system is the active component. Often used in control systems  Example : Air traffic Control  Chapter 6: Air Traffic Control: A Case Study in Designing for High Availability Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 13
    14. 14. Fault Recovery Tactics (4/4) Switchovers  Upon failure or Periodic Synchronization:  is the responsibility of the primary component, broadcasting synchronization signals to the redundant components. Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 14
    15. 15. Fault Prevention Tactics Removal from service To perform some preventive actions, e.g., rebooting to prevent slow memory leaks from causing problems Transactions the bundling of a sequence of steps so that they can be done all at once Process monitor Once a fault in a process is detected;  remove–reinstantiate-reinitialize stateVakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 15
    16. 16. Availability Tactics Hierarchy Availability Fault detection Recovery Recovery Prevention Preparation Reintroduction Fault and repair FaultArrives Masked or Repaired Ping/echo Voting Heartbeat Shadow Active red. State resync. Removal from Exception Passive red. Rollback Spare Service Transactions Process Monitor Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 16