Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Systems failure – a socio-                  technical perspectiveHuman Failure, LSCITS, EngD course in Socio-technical Sys...
Complex software systems   •      Multi-purpose. Organisational systems that support          different functions within a...
Systems of systems                                                                       •   Operational                  ...
Complex system realities  •       There is no definitive specification of what the system          should ‘do’ and it is p...
System failureHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 5
System dependability model    System fault                                              System error    A system          ...
A hospital system  •       A hospital system is designed to maintain information about          available beds for incomin...
What is failure?                                                                  •    Technical, engineering             ...
Bed management system   •      The percentage of system users who considered the          system’s incorrect reporting of ...
Failure is a judgement•     Specifications are a gross simplification of reality for      complex systems.•     Users don’...
Failures are inevitable•       Technical reasons      –       When systems are composed of opaque and uncontrolled        ...
Conflict inevitability   •      Impossible to establish a set of requirements where          stakeholder conflicts are all...
Normal failures   •      ‘Failures’ are not just catastrophic events but          normal, everyday system behaviour that d...
The Swiss Cheese modelHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 14
Failure trajectories  •       Failures rarely have a single cause. Generally, they          arise because several events o...
Vulnerabilities and defences •    Vulnerabilities     –    Faults in the (socio-technical) system which, if triggered by a...
Reason’s Swiss Cheese ModelHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 17
Active failures  •       Active failures        –      Active failures are the unsafe acts committed by people who are in ...
Defensive layers• Complex IT systems should have many defensive  layers:     – some are engineered - alarms, physical barr...
Dynamic vulnerabilities  •       While some vulnerabilities are static (e.g.          programming errors), others are dyna...
Recovering from failureHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012   Slide 21
Coping with failure                                                                       •   People are good at          ...
Recovery strategies   •      Local knowledge         –      Who to call; who knows what; where things are   •      Process...
Design for recovery  •       Holistic systems engineering        –      Software systems design has to be seen as part of ...
Recovery strategy•      Designing for recovery is a holistic approach to system design and       not (just) the identifica...
Key points •    Failures are inevitable in complex systems because      multiple stakeholders see these systems in differe...
Upcoming SlideShare
Loading in …5
×

Socio-technical systems failure (LSCITS EngD 2012)

1,720 views

Published on

Discusses socio-technical issues in systems failure

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

Socio-technical systems failure (LSCITS EngD 2012)

  1. 1. Systems failure – a socio- technical perspectiveHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1
  2. 2. Complex software systems • Multi-purpose. Organisational systems that support different functions within an organisation • System of systems. Usually distributed and normally constructed by integrating existing systems/components/services • Unlimited. Not subject to limitations derived from the laws of physics (so, no natural constraints on their size) • Data intensive. System data orders of magnitude larger than code; long-lifetime data • Dynamic. Changing quickly in response to changes in the business environmentHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2
  3. 3. Systems of systems • Operational independence • Managerial independence • Multiple stakeholder viewpoints • Evolutionary development • Emergent behaviour •Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3 Geographic
  4. 4. Complex system realities • There is no definitive specification of what the system should ‘do’ and it is practically impossible to create such a specification • The complexity of the system is such that it is not ‘understandable’ as a whole • It is likely that, at all times, some parts of the system will not be fully operational • Actors responsible for different parts of the system are likely to have conflicting goalsHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4
  5. 5. System failureHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5
  6. 6. System dependability model System fault System error A system An erroneous system characteristic that state that can (but need can (but need not) not) lead to a system lead to a system failure error System failure Externally- observed, unexpected and undesirable system behaviourHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6
  7. 7. A hospital system • A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit. • It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. • Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available . • In circumstances where the system reports that an incorrect number of beds are available, is this a failure?Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7
  8. 8. What is failure? • Technical, engineering view: a failure is ‘a deviation from a specification’. • An oracle can examine a specification, observe a system’s behaviour and detect failures. • Failure is an absolute - the system has either failed or it hasn’tHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8
  9. 9. Bed management system • The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%. • Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. • When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate. • They used other methods to find out whether or not a bed was available for an incoming patient.Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9
  10. 10. Failure is a judgement• Specifications are a gross simplification of reality for complex systems.• Users don’t read and don’t care about specifications• Whether or not system behaviour should be considered to be a failure, depends on the observer’s judgement• This judgement depends on: – The observer’s expectations – The observer’s knowledge and experience – The observer’s role – The observer’s context or situation – The observer’s authorityHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10
  11. 11. Failures are inevitable• Technical reasons – When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood – Failures often can be considered to be failures in data rather than failures in behaviour• Socio-technical reasons – Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes – Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’ Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11
  12. 12. Conflict inevitability • Impossible to establish a set of requirements where stakeholder conflicts are all resolved • Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders • Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12
  13. 13. Normal failures • ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary • A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate or unexpected system behaviour • This extra work constitutes the cost of recovery from system failureHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13
  14. 14. The Swiss Cheese modelHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14
  15. 15. Failure trajectories • Failures rarely have a single cause. Generally, they arise because several events occur simultaneously – Loss of data in a critical system • User mistypes command and instructs data to be deleted • System does not check and ask for confirmation of destructive action • No backup of data available • A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the systemHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15
  16. 16. Vulnerabilities and defences • Vulnerabilities – Faults in the (socio-technical) system which, if triggered by a human or technical error, can lead to system failure – e.g. missing check on input validity • Defences – System features that avoid, tolerate or recover from human error – Type checking that disallows allocation of incorrect types of value • When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16
  17. 17. Reason’s Swiss Cheese ModelHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17
  18. 18. Active failures • Active failures – Active failures are the unsafe acts committed by people who are in direct contact with the system or failures in the system technology. – Active failures have a direct and usually short-lived effect on the integrity of the defenses. • Latent conditions – Fundamental vulnerabilities in one or more layers of the socio- technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc. – Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity.Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18
  19. 19. Defensive layers• Complex IT systems should have many defensive layers: – some are engineered - alarms, physical barriers, automatic shutdowns, – others rely on people - surgeons, anesthetists, pilots, control room operators, – and others depend on procedures and administrative controls.• In an ideal word, each defensive layer would be intact.• In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location. Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19
  20. 20. Dynamic vulnerabilities • While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used. • For example – vulnerabilities may be related to human actions whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something – vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not madeHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20
  21. 21. Recovering from failureHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21
  22. 22. Coping with failure • People are good at coping with unexpected situations when things go wrong. – They can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. – People can prioritise and focus on the essence of a problemHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22
  23. 23. Recovery strategies • Local knowledge – Who to call; who knows what; where things are • Process reconfiguration – Doing things in a different way from that defined in the ‘standard’ process – Work-arounds, breaking the rules (safe violations) • Redundancy and diversity – Maintaining copies of information in different forms from that maintained in a software system – Informal information annotation – Using multiple communication channels • Trust – Relying on others to copeHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23
  24. 24. Design for recovery • Holistic systems engineering – Software systems design has to be seen as part of a wider process of socio-technical systems engineering • We cannot build ‘correct’ systems – We must therefore design systems to allow the broader socio-technical systems to recognise, diagnose and recover from failures • Extend current systems to support recovery • Develop recovery support systems as an integral part of systems of systemsHuman Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24
  25. 25. Recovery strategy• Designing for recovery is a holistic approach to system design and not (just) the identification of ‘recovery requirements’• Should support the natural ability of people and organisations to cope with problems – Ensure that system design decisions do not increase the amount of recovery work required – Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required) • Earlier recognition of problems • Visibility to make hypotheses easier to formulate • Flexibility to support recovery actions Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25
  26. 26. Key points • Failures are inevitable in complex systems because multiple stakeholders see these systems in different ways and because there is no single manager of these systems • Failures are a judgement – they are not absolute – but depend on the system observer • The Swiss cheese model is a failure model based on active failures (trigger events) and latent errors (system vulnerabilities). • People have developed strategies for coping with failure and systems should not be designed to make coping more difficult.Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26

×