CS5032 Lecture 13: organisations and failurePresentation Transcript
ORGANISATIONSANDDEPENDABILITY 1DR JOHN ROOKSBY
IN THIS LECTURE…High Reliability OrganisationsThese are organisations that are able to achieve high reliabilityfrom complex, critical systems • This lecture will cover five of the key qualities said to be held by these organisationsThis lecture will use Nuclear Powered Carriers as an exampleHigh Reliability Organisation, and NASA at the time of theColumbia disaster as an example of an unreliable organisation
NORMAL ACCIDENTSCharles Perrow, and introduced the idea that failures are normal incomplex systems. Perrow argued serious failures are likely whenthere is: • Interactive complexity: The presence of unfamiliar, unplanned and unexpected sequences of events in a system that are not visible or immediately comprehensible • Tight coupling: The presence of interdependent components. Tight coupling will make a system more prone to cascading errors.So complex, tightly coupled systems shouldn‟t be built?HRO researchers argue that some complex, tightly coupled systemsare far more dependable than others – because of the way they aremanaged
PRINCIPLES High Reliability Low Reliability Organisations Organisations Focus on failure Focus on Success Focus on reliability Focus on efficiency Reluctant to simplify Rely on Simplicity Dynamic hierarchies Inflexible Hierarchy De-centralised decision making Centralised decision making Open information Hide Information Multiple perspectives Single perspectives Are committed to resilience Are on “automatic pilot”
NUCLEAR POWEREDCARRIERSComplex, high risk socio-technical systems • Multiple (mechanical and digital) systems • Dangerous objects (aircraft, fuel, and explosives) in close proximity. Aircraft taking off and landing in 48-60 second intervals. • 6000 crew. Several different kinds of aircraft, multiple squadrons. All work interdependently and must be coordinated. • Carriers are 24 stories high and carry enough fuel for 15 years. 2000 telephones. 3,360 compartments and spaces
NUCLEAR POWEREDCARRIERSHigh risk • Nuclear reactor accidents • Fire, flooding, grounding, collision • Fuel and weapons explosions • Mistaken identification of friends and foes • High risks both to crew and a much larger publicHigh reliability • Low “crunch rates” • Comparatively few major accidents
COLUMBIA DISASTERFeb 1st 2003 - Columbia disintegrates during re-entry into theearth‟s atmosphereThe thermal protection system had been damaged during launchwhen a large piece of foam insulation broke off the mainpropellant tank and hit the shuttle • Known problem. • The majority of shuttle launches had included foam strikes, but nothing had been done about the design • They were aware the foam had struck the wing, but it was not treated as serious • Engineers concerns were not listened to
NASANASA had repeated similar failings • The Challenger disaster, 28th Jan 1986 (mission STS 51-L) • The Columbia disaster, 1st Feb 2003 (Mission STS-107)Many of the failings were the result of deep routed organisationalfindingsNASA strived to implement HRO principles
FIVE PRINCIPLES High Reliability Low Reliability Organisations Organisations Focus on failure Focus on Success Focus on reliability Focus on efficiency Reluctant to simplify Rely on Simplicity Dynamic hierarchies Inflexible Hierarchy De-centralised decision making Centralised decision making Share information Hide Information Multiple perspectives Single perspectives Are committed to resilience Are on “automatic pilot”
1. RELIABILITY OVEREFFICIENCYHigh Reliability Organisations give reliability precedence overefficiency • Decisions are made on the grounds of reliability first and then efficiency • Efficiency initiatives are treated with scepticism
1. RELIABILITY OVEREFFICIENCYHigh Reliability Organisations do the following: • Managers regularly talk to and familiarise themselves with staff about how they do their work and why. • Organisations develop safety measures as well as financial measures, and include these in employee evaluations • Organisations assign value to the avoidance of accidents • High redundancy despite cost • Cautious actions when necessary despite cost
• Carriers have to persuade congress that enormous amounts of redundancy (in jobs, communication structures, parts) are necessary, and that enormous amounts of training are necessary• Constant training despite cost. Commanding officers demand that carriers have regular sea exercises, that they are not just kept in port
NASA Prioritised efficiency over reliability• In the 1990s NASA faced drastic cuts and became overly concerned with pleasing congress. NASA Initiated the Faster, Better, Cheaper strategy in the mid-90s. Wanted to stick to a strict schedule. • With STS-107 they worried that the time needed to analyse the foam strike would delay the next mission. Didn‟t want to change the next missions objectives to a rescue mission. • Saw positioning the shuttle over Hawaii for images to be made as time consuming and costly
2. PREOCCUPATION WITHFAILUREHigh Reliability Organisations are preoccupied with failure (They donot focus on success) • Workers need to be heedful to the possibility of failure • Failures are understood to be normal (but unacceptable) • Know there can be unexpected failure modes, even in common activities
2. PREOCCUPATIONWITH FAILUREHigh Reliability Organisations address failure by • Constant training of all people (simulations, apprenticing, practice) • Using incident reporting • Designing in extensive redundancy • Maintaining contingencies for critical operations • Requiring proofs that something is safe, not that it is unsafe
• There is constant tracking of issues around malfunctioning, defective and substandard equipment. They act on these by training crew how to overcome problems and pressuring vendors to make improvements• Extensive redundancy (overlapping jobs, multiple channels and centres of communications, spare parts, multiple sources for decision making). • Example: if an aircrafts landing gear warning light comes on, the spotter, commander and pilot all work together to establish what the issues is.• Multiple contingencies are maintained. Example: There will always be multiple options for how to land the plane (or for the pilot to escape).
• Foam had been shed on 65 of 79 missions prior to STS-107. There were repeated resolves to do something about this and yet nothing happened.• After the foam strike, engineers who raised concerns were asked to prove it posed a danger rather than prove it didn‟t.• No sustained effort to acquire images of the shuttle, or to share them internally• A shuttle was available for a rescue mission but never actually considered.
3. SHARING THE BIGPICTUREHigh Reliability Organisations want everyone to know the wholepicture • If people are narrowly focused they will act only in their own interest. • People need to maintain awareness of other people and events around the organisation
3. SHARING THE BIGPICTUREHigh Reliability Organisations • Train people broadly • Educate people about overarching objectives, and set statements of purpose • Give people access to information on what is happening elsewhere • Clearly specify how people and teams fit into the whole
• Maintain awareness through many communication devices and multiple kinds of communication device, and have multiple centers of communication, each has direct access to information, each is vigilant.• Have well articulated hierarchies• Deck hands are motivated because they are treated as core parts of teams• People are rotated through different jobs. Top personnel are rotated to a different position every 90 days.
• Employees had little understanding of the overall organisation, and its internal processes• A team was set up with the correct expertise to assess the foam strike damage but its objectives were fuzzy and it had no direct connection to management• But not given the appropriate official category “Tiger Team”• The investigators did not know the process for requesting images, and were rebuked when they tried because they did not have the authority to request them or the correct approval
4. RELUCTANCE TOSIMPLIFYAll organisations have to simplify and abstract, to filter outunnecessary information (particularly for getting “big pictures”)But High Reliability Organisations • Use labels and categories as little as possible as they stop you from looking further into details and events. • Continually rework labels and categories • Listen to wisdom, but with skepticism • Do not focus on information that supports expectations, but focus on that which doesn‟t fit or disconfirms desires
• There are clear responsibilities and tasks, but in practice the crew are constantly negotiating, communicating and interacting• If there is a problem with an aircraft, multiple people take multiple views.
• Narrowed the foam strike down to a „tile incident‟, because management had expertise in Tiles. It was a reinforced- carbon carbon panel (RCC) incident.• The assessment of the damage was done using simulation software called „Crater‟ .• This software was designed for simulating small projectiles but the foam debris was 640 times larger than the data used to calibrate Crater.• Crater was not understood by NASA and the simulation was actually run and interpreted outside the organisation.• The simulation was only run twice and the people who ran it did not think it was very useful, but did not communicate this well
5. MIGRATION OF DECISIONMAKINGHigh Reliability Organisations migrate decision making as fardown the organisation as possible • Decisions are not made by one central authority. Decisions need to be made where there is expertise. This helps decisions to be made quickly and correctly
5. MIGRATION OF DECISIONMAKINGIn order to defer expertise: • Decision making ability migrated to the lowest appropriate levels • People are trained in making decisions and are given the right resources to do so • There is recognition of skill levels and legitimacy through the organisation and people are trusted
• There is hierarchy, but decision making is pushed to the extremes. For example if there is debris on the runway, whoever spots it can halt operations and have it cleared• Rank is not treated as an issue here
• NASA Mission STS-107• Decision making centralised among managers and ignored the expert opinions of engineers• Required authority for decisions to be made• Example: When images were requested, the organisation worried about the rank of the requestor
KEY POINTS• Organisational approaches are necessary for achieving dependable systems. Dependability is not a quality of a technology but a quality of technology-in-practice.• Technologies are not inherently dependable, but require people to operate and manage them in ways that are dependable• The HRO literature has identified a number of qualities of highly reliable organisations. These mainly relate to the operation of technology, although some researchers have studied software development organisations from this perspective.
READINGKH Roberts (1990) Some Characteristics of One Type of High ReliabilityOrganisation. Organisational Science, 1, 2: 160-76.Book: Charles Perrow (1984) Normal Accidents, Living with High RiskTechnologiesBook chapter: Karl Weick (2005) Making Sense of Blurred Images. In W Starbuckand M Farjoun, Organisation at the Limit. Blackwell publishing