Why do we fail?And how do we stop doing that?
Major cause of incidents    is human factor               -- folk knowledge
Outages become less frequent and shorter in duration as data centers increase in size. Thesmaller the data center the long...
marine accidentyear project to     Common patterns that were found included:marine accidents,   •      Human error continu...
more sources?
Other industries have   figured it out• Aviation• Medicine• Nuclear powerplants• Military• …?
Usability studies in IT focus on end users.
SYSADMIN
How other industries ensure reliability?
• Procedures & checklists• Verification & redundancy• Eliminating human factor where it’s  unnecessary• Focusing human effo...
Procedures & checklists• A lot of what we do is following known  procedures• If we write it down & follow the  instruction...
ITIL?
http://gawande.com/the-checklist-manifesto
ctors 35(2), pp. 28-43.                     COCKPIT CHECKLISTS:                   CONCEPTS, DESIGN, AND USE               ...
task-checklists for almost all segment of the flight, i.e., PREFLIGHT, TAXI, BEFORELANDING, etc.; and in particular before...
Various types of checklist devices have evolved over the years. Among them are thescroll, mechanical, and vocal checklist ...
The first is the redundancy between configuring the aircraft from memory and only thenusing the checklist procedure to ver...
been created by people and can be documented and understood. But when it comesto people, we are faced with a system elemen...
SYSADMIN
What do we do now?
• Use procedures where it makes sense • at least, write stuff down• Automation, verification, redundancy • scripting not on...
Why do we fail? (And how do we stop doing that?
Why do we fail? (And how do we stop doing that?
Why do we fail? (And how do we stop doing that?
Why do we fail? (And how do we stop doing that?
Why do we fail? (And how do we stop doing that?
Upcoming SlideShare
Loading in …5
×

Why do we fail? (And how do we stop doing that?

767
-1

Published on

Random notes, thoughts, & research starting points on how other industries ensure reliable operations in face of unreliable human nature.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
767
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Why do we fail? (And how do we stop doing that?

    1. 1. Why do we fail?And how do we stop doing that?
    2. 2. Major cause of incidents is human factor -- folk knowledge
    3. 3. Outages become less frequent and shorter in duration as data centers increase in size. Thesmaller the data center the longer and more common the outages.Root causes and responses by organizationsEighty percent of respondents know the root cause of the unplanned outage. The most frequentlycited root causes of data center outages are: UPS battery failure (65 percent), UPS capacityexceeded (53 percent), accidental EPO/human error (51 percent) and UPS equipment failure (49percent). Most common responses to unplanned outages are to repair, replace or purchaseadditional IT or infrastructure equipment (60 and 56 percent, respectively) followed by contactingthe equipment vendor for support (51 percent).Fifty-seven percent believe all or most of the unplanned outages could have been prevented. Themost common prevention tactics to avoid downtime are investing in improved equipment (50percent), increasing the budget and staff of the data center (34 and 20 percent, respectively),improving infrastructure design/incorporating redundant components (19 and 18 percent, National Survey on Datarespectively) as well as performing preventative maintenance of critical infrastructure (16percent). Center Outages National Survey on Data Sponsoredconducted by Ponemon Institute Independently by Emerson Network Power LLC Center Outages Publication Date: 30 September 2010 Sponsored by Emerson Network Power Independently conducted by Ponemon Institute LLC Publication Date: 30 September 2010 Ponemon Institute© Research Report
    4. 4. marine accidentyear project to Common patterns that were found included:marine accidents, • Human error continues to be the dominant factor inyze the contents. maritime accidents, contributing toABS to 85 percentPAPERS 2006 80 TECHNICALer understand the of accidents. • Based on USCG data, for all accidents over the efficient approach for in reporting period of 1999 to 2001, approximately 80 documenting marine inciden to 85% of the accidents analyzed involved human 1developed the MaRCAT m error. Of these, about 50% of maritime accidents and combining the best te were initiated by human error, and another 30% of proving and improving the were associated with human error. MaRCAT’s application duri • In MARS reports (voluntary mariner self reporting The ABS MaRCAT approa of accident and near misses), mariners note human caters to the unique need error in the majority of reports, and chiefly attribute including human element; accidents and near misses to: lack of competence, structural and security conc knowledge and ability; human fatigue; workload; ABS MaRCAT approach are manning; complacency, and; risk tolerance. • Assist clients with inve • USCG data on offshore pollution events in (e.g., groundings, coll California suggests that 46% are caused or incidents (minor to maj associated with human error. to their vessels and facil • For accidents associated with pollution events in the • Allow analysis of losses State of California, accident causes are chiefly the environment, human attributed to failures of situation awareness (94%). reliability, quality or bus • Among all human error types classified in numerous ABS TECHNICAL PAPERS 2006 databases and libraries of accident reports, failures • Provide ABS clients wit TRENDING THE CAUSES OF MARINE INCIDENTS of situation awareness and situation assessment incident investigators in overwhelmingly predominate, being a causal factor D.B. McCafferty, American Bureau of Shipping, USA analyses and in provid C.C. Baker, American Bureau of Shipping, USA in about 45% (offshore) to about 70% (ships) of the identifying, documentin recorded accidentsFrom Marine Incidents 3 Conference held in London,(see Presented at the Learning associated with human error January 25-26, 2006, and reprinted with the kind permission of the Royal Institution of Naval Architects accidents and near misse SUMMARY Figure 1, below). • Assist and facilitate
    5. 5. more sources?
    6. 6. Other industries have figured it out• Aviation• Medicine• Nuclear powerplants• Military• …?
    7. 7. Usability studies in IT focus on end users.
    8. 8. SYSADMIN
    9. 9. How other industries ensure reliability?
    10. 10. • Procedures & checklists• Verification & redundancy• Eliminating human factor where it’s unnecessary• Focusing human effort where it’s actually needed
    11. 11. Procedures & checklists• A lot of what we do is following known procedures• If we write it down & follow the instructions, we can focus on what is special, not on doing the usual steps• Also, we won’t forget any of the usual steps when situation goes sideways
    12. 12. ITIL?
    13. 13. http://gawande.com/the-checklist-manifesto
    14. 14. ctors 35(2), pp. 28-43. COCKPIT CHECKLISTS: CONCEPTS, DESIGN, AND USE Asaf Degani San Jose State University Foundation San Jose, CA Earl L. Wiener University of Mami Coral Gables, FL
    15. 15. task-checklists for almost all segment of the flight, i.e., PREFLIGHT, TAXI, BEFORELANDING, etc.; and in particular before the critical segments: TAKEOFF, APPROACH,and LANDING. Two other checklists are also used on the flight-deck: the abnormal andemergency checklist. This paper will address only the normal checklist.We believe that normal checklists are intended to achieve the following objectives: 1. Provide a standard foundation for verifying aircraft configuration that will attempt to defeat any reduction in the flight crews psychological and physical condition. 2. Provide a sequential framework to meet internal and external cockpit operational requirements. 3. Allow mutual supervision (cross checking) among crew members. 4. Dictate the duties of each crew member in order to facilitate optimum crew coordination as well as logical distribution of cockpit workload. 5. Enhance a team concept for configuring the plane by keeping all crew members “in the loop.” 6. Serve as a quality control tool by flight management and government regulators over the flight crews.Another objective of an effective checklist, often overlooked, is the promotion of apositive “attitude” toward the use of this procedure. For this to occur, the checklist mustbe well grounded within the “present day” operational environment, so that the flightcrews will have a sound realization of its importance, and not regard it as a nuisance task
    16. 16. Various types of checklist devices have evolved over the years. Among them are thescroll, mechanical, and vocal checklist (Degani and Wiener, 1990; Turner and Huntley,1991). More modern ones involve computer-based text displayed on a CRT andelectronic checklist devices that sense sub-system’s state (Rouse, Rouse, and Hammer,1982; Palmer and Degani, 1991).The paper checklist is the most common checklist device used today in commercialoperation. It has a list of items written on a paper card (see Figure 1). Usually, the card isheld in the pilot’s hand. Because of the wide prevalence of this device, it will be the focusof this paper.Sanders and McCormick (1987) state that “because humans are often the weak link in thesystem, it is common to see human-machine systems designed to provide parallelredundancy” (p. 18). A similar principle of backup and redundancy is applied in thechecklist procedure. There are two types of redundancies embedded in this procedure.The first is the redundancy between configuring the aircraft from memory and only thenusing the checklist procedure to verify that all items have been accomplished properly(set-up redundancy). The second is the redundancy between the two or three pilotsmonitoring each another while conducting the checklist procedure (mutual redundancy).The MethodThere are two dominant methods of conducting (“running”) a checklist—the do-list andthe challenge-response. Each is the product of a different operational philosophy.Do-list. This method can be better termed “call-do-response.” The checklist itself is usedto lead and direct the pilot in configuring the aircraft, using a step-by-step “cookbook”approach. The setup redundancy is eliminated here, and therefore, a skipped item caneasily pass unnoticed once the sequence is interrupted.Challenge-response. In this method, which can be more accurately termed “challenge-
    17. 17. The first is the redundancy between configuring the aircraft from memory and only thenusing the checklist procedure to verify that all items have been accomplished properly(set-up redundancy). The second is the redundancy between the two or three pilotsmonitoring each another while conducting the checklist procedure (mutual redundancy).The MethodThere are two dominant methods of conducting (“running”) a checklist—the do-list andthe challenge-response. Each is the product of a different operational philosophy.Do-list. This method can be better termed “call-do-response.” The checklist itself is usedto lead and direct the pilot in configuring the aircraft, using a step-by-step “cookbook”approach. The setup redundancy is eliminated here, and therefore, a skipped item caneasily pass unnoticed once the sequence is interrupted.Challenge-response. In this method, which can be more accurately termed “challenge-verification-response,” the checklist is a backup procedure. First, the pilots configure theplane according to memory. Only then, the pilots use the checklist to verify that all theitems listed on the checklist have been correctly accomplished. This is the most commonchecklist method used today by commercial operators. -5-
    18. 18. been created by people and can be documented and understood. But when it comesto people, we are faced with a system element that comes with no operating manualand no performance specifications, and that occasionally performs in ways notanticipated by the system designers. Some of these failures can be easily explained,an arithmetic error for example, while others are harder to predict. Althoughindividuals differ, researchers have discovered general principles of humanperformance that can help us to create safer and more efficient systems. The focusof this paper is on the functioning of people as elements of maintenance systems inaviation.The cost of maintenance errorSince the end of World War II, human factors researchers have studied pilots andthe tasks they perform, as well as air traffic control and cabin safety issues. Yetuntil recently, maintenance personnel were overlooked by the human factorsprofession. Whatever the reason for this, it is not because maintenance isinsignificant. Maintenance is one of the largest costs facing airlines. It has beenestimated that for every hour of flight, 12 man-hours of maintenance occur. Mostsignificantly, maintenance errors can have grave implications for flight safety.Accident statistics for the worldwide commercial jet transport industry showmaintenance as the ‘primary cause factor’ in a relatively low four per cent of hullloss accidents, compared with flight crew actions that are implicated as a primary 2cause factor in more than 60 per cent of accidents. Yet primary cause statisticsmay tend to understate the significance of maintenance as a contributing factor inaccidents. In 2003, Flight International reported that ‘technical/maintenancefailure’ emerged as the leading cause of airline accidents and fatalities, surpassingcontrolled flight into terrain, which had previously been the predominant cause of
    19. 19. SYSADMIN
    20. 20. What do we do now?
    21. 21. • Use procedures where it makes sense • at least, write stuff down• Automation, verification, redundancy • scripting not only for doing stuff, also for making sure it worked• Usability of our panels & dashboards
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×