IN THIS LECTURE…This lecture• Will introduce you to many of the themes I will cover on the course.• Will characterise failure as the norm rather than the exception in systems operation.• Will outline why critical systems engineering must address organisational and human factors as well as technical issues.• Will build upon the idea of socio-technical systems engineering introduced in the last lecture, and will introduce the idea of resilience engineering
A STORYA professor has to give an important lecture. He wakes uplate because his alarm clock fails to go off.His wife has left the house already. Unfortunately she hasleft the kitchen tap running and it has flooded the floor.The professor rushes to clean up the mess.He gets to his car only to realise he has locked his car andhouse keys inside.He has left a spare house-key with a neighbour – but theneighbour is away.He phones his wife but she doesn’t answer.
A STORYHe calls a friend and asks for a lift, but the friend’s car isbroken down.The professor sets off for the bus, but then remembers thereis a bus strike.He calls a taxi, but the taxi company is overwhelmed becauseof the bus strike.He gives up, calls work and cancels the lecture. This story is adapted from Perrow C (1984) Normal Accidents. Living with High Risk Technologies Basic Books.
ABOUT FAILUREFailure is a judgementFailures are commonFailures often have multiple causesFailures cascadeSome failures are more serious than othersFailures often have no ill effectFailures can often be recovered fromEngineering cannot eliminate failureSuccess is as complex as failure
FAILURE IS A JUDGMENTWhat do we judge the exact failure to be?• Failure to get to work? Failure to give lecture? The smaller failures that led to cancellation?What do we judge to be a significant failure?• Does cancelling a lecture matter?• Can cancellation be corrected for?Different perspectives can be taken on failure• Different explanations often suit different purposes• There may sometimes be no definite agreement about a failure, but this does not mean any interpretation will do.
Sources:Graph - The Passport Delays of Summer 1999.NAO Report.Images – BBC News Passport issuing 1998/9
FAILURES ARE COMMONErrors and failures happen all the time, particularly incomplex systems where there is a lot to go wrong.How many errors have you made in the last half an hour? If servers in a data center have 99.999% reliability, what are the odds that all will be working at any one time: a) if it has 10,000 servers? b) if it has 100,000 servers?http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
FAILURES OFTEN HAVEMULTIPLE CAUSESThere were multiple (mainly mundane) causes behind thelecture cancellation: • Human error (leaving tap running, forgetting keys) • Practices and procedures (Waking up late, rushing) • Technical failure (Alarm clock, Car) • System design (Door allows you to be locked out) • Environment (Lives too far from work) • External failures (Bus strike, lack of taxi capability) • Planning (Relying on a single lecturer)Who or what is responsible?Who has responsibility?
FAILURES CASCADEComplex systems have a high number of components andwill be dependent on a high number of external factors.These interdependencies may not always be apparent.Often the cause or causes of failure are at an order of removefrom the failure itself• A simplistic view is that there are chains of failure. A domino effect where one problem leads to another• A more complex view is that failures have complex webs of causes and influences• We may also view failures in terms of problems with defensesDisasters often result from unfortunate coincidences andcombinations of failure.
SOME FAILURES ARE MORESERIOUS THAN OTHERSIt is often helpful to distinguish between faults, errors, failures,disasters and catastrophe. But there is no consistently usedterminology.Failure is a judgmentThe seriousness of a failure is contextually dependent.• Failure in a life-critical system vs in a word processor• When is it acceptable for an aging component to fail?• When is it acceptable to take risks (e.g. do maintenance)?Engineers take different perspectives on failure. Some arguethat all failures, no matter how small, should be taken seriously.Some argue we need systems to be “good enough”.
FAILURES OFTEN HAVE NO ILLEFFECTAn error or failure may happenmany times with no ill effect.• This can lead people to be complacent• It may one day lead to disasterFor example the Columbia shuttledisaster occurred when foamdamaged tiles on the shuttle• Similar foam strikes had happened many times• NASA couldn’t believe this strike would cause the loss of Columbia
FAILURES CAN OFTEN BERECOVERED FROMA disaster is rarely an instantaneous event. Often a disasterresults from an unfortunate combination of failures and oftenthese take place over a period of time.• Failures can often be mitigated• Failures can often be recovered fromA resilient system is one that is able to recover from failures.It is the opposite of a brittle system.We must give operators the ability to mitigate and recoverfrom failure.
ENGINEERING CANNOTELIMINATE FAILURESGood engineering can greatly reduce but never eliminate thepossibility of failure.• Testing can be used to find problems but never show their absence• Formal methods can be used to eliminate design faults but this does not mean problems will not emerge in manufacturing or system operationCritical systems engineering must focus on operation as wellas design.Systems are increasing operated as services rather thanproducts, so this risk is increasingly on the developers (!)
SUCCESS IS AS COMPLEX ASFAILUREWe need to learn from success, not just failure• But success is even harder to define than failure.Success is a judgment• One person’s success is another’s failure• A successful system may just be one that hasn’t yet failedSuccess can be studied in terms of• Noteworthy success• Ordinary operation• “Successful failures”
SOCIO-TECHNICAL SYSTEMSENGINEERING Society Organisations People and ProcessesSocio- ApplicationsTechnicalSystems SoftwareEngineering Communications + Data Engineering Management Operating Systems Equipment
RESILIENCEDesign for failure• How can a system fail gracefully and appropriately?Design for recovery• How can a system be designed to support mitigation and recovery from failure?Design for avoidance• How can we reduce the number of failures a system will encounter?For all of these we need to understand systems operation.Critical systems engineering is not just about the designprocess, but also about understanding operation.
SUMMARY1. Failure is the norm, not the exception2. Resilient systems are able to cope with, recover from and avoid failure3. Resilience is a socio-technical, not technical problem
HOMEWORKFirst readChapter 3 “The Human Contribution” from J Reason (2008)The Human Contribution. Farnham, Ashgate.ThenMake a note of any interesting slips, lapses, mistakes,violations, etc. that you have made recently