Your SlideShare is downloading. ×
Assessing the Applicability of the Let-it-Crash Paradigm
for Safety-Related Software Development
Christoph Woskowski, Miko...
for the function caller to check those return codes. In many cases, though, the caller is
not able to handle every single ...
By analyzing the alternative let-it-crash (LiC) paradigm [2] this paper aims to assess its pitfalls as well as its potenti...
tinuously transfers the current vital signs for automated evaluation and storage. As a
redundant measure, additionally to ...
defining the foundation to develop a medical device. The process standard ISO 62304
[11] on the other hand defines the sof...
5. Unexpected component failures at random points in time as well as erroneous input
either do not influence the correct s...
Massive parallel systems, like the ones that typically are being developed using Erlang, facilitate another testing method...
be prevented – allowing only one sensor to calibrate at a time – and calibration of
any sensor type is guaranteed within a...
Upcoming SlideShare
Loading in...5
×

Let-it-Crash Paradigm for Safety-Related Software Development

406

Published on

For more details also vide our blog post: http://blog.zuehlke.com/is-it-safe-to-let-it-crash/
Critical software systems are bound to perform extensive error detection and exception handling. The corresponding source code is typically implemented in a defensive programming style. By addressing a cross-cutting concern, error handling code fragments are often not separable from the source ode realizing the core functionality. For extending exception handling in order to further improve fault-tolerance, even more source code is necessary. The same applies when increasing software safety by pro-actively detecting hazardous situations. However some leftover vulnerability always remains, especially in complex and distributed systems. Producing more code ultimately results in more complexity while reducing readability and maintainability. This in turn inevitably leads to programming errors.
The programming language Erlang breaks a new ground for handling faulttolerance problems. Very light-weight processes enable straightforward concurrency with communication solely based on message passing. Processes are able to monitor and – in case of a process termination – restart each other very swiftly. The exception handling method of choice for a worker process is to terminate itself, if it is unable to handle the situation locally. Supervisor hierarchies ensure appropriate error responses by starting a different process or by restarting a new instance of the terminated one.
This article aims to discuss whether the let-it-crash paradigm for faulttolerant systems may also be applicable to safety-related software projects. The industrial background of this paper is a proof-of-concept project using Erlang for implementing safety-related functionality in a close-to-reality scenario.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
406
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Let-it-Crash Paradigm for Safety-Related Software Development"

  1. 1. Assessing the Applicability of the Let-it-Crash Paradigm for Safety-Related Software Development Christoph Woskowski, Mikolaj Trzeciecki, Florian Schwedes Zühlke Engineering GmbH Landshuter Allee 12 80637 München, Germany {christoph.woskowski,mikolaj.trzeciecki, florian.schwedes}@zuehlke.com Abstract. Critical software systems are bound to perform extensive error detection and exception handling. The corresponding source code is typically implemented in a defensive programming style. By addressing a cross-cutting concern, error handling code fragments are often not separable from the source code realizing the core functionality. For extending exception handling in order to further improve fault-tolerance, even more source code is necessary. The same applies when increasing software safety by pro-actively detecting hazardous situations. However some leftover vulnerability always remains, especially in complex and distributed systems. Producing more code ultimately results in more complexity while reducing readability and maintainability. This in turn inevitably leads to programming errors. The programming language Erlang breaks a new ground for handling faulttolerance problems. Very light-weight processes enable straightforward concurrency with communication solely based on message passing. Processes are able to monitor and – in case of a process termination – restart each other very swiftly. The exception handling method of choice for a worker process is to terminate itself, if it is unable to handle the situation locally. Supervisor hierarchies ensure appropriate error responses by starting a different process or by restarting a new instance of the terminated one. This article aims to discuss whether the let-it-crash paradigm for faulttolerant systems may also be applicable to safety-related software projects. The industrial background of this paper is a proof-of-concept project using Erlang for implementing safety-related functionality in a close-to-reality scenario. Keywords: safety-related, fault-tolerance, supervisor hierarchies, let-it-crash, Erlang, proof of concept 1 Introduction For decades now, the ANSI C approach for propagating locally detected error conditions to the appropriate handling routine is to return error codes. Following this convention, it is good practice for every meaningful function to return diagnostic data and
  2. 2. for the function caller to check those return codes. In many cases, though, the caller is not able to handle every single of the propagated success - or error conditions in a proper and reasonable way. As a result two undesirable patterns can be found in many applications written in the programming language C: • Following the conventions thoroughly, a high percentage of the source code is concerned with returning, passing on and handling error codes. • On the other extreme - for avoiding the resulting effort and complexity - error codes are just ignored or new functions are implemented without returning any diagnostic data at all. Although there are reasonable solutions, for example to handle errors locally if possible and only to propagate the remaining cases, error passing and handling very often gets tedious. Another arguable aspect is the usage of the same means (return values) for announcing success as well as for propagating failure of a function instead of separating those concerns. Since most of todays embedded systems still are written in C the problems mentioned above actually are up-to-date. Many contemporary programming languages like C++, Java and C# provide another mechanism for reacting on errors and for passing information about unhandled critical situations to the appropriate handler. This mechanism is called exception handling. An “exception” in this context is a data structure containing specific information about the incident that has occurred. A piece of software that detects an erroneous situation initiates error handling by propagating (“throwing”) this diagnostic data. Since an exception is raised using termination semantics – at least in most contemporary programming languages – the execution of the corresponding program code immediately stops. The exception can be handled by any function above in the call stack that encapsulates the current call using a try-catch-block or by terminating the whole application if no appropriate handler is available. Compared with returning error codes, the exception handling mechanism generally reduces the amount of hand written error handling and error passing source code. It also treats recognition and management of erroneous situations like a cross-cutting concern. But there are also some serious drawbacks: • The termination semantics of exceptions may cause resource leaks, for example when using dynamic memory allocation. • Compiler vendors may implement exception handling differently. Specific implementations may have significant runtime and executable size overhead or may use dynamic memory allocation and thus increase timing jitter and unpredictability. Using exception handling correctly requires some practice and experience. In complex projects there have to be sensible architectural and design decisions regarding where and when to throw, pass-on or catch exceptions. The programming language Erlang [1] also enables a defensive programming style by providing means like return codes and exceptions. But, by using a different lightweight process model as well as pure message-based inter process communication, another error handling method is possible and favored.
  3. 3. By analyzing the alternative let-it-crash (LiC) paradigm [2] this paper aims to assess its pitfalls as well as its potential for improving error handling in critical systems. In order to gain inside experience using Erlang and LiC a fictional, close-to-reality research project has been started. Related Work. Joe Armstrong, co-inventor of the Erlang language, describes in his Ph.D. thesis [2] the main concepts behind Erlang – like functional, concurrent and distributed programming – as well as the notion of “Let it crash” and its implementation using supervisor hierarchies. He also highlights the idea of separation of concerns regarding core functionality and error handling in source code. While concentrating on the system attributes fault-tolerance and robustness, he does not elaborate further on the safety aspects. V. Nicosia [3] discusses the missing hard real-time abilities of the Erlang virtual machine and proposes an implementation of a hard real-time scheduler. He also suggests using high-level and functional languages like Erlang for solving efficiency and readability problems of traditional (object oriented) languages as well as for reducing the error-proneness of source code. I.A. Letia et al. [4] concentrate on Erlang features that support security aspects and especially intrusion detection and prevention of unintended functionality to be carried out. The latter also constitutes a failure mode related to system safety. 2 Proof of concept scenario Although often a necessity, long term hospitalization is expensive and – e.g. because of antibiotic-resistant microbes – can even pose a serious health threat to hospitalized people with weakened immune system. For reducing those risks and costs, an increasing number of patients are treated at home rather than on wards. Comparable concerns apply to geriatric care. One major drawback of home treatment is inadequate medical monitoring. The following described project aims to develop a telemonitoring system aiding to solve this problem [Fig. 1]. Wearing so called “functional clothing”, that means garment which incorporates wireless connected sensors as well as a power source, patients can be monitored without being restricted in movement by cables or probes. In the scope of their possibilities they can move freely at home, even though their vital signs are checked constantly. Each single sensor measures either heart rate or breathing rate or blood pressure and exists several times redundantly in order to compensate sensor failures and invalid measurement positions. Because power supply is limited – even though body movement and body heat may be used for recharging - at any point of time only one sensor of a kind is powered and active. Via short-range wireless connection (IEEE 802.15.4 [5]) measured data gets transferred to a base station located in the same house or room like the patient. This device employs a constant connection with all active sensors, is able to power them on and off and switches to an alternative measurement location if need arises (failure, implausible data). At the same time the base station via dedicated line establishes a connection with the hospital system and con-
  4. 4. tinuously transfers the current vital signs for automated evaluation and storage. As a redundant measure, additionally to the hospital system the device itself - to a limited extent - is also able to detect critical situation as well as major measurement failures. In this case it alarms audio-visually, notifies the hospital via dedicated line and, using a GSM modem, sends a short message or places an emergency call to a configurable phone number. Fig. 1. Proof of concept scenario The subject matter of the LiC proof-of-concept is the software development for the base station. The project focusses on as high as possible reliability of measurement data acquisition and transfer of the patient’s vital signs to the hospital system. The maximum number of currently active sensors has to be ensured for supporting a failsafe power supply. At the same time a minimum temporal coverage of the most important vital signs has to be guaranteed. For every point in time at least two out of three critical values (heart rate, breathing rate and blood pressure) have to be available. The hospital system in this case establishes the control channel of this dual-channel system and alarms in case of missing up-to-date data. On the other hand, when detecting and reacting on a critical situation, the hospital system – because of better diagnostic abilities – forms the primary channel, whereas the base station constitutes the control mechanism that ensures alarming in most critical cases. The safe state of the base station thus is a complete shutdown, since in that case the hospital system alarms because of missing data and, depending on the patient's criticality, informs the ambulance or for example a neighbor. Regulatory Considerations. For developing a medical system like the patient monitoring described above, compliance with a number of standards is mandatory: There are the medical device product standards IEC 60601-1 [6], IEC 61010-1 [7] and IEC 60601-2 [8] giving specific direction for the creation of a safe medical device. Then there are the management standards ISO 13485 [9] and ISO 14971 [10],
  5. 5. defining the foundation to develop a medical device. The process standard ISO 62304 [11] on the other hand defines the software development lifecycle, meaning how to develop and maintain a safe software system. Finally there are other sources of information, guidelines and techniques that may be used, like IEC 61508-3 [12] and IEC 60812 [13]. The standards to be considered for medical device development in general do value process requirements over technical aspects. For the prototypical implementation at hand though only the techniques used for error handling and error recovery are analyzed. While acknowledging the fact that for a real-world project the process considerations cannot be omitted, this work concentrates on guidelines regarding faulttolerance and the corresponding safety aspects. The following table lists the software failure modes considered for assessing the abilities of a LiC-based implementation (derived from IEC 60812 examples): Specific failure modes (IEC 60812 examples) ─ Activation error ─ Function failure ─ Does not stop ─ Untimely function ─ Intermittent function ─ Delayed function ─ Irregular function ─ Breakdown ─ Unstable ─ Function failure Derived software failure modes • The software fails to perform a required function. • The software performs a function that was not required nor intended. • The software performs the correct function, but at the wrong time or under inappropriate conditions. • The timing behavior or the order of task execution is incorrect (concurrency). • The software fails to recognize and handle critical situations. Table 1. Software failure modes 3 Implementation and Testing Application hypotheses. Considering the software failure modes [Table 1] for the project at hand, some application hypotheses are proposed. Those assumptions are used as requirements for the software architecture of the system being developed and for assessing the developed prototype regarding the applicability of LiC: 1. LiC and supervisor hierarchies enable to ensure the execution of critical functions. 2. LiC and supervisor hierarchies enable to detect and prevent the unintended execution of a function. 3. LiC and supervisor hierarchies allow defining and monitoring the conditions for carrying out a critical function. 4. In spite of concurrency, carrying out critical functions at a specific instant of time and in specific order is ensured.
  6. 6. 5. Unexpected component failures at random points in time as well as erroneous input either do not influence the correct system behavior (reduced quality of service is accepted) or result in the system switching to a safe state. Implementation. The prototypical implementation of parts of the fictional scenario as a PC application in Erlang enables straightforward deployment of different setups of worker processes and supervisors as well as the evaluation of separating business logic from error handling. The development concentrates on the software of the base station and just simulates the external sensors on the one side and the hospital system on the other. The diagram [Fig. 2] depicts one example setup, showing a structural view of the available processes and their dependencies. Fig. 2. Structural view of processes and dependencies The generic supervisor hierarchy is solely responsible for creating the worker processes (sensor drivers and data collector) and for handling errors by restarting or replacing terminated processes. The sensor drivers and the data collector on the other hand contain the core functionality (business logic) and no error handling at all. In case of missing sensor values, for example, the sensor driver just terminates and gets replaced. The same happens if there is data available but the values are outside of valid physical limits. Testing fault-tolerance. Testing system attributes like robustness and fault-tolerance is a complex task. To begin with, all relevant classes of faulty input have to be identified. In this respect this conventional method is the matching counterpart to traditional error handling using defensive programming style. The results of the analysis are then used to confront the system under test.
  7. 7. Massive parallel systems, like the ones that typically are being developed using Erlang, facilitate another testing method for fault-tolerance. Breaking down the system functionality into fine-grained, mutually monitoring processes, which terminate in case an error occurs that cannot be handled locally, promotes process-level testing. A simple and effective variant of testing fault-tolerance is based upon a so called “Chaos Monkey” [14]. This is the name for a single or multiple processes that are being injected into the system under test, having the sole task of randomly terminating other system processes. In traditional systems with a small number of complex tasks or threads this typically leads to complete failure within a very short period of time. For systems following the LiC philosophy this only triggers the process monitoring and thus a fast replacement of the terminated software part. In spite of the chaos monkey actively deactivating random components, a properly designed system is able to maintain basic functionality. 4 Conclusions and Future Work This paper proposes the LiC paradigm as an alternative error handling approach for safety related software development. Specific software failure modes derived from IEC 60812 examples formed the basis for formulating hypotheses regarding the applicability of LiC and supervisor hierarchies for implementing a critical medical system. A research project has been started in order to prove or disprove those assumptions with the following results: 1. Ensure the execution of critical functions. Ill-performing tasks and groups of tasks are stopped and restarted, no matter the cause. For instance, upon malfunction a sensor driver gets terminated and replaced by another one. This of course does not ensure a correct execution per se. 2. Prevent the unintended execution of a function. When a functional monitor detects a worker executing an unintended function, this worker gets terminated and replaced, thereby preventing the execution. For instance, the calibration of a sensor is aborted when there is another calibration request of higher priority, in order to keep the required number of sensors operational at a time. 3. Define and monitor the conditions for carrying out a critical function. Workers and functional monitors can control task execution and results given distinct validation checks. From a LiC perspective this excludes any measures to correct the situation besides restarting affected processes. For instance, a sensor driver validates the data received from its connected sensor before forwarding it to the collector. If a violation is detected the driver terminates itself so the supervisor can start another driver which in turn can connect to another physical sensor. 4. Ensure carrying out critical functions at a specific time and in specific order. Dismissing real-time requirements, conflicts within well-ordered task sequences can be resolved by terminating blocking processes which violate the order or time constraint at hand. For instance, by means of updating dedicated timestamps for sensor calibration an order is given. Thus life locks in calibration concurrency can
  8. 8. be prevented – allowing only one sensor to calibrate at a time – and calibration of any sensor type is guaranteed within a distinct time-interval. 5. Unexpected failures either have no influence or result in a safe state. Malfunctioning processes are immediately replaced by new ones. Fatal function loss immediately results in system shutdown. For instance, the patient monitoring system is robust with regard to sporadic process crashes as well as to the complete loss of one sensor data type. Losing more than one sensor type however is considered fatal and immediately results in system shutdown. The missing hard real-time abilities of Erlang pose a problem when it comes to timecritical safety applications. There are strategies to solve this issue, though. In the future those solutions have to be analyzed and developed further. Regarding the applicability of LiC for non-medical, safety critical systems the underlying Erlang language features have to be evaluated against safety standards like IEC 61508-3. Some research is necessary on whether it is possible to apply LiC without Erlang. Analyzing the underlying language features and corresponding counterparts in other languages or frameworks will provide the necessary information. 5 References 1. Erlang, In http://www.erlang.org 2. Armstrong, J.: Making Reliable Distributed Systems in the Presence of Software Errors. Ph.D. dissertation, Royal Institute of Technology, Stockholm (2003) 3. Nicosia, V.: Towards Hard Real-Time Erlang. University of Catania (2007) 4. Letia, I.A., Marian, D.A.: Embarking on the road of Intrusion Detection, with Erlang. Technical University of Cluj-Napoca (2010) 5. IEEE std. 802.15.4: Wireless Medium Access Control (MAC) and Physical Layer (PHY) specifications for Low Rate Wireless Personal Area Networks (LR-WPANs) (2003) 6. IEC 60601-1 Ed. 3.0, Medical electrical equipment – Part 1: General requirements for basic safety and essential performance. IEC Geneva (2005) 7. IEC 61010-1 Ed. 3.0, Safety requirements for electrical equipment for measurement, control, and laboratory use - Part 1: General requirements. IEC Geneva (2010) 8. IEC 60601-2 Ed. 3.0, Medical electrical equipment - Part 1-2: General requirements for basic safety and essential performance - Collateral standard: Electromagnetic compatibility - Requirements and tests. IEC Geneva (2007) 9. ISO 13485 Ed. 2.0, Medical devices - Quality management systems - requirements for regulatory purposes. ISO Geneva (2003) 10. ISO 14971 Ed. 2.0, Medical Devices - Application of Risk Management to Medical Devices. ISO Geneva (2007) 11. IEC 62304 Ed. 1.0, Medical Device Software – Software Life Cycle Processes. IEC Geneva (2006) 12. IEC 61508-3 Ed. 2.0, Functional safety of electrical/electronic/programmable electronic safety-related systems - Part 3: Software requirements. IEC Geneva (2010) 13. IEC 60812 Ed. 1.0, Analysis techniques for system reliability - Procedure for failure mode and effects analysis (FMEA). IEC Geneva (2001) 14. Chaos Monkey, In http://techblog.netflix.com/2012/07/chaos- monkey-released-into-wild.html

×