Failures of safety-critical electronic systems can result in loss of life, substantial financial damage or severe harm to the environment.
Safe computer systems are typically used in avionics or railway applications requiring particularly high reliability. This also goes for the medical market, while industrial automation environments demand more and more functional safety as technology becomes readily available.
5 Techniques to Achieve Functional Safety for Embedded Systems
1. Textmasterformat bearbeiten
▪ Second Level
▪ Third Level
▪ Fourth Level
Fifth Level
August 24, 2017
5 Techniques to Achieve Functional Safety for
Embedded Systems
2. 2
The Need for Safe Computing
Failures of safety-critical electronic systems can result in loss of life, substantial
financial damage or severe harm to the environment.
Safe computer systems are typically used in avionics or railway applications
requiring particularly high reliability. This also goes for the medical market, while
industrial automation environments demand more and more functional safety as
technology becomes readily available.
One of the key design elements of a safety-critical system is redundancy. Other
techniques are diversity in components, determinism and predictable behavior,
clustering to increase availability and supervisor and event logging features.
Considerations about mission-critical computer architectures are complex and
include safety-critical characteristics, reliability questions, error behavior modes,
Safety Integrity Levels (up to SIL 3 or SIL 4) and the major IEC and EN standards,
e.g., EN 50128 / EN 50129 for railways or DO-254 for avionics (up to DAL-A).
4. 4
Redundancy
Redundancy. Multiplying critical components, such as the CPU, increases the
function's reliability.
The most important strategy to make a system less vulnerable to risk is to
multiply significant components. A component that by failing brings the entire
system to a halt is called a "Single Point of Failure" (SPOF). If critical components,
such as the CPU, are redundant, the availability and/or reliability of the functions
increase.
Depending on what you want to achieve, you can use different redundancy
configurations. To do this, you name the number of functions that must be in
working order in case of a failure (M) compared to the total number of
redundant functions (N). This results in “M out of N”, abbreviated as MooN.
5. 5
Redundancy - MooN Constellations
With safe redundant functions, all
components must also deliver the same
computing results, to allow for the detection
of errors, in the simplest case in a 2oo2
system. This reduces availability (fail-safe)
Inputs Controller Outputs
M
Inputs Controller Outputs
Inputs Controller Outputs
M
Inputs Controller Outputs
Inputs Controller Outputs
Inputs Controller Outputs
Inputs Controller Outputs
2oo3
Voter
M
A 1oo2 constellation increases availability of
the system and, by consequence, the Mean
Time Between Failures (MTBF). If one of two
processors fails, a 1oo2 system can still go on
operating (fail-operational).
The 2oo3 set-up is used frequently, because it
increases both safety and availability. With
such a level of complexity, a voting
mechanism, or voter, is an inherent part of
the system. It permanently compares and
analyzes computing results.
6. 6
Diversity
Diversity. If redundant components are identical, a common cause can make
them fail. This is why a system must support dissimilarities both in hardware and
in software.
For instance, you can run different,
independently designed software applications
on the subsystems. On the hardware side you
could use different I/O interfaces. Identical
functions are implemented in varying ways. In
the end the two dissimilar set-ups must lead to
the same result, so that the system can act in a
defined way. Diversity is even possible on one
single board: memory management of the
processors allows to partition the resources,
which is in turn supported by real-time
operating systems like PikeOS.
Safe Application Safe Application
Linux Windows
Linux Drivers Windows Drivers
x86 Architecture RISC Architecture
7. 7
Clustering
Clustering. This does not increase a subsystem's safety, but it raises availability.
Backing up a system is using redundancy on a higher level with the aim of
keeping your system up even in case of a failure.
It is possible to combine two assemblies to form a highly available computer
cluster. In a set-up like this, every channel – being redundant itself – works
independently, but only one channel is active. If the active channel fails, the
system automatically switches to the second channel. The boards can be
connected using dedicated serial interfaces:
Sensors
1
2
3
Cluster
Active Computer
Stand-By Computer Stand-By Output
Active Output
Actor
UARTs (DEX) make for communication
between the two channels. A direct
connection between the Board
Management Controllers (BMCX)
controls the switch-over from the active
to the inactive channel.
8. 8
Determinism
Determinism. The need for predictable behavior forbids a number of
mechanisms, like interrupts, common in non-critical applications. Design
engineers need particular expertise in this respect.
Next to failure safety, mission-critical environments also demand calculable
execution times. The system must react to an external event within a defined
time, even under worst case conditions.
Engineers need to consider possible behavior and its consequences in detail at
an early stage, in preparation for their actual design. In terms of hardware and
firmware, BITE components are used here – Built-In Test Equipment. Errors
handling techniques such as ECC (Error Correcting Code) or the monitoring of
internal voltages play an important role, here, too.
When it comes to software, system integrators in need of deterministic behavior
select a real-time system like VxWorks or PikeOS.
9. 9
Supervisors, Event Logging
Supervisors. Board management and supervision in safe computers need to go
beyond the usual CPU functions. A reliable CPU should have a dedicated monitor
at its side rather than supervise itself.
Event Logging. While this is not a necessary safety function, it can help track back
faults in critical systems in case of an incident. Chances are higher to avoid the
error cause in the future by taking precautions.