2. Definitions
• Fault tolerance is a means for achieving dependability, allowing
us to prevent the system failure in the presence of faults.
• Layer. Abstraction layer, system component.
• Cross-layer approach (or design) is used when it is more efficient
to distribute the task between several layers rather than execute
it only at one layer.
• Many-core systems are those containing tens, hundreds or
thousands cores (multi-core systems have 2-8 cores)
2
3. Introduction
• Systems’ complexity and abstraction
• TCP/IP as a motivating example
• Many core systems
• Layered fault tolerance
• Cross-layer fault tolerance
3
4. Systems’ Complexity and Abstraction
• Abstraction simplifies the understanding of the system structure
• Layers of the computer system
• OSI model
• TCP/IP (Internet protocol suite)
• Object-oriented programming
• Components of the system are considered as black boxes
• Each component should provide predefined service according to
its interface
4
6. TCP/IP cross-layer fault tolerance
• All layers participate in error
detection and error recovery
• Error detection and recovery is
performed by cooperative
activities of several layers
• If an error is not detected at the
lower layer it will be detected and
recovered at the higher layer
• Efficiency and flexibility of TCP/IP
6
Layer
Error
detection
Error
recovery
Application Status codes
Retransmission
or custom
recovery
Transport CRC-16
TCP: ack., neg.
ack., ARQ, seq.
number
Internet CRC-16
Discard
corrupted
packet
Link CRC-32
Discard
corrupted
packet
7. Many-core systems
• 10, 100 or even 1000 cores
• Heterogeneous architectures
• Redundant cores for ensuring fault tolerance
• Performance, energy efficiency and reliability are very important
factors for many-core systems
7
8. Layered fault tolerance
• Faults can occur at the different layers of the system stack
• Major part of errors is handled at the layer, where they are
detected.
• Convenience for developer
• Predominance of convenience over the system efficiency
8
9. Layered fault tolerance
• System layers are considered separately
• Unnecessary error corrections are possible
• Above layer can not specify the required
quality of service of the layer that is below
• Not optimal in terms of performance and
energy consumption
9
10. Cross-Layer Fault Tolerance
• Fault tolerance will be distributed across
the system stack
• Useful information about the system state
will be shared among the layers
• Various application domains
• Above layers will have the possibility to
specify current needs and required service
level
10
11. Cross-layer design for wireless sensor
networks
• Single layer approach cannot share important information among different
layers
• Each layer does not have complete information. Optimal operation of the
entire network cannot be guaranteed
• Single layer approach does not have the ability to adapt to the
environmental change
L. Carnevali, L. Ridi, E. Vicario, "Stochastic Fault Trees for cross-layer power management of WSN monitoring systems," IEEE Conference on Emerging Technologies & Factory
Automation, pp. 1-8, 2009.
P. Rachelin Sujae, M. Vigneshpandi, "A Cross Layer Fault Tolerant Communication Architecture for Wireless Sensor Networks," Middle-East Journal of Scientific Research, pp. 1292-
1296, 2014.
Y. Wang, H. Wu, F. Lin, N.F. Tzeng, "Cross-Layer Protocol Design and Optimization for Delay/Fault-Tolerant Mobile Sensor Networks (DFT-MSN’s)," IEEE Journal on selected areas in
communications, vol. 26, no. 5, pp. 809-819, 2008.
11
12. Challenges
• Investigate the trade-off between reliability, performance and
energy-consumption in many-core systems
• Ensure cross-layer fault tolerance for many-core systems
• Demonstrate that applying the cross-layer fault tolerance can
improve performance and energy-efficiency
12
13. Plan
• Implement a case-study to gain an experience in developing
cross-layer fault tolerance
• Apply Order Graphs to model cross-layer fault tolerance, power
consumption and performance of many-core systems
• Design novel mechanisms, libraries and patters that will help in
engineering cross-layer fault tolerance of many-core systems
13
14. Case study: Car number plate recognition
application
• Several character recognition algorithms
• Possibility to specify the operational mode: reliability,
performance, energy efficiency or certain tradeoffs between
these parameters.
• Recover two types of errors:
• CPU core error.
• Insufficient Quality of Service.
14