Quasi-static fault-tolerant scheduling schemes for
energy-efficient hard real-time systems
• Wei Tongquan, CS Department of East China Normal University, China
• Piyush Mishra, GE Global Research, Niskayuna, NY 12309, USA
• Kaijie Wu, ECE Department of University of Illinois, Chicago, IL 60607, USA
• Junlong Zhou, CS Department of East China Normal University, China
Journal of Systems and Software
A Unified Approach for Fault Tolerance and Dynamic
Power Management in Fixed-Priority Real-Time
• Ying Zhang
– a Senior Software engineer with the Research and Development
Department, Guidant Corporation, St. Paul, MN, USA
• Krishnendu Chakrabarty
– Department of Electrical and Computer Engineering, Duke University,
Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on 25, no. 1 (2006): 111-125.
Checkpointing & Response Time
Reliability, The best fault tolerance count?
Offline Application Level Voltage Scaling
Offline Task Level Voltage Scaling
Online DVS by Using Slacks
Previous Work (Ying Zhang, Krishnendu Chakrabarty, 2006)
• Fault Tolerance Scheduling
Fault occurrences at runtime, checkpointing and state restoration.
• Dynamic Voltage Scaling (DVS)
• Offline Scheduling
Application Level Voltage Scaling (A-DVS)
Task Level Voltage Scaling (T-DVS)
• Online Scheduling
• Exact Rate-Monotonic Characterization
Instead of iteratively deriving the response time of each task for
feasibility analysis. 5
Online DVS Outline
• The adaptation of the offline task schedules to the
runtime behavior of fault occurrences is implemented:
(1) Pre-computing and saving in a lookup table the maximum slack
requirements for the processor to dynamically slow down.
(2) Retrieving and comparing the stored slack time requirements with
the generated cumulative slack in the runtime.
(3) Dynamically scaling down processor speed when the generated
slack time is equal to or greater than the stored slack requirements.
Fault-tolerant computing refers to the correct execution of user
programs and system software in the presence of faults.
Fault tolerance is typically achieved in real-time systems through
online fault detection, checkpointing, and rollback recovery .
Checkpointing increases the task execution time, and in the absence
of faults, it might cause a missed deadline for a task that completes
on time without checkpointing.
Frequent checkpointing reduces re-execution time due to faults but
increases task execution time and vice versa.
Therefore, the checkpointing interval, i.e., the duration between two
consecutive checkpoints, must be carefully chosen to balance
checkpointing cost with the re-execution time.
Fault occurrences count
• Relation between fault occurrences count and fault
k is the fault occurrences count to be tolerated.
a fault arrival rate λ and a task execution interval t, the mean number
of faults that arrive during the interval is λt.
o If k is much smaller than λt, a sophisticated fault-tolerant scheme with its
associated overhead is not appropriate.
o if k is much larger than λt, a fault-tolerant scheme that provides deterministic
real-time guarantee may not exist.
In order to target a system with reasonable real-time performance with
fault tolerance, the value of k can be taken to be a small multiple of λt,
e.g., 2λt ≤ k ≤ 3λt.
Exact Characterization of RMA (ECRMA)
• Critical Instant
The worst case behavior of RMA occurs when all tasks in a task set are
instantiated simultaneously and are ready for execution immediately after
It has been shown that a schedule of independent periodic tasks is
feasible if the first instance of each task is schedulable when it is
instantiated at a critical instant Lehoczky et al. (1989) .
A-DVS algorithm (2)
• Some Considerations
The binary search based A-DVS algorithm is valid only if the energy
consumption is monotonic with respect to frequency/voltage changes.
When the processor static power consumption as well as context
switching overhead is considered, the monotonicity does not hold.
In this case, there exists a critical processor speed below which scaling
down the processor speed will instead increase the energy consumption.
The minimum voltage level low is initialized to the level corresponding
to the processor critical speed.
Online reevaluation of DVS policies
Offline scheduling assumes that all tasks exhibit the worst case execution
time and all faults occur during the checkpointing.
The runtime behavior of task execution and fault occurrences can vary
In the runtime, not all tasks execute up to their worst case execution times
and not all faults occur during task executions.
Hence, the slack generated in the runtime could be used to dynamically scale
down the processor speed to save energy.
The online reevaluation of DVS policies can save significant energy by using
generated slacks due to uncertainties in fault occurrence.
Heuristic Method Based on GA (2)
• Init function
Initializes the search space (chromosome population).
One chromosome is initially generated using the computationally
feasible application-level speed scaling method.
The other chromosomes are generated randomly.