Published on

this paper consists of availability tactics like ping/echo, heartbeat etc

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Using Models to Improve the Availability of Automotive Software Architectures Charles Shelton, Christopher Martin Research and Technology Center, Robert Bosch LLC [Charles.Shelton, Christopher.Martin]@us.bosch.com Abstract the main CPU on the ECU. The application software running on the CPU periodically sends a signal to the This paper presents an initial model for evaluating watchdog indicating that it is still functioning. If a and improving the availability of a software failure occurs in the ECU software, it would not be architecture design. The model is implemented as a able to send this signal, and thus the watchdog would reasoning framework in the ArchE architecture expert determine that a system failure has occurred. Once the system developed jointly with the Software Engineering watchdog detects a failure, it triggers a system reset to Institute. To ensure continuous availability many recover the system and resume normal operation. automotive electronic control units (ECUs) employ an The watchdog concept has been widely used in the external watchdog running on a separate CPU to industry for at least 20 years [5]. However, there has monitor the software running on the ECU. If the ECU been little effort to evaluate the effectiveness of the has a failure that causes interruption of its watchdog, and whether it provides the improved functionality, the watchdog can detect this and reset availability it promises. Many automotive software the ECU to restore correct operation. The availability developers and architects regard the watchdog as an model can automatically evaluate the effectiveness of a “insurance policy” to protect against any unforeseen watchdog design in the software architecture and can hardware and software faults in the ECU. Therefore, it propose improvements to achieve better availability is important to understand how effective a watchdog is before implementation decisions are made. The model in improving the availability of the ECU software. enables a quantitative analysis of system availability We developed the availability model as a reasoning that can better guide software architecture and framework (RF) for the ArchE architecture expert dependability design decisions and potentially reduce system that has been developed with Bosch support at implementation and testing effort. the Software Engineering Institute (SEI) at Carnegie Mellon University (CMU) (a full description of the 1. Introduction ArchE methodology is available in [1]). ArchE (Architecture Expert) is a rule-based expert system that Many automotive electronic control units (ECUs) uses models to evaluate a software architecture design implement most of their functionality in real-time and how well it satisfies its non-functional quality software systems. Thus, ensuring the availability of the requirements (e.g. real-time performance and software system is essential to guaranteeing the modifiability), and automatically proposes suggestions dependable operation of the ECU. This paper presents to improve the architecture design when it does not a model for evaluating the availability of a software satisfy those requirements. The Bosch Rapid architecture. The results of this model can then be used Architecture Prototyping tool (RAPT) focuses on to judge the comparative effectiveness of different incorporating the ArchE expert system into a design design mechanisms in the software architecture for tool for software architects that enables model-based improving system availability. software architecture design. For our initial study, we focus on the watchdog The remainder of this paper is organized as follows. concept. The watchdog is a major design mechanism Section 2 gives an overview of ArchE with important being used in ECUs to compensate for transient system definitions. Section 3 describes the availability failures and maintain availability. A watchdog is reasoning framework we developed and how we model usually an external circuit or processor that monitors and calculate availability for a software architecture Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07) 0-7695-2968-2/07 $20.00 © 2007
  2. 2. User Interface Interaction Architect ArchE User •Results of RF Evaluations •Selected Design Tactics Proposed Architecture Input to the Architect Expert System Seeker •Requirements Input from User • Model Evaluation Results from RFs •Commands to Apply Tactics for an RF • Design Tactic Suggestions from RFs Performance Availability Modifiability RF RF RF Figure 1. High-Level Modules in ArchE design. Section 4 details some open issues with the software requirements that must be fulfilled. Both accuracy of the model and our plan for validating the functional and non-functional requirements must be model. Section 5 concludes the paper. specified. Functional requirements are specified as 2. Overview of ArchE responsibilities, and non-functional quality requirements are specified as quality attribute ArchE is composed of multiple reasoning scenarios. Responsibilities represent the units of frameworks (RFs) that evaluate the architecture with functionality that the software must provide. They are respect to a particular quality. Each RF can process the “atomic” units that are assigned to architecture requirements for its quality, generate an initial design elements (modules and tasks) in the architecture architecture design based on the requirements, evaluate to be implemented. Each responsibility will have a set how well the architecture satisfies the quality of parameters (e.g. execution time in milliseconds, cost requirements using a model derived from the of change in person-days) for which the architect must architecture design, and propose design suggestions provide some initial estimates. These are the input (tactics) to improve the architecture and bring it closer parameters that will be used for executing the RF to satisfying its quality requirements. models and evaluating the architecture design for each Since each RF only evaluates a single quality, they quality. may propose tactics that conflict with each other to Quality attribute scenarios (described in detail in satisfy their individual quality requirements. For [2]) define the software quality requirements in a example, a modifiability tactic that decomposes and concise format. Each scenario includes a response encapsulates software modules behind interfaces may measure that specifies a quantitative constraint that can introduce additional runtime execution penalties that be evaluated based on the results of an RF model. A adversely affect real-time performance. Therefore, scenario is defined by six parts: stimulus, source of ArchE has an arbitration module, the Seeker, that stimulus, environment, artifact, response, and response collects the results and suggestions from each RF, measure. The response measure is the critical part of determines what potential side effects there are for the scenario because it provides a quantitative applying each tactic, and evaluates the tactics to decide constraint with which to evaluate whether the scenario which tactics promise the most net improvement of the is satisfied. Table 1 illustrates an example of a real- architecture design. Figure 1 shows the decomposition time performance quality attribute scenario. Each RF of the ArchE system and its major components. takes the set of scenarios for its quality and uses their response measures as constraints to evaluate whether 2.1 Basic ArchE Concepts the model derived from the architecture satisfies its requirements. The first step in providing information to ArchE to evaluate an architecture design is to provide the Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07) 0-7695-2968-2/07 $20.00 © 2007
  3. 3. Table 1. Example Quality Attribute Scenario existing architecture design to evaluate. The RF will Performance Scenario: A controller receives periodic use some general heuristics to develop an initial input from a sensor every 10 ms. The controller must run architecture. For the performance RF, the rules assume its control algorithm and send an output to a system that each scenario and its response measure define a actuator within 10 ms after receiving the sensor input. separate task in the architecture, and creates the set of Input data received from Source of Stimulus sensor tasks. It assigns responsibilities to each task according Periodic activation every 10 to which responsibilities are linked to each scenario. Stimulus ms Each task’s period (or minimum arrival time for a Environment Normal run-time operation sporadic task) is derived from its scenario’s stimulus, Control algorithm, sensor, and its deadline is derived from its scenario’s response Artifact actuator measure. Once these tasks are generated the Compute and update performance RF can evaluate the architecture design Response controller output value to actuator using a real-time performance analysis. Complete controller Alternatively, if the architect has a specific design Response Measure operation within 10 ms he or she wants to evaluate, this design can be used instead of the output of the Initial Design Creation 2.2 Elements of a Reasoning Framework module. Once an architecture design is present, the RF can interpret and evaluate its model to determine Each RF consists of several modules, each whether the architecture satisfies its scenarios for that containing specific sets of rules in the ArchE system. quality. These modules provide the following functions: The Model Interpretation and Evaluation module • Scenario and Responsibility Parameter Definition contains the rules to interpret a model internal to the • Initial Design Creation RF from the architecture design. This model is then • Model Interpretation and Evaluation evaluated and the results are used to judge whether the • Suggest Design Tactics architecture satisfies its scenarios. In the performance • Apply Design Tactics RF, rate-monotonic analysis (RMA) [4] is applied to We describe these modules using the real-time the set of tasks in the architecture to calculate the performance RF as an example. latency for each task. The latency for each task is In the Scenario and Responsibility Parameter compared to its deadline to determine if the Definition module, the scenario type for the RF’s architecture satisfies the response measure of the quality is defined. Each scenario specified for this scenario for that task. quality must conform to the format as defined by the If the architecture does not satisfy some of the RF. This enables the rules in ArchE to automatically scenarios for a given quality, the RF will execute the process the scenarios in each RF. When considering Suggest Design Tactics module. In this module there real-time performance, each scenario’s response are rules to select possible design changes to the measure determines a real-time deadline for a architecture and evaluate whether they improve the particular system function, and the scenario’s stimulus model in the RF. The RF will select the most indicates whether the function is periodic or sporadic. promising tactics that show the greatest improvement in In addition to the scenario type definition, each RF terms of satisfying the scenarios for that quality, and defines a set of parameters that must be provided as send them to the Seeker for arbitration with possible input for each functional responsibility. These tactics from other RFs. The Seeker module is parameters are required as inputs for the execution of independent from all of the ArchE RFs and will decide the analysis model. The performance RF requires that which tactics taken from all of the RFs to present to the each responsibility specify an execution time. With user. The Seeker makes this decision by prioritizing these execution times, the performance RF can assign the tactic suggestions received from the RFs according execution times to each task in the architecture based to their net improvement of the architecture design on which functional responsibilities are assigned to towards satisfying all requirements scenarios. The user which tasks. will then select a tactic to apply to change the The Initial Design Creation module contains rules architecture design. for generating an initial architecture design only from In the performance RF, design tactic suggestions the requirements (responsibilities and scenarios) include reducing the execution time for a provided. This is required when a project starts only responsibility, increasing the period for task, and with a requirements specification and there is no lengthening the deadline for a task. The performance Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07) 0-7695-2968-2/07 $20.00 © 2007
  4. 4. RF will try these tactics out for each task that does not depends on having an existing performance satisfy its scenario, rerun the RMA model, and select architecture analyzed by the performance RF. the tactics that product the greatest overall latency We developed the availability RF using the standard improvement for the architecture tasks. ArchE RF structure. The next few sections describe Finally, the Apply Design Tactics module contains the components of the availability RF. the rules that receive the user’s input for selecting a tactic, and will alter the architecture design according 3.1 Availability Scenarios and Parameters to the user’s response. For example, if the user selects a reduce execution time tactic for a responsibility, the We identified four types of availability scenarios to performance RF has the rules that make ArchE update be evaluated by the RF: the responsibility parameter value, and update the task • General System Availability – This scenario execution time for the task containing that type describes the overall availability target for responsibility in the architecture design. the system, such as “five nines” or 0.99999 availability. There should only be one of these 2.3 The RAPT Tool scenarios for a single architecture design. • Maximum Recovery Time – This scenario type At Bosch, we have incorporated the ArchE expert puts a constraint on the maximum time it takes system into our RAPT architecture design tool. RAPT the system to recover to a known good state once is a tool implemented in the Java-based Eclipse a failure is detected. framework [3]. RAPT is intended to provide a more • Minimum Time Before Recovery – This streamlined user interface for automotive software scenario type puts a constraint on the minimum architects that can encapsulate some of the more formal time the system should wait before initiating a details of the ArchE expert system and its quality recovery action after a failure is detected. This attribute RFs. Users can input their requirements and scenario helps specify a “grace period” after a architecture designs into RAPT, and then click a button failure is detected, in case the detection is a false to have their architecture evaluated by ArchE. positive and does not require the watchdog to In addition to providing a user interface for ArchE, perform a system reset. This time allows the the RAPT tool can store and retrieve requirements and system a “second chance” to send its service architecture designs as models in an XML format, signal to the watchdog. stored in an online database. The RAPT tool can • Maximum Time for Failure Detection – This search the database for requirements or design scenario type puts a constraint on the maximum elements used in previous projects. These items can time it should take for a failure to be detected by then be immediately incorporated into new architecture the system. In terms of the watchdog, this puts designs and requirements specifications, encouraging an upper bound on the timeout period for the reuse of software architecture assets. watchdog. The parameters required for the functional 3. Availability Reasoning Framework responsibilities in the availability RF include: • Failure Rate – The architect must give a rough We developed the availability RF specifically to estimate of the failure rate expressed as number evaluate the effectiveness of watchdog configurations of failures per hour for the implementation of in automotive applications. Therefore, we do not each functional responsibility in the system. currently address other possible dependability This can be based on data from previous mechanisms that could contribute to system software systems, failure rates of underlying availability. The goal of this RF is to provide a hardware resources being used by each particular quantitative evaluation framework for determining the function, or developer experience. required design parameters for a watchdog • Jitter – The architect must specify the amount of configuration in order to satisfy availability constraints jitter in the execution time of the responsibility, and requirements. expressed as a percentage of the responsibility’s The model for evaluating availability draws directly total execution time. from the real-time performance task architecture. The • Jitter Rate – The architect must specify the rate tasks defined in the architecture specify run-time at which each responsibility experiences jitter in characteristics that can be used to evaluate availability its execution time. on a task-by-task basis. Therefore, the availability RF Fourth International Workshop on Software Engineering for Automotive Systems (SEAS'07) 0-7695-2968-2/07 $20.00 © 2007
  5. 5. Worst-case downtime per failure = System t_wd + t_grace + t_recovery System Restarts Failure; System System System System In Working WD NOT Services Services Services Services State Serviced WD WD WD WD time T0 T1 T2 T3 T4 T’0 T’1 WD “Grace System Recovery WD Timeout Period” t_grace Time t_recovery Period t_wd WD Triggers System Reset Figure 2. Timeline Showing System and Watchdog (WD) Interaction When a Failure Occurs The availability RF uses the failure rate parameters unavailability due to the individual failures of each for each responsibility as the initial input to the task, and compute system availability as 1 minus the availability model to determine how often the system sum of all task unavailability values. For the purposes might experience a failure. The jitter and jitter rate of the model, we assume that task failures are parameters are used to model how often jitter may independent and will each cause a system failure, cause the system to miss a watchdog service deadline, which means the system’s watchdog servicing task does but not cause a system failure. This case would not run, so that the watchdog detects a failure and represent a false positive error detection. restarts the system. If the input parameters are accurate, these assumptions will provide a conservative 3.2 Initial Availability Architecture Design upper bound on the unavailability of each task, and a lower bound on the system availability. Since the availability RF uses the run-time To calculate the system unavailability due to each performance architecture as a basis, the rules for an task, we must evaluate two cases; genuine task failures, initial availability design are straightforward. The RF and task overruns due to jitter, which cause false generates an initial configuration for a watchdog to positive failures. For the case of real task failures, we monitor the software system, and creates a low priority must first calculate the task failure rate based on the watchdog servicing task to be added to the system task failure rates of its responsibilities, and then calculate architecture. The servicing task has a period that the unavailability penalty due to this task failure rate. equals the watchdog configuration’s timeout value. For Assuming that the failure rates of individual an initial configuration, this timeout value is set to 20% responsibilities in a task are independent, the overall longer than the task with the longest period in the task failure rate FRT is given by: software architecture. FRT = 1 – Product(all 1 – FRRi) The RF will also ask the user to provide initial Where FRRi is the failure rate for each of the values for the grace period and recovery time responsibilities in allocated to the task. Since the task parameters of the watchdog. can only fail when it is executing, the task failure rate must be multiplied by the task utilization UT, which is 3.3 Model Interpretation and Evaluation given by the ratio of the task’s execution time to its period. To calculate the unavailability due to this task, Availability is defined as the ratio of expected we multiply that result by the time penalty due to the uptime (when the system can provide service) to the detection DWD, grace period GWD, and recovery times total operating time of the system. This can also be RTWD of the watchdog configuration: expressed as 1 minus the unavailability, which is the Task Unavailability due to failures TUF = ratio of downtime, when the system is expected to FRT * UT * (DWD + GWD + RTWD) provide service but is not, to total operating time. See Figure 2 for a timeline illustrating what happens Using the input parameters provided by the architect when a task failure occurs. and the performance task architecture, we calculate the To calculate the unavailability due to task jitter, a similar method can be applied. The worst case task Fourth International Workshop on Software Engineering for Automotive Systems 29th International Conference on Software Engineering Workshops(ICSEW'07) (SEAS'07) 0-7695-2968-2/07 0-7695-2830-9/07 $20.00 © 2007
  6. 6. jitter rate JRT can be calculated from the individual timeout period might be more sensitive to false responsibility jitter rates in a manner similar to how we positives from task jitter. calculate the task failure rate from individual The watchdog grace period has a similar tension responsibility failure rates. We use a conservative between false positives and real failures. A long grace worst case jitter execution time penalty for the task as period for the watchdog will eliminate more false the sum of all the jitter values for the individual positives, but cause a greater unavailability penalty for responsibilities in that task. Then the unavailability each real task failure. The availability RF must due to task jitter is simply: compare the measures of unavailability due to task Task Unavailability due to jitter TUJ = failures and task jitter to decide whether to propose an JRT * U’T * (DWD + GWD + RTWD) increase or decrease to the grace period value. However, we cannot automatically assume that task The system recovery time after a watchdog reset is jitter will cause the system to miss its watchdog the majority of the time spent recovering the system service. In order to determine if the task jitter causes after a failure. The availability RF would always like the watchdog servicing task to miss its deadline, we to propose reducing this parameter in the watchdog must reanalyze the performance task architecture using configuration to improve availability. However, this the worst case jitter value added to the task. When the recovery time is largely dependent on the ECU task latencies are recalculated, if the watchdog service hardware configuration, and may not be tunable by the task has overrun its deadline by more than the software architects or developers. Ultimately the watchdog configuration’s grace period, then we can software architect must decide if applying this tactic is reason that the task jitter will cause a false positive possible and appropriate for their system. failure detection and watchdog reset. Otherwise, the The availability RF can test these design tactics by task jitter will be tolerated by the watchdog and no rerunning the availability model to compare the new unavailability penalty will be assessed. model’s availability metric with the original model. Finally the system availability can be calculated The RF will then propose the tactics most likely to from the results of all the task unavailability values: provide the biggest gain in availability for the software Overall Availability = 1 – (sum of all TUFi + TUJi) architecture. The architect can then decide whether to For all i tasks in the architecture design. select one of the tactics proposed by ArchE. The architecture design will be updated, and the process 3.4 Availability Design Tactics will iterate until all scenarios are satisfied. Another possible availability tactic might include If the calculated system availability does not meet increasing the period or reducing the execution time of the availability scenarios’ response measures, the a task that has a relatively high failure rate. This would availability RF will suggest design tactics to improve effectively reduce the portion of time the task runs in the architecture and satisfy its requirements. With the the system, reducing the chance that it can cause a results of the availability model, the availability RF has system failure. However, this tactic could conflict with heuristic rules that decide which tactics to propose. the real-time performance RF. A change in the task A watchdog configuration has three major architecture may cause some performance scenarios to parameters associated with it that can be manipulated become unfulfilled. This is one example of a possible in the design to affect system availability: tradeoff decision, and in this situation the architect • Watchdog timeout value must decide which scenarios (performance or • Watchdog grace period value availability) are more important. • System recovery time after a watchdog reset The watchdog timeout value is an upper bound on 4. Open Issues the time it will take for the watchdog to detect a system failure. Shortening the timeout period for the watchdog One of the major drawbacks to this model-based could increase availability because failures would be approach for availability evaluation is that the architect detected more quickly, and thus recovered more must provide some nominal failure rate values for the quickly. However, in order to catch all possible task functions in the system. Determining failure rates for failures, the watchdog service task should be the lowest software functions and modules is a challenging priority task, meaning the deadline for the service task problem that remains unsolved. The issue is even more in the software architecture and the watchdog timeout problematic considering that this architecture period should be longer than the lowest priority availability analysis may be done before the software application task. Also, a shorter watchdog implementation is built. Our current approach is to Fourth International Workshop on Software Engineering for Automotive Systems 29th International Conference on Software Engineering Workshops(ICSEW'07) (SEAS'07) 0-7695-2968-2/07 0-7695-2830-9/07 $20.00 © 2007
  7. 7. base failure rate estimates on underlying hardware of rules for suggesting design tactics to improve the resources accessed by the software functions. Also, it architecture to fulfill specific requirements. may not be as important to have precise failure Additionally, the RAPT tool provides a more intuitive numbers to generate a completely accurate availability user interface and also provides support for reuse of estimate, as long as we can evaluate a range of requirements and design elements via an online configurations and their response to varying failure repository. rates. The responsibility parameters in ArchE can be Within the ArchE expert system, multiple non- quickly modified to observe the response of the functional quality attributes, such as performance, availability model. This should be useful for finding modifiability and availability, can be evaluated to “good enough” watchdog configurations to satisfy assess the ability of a software architecture to satisfy its availability quality requirements and optimize the requirements. Each RF will provide design suggestions architecture to satisfy all of its quality requirements. to improve the architecture, and ArchE can evaluate the side effects from each quality attribute RF’s tactics on 5. Conclusions and Future Work the other quality attributes of the architecture design. Thus, tradeoff decisions can be explicitly evaluated. In this paper we presented a framework for As shown earlier, the performance and availability RFs evaluating and improving availability in a software can have conflicts since their models are based on the architecture design. This framework uses a model to same runtime task architecture. determine the availability, and heuristics to propose Future work for the availability RF will include improvements to the architecture if it does not satisfy expanding its scope to propose design tactics for other its availability requirements. Although this approach dependability mechanisms besides the watchdog. This depends on some initial estimates for software failure will allow the availability model to provide more rates, the model can be used to identify a range of general architecture design assistance. acceptable solutions for a given set of requirements and a range of possible failure rates. 6. Acknowledgements Our availability RF enables an architect to explore design configurations for a watchdog configuration and This work builds on the ArchE concepts developed how the watchdog can improve system availability. in conjunction with the Software Engineering Institute This is important for automotive ECU software at Carnegie Mellon University. designers because the watchdog concept is used as an important dependability mechanism to tolerate 7. References unforeseen transient software and hardware failures. Despite the importance of the watchdog and its near- [1] F. Bachmann, L. Bass, M. Klein, and C. Shelton, ubiquitous use in ECUs, there has been little work done “Designing Software Architectures to Achieve Quality to model and evaluate how the watchdog contributes to Attribute Requirements”, IEE Proceedings on Software, pp. system availability and dependability. This work 153-165, Vol. 152, No. 5, pp. 153-165, August 2005. addresses that concern. [2] L. Bass, P. Clements, and R. Kazman, Software The current RF does not address other dependability Architecture in Practice, Second Edition, Addison-Wesley, concerns such as reliability, safety, and security. Since Boston, MA, USA, 2003. the focus of this work was to develop a specific quantitative model that could be used to help make [3] The Eclipse Foundation, “Eclipse – An Open sound architecture design decisions regarding the Development Platform”, <http://www.eclipse.org>, January watchdog, we decided to narrow the scope to the 2007. dependability quality most directly affected by the watchdog mechanism; availability. Other [4] M. Klein, T. Ralya, B. Pollak, et al., A Practitioner’s dependability concerns could be addressed with Handbook for Real-Time Analysis : Guide to Rate Monotonic Analysis for Real-Time Systems, Kluwer additional RFs with separate analysis models. Academic Publishers, Boston, MA, USA, 1993. ArchE enables software architects to analyze and improve their software architecture designs using RFs [5] A. Mahmood, E. J. McCluskey, “Concurrent Error to evaluate the architecture against its quality Detection Using Watchdog Processors – A Survey,” IEEE requirements using quantitative models. Each RF Transactions on Computers, Vol. 37, No. 2, pp. 160-174, incorporates well known design strategies in the form February 1988. Fourth International Workshop on Software Engineering for Automotive Systems 29th International Conference on Software Engineering Workshops(ICSEW'07) (SEAS'07) 0-7695-2968-2/07 0-7695-2830-9/07 $20.00 © 2007