1. Fault confinement. This stage limits the spread of fault effects to one area of the Web service, thus preventing contamination of other areas. Fault-confinement can be achieved through use of: fault-detection within the Web services, consistency checks and multiple requests/confirmations. 2. Fault detection. This stage recognizes that something unexpected has occurred in the Web services. Fault latency is the period of time between the occurrence of a fault and its detection. Techniques fall in 2 classes: off-line and on-line. With off-line techniques, such as diagnostic programs, the service is not able to perform useful work while under test. On-line techniques, such as duplication, provide a real-time detection capability that is performed concurrently with useful work. 3. Diagnosis. This stage is necessary if the fault detection technique does not provide information about the failure location and/or properties. 4. Reconfiguration. This stage occurs when a fault is detected and a permanent failure is located. The Web services can be composed of different components. When providing the service, there may be failure in individual components. The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system. 5. Recovery. This stage utilizes techniques to eliminate the effects of faults. Two basic recovery approaches are based on: fault masking, retry and rollback. Fault-masking techniques hide the effects of failures by allowing redundant information to outweigh the incorrect information. Web services can be replicated or implemented with different versions (NVP). Retry attempts a second attempt at an operation and is based on the premise that many faults are transient in nature. Web services provide services through network; retry would be a practical as requests/reply may be affected by the situation of the network. Rollback makes use of the fact that the Web service operation is backed up (checkpointed) to some point in its processing prior to fault detection and operation recommences from this point. Fault latency is important here because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error. 6. Restart. This stage occurs after the recovery of undamaged information. Hot restart: resumption of all operations from the point of fault detection and is possible only if no damage has occurred. Warm restart: only some of the processes can be resumed without loss. Cold restart: complete reload of the system with no processes surviving. The Web services can be restarted by rebooting the server. 7. Repair. At this stage, a failed component is replaced. Repair can be off-line or on-line. Web services can be component-based and consist of other Web services In off-line repair either the Web service will continue if the failed component/sub-Web service is not necessary for operation or the Web services must be brought down to perform the repair. In on-line repair the component/sub-Web service may be replaced immediately with a backup spare or operation may continue without the component. With on-line repair Web service operation is not interrupted. 8. Reintegration. In this stage the repaired module must be reintegrated into the Web service. For on-line repair, reintegration must be performed without interrupting Web service operation.
This is the descriptions of test set, containing the detailed testing purpose of each test case. Can be classified as functional testing (1-800) and random testing (801-1200). Can be classified into six regions according to their different patterns.
For overall, moderate; highest, region IV, lowest: region VI
Normal testing; code coverage (48%-52%) main control flow/data flow Exceptional testing: two clusters . The reason is in some cases, part of large-scale computational functions are executed but others will be skipped. But in other cases, all these computational code are skipped.
In`tegral De`rivative Since lambda (t) is the failure intensity function with respect to time, any existing distributions in well-known reliability models can be used, e.g., NHPP,Weibull model,S-shaped model or logarithmic Poisson models.
1. Software Reliability Engineering: A Roadmap Michael R. Lyu Dept. of Computer Science & Engineering The Chinese University of Hong Kong Future of Software Engineering ICSE’2007 Minneapolis, Minnesota May 24, 2007
2. Introduction <ul><li>Software reliability is the probability of failure-free operation with respect to execution time and environment. </li></ul><ul><li>Software reliability engineering (SRE) is the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. </li></ul><ul><li>SRE has been adopted by more than 50 companies as standards or best current practices. </li></ul><ul><li>Creditable software reliability techniques are still in urgent need. </li></ul>
3. Historical SRE Techniques: Fault Lifecycle <ul><li>Fault prevention: to avoid, by construction , fault occurrences. </li></ul><ul><li>Fault removal: to detect, by verification and validation , the existence of faults and eliminate them. </li></ul><ul><li>Fault tolerance: to provide, by redundancy and diversity , service complying with the specification in spite of manifested faults. </li></ul><ul><li>Fault/failure forecasting: to estimate, by statistical modeling , the presence of faults and occurrence of failures. </li></ul>
4. Fault Lifecycle Technique Fault Manifestation and Modeling Process Reliability Fault Prevention Fault Removal Fault Tolerance Fault/Failure Forecasting
6. Software Reliability Modeling R = e - t Testing Time
7. Current SRE Process Overview
8. Current Trends and Problems <ul><li>The theoretical foundation of software reliability comes from hardware reliability techniques. </li></ul><ul><li>Software failures do not happen independently. </li></ul><ul><li>Software failures seldom repeat in exactly the same or predictable pattern. </li></ul><ul><li>Failure mode and effect analysis (FMEA) for software is still controversial and incomplete. </li></ul><ul><li>There is currently a need for a creditable end-to-end software reliability paradigm that can be directly linked to reliability prediction from the very beginning. </li></ul>
9. Future Direction 1: Reliability-Centric Software Architectures <ul><li>The product view – achieve failure-resilient software architecture </li></ul><ul><ul><li>Fault prevention </li></ul></ul><ul><ul><li>Fault tolerance </li></ul></ul><ul><li>The process view – explore the component-based software engineering </li></ul><ul><ul><li>Component identification, construction, protection, integration and interaction </li></ul></ul><ul><ul><li>Reliability modeling based on software structure </li></ul></ul>
10. Future D i r e c t i o n 2: Design for Reliability Achievement <ul><li>Fault confinement </li></ul><ul><li>Fault detection </li></ul><ul><li>Diagnosis </li></ul><ul><li>Reconfiguration </li></ul><ul><li>Recovery </li></ul><ul><li>Restart </li></ul><ul><li>Repair </li></ul><ul><li>Reintegration </li></ul>
12. Future D i r e c t i o n 3: Testing for Reliability Assessment <ul><li>Establish the link between software testing and reliability </li></ul><ul><li>Study the effect of code coverage to fault coverage </li></ul><ul><li>Evaluate impact of reliability by various testing metrics </li></ul><ul><li>Assess competing testing schemes quantitatively </li></ul>
13. Positive vs. negative evidences for coverage-based software testing Code coverage contributes to a noticeable amount of fault coverage Cai (2005) High code coverage brings high software reliability and low failure rate Frankl(1988) Horgan(1994) Weyuker(1988) Positive Findings Resources The testing result on published data did not support a causal dependency between code coverage and defect coverage Briand(2000) Negative An increase in reliability comes with an increase in at least one code coverage measures Frate(1995) The correlation between test effectiveness and block coverage is higher than that between test effectiveness and the size of test set Wong(1994) A correlation between code coverage and software reliability is observed Chen(1992)
14. RSDIMU test cases description I II III IV V VI
15. The correlation: various test regions <ul><li>Linear modeling fitness in various test case regions </li></ul><ul><li>Linear regression relationship between block coverage and fault coverage in the whole test set </li></ul>Fault Coverage
16. The correlation: normal operational testing vs. exceptional testing <ul><li>Normal operational testing </li></ul><ul><ul><li>very weak correlation </li></ul></ul><ul><li>Exceptional testing </li></ul><ul><ul><li>strong correlation </li></ul></ul>0.944 Exceptional testing (373) 0.045 Normal testing (827) 0.781 Whole test case (1200) R-square Testing profile (size)
17. The correlation: normal operational testing vs. exceptional testing <ul><li>Normal testing: small coverage range (48%-52%) </li></ul><ul><li>Exceptional testing: two main clusters </li></ul>Fault Coverage Fault Coverage
18. The Spectrum in Software Testing and Reliability Software Reliability Growth Models New Model Coverage-Based Analysis <ul><li>A new model is needed to combine execution time and testing coverage </li></ul>- user oriented - tester oriented - more physical meaning - less physical meaning - abundant models - lack of models - easy data collection - hard data collection - less relevance to testing - more relevance to testing Time Based Models Coverage Based Testing
19. A New Coverage-Based Reliability Model <ul><ul><li>λ (t,c): joint failure intensity function </li></ul></ul><ul><ul><li>λ 1 (t): failure intensity function with respect to time </li></ul></ul><ul><ul><li>λ 2 (c): failure intensity function with respect to coverage </li></ul></ul><ul><ul><li>α 1 , γ 1 , α 2 , γ 2 : parameters with the constraint of </li></ul></ul><ul><ul><li>α 1 + α 2 = 1 </li></ul></ul><ul><ul><li>joint failure intensity function </li></ul></ul><ul><ul><li>failure intensity function with time </li></ul></ul><ul><ul><li>failure intensity function with coverage </li></ul></ul>Dependency factors
20. Estimation Accuracy
21. Future D i r e c t i o n 4: Metrics for Reliability Prediction <ul><li>New models (e.g., BBN) to explore rich software metrics </li></ul><ul><li>Data mining approaches </li></ul><ul><li>Machine learning techniques </li></ul><ul><li>Bridging the gap of the one-way function: feedback to building reliable software </li></ul><ul><li>Continuous industrial data collection efforts – demonstration of cost-effectiveness </li></ul>
22. Future D i r e c t i o n 5: Reliability for Emerging Software Applications <ul><li>“ The Internet changes everything” </li></ul><ul><li>On-demand customizable software </li></ul><ul><li>Service oriented architecture, composition, integration </li></ul><ul><li>Customization by middleware – from metadata to metacode </li></ul><ul><li>A common infrastructure delivers reliability to all customers </li></ul>
23. A Paradigm for Reliable Web Service Replication Manager Web service selection algorithm WatchDog UDDI Registry WSDL Web Service IIS Application Database Web Service IIS Application Database Web Service IIS Application Database Client Port Application Database <ul><li>Create Web services </li></ul><ul><li>Select primary Web </li></ul><ul><li>service (PWS) </li></ul>3. Register 4. Look up 5. Get WSDL 6. Invoke Web service <ul><li>Keep check the availability of the PWS </li></ul><ul><li>If PWS failed, reselect the PWS. </li></ul>9. Update the WSDL
24. Conclusions <ul><li>Software reliability is receiving higher attention as it becomes an important economic consideration for businesses. </li></ul><ul><li>New SRE paradigms need to consider software architectures, testing techniques, data analyses, and creditable reliability modeling procedures. </li></ul><ul><li>Domain specific approaches on emerging software applications are worthy of investigation. </li></ul><ul><li>Still a long way to go, but the directions are clear. </li></ul>