Like this presentation? Why not share!

# Cleansing test suites from coincidental correctness to enhance falut localization

## on Mar 19, 2012

• 414 views

### Views

Total Views
414
Views on SlideShare
414
Embed Views
0

Likes
0
5
0

No embeds

### Report content

• Comment goes here.
Are you sure you want to
• [49] [142] [138] [138] [8] [29] [61]
• 这里抽象出来了四个因素， a 00 ， a 10 ， a 01 ， a 11 ，但是如何对这四个因素进行错误定位领域的数学建模呢？
• 这里抽象出来了四个因素， a 00 ， a 10 ， a 01 ， a 11 ，但是如何对这四个因素进行错误定位领域的数学建模呢？ a 00 is the number of passed runs in which part S is not involved a 01 is the number of failed runs in which part S is not involved a 10 is the number of passed runs in which part S is involved a 11 is the number of failed runs in which part S is involved
• The  Jaccard index , also known as the  Jaccard similarity coefficient  (originally coined  coefficient de communauté  by  Paul Jaccard ), is a  statistic  used for comparing the  similarity  and diversity of  sample  sets. Similarity of asymmetric binary attributes Given two objects,  A  and  B , each with  n   binary  attributes, the Jaccard coefficient is a useful measure of the overlap that  A and  B  share with their attributes. Each attribute of  A  and  B  can either be 0 or 1. The total number of each combination of attributes for both  A  and  B  are specified as follows:
• by dividing the numerator and denominator
• 以上这三种情况，会给疑似度带来怎样的影响呢？ which come from Insufficient Test Suite Program’s inherent features
• 需要发生多少，才会把正确语句的疑似度给挤上去？ The fault is executed, this execution will not fail
• 这两条语句的疑似度降低了，而其他语句的疑似度超过了它们。
• 即使是极端情况，也不会把错误语句疑似度排在正确语句的后面。
• Lines 1-4 populate CCE with program elements that are totally correlated with failing runs and correlated with a given ratio of passing runs (determined by θ). Lines 5-8 populate CCT with tests that execute one or more cce’s. Technique - I exhibits a low rate of false negatives.
• Lines 1-4 populate CCE with program elements that are totally correlated with failing runs and correlated with a given ratio of passing runs (determined by θ). Lines 5-8 populate CCT with tests that execute one or more cce’s. Technique - I exhibits a low rate of false negatives.
• Lines 1-4 populate CCE with program elements that are totally correlated with failing runs and correlated with a given ratio of passing runs (determined by θ). Lines 5-8 populate CCT with tests that execute one or more cce’s. Technique - I exhibits a low rate of false negatives.
• Lines 1-4 populate CCE with program elements that are totally correlated with failing runs and correlated with a given ratio of passing runs (determined by θ). Lines 5-8 populate CCT with tests that execute one or more cce’s. Technique - I exhibits a low rate of false negatives.

## Cleansing test suites from coincidental correctness to enhance falut localizationPresentation Transcript

• Cleansing Coincidental Correctnessto Enhance Fault Localization Tao He elfinhe@gmail.com Software Engineering Laboratory Department of Computer Science, Sun Yat-Sen University The 2nd Joint Winter Workshop on Software Engineering December 2010 Sun Yat-Sen University, Guangzhou, China 1/37
• Outline Coverage-Based Fault Localization  Introduction  Methodology  Evaluation  Discussion Cleansing Coincidental Correctness  Methodology  Evaluation Conclusion and Future Work 2/37
• Introduction  Software Debugging is an arduous task[1] that requires  Time  Effort  A good understanding of the source code  Three steps to debug[2]  Fault detection  Fault localization  Fault correction  We focus on automatic Fault Localization…[1] I. Vessey. Expertise in debugging computer programs: A process analysis. International Journal of Man-Machine Studies, 23(5):459–494, November 1985.[2] D. Wieland. Model-Based Debugging of Java Programs Using Dependencies. PhD thesis, Technischen 3/37Universitat Wien, 2001.
• Input of Fault Localization  Source code  Test Cases Input: oracle//Find the maximum among a, b and c int max (int a, int b, int c){ a, b, c1 int temp = a; 3, 2, 1 32 if (b > temp ){3 temp = b+1; //bug 2, 1, 3 34 } 1, 2, 3 35 if (c > temp ){6 temp = c; 1, 2, 4 47 } 1, 2, 3 38 return temp; } 1, 3, 2 3 Source Code Test Cases 4/37
• Output of Fault Localization  Suspiciousness of each statement  Based on likelihood of containing faults.  Statement with higher suspiciousness should be examined before statement with a lower suspiciousness.//Find the maximum among a, b and c most suspicious int max (int a, int b, int c){1 int temp = a;2 if (b > temp ){3 temp = b+1; //bug S1 S2 S3 S4 S5 S6 S7 S84 }5 if (c > temp ){6 temp = c; S 0.33 0.33 0.5 0.33 0.33 0.25 0.33 0.337 }8 return temp; } Source Code Suspiciousness results for Jaccard coefficient 5/37
• Coverage-Based Fault Localization (CBFL)  Based on the executable statement hit (coverage)  Input of CBFL  Coverage  Execution result (passed or failed)//Find the maximum among a, b and c int max (int a, int b, int c){ a, b, c S1 S2 S3 S4 S5 S6 S7 S8 r1 int temp = a; 3, 2, 1 1 1 0 1 1 0 1 1 p2 if (b > temp ){3 temp = b+1; //bug 2, 1, 3 1 1 0 1 1 1 1 1 p4 }5 if (c > temp ){ 1, 2, 3 1 1 1 1 1 0 1 1 p6 temp = c;7 } 1, 2, 4 1 1 1 1 1 1 1 1 p8 return temp; } 1, 2, 3 1 1 1 1 1 1 1 1 f Source Code 1, 3, 2 1 1 1 1 1 0 1 1 f 6/37
• Input of CBFL  For brevity…//Find the maximum among a, b and c a, b, c S3 S6 Others r int max (int a, int b, int c){1 int temp = a; 3, 2, 1 0 0 1 p2 if (b > temp ){ 2, 1, 3 0 1 1 p3 temp = b+1; //bug 1, 2, 3 1 0 1 p4 }5 if (c > temp ){ 1, 2, 4 1 1 1 p6 temp = c;7 } 1, 2, 3 1 1 1 f8 return temp; } 1, 3, 2 1 0 1 f Source Code 7/37
• Methodology  Intuitively, for each statement, there are four factors, which will contribute to the suspiciousness.For each statement S An example a, b, c S3 S6 Others r SJ(s) Cue 3, 2, 1 0 0 1 p 2, 1, 3 0 1 1 p ↑a00(S) ↑ |Not cover S, Passed tests| 1, 2, 3 1 0 1 p ↑a10(S) ↓ |Cover S, Passed tests| 1, 2, 4 1 1 1 p ↑a01(S) ↓ |Not cover S, Failed tests| 1, 2, 3 1 1 1 f ↑a11(S) ↑ |Cover S, Failed tests| 1, 3, 2 1 0 1 f a00(S) 2 2 0 a10(S) 2 2 4 a01(S) 0 1 0 a11(S) 2 1 2 8/37
• Jaccard [3] Similarity of asymmetric binary attributes a, b, c S3 S6 Others r 3, 2, 1 0 0 1 p 2, 1, 3 0 1 1 p a11 ( s ) SJ ( s) = 1, 2, 3 1 0 1 p a11 ( s ) + a01 ( s) + a10 ( s) 1, 2, 4 1 1 1 p failed ( s ) 1, 2, 3 1 1 1 f SJ ( s) = totalfailed + passed ( s) 1, 3, 2 1 0 1 f SJ(j) 0.5 0.25 0.33[3] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. A. Brewer. Pinpoint: Problem determination in large,dynamic internet services. In Proceedings of 2002 International Conference on Dependable Systems and Networks(DSN 2002), pages 595–604, Bethesda, MD, USA, 23-26 June 2002. IEEE Computer Society. 9/37
• Tarantula [4]  Used in the Tarantula fault a, b, c S3 S6 Others r localization tool 3, 2, 1 0 0 1 p a11 ( s ) a11 ( s ) + a01 ( s ) 2, 1, 3 0 1 1 p ST ( s ) = a11 ( s ) a10 ( s ) 1, 2, 3 1 0 1 p + a11 ( s ) + a01 ( s ) a10 ( s ) + a00 ( s ) 1, 2, 4 1 1 1 p failed ( s ) 1, 2, 3 1 1 1 f totalfailed ST ( s ) = failed ( s ) passed ( s ) 1, 3, 2 1 0 1 f + totalfailed totalpassed ST(j) 0.66 0.5 0.5[4] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic faultlocalization technique. In D.F. Redmiles, T. Ellman, and A. Zisman, editors, 20th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE 2005), pages 273–282, Long Beach, CA, USA, November 7-11 2005. ACM. 10/37
• Ochiai [5] a, b, c S3 S6 Others r Used in the molecular biology domain  To measure genetic similarity 3, 2, 1 0 0 1 p 2, 1, 3 0 1 1 p SO ( s ) = a11 ( s ) 1, 2, 3 1 0 1 p (a11 ( s ) + a01 ( s )) × (a11 ( s ) + a10 ( s )) 1, 2, 4 1 1 1 p 1, 2, 3 1 1 1 f failed ( s ) SO ( s ) = totalfailed × ( passed ( s ) + failed ( s )) 1, 3, 2 1 0 1 f SO(j) 0.7 0.41 0.57[5] R. Abreu, P. Zoeteweij, and A. J. van Gemund. On the accuracy of spectrum-based fault localization. In P.McMinn, editor, Proceedings of the Testing: Academia and Industry Conference - Practice And Research Techniques(TAIC PART’07), pages 89–98, Windsor, United Kingdom, September 2007. IEEE Computer Society. 11/37
• Evaluation  Assign a score to every faulty version of each subject program  Score [6]  Describes the percentage of program that need not to be examined until the first bug-containing statement is reached  Assumption  Perfect bug detection  i.e., programmers can always correctly classify faulty code as faulty, and non-faulty code as non-faulty.[6] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic faultlocalization technique. In D.F. Redmiles, T. Ellman, and A. Zisman, editors, 20th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE 2005), pages 273–282, Long Beach, CA, USA, November 7-11 2005. ACM. 12/37
• Evaluation (cont’) - An example Step 1 S6 S3 S1 S2 S4 S5 S7 S8 S 0.7 0.5 0.33 0.33 0.33 0.33 0.33 0.33  Not bug Sorted suspiciousness results //Find the maximum among a, b and c int max (int a, int b, int c){ 1 int temp = a; 2 if (b > temp ){ 3 temp = b+1; //bug 4 } 5 if (c > temp ){ Not bug 6 temp = c; 7 } 8 return temp; } Source Code 13/37
• Evaluation (cont’) - An example Step 1 S6 S3 S1 S2 S4 S5 S7 S8 S 0.7 0.5 0.33 0.33 0.33 0.33 0.33 0.33  Not bug Sorted suspiciousness results Step 2 //Find the maximum among a, b and c  Find it! int max (int a, int b, int c){ 1 int temp = a; 2 if (b > temp ){ Find it! 3 temp = b+1; //bug 4 } 5 if (c > temp ){ 6 temp = c; 7 } 8 return temp; } Source Code 14/37
• Evaluation (cont’) - An example  2 statements have been examined  8 statements in the program totally  Score of this program is  1- (2 ÷ 8) = 0.75  The percentage of statements that need not to be examined 15/37
• Evaluation (cont’) Assign a score to every faulty version of Siemens suite  The effectiveness of existing techniques has been limited… 16/37
• Discussion  Rewrite the coefficients as below [7] a11 a11 + a01 a11 a11 ST = SJ = SO = a11 a10 a11 + a01 + a10 (a11 + a01 ) × (a11 + a10 ) + a11 + a01 a10 + a00 Replace by Square, and Divide by a11 + a01 divide by CT = C J = a11 + a01 a10 + a00 C J = a11 + a01 For brevity 1 a11 a11 a11 ′ ST = ′ SJ = S = 2 O ′ SO = a10  a  a 1 + CT × C J + a10 C J × 1 + 10  1 + 10 a11  a  a11  11   Both CT=(a11+a01)/(a10+a00) and CJ=a11+a01 are constant for all statements  Not influence the suspiciousness ranking  So rankings from three coefficients depend only on a11 and a10[7] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund. A practical evaluation of spectrum-based faultlocalization. Journal of Systems and Software, 82(11):1780–1792, 2009. 17/37
• The impact of a11 and a10 The suspiciousness calculated by the coefficients have  a positive correlation with a11  a negative correlation with a10 Assume that  the fault is executed, this execution will fail (to increase a11),  the fault is not executed, this execution will pass (to increase a10),  the test suite is adequate. Then the fault statement will always rank top. Why ineffective? Any interferences? 18/37
• Interferences  Factors impair the CBFL (interferences)  Coincidental Correctness [8]  The fault is executed, but this execution will not fail,  Multiple Faults  The fault is not executed, but this execution will fail.  Coverage Equivalence  The coverage between statements are always the same.[8] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi. An empirical study of the factors that reduce theeffectiveness of coverage-based fault localization. In B. Liblit, N. Nagappan, and T. Zimmermann, editors,Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction withISSTA 2009, pages 1–5, Chicago, Illinois, July 19-19 2009. ACM. 19/37
• Coincidental Correctness  Not all conditions for failure are met.  The RIP (reachability-infection-propagation) model[9]  Condition 1:the fault is executed  Condition 2:the program has transitioned into an infectious state  Condition 3:the infection has propagated to the output //Find the maximum among a, b and c int max (int a, int b, int c){ 1 int temp = a; 2 if (b > temp ){ 3 temp = b+1; //bug 4 } condition a, b, c S3 S6 Others r 5 if (c > temp ){ a ＜ b, b ＋ 1 ＝ c 1, 2, 3 1 0 1 p 6 temp = c; a ＜ b, b ＋ 1 ＜ c 1, 2, 4 1 1 1 p 7 } 8 return temp; }[9] Ammann P. and Offutt J. Introduction to Software Testing. Cambridge University Press, 2008. 20/37
• Multiple Faults  The fault is not executed, but this execution will failed. (Because another fault is executed.)//Find the maximum among a, b and c int max (int a, int b, int c){1 int temp = a;2 if (b > temp ){ condition a, b, c S3 S6 r3 temp = b+1; //bug a ＜ b, b + 1 ≥ c 1, 2, 4 1 0 f4 }5 if (c > temp ){ a ≥ b, a ＜ c 3, 2, 4 0 1 f6 temp = c+1; //bug7 }8 return temp; } 21/37
• Coverage Equivalence  The coverage between statements are always the same.  Due to  Inadequacy of the test suite  The inherent property of a program//Find the maximum among a, b and c int max (int a, int b, int c){1 int temp = a+1; //bug2 if (b > temp ){ condition a, b, c S1 S8 r3 temp = b;4 } a ＜ b or a ＜c 1, 2, 3 1 1 p5 if (c > temp ){ otherwise 7, 2, 4 1 1 f6 temp = c;7 }8 return temp; } 22/37
• Empirical Study  Coincidental Correctness (72.1%) [8]  Strong Coincidental Correctness (15.7%)  Meet Condition 1,2 of RIP(reachability-infection-propagation) model.  Weak Coincidental Correctness (56.4%)  Meet only Condition 1 of RIP(reachability-infection-propagation) model.  A safety reducing factor.  Causes the faulty statement has a lower score than others.[8] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi. An empirical study of the factors that reduce theeffectiveness of coverage-based fault localization. In B. Liblit, N. Nagappan, and T. Zimmermann, editors,Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction withISSTA 2009, pages 1–5, Chicago, Illinois, July 19-19 2009. ACM. 23/37
• Cleansing Coincidental Correctness [10]  Input:  A test suite and the coverage matrix  Output:  Subset of passing tests that are likely to be coincidentally correct.  Assumption  A good candidate for a cce is a program element that occurs in all failing runs and in a non-zero but not excessively large percentage of passing runs[10] Wes Masri, Rawad Abou Assi, Cleansing Test Suites from Coincidental Correctness to Enhance Fault-Localization, 2008 International Conference on Software Testing, Verification, and Validation, pp. 165-174, 2010Third International Conference on Software Testing, Verification and Validation, 2010. IEEE 24/37
• Technique - I Populate CCE with program elements that are totally correlated with failures.Assumption fT(cce) = 1.0 0 < pT(cce) ≤ θ  where fT(cce) is the percentage of TF executing cce, pT(cce) the percentage of Tp executing cce, and θ < 1.0. T : a test suite We estimate: CCE: the set of program elements that are likely to TF : failing tests be correlated with coincidentally correct tests. cce: an element in CCE TP : passing tests cct : test that induce cce CCT: estimate of TCC TCC : Coincidentally Correct tests 25/37
• Technique - I (cont’)Assumption fT(cce) = 1.0 0 < pT(cce) ≤ θ  where fT(cce) is the percentage of TF executing cce, pT(cce) the percentage of Tp executing cce, and θ < 1.0. Populate CCT with tests that execute one or more cce’s. T : a test suite We estimate: CCE: the set of program elements that are likely to TF : failing tests be correlated with coincidentally correct tests. cce: an element in CCE TP : passing tests cct : test that induce cce CCT: estimate of TCC TCC : Coincidentally Correct tests 26/37
• Technique - I - An example//Find the maximum among a, b and c int max (int a, int b, int c){1 int temp = a; a, b, c S3 S6 Others r2 if (b > temp ){ 1, 2, 3 1 0 1 p3 temp = b+1; //bug 1, 2, 4 1 1 1 p4 }5 if (c > temp ){ 3, 2, 1 0 0 1 p6 temp = c; 2, 1, 3 0 1 1 p7 } 1, 2, 3 1 1 1 f8 return temp; } 1, 3, 2 1 0 1 f cce 27/37
• Technique - I - An example (cont’)//Find the maximum among a, b and c int max (int a, int b, int c){ a, b, c S3 S6 Others r1 int temp = a;2 if (b > temp ){ Find them! cct 1, 2, 3 1 0 1 p3 temp = b+1; //bug coincidental cct 1, 2, 4 1 1 1 p4 } correctness 3, 2, 1 0 0 1 p5 if (c > temp ){6 temp = c; 2, 1, 3 0 1 1 p7 } 1, 2, 3 1 1 1 f8 return temp; } 1, 3, 2 1 0 1 f cce 28/37
• Technique - II A high average weight is more likely to be a coincidentally correct test.  Weight (correlate with suspiciousness)  ((average weight of the covered cce’s) + (percent of cce’s covered)) The lower ranked cct’s are discarded 29/37
• Technique - III Partitions the cct’s into two clusters based on the similarity of the suspicious cce’s Assumptions  Typically, some cce’s are relevant to the fault and others are not.  The coincidentally correct tests exercise these fault relevant cce’s whereas the correct tests don’t. 30/37
• Evaluation false negatives: false positives: safety change: precision change: coverage reduction: 31/37
• Evaluation (cont’) 32/37
• Evaluation (cont’) Comparative results summaries 33/37
• Conclusion Without interferences, CBFL are effective and efficient techniques that automate Fault Localization. Well designed coefficients will be compatible with some interferences but not all of them. Three variations of a technique are presented to identify coincidental correctness, a safety reducing factor for CBFL. 34/37
• Future Work Conduct more algorithms to identify coincidental correctness  e.g. cluster analysis and failure classification. Evaluate whether different program elements can further reduce the rate of false positives  e.g. predicates, function calls, program paths Assess the impact of cleansing coincidental correctness on other fault localization approaches 35/37
• Q&A 36/37
• Thank you!Contact me via elfinhe@gmail.com 37/37