• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Muffler a tool using mutation to facilitate fault localization 2.3
 

Muffler a tool using mutation to facilitate fault localization 2.3

on

  • 434 views

 

Statistics

Views

Total Views
434
Views on SlideShare
434
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I assume that you have already known a lot of these techniques, so I only give a quick review.
  • Please find another definition, using passed runs to describ CC
  • Please remember to notate the CC, e.g., 1382.Please remember to add amination
  • Please remember to notate the CC, e.g., 1382.Please remember to add amination
  • It is worthwhile to mention that Muffler’s time cost can be greatly reduced with a simple test selection strategy. The strategy can be described as: do not re-run a test case that does not cover the mutated statement. Furthermore, because the executions of mutants do not depend on each other, we can parallelize them with not much effort. Nonetheless, we have to admit that Muffler need more time to offer a better effectiveness in fault localization.

Muffler a tool using mutation to facilitate fault localization 2.3 Muffler a tool using mutation to facilitate fault localization 2.3 Presentation Transcript

  • Muffler: An Approach Using Mutationto Facilitate Fault Localization Tao He elfinhe@gmail.com Department of Computer Science, Sun Yat-Sen University Department of Computer Science and Engineering, HKUST Group Discussion February 2012 HKUST, Hong Kong, China 1/34
  • Outline Background Motivation Why does our approach work? Our Approach – Muffler Empirical Evaluation Conclusion 2/34
  • Background Coverage-Based Fault Localization (CBFL)  Input  Coverage  Testing results (passed or failed)  Output  A ranking list of statements  Ranking functions  Most CBFL techniques are similar with each other except that different ranking functions are used to compute suspiciousness. 3/34
  • What is the limitation of existingCBFL techniques? 4/34
  • Motivation  One fundamental assumption [YPW08] of CBFL  The observed behaviors from passed runs can precisely represent the correct behaviors of this program;  and the observed behaviors from failed runs can represent the infamous behaviors.  Therefore, the different observed behaviors of program entities between passed runs and failed runs will indicate the fault’s location.  But this does not always hold.[YPW08] C. Yilmaz, A. Paradkar, and C. Williams. Time will tell: fault localization using time spectra. In Proceedingsof the 30th international conference on Software engineering (ICSE 08). ACM, New York, NY, USA, 81-90. 2008. 5/34
  • Motivation  Coincidental Correctness (CC)  “No failure is detected, even though a fault has been executed.” [RT93]  i.e., the passed runs may cover the fault.  Weaken the first part of CBFL‟s assumption:  The observed behaviors from passed runs can precisely represent the correct behaviors of this program;  More, CC occurs frequently in practice.[MAE+09][RT93] D.J. Richardson and M.C. Thompson, An analysis of test data selection criteria using the RELAY model offault detection, Software Engineering, IEEE Transactions on, vol. 19, (no. 6), pp. 533-553, 1993.[MAE+09] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi, An empirical study of the factors that reduce theeffectiveness of coverage-based fault localization, in Proceedings of the 2nd International Workshop on Defects inLarge Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing 6/34and Analysis (ISSTA 2009), pp. 1-5, 2009.
  • Our goal is to address the CC issue via mutation analysisWhat is the idea? 7/34
  • Why does our approach work?- Key hypothesis Mutating the faulty statement tends to maintain the results of passed test cases. By contrast, mutating a correct statement tends to change the results of passed test cases (from passed to failed). 8/34
  • Why does our approach work?- Three comprehensive scenarios (1/3) - If we mutate an M in different basic blocks with F Test cases Passed Program Failed F M M: Mutant point Test results F: Fault point 3 test results change from passed to failed 9/34
  • Why does our approach work?- Three comprehensive scenarios (1/3) - If we mutate an M in different basic blocks with F Test cases Passed M Program Failed F M: Mutant point Test results F: Fault point 3 test results change from passed to failed 10/34
  • Why does our approach work?- Three comprehensive scenarios (1/3) - If we mutate F Test cases Passed Program Failed F +M M: Mutant point Test results F: Fault point 0 test result changes from passed to failed 11/34
  • Why does our approach work?- Three comprehensive scenarios (2/3) - If we mutate an M in the same basic block with F Test cases Due to different data flow to affect output Passed F Program Failed M M: Mutant point F: Fault point Control Flow Test results 3 test results change from passed to failed Data Flow 12/34
  • Why does our approach work?- Three comprehensive scenarios (2/3) - If we mutate F Test cases Passed F +M Program Failed M: Mutant point F: Fault point Control Flow Test results 0 test result change from passed to failed Data Flow 13/34
  • Why does our approach work?- Three comprehensive scenarios (3/3) - When CC occurs frequently Test cases - If we mutate F Due to weak ability to affect output Passed Program Failed F +M M: Mutant point F: Fault point Test results Weak ability to generate an infectious state or to propagate the infectious state to output 0 test result changes from passed to failed 14/34
  • Does this work in real programs? 15/34
  • Why does our approach work?1000 - A feasibility study 2500 2000 800 800 2000 1500 600 600 1500 400 1000 400 1000 200 200 500 500 0 0 0 0 tcas v7 tot_info v17 schedule v4 schedule2 v1 4000 40004000 150 3000 30003000 1002000 2000 2000 501000 1000 1000 0 0 0 0 print_tokens v7 print_tokens2 v3 replace v24 space v20 Figure: Distribution of statements’ result changes and faulty statement’s testing result changes. The vertical axis denotes the number of testing results changes (from „passed‟ to „failed‟), and horizontal width denotes the probability density at corresponding amount of testing results changes. 16/34
  • Why does our approach work? - Another feasibility study (When CC%≥95%) 25 ∎ Result changes (avg. 16.33%) 20 Frequency of faulty versions ∎ Naish (avg. 47.55%) 15 10 5 0 0% 20 % 40 % 60 % 80 % Percentage of code examined Figure: Frequency distribution of effectiveness when CC%≥ 95%. When CC% is greater or equal than 95%, code examination effort reduction of result changes is 65.66% (=100%-16.33%/47.55%). Only 6 faulty versions need to examine less than 20% of statements for Naish, while 22 versions by using result changes 17/34
  • How to design our new rankingfunction? 18/34
  • Our Approach – Muffler [LRR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACMTransaction on Software Engineering Methodology, 20(3):11, 2011. 19/34
  • How do we evaluate our approach?What is the result? 20/34
  • Empirical Evaluation Lines of Number of Number of Program suite Executable LOC versions test cases Code tcas 41 63-67 1608 133-137 tot_info 23 122-123 1052 272-273 schedule 9 149-152 2650 290-294 schedule2 10 127-129 2710 261-263 print_tokens 7 189-190 4130 341-343 print_tokens2 10 199-200 4115 350-355 replace 32 240-245 5542 508-515 21/34 space 38 3633-3647 13585 5882-5904
  • Empirical Evaluation 100% 95% 90% 85% 80% 75% Percentage of fault located 70% 65% 60% 55% 50% 45% 40% 35% 30% Techiniques 25% Muffler 20% Naish 15% Ochiai Tarantula 10% Wong3 5% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of code examined Figure: Overall effectiveness comparison. 22/34
  • Empirical Evaluation % of code Tarantula Ochiai χDebug Naish Muffler examined 1% 14 18 19 21 35 5% 38 48 56 58 74 10% 54 63 68 68 85 15% 57 65 80 80 94 20% 60 67 84 84 99 30% 79 88 91 92 110 Table: Number of faults located at different 99 40% 92 98 98 level of code 117 examination effort using Naish and Muffler. 50% 98 99 101 102 121 60% 99 103 105 106 123 70% 101 107 117 119 123 When 1% of the statements have been examined, 123 can reach the 80% 114 122 122 Naish 123 fault in 17.07% of faulty versions. At 122 same time, Muffler 123 reach 90% 123 123 the 123 can the fault in 28.46% of faulty versions. 100% 123 123 123 123 123 23/34
  • Empirical Evaluation Tarantula Ochiai χDebug Naish Muffler Min 0.00 0.00 0.00 0.00 0.00 Max 87.89 84.25 93.85 78.46 55.38 Median 20.33 9.52 7.69 7.32 3.25 Mean 27.68 23.62 20.04 19.34 9.62 Stdev 28.29 26.36 24.61 23.86 13.22 Table: Statistics of code examination effort.Among these five techniques, Muffler always scores the best in the rows that correspond tothe minimum, median, and mean code examination effort. In addition, Muffler gets muchlower standard deviation, which means that their performances vary less widely thanothers, and are shown to be more stable in terms of effectiveness. Results also show thatMuffler reduces the average code examination effort from Naish by 50.26% (=100%-(9.62%/19.34%). 24/34
  • How about the coincidentalcorrectness issue? 25/34
  • ‹#›/34
  • Conclusion and future work We propose Muffler, a technique using mutation to help locate program faults. On 123 faulty versions of seven programs, we conduct a comparison of effectiveness and efficiency with Naish technique. Results show that Muffler reduces the average code examination effort on each faulty version by 50.26%. For future work, we plan to generalize our approach to locate faults in multi-fault programs. 27/34
  • Q&A 28/34
  • Thank you!Contact me via elfinhe@gmail.com 29/34
  • # Background  Mutation analysis, first proposed by Hamlet [Ham77] and Demilo et al. [DLS78] , is a fault-based testing technique used to measure the effectiveness of a test suite.  In mutation analysis, one introduces syntactic code changes, one at a time, into a program to generate various faulty programs (called mutants).  A mutation operator is a change-seeding rule to generate a mutant from the original program.[Ham77] R.G. Hamlet, Testing Programs with the Aid of a Compiler, Software Engineering, IEEE Transactionson, vol. SE-3, (no. 4), pp. 279- 290, 1977.[DLS78] R.A. DeMillo, R.J. Lipton and F.G. Sayward, Hints on Test Data Selection: Help for the PracticingProgrammer, Computer, vol. 11, (no. 4), pp. 34-41, 1978. 30/34
  • # Ranking functions  Tarantula [JHS02], Ochiai [AZV07], χDebug [WQZ+07], and Naish [NLR11] Table: Ranking faunctions[JHS02] J.A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. In Proceedings of the24th International Conference on Software Engineering (ICSE 02), pp. 467-477, 2002.[AZV07] R. Abreu, P. Zoeteweij and A.J.C. Van Gemund, On the accuracy of spectrum-based fault localization, in Proc. Proceedings -Testing: Academic and Industrial Conference Practice and Research Techniques, TAIC PART-Mutation 2007, pp. 89-98, 2007.[WQZ+07] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective Fault Localization using Code Coverage. In Proceedings of the31st Annual International Computer Software and Applications Conference (COMPSAC 07), Vol. 1, pp. 449-456, 2007.[NLR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on SoftwareEngineering Methodology, 20(3):11, 2011. 31/34
  • # Our Approach – Muffler Faulty Test Program Suite Instrument program & Execute against test suite Coverage & Testing Results Select statements to mutate Candidate Statements Mutate selected statements Mutants Run mutants against test suite Legend Changes of testing results Calculate suspiciousness Input & Sort statements Process Ranking List of all Output statements Figure: Dataflow diagram of Muffler. 32/34
  • # Our Approach – Muffler Primary Key Secondary Key Additional Key (imprecise when (invalid when (inclined to handle multiple faults coincidental coincidental correctness) occurs) correctness% is high) 33/34
  • # An Example TotalPassed TotalFailed Part IIPart I 2440 210 Tarantula Ochiai χDebug Naish Statement Passed(s) Failed(s) susp* r** susp r susp r susp r S1 if (block_queue){ 1798 210 0.58 8 0.32 8 205.41 8 510812 8 S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S4 proc = find_nth(block_queue, n); 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S5 if (proc) { 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S6 block_queue = del_ele(block_queue, proc); 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S7 prio = proc->priority; 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}} 1358 210 0.64 3 0.37 3 205.85 3 511252 3 Code examination effort to locate S2 and S3: 88% 88% 88% 88% Figure: Faulty version v2 of program “schedule”. 34/34
  • # An ExamplePart III Part IV Muffler Mutated statement for each mutant Changep→f Changep→f Changep→f Changep→f Changep→f Impact susp rM1 if (!block_queue ) { 1644 1798 1101 1101 1644 1457.6 509354.4 8M2 count = block_queue->mem_count != 1; 249 1097 1097 249 1382 814.8 510413.2 2M3 n = (int) (count <= ratio) ; 249 1116 1101 494 1101 812.2 510415.8 2M4 proc = find_nth(block_queue , ratio); 1088 638 1136 744 1382 997.6 510230.4 5M5 if (!proc) { 1136 1358 1101 1382 1101 1215.6 510012.4 6M6 block_queue = del_ele(block_queue , proc-1); 1123 349 1358 814 1358 1000.4 510251.6 4M7 prio /= proc->priority; 1358 1358 1101 1101 1358 1255.2 509996.8 7M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598 598 1138 1358 1101 958.6 510293.4 3 Code examination effort to locate S2 and S3: 25% Figure: Faulty version v2 of program “schedule”. 35/34
  • # An Example TotalPassed TotalFailed Part IIPart I 2440 210 Tarantula Ochiai χDebug Naish Statement Passed(s) Failed(s) susp* r** susp r susp r susp r S1 if (block_queue){ 1798 210 0.58 8 0.32 8 205.41 8 510812 8 S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S4 proc = find_nth(block_queue, n); 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S5 if (proc) { 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S6 block_queue = del_ele(block_queue, proc); 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S7 prio = proc->priority; 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}} 1358 210 0.64 3 0.37 3 205.85 3 511252 3 Code examination effort to locate S2 and S3: 88% 88% 88% 88% Figure: Faulty version v2 of program “schedule”. 36/34
  • # An ExamplePart III Part IV Muffler Mutated statement for each mutant Changep→f Changep→f Changep→f Changep→f Changep→f Impact susp rM1 if (!block_queue ) { 1644 1798 1101 1101 1644 1457.6 509354.4 8M2 count = block_queue->mem_count != 1; 249 1097 1097 249 1382 814.8 510413.2 2M3 n = (int) (count <= ratio) ; 249 1116 1101 494 1101 812.2 510415.8 2M4 proc = find_nth(block_queue , ratio); 1088 638 1136 744 1382 997.6 510230.4 5M5 if (!proc) { 1136 1358 1101 1382 1101 1215.6 510012.4 6M6 block_queue = del_ele(block_queue , proc-1); 1123 349 1358 814 1358 1000.4 510251.6 4M7 prio /= proc->priority; 1358 1358 1101 1101 1358 1255.2 509996.8 7M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598 598 1138 1358 1101 958.6 510293.4 3 Code examination effort to locate S2 and S3: 25% Figure: Faulty version v2 of program “schedule”. 37/34
  • # Empirical Evaluation Versus Versus Versus Versus Tanrantula Ochiai χDebug Naish More effective 102 96 93 89 Same effectiveness 19 23 23 25 Less effective 2 4 7 9 Table: Pair-wise comparison between Muffler and existing techniques.Muffler is more effective (examining more statements before encountering the faultystatement) than Naish for 89 out of 123 faulty versions; is as effective (examining the samenumber of statements before encountering the faulty statement) as Naish for 25 out of 123faulty versions; and is less effective (examining less statements before encountering thefaulty statement) than Naish for only 9 out of 123 faulty versions. 38/34
  • # Empirical Evaluation Experience on real faults Faulty versions CC% Code examination effort Naish Muffler v5 1% 0% 0% v9 7% 1% 0% v17 31% 12% 7% v28 49% 11% 5% v29 99% 25% 9% Table: Results with real faults in spaceFive faulty versions are chosen to represent low, medium, and the high occurrence ofcoincidental correctness. In this table, the column “CC%” presents the percentage ofcoincidentally passed test cases out of all passed test cases. The columns under the head“Code examination effort” present the percentage of code to be examined before the fault isencountered. 39/34
  • # Empirical Evaluation Efficiency analysis Program suite CBFL (seconds) Muffler (seconds) tcas 18.00 868.68 tot_info 11.92 573.12 schedule 34.02 2703.01 schedule2 27.76 1773.14 print_tokens 59.11 2530.17 print_tokens2 62.07 5062.87 replace 69.13 4139.19 Average 40.29 2521.46 Table: Time spent by each technique on subject programs.We have shown experimentally that, by taking advantages from both coverage and mutationimpact, Muffler outperforms Naish regardless the occurrence of coincidental correctness.Unfortunately, our approaches, Muffler need to execute piles of mutants to compute mutationimpact. The execution of mutants against the test suite may increase the time cost of faultlocalization. The time mainly contains the cost of instrumentation, execution, and coveragecollection. From this table, we observe that Muffler takes approximately 62.59 times ofaverage time cost to the Naish technique. 40/34
  • # Empirical Evaluation Efficiency analysis Program Mutated Total Time per mutant Mutants suite statements statements (seconds) tcas 40.15 65.10 199.90 4.26 tot_info 39.57 122.96 191.87 2.92 schedule 80.60 150.20 351.60 7.59 schedule2 75.33 127.56 327.78 5.32 print_tokens 67.43 189.86 260.29 9.49print_tokens2 86.67 199.44 398.67 12.54 replace 71.14 242.86 305.93 13.30 Average 56.52 142.79 256.90 7.92 Table: Information about mutants generated.This Table illustrates the detailed data about the number of mutated/total executablestatements, the number of mutants generated, and the time cost of running each mutant. Forexample, of the program tcas, there are, on average, 40.15 statements that are mutated byMuffler; and 65.10 executable statements in total; 199.90 mutants are generated and it takes4.26 seconds to run each of them, on average. Notice that there is no need to collect coveragefrom the mutants‟ executions, and it takes about 1/4 time to run a mutant withoutinstrumentation and coverage collection. 41/34