why an Opensea Clone Script might be your perfect match.pdf
A software fault localization technique based on program mutations
1. A Software Fault Localization Technique
Based on Program Mutations
Tao He
Coauthor with Xinming Wang, Xiaocong Zhou, Wenjun Li, Zhenyu Zhang, S.C. Cheung
elfinhe@gmail.com
Software Engineering Laboratory
Department of Computer Science, Sun Yat-Sen University
The 6nd Seminar of SELAB
November 2012
Sun Yat-Sen University, Guangzhou, China
1/23
4. Background
Coverage-Based Fault Localization (CBFL)
Input
Coverage
Testing results (passed or failed)
Output
A ranking list of statements
Ranking functions
Most CBFL techniques are similar with each other
except that different ranking functions are used to
compute suspiciousness.
4/23
5. Motivation
One fundamental assumption [YPW08] of CBFL
The observed behaviors from passed runs can precisely
represent the correct behaviors of this program;
and the observed behaviors from failed runs can represent the
infamous behaviors.
Therefore, the different observed behaviors of program
entities between passed runs and failed runs will indicate the
fault’s location.
But this does not always hold.
5/23
[YPW08] C. Yilmaz, A. Paradkar, and C. Williams. Time will tell: fault localization using time spectra. In Proceedings
of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 81-90. 2008.
6. Motivation
Coincidental Correctness (CC)
“No failure is detected, even though a fault has been executed.” [RT93]
i.e., the passed runs may cover the fault.
Weaken the first part of CBFL’s assumption:
The observed behaviors from passed runs can precisely represent
the correct behaviors of this program;
More, CC occurs frequently in practice.[MAE+09]
6/23
[RT93] D.J. Richardson and M.C. Thompson, An analysis of test data selection criteria using the RELAY model of
fault detection, Software Engineering, IEEE Transactions on, vol. 19, (no. 6), pp. 533-553, 1993.
[MAE+09] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi, An empirical study of the factors that reduce the
effectiveness of coverage-based fault localization, in Proceedings of the 2nd International Workshop on Defects in
Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing
and Analysis (ISSTA 2009), pp. 1-5, 2009.
7. Our goal is to address the CC issue via mutation analysis
Our Approach – Muffler
7/23
8. Why does our approach work?
- Key hypothesis
Mutating the faulty statement tends to maintain the
results of passed test cases.
By contrast, mutating a correct statement tends to
change the results of passed test cases (from passed to
failed).
8/23
9. Why does our approach work?
- Three comprehensive scenarios (1/3)
9/23
F M
M: Mutant point
F: Fault point
Program
Test cases
Test results
Passed
Failed
3 test results change from passed to failed
- If we mutate an M in different basic blocks with F
10. Why does our approach work?
- Three comprehensive scenarios (1/3)
10/23
F
M
M: Mutant point
F: Fault point
Program
Test cases
Test results
Passed
Failed
3 test results change from passed to failed
- If we mutate an M in different basic blocks with F
11. Why does our approach work?
- Three comprehensive scenarios (1/3)
11/23
F +M
M: Mutant point
F: Fault point
Program
Test cases
Test results
Passed
Failed
0 test result changes from passed to failed
- If we mutate F
12. Why does our approach work?
- Three comprehensive scenarios (2/3)
12/23
F
M: Mutant point
F: Fault point
Program
Test cases
Test results
Passed
Failed
M
Data Flow
Control Flow
3 test results change from passed to failed
- If we mutate an M in the same basic block with F
Due to different data flow to affect output
13. Why does our approach work?
- Three comprehensive scenarios (2/3)
13/23
F
M: Mutant point
F: Fault point
Program
Test cases
Test results
Passed
Failed
Data Flow
Control Flow
0 test result change from passed to failed
+M
- If we mutate F
14. Why does our approach work?
- Three comprehensive scenarios (3/3)
14/23
F +M
M: Mutant point
F: Fault point
Program
Test cases
Test results
Passed
Failed
0 test result changes from passed to failed
Weak ability to generate
an infectious state or to
propagate the infectious
state to output
- When CC occurs frequently
- If we mutate F
Due to weak ability to affect output
15. Our Approach – Muffler
15/23
Naish, the best existing ranking function[LRR11]
𝑆𝑢𝑠𝑝 𝑁𝑎𝑖𝑠ℎ (𝑆𝑖) = 𝐹𝑎𝑖𝑙𝑒𝑑(𝑃, 𝑆𝑖) × (𝑇𝑜𝑡𝑎𝑙𝑃𝑎𝑠𝑠𝑒𝑑(𝑃) + 1)
Mutation impact, the average amount of testing results
change from passed to failed
𝐼𝑚𝑝𝑎𝑐𝑡 𝑆𝑖 =
𝑗=1
𝑚
𝐶ℎ𝑎𝑛𝑔𝑒 𝑝→𝑓 𝑃, 𝑀 𝑆 𝑖,𝑗
𝑚
Muffler, a combination of Naish and mutation impact
𝑺𝒖𝒔𝒑 𝑴𝒖𝒇𝒇𝒍𝒆𝒓 𝑺𝒊 = 𝑺𝒖𝒔𝒑 𝑵𝒂𝒊𝒔𝒉 𝑺𝒊 – 𝑰𝒎𝒑𝒂𝒄𝒕(𝑺𝒊)
[LRR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM
Transaction on Software Engineering Methodology, 20(3):11, 2011.
19. Empirical Evaluation
19/23
% of code
examined
Tarantula Ochiai χDebug Naish Muffler
1% 14 18 19 21 35
5% 38 48 56 58 74
10% 54 63 68 68 85
15% 57 65 80 80 94
20% 60 67 84 84 99
30% 79 88 91 92 110
40% 92 98 98 99 117
50% 98 99 101 102 121
60% 99 103 105 106 123
70% 101 107 117 119 123
80% 114 122 122 123 123
90% 123 123 122 123 123
100% 123 123 123 123 123
Table: Number of faults located at different level of code
examination effort using Naish and Muffler.
When 1% of the statements have been examined, Naish can reach the
fault in 17.07% of faulty versions. At the same time, Muffler can reach
the fault in 28.46% of faulty versions.
20. Empirical Evaluation
20/23
Tarantula Ochiai χDebug Naish Muffler
Min 0.00 0.00 0.00 0.00 0.00
Max 87.89 84.25 93.85 78.46 55.38
Median 20.33 9.52 7.69 7.32 3.25
Mean 27.68 23.62 20.04 19.34 9.62
Stdev 28.29 26.36 24.61 23.86 13.22
Table: Statistics of code examination effort.
Among these five techniques, Muffler always scores the best in the rows that correspond to
the minimum, median, and mean code examination effort. In addition, Muffler gets much
lower standard deviation, which means that their performances vary less widely than others,
and are shown to be more stable in terms of effectiveness. Results also show that Muffler
reduces the average code examination effort from Naish by 50.26% (=100%-
(9.62%/19.34%).
21. Conclusion and future work
We propose Muffler, a technique using mutation to
help locate program faults.
On 123 faulty versions of seven programs, we conduct
a comparison of effectiveness and efficiency with
Naish technique. Results show that Muffler reduces the
average code examination effort on each faulty version
by 50.26%.
For future work, we plan to generalize our approach to
locate faults in multi-fault programs.
21/23
24. # Background
Mutation analysis, first proposed by Hamlet [Ham77] and
Demilo et al. [DLS78] , is a fault-based testing technique
used to measure the effectiveness of a test suite.
In mutation analysis, one introduces syntactic code
changes, one at a time, into a program to generate
various faulty programs (called mutants).
A mutation operator is a change-seeding rule to
generate a mutant from the original program.
24/23
[Ham77] R.G. Hamlet, Testing Programs with the Aid of a Compiler, Software Engineering, IEEE Transactions on,
vol. SE-3, (no. 4), pp. 279- 290, 1977.
[DLS78] R.A. DeMillo, R.J. Lipton and F.G. Sayward, Hints on Test Data Selection: Help for the Practicing
Programmer, Computer, vol. 11, (no. 4), pp. 34-41, 1978.
25. # Ranking functions
25/23
Table: Ranking faunctions
𝑆𝑢𝑠𝑝 𝑇𝑎𝑟𝑎𝑛𝑡𝑢𝑙𝑎 𝑆𝑖 =
𝐹𝑎𝑖𝑙𝑒𝑑(𝑆𝑖)
𝑇𝑜𝑡𝑎𝑙𝐹𝑎𝑖𝑙𝑒𝑑
𝐹𝑎𝑖𝑙𝑒𝑑(𝑆𝑖)
𝑇𝑜𝑡𝑎𝑙𝐹𝑎𝑖𝑙𝑒𝑑
+
𝑃𝑎𝑠𝑠𝑒𝑑(𝑆𝑖)
𝑇𝑜𝑡𝑎𝑙𝑃𝑎𝑠𝑠𝑒𝑑
𝑆𝑢𝑠𝑝 𝑂𝑐ℎ𝑖𝑎𝑖 𝑆𝑖 =
𝐹𝑎𝑖𝑙𝑒𝑑(𝑆𝑖)
𝑇𝑜𝑡𝑎𝑙𝐹𝑎𝑖𝑙𝑒𝑑 × (𝐹𝑎𝑖𝑙𝑒𝑑 𝑆𝑖 + 𝑃𝑎𝑠𝑠𝑒𝑑(𝑆𝑖))
𝑆𝑢𝑠𝑝 𝜒𝐷𝑒𝑏𝑢𝑔 𝑆𝑖 = 𝐹𝑎𝑖𝑙𝑒𝑑 𝑆𝑖 − ℎ,
𝑤ℎ𝑒𝑟𝑒 ℎ =
𝑃𝑎𝑠𝑠𝑒𝑑 𝑆𝑖 , 𝑖𝑓 𝑃𝑎𝑠𝑠𝑒𝑑(𝑆𝑖) ≤ 2
2 + 0.1 × 𝑃𝑎𝑠𝑠𝑒𝑑 𝑆𝑖 − 2 , 𝑖𝑓2 < 𝑃𝑎𝑠𝑠𝑒𝑑(𝑆𝑖) ≤ 10
2.8 + 0.01 × 𝑃𝑎𝑠𝑠𝑒𝑑 𝑆𝑖 − 10 , 𝑖𝑓𝑃𝑎𝑠𝑠𝑒𝑑(𝑆𝑖) > 10
𝑆𝑢𝑠𝑝 𝑁𝑎𝑖𝑠ℎ (𝑆𝑖) = 𝐹𝑎𝑖𝑙𝑒𝑑(𝑆𝑖) × (𝑇𝑜𝑡𝑎𝑙𝑃𝑎𝑠𝑠𝑒𝑑 + 1)
Tarantula [JHS02], Ochiai [AZV07], χDebug [WQZ+07], and Naish [NLR11]
[JHS02] J.A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. In Proceedings of the
24th International Conference on Software Engineering (ICSE '02), pp. 467-477, 2002.
[AZV07] R. Abreu, P. Zoeteweij and A.J.C. Van Gemund, On the accuracy of spectrum-based fault localization, in Proc. Proceedings -
Testing: Academic and Industrial Conference Practice and Research Techniques, TAIC PART-Mutation 2007, pp. 89-98, 2007.
[WQZ+07] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective Fault Localization using Code Coverage. In Proceedings of the
31st Annual International Computer Software and Applications Conference (COMPSAC '07), Vol. 1, pp. 449-456, 2007.
[NLR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software
Engineering Methodology, 20(3):11, 2011.
26. # Our Approach – Muffler
26/23
Test
Suite
Faulty
Program
Ranking List of all
statements
Instrument program
&
Execute against test suite
Select statements to mutate
Mutate selected statements
Run mutants against test suite
Calculate suspiciousness
&
Sort statements
Coverage & Testing Results
Candidate Statements
Mutants
Changes of testing results
Legend
Process
Input
Output
Figure: Dataflow diagram of Muffler.
27. # Our Approach – Muffler
𝑆𝑢𝑠𝑝 𝑀𝑢𝑓𝑓𝑙𝑒𝑟 (𝑆𝑖) = 𝐹𝑎𝑖𝑙𝑒𝑑(𝑃, 𝑆𝑖) × (𝑇𝑜𝑡𝑎𝑙𝑃𝑎𝑠𝑠𝑒𝑑(𝑃) + 1)– 𝐼𝑚𝑝𝑎𝑐𝑡(𝑆𝑖)
27/23
Primary Key
(imprecise when
multiple faults
occurs)
Secondary Key
(invalid when
coincidental
correctness%
is high)
Additional Key
(inclined to handle
coincidental correctness)
31. # An Example
31/23
Part III
Mutated statement for each mutant Changep→f
M1 if (!block_queue ) { 1644
M2 count = block_queue->mem_count != 1; 249
M3 n = (int) (count <= ratio) ; 249
M4 proc = find_nth(block_queue , ratio); 1088
M5 if (!proc) { 1136
M6 block_queue = del_ele(block_queue , proc-1); 1123
M7 prio /= proc->priority; 1358
M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598
Changep→f Changep→f Changep→f Changep→f
1798 1101 1101 1644
1097 1097 249 1382
1116 1101 494 1101
638 1136 744 1382
1358 1101 1382 1101
349 1358 814 1358
1358 1101 1101 1358
598 1138 1358 1101
Part IV Muffler
Impact susp r
1457.6 509354.4 8
814.8 510413.2 2
812.2 510415.8 2
997.6 510230.4 5
1215.6 510012.4 6
1000.4 510251.6 4
1255.2 509996.8 7
958.6 510293.4 3
Code examination effort to locate S2 and S3: 25%
Figure: Faulty version v2 of program “schedule”.
32. # Empirical Evaluation
32/23
Versus
Tanrantula
Versus
Ochiai
Versus
χDebug
Versus
Naish
More effective 102 96 93 89
Same effectiveness 19 23 23 25
Less effective 2 4 7 9
Table: Pair-wise comparison between
Muffler and existing techniques.
Muffler is more effective (examining more statements before encountering the faulty
statement) than Naish for 89 out of 123 faulty versions; is as effective (examining the same
number of statements before encountering the faulty statement) as Naish for 25 out of 123
faulty versions; and is less effective (examining less statements before encountering the
faulty statement) than Naish for only 9 out of 123 faulty versions.
33. # Empirical Evaluation
33/23
Faulty versions CC% Code examination effort
Naish Muffler
v5 1% 0% 0%
v9 7% 1% 0%
v17 31% 12% 7%
v28 49% 11% 5%
v29 99% 25% 9%
Experience on real faults
Table: Results with real faults in space
Five faulty versions are chosen to represent low, medium, and the high occurrence of
coincidental correctness. In this table, the column “CC%” presents the percentage of
coincidentally passed test cases out of all passed test cases. The columns under the head
“Code examination effort” present the percentage of code to be examined before the fault is
encountered.
34. # Empirical Evaluation
34/23
Efficiency analysis
Table: Time spent by each technique on subject programs.
We have shown experimentally that, by taking advantages from both coverage and mutation
impact, Muffler outperforms Naish regardless the occurrence of coincidental correctness.
Unfortunately, our approaches, Muffler need to execute piles of mutants to compute mutation
impact. The execution of mutants against the test suite may increase the time cost of fault
localization. The time mainly contains the cost of instrumentation, execution, and coverage
collection. From this table, we observe that Muffler takes approximately 62.59 times of
average time cost to the Naish technique.
Program suite CBFL (seconds) Muffler (seconds)
tcas 18.00 868.68
tot_info 11.92 573.12
schedule 34.02 2703.01
schedule2 27.76 1773.14
print_tokens 59.11 2530.17
print_tokens2 62.07 5062.87
replace 69.13 4139.19
Average 40.29 2521.46
35. # Empirical Evaluation
35/23
Efficiency analysis
Table: Information about mutants generated.
This Table illustrates the detailed data about the number of mutated/total executable
statements, the number of mutants generated, and the time cost of running each mutant. For
example, of the program tcas, there are, on average, 40.15 statements that are mutated by
Muffler; and 65.10 executable statements in total; 199.90 mutants are generated and it takes
4.26 seconds to run each of them, on average. Notice that there is no need to collect coverage
from the mutants’ executions, and it takes about 1/4 time to run a mutant without
instrumentation and coverage collection.
Program
suite
Mutated
statements
Total
statements
Mutants
Time per mutant
(seconds)
tcas 40.15 65.10 199.90 4.26
tot_info 39.57 122.96 191.87 2.92
schedule 80.60 150.20 351.60 7.59
schedule2 75.33 127.56 327.78 5.32
print_tokens 67.43 189.86 260.29 9.49
print_tokens2 86.67 199.44 398.67 12.54
replace 71.14 242.86 305.93 13.30
Average 56.52 142.79 256.90 7.92
41. Why does our approach work?
- Another feasibility study (When CC%≥95%)
41/23
0
5
10
15
20
25
0 % 20 % 40 % 60 % 80 %
Frequencyoffaultyversions
Percentage of code examined
When CC% is greater or equal than 95%, code examination effort
reduction of result changes is 65.66% (=100%-16.33%/47.55%).
Only 6 faulty versions need to examine less than 20% of statements for
Naish, while 22 versions by using result changes
∎ Result changes (avg. 16.33%)
∎ Naish (avg. 47.55%)
Figure: Frequency distribution of effectiveness
when CC%≥ 95%.
42. Experience on real faults
42/23
Table 8: Results with real faults in space
Faulty
versions
CC%
Lines of code examined
Naish Muffler
v5 0.90% 2 1
v20 1.97% 15 5
v21 1.97% 15 6
v10 2.74% 47 18
v11 6.29% 37 14
V6 6.92% 40 7
v9 19.05% 7 1
v17 30.92% 427 244
v28 48.57% 268 170
v29 99.32% 797 331
Editor's Notes
I assume that you have already known a lot of these techniques, so I only give a quick review.
Please find another definition, using passed runs to describ CC
Please use animation to show these slides
Please use animation to show these slides
Please use animation to show these slides
Please use animation to show these slides
Please use animation to show these slides
Please use animation to show these slides
Figure 3 shows the results of overall comparison of the effectiveness of Naish and Muffler. It depicts the percentage of versions (out of 123 faulty versions) whose fault can be located when a certain percentage of code in each version that have been examined. The vertical axis shows the percentage of faulty versions, and the horizontal axis shows the percentage of code to be examined in descending order of a ranking list from a fault localization technique.
Tarantula [17], Ochiai [3], χDebug [37], and Naish [26]
Please remember to notate the CC, e.g., 1382.
Please remember to add amination
Please remember to notate the CC, e.g., 1382.
Please remember to add amination
It is worthwhile to mention that Muffler’s time cost can be greatly reduced with a simple test selection strategy. The strategy can be described as: do not re-run a test case that does not cover the mutated statement. Furthermore, because the executions of mutants do not depend on each other, we can parallelize them with not much effort. Nonetheless, we have to admit that Muffler need more time to offer a better effectiveness in fault localization.
The vertical axis denotes the number of testing results changes (from ‘passed’ to ‘failed’), and horizontal width denotes the probability density at corresponding amount of testing results changes. We additionally draw a red line to indicate the faulty statement’s testing result changes.
Each of these programs varies from the other with respect to size, functionality, fault type, etc.
But we can derive similar results that the fault
The vertical axis denotes the number of testing results changes (from ‘passed’ to ‘failed’), and horizontal width denotes the probability density at corresponding amount of testing results changes. We additionally draw a red line to indicate the faulty statement’s testing result changes.
Each of these programs varies from the other with respect to size, functionality, fault type, etc.
But we can derive similar results that the fault
of the 32 faulty versions whose CC% is greater than 95%: 1) 19 faulty versions need to examine more than 60% of all statements by using Naish; whereas no version for Muffler; 3) Only 8 versions need to examine less than 40% of statements for Naish, when 25 versions for Muffler.
Table 8 reports the results for faulty versions with different concentration of coincidental correctness. In this table, the column “CC%” presents the percentage of coincidentally passed test cases out of all passed test cases. The columns under the head “Lines of code examined” present the number of code lines to be examined before the fault is encountered by using Naish and Muffler, respectively. For example, the version “v6” having a percentage of coincidental correctness as 6.92%, Naish needs to examine 40 code lines, whereas Muffler needs only to examine 7 code lines, the code examination effort reduction from Naish to Muffler is 82.5% (=100%–(7/40)).