Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Prioritizing Test Cases for Regression Testing Article By: Rothermel, et al. Presentation by: Martin, Otto, and Prashanth
  2. 2. <ul><li>Test case prioritization techniques - schedule test cases for execution in an order that attempts to increase their effectiveness at meeting some performance goal. </li></ul><ul><li>One goal is the rate of fault detection - a measure of how quickly faults are detected within the testing process </li></ul><ul><ul><li>An improved rate of fault detection during testing can provide faster feedback on the system under test and let software engineers begin correcting faults earlier than might otherwise be possible. </li></ul></ul><ul><li>One application of prioritization techniques involves regression testing </li></ul>
  3. 3. <ul><li>This paper describes several techniques for using test execution information to prioritize test cases for regression testing, including: </li></ul><ul><li>1) techniques that order test cases based on their total coverage of code components, </li></ul><ul><li>2) techniques that order test cases based on their coverage of code components not previously covered, and </li></ul><ul><li>3) techniques that order test cases based on their estimated ability to reveal faults in the code components that they cover. </li></ul>
  4. 4. <ul><li>When the time required to re-execute an entire test suite is short, test case prioritization may not be cost-effective-it may be sufficient simply to schedule test cases in any order. </li></ul><ul><li>When the time required to execute an entire test suite is sufficiently long, however, test-case prioritization may be beneficial because, in this case, meeting testing goals earlier can yield meaningful benefits. </li></ul><ul><li>In general test case prioritization, given program P and test suite T, we prioritize the test cases in T with the intent of finding an ordering of test cases that will be useful over a succession of subsequent modified versions of P. </li></ul><ul><li>In the case of regression testing, prioritization techniques can use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs. </li></ul>
  5. 5. <ul><li>This paper considers 9 different test case prioritization techniques. </li></ul><ul><li>The first three techniques serve as experimental controls </li></ul><ul><li>The last six techniques represent heuristics that could be implemented using software tools </li></ul><ul><li>A source of motivation for these approaches is the conjecture that the availability of test execution data can be an asset. </li></ul><ul><li>This assumes that past test execution data can be used to predict, with sufficient accuracy, subsequent execution behavior. </li></ul>
  6. 7. <ul><li>Definition 1. The Test Case Prioritization Problem: </li></ul><ul><li>Given: T, a test suite, PT, the set of permutations of T, and f, a function from PT to the real numbers. </li></ul><ul><li>PT represents the set of all possible prioritizations (orderings) of T </li></ul><ul><li>f is a function that, applied to any such ordering, yields an award value for that ordering. </li></ul>
  7. 8. <ul><li>A challenge: care must be taken to keep the cost of performing the prioritization from excessively delaying the very regression testing activities it is intended to facilitate. </li></ul>
  8. 9. <ul><li>M3: Optimal prioritization. </li></ul><ul><li>Given program P and a set of known faults for P, if we can determine, for test suite T, which test cases in T expose which faults in P, then we can determine an optimal ordering of the test cases in T for maximizing T's rate of fault detection for that set of faults. </li></ul><ul><li>This is not a practical technique, as it requires a priori knowledge of the existence of faults and of which test cases expose which faults. </li></ul><ul><li>However, by using this technique in the empirical studies, we can gain insight into the success of other practical heuristics, by comparing their solutions to optimal solutions. </li></ul>
  9. 10. <ul><li>M4: Total statement coverage prioritization. </li></ul><ul><li>By instrumenting a program, we can determine, for any test case, which statements in that program were exercised (covered) by that test case. </li></ul><ul><li>We can then prioritize test cases in terms of the total number of statements they cover by counting the number of statements covered by each test case and then sorting the test cases in descending order of that number. </li></ul>
  10. 11. <ul><li>M5: Additional statement coverage prioritization. </li></ul><ul><li>Total statement coverage prioritization schedules test cases in the order of total coverage achieved; however, having executed a test case and covered certain statements, more may be gained in subsequent testing by executing statements that have not yet been covered. </li></ul><ul><li>Additional statement coverage prioritization iteratively selects a test case that yields the greatest statement coverage, then adjusts the coverage information on all remaining test cases to indicate their coverage of statements not yet covered and repeats this process until all statements covered by at least one test case. </li></ul><ul><li>We may reach a point where each statement has been covered by at least one test case, and the remaining unprioritized test cases cannot add additional statement coverage. We could order these remaining test cases using any prioritization technique. </li></ul>
  11. 13. <ul><li>M6: Total branch coverage prioritization. </li></ul><ul><li>Total branch coverage prioritization is the same as total statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. </li></ul><ul><li>In this context, we define branch coverage as coverage of each possible overall outcome of a (possibly compound) condition in a predicate. Thus, for example, each if or while statement must be exercised such that it evaluates at least once to true and at least once to false. </li></ul>
  12. 14. <ul><li>M7: Additional branch coverage prioritization. </li></ul><ul><li>Additional branch coverage prioritization is the same as additional statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. </li></ul><ul><li>After complete coverage has been achieved the remaining test cases are prioritized by resetting coverage vectors to their initial values and reapplying additional branch coverage prioritization to the remaining test cases. </li></ul>
  13. 15. <ul><li>M8: Total fault-exposing-potential (FEP) prioritization. </li></ul><ul><li>Some faults are more easily exposed than other faults, and some test cases are more adept at revealing particular faults than other test cases. </li></ul><ul><li>The ability of a test case to expose a fault-that test case's fault exposing potential (FEP)-depends not only on whether the test case covers (executes) a faulty statement, but also on the probability that a fault in that statement will cause a failure for that test case </li></ul><ul><li>Three probabilities that could be used in determining FEP: </li></ul><ul><ul><li>1) the probability that a statement s is executed (execution probability), </li></ul></ul><ul><ul><li>2) the probability that a change in s can cause a change in program state (infection probability), and </li></ul></ul><ul><ul><li>3) the probability that a change in state propagates to output (propagation probability). </li></ul></ul>
  14. 16. <ul><li>This paper adopts an approach that uses mutation analysis, to produce a combined estimate of propagation-and-infection that does not incorporate independent execution probabilities. </li></ul><ul><li>Mutation analysis creates a large number of faulty versions (mutants) of a program by altering program statements, and uses these to assess the quality of test suites by measuring whether those test suites can detect those faults (‘kill’ those mutants). </li></ul><ul><li>Given program P and test suite T, we first create a set of mutants N ={n 1 ; n 2 ; . . . ; n m } for P, noting which statement s j in P contains each mutant. Next, for each test case t i in T, we execute each mutant version n k of P on t i , noting whether t i kills that mutant. </li></ul><ul><li>Having collected this information for every test case and mutant, we consider each test case t i and each statement s j in P, and calculate the fault-exposing potential FEP(s, t) of t i on s j as the ratio of mutants of s j killed by t i to the total number of mutants of s j . </li></ul>
  15. 17. <ul><li>To perform total FEP prioritization, given these FEP(s; t) values, we next calculate, for each test case t i in T, an award value, by summing the FEP(s j ; t i ) values for all statements s j in P. </li></ul><ul><li>Given these award values, we then prioritize test cases by sorting them in order of descending award value. </li></ul>
  16. 19. <ul><li>M9: Additional fault-exposing-potential (FEP) prioritization. </li></ul><ul><li>This lets us account for the fact that additional executions of a statement may be less valuable than initial executions. </li></ul><ul><li>We require a mechanism for measuring the value of an execution of a statement, that can be related to FEP values. </li></ul><ul><li>For this, we use the term confidence. We say that the confidence in statement s, C(s), is an estimate of the probability that s is correct. </li></ul><ul><li>If we execute a test case t that exercises s and does not reveal a fault in s, C(s) should increase. </li></ul>
  17. 21. <ul><li>Research Questions </li></ul><ul><ul><li>Can test case prioritization improve the rate of fault detection in test suites? </li></ul></ul><ul><ul><li>How do the various test case prioritization techniques discussed earlier compare to one another in terms of effects on rate of fault detection? </li></ul></ul><ul><li>Effectiveness Measures </li></ul><ul><ul><li>Use a weighted Average of the Percentage of Faults Detected (APFD) </li></ul></ul><ul><ul><li>Ranges from 0..100 </li></ul></ul><ul><ul><li>Higher numbers means faster detection </li></ul></ul><ul><li>Problems with APFD </li></ul><ul><ul><li>Doesn’t measure cost of prioritization </li></ul></ul><ul><ul><li>Cost is normally amortized because test suites are created after the release of a version of the software </li></ul></ul>
  18. 22. Effectiveness Example
  19. 23. <ul><li>Programs used </li></ul><ul><ul><li>Aristotle program analysis system for test coverage and control graph information </li></ul></ul><ul><ul><li>Proteum mutation system to obtain mutation scores. </li></ul></ul><ul><ul><li>Used 8 C programs as subjects </li></ul></ul><ul><ul><ul><li>First 7 were created at Siemens, the eighth is a European Space Agency program </li></ul></ul></ul>
  20. 25. <ul><li>Siemens Programs - Description </li></ul><ul><ul><li>7 programs used by Siemens in a study that observed the “fault detecting effectiveness of coverage criteria” </li></ul></ul><ul><ul><li>Created faulty versions of these programs by manual seeding them with single errors creating the “number of versions” column </li></ul></ul><ul><ul><li>Using single line faults only allows researchers to determine whether a test case discovers the error or not </li></ul></ul><ul><ul><li>For each of the seven programs, a test case suite was created by Siemens. First via a black box method, they then completed the suite using white box testing, so that each “executable statement, edge, and definition use pair … was exercised by at least 30 test cases. </li></ul></ul><ul><ul><li>Kept faulty programs whose errors were detectable by between 3 and 350 test cases </li></ul></ul><ul><ul><li>Test suites were created by the researchers by random selection until a branch coverage adequate test suite was created </li></ul></ul><ul><ul><li>Proteum was used to create mutants of the seven programs </li></ul></ul>
  21. 26. <ul><li>Space Program – Description </li></ul><ul><ul><li>33 versions of space with only one fault in each were created by the ESA, 2 more were created by the research team </li></ul></ul><ul><ul><li>Initial pool of 10 000 test cases were obtained from Vokolos and Frankl </li></ul></ul><ul><ul><li>Used these as a base and added cases until each statement and edge was exercised by at least 30 test cases </li></ul></ul><ul><ul><li>Created a branch coverage adequate test suite in the same way as the Siemens program </li></ul></ul><ul><ul><li>Also created mutants via Proteum </li></ul></ul>
  22. 27. <ul><li>Empirical Studies and Results </li></ul><ul><li>4 different studies using the 8 programs </li></ul><ul><ul><li>Siemens programs with APFD measured relative to Siemens faults </li></ul></ul><ul><ul><li>Siemens programs with APFD measured relative to mutants </li></ul></ul><ul><ul><li>Space with APFD measured relative to actual faults </li></ul></ul><ul><ul><li>Space with APFD measure relative to mutants </li></ul></ul>
  23. 28. <ul><li>Siemens programs with APFD measured relative to Siemens faults – Study Format </li></ul><ul><ul><li>M2 to M9 were applied to each of the 1000 test suites, resulting in 8000 prioritized test suites </li></ul></ul><ul><ul><li>The original 1000 were used as M1 </li></ul></ul><ul><ul><li>Calculated the APFD relative to the faults provided by the program </li></ul></ul>
  24. 29. Example boxplot
  25. 30. <ul><li>Study 1 - Overall observations </li></ul><ul><ul><li>M3 is markedly better than all of the others (as expected) </li></ul></ul><ul><ul><li>The test case prioritization techniques offered appear to have some improvement, but more statistics needed to be done to confirm </li></ul></ul><ul><ul><li>Upon completion of these statistics, more results were revealed </li></ul></ul><ul><ul><ul><li>Branch based coverage did as well or better than statement coverage </li></ul></ul></ul><ul><ul><ul><li>All except one indicates that total branch coverage did as well or better than additional branch coverage </li></ul></ul></ul><ul><ul><ul><li>All total statement coverage did as well or better than additional statement coverage </li></ul></ul></ul><ul><ul><ul><li>In 5 of 7 programs, even randomly prioritized test suites did better than untreated test suites </li></ul></ul></ul>
  26. 31. Example Groupings
  27. 32. <ul><li>Siemens programs with APFD measured relative to mutants – Study Format </li></ul><ul><ul><li>Same format as the first study, 9000 test suites used, 1000 for each prioritization technique </li></ul></ul><ul><ul><li>But rather than run those test cases on the small subset of known errors, they were applied to mutated programs that were created to form a larger bed of programs to test against </li></ul></ul><ul><li>Results </li></ul><ul><ul><li>Additional and Total FEP prioritization outperformed all others (except optimal) </li></ul></ul><ul><ul><li>Branch almost always outperformed statement </li></ul></ul><ul><ul><li>Total statement outperformed additional </li></ul></ul><ul><ul><li>But additional branch coverage outperformed total branch coverage </li></ul></ul><ul><ul><li>However, in this study random did not outperform the control </li></ul></ul>
  28. 33. <ul><li>Space with APFD measured relative to Actual Faults </li></ul><ul><li>M2 – M9 were applied to each of the 50 test suites, resulting in 400 test suites, plus the original 50 resulting in 450 total test suites </li></ul><ul><li>Additional FEP outperformed all others, but there was no significant difference among the rest </li></ul><ul><li>Also random is no better than the control </li></ul>
  29. 34. Study 3 Groupings
  30. 35. <ul><li>Space with APFD measured relative to mutants </li></ul><ul><ul><li>Same technique as other space study, only using 132,163 mutant version of the software </li></ul></ul><ul><ul><li>Additional FEP outperformed all others </li></ul></ul><ul><ul><li>Branch and statement are indistinguishable </li></ul></ul><ul><ul><li>But additional coverage always outperforms its total counterpart </li></ul></ul>
  31. 36. Study 4 Groupings
  32. 37. <ul><li>Threats to Validity </li></ul><ul><ul><li>Construct Validity – You are measuring what you say you are measuring (and not something else) </li></ul></ul><ul><ul><li>Internal Validity – Ability to say that the causal relationship is true </li></ul></ul><ul><ul><li>External Validity – Ability to generalize results across the field </li></ul></ul>
  33. 38. <ul><li>Construct Validity </li></ul><ul><ul><li>APFD is highly accurate, but it is not the only method of measuring fault detection, could also measure percentage of test suite that must be run before all errors are found </li></ul></ul><ul><ul><li>No value to later tests that detect the same error </li></ul></ul><ul><ul><li>FEP based calculations – Other estimates may more accurately capture the probability of a test case finding a fault </li></ul></ul><ul><ul><li>Effectiveness is measured without cost </li></ul></ul>
  34. 39. <ul><li>Internal Validity </li></ul><ul><ul><li>Instrumentation bias can bias results especially in APFD and prioritization measurement tools </li></ul></ul><ul><ul><li>Performed code revision </li></ul></ul><ul><ul><li>Also limit problems by running prioritization algorithm on each test suite and each subject program </li></ul></ul>
  35. 40. <ul><li>External Validity </li></ul><ul><ul><li>The Siemens programs are non-trivial but not representative of real world programs. The space program is, but is only one program </li></ul></ul><ul><ul><li>Faults in Siemens programs were seeded (not like those in the real world) </li></ul></ul><ul><ul><li>Faults in space were found during development, but these may differ from those found later in the development process. Plus they are only one set of faults found by one set of programmers </li></ul></ul><ul><ul><li>Single faults version programs are also not representative of the real world </li></ul></ul><ul><ul><li>The test suites were created with only a single method, other real world methods exist </li></ul></ul><ul><ul><li>These threats can only be answered by more studies with different test suites, programs, and errors </li></ul></ul>
  36. 41. Additional Discussion And Practical Implications
  37. 42. <ul><li>Test case prioritization can substantially improve rate of fault detection of test suites. </li></ul><ul><li>Additional FEP prioritization techniques do not always justify the additional expenses incurred, as is gathered from cases where specific coverage based techniques outperformed them and also in cases where the total gain in APFD, when the additional FEP techniques did perform the best, was not large enough. </li></ul><ul><li>Branch-coverage-based techniques almost always performed as well if not better than statement-coverage-based techniques. Thus if the two techniques incur similar costs, branch-coverage-techniques are advocated. </li></ul>
  38. 43. <ul><li>Total statement and branch coverage techniques perform almost at par with the additional branch and statement coverage techniques, entitling its use due to its lower complexity. </li></ul><ul><ul><li>However, this does not apply for space (Study 4) program where the additional branch and statement coverage techniques outperformed the total statement and branch coverage techniques by a huge margin. </li></ul></ul><ul><li>Randomly prioritized test suites typically outperform untreated test suites. </li></ul>
  39. 44. Conclusion
  40. 45. <ul><li>Any one of the prioritization techniques offer some amount of improved fault detection capabilities. </li></ul><ul><li>These studies are of interest only to research groups, due to the high expense that they incur. However, code coverage based techniques have immediate practical implications. </li></ul>
  41. 46. Future Work
  42. 47. <ul><li>Additional studies to be performed using wider range of programs, faults and test suites. </li></ul><ul><li>The gap between optimal prioritization and FEP prioritization techniques is yet to be bridged. </li></ul><ul><li>Determining which prioritization technique is warranted by particular types of programs and test suites. </li></ul><ul><li>Other prioritization objectives have to be investigated. </li></ul><ul><ul><li>Version specific techniques </li></ul></ul><ul><ul><li>Techniques may not only be applied to regression testing but also during the initial testing of the software. </li></ul></ul>