Your SlideShare is downloading. ×
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Software Unit Test Coverage and Adequacy
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Software Unit Test Coverage and Adequacy

2,293

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,293
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
49
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Software Unit Test Coverage and Adequacy HONG ZHU Nanjing University PATRICK A. V. HALL AND JOHN H. R. MAY The Open University, Milton Keynes, UK Objective measurement of test quality is one of the key issues in software testing. It has been a major research focus for the last two decades. Many test criteria have been proposed and studied for this purpose. Various kinds of rationales have been presented in support of one criterion or another. We survey the research work in this area. The notion of adequacy criteria is examined together with its role in software dynamic testing. A review of criteria classification is followed by a summary of the methods for comparison and assessment of criteria. Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging General Terms: Measurement, Performance, Reliability, Verification Additional Key Words and Phrases: Comparing testing effectiveness, fault- detection, software unit test, test adequacy criteria, test coverage, testing methods 1. INTRODUCTION Goodenough and Gerhart [1975, 1977] made an early breakthrough in research In 1972, Dijkstra claimed that “program on software testing by pointing out that testing can be used to show the presence the central question of software testing of bugs, but never their absence” to per- suade us that a testing approach is not is “what is a test criterion?”, that is, the acceptable [Dijkstra 1972]. However, the criterion that defines what constitutes last two decades have seen rapid growth an adequate test. Since then, test crite- of research in software testing as well as ria have been a major research focus. A intensive practice and experiments. It great number of such criteria have been has been developed into a validation and proposed and investigated. Consider- verification technique indispensable to able research effort has attempted to software engineering discipline. Then, provide support for the use of one crite- where are we today? What can we claim rion or another. How should we under- about software testing? stand these different criteria? What are In the mid-’70s, in an examination of the future directions for the subject? the capability of testing for demonstrat- In contrast to the constant attention ing the absence of errors in a program, given to test adequacy criteria by aca- Authors’ addresses: H. Zhu, Institute of Computer Software, Nanjing University, Nanjing, 210093, P.R. of China; email: hzhu@netra.nju.edu.cn ; P.A.V. Hall and J.H.R. May, Department of Computing, The Open University, Walton Hall, Milton Keynes, MK76AA, UK. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1997 ACM 0360-0300/97/1200–0366 $03.50 ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 2. Test Coverage and Adequacy • 367 demics, the software industry has been software. A way to measure how well slow to accept test adequacy measure- this objective has been achieved is to ment. Few software development stan- plant some artificial faults into the dards require or even recommend the program and check if they are de- use of test adequacy criteria [Wichmann tected by the test. A program with a 1993; Wichmann and Cox 1992]. Are planted fault is called a mutant of the test adequacy criteria worth the cost for original program. If a mutant and the practical use? original program produce different Addressing these questions, we sur- outputs on at least one test case, the vey research on software test criteria in fault is detected. In this case, we say the past two decades and attempt to put that the mutant is dead or killed by it into a uniform framework. the test set. Otherwise, the mutant is still alive. The percentage of dead mu- 1.1 The Notion of Test Adequacy tants compared to the mutants that are not equivalent to the original pro- Let us start with some examples. Here gram is an adequacy measurement, we seek to illustrate the basic notions called the mutation score or mutation underlying adequacy criteria. Precise adequacy [Budd et al. 1978; DeMillo definitions will be given later. et al. 1978; Hamlet 1977]. —Statement coverage. In software test- ing practice, testers are often re- From Goodenough and Gerhart’s quired to generate test cases to exe- [1975, 1977] point of view, a software cute every statement in the program test adequacy criterion is a predicate at least once. A test case is an input that defines “what properties of a pro- on which the program under test is gram must be exercised to constitute a executed during testing. A test set is a ‘thorough’ test, i.e., one whose success- set of test cases for testing a program. ful execution implies no errors in a The requirement of executing all the tested program.” To guarantee the cor- statements in the program under test rectness of adequately tested programs, is an adequacy criterion. A test set they proposed reliability and validity that satisfies this requirement is con- requirements of test criteria. Reliability sidered to be adequate according to requires that a test criterion always the statement coverage criterion. produce consistent test results; that is, Sometimes the percentage of executed if the program tested successfully on statements is calculated to indicate one test set that satisfies the criterion, how adequately the testing has been then the program also tested success- performed. The percentage of the fully on all test sets that satisfies the statements exercised by testing is a criterion. Validity requires that the test measurement of the adequacy. always produce a meaningful result; that is, for every error in a program, —Branch coverage. Similarly, the branch there exists a test set that satisfies the coverage criterion requires that all criterion and is capable of revealing the control transfers in the program un- error. But it was soon recognized that der test are exercised during testing. there is no computable criterion that The percentage of the control trans- satisfies the two requirements, and fers executed during testing is a mea- hence they are not practically applica- surement of test adequacy. ble [Howden 1976]. Moreover, these two —Path coverage. The path coverage cri- requirements are not independent since terion requires that all the execution a criterion is either reliable or valid for paths from the program’s entry to its any given software [Weyuker and Os- exit are executed during testing. trand 1980]. Since then, the focus of —Mutation adequacy. Software testing research seems to have shifted from is often aimed at detecting faults in seeking theoretically ideal criteria to ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 3. 368 • Zhu et al. the search for practically applicable ap- that the adequacy of testing the pro- proximations. gram p by the test set t with respect to Currently, the software testing litera- the specification s is of degree r accord- ture contains two different, but closely ing to the criterion C. The greater the related, notions associated with the real number r, the more adequate the term test data adequacy criteria. First, testing. an adequacy criterion is considered to be a stopping rule that determines These two notions of test data ade- whether sufficient testing has been quacy criteria are closely related to one done that it can be stopped. For in- another. A stopping rule is a special stance, when using the statement cover- case of measurement on the continuum age criterion, we can stop testing if all since the actual range of measurement the statements of the program have results is the set {0,1}, where 0 means been executed. Generally speaking, false and 1 means true. On the other since software testing involves the pro- hand, given an adequacy measurement gram under test, the set of test cases, M and a degree r of adequacy, one can and the specification of the software, an always construct a stopping rule M r such that a test set is adequate if and adequacy criterion can be formalized as only if the adequacy degree is greater a function C that takes a program p, a than or equal to r; that is, M r (p, s, t) specification s, and a test set t and gives true N M(p, s, t) r. Since a stopping a truth value true or false. Formally, let rule asserts a test set to be either ade- P be a set of programs, S be a set of quate or inadequate, it is also called a specifications, D be the set of inputs of predicate rule in the literature. the programs in P, T be the class of test An adequacy criterion is an essential sets, that is, T 2D, where 2X denotes part of any testing method. It plays two the set of subsets of X. fundamental roles. First, an adequacy Definition 1.1 (Test Data Adequacy criterion specifies a particular software Criteria as Stopping Rules). A test testing requirement, and hence deter- data adequacy criterion C is a function mines test cases to satisfy the require- C: P S T 3 {true, false}. C(p, s, t) ment. It can be defined in one of the true means that t is adequate for testing following forms. program p against specification s accord- (1) It can be an explicit specification for ing to the criterion C, otherwise t is inad- test case selection, such as a set of equate. guidelines for the selection of test Second, test data adequacy criteria cases. Following such rules one can provide measurements of test quality produce a set of test cases, although when a degree of adequacy is associated there may be some form of random with each test set so that it is not sim- selections. Such a rule is usually ply classified as good or bad. In practice, referred to as a test case selection the percentage of code coverage is often criterion. Using a test case selection used as an adequacy measurement. criterion, a testing method may be Thus, an adequacy criterion C can be defined constructively in the form of formally defined to be a function C from an algorithm which generates a test a program p, a specification s, and a test set from the software under test and set t to a real number r C(p, s, t), its own specification. This test set is the degree of adequacy [Zhu and Hall then considered adequate. It should 1992]. Formally: be noticed that for a given test case selection criterion, there may exist a Definition 1.2 (Test Data Adequacy number of test case generation algo- Criteria as Measurements). A test data rithms. Such an algorithm may also adequacy criterion is a function C, C: involve random sampling among P S T 3 [0,1]. C(p, s, t) r means many adequate test sets. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 4. Test Coverage and Adequacy • 369 (2) It can also be in the form of specify- cess of software testing. If path cover- ing how to decide whether a given age is used, then the observation of test set is adequate or specifying whether statements have been executed how to measure the adequacy of a is insufficient; execution paths should test set. A rule that determines be observed and recorded. However, if whether a test set is adequate (or mutation score is used, it is unneces- more generally, how adequate) is sary to observe whether a statement is usually referred to as a test data executed during testing. Instead, the adequacy criterion. output of the original program and the output of the mutants need to be re- However, the fundamental concept corded and compared. underlying both test case selection cri- Although, given an adequacy crite- teria and test data adequacy criteria is rion, different methods could be devel- the same, that is, the notion of test oped to generate test sets automatically adequacy. In many cases they can be or to select test cases systematically easily transformed from one form to an- and efficiently, the main features of a other. Mathematically speaking, test testing method are largely determined case selection criteria are generators, by the adequacy criterion. For example, that is, functions that produce a class of as we show later, the adequacy criterion test sets from the program under test is related to fault-detecting ability, the and the specification (see Definition dependability of the program that 1.3). Any test set in this class is ade- passes a successful test and the number quate, so that we can use any of them of test cases required. Unfortunately, equally.1 Test data adequacy criteria the exact relationship between a partic- are acceptors that are functions from ular adequacy criterion and the correct- the program under test, the specifica- ness or reliability of the software that tion of the software and the test set to a passes the test remains unclear. characteristic number as defined in Def- Due to the central role that adequacy inition 1.1. Generators and acceptors criteria play in software testing, soft- are mathematically equivalent in the ware testing methods are often com- sense of one-one correspondence. Hence, pared in terms of the underlying ade- we use “test adequacy criteria” to de- quacy criteria. Therefore, subsequently, note both of them. we use the name of an adequacy crite- Definition 1.3 (Test Data Adequacy rion as a synonym of the corresponding Criteria as Generators [Budd and An- testing method when there is no possi- gluin 1982]). A test data adequacy cri- bility of confusion. terion C is a function C: P S 3 2 T. A test set t C(p, s) means that t satis- 1.2 The Uses of Test Adequacy Criteria fies C with respect to p and s, and it is said that t is adequate for (p, s) accord- An important issue in the management ing to C. of software testing is to “ensure that before any testing the objectives of that The second role that an adequacy cri- testing are known and agreed and that terion plays is to determine the observa- the objectives are set in terms that can tions that should be made during the be measured.” Such objectives “should testing process. For example, statement be quantified, reasonable and achiev- coverage requires that the tester, or the able” [Ould and Unwin 1986]. Almost testing system, observe whether each all test adequacy criteria proposed in statement is executed during the pro- the literature explicitly specify particu- lar requirements on software testing. 1 Test data selection criteria as generators should They are objective rules applicable by not be confused with test case generation software project managers for this purpose. tools, which may only generate one test set. For example, branch coverage is a ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 5. 370 • Zhu et al. test requirement that all branches of Generally speaking, there are two basic the program should be exercised. The aspects of software dependability as- objective of testing is to satisfy this sessment. One is the dependability esti- requirement. The degree to which this mation itself, such as a reliability fig- objective is achieved can be measured ure. The other is the confidence in quantitatively by the percentage of estimation, such as the confidence or branches exercised. The mutation ade- the accuracy of the reliability estimate. quacy criterion specifies the testing re- The role of test adequacy here is a con- quirement that a test set should be able tributory factor in building confidence to rule out a particular set of software in the integrity estimate. Recent re- faults, that is, those represented by mu- search has shown some positive results tants. Mutation score is another kind of with respect to this role [Tsoukalas quantitative measurement of test qual- 1993]. ity. Although it is common in current soft- Test data adequacy criteria are also ware testing practice that the test pro- very helpful tools for software testers. cesses at both the higher and lower There are two levels of software testing levels stop when money or time runs processes. At the lower level, testing is out, there is a tendency towards the use a process where a program is tested by of systematic testing methods with the feeding more and more test cases to it. application of test adequacy criteria. Here, a test adequacy criterion can be used as a stopping rule to decide when 1.3 Categories of Test Data Adequacy this process can stop. Once the mea- Criteria surement of test adequacy indicates that the test objectives have been There are various ways to classify ade- achieved, then no further test case is quacy criteria. One of the most common needed. Otherwise, when the measure- is by the source of information used to ment of test adequacy shows that a test specify testing requirements and in the has not achieved the objectives, more measurement of test adequacy. Hence, tests must be made. In this case, the an adequacy criterion can be: adequacy criterion also provides a —specification-based, which specifies guideline for the selection of the addi- the required testing in terms of iden- tional test cases. In this way, adequacy tified features of the specification or criteria help testers to manage the soft- the requirements of the software, so ware testing process so that software that a test set is adequate if all the quality is ensured by performing suffi- identified features have been fully ex- cient tests. At the same time, the cost of ercised. In software testing literature testing is controlled by avoiding redun- it is fairly common that no distinction dant and unnecessary tests. This role of is made between specification and re- adequacy criteria has been considered quirements. This tradition is followed by some computer scientists [Weyuker in this article also; 1986] to be one of the most important. —program-based, which specifies test- At a higher level, the testing proce- ing requirements in terms of the pro- dure can be considered as repeated cy- gram under test and decides if a test cles of testing, debugging, modifying set is adequate according to whether program code, and then testing again. the program has been thoroughly ex- Ideally, this process should stop only ercised. when the software has met the required reliability requirements. Although test It should not be forgotten that for both data adequacy criteria do not play the specification-based and program-based role of stopping rules at this level, they testing, the correctness of program out- make an important contribution to the puts must be checked against the speci- assessment of software dependability. fication or the requirements. However, ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 6. Test Coverage and Adequacy • 371 in both cases, the measurement of test In the software testing literature, peo- adequacy does not depend on the results ple often talk about white-box testing of this checking. Also, the definition of and black-box testing. Black-box testing specification-based criteria given previ- treats the program under test as a ously does not presume the existence of “black box.” No knowledge about the a formal specification. implementation is assumed. In white- It has been widely acknowledged that box testing, the tester has access to the software testing should use information details of the program under test and from both specification and program. performs the testing according to such Combining these two approaches, we details. Therefore, specification-based have: criteria and interface-based criteria be- long to black-box testing. Program- —combined specification- and program- based criteria and combined specifica- based criteria, which use the ideas of tion and program-based criteria belong both program-based and specification- to white-box testing. based criteria. Another classification of test ade- quacy criteria is by the underlying test- There are also test adequacy criteria ing approach. There are three basic ap- that specify testing requirements with- proaches to software testing: out employing any internal information from the specification or the program. (1) structural testing: specifies testing For example, test adequacy can be mea- requirements in terms of the cover- sured according to the prospective us- age of a particular set of elements in age of the software by considering the structure of the program or the whether the test cases cover the data specification; that are most likely to be frequently (2) fault-based testing: focuses on de- used as input in the operation of the tecting faults (i.e., defects) in the software. Although few criteria are ex- software. An adequacy criterion of plicitly proposed in such a way, select- this approach is some measurement ing test cases according to the usage of of the fault detecting ability of test the software is the idea underlying ran- sets.2 dom testing, or statistical testing. In (3) error-based testing: requires test random testing, test cases are sampled cases to check the program on cer- at random according to a probability tain error-prone points according to distribution over the input space. Such our knowledge about how programs a distribution can be the one represent- typically depart from their specifica- ing the operation of the software, and tions. the random testing is called representa- The source of information used in the tive. It can also be any probability dis- adequacy measurement and the under- tribution, such as a uniform distribu- lying approach to testing can be consid- tion, and the random testing is called ered as two dimensions of the space of nonrepresentative. Generally speaking, software test adequacy criteria. A soft- if a criterion employs only the “inter- ware test adequacy criterion can be face” information—the type and valid classified by these two aspects. The re- range for the software input—it can be view of adequacy criteria is organized called an interface-based criterion: according to the structure of this space. —interface-based criteria, which specify testing requirements only in terms of the type and range of software input 2 We use the word fault to denote defects in soft- without reference to any internal fea- ware and the word error to denote defects in the tures of the specification or the pro- outputs produced by a program. An execution that gram. produces an error is called a failure. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 7. 372 • Zhu et al. 1.4 Organization of the Article been used as a model of program struc- ture. It is widely used in static analysis The remainder of the article consists of of software [Fenton et al. 1985; Ko- two main parts. The first part surveys saraju 1974; McCabe 1976; Paige 1975]. various types of test data adequacy cri- It has also been used to define and teria proposed in the literature. It in- study program-based structural test ad- cludes three sections devoted to struc- equacy criteria [White 1981]. In this tural testing, fault-based testing, and section we give a brief introduction to error-based testing. Each section con- the flow-graph model of program struc- sists of several subsections covering the ture. Although we use graph-theory ter- principles of the testing method and minology in the following discussion, their application to program-based and readers are required to have only a pre- specification-based test criteria. The liminary knowledge of graph theory. To second part is devoted to the rationale help understand the terminology and to presented in the literature in support of avoid confusion, a glossary is provided the various criteria. It has two sections. in the Appendix. Section 5 discusses the methods of com- A flow graph is a directed graph that paring adequacy criteria and surveys consists of a set N of nodes and a set the research results in the literature. E N N of directed edges between Section 6 discusses the axiomatic study nodes. Each node represents a linear and assessment of adequacy criteria. Fi- sequence of computations. Each edge nally, Section 7 concludes the paper. representing transfer of control is an ordered pair n 1 , n 2 of nodes, and is 2. STRUCTURAL TESTING associated with a predicate that repre- This section is devoted to adequacy cri- sents the condition of control transfer teria for structural testing. It consists of from node n 1 to node n 2 . In a flow two subsections, one for program-based graph, there is a begin node and an end criteria and the other for specification- node where the computation starts and based criteria. finishes, respectively. The begin node has no inward edges and the end node 2.1 Program-Based Structural Testing has no outward edges. Every node in a flow graph must be on a path from the There are two main groups of program- begin node to the end node. Figure 1 is based structural test adequacy criteria: an example of flow graph. control-flow criteria and data-flow crite- ria. These two types of adequacy crite- Example 2.1 The following program ria are combined and extended to give computes the greatest common divisor dependence coverage criteria. Most ade- of two natural numbers by Euclid’s al- quacy criteria of these two groups are gorithm. Figure 1 is the corresponding based on the flow-graph model of pro- flow graph. gram structure. However, a few control- Begin flow criteria define test requirements in input (x, y); terms of program text rather than using while (x 0 and y 0) do an abstract model of software structure. if (x y) then x: x y 2.1.1 Control Flow Adequacy Crite- else y: y x ria. Before we formally define various endif control-flow-based adequacy criteria, we endwhile; first give an introduction to the flow output (x y); graph model of program structure. end A. The flow graph model of program It should be noted that in the litera- structure. The control flow graph ture there are a number of conventions stems from compiler work and has long of flow-graph models with subtle differ- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 8. Test Coverage and Adequacy • 373 Figure 1. Flow graph for program in Example 2.1. ences, such as whether a node is al- to another is represented by a directed lowed to be associated with an empty edge between the nodes such that the sequence of statements, the number of condition of the control transfer is asso- outward edges allowed for a node, and ciated with it. the number of end nodes allowed in a B. Control-flow adequacy criteria. flow graph, and the like. Although most Now, given a flow-graph model of a pro- adequacy criteria can be defined inde- gram and a set of test cases, how do we pendently of such conventions, using measure the adequacy of testing for the different ones may result in different program on the test set? First of all, measures of test adequacy. Moreover, recall that the execution of the program testing tools may be sensitive to such on an input datum is modeled as a conventions. In this article no restric- traverse in the flow graph. Every execu- tions on the conventions are made. tion corresponds to a path in the flow For programs written in a procedural graph from the begin node to the end programming language, flow-graph node. Such a path is called a complete models can be generated automatically. computation path, or simply a computa- Figure 2 gives the correspondences be- tion path or an execution path in soft- tween some structured statements and ware testing literature. their flow-graph structures. Using these A very basic requirement of adequate rules, a flow graph, shown in Figure 3, testing is that all the statements in the can be derived from the program given program are covered by test executions. in Example 2.1. Generally, to construct This is usually called statement cover- a flow graph for a given program, the age [Hetzel 1984]. But full statement program code is decomposed into a set coverage cannot always be achieved be- of disjoint blocks of linear sequences of cause of the possible existence of infea- statements. A block has the property sible statements, that is, dead code. that whenever the first statement of the Whether a piece of code is dead code is block is executed, the other statements undecidable [Weyuker 1979a; Weyuker are executed in the given order. Fur- 1979b; White 1981]. Because state- thermore, the first statement of the ments correspond to nodes in flow- block is the only statement that may be graph models, this criterion can be de- executed directly after the execution of fined in terms of flow graphs, as follows. a statement in another block. Each block corresponds to a node in the flow Definition 2.1 (Statement Coverage graph. A control transfer from one block Criterion). A set P of execution paths ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 9. 374 • Zhu et al. Figure 2. Example flow graphs for structured statements. may be missed from an adequate test. Hence, we have a slightly stronger re- quirement of adequate test, called branch coverage [Hetzel 1984], that all control transfers must be checked. Since control transfers correspond to edges in flow graphs, the branch coverage crite- rion can be defined as the coverage of all edges in the flow graph. Definition 2.2 (Branch Coverage Crite- rion). A set P of execution paths satis- fies the branch coverage criterion if and only if for all edges e in the flow graph, there is at least one path p in P such that p contains the edge e. Figure 3. Flow graph for Example 2.1. Branch coverage is stronger than statement coverage because if all edges in a flow graph are covered, all nodes are necessarily covered. Therefore, a satisfies the statement coverage crite- test set that satisfies the branch cover- rion if and only if for all nodes n in the age criterion must also satisfy state- flow graph, there is at least one path p ment coverage. Such a relationship be- in P such that node n is on the path p. tween adequacy criteria is called the Notice that statement coverage is so subsumes relation. It is of interest in weak that even some control transfers the comparison of software test ade- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 10. Test Coverage and Adequacy • 375 quacy criteria (see details in Section feasible elements. Most program-based 5.1.3). adequacy criteria in the literature are However, even if all branches are ex- not finitely applicable, but finitely ap- ercised, this does not mean that all com- plicable versions can often be obtained binations of control transfers are by redefinition in this way. Subse- checked. The requirement of checking quently, such a version is called the all combinations of branches is usually feasible version of the adequacy crite- called path coverage or path testing, rion. It should be noted, first, that al- which can be defined as follows. though we can often obtain finite appli- Definition 2.3 (Path Coverage Crite- cability by using the feasible version, rion). A set P of execution paths satis- this may cause the undecidability prob- fies the path coverage criterion if and lem; that is, we may not be able to only if P contains all execution paths decide whether a test set satisfies a from the begin node to the end node in given adequacy criterion. For example, the flow graph. whether a statement in a program is feasible is undecidable [Weyuker 1979a; Although the path coverage criterion Weyuker 1979b; White 1991]. There- still cannot guarantee the correctness of fore, when a test set does not cover all a tested program, it is too strong to be the statements in a program, we may practically useful for most programs, not be able to decide whether a state- because there can be an infinite number ment not covered by the test data is of different paths in a program with dead code. Hence, we may not be able to loops. In such a case, an infinite set of decide if the test set satisfies the feasi- test data must be executed for adequate ble version of statement coverage. Sec- testing. This means that the testing ond, for some adequacy criteria, such as cannot finish in a finite period of time. path coverage, we cannot obtain finite But, in practice, software testing must applicability by such a redefinition. be fulfilled within a limited fixed period Recall that the rationale for path cov- of time. Therefore, a test set must be erage is that there is no path that does finite. The requirement that an ade- not need to be checked by testing, while quacy criterion can always be satisfied finite applicability forces us to select a by a finite test set is called finite appli- finite subset of paths. Thus, research cability [Zhu and Hall 1993] (see Sec- into flow-graph-based adequacy criteria tion 6). has focused on the selection of the most The statement coverage criterion and important subsets of paths. Probably branch coverage criterion are not fi- the most straightforward solution to the nitely applicable either, because they conflict is to select paths that contain no require testing to cover infeasible ele- redundant information. Hence, two no- ments. For instance, statement cover- tions from graph theory can be used. age requires that all the statements in a First, a path that has no repeated occur- program are executed. However, a pro- rence of any edge is called a simple path gram may have infeasible statements, in graph theory. Second, a path that has that is, dead code, so that no input data no repeated occurrences of any node is can cause their execution. Therefore, in called an elementary path. Thus, it is such cases, there is no adequate test set possible to define simple path coverage that can satisfy statement coverage. and elementary path coverage criteria, Similarly, branch coverage is not fi- which require that adequate test sets nitely applicable because a program should cover all simple paths and ele- may contain infeasible branches. How- mentary paths, respectively. ever, for statement coverage and also These two criteria are typical ones branch coverage, we can define a fi- that select finite subsets of paths by nitely applicable version of the criterion specifying restrictions on the complexity by requiring testing only to cover the of the individual paths. Another exam- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 11. 376 • Zhu et al. ple of this type is the length-n path graph theory, the maximal size of a set coverage criterion, which requires cover- of independent paths is unique for any age of all subpaths of length less than given graph and is called the cyclomatic or equal to n [Gourlay 1983]. A more number, and can be easily calculated by complicated example of the type is the following formula. Paige’s level-i path coverage criterion [Paige 1978; Paige 1975]. Informally, v G e n p, the criterion starts with testing all ele- where v(G) denotes the cyclomatic num- mentary paths from the begin node to ber of the graph G, n is the number of the end node. Then, if there is an ele- vertices in G, e is the number of edges, mentary subpath or cycle that has not and p is the number of strongly con- been exercised, the subpath is required nected components.3 The adequacy cri- to be checked at next level. This process terion is then defined as follows. is repeated until all nodes and edges are covered by testing. Obviously, a test set Definition 2.4 (Cyclomatic-Number that satisfies the level-i path coverage Criterion). A set P of execution paths criterion must also satisfy the elemen- satisfies the cyclomatic number crite- tary path coverage criterion, because rion if and only if P contains at least elementary paths are level-0 paths. one set of v independent paths, where A set of control-flow adequacy criteria v e n p is the cyclomatic number that are concerned with testing loops is of the flow graph. loop count criteria, which date back to the mid-1970s [Bently et al. 1993]. For McCabe also gave an algorithm to any given natural number K, the loop generate a set of independent paths count-K criterion requires that every from any given flow graph [McCabe loop in the program under test should 1976]. Paige [1978] has shown that the be executed zero times, once, twice, and level-i path coverage criterion subsumes so on, up to K times [Howden 1975]. McCabe’s cyclomatic number criterion. Another control-flow criterion con- The preceding control-flow test ade- cerned with testing loops is the cycle quacy criteria are all defined in terms of combination criterion, which requires flow-graph models of program structure that an adequate test set should cover except loop count criteria, which are all execution paths that do not contain a defined in terms of program text. A cycle more than once. number of other test adequacy criteria An alternative approach to defining are based on the text of program. One of control-flow adequacy criteria is to spec- the most popular criteria in software ify restrictions on the redundancy testing practice is the so-called multiple among the paths. McCabe’s cyclomatic condition coverage discussed in Myers’ measurement is such an example [Mc- [1979] classic book which has proved Cabe 1976; McCabe 1983; McCabe and popular in commercial software testing Schulmeyer 1985]. It is based on the practice. The criterion focuses on the theorem of graph theory that for any conditions of control transfers, such as flow graph there is a set of execution the condition in an IF-statement or a paths such that every execution path WHILE-LOOP statement. A test set is can be expressed as a linear combina- said of satisfying the decision coverage tion of them. A set of paths is indepen- criterion if for every condition there is dent if none of them is a linear combina- at least one test case such that the tion of the others. According to McCabe, condition has value true when evalu- a path should be tested if it is indepen- dent of the paths that have been tested. 3 A graph is strongly connected if, for any two On the other hand, if a path is a linear nodes a and b, there exists a path from a to b and combination of tested paths, it can be a path from b to a. Strongly connected compo- considered redundant. According to nents are maximal strongly connected subgraphs. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 12. Test Coverage and Adequacy • 377 ated, and there is also at least one test TER2, respectively, where TER repre- case such that the condition has value sents test effectiveness ratio. The cover- false. In a high-level programming lan- age of LCSAJ is the third level, which is guage, a condition can be a Boolean defined as TER3. The hierarchy is then expression consisting of several atomic extended to the coverage of program predicates combined by logic connec- paths containing a number of LCSAJs. tives like and, or, and not. A test set satisfies the condition coverage crite- Definition 2.6 (TER3: LCSAJ Cover- rion if for every atomic predicate there age) is at least one test case such that the predicate has value true when evalu- number of LCSAJs exercised ated, and there is also at least one test at least once case such that the predicate has value TER3 total number of LCSAJs. false. Although the result value of an evaluation of a Boolean expression can Generally speaking, an advantage of only be one of two possibilities, true or text-based adequacy criteria is that test false, the result may be due to different adequacy can be easily calculated from combinations of the truth values of the the part of program text executed dur- atomic predicates. The multiple condi- ing testing. However, their definitions tion coverage criterion requires that a may be sensitive to language details. test set should cover all possible combi- For programs written in a structured nations of the truth values of atomic programming language, the application predicates in every condition. This crite- of TERn for n greater than or equal to 3 rion is sometimes also called extended requires analysis and reformatting of branch coverage in the literature. For- the program structure. In such cases, mally: the connection between program text Definition 2.5 (Multiple Condition and its test adequacy becomes less Coverage). A test set T is said to be straightforward. In fact, it is observed adequate according to the multiple-con- in software testing practice that a small dition-coverage criterion if, for every modification to the program may result condition C, which consists of atomic in a considerably different set of linear predicates (p 1 , p 2 , . . . , p n ), and all the sequence code and jumps. possible combinations (b 1 , b 2 , . . . , b n ) It should be noted that none of the of their truth values, there is at least adequacy criteria discussed in this sec- one test case in T such that the value of tion are applicable due to the possible p i equals b i , i 1, 2, . . . , n. existence of infeasible elements in a program, such as infeasible statements, Woodward et al. [1980] proposed and infeasible branches, infeasible combina- studied a hierarchy of program-text- tions of conditions, and the like. These based test data adequacy criteria based criteria, except path coverage, can be on a class of program units called linear redefined to obtain finite applicability code sequence and jump (LCSAJ). These by only requiring the coverage of feasi- criteria are usually referred to as test ble elements. effectiveness metrics in the literature. An LCSAJ consists of a body of code 2.1.2 Data-Flow-Based Test Data Ad- through which the flow of control may equacy Criteria. In the previous sec- proceed sequentially and which is ter- tion, we have seen how control-flow in- minated by a jump in the control flow. formation in the program under test is The hierarchy TERi , i 1, 2, . . . , used to specify testing requirements. In n, . . . of criteria starts with statement this section, data-flow information is coverage as the lowest level, followed by taken into account in the definition of branch coverage as the next lowest testing requirements. We first introduce level. They are denoted by TER1 and the way that data-flow information is ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 13. 378 • Zhu et al. added into the flow-graph models of pro- have been bound to x in some node gram structures. Then, three basic other than the one in which it is being groups of data-flow adequacy criteria used. Otherwise it is a local computa- are reviewed. Finally, their limitations tional use. and extensions are discussed. Data-flow test adequacy analysis is concerned with subpaths from defini- A. Data-flow information in flow tions to nodes where those definitions graph. Data-flow analysis of test ade- are used. A definition-clear path with quacy is concerned with the coverage of respect to a variable x is a path where flow-graph paths that are significant for for all nodes in the path there is no the data flow in the program. Therefore, definition occurrence of the variable x. data-flow information is introduced into A definition occurrence of a variable x the flow-graph models of program struc- at a node u reaches a computational use tures. occurrence of the variable at node v if Data-flow testing methods are based and only if there is a path p from u to v on the investigation of the ways in such that p (u, w 1 , w 2 , . . . , w n , v), which values are associated with vari- and (w 1 , w 2 , . . . , w n ) is definition-clear ables and how these associations can with respect to x, and the occurrence of effect the execution of the program. This x at v is a global computational use. We analysis focuses on the occurrences of say that the definition of x at u reaches variables within the program. Each the computational occurrence of x at v variable occurrence is classified as ei- through the path p. Similarly, if there is ther a definition occurrence or a use a path p (u, w 1 , w 2 , . . . , w n , v) from occurrence. A definition occurrence of a u to v, and (w 1 , w 2 , . . . , w n ) is defini- variable is where a value is bound to the tion-clear with respect to x, and there is variable. A use occurrence of a variable a predicate occurrence of x associated is where the value of the variable is with the edge from w n to v, we say that referred. Each use occurrence is further u reaches the predicate use of x on the classified as being a computational use edge (w n , v) through the path p. If a or a predicate use. If the value of a path in one of the preceding definitions variable is used to decide whether a is feasible, that is, there is at least one predicate is true for selecting execution input datum that can actually cause the paths, the occurrence is called a predi- execution of the path, we say that a cate use. Otherwise, it is used to com- definition feasibly reaches a use of the pute a value for defining other variables definition. or as an output value. It is then called a Three groups of data-flow adequacy computational use. For example, the as- criteria have been proposed in the liter- signment statement “y : x 1 x 2 ” con- ature, and are discussed in the follow- tains computational uses of x 1 and x 2 ing. and a definition of y. The statement “if x1 x 2 then goto L endif” contains B. Simple definition-use association predicate uses of x 1 and x 2 . coverage—the Rapps-Weyuker-Frankl Since we are interested in tracing the family. Rapps and Weyuker [1985] flow of data between nodes, any defini- proposed a family of testing adequacy tion that is used only within the node in criteria based on data-flow information. which the definition occurs is of little Their criteria are concerned mainly importance. Therefore a distinction is with the simplest type of data-flow made between local computational uses paths that start with a definition of a and global computational uses. A global variable and end with a use of the same computational use of a variable x is variable. Frankl and Weyuker [1988] where no definition of x precedes the later reexamined the data-flow ade- computational use within the node in quacy criteria and found that the origi- which it occurs. That is, the value must nal definitions of the criteria did not ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 14. Test Coverage and Adequacy • 379 satisfy the applicability condition. They uses are exercised, but it also requires redefined the criteria to be applicable. that at least one predicate use should be The following definitions come from the exercised when there is no computa- modified definitions. tional use of the variable. In contrast, The all-definitions criterion requires the all-p-uses/some-c-uses criterion puts that an adequate test set should cover emphasis on predicate uses by requiring all definition occurrences in the sense that test sets should exercise all predi- that, for each definition occurrence, the cate uses and exercise at least one testing paths should cover a path computational use when there is no through which the definition reaches a predicate use. Two even weaker criteria use of the definition. were also defined. The all-predicate- uses criterion completely ignores the Definition 2.7 (All Definitions Crite- computational uses and requires that rion). A set P of execution paths satis- only predicate uses need to be tested. fies the all-definitions criterion if and The all-computation-uses criterion only only if for all definition occurrences of a requires that computational uses should variable x such that there is a use of x be tested and ignores the predicate which is feasibly reachable from the def- uses. inition, there is at least one path p in P Notice that, given a definition occur- such that p includes a subpath through rence of a variable x and a use of the which the definition of x reaches some variable x that is reachable from that use occurrence of x. definition, there may exist many paths through which the definition reaches Since one definition occurrence of a the use. A weakness of the preceding variable may reach more than one use criteria is that they require only one of occurrence, the all-uses criterion re- such paths to be exercised by testing. quires that all of the uses should be However, the applicability problem exercised by testing. Obviously, this re- arises if all such paths are to be exer- quirement is stronger than the all-defi- cised because there may exist an infi- nition criterion. nite number of such paths in a flow Definition 2.8 (All Uses Criterion). A graph. For example, consider the flow set P of execution paths satisfies the graph in Figure 1, the definition of y at all-uses criterion if and only if for all node a1 reaches the use of y at node a3 definition occurrences of a variable x through all the paths in the form: and all use occurrences of x that the a1, a2 ∧ a2, a2 n∧ a2, a3 , n 1, definition feasibly reaches, there is at least one path p in P such that p in- where ∧ is the concatenation of paths, cludes a subpath through which that p n is the concatenation of p with itself definition reaches the use. for n times, which is inductively defined The all-uses criterion was also pro- to be p 1 p and p k p ∧ p k 1 , for all posed by Herman [1976], and called k 1. To obtain finite applicability, reach-coverage criterion. As discussed at Frankl and Weyuker [1988] and Clarke the beginning of the section, use occur- et al. [1989] restricted the paths to be rences are classified into computational cycle-free or only the end node of the use occurrences and predicate use oc- path to be the same as the start node. currences. Hence, emphasis can be put either on computational uses or on Definition 2.9 (All Definition-Use- predicate uses. Rapps and Weyuker Paths Criterion: Abbr. All DU-Paths [1985] identified four adequacy criteria Criterion). A set P of execution paths of different strengths and emphasis. satisfies the all-du-paths criterion if The all-c-uses/some-p-uses criterion re- and only if for all definitions of a vari- quires that all of the computational able x and all paths q through which ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 15. 380 • Zhu et al. that definition reaches a use of x, there action is a path p (n 1 ) p 1 (n 2 ) is at least one path p in P such that q is . . . (n k 1 ) p k 1 (n k ) such that for a subpath of p, and q is cycle-free or all i 1, 2, . . . , k 1, d i (x i ) reaches contains only simple cycles. u i (x i ) through p i . The required k-tuples criterion then requires that all k–dr However, even with this restriction, it interactions are tested. is still not applicable since such a path may be infeasible. Definition 2.11 (Required k-Tuples Criteria). A set P of execution paths C. Interactions between variables—the satisfies the required k-tuples criterion, Ntafos required K-tuples criteria. Ntafos k 1, if and only if for all j–dr interac- [1984] also used data-flow information tions L, 1 j k, there is at least one to analyze test data adequacy. He stud- path p in P such that p includes a ied how the values of different variables subpath which is an interaction path for interact, and defined a family of ade- L. quacy criteria called required k-tuples, where k 1 is a natural number. These Example 2.1 Consider the flow criteria require that a path set cover the graph in Figure 2. The following are chains of alternating definitions and 3–dr interaction paths. uses, called definition-reference interac- tions (abbr. k–dr interactions) in a1, a3, a2, a4 for the 3–dr Ntafos’ terminology. Each definition in interaction d 1 x , u 1 x , d 2 y , a k–dr interaction reaches the next use in the chain, which occurs at the same u 2 y , d 3 x , u 3 x ; and node as the next definition in the chain. Formally: a1, a2, a3, a4 for the 3–dr Definition 2.10 (k–dr interaction). interaction d 1 y , u 1 y , d 2 x , For k 1, a k–dr interaction is a se- quence K [d 1 (x 1 ), u 1 (x 1 ), d 2 (x 2 ), u2 x , d3 y , u3 y . u 2 (x 2 ), . . . , d k (x k ), u k (x k )] where D. Combinations of definitions—the (i) d i ( x i ), 1 i k, is a definition Laski-Korel criteria. Laski and Korel occurrence of the variable x i ; [1983] defined and studied another kind (ii) u i ( x i ), 1 i k, is a use occur- of testing path selection criteria based rence of the variable x i ; on data-flow analysis. They observed (iii) the use u i ( x i ) and the definition that a given node may contain uses of d i 1 ( x i ) are associated with the several different variables, where each same node n i 1 ; use may be reached by several defini- (iv) for all i, 1 i k, the ith defini- tions occurring at different nodes. Such tion d i ( x i ) reaches the ith use definitions constitute the context of the u i ( x i ). computation at the node. Therefore, they are concerned with exploring such Note that the variables x 1 , x 2 , . . . , x k contexts for each node by selecting and the nodes n 1 , n 2 , . . . , n k need not paths along which the various combina- be distinct. This definition comes from tions of definitions reach the node. Ntafos’ [1988] later work. It is different from the original definition where the Definition 2.12 (Ordered Context). nodes are required to be distinct [Ntafos Let n be a node in the flow graph. 1984]. The same modification was also Suppose that there are uses of the vari- made by Clark et al. [1989] in their ables x 1 , x 2 , . . . , x m at the node n.4 Let formal analysis of data flow adequacy criteria. 4 This assumption comes from Clark et al. [1989]. An interaction path for a k–dr inter- The original definition given by Laski and Korel ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 16. Test Coverage and Adequacy • 381 [n 1 , n 2 , . . . , n m ] be a sequence of nodes context {n 1 , n 2 , . . . , n m }. Ignoring the such that for all i 1, 2, . . . , m, there ordering between the nodes, a slightly is a definition of x i on node n i , and the weaker criterion, called the context-cov- definition of x i reaches the node n with erage criterion, requires that all con- respect to x i . A path p p 1 (n 1 ) p 2 texts for all nodes are covered. (n 2 ) . . . p m (n m ) p m 1 (n) is called an ordered context path for the Definition 2.14 (Context-Coverage Cri- node n with respect to the sequence [n 1 , terion). A set P of execution paths sat- n 2 , . . . , n m ] if and only if for all i 2, isfies the context coverage criterion if 3, . . . , m, the subpath p i (n i ) p i 1 and only if for all nodes n and for all ... p m 1 is definition clear with re- contexts for n, there is at least one path spect to x i 1 . In this case, we say that p in P such that p contains a subpath the sequence [n 1 , n 2 , . . . , n m ] of nodes which is a definition context path for n is an ordered context for n. with respect to the context. Example 2.2 Consider the flow E. Data-flow testing for structured graph in Figure 1. There are uses of the data and dynamic data. The data flow two variables x and y at node a4. The testing methods discussed so far have a node sequences [a1, a2], [a1, a3], [a2, number of limitations. First, they make a3], and [a3, a2] are ordered contexts no distinction between atomic data such for node a4. The paths (a1, a2, a4), as integers and structured or aggregate (a1, a3, a4), (a2, a3, a4), and (a3, a2, data such as arrays and records. Modifi- a4) are the ordered context paths for cations and references to an element of them, respectively. a structured datum are regarded as The ordered context coverage requires modifications and references to the that an adequate test set should cover whole datum. It was argued that treat- all ordered contexts for every mode. ing structured data, such as arrays, as aggregate values may lead to two types Definition 2.13 (Ordered-Context Cov- of mistakes [Hamlet et al. 1993]. A com- erage Criterion). A set P of execution mission mistake may happen when a paths satisfies the ordered-context cov- definition-use path is identified but it is erage criterion if and only if for all not present for any array elements. An nodes n and all ordered contexts c for n, omission mistake may happen when a there is at least one path p in P such path is missed because of a false inter- that p contains a subpath which is an mediate assignment. Such mistakes oc- ordered context path for n with respect cur frequently even in small programs to c. [Hamlet et al. 1993]. Treating elements Given a node n, let {x 1 , x 2 , . . . , x m } of structured data as independent data be a nonempty subset of the variables can correct the mistakes. Such an ex- that are used at the node n, the nodes tension seems to add no complexity n i, i 1, 2, . . . , m, have definition when the references to the elements of occurrences of the variables x i , that structured data are static, such as the reach the node n. If there is a permuta- fields of records. However, treating ar- tion of the nodes which is an ordered rays element-by-element may introduce context for n, then we say that the set a potential infinity of definition-use {n 1 , n 2 , . . . , n m } is a context for n, and paths to be tested. Moreover, theoreti- an ordered context path for n with re- cally speaking, whether two references spect to is also called a definition to array elements are references to the context path for n with respect to the same element is undecidable. Hamlet et al. [1993] proposed a partial solution to [1983] defines a context to be formed from all this problem by using symbolic execu- variables having a definition that reaches the tion and a symbolic equation solver to node. determine whether two occurrences of ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 17. 382 • Zhu et al. array elements can be the occurrences efficient testing of the interaction be- of the same element. tween procedures. The basic idea of in- The second limitation of the data-flow terprocedural data-flow testing is to test testing methods discussed is that dy- the data dependence across procedure namic data were not taken into account. interfaces. Harrold and Soffa [1990; One of the difficulties in the data-flow 1991] identified two types of interproce- analysis of dynamic data such as those dural data dependences in a program: referred to by pointers is that a pointer direct data dependence and indirect variable may actually refer to a number data dependence. A direct data depen- of data storage. On the other hand, a dence is a definition-use association data storage may have a number of whose definition occurs in procedure P references to it, that is, the existence of and use occurs in a directly called proce- alias. Therefore, for a given variable V, dure Q of P. Such a dependence exists a node may contain a definite definition when (1) a definition of an actual pa- to the variable if a new value is defi- rameter in one procedure reaches a use nitely bound to the variable at the node. of the corresponding formal parameter It has a possible definition at a node n if at a call site (i.e., a procedure call); (2) a it is possible that a new value is bound definition of a formal parameter in a to it at the node. Similarly, a path may called procedure reaches a use of the be definitely definition-clear or possibly corresponding actual parameter at a re- definition-clear with respect to a vari- turn site (i.e., a procedure return); or (3) able. Ostrand and Weyuker [1991] ex- a definition of a global variable reaches tended the definition-use association re- a call or return site. An indirect data lation on the occurrences of variables to dependence is a definition-use associa- a hierarchy of relations. A definition- tion whose definition occurs in proce- use association is strong if there is a dure P and use occurs in an indirectly definite definition of a variable and a called procedure Q of P. Conditions for definite use of the variable and every indirect data dependence are similar to definition-clear path from the definition to the use is definitely definition-clear those for direct data dependence, except with respect to the variable. The associ- that multiple levels of procedure calls ation is firm if both the definition and and returns are considered. Indirect the use are definite and there is at least data dependence can be determined by one path from the definition to the use considering the possible uses of defini- that it is definitely definition-clear. The tions along the calling sequences. When association is weak if both the definition a formal parameter is passed as an ac- and the use are definite, but there is no tual parameter at a call site, an indirect path from the definition to the use data dependence may exist. Given this which is definitely definition-clear. An data dependence information, the data- association is very weak if the definition flow test adequacy criteria can be easily or the use or both of them are possible extended for interprocedural data-flow instead of definite. testing. Harrold and Soffa [1990] pro- posed an algorithm for computing the F. Interprocedural data-flow test- interprocedural data dependences and ing. The data-flow testing methods developed a tool to support interproce- discussed so far have also been re- dural data-flow testing. stricted to testing the data dependence Based on Harrold and Soffa’s work, existing within a program unit, such as Ural and Yang [1988; 1993] extended a procedure. As current trends in pro- the flow-graph model for accurate repre- gramming encourage a high degree of sentation of interprocedural data-flow modularity, the number of procedure information. Pande et al. [1991] pro- calls and returns executed in a module posed a polynomial-time algorithm for continues to grow. This mandates the determining interprocedural definition- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 18. Test Coverage and Adequacy • 383 use association including dynamic data program under test. Furthermore, these of single-level pointers for C programs. dependence relations can be efficiently calculated. 2.1.3 Dependence Coverage Criterion— an Extension and Combination of Data- Flow and Control-Flow Testing. An ex- 2.2 Specification-Based Structural Testing tension of data-flow testing methods was made by Podgurski and Clarke There are two main roles a specification can play in software testing [Richardson [1989; 1990] by generalizing control and et al. 1992]. The first is to provide the data dependence. Informally, a state- necessary information to check whether ment s is semantically dependent on a the output of the program is correct statement s if the function computed [Podgurski and Clarke 1989; 1990]. by s affects the execution behavior of s. Checking the correctness of program Podgurski and Clarke then proposed a outputs is known as the oracle problem. necessary condition of semantic depen- The second is to provide information to dence called weak syntactic dependence select test cases and to measure test as a generalization of data dependence. adequacy. As the purpose of this article There is a weak syntactic dependence is to study test adequacy criteria, we between two statements if there is a focus on the second use of specifications. chain of data flow and a weak control Like programs, a specification has dependence between the statements, two facets, syntactic structure and se- where a statement u is weakly control- mantics. Both of them can be used to dependent on statement v if v has suc- select test cases and to measure test cessors v and v such that if the branch adequacy. This section is concerned from v to v is executed then u is neces- with the syntactic structure of a specifi- sarily executed within a fixed number of cation. steps, whereas if the branch v to v is A specification specifies the proper- taken then u can be bypassed or its ties that the software must satisfy. execution can be delayed indefinitely. Given a particular instance of the soft- Podgurski and Clarke also defined the ware’s input and its corresponding out- notion of strong syntactic dependence: put, to check whether the instance of there is a strong syntactic dependence the software behavior satisfies these between two statements if there is a properties we must evaluate the specifi- chain of data flow and a strong control cation by substituting the instance of dependence between the statements. input and output into the input and Roughly speaking, a statement u is output variables in the specification, re- strongly control dependent on state- spectively. Although this evaluation ment v if v has two successors v and v process may take various forms, de- such that the execution through the pending on the type of the specification, branch v to v may result in the execu- the basic idea behind the approach is to tion of u, but u may be bypassed when consider a particular set of elements or the branch from v to v is taken. Pod- components in the specification and to gurski and Clarke proved that strong calculate the proportion of such ele- syntactic dependence is not a necessary ments or components involved in the condition of semantic dependence. evaluation. When the definition-use association There are two major approaches to relation is replaced with various depen- formal software functional specifica- dence relations, various dependence- tions, model-based specifications and coverage criteria can be obtained as ex- property-oriented specifications such as tensions to the data-flow test adequacy axiomatic or algebraic specifications. criteria. Such criteria make more use of The following discussion is based on semantic information contained in the these types of specifications. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 19. 384 • Zhu et al. 2.2.1 Coverage of Model-Based For- That is, some of the choices of output mal Functional Specifications. When a allowed by the specification may not be specification is model-based, such as implemented by the program. This may those written in Z and VDM, it has two not be considered a program error, but parts. The first describes the state it may result in infeasible combinations. space of the software, and the second A feasible combination of the atomic part specifies the required operations on predicates in the preconditions is a de- the space. The state space of the soft- scription of the conditions that test ware system can be defined as a set of cases should satisfy. It specifies a sub- typed variables with a predicate to de- domain of the input space. It can be scribe the invariant property of the expressed in the same specification lan- state space. The operations are func- guage. Such specifications of testing re- tions mapping from input data and the quirements are called test templates. state before the operation to the output Stocks and Carrington [1993] suggested data and the state after the operation. the use of the formal functional specifi- Such operations can be specified by a cation language Z to express test tem- set of predicates that give the precondi- plates because the schema structure of tion, that is, the condition on the input Z and its schema calculus can provide data and the state before the operation, support to the derivation and refine- and postconditions that specify the rela- ment of test templates according to for- tionship between the input data, output mal specifications and heuristic testing data, and the states before and after the rules. Methods have also been proposed operation. to derive such test templates from mod- The evaluation of model-based formal el-based specification languages. Amla functional specifications is fairly similar and Ammann [1992] described a tech- to the evaluation of a Boolean expres- nique to extract information from for- sion in an imperative programming lan- mal specifications in Z and to derive guage. When input and output variables test templates written in Z for partition in the expression are replaced with an testing. The key step in their method is instance of input data and program out- to identify the categories of the test puts, each atomic predicate must be ei- data for each parameter and environ- ther true or false. If the result of the ment variable of a functional unit under evaluation of the whole specification is test. These categories categorize the in- true, then the correctness of the soft- put domain of one parameter or one ware on that input is confirmed. Other- environment variable according to the wise, a program error is found. How- major characteristics of the input. Ac- ever, the same truth value of a cording to Amla and Ammann, there are specification on two instances of input/ typically two distinct sources of catego- output may be due to different combina- ries in Z specifications: (a) characteris- tions of the truth values of the atomic tics enumerated in the preconditions predicates. Therefore it is natural to and (b) characteristics of a parameter or require that an adequate test cover a environment variable by itself. For pa- certain subset of feasible combinations rameters, these characteristics are of the predicates. Here a feasible combi- based on their type. For environment nation means that the combination can variables, these characteristics may be satisfied; that is, there is an assign- also be based on the invariant for the ment of values to the input and output state components. Each category is then variables such that the atomic predi- further divided into a set of choices. A cates take their corresponding values in choice is a subset of data that can be the predicate combination. In the case assigned to the parameter or the envi- where the specification contains nonde- ronment variable. Each category can be terminism, the program may be less broken into at least two choices: one for nondeterministic than the specification. the valid inputs and the other for the ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 20. Test Coverage and Adequacy • 385 invalid inputs. Finer partitions of valid ronment variable, there is at least one inputs are derived according to the syn- test datum that belongs to the choice. tactic structure of the precondition Formally: predicate, the parameters, or the invari- Definition 2.16 (Each-Choice-Used ant predicate of the environment vari- Criterion). A set of test data T satis- ables. For example, the predicate “A ∨ fies the each-choice-used criterion if the B” is partitioned into three choices: (a) subset E {e e C and ?t T.(t “¬ (A ∨ B)” for the set of data which are e)} satisfies the condition: invalid inputs; (b) “A” for the subset of valid inputs which satisfy condition A; @i. 1 i n (3) “B” for the subset of valid inputs which satisfy the condition B. f Ei A i,1 , A i,2 , . . . , A i,k i , Based on Amla and Ammann’s [1992] work, Ammann and Offutt [1994] re- where cently considered how to test a func- Ei e ?X1 , . . . , X i 1 , tional unit effectively and efficiently by selecting test cases to cover various sub- Xi , . . . , Xn . X1 ... Xi e 1 1 sets of the combinations of the catego- ries and choices when the functional Xi 1 ... Xn E . unit has more than one parameter and environment variable. They proposed Ammann and Offutt suggested the three coverage criteria. The all-combi- use of the base-choice-coverage criterion nations criterion requires that software and described a technique to derive test is tested on all combinations of choices; templates that satisfy the criterion. The that is, for each combination of the base-choice-coverage criterion is based choices of the parameters and the envi- on the notion of base choice, which is a ronment variables, there is at least one combination of the choices of parame- ters and environment variables that test datum in the combination. Let x 1 , represents the normal operation of the x 2 , . . . , x n be the parameters and envi- functional unit under test. Therefore, ronment variables of the functional unit test cases of the base choice are useful under test. Suppose that the choices for to evaluate the function’s behavior in x i are A i,1 , A i,2 , . . . , A i,k i, k i 0, i 1, normal operation mode. To satisfy the 2, . . . , n. Let C {A 1,u 1 A 2,u 2 ... base-choice-coverage criterion, software A n,u n 1 ui k i and 1 i n}. C needs to be tested on the subset of com- is then the set of all combinations of binations of choices such that for each choices. The all combination criterion choice in a category, the choice is com- can be formally defined as follows. bined with the base choices for all other categories. Assume that A 1,1 A 2,1 Definition 2.15 (All-Combination Cri- ... A n,1 is the base choice. The base- terion). A set of test data T satisfies choice coverage criterion can be for- the all-combination criterion if for all mally defined as follows. c C, there exists at least one t T Definition 2.17 (Base-Choice-Coverage such that t c. Criterion). A set of test data T satis- This criterion was considered to be fies the base-choice-coverage criterion if inefficient, and the each-choice-used cri- the subset E {e e C ∧ ?t T.(t terion was considered ineffective [Am- e)} satisfies the following condition: mann and Offutt 1994]. The each- n choice-used criterion requires that each choice is used in the testing; that is, for E Bi , each choice of each parameter or envi- i 1 ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 21. 386 • Zhu et al. where the operations represents a test execu- tion of the program, where the test case Bi A 1,1 ... Ai 1,1 A i, j Ai 1,1 consists of the constants substituted for the variables. Second, a term also repre- ... A n,1 j 1, 2, . . . , k i . sents a value, that is, the result of the There are a number of works on spec- sequence of operations. Therefore, ification-based testing that focus on der- checking an equation means executing ivation of test cases from specifications, the operation sequences for the two including Denney’s [1991] work on test- terms on the two sides of the equation case generation from Prolog-based spec- and then comparing the results. If the ifications and many others [Hayes 1986; results are the same or equivalent, the Kemmerer 1985; McMullin and Gannon program is considered to be correct on 1983; Wild et al. 1992]. this test case, otherwise the implemen- Model-based formal specification can tation has errors. This interpretation also be in an executable form, such as a allows the use of algebraic specifica- finite state machine or a state chart. tions as test oracles. Aspects of such models can be repre- Since variables in a term can be re- sented in the form of a directed graph. placed by any value of the data type, Therefore, the program-based adequacy there is a great deal of freedom to choose criteria based on the flow-graph model input data for any given sequence of oper- can be adapted for specification-based ations. For algebraic specification, values testing [Fujiwara et al. 1991; Hall and are represented by ground terms, that is, Hierons 1991; Ural and Yang 1988; terms without variables. Gaudel [Bouge 1993]. et al. 1986; Bernot et al. 1991] and her 2.2.2 Coverage of Algebraic Formal colleagues suggested that the selection of Functional Specifications. Property- test cases should be based on partitioning oriented formal functional specifications the set of ground terms according to their specify software functions by a set of complexity so that the regularity and uni- properties that the software should pos- formity hypotheses on the subsets in the sess. In particular, an algebraic specifi- partition can be assumed. The complexity cation consists of a set of equations that of a test case is then the depth of nesting the operations of the software must sat- of the operators in the ground term. isfy. Therefore checking if a program Therefore, roughly speaking, the selection satisfies the specification means check- of test cases should first consider con- ing whether all of the equations are stants specified by the specification, then satisfied by the program. all the values generated by one applica- An equation in an algebraic specifica- tion of operations on constants, then val- tion consists of two terms as two sides ues generated by two applications on con- of the equation. A term is constructed stants, and so on until the test set covers from three types of symbols: variables data of a certain degree of complexity. representing arbitrary values of a given The following hypothesis, called the data type, constants representing a regularity hypothesis [Bouge et al. 1986; given data value in a data type, and Bernot et al. 1991], formally states the operators representing data construc- gap between software correctness and tors and operations on data types. adequate testing by the preceding ap- Each term has two interpretations in proach. the context of testing. First, a term rep- resents a sequence of calls to the opera- Regularity Hypothesis tions that implement the operators specified in the specification. When the @x complexity x Kft x variables in the term are replaced with constants, such a sequence of calls to t x f @x t x t x (2.1) ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 22. Test Coverage and Adequacy • 387 Informally, the regularity hypothesis mation in the specification. A simple assumes that for some complexity de- way to combine program-based struc- gree k, if a program satisfies an equa- tural testing with specification-based tion t(x) t (x) on all data of complex- structure coverage criteria is to mea- ity not higher than k, then the program sure test adequacy with criteria from satisfies the equation on all data. This both approaches. hypothesis captures the intuition of in- ductive reasoning in software testing. 3. FAULT-BASED ADEQUACY CRITERIA But it cannot be proved formally (at least in its most general form) nor vali- Fault-based adequacy criteria measure dated empirically. Moreover, there is no the quality of a test set according to its way to determine the complexity k such effectiveness or ability to detect faults. that only the test cases of complexity less than k need to be tested. 3.1 Error Seeding 2.3 Summary of Structure Coverage Error seeding is a technique originally Criteria proposed to estimate the number of faults that remain in software. By this In summary, when programs are mod- method, artificial faults are introduced eled as directed graphs, paths in the into the program under test in some flow graph should be exercised by test- suitable random fashion unknown to ing. However, only a finite subset of the the tester. It is assumed that these arti- paths can be checked during testing. ficial faults are representative of the The problem is therefore to choose inherent faults in the program in terms which paths should be exercised. of difficulty of detection. Then, the soft- Control-flow test data adequacy crite- ware is tested and the inherent and ria answer this question by specifying artificial faults discovered are counted restrictions on the complexity of the separately. Let r be the ratio of the paths or by specifying restrictions on number of artificial faults found to the the redundancy among paths. Data-flow number of total artificial faults. Then adequacy criteria use data-flow infor- the number of inherent faults in the mation in the program and select paths program is statistically predicted with that are significant with respect to such maximum likelihood to be f/r, where f is information. Data-flow and control-flow the number of inherent faults found by adequacy criteria can be extended to testing. dependence coverage criteria, which This method can also be used to mea- make more use of semantic information sure the quality of software testing. The contained in the program under test. ratio r of the number of artificial faults However, none of these criteria use in- found to the total number of artificial formation about software requirements faults can be considered as a measure of or functional specifications. the test adequacy. If only a small pro- Specification-based structure cover- portion of artificial faults are found dur- age criteria specify testing require- ing testing, the test quality must be ments and measure test adequacy ac- poor. In this sense, error seeding can cording to the extent that the test data show weakness in the testing. cover the required functions specified in An advantage of the method is that it formal specifications. These criteria fo- is not restricted to measuring test qual- cus on the specification and ignore the ity for dynamic testing. It is applicable program that implements the specifica- to any testing method that aims at find- tion. ing errors or faults in the software. As discussed in Section 1, software However, the accuracy of the measure testing should employ information con- is dependent on how faults are intro- tained in the program as well as infor- duced. Usually, artificial faults are ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 23. 388 • Zhu et al. manually planted, but it has been data presented. For example, the test proved difficult to implement error data may not exercise the portion of the seeding in practice. It is not easy to program that was mutated. introduce artificial faults that are equivalent to inherent faults in diffi- (2) The mutant is equivalent to the culty of detection. Generally, artificial original program. errors are much easier to find than in- The mutant and the original program herent errors. In an attempt to over- always produce the same output, hence come this problem, mutation testing in- no test data can distinguish between troduces faults into a program more the two. Normally, only a small percent- systematically. age of mutants are equivalent to the original program. 3.2 Program Mutation Testing Definition 3.1 (Mutation Adequacy 3.2.1 Principles of Mutation Ade- Score). The mutation adequacy of a set quacy Analysis. Mutation analysis is of test data is measured by an adequacy proposed as a procedure for evaluating score computed according to the follow- the degree to which a program is tested, ing equation. that is, to measure test case adequacy [DeMillo et al. 1978; Hamlet 1977]. D Adequacy Score S Briefly, the method is as follows. We M E have a program p and a test set t that has been generated in some fashion. where D is the number of dead mutants, The first step in mutation analysis is M is the total number of mutants, and the construction of a collection of alter- E is the number of equivalent mutants. native programs that differ from the Notice that the general problem of original program in some fashion. These deciding whether a mutant is equiva- alternatives are called mutants of the lent to the original program is undecid- original program, a name borrowed able. from biology. Each mutant is then exe- 3.2.2 Theoretical Foundations of Mu- cuted on each member of the test set t, tation Adequacy. Mutation analysis is stopping either when an element of t is based on two basic assumptions—the found on which p and the mutant pro- competent programmer hypothesis and gram produce different responses, or the coupling effect hypothesis. These when t is exhausted. two assumptions are based on observa- In the former case we say that the tions made in software development mutant has died since it is of no further practice, and if valid enable the practi- value, whereas in the latter case we say cal application of mutation testing to the mutant lives. These live mutants real software [DeMillo et al. 1988]. provide valuable information. A mutant However, they are very strong assump- may remain alive for one of the follow- tions whose validity is not self-evident. ing reasons. A. Competent programmer assump- (1) The test data are inadequate. tion. The competent programmer hy- If a large proportion of mutants live, pothesis assumes that the program to then it is clear that on the basis of these be tested has been written by competent test data alone we have no more reason programmers. That is, “they create pro- to believe that p is correct than to be- grams that are close to being correct” lieve that any of the live mutants are [DeMillo et al. 1988]. A consequence correct. In this sense, mutation analysis drawn from the assumption by DeMillo can clearly reveal a weakness in test et al. [1988] is that “if we are right in data by demonstrating specific pro- our perception of programs as being grams that are not ruled out by the test close to correct, then these errors should ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 24. Test Coverage and Adequacy • 389 be detectable as small deviations from set of programs (p). We say that p is the intended program.” In other words, locally correct with respect to if, for the mutants to be considered in muta- all programs q in (p), either q is tion analysis are those within a small equivalent to p or q fails on at least one deviation from the original program. In test point in the input space. In other practice, such mutants are obtained by words, in the neighborhood of a pro- systematically and mechanically apply- gram only the program and its equiva- ing a set of transformations, called mu- lents are possibly correct, because all tation operators, to the program under others are incorrect. With the compe- test. These mutation operators would tent programmer hypothesis, local cor- ideally model programming errors made rectness implies correctness if the by programmers. In practice, this may neighborhood is always large enough to be only partly true. We return to muta- cover at least one correct program. tion operators in Section 3.2.3. Definition 3.2 (Neighborhood Ade- B. Coupling effect assumption. The quacy Criterion). A test set t is -ade- second assumption in mutation analysis quate if for all programs q p in , is the coupling effect hypothesis, which there exists x in t such that p(x) q(x). assumes that simple and complex errors Budd and Angluin [1982] studied the are coupled, and hence test data that computability of the generation and rec- cause simple nonequivalent mutants to ognition of adequate test sets with re- die will usually also cause complex mu- spect to the two notions of correctness, tants to die. but left the neighborhood construction Trying to validate the coupling effect problem open. assumption, Offutt [1989; 1992] did an The neighborhood construction prob- empirical study with the Mothra muta- lem was investigated by Davis and tion-testing tool. He demonstrated that Weyuker [1988; 1983]. They developed a a test set developed to kill first-order theory of metric space on programs mutants (i.e., mutants) can be very suc- written in a language defined by a set of cessful at killing second-order mutants production rules. A transition sequence (i.e., mutants of mutants). However, the from program p to program q was de- relationship between second-order mu- fined to be a sequence of the programs tants and complex faults remains un- p 1 , p 2 , . . . , p n such that p 1 p and clear. pn q, and for each i 1, 2, . . . , n 1, In the search for the foundation of either p i 1 is obtained from p i (a for- mutation testing, theories have been de- ward step) or p i can be obtained from veloped for fault-based testing in gen- p i 1 (a backward step), using one of the eral and mutation testing in particular. productions of the grammar in addition Program correctness is usually taken to to a set of short-cut rules to catch the mean that the program computes the intuition that certain kinds of programs intended function. But in mutation are conceptually close. The length of a analysis there is another notion of cor- transition sequence is defined to be the rectness, namely, that a certain class of maximum of the number of forward faults has been ruled out. This is called steps and the number of backward steps local correctness here to distinguish it in the transition sequence. The distance from the usual notion of correctness. (p, q) between program p and q is then According to Budd and Angluin [1982] defined as the smallest length of a tran- local correctness of a program p can be sition sequence from p to q. This dis- defined relative to a neighborhood of tance function on the program space can p, which is a set of programs containing be proved to satisfy the axioms of a the program p itself. Precisely speaking, metric space, that is, for all programs p, is a mapping from a program p to a q, r, ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 25. 390 • Zhu et al. Table I. Levels of Analysis in Mutation Testing a An erroneous expression may coincidentally compute correct output on a particular input. For example, assume that a reference to a variable is mistaken by referring to another variable, but in a particular case these two variables have the same value. Then, the expression computes the correct output on the particular input. If such an input is used as a test case, it is unable to detect the fault. Coincidental correctness analysis attempts to analyse whether a test set suffers from such weakness. (1) ( p, q) 0; (p)-critical. Finally, by studying min- (2) ( p, q) 0 if and only if p q; imally adequate test sets, Davis and (3) ( p, q) (q, p); Weyuker [1988] obtained a lower bound of the numbers of non-critical points (4) ( p, r) ( p, q) (q, r). that must be present in an adequate Then, the neighborhood (p) of pro- test set. gram p within a distance is the set of 3.2.3 Mutation Transformations. Now programs within the distance accord- let us consider the problem of how to ing to ; formally, (p) {q (p, q) generate mutants from a given program. }. Davis and Weyuker introduced the A practical method is the use of mutation notion of critical points for a program operators. Generally speaking, a muta- with respect to a neighborhood. Intu- tion operator is a syntactic transforma- itively, a critical point for a program is tion that produces a mutant when applied a test case that is the unique input to to the program under test. It applies to a kill some mutant in the neighborhood certain type of syntactic structure in the set. Formally: program and replaces it with another. In Definition 3.3 (Critical Points). An current mutation testing tools such as input c is a -critical point for p with Mothra [King and Offutt 1991], the muta- respect to if there exists a program tion operators are designed on the basis q such that for all x c, p(x) of many years’ study of programmer er- q(x) but p(c) q(c). rors. Different levels of mutation analysis can be done by applying certain types of There is a nice relationship between mutation operators to the corresponding critical points [Davis and Weyuker syntactical structures in the program. Ta- 1988] and adequate test sets. First, ble I from Budd [1981], briefly describes -critical points must be members of the levels of mutation analysis and the any -adequate test sets. Second, when corresponding mutation operators. This the distance increases, the neighbor- framework has been used by almost all of hood set (p) as well as the set of the mutation testing tools. critical points increase in the sense of set inclusion. That is, if c is (p)- 3.2.4 The Pros and Cons. Using a critical, then for all , c is also mutation-based testing system, a tester ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 26. Test Coverage and Adequacy • 391 supplies the program to be tested and achieved by replacing a statement chooses the levels of mutation analysis or a branch by the special statement to be performed. Then the system gener- TRAP, which causes the abortion of ates the set of mutants. The tester also the execution. supplies a test set. The testing system executes the original program and each The drawback of mutation testing is mutant on each test case and compares the large computation resources (both the outputs produced by the original time and space) required to test large- program and its mutants. If the output scale software. It was estimated that of a mutant differs from the output of the number of mutants for an n-line the original program, the mutant is program is on the order of n 2 [Howden 1982]. Recently, experimental data con- marked dead. Once execution is com- firmed this estimate; see Section 5.3 plete, the tester examines any mutants [Offutt et al. 1993]. A major expense in that are still alive. The tester can then mutation testing is perhaps the sub- declare a mutant to be equivalent to the stantial human cost of examining large original program, or can supply addi- numbers of mutants for possible equiva- tional test data in an effort to kill the lence, which cannot be determined ef- mutant. fectively. An average of 8.8% of equiva- Reports on experiments with muta- lent mutants had been observed in tion-based testing have claimed that experiments [Offutt et al. 1993]. the method is powerful and has a num- ber of attractive features [DeMillo and Mathur 1990; DeMillo et al. 1988]. 3.3 Variants of Program Mutation Testing (1) Mutation analysis allows a great de- Using the same idea of mutation analy- gree of automation. Mutants are sis, Howden [1982] proposed a testing generated by applying mutation op- method to improve efficiency. His test- erators and are then compiled and ing method is called weak mutation test- executed. The outputs of the mu- ing because it is weaker than the origi- tants and the original programs are nal mutation testing. The original compared and then a mutation ade- method was referred to as strong muta- quacy score is calculated. All these tion testing. The fundamental concept can be supported by mutation-test- of weak mutation testing is to identify ing software such as that of Mothra classes of program components and er- [DeMillo et al. 1988; King and Of- rors that can occur in the components. futt 1991]. Mutation transformations are then ap- (2) Mutation-based testing systems pro- plied to the components to generate mu- vide an interactive test environment tants of the components. For each com- that allows the tester to locate and ponent and a mutant of the component, remove errors. When a program un- weak mutation testing requires that a der test fails on a test case and a test datum be constructed so that it mutant does not, the tester should causes the execution of the component find it easy to locate and remove the and that the original component and the error by considering the mutation mutant compute different values. The operator applied and the location major advantage of weak mutation test- where the mutation operator is ap- ing is efficiency. Although there is the plied. same number of mutants in weak muta- (3) Mutation analysis includes many tion testing as in strong mutation test- other testing methods as special ing, it is not necessary to carry out a cases. For example, statement cov- separate program execution for each erage and branch coverage are spe- mutant. Although a test set that is ade- cial cases of statement analysis in quate for weak mutation testing may mutation testing. This can be not be adequate for the strong mutation ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 27. 392 • Zhu et al. testing, experiments with weak muta- to p always gives a mutant that is tion testing such as Offutt and Lee’s stronger than the mutants obtained by [1991] and Marick’s [1991] suggest that it the application of to p at the same can still be an effective testing method. location. Taking the relational opera- From a more general view of muta- tor as an example, mutants can be tion-testing principles, Woodward and generated by replacing “ ” with “ ”, Halewood [1988] proposed firm muta- “ ”, “ ”, “ ”, and “ ”, respectively. In- tion testing. They regarded mutation tuitively, replacing “ ” with “ ” should testing as the process of making small be stronger than replacing with “ ”, changes in a program and comparing because if the test data are not good enough to distinguish “ ” from “ ”, it the outcomes of the original and would not be adequate for other rela- changed versions. They identified a set tional operators. Based on such argu- of parameters of such changes and com- ments, a partial ordering on relational parisons. The basic idea of firm muta- operators was defined. However, Wood- tion testing is to make such parameters ward [1991] proved that operator order- explicit so that they can be altered by ing is not the right approach to achiev- testers. Strong and weak mutation test- ing mutant ordering. The situation ing are two extremes of the firm-mutation turns out to be quite complicated when testing method. The advantage of firm- a mutation operator is applied to a loop mutation testing is that it is less expen- body, and a counterexample was given. sive than strong-mutation testing but Experiments are needed to see how ef- stronger than weak-mutation testing. In fective ordered mutation testing can be addition, firm-mutation testing provides and to assess the extent of cost saving. a mechanism for the control of fault intro- It was observed that some mutation duction and test results comparison. The operators generate a large number of major disadvantage of firm-mutation mutants that are bound to be killed if a testing is that there is no obvious system- test set kills other mutants. Therefore, atic basis on which to select the parame- Mathur [1991] suggested applying the ters and the area of program code. mutation testing method without apply- Returning to frameworks for strong ing the two mutation operators that mutation testing, ordered mutation test- produce the most mutants. Mathur’s ing was proposed to improve efficiency method was originally called con- without sacrificing effectiveness. The strained mutation testing. Offutt et al. idea is to construct an order between [1993] extended the method to omit the mutants such that mutant b is stronger first N most prevalent mutation opera- than a (written a b) if for any test tors and called it N-selective mutation case t, t kills b implies that t necessar- testing. They did an experiment on se- ily kills a. Therefore, mutant b should lective testing with the Mothra muta- be executed on test cases before the tion testing environment. They tested execution of a, and a is executed only 10 small programs using an automatic when the test data failed to kill b. A test-case generator, Godzilla [DeMillo similar ordering can also be defined on and Offutt 1991; 1993]. Their experi- test data. Given a mutant q of program mentation showed that with full selec- p, q should be executed on test data t tive mutation score, an average nonse- before the execution on test data s if t is more likely to kill q than s is. Ordering lective mutation score of 99.99, 99.84 on mutation operators was proposed to and 99.71% was achieved by 2-selective, achieve the ordering on mutants in 4-selective, and 6-selective mutation early ’80s by Woodward et al. [1980] testing, respectively. Meanwhile, the and Riddell et al. [1982], and reap- average savings5 for these selective mu- peared recently in a note by Duncan tations were 23.98, 41.36, and 60.56%, and Robson [1990]. A mutation opera- tion is said to be stronger than if 5 The savings are measured as the reduction of for all programs p, the application of the number of mutants. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 28. Test Coverage and Adequacy • 393 respectively [Offutt et al. 1993]. Mathur polynomial functions of degree k, and theorized that selective mutation test- the set of multinomials of degree k. ing has complexity linear in program There are some major advantages of size measured as the number of variable vector boundedness. First, any finite- references, and yet retains much of the dimensioned vector space can be de- effectiveness of mutation testing. How- scribed by a finite set of characteristic ever, experimental data show that the vectors such that any member of the number of mutants is still quadratic vector space is a linear combination of although the savings are substantial the characteristic vectors. Second, vec- [Offutt et al. 1993]. tor spaces are closed under the opera- tions of addition and scalar multiplica- tion. Suppose that some correct function 3.4 Perturbation Testing C has been replaced by an erroneous form C . Then the expression C C is While mutation testing systematically the effect of the fault on the transforma- plants faults in programs by applica- tion function. It is called the error func- tions of syntactic transformations, Zeil’s tion of C or the perturbation to C. Since perturbation testing analyses test effec- the vector space contains C and C , the tiveness by considering faults in an “er- error function must also be in the vector ror” space. It is concerned with faults in space. This enables us to study faults in arithmetic expressions within program a program as functional differences be- statements [Zeil 1983]. It was proposed tween the correct function and the in- to test vector-bounded programs, which correct function rather than simply as have the following properties: (1) vector- syntactical differences, as in mutation bounded programs have a fixed number testing. Thus it builds a bridge between of variables on a continuous input do- fault-based testing and error-based test- main; (2) in the flow-graph model, each ing, which is discussed in Section 4. node n in the flow graph is associated When a particular error function with a function C n that transforms the space E is identified, a neighborhood of environment v to a new environment v . a given function f can be expressed as Here, environment is a vector (x 1 , x 2 , the set of functions in the form of f f e, . . . , x n , y 1 , y 2 , . . . , y m ) which consists where f e E. Test adequacy can then of the current values of the input vari- be defined in terms of a test set’s ability ables x i and the values of the intermedi- to detect error functions in a particular ate and output variables y i . The func- error function space. Assuming that tions associated with nodes are there are no missing path errors in the assumed to be in a class C which is program, Zeil considered the error func- closed under function composition. (3) tions in the function C and the predi- each node may have at most two out- cate T associated with a node. The fol- ward arcs. In such cases the node n is lowing gives more details about also associated with a predicate T n , perturbation of these two types of func- which is applied to the new environ- tions. ment obtained by applying C n and com- pared with zero to determine the next 3.4.1 Perturbation to the Predicate. node for execution. It was assumed that Let n be a node in a flow graph and C n the predicates are simple, not combined and T n be the computation function and with and, or, or other logical operators. predicate function associated with node (4) the predicates are assumed to be in a n. Let A be a path from the begin node class T which is a vector space over the to node n and C A be the function com- real numbers R of dimension k and is puted through the path. Let T be the closed under composition with C. Exam- correct predicate that should be used in ples of vector spaces of functions include place of T n . Suppose that the predicate the set of linear functions, the set of is tested by an execution through the ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 29. 394 • Zhu et al. path A. Recall that a predicate is a Having given a characterization theo- real-valued function, and its result is rem of the set BLIND(A) and a con- compared with zero to determine the structive methods to calculate the set, direction of control transfer. If in a test Zeil defined a test-path selection crite- execution the error function e T rion for predicate perturbations. T n is evaluated to zero, it does not affect the direction of control transfer. There- Definition 3.4 (Adequacy of Detecting fore the error function e T Tn Predicate Perturbation). A set P of cannot be detected by such an execution paths all ending at some predicate T is if and only if there exists a positive perturbation-test-adequate for predicate scalar such that T if Tn ° CA v0 T ° CA v0 , BLIND p . p P for all initial environments v0 which Zeil’s criterion was originally stated in cause execution through path A, where ° the form of a rejection rule: if a program is functional composition. The subset of T has been reliably tested on a set P of that consists of the functions satisfying execution paths that all end at some the preceding equation is called the blind- predicate T, then an additional path p ness space for the path A, denoted by also ending at T need not be tested if BLIND(A). Zeil identified three types of and only if blindness and provided a constructive method to obtain BLIND(A) for any path BLIND x BLIND p . x P A. Assignment blindness consists of func- tions in the following set. Zeil also gave the following theorem about perturbation testing for predicate null CA ee T ∧ @v. e ° CA v 0. errors. THEOREM 3.1 (Minimal Adequate Test These functions evaluate to zero when Set for Predicate Perturbation Testing). the expressions computed in C A are sub- A minimal set of subpaths adequate for stituted for the program variables. After testing a given predicate in a vector the assignment statement “X : f(v)”, bounded program contains at most k for example, the expression “X f(v)” subpaths, where k is the dimension of T. can be added into the set of undetect- able predicates. Equality blindness con- 3.4.2 Perturbation to Computations. sists of equality restrictions on the path The perturbation function of the compu- domain. To be an initial environment tation associated with node n can also that causes execution of a path A, it be expressed as e C C n where C must satisfy some restrictions. A re- is the unknown correct computation. striction can be an equality, such as x However, a fault in computation func- 2. If an input restriction r(v 0 ) 0 is tion may cause two types of errors: do- imposed, then the predicate r(v) 0 is main error or computation error. A com- an undetectable error function. The fi- putation error can be revealed if there is nal component of undetectable predi- a path A from the begin node to the cates is the predicate T n itself. Because node n and a path B from node n to a T n (v) compared with zero and T n (v) node that contains an output statement compared with zero are identical for all M, such that for some initial environ- positive real numbers , self-blindness ment v 0 that causes the execution of consists of all predicates of the form paths A and B T n (v). These three types of undetect- able predicates can be combined to form M ° CB ° Cn ° CA v0 more complicated undetectable error functions. M ° CB ° C ° CA v0 . ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 30. Test Coverage and Adequacy • 395 A sufficient condition of this inequality error space has been applied to improve is that M ° C B ° e ° C A (v 0 ) 0. The set domain-testing methods as well [Afifi et of functions that satisfy the equation al. 1992; Zeil et al. 1992], which are M ° C B ° e ° C A (v 0 ) 0 for all v 0 that discussed in Section 4. execute A and B is then the blindness Although both mutation testing and space of the test execution through the perturbation testing are aimed at de- path A followed by B. tecting faults in software, there are a A domain error due to an erroneous number of differences between the two computation function C n can be re- methods. First, mutation testing mainly vealed by a path A from the begin node deals with faults at the syntactical to node n and a path B from node n to a level, whereas perturbation testing fo- node m with predicate T m , if for some cuses on the functional differences. Sec- initial environment v 0 that causes the ond, mutation testing systematically execution of paths A and B to node m. generates mutants according to a given list of mutation operators. The pertur- Tm ° CB ° Cn ° CA v0 bation testing perturbs a program by considering perturbation functions Tm ° CB ° C ° CA v0 . drawn from a structured error space, such as a vector space. Third, in muta- In other words, the erroneous computa- tion testing, a mutant is killed if on a tion may cause domain error if the fault test case it produces an output different affects the evaluation of a predicate. from the output of the original program. The blindness space of a test path for Hence the fault is detected. However, detecting such perturbations of a com- for perturbation testing, the blindness putation function can be expressed by space is calculated according to the test the equation: T m ° C B ° e ° C A (v 0 ) 0, cases. A fault is undetected if it remains where e Cn C . in the blindness space. Finally, al- A solution to the two preceding equa- though mutation testing is more gener- tions was obtained for linear-dominated ally applicable in the sense that there programs, which are bounded vector are no particular restrictions on the programs with the additional restric- software under test, perturbation test- tions that C is the set of linear trans- ing guarantees that combinations of the formations and T is the set of linear faults can be detected if all simple functions [Zeil 1983]. The solution es- faults are detected, provided that the sentially consists of two parts: assign- error space is a vector space. ment blindness and equality blindness, as in predicate perturbation, and blind- 3.5 The RELAY Model ness due to computation C B masking out differences between C n and C . An As discussed in perturbation testing, it adequacy criterion similar to the predi- is possible for a fault not to cause an cate perturbation adequacy was also de- error in output even if the statement fined. containing the fault is executed. With The principle of perturbation testing the restriction to linear-domained pro- can be applied without the assumptions grams, Zeil provided conditions on about the computation function space C which a perturbation (i.e., a fault), can and the predicate function space T, but be exposed. selecting an appropriate error function The problem of whether a fault can be space E. Zeil [1983] gave a characteriza- detected was addressed by Morrel tion of the blindness space for vector [1990], where a model of error origina- error spaces without the condition of tion and transfer was proposed. An er- linearity. The principle of perturbation ror is originated (called “created” by testing can also be applied to individual Morrel) when an incorrect state is intro- test cases [Zeil 1984]. The notion of duced at some fault location, and it is ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 31. 396 • Zhu et al. Figure 4. RELAY model of fault detection. transferred (called “propagated” by Mor- (1) origination of a potential error in rel) if it persists to the output. the smallest subexpression contain- The RELAY model proposed by Rich- ing the fault; ardson and Thompson [1988; 1993] was (2) computational transfer of the poten- built upon Morrel’s theory, but with tial error through each operator in more detailed analysis of how an error the node, thereby revealing a con- is transferred and the conditions of such text error; transfers. The errors considered within (3) data-flow transfer of that context the RELAY model are those caused by error to another node on the path particular faults in a module. A poten- that references the incorrect con- tial fault is a discrepancy between a text; node n in the module under test and the corresponding node n* in the hypotheti- (4) cycle through (2) and (3) until a cally correct module. This potential potential error is output. fault results in a potential error if the If there is no input datum for which a expression containing the fault is exe- potential error originates and all these cuted and evaluated to a value different transfers occur, then the potential fault from that of the corresponding hypo- is not a fault. This view of error detection thetically correct expression. Given a has an analogy in a relay race, hence the potential fault, a potential error origi- name of the model. Based on this view, nates at the smallest subexpression of the RELAY model develops revealing con- the node containing the fault that eval- ditions that are necessary and sufficient uates incorrectly. The potential error to guarantee error detection. Test data transfers to a superexpression if the su- are then selected to satisfy revealing con- perexpression evaluates incorrectly. ditions. When these conditions are in- Such error transfers are called compu- stantiated for a particular type of fault, tational transfer. To reveal an output they provide a criterion by which test error, execution of a potential fault data can be selected for a program so as must cause an error that transfers from to guarantee the detection of an error node to node until an incorrect output caused by any fault of that type. results, where an error in the function computed by a node is called a context 3.6 Specification-Mutation Testing error. If a potential error is reflected in the value of some variable that is refer- Specification-fault-based testing at- enced at another node, the error trans- tempts to detect faults in the implemen- fer is called a data-flow transfer. This tation that are derived from misinter- process of error transfer is illustrated in preting the specification or the faults in Figure 4. the specification. Specification-fault- Therefore the conditions under which based testing involves planting faults a fault is detected are: into the specification. The program that ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 32. Test Coverage and Adequacy • 397 implements the original specification is a test set is measured according to its then executed and the results of the ability to detect such faults. execution are checked against the origi- Error seeding is based on the assump- nal specification and those with planted tion that the artificial faults planted in faults. The adequacy of the test is deter- the program are as difficult to detect as mined according to whether all the the natural errors. This assumption has planted faults are detected by checking. proved not true in general. Gopal and Budd [1983] extended pro- Mutation analysis systematically and gram-mutation testing to specification- automatically generates a large number mutation testing for specifications writ- of mutants. A mutant represents a pos- ten in predicate calculus. They identified sible fault. It can be detected by the test a set of mutation operations that are cases if the mutant produces an output applied to a specification in the form of different from the output of the original pre/postconditions to generate mutants program. The most important issue in of the specification. Then the program mutation-adequacy analysis is the de- under test is executed on test cases to sign of mutation operators. The method obtain the output of the program. The is based on the competent-programmer input/output pair is used to evaluate the and coupling-effect hypotheses. The mutants of the specification. If a mutant principle of mutation-adequacy analysis is falsified by the test cases in the eval- can be extended to specification-based- uation, we say that it is killed by the adequacy analysis. Given a set of test test cases; otherwise it remains alive. cases and the program under test, mu- Gopal and Budd [1983] noticed that tation adequacy can be calculated auto- some alterations to specifications were matically except for detection of equiva- not useful as mutation operators. For lent mutants. example, replacing the various clauses However, measuring the adequacy of in the specification by the truth values software testing by mutation analysis is “true” or “false” tends to generate mu- expensive. It may require large compu- tants that are trivial to kill and appear tation resources to store and execute a to be of little practical significance. large number of mutants. It also re- In an investigation of mutation opera- quires huge human resources to deter- tors for algebraic specifications, Wood- mine if live mutants are equivalent to ward [1993] defined his set of mutation the original program. Reduction of the operators based on his analysis of errors testing expense to a practically accept- in the specifications made by students. able level has been an active research He considered algebraic specifications as topic. Variants such as weak mutation term-rewriting systems. The original testing, firm mutation testing, and or- specification and the mutants of the spec- dered mutation testing have been pro- ification are compiled into executable posed. Another approach not addressed codes. When the executions of the original in this article is the execution of mu- specification and a mutant on a given test tants in parallel computers [Choi et al. case generate two different outputs, the 1989; Krauser et al. 1991]. This re- mutant is regarded as dead. Otherwise, it quires the availability of massive paral- is alive. In this way the test adequacy is lel computers and very high portability measured without executing the program. of the software so that it can be exe- cuted on the target machine as well as the testing machine. In summary, al- 3.7 Summary of Fault-Based Adequacy though progress has been made to re- Criteria duce the expense of adequacy measure- ment by mutation analysis, there Fault-based adequacy criteria focus on remain open problems. the faults that could possibly be con- Perturbation testing is concerned tained in the software. The adequacy of with the possible functional differences ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 33. 398 • Zhu et al. between the program under test and the 4.1 Specification-Based Input Space hypothetical correct program. The ade- Partitioning quacy of a test set is decided by its The software input space can be parti- ability to limit the error space defined tioned either according to the program in terms of a set of functions. or the specification. When partitioning A test case may not reveal a fault. the input space according to the specifi- The RELAY model analyzed the condi- cation, we consider a subset of data as a tion under which a test case reveals a subdomain if the specification requires fault. Therefore, given a fault, test cases the same function on the data. can be selected to satisfy the conditions, For example, consider the following and hence guarantee the detection of specification of a module DISCOUNT the fault. This model can also be used as INVOICE, from Hall and Hierons a basis of test adequacy criteria. [1991]. Example 4.1 (Informal Specification 4. ERROR-BASED ADEQUACY CRITERIA of DISCOUNT INVOICE Module). A company produces two goods, X and Y, AND DOMAIN ANALYSIS with prices £5 for each X purchased and Error-based testing methods require £10 for each Y purchased. An order test cases to check programs on certain consists of a request for a certain num- error-prone points [Foster 1980; Myers ber of Xs and a certain number of Ys. 1979]. The basic idea of domain analysis The cost of the purchase is the sum of and domain testing is to partition the the costs of the individual demands for input-output behavior space into subdo- the two products discounted as ex- mains so that the behavior of the soft- plained in the following. If the total is ware on each subdomain is equivalent, greater than £200, a discount of 5% is in the sense that if the software behaves given; if the total is greater than £1,000, correctly for one test case within a sub- a discount of 20% is given; also, the domain, then we can reasonably assume company wishes to encourage sales of X that the behavior of the software is cor- and give a further discount of 10% if rect for all data within that subdomain. more than thirty Xs are ordered. Nonin- teger final costs are rounded down to We may wish, however, to take more give an integer value. than one test case within each subdo- When the input x and y to the module main in order to increase our confidence DISCOUNT INVOICE has the property in the conformance of the implementa- that x 30 and 5*x 10* y 200, the tion upon this subdomain. Given a par- output should be 5*x 10*y. That is, tition into subdomains, the question is for all the data in the subset {(x, y) how many test cases should be used for x 30, 5*x 10*y 200}, the re- each subdomain and where in the sub- quired computation is the same function domain these test cases should be cho- sum 5*x 10*y. Therefore the sub- sen. The answers to these questions are set should be one subdomain. A careful based on assumptions about the where- analysis of the required computation abouts of errors. However, theoretical will give the following partition of the studies have proved that testing on cer- input space into six subdomains, as tain sets of error-prone points can de- shown in Figure 5. tect certain sets of faults in the program It seems that there is no general me- [Afifi et al. 1992; Clarke et al. 1989; chanically applicable method to derive Richardson and Clarke 1985; Zeil 1984; partitions from specifications, even if 1983; 1989a; 1989b; 1992]. there is a formal functional specifica- Before considering these problems, let tion. However, systematic approaches to us first look at how to partition the analyzing formal functional specifica- behavior space. tions and deriving partitions have been ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 34. Test Coverage and Adequacy • 399 Figure 5. Partition of input space of DISCOUNT INVOICE module. proposed by a number of researchers, The input data that satisfy a precon- such as Stocks and Carrington [1993]. dition predicate P i should constitute a In particular, for formal specifications subdomain. The corresponding compu- in certain normal forms, the derivation tation on the subdomain must satisfy of partitions is possible. Hierons [1992] the postcondition Q i . The precondition developed a set of transformation rules predicate is called the subdomain’s do- to transform formal functional specifica- main condition. When the domain con- tions written in pre/postconditions into dition can be written in the form of the the following normal form. conjunction of atomic predicates of in- equality, such as P1 x1 , x2 , . . . , xn exp1 x 1 , x 2 , . . . , x n ∧ Q1 x1 , x2 , . . . , xn , y1 , y2 , . . . , ym ∨ expr x 1 , x 2 , . . . , x n (4.1) P2 x1 , x2 , . . . , xn exp1 x 1 , x 2 , . . . , x n ∧ Q2 x1 , x2 , . . . , xn , y1 , y2 , . . . , ym ∨ expr x 1 , x 2 , . . . , x n , (4.2) ··· then the equation PK x1 , x2 , . . . , xn exp1 x 1 , x 2 , . . . , x n ∧ QK x1 , x2 , . . . , xn , y1 , y2 , . . . , ym ∨ expr x 1 , x 2 , . . . , x n (4.3) where P i (x 1 , x 2 , . . . , x n ), i 1, 2, . . . , K, are preconditions that give the condi- defines a border of the subdomain. tion on the valid input data and the Example 4.2 (Formal specification of state before the operation. Q i (x 1 , x 2 , the DISCOUNT INVOICE module). . . . , x n , y 1 , y 2 , . . . , y m ), i 1, 2, . . . , For the DISCOUNT INVOICE module, K, are postconditions that specify the a formal specification can be written as relationship between the input data, follows. output data, and the states before and after the operation. The variables x i are x 0∧y 0 f sum 5x 10y input variables and y i are output vari- ables. sum 200 f discount1 100 ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 35. 400 • Zhu et al. sum 200 ∧ sum 1000 Since sum 5*x 10*y, the borders of the region are the lines defined by the f discount1 95 following four equations: x 0, y 0, x 30, sum 1000 f discount1 80 ) 5*x 10*y 200. x 30 f discount2 100 They are the x axis, the y axis, the x 30 f discount2 90 borders and in Figure 5, respec- tively. total round sum discount1 Notice that the DISCOUNT INVOICE module has two input variables. Hence discount2/10000 , the border equations should be under- stood as lines in a two-dimensional where round is the function that rounds plane as in Figure 5. Generally speak- down a real number to an integer value; ing, the number of input variables is the x is the number of product X that a dimension of the input space. A border customer ordered; y is the number of of a subdomain in a K-dimensional product Y that the customer ordered; space is a surface of K 1 dimension. sum is the cost of the order before dis- count; discount1, discount2 are the per- centages of the first and second type of 4.2 Program-Based Input-Space discount, respectively; and total is the Partitioning grand total cost of the purchase. The software input space can also be When this specification is trans- partitioned according to the program. In formed into the normal form, we have a this case, two input data belong to the clause with the following predicate as same subdomain if they cause the same precondition. “computation” of the program. Usually, the same execution path of a program is x 0 & y 0 & sum 200 considered as the same computation. Therefore, the subdomains correspond & not sum 200 & sum 1000 to the paths in the program. When the program contains loops, a particular & not sum 1000 & x 30 subset of the paths is selected according to some criterion, such as those dis- & not x 30 cussed in Section 2.1. and the following predicate as the corre- Example 4.3 Consider the following sponding postcondition, program, which implements the DIS- COUNT INVOICE. discount1 100 & discount2 100 Program DISCOUNT_INVOICE & total round sum discount1 (x, y: Int) Var discount1, discount2: Int; discount2/10000 . input (x, y); if x 30 The precondition defines the region A in then discount2 : 100 the partition shown in Figure 5. It is else discount2 : 90 equivalent to endif; sum : 5*x 10*y; x 0 & y 0 & sum 200 if sum 200 then discount1 : 100 & x 30 . (4.4) elseif sum 1000 ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 36. Test Coverage and Adequacy • 401 Figure 6. Flow graph of DISCOUNT INVOICE program. then discount1 : 95 are missing in the partitioning accord- else discount1 : 80 ing to the program. endif; Now let us return to the questions output (round(sum*discount1 about where and how many test cases *discount2/10000)) end should be selected for each subdomain. If only one test case is required and if There are six paths in the program; see we do not care about the position of the Figure 6 for its flow graph. For each test case in the subdomain, then the path there is a condition on the input test adequacy is equivalent to path cov- data such that an input causes the exe- erage provided that the input space is cution of the path if and only if the partitioned according to program paths. condition is satisfied. Such a condition But error-based testing requires test is called the path condition, and can be cases selected not only within the sub- derived mechanically by, say, symbolic domains, but also on the boundaries, at execution [Girgis 1992; Howden 1977; vertices, and just off the vertices or the 1978; Liu et al. 1989; Young and Taylor boundaries in each of the adjacent sub- 1988]. Therefore, path conditions char- domains, because these places are tradi- acterize the subdomains (i.e., they are tionally thought to be error-prone. A domain conditions). Table II gives the test case in the subdomain is called an path conditions of the paths. on test point; a test case that lies out- Notice that, for the DISCOUNT IN- side the subdomain is called an off test VOICE example, partitioning according point. to program paths is different from the This answer stems from the classifica- partitioning according to specification. tion of program errors into two main The borders of the x axis and the y axis types: domain error and computation ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 37. 402 • Zhu et al. Table II. Paths and Path Conditions error. Domain errors are those due to An adequacy criterion stricter than the incorrectness in a program’s selec- N 1 adequacy is the N N criterion, tion of boundaries for a subdomain. which requires N test cases off the bor- Computation errors are those due to the der rather than only one test case off incorrectness of the implementation of the border. Moreover, these N test cases the computation on a given subdomain. are required to be linearly independent This classification results in two types [Clarke et al. 1982]. of domain testing. The first aims at the correctness of the boundaries for each Definition 4.2 (N N Domain Ad- subdomain. The second is concerned equacy). Let {D 1 , D 2 , . . . , D n } be the with the computation on each subdo- set of subdomains of software S that main. The following give two sets of has N input variables. A set T of test criteria for these two types of testing. cases is said to be N N domain-test adequate if, for each subdomain D i , i 4.3 Boundary Analysis 1, 2, . . . , n, and each border B of D i , there are at least N test cases on the White and Cohen [1980] proposed a test border B and at least N linearly inde- method called N 1 domain-testing pendent test cases just off the border B. strategy that requires N test cases to be If the border B is in the domain D i , the selected on the borders in an N-dimen- N test cases off the border should be off sional space and one test case just off test points, otherwise they should be on the border. This can be defined as the test points. following adequacy criterion. The focal point of boundary analysis Definition 4.1 (N 1 Domain Ad- is to test if the borders of a subdomain equacy). Let {D 1 , D 2 , . . . , D n } be the are correct. The N 1 domain-ade- set of subdomains of software S that quacy criterion aims to detect if there is has N input variables. A set T of test an error of parallel shift of a linear cases is said to be N 1 domain-test border, whereas the N N domain- adequate if, for each subdomain D i , i adequacy is able to detect parallel shift 1, 2, . . . , n, and each border B of D i , and rotation of linear borders. This is there are at least N test cases on the why the criteria select the specific posi- border B and at least one test case tion for the on and off test cases. Con- which is just off the border B. If the sidering that the vertices of a subdo- border is in the domain D i , the test case main are the points at the intersection off the border should be an off test of several borders, Clark et al. [1982] point, otherwise, the test case off the suggested the use of vertices as test border should be an on test point. cases to improve the efficiency of bound- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 38. Test Coverage and Adequacy • 403 ary analysis. The criterion requires that By applying Zeil’s [1984; 1983] work the vertices are chosen as test cases and on error spaces (see also the discussion that for each vertex a point close to the of perturbation testing in Section 3.4), vertex is also chosen as a test case. Afifi et al. [1992] proved that any test set satisfying the N 2 domain-ade- Definition 4.3 (V V Domain Ad- quacy criterion can detect all linear er- equacy). Let {D 1 , D 2 , . . . , D n } be the rors of domain borders. They also pro- set of subdomains of software S. A set T vided a method of test-case selection to of test cases is said to be V V domain- satisfy the criterion consisting of the test adequate if, for each subdomain D i , following four steps. i 1, 2, . . . , n, T contains the vertices of D i and for each vertex v of D i there is (1) Choose (N 1) on test points in a test case just off the vertex v. If a general position; the selection of vertex v of D i is in the subdomain D i , these points should attempt to then the test case just off v should be an spread them as far from one another off test point; otherwise it should be an as possible and put them on or very on point. close to the border. (2) Determine the open convex region of The preceding criteria are effective these points. for detection of errors in linear domains, (3) If this region contains off points, where a domain is a linear domain if its then select one. borders are linear functions. For nonlin- ear domains, Afifi et al. [1992] proved (4) If this region has no off points, then that the following criteria were effective change each on point to be an off for detecting linear errors. point by a slight perturbation; now there are N 1 off points and there is near certainty of finding an on Definition 4.4 (N 2 Domain Ad- point in the new open convex region. equacy). Let {D 1 , D 2 , . . . , D n } be the set of subdomains of software S that has N input variables. A set T of test 4.4 Functional Analysis cases is said to be N 2 domain-test Although boundary analysis focuses on adequate if, for each subdomain D i , i border location errors, functional analy- 1, 2, . . . , n, and each border B of D i , sis emphasizes the correctness of com- there are at least N 2 test cases x 1 , putation on each subdomain. To see how x 2 , . . . , x N 2 in T such that functional analysis works, let us take the DISCOUNT INVOICE module as an —each set of (N 1) test cases are in example again. general position, where a set {x i } con- taining N 1 vectors is in general Example 4.4 Consider region A in position if the N vectors x i x 1, i Figure 5. The function to be calculated 2, 3, . . . , N 1, are linear-indepen- on this subdomain specified by the spec- dent; ification is —there is at least one on test point and total 5*x 10*y. (4.5) one off test point; —for each pair of test cases, if the two This can be written as f(x, y) 5x points are of the same type (in the 10y, a linear function of two variables. sense that both are on test points or Mathematically speaking, two points both are off test points), they should are sufficient to determine a linear lie on the opposite sides of the hyper- function, but one point is not. Therefore plane formed by the other n test at least two test cases must be chosen in points; otherwise they should lie on the subdomain. If the program also com- the same side of the hyperplane putes a linear function on the subdo- formed by the other n test points. main and produces correct outputs on ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 39. 404 • Zhu et al. the two test cases, then we can say that software S. Suppose that the required the program computes the correct func- function on subdomain D i is a multino- tion on the subdomain. mial in m variables of degree k i , i 1, Analyzing the argument given in the 2, . . . , n. A set T of test cases is said to preceding example, we can see that be functional-test adequate if, for each functional analysis is based on the fol- subdomain D i , i 1, 2, . . . , n, T lowing assumptions. First, it assumes contains at least (k i 1) m test cases that for each subdomain there is a set of cj c j,1 , c j,2 , . . . , c j,m in the subdo- functions f associated with the do- main D i such that the matrix T [b ij ] main. In the preceding example, f con- is nonsingular, where b ij t i (c j,1 , c j,2 , sists of linear functions of two variables. . . . , c j,m ), and t i is the same as in the Second, the specified computation of the preceding theorem. software on this domain is considered as a function f *, which is an element of f . 4.5 Summary of Domain-Analysis and Third, it is assumed that the function f Error-Based Test Adequacy Criteria computed by the software on the subdo- main is also an element of f . Finally, The basic idea behind domain analysis there is a method for selecting a finite is the classification of program errors number of test cases for f * such that if f into two types: computation errors and and f * agree on the test cases, then domain errors. A computation error is they are equivalent. reflected by an incorrect function com- In Section 3.4, we saw the use of an puted by the program. Such an error error space in defining the neighbor- may be caused, for example, by the exe- hood set f and the use of perturbation cution of an inappropriate assignment testing in detecting computation errors statement that affects the function com- [Zeil 1984; 1983]. puted within a path in the program. A In Howden’s [1978; 1987] algebraic domain error may occur, for instance, testing, the set of functions associated when a branch predicate is expressed with the subdomain is taken as the incorrectly or an assignment statement polynomials of degree less than or equal that affects a branch predicate is wrong, to k, where k is chosen as the degree of thus affecting the conditions under the required function, if it is a polyno- which the path is selected. A boundary- mial. The following mathematical theo- analysis adequacy criterion focuses on rem, then, provides a guideline for the the correctness of the boundaries, which selection of test cases in each subdo- are sensitive to domain errors. A func- main [Howden, 1987]. tional-analysis criterion focuses on the correctness of the computation, which is THEOREM 4.1 Suppose that contains sensitive to computation errors. They all multinomials in n variables x 1 , x 2 , should be used in a complementary . . . , x n of degree less than or equal to k, fashion. and f,n f * . Let f(x 1 , x 2 , . . . , x n ) It is widely recognized that software (k 1) i 1 a i t i (x 1 , x 2 , . . . , x n ), where t i (x 1 , testing should take both specification x 2, . . . , x n) x i 1x i 2, . . . , x i n, 0 1 2 n i 1, and program into account. A way to i 2, . . . , i n k. Then f and f * are combine program-based and specifica- identical if they agree on any set of m tion-based domain-testing techniques is (k 1) n values { c i,1 , c i,2 , . . . , c i,n first to partition the input space using i 1, 2, . . . , m} such that the matrix the two methods separately and then M [b ij ] is nonsingular, where b ij ti refine the partition by intersection of (c j,1 , c j,2 , . . . , c j,n ). the subdomains [Gourlay 1983; Weyuker and Ostrand 1980; Richardson Definition 4.5 (Functional Adequacy) and Clarke 1985]. Finally, for each sub- [Howden 1978; 1987]. Let {D 1 , D 2 , domain in the refined partition, the re- . . . , D n } be the set of subdomains of quired function and the computed func- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 40. Test Coverage and Adequacy • 405 tion are checked to see if they belong to Weyuker and Jeng 1991; Woodward et the same set of functions, say, polyno- al. 1980]. The methods to compare test mials of degree K, and the test cases are adequacy criteria according to this mea- selected according to the set of functions sure can be classified into three types: to which they belong. statistical experiment, simulation, and A limitation of domain-analysis tech- formal analysis. The following summa- niques is that they are too complicated rizes research in these three ap- to be applicable to software that has a proaches. complex input space. For example, pro- cess control software may have se- 5.1.1 Statistical Experiments. The quences of interactions between the basic form of statistical experiments software and the environment system. with test adequacy criteria is as follows. It may be difficult to partition the input Let C 1 , C 2 , . . . , C n be the test ade- space into subdomains. quacy criteria under comparison. The Another shortcoming of boundary- experiment starts with the selection of analysis techniques is that they were a set of sample programs, say, P 1 , proposed for numerical input space such P 2 , . . . , P m . Each program has a collec- that the notions of the closeness of two tion of faults that are known due to points in the V V adequacy, and “just previous experience with the software, off a border” in the N N and N 1 or planted artificially, say, by applying adequacy can be formally defined. How- mutation operations. For each program ever, it is not so simple to extend these P i, i 1, 2, . . . , m, and adequacy notions to nonnumerical software such criterion C j , j 1, 2, . . . , n, k test sets j j j as compilers. T i 1, T i 2, . . . , T i k are generated in some j fashion so that T i u is adequate to test 5. COMPARISON OF TEST DATA program P i according to the criterion C j . j ADEQUACY CRITERIA The proportion of faults r i u detected by j the test set T i u over the known faults in Comparison of testing methods has al- the program Pi is calculated for every i ways been desirable. It is notoriously 1, 2, . . . , n, j 1, 2, . . . , m, and u difficult because testing methods are 1, 2, . . . , k. Statistical inferences are j defined using different models of soft- then made based on the data r i u, i 1, ware and based on different theoretical . . . , n, j 1, . . . , m, u 1, 2, . . . , k. foundations. The results are often con- For example, Ntafos [1984] compared troversial. branch coverage, random testing, and In comparing testing adequacy crite- required pair coverage. He used 14 ria, it must be made very clear in what small programs. Test cases for each pro- sense one criterion is better than an- gram were selected from a large set of other. There are three main types of random test cases and modified as such measures in the software testing needed to satisfy each of the three test- literature: fault-detecting ability, soft- ing strategies. The percentages of mu- ware reliability, and test cost. This sec- tants killed by the test sets were consid- tion reviews the methods and the main ered the fault-detection abilities. results of the comparisons. Hamlet [1989] pointed out two poten- tial invalidating factors in this method. 5.1 Fault-Detecting Ability They are: Fault-detecting ability is one of the (1) A particular collection of programs most direct measures of the effective- must be used—it may be too small ness of test adequacy criteria [Basili or too peculiar for the results to be and Selby 1987; Duran and Ntafos trusted. 1984; Frankl and Weiss 1993; Frankl (2) Particular test data must be created and Weyuker 1988; Ntafos 1984; for each method—the data may have ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 41. 406 • Zhu et al. good or bad properties not related to higher fault-detection rate than func- the testing method. tional or structural testing did. Al- though functional testing detected Hamlet [1989] was also critical: “It is more faults than structural testing not unfair to say that a typical testing did, functional testing and structural experiment uses a small set of toy pro- testing did not differ in fault-detec- grams, with uncontrolled human gener- tion rate. In contrast, with advanced ation of the test data. That is, neither students, the three techniques were (1) nor (2) is addressed.” not different in fault-detection rate. In addition to these two problems, there is, in fact, another potential inval- —Software type. The number of faults idating factor in the method. That is: observed, fault-detection rate, and to- tal effort in software testing all (3) A particular collection of known clearly depended on the type of soft- faults in each program must be ware under test. used—it may not be representative —Fault type. Different testing tech- of the faults in software. The faults niques have different fault-detection could be too easy or too difficult to abilities for different types of faults. find, or there could be a different Code reading detected more interface distribution of the types of faults. faults than the other methods did, One particular test method or ade- and functional testing detected more quacy criterion may have a better control faults than the other methods ability to detect one type of fault did. than other types. To avoid the effects of the potential Their experiment results indicated the invalidating factors related to the test complexity of the task of comparing data, Basili and Selby [1987] used frac- software testing methods. tional factorial design methodology of Recently, Frankl and Weiss [1993] statistical experiments and repeated ex- used another approach to addressing perimentation. They compared two dy- the potential invalidating factor associ- namic testing methods (functional test- ated with the test data. They compared ing and statement coverage) and a branch adequacy and all-uses data-flow static testing method (code review) in adequacy criteria using nine small pro- three iterative phases involving 74 sub- grams of different subjects. Instead of jects (i.e., testers) of various back- using one adequate test set for each grounds. However, they only used four criterion, they generated a large num- sample programs. The faults contained ber of adequate test sets for each crite- in the sample programs include those rion and calculated the proportion of the made during the actual development of test sets that detect errors as an esti- the program as well as artificial faults. mate of the probability of detecting er- Not only the fault-detection ability, but rors. Their main results were: also the fault-detection cost with re- spect to various classes of faults were —for five of the nine subjects, the all- compared. Since the test sets for dy- uses criterion was more effective than namic testing are generated manually branch coverage at 99% confidence, and the static testing is performed man- where the effectiveness of an ade- ually, they also made a careful compar- quacy criterion is the probability that ison of human factors in testing. Their an adequate test set exposes an error main results were: when the test set is selected randomly according to a given distribution on —Human factors. With professional the adequate test sets of the criterion. programmers, code reading detected —in four of the nine subjects, all-uses more software faults and yielded a adequate test sets were more effective ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 42. Test Coverage and Adequacy • 407 than branch-coverage adequate sets a set of test cases will discover.6 For of similar size. random testing, the probability P r of —for the all-uses criterion, the probabil- finding at least one error in n tests is ity of detecting errors clearly depends 1 (1 ) n . For a partition testing on the extent of the adequacy only method in which n i test cases are cho- when the adequacy is very close to sen at random from subset D i , the prob- 100% (i.e., higher than 97%). In four ability P p of finding at least one error is of the subjects, such a dependence given by existed for the branch-coverage crite- rion over all the adequacy range. k ni Pp 1 1 i . (5.2) The advantage of the statistical experi- i 1 ment approach is that it is not limited to comparing adequacy criteria based on With respect to the same partition, a common model. However, given the potential invalidating factors, it is Pr 1 1 n doubtful that it is capable of producing n trustworthy results on pairs of ade- k quacy criteria with subtle differences. 1 1 pi i , i 1 5.1.2 Simulation. Both simulation and formal analysis are based on a cer- k tain simplified model of software test- where n ni . (5.3) ing. The most famous research that i 1 uses simulation for comparing software test adequacy criteria are Duran and The expected number E p of errors dis- Ntafos’ [1984] comparison of partition covered by partition testing is given by testing with random testing and Hamlet and Taylor’s [1990] repetition of the k comparison. Ep k i (5.4) Duran and Ntafos modeled partition i 1 testing methods by a finite number of disjoint subsets of software input if one random test case is chosen from spaces. Suppose that the input space D each D i , where i is the failure rate for of a given software system is par- D i. titioned into k disjoint subsets D 1 , If in total n random test cases are D 2 , . . . , D k and an input chosen at used in random testing, some set of random has probability p i of lying in D i . actual values n {n 1 , n 2 , . . . , n k } will Let i be the failure rate for D i , then occur, where n i is the number of test cases that fall in D i . If n were known, k then the expected number of errors pi (5.1) found would be i i 1 k ni is the probability that the program will E n, k, n 1 1 i . (5.5) fail to execute correctly on an input i 1 chosen at random in accordance with the input distribution. Duran and Ntafos investigated the probability that 6 Note that an error means that the program’s a test set will reveal one or more errors output on a specific input does not meet the and the expected number of errors that specification of the software. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 43. 408 • Zhu et al. Since n is not known, the expected imulation can be performed in an ideal number E r of errors found by n random testing scenario to avoid some of the tests is complicated human factors. However, because simulation is based on a certain simplified model of software testing, the E r k, n E n, k, n n! realism of the simulation result could be n questionable. For example, in Duran and Ntafos’ experiment, the choice of k k k the particular hypothetical distribution p ni ni k 1 pi n . (i.e., for 2 percent of the time i 0.98 i i and for 98 percent of the time i i 1 i 1 i 1 0.049) seems rather arbitrary. (5.6) 5.1.3 Formal Analysis of the Relation- They conducted simulations of vari- ships among Adequacy Criteria. One ous program conditions in terms of i of the basic approaches to comparing distributions and K, and calculated and adequacy criteria is to define some rela- compared the values of P p , P r , E p , and tion among adequacy criteria and to E r . Two types of i distributions were prove that the relation holds or does not investigated. The first type, called hypo- hold for various pairs of criteria. The thetical distributions, considered the majority of such comparisons in the lit- situation where a test case chosen from erature use the subsume ordering, a subset would have a high probability which is defined as follows. of finding an error affecting that subset. Definition 5.1 (Subsume Relation Therefore the i s were chosen from a among Adequacy Criteria). Let C 1 and distribution such that 2 percent of the C 2 be two software test data adequacy time i 0.98 and 98 percent of the criteria. C 1 is said to subsume C 2 if for time, i 0.049. The second type of i all programs p under test, all specifica- distribution was uniform distributions; tions s and all test sets t, t is adequate that is, the i s were allowed to vary according to C 1 for testing p with re- uniformly from 0 to a value max that spect to s implies that t is adequate varies from 0.01 to 1. according to C 2 for testing p with re- The main result of their simulation spect to s. was that when the fault-detecting abil- ity of 100 simulated random test cases Rapps and Weyuker [1985] studied was compared to that of 50 simulated the subsume relation among their origi- partition test cases, random testing was nal version of data-flow adequacy crite- superior. This was considered as evi- ria, which were not finitely applicable. dence to support one of the main conclu- Frankl and Weyuker [1988] later used sions of the paper, that random testing this relation to compare their revised would be more cost-effective than parti- feasible version of data-flow criteria. tion testing, because performing 100 They found that feasibility affects the random tests was considered less expen- subsume relation among adequacy crite- sive than 50 partition tests. ria. In Figure 7, the subsume relation Considering Duran and Ntafos’ result that holds for the infeasible version of counterintuitive, Hamlet and Taylor criteria but does not hold for the feasi- [1990] did more extensive simulation ble version of the criteria is denoted by and arrived at more precise statements an arrow marked with the symbol “*”. about the relationship between parti- Ntafos [1988] also used this relation to tion probabilities, failure rates, and ef- compare all the data-flow criteria and fectiveness. But their results corrobo- several other structural coverage crite- rated Duran and Ntafos’ results. ria. Among many other works on com- In contrast to statistical experiment, paring testing methods by subsume re- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 44. Test Coverage and Adequacy • 409 Figure 7. Subsume relation among adequacy criteria. lation are those by Weiser et al. [1985] Weyuker [1993a] proved that the fact and Clark et al. [1989]. Since the ade- that C 1 subsumes C 2 does not always quacy criteria based on flow analysis guarantee that C 1 is better at detecting have a common basis, most of them can faults. be compared; they fall into a fairly sim- Frankl and Weyuker investigated ple hierarchy, as shown in Figure 7. A whether a relation on adequacy criteria relatively complete picture of the rela- can guarantee better fault-detecting tions among test adequacy criteria can ability for subdomain testing methods. be built. However, many methods are The testing methods they considered incomparable; that is, one does not sub- are those by which the input space is sume the other. divided into a multiset of subspaces so The subsume relation is actually a that in each subspace at least one test comparison of adequacy criteria accord- case is required. Therefore a subdomain ing to the severity of the testing meth- test adequacy criterion is equivalent to ods. The subsume relation evaluates ad- a function that takes a program p and a equacy criteria in their own terms, specification s and gives a multiset7 without regard for what is really of in- {D 1 , D 2 , . . . , D k } of subdomains. They terest. The relation expresses nothing about the ability to expose faults or to 7 In a multiset an element may appear more than assure software quality. Frankl and once. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 45. 410 • Zhu et al. Figure 8. Relationships among partial orderings on adequacy criteria. defined a number of partial orderings properly covers C 2 for ( p, s) if there is on test adequacy criteria and studied a multiset M {D 1,1 , D 1,2 , . . . , whether these relations are related to D 1,k1 , . . . , D n,1 , . . . , D n,kn } C1 ( p, the ability of fault detection. The follow- s) such that E i D i,1 D i,2 ... ing are the relations they defined, D i,ki for all i 1, 2, . . . , n. which can be regarded as extensions of —Let C1 ( p, s) {D 1 , D 2 , . . . , D m }, the subsume relation. C2 ( p, s) {E 1 , E 2 , . . . , E n }. C 1 Definition 5.2 (Relations among Test properly partitions C 2 for ( p, s) if Adequacy Criteria). Let C 1 and C 2 be there is a multiset M {D 1,1 , . . . , two subdomain test adequacy criteria D 1,k1 , . . . , D n,1 , . . . , D n,kn } C1 ( p, and c1 (p, s) and c2 (p, s) be the multi- s) such that E i D i,1 D i,2 ... set of the subsets of the input space for D i,ki for all i 1, 2, . . . , n, and for a program p and a specification s ac- each i 1, 2, . . . , n, the collection cording to C 1 and C 2 , respectively. {D i,1 , . . . , D i,ki } is pairwise disjoint. For each relation R previously defined, —C 1 narrows C 2 for (p, s) if for every Frankl and Weyuker defined a stronger D C2 (p, s), there is a subdomain relation, called universal R, which re- D C1 (p, s) such that D D. quires R to hold for all specifications s —C 1 covers C 2 for ( p, s) if for every D and all programs p. These relations C2 ( p, s), there is a nonempty collec- have the relationships shown in Figure tion of subdomains {D 1 , D 2 , . . . , D n } 8. belonging to C1 ( p, s) such that D 1 To study fault-detecting ability, D2 . . . Dn D. Frankl and Weyuker considered two —C 1 partitions C 2 for ( p, s) if for every idealized strategies for selecting the D C2 ( p, s), there is a nonempty test cases. The first requires the tester collection of pairwise disjoint subdo- to independently select test cases at mains {D 1 , D 2 , . . . , D n } belonging to random from the whole input space ac- C1 ( p, s) such that D 1 D2 ... cording to a uniform distribution until Dn D. the adequacy criterion is satisfied. The —Let C1 ( p, s) {D 1 , D 2 , . . . , D m }, second strategy assumes that the input C2 ( p, s) {E 1 , E 2 , . . . , E n }. C 1 space has first been divided into subdo- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 46. Test Coverage and Adequacy • 411 Table III. Relationships among Fault-Detecting Ability and Orderings on Adequacy Criteria (ki, i 1, 2, is the number of subdomains in Ci(p, s)) mains and then requires independent According to Frankl and Weyuker, the random selection of a predetermined first measure M 1 (C, p, s) gives a crude number n of test cases from each subdo- lower bound on the probability that an main. They used three different mea- adequate test set selected using either sures, M 1 , M 2 , and M 3 in the following of the two strategies will expose at least definition of fault-detecting ability. In a one fault. The second measure M 2 (C, p, later paper Frankl and Weyuker s) gives the exact probability that a test [1993b] also studied the expected num- set chosen by using the first sampling ber of errors detected by a test method. strategy will expose at least one fault. This measure is E(C, p, s) in the follow- M 3 (C, p, s, n) is a generalization of M 2 ing definition. Notice that these mea- by taking the number n of test cases sures are the same as those used by selected in each subdomain into ac- Duran and Ntafos [1984] except i is count. It should be noted that M 2 and replaced by m i /d i ; see also Section 5.1.2. M 3 are special cases of Equation (5.2) Definition 5.3 (Measures of Fault-De- for P p in Duran and Ntafos’ work that tection Ability). Let c (p, s) {D 1 , D 2 , ni 1 and n i n, respectively. It is . . . , D k } be the multiset of the subdo- obvious that M 2 and M 3 are accurate mains of the input space of (p, s) ac- only for the first sampling strategy be- cording to the criterion C, d i D i , and cause for the second strategy it is un- m i be the number of fault-causing in- reasonable to assume that each subdo- puts in D i , i 1, . . . , k. Define: main has exactly n random test cases. Frankl and Weyuker [1993a,b] proved mi that different relations on the adequacy M 1 C, p, s max (5.7) criteria do relate to the fault-detecting 1 i k di ability. Table III summarizes their re- sults. k mi Given the fact that the universal nar- M 2 C, p, s 1 1 (5.8) rows relation is equivalent to the sub- di i 1 sume relation for subdomain testing methods, Frankl and Weyuker con- k n cluded that “C 1 subsumes C 2 ” does not mi M3 C, p, s, n 1 1 , (5.9) guarantee a better fault-detecting abil- i 1 di ity for C 1 . However, recently Zhu [1996a] proved that in certain testing where n 1 is the number of test cases scenarios the subsumes relation can im- in each subdomain. ply better fault-detecting ability. He identified two software testing scenar- k mi ios. In the first scenario, a software E C, p, s . (5.10) tester is asked to satisfy a particular i 1 di adequacy criterion and generates test ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 47. 412 • Zhu et al. Figure 9. Universally properly cover relation among adequacy criteria. cases specifically to meet the criterion. 5.2 Software Reliability This testing scenario is called the prior testing scenario. Frankl and Weyuker’s The reliability of a software system that second sampling strategy belongs to has passed an adequate test is a direct this scenario. In the second scenario, measure of the effectiveness of test ade- the tester generates test cases without quacy criteria. However, comparing test any knowledge of test adequacy crite- adequacy criteria with respect to this rion. The adequacy criterion is only measure is difficult. There is little work used as a stopping rule so that the of this type in the literature. A recent tester stops generating test cases only if breakthrough is the work of Tsoukalas the criterion is satisfied. This scenario et al. [1993] on estimation of software is called posterior testing scenario. reliability from random testing and par- Frankl and Weyuker’s first sampling tition testing. strategy belongs to the posterior sce- Motivated to explain the observations nario. It was proved that in the poste- made in Duran and Ntafos’ [1984] ex- rior testing scenario, the subsume rela- periment on random testing and parti- tion does imply better fault-detecting tion testing, they extended the Thayer- ability in all the probability measures Lipow-Nelson reliability model [Thayer for error detection and for the expected et al. 1978] to take into account the cost number of errors. In the posterior test- of errors, then compared random testing ing scenario, the subsume relation also with partition testing by looking at the implies more test cases [Zhu 1996a]. upper confidence bounds when estimat- Frankl and Weyuker [1993b] also in- ing the cost-weighted failure rate. vestigated how existing adequacy crite- Tsoukalas et al. [1993] considered the ria fit into the partial orderings. They situation where the input space D is studied the properly cover relation on a partitioned into k pairwise disjoint sub- subset of data-flow adequacy criteria, a sets so that D D1 D2 ... D k. limited mutation adequacy (which is ac- On each subdomain D i , there is a cost tually branch coverage), and condition penalty c i that would be incurred by the coverage. The results are shown in Fig- program’s failure to execute properly on ure 9. an input from D i , and a probability p i ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 48. Test Coverage and Adequacy • 413 that a randomly selected input will be- [1990]. They showed that in some cases long to D i . They defined the cost- random testing can perform much bet- weighted failure rate for the program as ter than partition testing, especially a whole as when only one test case is selected in each subdomain. k This work has a number of practical C ci pi i . (5.11) implications, especially for safety-criti- i 1 cal software where there is a significant penalty for failures in some subdo- Usually, the cost-weighted failure rate mains. For example, it was proved that is unknown in advance, but can be esti- when no failures are detected, the opti- mated by mum way to distribute the test cases over the subdomains in partition testing k is to select n i proportional to c i p i , 0 C c i p i f i /n i , (5.12) i k, rather than select one test case in i 1 each subdomain. where f i is the number of failures ob- 5.3 Test Cost served over D i and n i is the number of random test cases within the subdo- As testing is an expensive software de- main D i . Given the total number n of velopment activity, the cost of testing to test cases, Equation (5.13) gives the achieve a certain adequacy according to maximum likelihood estimate of the a given criterion is also of importance. cost-weighted failure rate C. Comparison of testing costs involves many factors. One of the simplified k measures of test cost is the size of an C c i f i /n (5.13) adequate test set. Weyuker’s [1988c, i 1 1993] work on the complexity of data- flow adequacy criteria belongs to this But Tsoukalas et al. [1993] pointed out category. that the real issue is how confident one In 1988, Weyuker [1988c] reported an can be in the estimate. To this end they experiment with the complexity of data- sought an upper confidence bound on C. flow testing. Weyuker used the suite of Therefore, considering finding the upper programs in Software Tools in Pascal by bound as a linear programming problem Kernighan and Plauger [1981] to estab- k that maximizes i 1 cipi i subject to cer- lish empirical complexity estimates for tain conditions, they obtained expressions Rapps and Weyuker’s family of data- for the upper bounds in the two cases of flow adequacy criteria. The suite con- random testing and partition testing and sists of over 100 subroutines. Test cases calculated the upper bounds for various were selected by testers who were in- special cases. structed to select “natural” and “atomic” The conclusion was that confidence is test cases using the selection strategy of more difficult to achieve for random their choice, and were not encouraged to testing than for partition testing. This select test cases explicitly to satisfy the agrees with the intuition that it should selected data-flow criterion. Then seven be easier to have a certain level of con- statistical estimates were computed for fidence for the more systematic of the each criterion from the data (d i , t i ) two testing methods. That is, it is easier where d i denoted the number of deci- to put one’s faith in partition testing. sion statements in program i and t i But they also confirmed the result of denoted the number of test cases used the empirical studies of the effective- to satisfy the given criterion for pro- ness of random testing by Duran and gram i. Among the seven measures, the Ntafos [1984] and Hamlet and Taylor most interesting ones are the least ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 49. 414 • Zhu et al. squares line t d (where t is the cases needed to satisfy a criterion might number of test cases sufficient to satisfy yield an optimistic picture of the real the given criteria for the subject pro- effort needed to accomplish the testing. gram and d is the number of decision Offutt et al. [1993] studied the cost of statements in that program), and the mutation testing with a sample set of 10 weighted average of the ratios of the programs. They used simple and multi- number of decision statements in a sub- ple linear regression models to establish ject program to the number of test cases a relationship among the number of sufficient to satisfy the selected crite- generated mutants and the number of rion. Weyuker observed that (1) the co- lines of code, the number of variables, efficients of d in the least-square lines the number of variable references, and were less than 1 for all the cases and (2) the number of branches. The linear re- the coefficients were very close to the gression models provide a powerful ve- average number of test cases required hicle for finding functional relationships for each decision statement. Based on among random variables. The coeffi- these observations, Weyuker concluded cient of determination provides a sum- that the experiment results reinforced mary statistic that measures how well our intuition that the relationship be- the regression equation fits the data, tween the required number of test cases and hence is used to decide whether a and the program size was linear. A re- relationship between some data exists. sult similar to Weyuker’s is also ob- Offutt et al. used a statistical package served for the all du-paths criterion by to calculate the coefficient of determina- Bieman and Schultz [1992]. tion of the following formulas. The programs in the Kernighan and 2 Plauger suite used in Weyuker’s experi- Y mutant 0 1 X line 2 X line ment were essentially the same type, 2 well designed and modularized, and Y mutant 0 1 X var 2 X var hence relatively small in size. Address- ing the problem that whether substan- Y mutant 0 1 X var 2 X varref tially larger, unstructured and modular- ized programs would require larger 3 X var X varref , amounts of test data relative to their where X line is the number of lines in the size, Weyuker [1993] conducted another program, X var is the number of vari- experiment with data-flow test ade- ables in the program, X varref is the num- quacy criteria. She used a subset of ber of variable references in the pro- ACM algorithms in Collected Algo- gram, and Y mutant is the number of rithms from ACM (Vol. 1, ACM Press, mutants. They found that for single New York, 1980) as sample programs units, the coefficients of determination that contained known faults and five or of the formulas were 0.96, 0.96, and more decision statements. In this study, 0.95, respectively. For programs of mul- the same information was collected and tiple units, they established the follow- the same values were calculated. The ing formula with a coefficient of deter- same conclusion was obtained. How- mination of 0.91. ever, in this study, Weyuker observed a large proportion of infeasible paths in Y mutant X var X varref 0 1 2 data-flow testing. The average rates of infeasible paths for the all du-path cri- 3 X unit 4 X varX varref , terion were 49 and 56% for the two groups of programs, respectively. where X unit is the number of units in Weyuker pointed out that the large the program. Therefore their conclusion number of infeasible paths implies that was that the number of mutants is qua- assessing the cost of data-flow testing dratic. As mentioned in Section 3, Of- only in terms of the number of test futt et al. also studied the cost of selec- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 50. Test Coverage and Adequacy • 415 tive mutation testing and showed that this direction includes refinement and even for 6-selective mutation testing, improvement of the axioms [Parrish and the number of mutants is still qua- Zweben 1991; 1993; Zhu and Hall 1993; dratic. Zhu et al. 1993; Zhu 1995a] as well as analysis and criticism [Hamlet 1989; 5.4 Summary of the Comparison of Zweben and Gourlay 1989]. Adequacy Criteria There are four clearcut roles that axi- omatization can play in software testing Comparison of test data adequacy crite- research, as already seen in physics and ria can be made with respect to various mathematics. First, it makes explicit measures, such as fault detection, confi- the exact details of a concept or an dence in the reliability of the tested argument that, before the axiomatiza- software, and the cost of software test- tion, was either incomplete or unclear. ing. Various abstract relations among Second, axiomatization helps to isolate, the criteria, such as the subsume rela- abstract, and study more fully a class of tion, have also been proposed and rela- mathematical structures that have re- tionships among adequacy criteria have curred in many important contexts, of- been investigated. The relationship be- ten with quite different surface forms, tween such relations and fault-detecting so that a particular form can draw on ability has also been studied. Recent results discovered elsewhere. Third, it research results have shown that parti- can provide the scientist with a compact tion testing more easily achieves high way to conduct a systematic exploration confidence in software reliability than of the implications of the postulates. random testing, though random testing Finally, the fourth role is the study of may be more cost-effective. what is needed to axiomatize a given empirical phenomenon and what can be 6. AXIOMATIC ASSESSMENT OF axiomatized in a particular form. For ADEQUACY CRITERIA instance, a number of computer scien- tists have questioned whether formal In Section 5, we have seen two kinds of properties can be defined to distinguish rationale presented to support the use one type of adequacy criteria from an- of one criterion or another, yet there is other when criteria are expressed in a no clear consensus. The first kind of particular form such as predicates rationale uses statistical or empirical [Hamlet 1989; Zweben and Gourlay data to compare testing effectiveness, 1989]. whereas the second type analytically Therefore we believe that axiomatiza- compares test data adequacy criteria tion will improve our understanding of based on certain mathematical models software testing and clarify our notion of software testing. of test adequacy. For example, as In this section we present a third type pointed out by Parrish and Zweben of rationale, the axiomatic study of the [1993], the investigation of the applica- properties of adequacy criteria. The ba- bility of test data adequacy criteria is sic idea is to seek the most fundamental an excellent example of the role that properties of software test adequacy axiomatization has played. Applicability and then check if the properties are was proposed as an axiom by Weyuker satisfied for each particular criterion. in 1986, requiring a criterion to be ap- Although the idea of using abstract plicable to any program in the sense of properties as requirements of test ade- the existence of test sets that satisfy the quacy criteria can be found in the liter- criterion. In the assessment of test data ature, such as Baker et al. [1986], adequacy criteria against this axiom, Weyuker [1986, 1988a] is perhaps the the data-flow adequacy criteria origi- first who explicitly and clearly em- nally proposed by Rapps and Weyuker ployed axiom systems. Work following in 1985 were found not applicable. This ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 51. 416 • Zhu et al. leads to the redefinition of the criteria Weyuker then analyzed exhaustive test- and reexamination of the power of the ing and pointed out that, although ex- criteria [Frankl and Weyuker 1988]. haustive testing is adequate, an ade- Weyuker’s applicability requirements quacy criterion should not always ask were defined on the assumption that for exhaustive testing. Hence she de- the program has a finite representable fined the following nonexhaustive appli- input data space. Therefore any ade- cability. quate test set is finite. When this as- sumption is removed and adequacy cri- AXIOM A2 (Nonexhaustive Applicabil- ity). There is a program p and a test set teria were considered as measurements, t such that p is adequately tested by t Zhu and Hall [1993] used the term fi- and t is not an exhaustive test set. nite applicability to denote the require- ment that for any software and any Notice that by exhaustive testing adequacy degree r less than 1 there is a Weyuker meant the test set of all repre- finite test set whose adequacy is greater sentable points of the specification do- than r. They also found that finite ap- main. The property of central impor- plicability can be derived from other tance in Weyuker’s system is more fundamental properties of test monotonicity. data adequacy criteria. AXIOM A3 (Monotonicity). If t is ade- quate for p and t t , then t is ade- 6.1 Weyuker’s Axiomatization of Program- quate for p. Based Adequacy Criteria AXIOM A4 (Inadequate Empty Set). Weyuker [1986] proposed an informal The empty set is not adequate for any axiom system of test adequacy criteria. program. Her original purpose of the system was Weyuker then studied the relation- to present a set of properties of ideal ships among test adequacy and program program-based test adequacy criteria syntactic structure and semantics. How- and use these properties to assess exist- ever, her axioms were rather negative. ing criteria. Regarding test adequacy They stated that neither semantic close- criteria as stopping rules, Weyuker’s ax- ness nor syntactic structure closeness iom system consists of eleven axioms, are sufficient to ensure that two pro- which were later examined, formalized, grams require the same test data. More- and revised by Parrish and Zweben over, an adequately tested program [1991; 1993]. does not imply that the components of The most fundamental properties of the program are adequately tested, nor adequacy criteria proposed by Weyuker does adequate testing of components were those concerning applicability. She imply adequate testing of the program. distinguished the following three appli- Weyuker assessed some test adequacy cability properties of test adequacy cri- criteria against the axioms. The main teria. discovery of the assessment is that the mutation adequacy criterion defined in AXIOM A1 (Applicability). For every its original form does not satisfy the program, there exists an adequate test monotonicity and applicability axioms, set. because it requires a correctness condi- Assuming the finiteness of represent- tion. The correctness condition requires able points in the input data space, that the program under test produces Weyuker rephrased the axiom into the correct output on a test set if it is to be following equivalent form. considered adequate. For all the ade- quacy criteria we have discussed, it is AXIOM A1 For every program, there possible for a program to fail on a test exists a finite adequate test set. case after producing correct outputs on ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 52. Test Coverage and Adequacy • 417 an adequate test set. Hence the correct- measures software test adequacy inde- ness condition causes a conflict with pendently of the program, whereas a monotonicity. The correctness condition specification-independent criterion mea- does not play any fundamental role in sures software test adequacy indepen- mutation adequacy, so Weyuker sug- dently of the specification. Parrish and gested removal of this condition from Zweben explored the interdependence re- the definition.8 lations between the notions and the axi- Aware of the insufficiency of the ax- oms. Their work illustrated the complex iom system, Weyuker [1988a] later pro- levels of analysis that an axiomatic the- posed three additional axioms. The re- ory of software testing can make possible. naming property requires that a test set Formalization is not merely a mathemat- be adequate for a program p if and only ical exercise because the basic notions of if it is adequate for a program obtained software testing have to be expressed pre- by systematic renaming variables in p. cisely and clearly so that they can be The complexity property requires that critically analyzed. For example, infor- for every natural number n, there is a mally, a program-limited criterion only program p, such that p is adequately requires test data selected in the domain tested by a size-n test set but not by any of the program, and a specification-lim- size (n 1) test set. The statement ited criterion requires test data only se- coverage property requires that all ade- lected in the domain of the specification. quate test sets cover all feasible state- Parrish and Zweben formally defined the ments of the program under test. These notion to be that if a test case falls out- axioms were intended to rule out some side the domain, the test set is considered “unsuitable notions of adequacy.” inadequate. This definition is counterin- Given the fact that Weyuker’s axioms tuitive and conflicts with the monotonic- are unable to distinguish most ade- ity axiom. A more logical definition is that quacy criteria, Hamlet [1989] and Zwe- the test cases outside the domain are not ben and Gourlay [1989] questioned if taken into account in determining test the axiomatic approach could provide adequacy [Zhu and Hall 1993]. Parrish useful comparison of adequacy criteria. and Zweben [1993] also formalized the Recently, some of Weyuker’s axioms notion of the correctness condition such were combined with Baker et al.’s that a test set is inadequate if it contains [1986] properties for assessing control- a test case on which the program is incor- flow adequacy criteria [Zhu 1995a]. rect. They tried to put it into the axiom Control-flow adequacy criteria of subtle system by modifying the monotonicity ax- differences were distinguished. The as- iom to be applied only to test sets on sessment suggested that the cycle com- which the software is correct. However, bination criterion was the most favor- Weyuker [1986] argued that the correct- able. ness condition plays no role in determin- ing test adequacy. It should not be an 6.2 Parrish and Zweben’s Formalization axiom. and Analysis of Weyuker’s Axioms Parrish and Zweben [1993; 1991] for- 6.3 Zhu and Hall’s Measurement Theory malized Weyuker’s axioms, but put Based on the mathematical theory of them in a more general framework that measurement, Zhu and Hall [1993; Zhu involved specifications as well. They de- et al. 1993] proposed a set of axioms for fined the notions of program-independent the measurement of software test ade- criteria and specification-independent cri- quacy. The mathematical theory of mea- teria. A program-independent criterion surement is the study of the logical and philosophical concepts underlying mea- 8 The definition of mutation adequacy used in this surement as it applies in all sciences. It article does not use the correctness condition. studies the conceptual structures of ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 53. 418 • Zhu et al. measurement systems and their proper- total ordering with maximal and mini- ties related to the validation of the use mal elements. of measurements [Roberts 1979; Krantz To assign numbers to the objects in et al. 1971; Suppes et al. 1989; Luce et the empirical system, a numerical sys- al. 1990]. Therefore the measurement of tem N (N, G) must be defined so that software test adequacy should conform the measurement mapping is a homo- to the theory. In fact, recent years have morphism from the empirical system to seen rapid growth of rigorous applica- the numerical system. It has been tions of the measurement theory to soft- proved that for any numerical system ware metrics [Fenton 1991]. (N, N ) and measurement of test According to the theory, measurement adequacy on N, there is a measurement is the assignment of numbers to proper- on the unit interval of real numbers ties of objects or events in the real world ([0,1], ) such that ° , where is by means of an objective empirical oper- an isomorphism between (N, N ) and ation. The modern form of measurement ([0,1], ) [Zhu and Hall 1993; Zhu et al. theory is representational; that is, num- 1994; Zhu 1995b]. Therefore properties bers assigned to objects/events must of test adequacy measurements can be represent the perceived relation be- obtained by studying adequacy mea- tween the properties of those objects/ surements on the real unit interval. events. Readers are referred to Roberts’ Zhu and Hall’s axiom system for ade- [1979] book for various applications of quacy measurements on the real unit the theory and Krantz et al.’s trilogy interval consists of the following axi- [1971; Suppes et al. 1989; Luce et al. oms. 1990] for a systematic treatment of the AXIOM B1 (Inadequacy of Empty Test subject. Usually, such a theory com- Set). For all programs p and specifica- prises three main sections: the descrip- tions s, the adequacy of the empty test tion of an empirical relational system, a set is 0. representation theorem, and a unique- AXIOM B2 (Adequacy of Exhaustive ness condition. Testing). For all programs p and speci- An empirical relational system Q fications s, the adequacy of the exhaus- (Q, R) consists of a set Q of manifesta- tive test set D is 1. tions of the property or attribute and a family of relations R {R 1 , R 2 , . . . , AXIOM B3 (Monotonicity). For all pro- R n } on Q. The family R of relations grams p and specifications s, if test set provides the basic properties of the t 1 is a subset of test set t 2 , then the manifestations with respect to the prop- adequacy of t 1 is less than or equal to erty to be measured. They are usually the adequacy of t 2 . expressed as axioms derived from the empirical knowledge of the real world. AXIOM B4 (Convergence). Let t 1 , For software testing, the manifestations t 2, . . . , t n, . . . T be test sets such that are the software tests P S T, which t1 t2 ... tn . . . . Then, for all programs p and specifications s, consist of a set of programs P, a set of specifications S, and a set of test sets T. Zhu et al. [1994] defined a relation on lim C s t n p Cs p tn , the space of software testing such that n3 n 1 t1 t 2 means that test t 2 is at least as adequate as t 1 . It was argued that the where C s (t) is the adequacy of test set t p for testing program p with respect to relation has the properties of reflexiv- specification s. ity, transitivity, comparability, bound- edness, equal inadequacy for empty AXIOM B5 (Law of Diminishing Re- tests, and equal adequacy for exhaus- turns). The more a program has been tive tests. Therefore this relation is a tested, the less a given test set can fur- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 54. Test Coverage and Adequacy • 419 ther contribute to the test adequacy. For- foundation for the investigation of the mally, for all programs p, specifications meaningfulness of statements and sta- s, and test sets t, tistical operations that involve test ade- quacy measurements [Krantz et al. @c, d T. c d f Cs t c p 1971; Suppes et al. 1989; Luce et al. 1990]. Cs t d , p The irregularity theorem confirms our intuition that the criteria are actually where C s t c p Cs t p c Cs c . p different approximations to the ideal measure of test adequacy. Therefore it This axiom system has been proved to is necessary to find the “errors” for each be consistent [Zhu and Hall 1993; Zhu criterion. In an attempt to do so, Zhu et et al. 1994]. From these axioms, a num- al. [1994] also investigated the proper- ber of properties of test data adequacy ties that distinguish different classes of criteria can be proved. For example, if a test adequacy criteria. test data adequacy criterion satisfies Axiom B2 and Axiom B4, then it is 6.4 Semantics of Software Test Adequacy finitely applicable, which means that for any real number r less than 1, there The axioms of software test adequacy is a finite test set whose adequacy is criteria characterize the notion of soft- greater than or equal to r. Therefore an ware test adequacy by a set of proper- adequacy criterion can fail to be finitely ties for adequacy. These axioms do not applicable because of two reasons: not directly answer what test adequacy satisfying the adequacy of exhaustive means. The model theory [Chang and test, such as requiring coverage of infea- Keisler 1973] of mathematical logic pro- sible elements, or not satisfying the con- vides a way of assigning meanings to vergence property such as the path-cov- formal axioms. Recently, the model the- erage criterion. Another example of ory was applied to assigning meanings derivable property is subadditivity, to the axioms of test adequacy criteria which can be derived from the law of [Zhu 1996b, 1995b]. Software testing diminishing returns. Here an adequacy was interpreted as inductive inference measurement is subadditive if the ade- such that a finite number of observa- quacy of the union of two test sets t 1 tions made on the behavior of the soft- and t 2 is less than or equal to the sum of ware under test are used to inductively the adequacy of t 1 and the adequacy of derive general properties of the soft- t 2. ware. In particular, Gold’s [1967] induc- The measurement theory of software tive inference model of identification in test adequacy was concerned not only the limit was used to interpret Weyuk- with the properties derivable from the er’s axioms and a relationship between axioms, but also with measurement-the- adequate testing and software correct- oretical properties of the axiom systems. ness was obtained in the model [Zhu Zhu et al. [1994] proved a uniqueness 1996b]. Valiant’s [1984] PAC inductive theorem that characterizes the admissi- inference model was used to interpret ble transformations between any two Zhu and Hall’s axioms of adequacy mea- adequacy measurements that satisfy surements. Relationships between soft- the axioms. However, the irregularity of ware reliability and test adequacy were adequacy measurements was also for- obtained [Zhu 1995b]. mally proved. That is, not all adequacy measurements can be transformed from 7. CONCLUSION one to another. In fact, few existing adequacy criteria are convertible from Since Goodenough and Gerhart [1975, one to another. According to measure- 1977] pointed out that test criteria are a ment theory, these two theorems lay the central problem of software testing, test ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 55. 420 • Zhu et al. criteria have been a major focus of the the concatenation of p to q and write research on software testing. A large r p ∧ q. number of test data adequacy criteria have been proposed and various ration- Cycle: A path p (n 1 , n 2 , . . . , n t , ales have been presented for the sup- n 1 ) is called a cycle. port of one or another criterion. In this Cycle-free path: A path is said to be article, various types of software test cycle-free if it does not contain cycles as adequacy criteria proposed in the litera- subpaths. ture are reviewed. The research on com- parison and evaluation of criteria is also Directed graph, node, and edge: A surveyed. The notion of test adequacy is directed graph consists of a set N of examined and its roles in software test- nodes and a set E N N of directed ing practice and theoretical research are edges between nodes. discussed. Although whether an adequacy Elementary cycle and simple cycle: If criterion captures the true notion of test the nodes n 1 , n 2 , . . . , n t in the cycle adequacy is still a matter of controversy, (n 1 , n 2 , . . . , n t , n 1 ) are all different, it appears that software test adequacy then the cycle is called an elementary criteria as a whole are clearly linked to cycle. If the edges in the cycle are all the fault-detecting ability of software different, then it is called a simple cycle. testing as well as to the dependability of the software that has passed an adequate Elementary path and simple path: A test. It was predicted that with the ad- path is called an elementary path if the vent of a quality assurance framework nodes in the path are all different. It is based upon ISO 9001, which calls for spe- called a simple path if the edges in the cific mechanisms for defect removal, the path are all different. acceptance of more formal measurement Empty path: A path is called an of the testing process can now be antici- empty path if its length is 1. In this pated [Wichmann 1993]. There is a ten- case, the path contains no edge but only dency towards systematic approaches to a node, and hence is written as (n 1 ). software testing through using test ade- quacy criteria. Feasible and infeasible paths: Not all complete computation paths neces- sarily correspond to an execution of the ACKNOWLEDGMENT program. A feasible path is a complete The authors would like to thank the anonymous computation path such that there exist referees for their valuable comments. input data that can cause the execution of the path. Otherwise, it is an infeasi- APPENDIX ble path. Flow graph: A flow graph is a di- Glossary of Graph-Theory-Related rected graph that satisfies the following Terminology conditions: (1) it has a unique begin node which has no inward edge; (2) it Complete computation path: A com- has a unique end node which has no plete computation path is a path that outward edge; and (3) every node in a starts with the begin node and ends at flow graph must be on a path from the the end node of the flow graph. It is also begin node to the end node. called a computation path or execution path. Flow graph model of program struc- ture: The nodes in a flow graph repre- Concatenation of paths: If p (n 1 , sent linear sequences of computations. n 2 , . . . , n s ), q (n s 1 , . . . , n t ) are two The edges in the flow graph represent paths, and r (n 1 , n 2 , . . . , n s , n s 1 , control transfers. Each edge is associ- . . . , n t ) is also a path, we say that r is ated with a predicate that represents ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 56. Test Coverage and Adequacy • 421 the condition of control transfer from BACHE, R. AND MULLERBURG, M. 1990. Meas- the computation on the start node of the ures of testability as a basis for quality assur- ance. Softw. Eng. J. (March), 86 –92. edge to that on the end node of the edge. BAKER, A. L., HOWATT, J. W., AND BIEMAN, J. M. The begin node and the end node in the 1986. Criteria for finite sets of paths that flow graph are where the computation characterize control flow. In Proceedings of starts and finishes. the 19th Annual Hawaii International Confer- ence on System Sciences, 158 –163. Inward edges and outward edges of a BASILI, V. R. AND RAMSEY, J. 1984. Structural node: The edge n 1 , n 2 is an inward coverage of functional testing. Tech. Rep. TR- edge of the node n 2 . It is an outward 1442, Department of Computer Science, Uni- edge of the node n 1 . versity of Maryland at College Park, Sept. BASILI, V. R. AND SELBY, R. W. 1987. Com- Length of a path: The length of a paring the effectiveness of software testing. path (n 1 , n 2 , . . . , n t ) is t. IEEE Trans. Softw. Eng. SE-13, 12 (Dec.), 1278 –1296. Path: A path p in a graph is a se- BAZZICHI, F. AND SPADAFORA, I. 1982. An auto- quence (n 1 , n 2 , . . . , n t ) of nodes such matic generator for compiler testing. IEEE that n i , n i 1 is an edge in the graph, Trans. Softw. Eng. SE-8, 4 (July), 343–353. for all i 1, 2, . . . , t 1, t 0. BEIZER, B. 1983. Software Testing Techniques. Van Nostrand Reinhold, New York. Start node and end node of an edge: BEIZER, B. 1984. Software System Testing and The node n 1 is the start node of the edge Quality Assurance. Van Nostrand Reinhold, n 1 , n 2 and n 2 is the end node of the New York. edge. BENGTSON, N. M. 1987. Measuring errors in op- erational analysis assumptions, IEEE Trans. Start node and end node of a path: Softw. Eng. SE-13, 7 (July), 767–776. The start node of a path (n 1 , n 2 , . . . , BENTLY, W. G. AND MILLER, E. F. 1993. CT cov- n t ) is n 1 , and n t is the end node of the erage—initial results. Softw. Quality J. 2, 1, path. 29 – 47. BERNOT, G., GAUDEL, M. C., AND MARRE, B. 1991. Strongly connected graph: A graph Software testing based on formal specifica- is strongly connected if for any two tions: A theory and a tool. Softw. Eng. J. nodes a and b, there exists a path from (Nov.), 387– 405. a to b and a path from b to a. BIEMAN, J. M. AND SCHULTZ, J. L. 1992. An em- pirical evaluation (and specification) of the all Subpath: A subsequence n u , n u 1 , du-paths testing criterion. Softw. Eng. J. . . . , n v of the path p (n 1 , n 2 , . . . , n t ) (Jan.), 43–51. is called a subpath of p, where 1 u BIRD, D. L. AND MUNOZ, C. U. 1983. Automatic generation of random self-checking test cases. v t. IBM Syst. J. 22, 3. REFERENCES BOUGE, L., CHOQUET, N., FRIBOURG, L., AND GAU- DEL, M.-C. 1986. Test set generation from ADRION, W. R., BRANSTAD, M. A., AND CHERNI- algebraic specifications using logic program- AVSKY, J. C. 1982. Validation, verification, ming. J. Syst. Softw. 6, 343–360. and testing of computer software. Comput. Surv. 14, 2 (June), 159 –192. BUDD, T. A. 1981. Mutation analysis: Ideas, ex- amples, problems and prospects. In Computer AFIFI, F. H., WHITE, L. J., AND ZEIL, S. J. 1992. Testing for linear errors in nonlinear com- Program Testing, Chandrasekaran and Radic- puter programs. In Proceedings of the 14th chi, Eds., North Holland, 129 –148. IEEE International Conference on Software BUDD, T. A. AND ANGLUIN, D. 1982. Two notions Engineering (May), 81–91. of correctness and their relation to testing. AMLA, N. AND AMMANN, P. 1992. Using Z speci- Acta Inf. 18, 31– 45. fications in category partition testing. In Pro- BUDD, T. A., LIPTON, R. J., SAYWARD, F. G., AND ceedings of the Seventh Annual Conference on DEMILLO, R. A. 1978. The design of a pro- Computer Assurance (June), IEEE, 3–10. totype mutation system for program testing. AMMANN, P. AND OFFUTT, J. 1994. Using formal In Proceedings of National Computer Confer- methods to derive test frames in category- ence, 623– 627. partition testing. In Proceedings of the Ninth CARVER, R. AND KUO-CHUNG, T. 1991. Replay Annual Conference on Computer Assurance and testing for concurrent programs. IEEE (Gaithersburg, MD, June), IEEE, 69 –79. Softw. (March), 66 –74. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 57. 422 • Zhu et al. CHAAR, J. K., HALLIDAY, M. J., BHANDARI, I. S., AND SOFT Symposium on Software Testing, Anal- CHILLAREGE, R. 1993. In-process evaluation ysis and Verification 2, (July), 142–151. for software inspection and test. IEEE Trans. DEMILLO, R. A., LIPTON, R. J., AND SAYWARD, Softw. Eng. 19, 11, 1055–1070. F. G. 1978. Hints on test data selection: CHANDRASEKARAN, B. AND RADICCHI, S. (EDS.) Help for the practising programmer. Com- 1981. Computer Program Testing, North- puter 11, (April), 34 – 41. Holland. DEMILLO, R. A., MCCRACKEN, W. M., MATIN, R. J., CHANG, C. C. AND KEISLER, H. J. 1973. Model AND PASSUFIUME, J. F. 1987. Software Test- Theory. North-Holland, Amsterdam. ing and Evaluation, Benjamin-Cummings, CHANG, Y.-F. AND AOYAMA, M. 1991. Testing the Redwood City, CA. limits of test technology. IEEE Softw. DENNEY, R. 1991. Test-case generation from (March), 9 –11. Prolog-based specifications. IEEE Softw. CHERNIAVSKY, J. C. AND SMITH, C. H. 1987. A (March), 49 –57. recursion theoretic approach to program test- DIJKSTRA, E. W. 1972. Notes on structured pro- ing. IEEE Trans. Softw. Eng. SE-13, 7 (July), gramming. In Structured Programming, by 777–784. O.-J. Dahl, E. W. Dijkstra, and C. A. R. CHERNIAVSKY, J. C. AND SMITH, C. H. 1991. On Hoare, Academic Press. Weyuker’s axioms for software complexity DOWNS, T. 1985. An approach to the modelling measures. IEEE Trans. Softw. Eng. SE-17, 6 of software testing with some applications. (June), 636 – 638. IEEE Trans. Softw. Eng. SE-11, 4 (April), CHOI, B., MATHUR, A., AND PATTISON, B. 1989. 375–386. PMothra: Scheduling mutants for execution DOWNS, T. 1986. Extensions to an approach to on a hypercube. In Proceedings of SIGSOFT the modelling of software testing with some Symposium on Software Testing, Analysis and performance comparisons. IEEE Trans. Verification 3 (Dec.) 58 – 65. Softw. Eng. SE-12, 9 (Sept.), 979 –987. CHUSHO, T. 1987. Test data selection and qual- DOWNS, T. AND GARRONE, P. 1991. Some new ity estimation based on the concept of essen- models of software testing with performance tial branches for path testing. IEEE Trans. comparisons. IEEE Trans. Rel. 40, 3 (Aug.), Softw. Eng. SE-13, 5 (May), 509 –517. 322–328. CLARKE, L. A., HASSELL, J., AND RICHARDSON, D. J. DUNCAN, I. M. M. AND ROBSON, D. J. 1990. Or- 1982. A close look at domain testing. IEEE dered mutation testing. ACM SIGSOFT Trans. Softw. Eng. SE-8, 4 (July), 380 –390. Softw. Eng. Notes 15, 2 (April), 29 –30. CLARKE, L. A., PODGURSKI, A., RICHARDSON, D. J., DURAN, J. W. AND NTAFOS, S. 1984. An evalua- AND ZEIL, S. J. 1989. A formal evaluation of tion of random testing. IEEE Trans. Softw. data flow path selection criteria. IEEE Trans. Eng. SE-10, 4 (July), 438 – 444. Softw. Eng. 15, 11 (Nov.), 1318 –1332. FENTON, N. 1992. When a software measure is CURRIT, P. A., DYER, M., AND MILLS, H. D. 1986. not a measure. Softw. Eng. J. (Sept.), 357– Certifying the reliability of software. IEEE 362. Trans. Softw. Eng. SE-6, 1 (Jan.) 2–13. FENTON, N. E. 1991. Software metrics—a rigor- DAVIS, M. AND WEYUKER, E. 1988. Metric space- ous approach. Chapman & Hall, London. based test-data adequacy criteria. Comput. J. 13, 1 (Feb.), 17–24. FENTON, N. E., WHITTY, R. W., AND KAPOSI, A. A. 1985. A generalised mathematical DEMILLO, R. A. AND MATHUR, A. P. 1990. On theory of structured programming. Theor. the use of software artifacts to evaluate the Comput. Sci. 36, 145–171. effectiveness of mutation analysis for detect- ing errors in production software. In Proceed- FORMAN, I. R. 1984. An algebra for data flow ings of 13th Minnowbrook Workshop on Soft- anomaly detection. In Proceedings of the Sev- ware Engineering (July 24 –27, Blue enth International Conference on Software En- Mountain Lake, NY), 75–77. gineering (Orlando, FL), 250 –256. DEMILLO, R. A. AND OFFUTT, A. J. 1991. Con- FOSTER, K. A. 1980. Error sensitive test case straint-based automatic test data generation. analysis (ESTCA). IEEE Trans. Softw. Eng. IEEE Trans. Softw. Eng. 17, 9 (Sept.), 900 – SE-6, 3 (May), 258 –264. 910. FRANKL, P. G. AND WEISS, S. N. 1993. An exper- DEMILLO, R. A. AND OFFUTT, A. J. 1993. Experi imental comparison of the effectiveness of mental results from an automatic test case branch testing and data flow testing. IEEE generator. ACM Trans. Softw. Eng. Methodol. Trans. Softw. Eng. 19, 8 (Aug.), 774 –787. 2, 2 (April), 109 –127. FRANKL, P. G. AND WEYUKER, J. E. 1988. An DEMILLO, R. A., GUINDI, D. S., MCCRACKEN, W. M., applicable family of data flow testing criteria. OFFUTT, A. J., AND KING, K. N. 1988. An IEEE Trans. Softw. Eng. SE-14, 10 (Oct.), extended overview of the Mothra software 1483–1498. testing environment. In Proceedings of SIG- FRANKL, P. G. AND WEYUKER, J. E. 1993a. A ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 58. Test Coverage and Adequacy • 423 formal analysis of the fault-detecting ability HAMLET, R. G. 1977. Testing programs with the of testing methods. IEEE Trans. Softw. Eng. aid of a compiler. IEEE Trans. Softw. Eng. 3, 19, 3 (March), 202–213. 4 (July), 279 –290. FRANKL, P. G. AND WEYUKER, E. J. 1993b. Prov- HARROLD, M. J., MCGREGOR, J. D., AND FITZ- able improvements on branch testing. IEEE PATRICK, K. J. 1992. Incremental testing of Trans. Softw. Eng. 19, 10, 962–975. object-oriented class structures. In Proceed- FREEDMAN, R. S. 1991. Testability of software ings of 14th ICSE (May) 68 – 80. components. IEEE Trans. Softw. Eng. SE-17, HARROLD, M. J. AND SOFFA, M. L. 1990. Inter- 6 (June), 553–564. procedural data flow testing. In Proceedings FRITZSON, P., GYIMOTHY, T., KAMKAR, M., AND of SIGSOFT Symposium on Software Testing, SHAHMEHRI, N. 1991. Generalized algorith- Analysis, and Verification 3 (Dec.), 158 –167. mic debugging and testing. In Proceedings of HARROLD, M. J. AND SOFFA, M. L. 1991. Select- ACM SIGPLAN Conference on Programming ing and using data for integration testing. Language Design and Implementation (To- IEEE Softw. (March), 58 – 65. ronto, June 26 –28). HARTWICK, D. 1977. Test planning. In Proceed- FUJIWARA, S., V. BOCHMANN, G., KHENDEK, F., ings of National Computer Conference, 285– AMALOU, M., AND GHEDAMSI, A. 1991. Test 294. selection based on finite state models. IEEE HAYES, I. J. 1986. Specification directed mod- Trans. Softw. Eng. SE-17, 6 (June), 591– 603. ule testing. IEEE Trans. Softw. Eng. SE-12, 1 GAUDEL, M.-C. AND MARRE, B. 1988. Algebraic (Jan.), 124 –133. specifications and software testing: Theory HENNELL, M. A., HEDLEY, D., AND RIDDELL, I. J. and application. In Rapport LRI 407. 1984. Assessing a class of software tools. In GELPERIN, D. AND HETZEL, B. 1988. The growth Proceedings of the Seventh ICSE, 266 –277. of software testing. Commun. ACM 31, 6 HERMAN, P. 1976. A data flow analysis ap- (June), 687– 695. proach to program testing. Aust. Comput. J. 8, GIRGIS, M. R. 1992. An experimental evalua- 3 (Nov.), 92–96. tion of a symbolic execution system. Softw. HETZEL, W. 1984. The Complete Guide to Soft- Eng. J. (July), 285–290. ware Testing, Collins. GOLD, E. M. 1967. Language identification in HIERONS, R. 1992. Software testing from formal the limit. Inf. Cont. 10, 447– 474. specification. Ph.D. Thesis, Brunel Univer- GOODENOUGH, J. B. AND GERHART, S. L. 1975. sity, UK. Toward a theory of test data selection. IEEE HOFFMAN, D. M. AND STROOPER, P. 1991. Auto- Trans. Softw. Eng. SE-3 (June). mated module testing in Prolog. IEEE Trans. GOODENOUGH, J. B. AND GERHART, S. L. 1977. Softw. Eng. 17, 9 (Sept.), 934 –943. Toward a theory of testing: Data selection HORGAN, J. R. AND LONDON, S. 1991. Data flow criteria. In Current Trends in Programming coverage and the C language. In Proceedings Methodology, Vol. 2, R. T. Yeh, Ed., Prentice- of TAV4 (Oct.), 87–97. Hall, Englewood Cliffs, NJ, 44 –79. HORGAN, J. R. AND MATHUR, A. P. 1992. Assess- GOPAL, A. AND BUDD, T. 1983. Program testing ing testing tools in research and education. by specification mutation. Tech. Rep. TR 83- IEEE Softw. (May), 61– 69. 17, University of Arizona, Nov. HOWDEN, W. E. 1975. Methodology for the gen- GOURLAY, J. 1983. A mathematical framework eration of program test data. IEEE Trans. for the investigation of testing. IEEE Trans. Comput. 24, 5 (May), 554 –560. Softw. Eng. SE-9, 6 (Nov.), 686 –709. HOWDEN, W. E. 1976. Reliability of the path HALL, P. A. V. 1991. Relationship between analysis testing strategy. IEEE Trans. Softw. specifications and testing. Inf. Softw. Technol. Eng. SE-2, (Sept.), 208 –215. 33, 1 (Jan./Feb.), 47–52. HOWDEN, W. E. 1977. Symbolic testing and the HALL, P. A. V. AND HIERONS, R. 1991. Formal DISSECT symbolic evaluation system. IEEE methods and testing. Tech. Rep. 91/16, Dept. Trans. Softw. Eng. SE-3 (July), 266 –278. of Computing, The Open University. HOWDEN, W. E. 1978a. Algebraic program test- HAMLET, D. AND TAYLOR, R. 1990. Partition ing. ACTA Inf. 10, 53– 66. testing does not inspire confidence. IEEE HOWDEN, W. E. 1978b. Theoretical and empiri- Trans. Softw. Eng. 16 (Dec.), 206 –215. cal studies of program testing. IEEE Trans. HAMLET, D., GIFFORD, B., AND NIKOLIK, B. 1993. Softw. Eng. SE-4, 4 (July), 293–298. Exploring dataflow testing of arrays. In Pro- HOWDEN, W. E. 1978c. An evaluation of the ef- ceedings of 15th ICSE (May), 118 –129. fectiveness of symbolic testing. Softw. Pract. HAMLET, R. 1989. Theoretical comparison of Exper. 8, 381–397. testing methods. In Proceedings of SIGSOFT HOWDEN, W. E. 1980a. Functional program Symposium on Software Testing, Analysis, testing. IEEE Trans. Softw. Eng. SE-6, 2 and Verification 3 (Dec.), 28 –37. (March), 162–169. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 59. 424 • Zhu et al. HOWDEN, W. E. 1980b. Functional testing and ented program testing strategy. IEEE Trans. design abstractions. J. Syst. Softw. 1, 307–313. Softw. Eng. SE-9, (May), 33– 43. HOWDEN, W. E. 1981. Completeness criteria for LASKI, J., SZERMER, W., AND LUCZYCKI, P. testing elementary program functions. In Pro- 1993. Dynamic mutation testing in inte- ceedings of Fifth International Conference on grated regression analysis. In Proceedings of Software Engineering (March), 235–243. 15th International Conference on Software HOWDEN, W. E. 1982a. Validation of scientific Engineering (May), 108 –117. programs. Comput. Surv. 14, 2 (June), 193–227. LAUTERBACH, L. AND RANDALL, W. 1989. Ex- HOWDEN, W. E. 1982b. Weak mutation testing perimental evaluation of six test techniques. and completeness of test sets. IEEE Trans. In Proceedings of COMPASS 89 (Washington, DC, June), 36 – 41. Softw. Eng. SE-8, 4 (July), 371–379. LEVENDEL, Y. 1991. Improving quality with a HOWDEN, W. E. 1985. The theory and practice of manufacturing process. IEEE Softw. (March), functional testing. IEEE Softw. (Sept.), 6 –17. 13–25. HOWDEN, W. E. 1986. A functional approach to LINDQUIST, T. E. AND JENKINS, J. R. 1987. Test program testing and analysis. IEEE Trans. case generation with IOGEN. In Proceedings Softw. Eng. SE-12, 10 (Oct.), 997–1005. of the 20th Annual Hawaii International Con- HOWDEN, W. E. 1987. Functional program test- ference on System Sciences, 478 – 487. ing and analysis. McGraw-Hill, New York. LITTLEWOOD, B. AND STRIGINI, L. 1993. Valida- HUTCHINS, M., FOSTER, H., GORADIA, T., AND OS- tion of ultra-high dependability for software- TRAND, T. 1994. Experiments on the effec- based systems. C ACM 36, 11 (Nov.), 69 – 80. tiveness of dataflow- and controlflow-based LIU, L.-L. AND ROBSON, D. J. 1989. Symbolic test adequacy criteria. In Proceedings of 16th evaluation in software testing, the final re- IEEE International Conference on Software port. Computer Science Tech. Rep. 10/89, Engineering (May). School of Engineering and Applied Science, INCE, D. C. 1987. The automatic generation of University of Durham, June. test data. Comput. J. 30, 1, 63– 69. LUCE, R. D., KRANTZ, D. H., SUPPES, P., AND TVER- INCE, D. C. 1991. Software testing. In Software SKY, A. 1990. Foundations of Measurement, Engineer’s Reference Book, J. A. McDermid, Vol. 3: Representation, Axiomatization, and Ed., Butterworth-Heinemann (Chapter 19). Invariance. Academic Press, San Diego. KARASIK, M. S. 1985. Environmental testing MALAIYA, Y. K., VONMAYRHAUSER, A., AND SRIMANI, techniques for software certification. IEEE P. K. 1993. An examination of fault expo- Trans. Softw. Eng. SE-11, 9 (Sept.), 934 –938. sure ratio. IEEE Trans. Softw. Eng. 19, 11, KEMMERER, R. A. 1985. Testing formal specifi- 1087–1094. cations to detect design errors. IEEE Trans. MARICK, B. 1991. The weak mutation hypothe- Softw. Eng. SE-11, 1 (Jan.), 32– 43. sis. In Proceedings of SIGSOFT Symposium on Software Testing, Analysis, and Verifica- KERNIGHAN, B. W. AND PLAUGER, P. J. 1981. tion 4 (Oct.), 190 –199. Software Tools in Pascal, Addison-Wesley, Reading, MA. MATHUR, A. P. 1991. Performance, effective- ness, and reliability issues in software test- KING, K. N. AND OFFUTT, A. J. 1991. A FOR- ing. In Proceedings of the 15th Annual Inter- TRAN language system for mutation-based national Computer Software and Applications software testing. Softw. Pract. Exper. 21, 7 Conference (Tokyo, Sept.), 604 – 605. (July), 685–718. MARSHALL, A. C. 1991. A Conceptual model of KOREL, B., WEDDE, H., AND FERGUSON, R. 1992. software testing. J. Softw. Test. Ver. Rel. 1, 3 Dynamic method of test data generation for (Dec.), 5–16. distributed software. Inf. Softw. Tech. 34, 8 MCCABE, T. J. 1976. A complexity measure. (Aug.), 523–532. IEEE Trans. Softw. Eng. SE-2, 4, 308 –320. KOSARAJU, S. 1974. Analysis of structured pro- MCCABE, T. J. (ED.) 1983. Structured Testing. grams. J. Comput. Syst. Sci. 9, 232–255. IEEE Computer Society Press, Los Alamitos, KRANTZ, D. H., LUCE, R. D., SUPPES, P., AND TVER- CA. SKY, A. 1971. Foundations of Measurement, MCCABE, T. J. AND SCHULMEYER, G. G. 1985. Vol. 1: Additive and Polynomial Representa- System testing aided by structured analysis: tions. Academic Press, New York. A practical experience. IEEE Trans. Softw. KRAUSER, E. W., MATHUR, A. P., AND REGO, V. J. Eng. SE-11, 9 (Sept.), 917–921. 1991. High performance software testing on MCMULLIN, P. R. AND GANNON, J. D. 1983. Com- SIMD machines. IEEE Trans. Softw. Eng. bining testing with specifications: A case SE-17, 5 (May), 403– 423. study. IEEE Trans. Softw. Eng. SE-9, 3 LASKI, J. 1989. Testing in the program develop- (May), 328 –334. ment cycle. Softw. Eng. J. (March), 95–106. MEEK, B. AND SIU, K. K. 1988. The effectiveness LASKI, J. AND KOREL, B. 1983. A data flow ori- of error seeding. Alvey Project SE/064: Qual- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 60. Test Coverage and Adequacy • 425 ity evaluation of programming language pro- OULD, M. A. AND UNWIN, C., EDS. 1986. Testing cessors, Report No. 2, Computing Centre, in Software Development. Cambridge Univer- King’s College London, Oct. sity Press, New York. MILLER, E. AND HOWDEN, W. E. 1981. Tutorial: PAIGE, M. R. 1975. Program graphs, an alge- Software Testing and Validation Techniques, bra, and their implication for programming. (2nd ed.). IEEE Computer Society Press, Los IEEE Trans. Softw. Eng. SE-1, 3, (Sept.), Alamitos, CA. 286 –291. MILLER, K. W., MORELL, L. J., NOONAN, R. E., PAIGE, M. R. 1978. An analytical approach to PARK, S. K., NICOL, D. M., MURRILL, B. W., AND software testing. In Proceedings COMP- VOAS, J. M. 1992. Estimating the probabil- SAC’78, 527–532. ity of failure when testing reveals no failures. PANDI, H. D., RYDER, B. G., AND LANDI, W. IEEE Trans. Softw. Eng. 18, 1 (Jan.), 33– 43. 1991. Interprocedural Def-Use associations MORELL, L. J. 1990. A theory of fault-based in C programs. In Proceedings of SIGSOFT testing. IEEE Trans. Softw. Eng. 16, 8 (Aug.), Symposium on Software Testing, Analysis, 844 – 857. and Verification 4, (Oct.), 139 –153. MYERS, G. J. 1977. An extension to the cyclo- PARRISH, A. AND ZWEBEN, S. H. 1991. Analysis matic measure of program complexity. SIG- and refinement of software test data ade- PLAN No. 12, 10, 61– 64. quacy properties. IEEE Trans. Softw. Eng. SE-17, 6 (June), 565–581. MYERS, G. J. 1979. The Art of Software Testing. John Wiley and Sons, New York. PARRISH, A. S. AND ZWEBEN, S. H. 1993. Clarify- ing some fundamental concepts in software MYERS, J. P., JR. 1992. The complexity of soft- testing. IEEE Trans. Softw. Eng. 19, 7 (July), ware testing. Softw. Eng. J. (Jan.), 13–24. 742–746. NTAFOS, S. C. 1984. An evaluation of required PETSCHENIK, N. H. 1985. Practical priorities in element testing strategies. In Proceedings of system testing. IEEE Softw. (Sept.), 18 –23. the Seventh International Conference on Soft- ware Engineering, 250 –256. PIWOWARSKI, P., OHBA, M., AND CARUSO, J. 1993. Coverage measurement experience during NTAFOS, S. C. 1984. On required element test- function testing. In Proceedings of the 15th ing. IEEE Trans. Softw. Eng. SE-10, 6 (Nov.), ICSE (May), 287–301. 795– 803. PODGURSKI, A. AND CLARKE, L. 1989. The impli- NTAFOS, S. C. 1988. A comparison of some cations of program dependences for software structural testing strategies. IEEE Trans. testing, debugging and maintenance. In Pro- Softw. Eng. SE-14 (June), 868 – 874. ceedings of SIGSOFT Symposium on Software OFFUTT, A. J. 1989. The coupling effect: Fact or Testing, Analysis, and Verification 3, (Dec.), fiction. In Proceedings of SIGSOFT Sympo- 168 –178. sium on Software Testing, Analysis, and Veri- PODGURSKI, A. AND CLARKE, L. A. 1990. A for- fication 3 (Dec. 13–15), 131–140. mal model of program dependences and its OFFUTT, A. J. 1992. Investigations of the soft- implications for software testing, debugging ware testing coupling effect. ACM Trans. and maintenance. IEEE Trans. Softw. Eng. Softw. Eng. Methodol. 1, 1 (Jan.), 5–20. 16, 9 (Sept.), 965–979. OFFUTT, A. J. AND LEE, S. D. 1991. How strong PRATHER, R. E. AND MYERS, J. P. 1987. The path is weak mutation? In Proceedings of SIG- prefix software testing strategy. IEEE Trans. SOFT Symposium on Software Testing, Anal- Softw. Eng. SE-13, 7 (July). ysis, and Verification 4 (Oct.), 200 –213. PROGRAM ANALYSIS LTD., UK. 1992. Testbed OFFUTT, A. J., ROTHERMEL, G., AND ZAPF, C. technical description. May. 1993. An experimental evaluation of selec- RAPPS, S. AND WEYUKER, E. J. 1985. Selecting tive mutation. In Proceedings of 15th ICSE software test data using data flow informa- (May), 100 –107. tion. IEEE Trans. Softw. Eng. SE-11, 4 OSTERWEIL, L. AND CLARKE, L. A. 1992. A pro- (April), 367–375. posed testing and analysis research initiative. RICHARDSON, D. J. AND CLARKE, L. A. 1985. IEEE Softw. (Sept.), 89 –96. Partition analysis: A method combining test- OSTRAND, T. J. AND BALCER, M. J. 1988. The ing and verification. IEEE Trans. Softw. Eng. category-partition method for specifying and SE-11, 12 (Dec.), 1477–1490. generating functional tests. Commun. ACM RICHARDSON, D. J., AHA, S. L., AND O’MALLEY, T. O. 31, 6 (June), 676 – 686. 1992. Specification-based test oracles for re- OSTRAND, T. J. AND WEYUKER, E. J. 1991. Data- active systems. In Proceedings of 14th Inter- flow-based test adequacy analysis for lan- national Conference on Software Engineering guages with pointers. In Proceedings of SIG- (May), 105–118. SOFT Symposium on Software Testing, Anal- RICHARDSON, D. J. AND THOMPSON, M. C. ysis, and Verification 4, (Oct.), 74 – 86. 1988. The RELAY model of error detection ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 61. 426 • Zhu et al. and its application. In Proceedings of SIG- TSAI, W. T., VOLOVIK, D., AND KEEFE, T. F. SOFT Symposium on Software Testing, Anal- 1990. Automated test case generation for ysis, and Verification 2 (July). programs specified by relational algebra que- RICHARDSON, D. J. AND THOMPSON, M. C. 1993. ries. IEEE Trans. Softw. Eng. 16, 3 (March), An analysis of test data selection criteria us- 316 –324. ing the relay model of fault detection. IEEE TSOUKALAS, M. Z., DURAN, J. W., AND NTAFOS, Trans. Softw. Eng. 19, 6, 533–553. S. C. 1993. On some reliability estimation RIDDELL, I. J., HENNELL, M. A., WOODWARD, M. R., problems in random and partition testing. AND HEDLEY, D. 1982. Practical aspects of IEEE Trans. Softw. Eng. 19, 7 (July), 687– 697. program mutation. Tech. Rep., Dept. of Compu- URAL, H. AND YANG, B. 1988. A structural test tational Science, University of Liverpool, UK. selection criterion. Inf. Process. Lett. 28, 3 ROBERTS, F. S. 1979. Measurement Theory, En- (July), 157–163. cyclopedia of Mathematics and Its Applica- tions, Vol. 7. Addison-Wesley, Reading, MA. URAL, H. AND YANG, B. 1993. Modeling software for accurate data flow representation. In Pro- ROE, R. P. AND ROWLAND, J. H. 1987. Some the- ceedings of 15th International Conference on ory concerning certification of mathematical Software Engineering (May), 277–286. subroutines by black box testing. IEEE Trans. Softw. Eng. SE-13, 6 (June), 677– 682. VALIANT, L. C. 1984. A theory of the learnable. ROUSSOPOULOS, N. AND YEH, R. T. 1985. SEES: Commun. ACM 27, 11, 1134 –1142. A software testing environment support system. VOAS, J., MORRELL, L., AND MILLER, K. 1991. IEEE Trans. Softw. Eng. SE-11, 4, (April), 355– Predicting where faults can hide from testing. 366. IEEE Softw. (March), 41– 48. RUDNER, B. 1977. Seeding/tagging estimation WEISER, M. D., GANNON, J. D., AND MCMULLIN, of software errors: Models and estimates. P. R. 1985. Comparison of structural test Rome Air Development Centre, Rome, NY, coverage metrics. IEEE Softw. (March), 80 – 85. RADC-TR-77-15, also AD-A036 655. WEISS, S. N. AND WEYUKER, E. J. 1988. An ex- SARIKAYA, B., BOCHMANN, G. V., AND CERNY, tended domain-based model of software reli- E. 1987. A test design methodology for pro- ability. IEEE Trans. Softw. Eng. SE-14, 10 tocol testing. IEEE Trans. Softw. Eng. SE-13, (Oct.), 1512–1524. 5 (May), 518 –531. WEYUKER, E. J. 1979a. The applicability of pro- SHERER, S. A. 1991. A cost-effective approach to gram schema results to programs. Int. testing. IEEE Softw. (March), 34 – 40. J. Comput. Inf. Sci. 8, 5, 387– 403. SOFTWARE RESEARCH. 1992. Software Test- WEYUKER, E. J. 1979b. Translatability and de- Works—Software Testers Workbench System. cidability questions for restricted classes of Software Research, Inc. program schema. SIAM J. Comput. 8, 5, 587– SOLHEIM, J. A. AND ROWLAND, J. H. 1993. An 598. empirical-study of testing and integration strategies using artificial software systems. WEYUKER, E. J. 1982. On testing non-testable IEEE Trans. Softw. Eng. 19, 10, 941–949. programs. Comput. J. 25, 4, 465– 470. STOCKS, P. A. AND CARRINGTON, D. A. 1993. Test WEYUKER, E. J. 1983. Assessing test data ade- templates: A specification-based testing quacy through program inference. ACM Trans. framework. In Proceedings of 15th Interna- Program. Lang. Syst. 5, 4, (Oct.), 641– 655. tional Conference on Software Engineering WEYUKER, E. J. 1986. Axiomatizing software (May), 405– 414. test data adequacy. IEEE Trans. Softw. Eng. SU, J. AND RITTER, P. R. 1991. Experience in SE-12, 12, (Dec.), 1128 –1138. testing the Motif interface. IEEE Softw. WEYUKER, E. J. 1988a. The evaluation of pro- (March), 26 –33. gram-based software test data adequacy crite- SUPPES, P., KRANTZ, D. H., LUCE, R. D., AND TVER- ria. Commun. ACM 31, 6, (June), 668 – 675. SKY, A. 1989. Foundations of Measurement, WEYUKER, E. J. 1988b. Evaluating software Vol. 2: Geometrical, Threshold, and Probabi- complexity measures. IEEE Trans. Softw. listic Representations. Academic Press, San Diego. Eng. SE-14, 9, (Sept.), 1357–1365. TAI, K.-C. 1993. Predicate-based test genera- WEYUKER, E. J. 1988c. An empirical study of tion for computer programs. In Proceedings of the complexity of data flow testing. In Pro- 15th International Conference on Software ceedings of SIGSOFT Symposium on Software Engineering (May), 267–276. Testing, Analysis, and Verification 2 (July), TAKAHASHI, M. AND KAMAYACHI, Y. 1985. An 188 –195. empirical study of a model for program error WEYUKER, E. J. 1993. More experience with prediction. IEEE, 330 –336. data-flow testing. IEEE Trans. Softw. Eng. THAYER, R., LIPOW, M., AND NELSON, E. 1978. 19, 9, 912–919. Software Reliability. North-Holland. WEYUKER, E. J. AND DAVIS, M. 1983. A formal ACM Computing Surveys, Vol. 29, No. 4, December 1997
  • 62. Test Coverage and Adequacy • 427 notion of program-based test data adequacy. ecution. IEEE Trans. Softw. Eng. SE-14, 10 Inf. Cont. 56, 52–71. (Oct.), 1499 –1511. WEYUKER, E. J. AND JENG, B. 1991. Analyzing ZEIL, S. J. 1983. Testing for perturbations of partition testing strategies. IEEE Trans. program statements. IEEE Trans. Softw. Eng. Softw. Eng. 17, 7 (July), 703–711. SE-9, 3, (May), 335–346. WEYUKER, E. J. AND OSTRAND, T. J. 1980. Theo- ZEIL, S. J. 1984. Perturbation testing for com- ries of program testing and the application of putation errors. In Proceedings of Seventh revealing sub-domains. IEEE Trans. Softw. International Conference on Software Engi- Eng. SE-6, 3 (May), 236 –246. neering (Orlando, FL), 257–265. WHITE, L. J. 1981. Basic mathematical defini- ZEIL, S. J. 1989. Perturbation techniques for tions and results in testing. In Computer Pro- detecting domain errors. IEEE Trans. Softw. gram Testing, B. Chandrasekaran and S. Eng. 15, 6 (June), 737–746. Radicchi, Eds., North-Holland, 13–24. ZEIL, S. J., AFIFI, F. H., AND WHITE, L. J. WHITE, L. J. AND COHEN, E. I. 1980. A domain 1992. Detection of linear errors via domain strategy for computer program testing. IEEE testing. ACM Trans. Softw. Eng. Methodol. 1, Trans. Softw. Eng. SE-6, 3 (May), 247–257. 4, (Oct.), 422– 451. WHITE, L. J. AND WISZNIEWSKI, B. 1991. Path ZHU, H. 1995a. Axiomatic assessment of con- testing of computer programs with loops us- trol flow based software test adequacy crite- ing a tool for simple loop patterns. Softw. ria. Softw. Eng. J. (Sept.), 194 –204. Pract. Exper. 21, 10 (Oct.). ZHU, H. 1995b. An induction theory of software WICHMANN, B. A. 1993. Why are there no mea- testing. Sci. China 38 (Supp.) (Sept.), 58 –72. surement standards for software testing? Comput. Stand. Interfaces 15, 4, 361–364. ZHU, H. 1996a. A formal analysis of the sub- sume relation between software test adequacy WICHMANN, B. A. AND COX, M. G. 1992. Prob- criteria. IEEE Trans. Softw. Eng. 22, 4 lems and strategies for software component testing standards. J. Softw. Test. Ver. Rel. 2, (April), 248 –255. 167–185. ZHU, H. 1996b. A formal interpretation of soft- WILD, C., ZEIL, S., CHEN, J., AND FENG, G. 1992. ware testing as inductive inference. J. Softw. Employing accumulated knowledge to refine Test. Ver. Rel. 6 (July), 3–31. test cases. J. Softw. Test. Ver. Rel. 2, 2 (July), ZHU, H. AND HALL, P. A. V. 1992a. Test data 53– 68. adequacy with respect to specifications and WISZNIEWSKI, B. W. 1985. Can domain testing related properties. Tech. Rep. 92/06, Depart- overcome loop analysis? IEEE, 304 –309. ment of Computing, The Open University, UK, Jan. WOODWARD, M. R. 1991. Concerning ordered mutation testing of relational operators. J. ZHU, H. AND HALL, P. A. V. 1992b. Testability of Softw. Test. Ver. Rel. 1, 3 (Dec.), 35– 40. programs: Properties of programs related to WOODWARD, M. R. 1993. Errors in algebraic test data adequacy criteria. Tech. Rep. 92/05, specifications and an experimental mutation Department of Computing, The Open Univer- testing tool. Softw. Eng. J. (July), 211–224. sity, UK, Jan. WOODWARD, M. R. AND HALEWOOD, K. 1988. ZHU, H. AND HALL, P. A. V. 1993. Test data From weak to strong— dead or alive? An anal- adequacy measurements. Softw. Eng. J. 8, 1 ysis of some mutation testing issues. In Pro- (Jan.), 21–30. ceedings of Second Workshop on Software ZHU, H., HALL, P. A. V., AND MAY, J. 1992. Testing, Verification and Analysis (July) 152– Inductive inference and software testing. J. 158. Softw. Test. Ver. Rel. 2, 2 (July), 69 – 82. WOODWARD, M. R., HEDLEY, D., AND HENNEL, ZHU, H., HALL, P. A. V., AND MAY, J. 1994. Un- M. A. 1980. Experience with path analysis derstanding software test adequacy—an axi- and testing of programs. IEEE Trans. Softw. omatic and measurement approach. In Math- Eng. SE-6, 5 (May), 278 –286. ematics of Dependable Systems, Proceedings of WOODWARD, M. R., HENNEL, M. A., AND HEDLEY, IMA First Conference (Sept., London), Oxford D. 1980. A limited mutation approach to University Press, Oxford. program testing. Tech. Rep. Dept. of Compu- ZWEBEN, S. H. AND GOURLAY, J. S. 1989. On the tational Science, University of Liverpool. adequacy of Weyuker’s test data adequacy YOUNG, M. AND TAYLOR, R. N. 1988. Combining axioms. IEEE Trans. Softw. Eng. SE-15, 4, static concurrency analysis with symbolic ex- (April), 496 –501. Received November 1994; revised March 1996; accepted October 1996 ACM Computing Surveys, Vol. 29, No. 4, December 1997

×