Software Unit Test Coverage and Adequacy


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Software Unit Test Coverage and Adequacy

  1. 1. Software Unit Test Coverage and Adequacy HONG ZHU Nanjing University PATRICK A. V. HALL AND JOHN H. R. MAY The Open University, Milton Keynes, UK Objective measurement of test quality is one of the key issues in software testing. It has been a major research focus for the last two decades. Many test criteria have been proposed and studied for this purpose. Various kinds of rationales have been presented in support of one criterion or another. We survey the research work in this area. The notion of adequacy criteria is examined together with its role in software dynamic testing. A review of criteria classification is followed by a summary of the methods for comparison and assessment of criteria. Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging General Terms: Measurement, Performance, Reliability, Verification Additional Key Words and Phrases: Comparing testing effectiveness, fault- detection, software unit test, test adequacy criteria, test coverage, testing methods 1. INTRODUCTION Goodenough and Gerhart [1975, 1977] made an early breakthrough in research In 1972, Dijkstra claimed that “program on software testing by pointing out that testing can be used to show the presence the central question of software testing of bugs, but never their absence” to per- suade us that a testing approach is not is “what is a test criterion?”, that is, the acceptable [Dijkstra 1972]. However, the criterion that defines what constitutes last two decades have seen rapid growth an adequate test. Since then, test crite- of research in software testing as well as ria have been a major research focus. A intensive practice and experiments. It great number of such criteria have been has been developed into a validation and proposed and investigated. Consider- verification technique indispensable to able research effort has attempted to software engineering discipline. Then, provide support for the use of one crite- where are we today? What can we claim rion or another. How should we under- about software testing? stand these different criteria? What are In the mid-’70s, in an examination of the future directions for the subject? the capability of testing for demonstrat- In contrast to the constant attention ing the absence of errors in a program, given to test adequacy criteria by aca- Authors’ addresses: H. Zhu, Institute of Computer Software, Nanjing University, Nanjing, 210093, P.R. of China; email: ; P.A.V. Hall and J.H.R. May, Department of Computing, The Open University, Walton Hall, Milton Keynes, MK76AA, UK. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1997 ACM 0360-0300/97/1200–0366 $03.50 ACM Computing Surveys, Vol. 29, No. 4, December 1997
  2. 2. Test Coverage and Adequacy • 367 demics, the software industry has been software. A way to measure how well slow to accept test adequacy measure- this objective has been achieved is to ment. Few software development stan- plant some artificial faults into the dards require or even recommend the program and check if they are de- use of test adequacy criteria [Wichmann tected by the test. A program with a 1993; Wichmann and Cox 1992]. Are planted fault is called a mutant of the test adequacy criteria worth the cost for original program. If a mutant and the practical use? original program produce different Addressing these questions, we sur- outputs on at least one test case, the vey research on software test criteria in fault is detected. In this case, we say the past two decades and attempt to put that the mutant is dead or killed by it into a uniform framework. the test set. Otherwise, the mutant is still alive. The percentage of dead mu- 1.1 The Notion of Test Adequacy tants compared to the mutants that are not equivalent to the original pro- Let us start with some examples. Here gram is an adequacy measurement, we seek to illustrate the basic notions called the mutation score or mutation underlying adequacy criteria. Precise adequacy [Budd et al. 1978; DeMillo definitions will be given later. et al. 1978; Hamlet 1977]. —Statement coverage. In software test- ing practice, testers are often re- From Goodenough and Gerhart’s quired to generate test cases to exe- [1975, 1977] point of view, a software cute every statement in the program test adequacy criterion is a predicate at least once. A test case is an input that defines “what properties of a pro- on which the program under test is gram must be exercised to constitute a executed during testing. A test set is a ‘thorough’ test, i.e., one whose success- set of test cases for testing a program. ful execution implies no errors in a The requirement of executing all the tested program.” To guarantee the cor- statements in the program under test rectness of adequately tested programs, is an adequacy criterion. A test set they proposed reliability and validity that satisfies this requirement is con- requirements of test criteria. Reliability sidered to be adequate according to requires that a test criterion always the statement coverage criterion. produce consistent test results; that is, Sometimes the percentage of executed if the program tested successfully on statements is calculated to indicate one test set that satisfies the criterion, how adequately the testing has been then the program also tested success- performed. The percentage of the fully on all test sets that satisfies the statements exercised by testing is a criterion. Validity requires that the test measurement of the adequacy. always produce a meaningful result; that is, for every error in a program, —Branch coverage. Similarly, the branch there exists a test set that satisfies the coverage criterion requires that all criterion and is capable of revealing the control transfers in the program un- error. But it was soon recognized that der test are exercised during testing. there is no computable criterion that The percentage of the control trans- satisfies the two requirements, and fers executed during testing is a mea- hence they are not practically applica- surement of test adequacy. ble [Howden 1976]. Moreover, these two —Path coverage. The path coverage cri- requirements are not independent since terion requires that all the execution a criterion is either reliable or valid for paths from the program’s entry to its any given software [Weyuker and Os- exit are executed during testing. trand 1980]. Since then, the focus of —Mutation adequacy. Software testing research seems to have shifted from is often aimed at detecting faults in seeking theoretically ideal criteria to ACM Computing Surveys, Vol. 29, No. 4, December 1997
  3. 3. 368 • Zhu et al. the search for practically applicable ap- that the adequacy of testing the pro- proximations. gram p by the test set t with respect to Currently, the software testing litera- the specification s is of degree r accord- ture contains two different, but closely ing to the criterion C. The greater the related, notions associated with the real number r, the more adequate the term test data adequacy criteria. First, testing. an adequacy criterion is considered to be a stopping rule that determines These two notions of test data ade- whether sufficient testing has been quacy criteria are closely related to one done that it can be stopped. For in- another. A stopping rule is a special stance, when using the statement cover- case of measurement on the continuum age criterion, we can stop testing if all since the actual range of measurement the statements of the program have results is the set {0,1}, where 0 means been executed. Generally speaking, false and 1 means true. On the other since software testing involves the pro- hand, given an adequacy measurement gram under test, the set of test cases, M and a degree r of adequacy, one can and the specification of the software, an always construct a stopping rule M r such that a test set is adequate if and adequacy criterion can be formalized as only if the adequacy degree is greater a function C that takes a program p, a than or equal to r; that is, M r (p, s, t) specification s, and a test set t and gives true N M(p, s, t) r. Since a stopping a truth value true or false. Formally, let rule asserts a test set to be either ade- P be a set of programs, S be a set of quate or inadequate, it is also called a specifications, D be the set of inputs of predicate rule in the literature. the programs in P, T be the class of test An adequacy criterion is an essential sets, that is, T 2D, where 2X denotes part of any testing method. It plays two the set of subsets of X. fundamental roles. First, an adequacy Definition 1.1 (Test Data Adequacy criterion specifies a particular software Criteria as Stopping Rules). A test testing requirement, and hence deter- data adequacy criterion C is a function mines test cases to satisfy the require- C: P S T 3 {true, false}. C(p, s, t) ment. It can be defined in one of the true means that t is adequate for testing following forms. program p against specification s accord- (1) It can be an explicit specification for ing to the criterion C, otherwise t is inad- test case selection, such as a set of equate. guidelines for the selection of test Second, test data adequacy criteria cases. Following such rules one can provide measurements of test quality produce a set of test cases, although when a degree of adequacy is associated there may be some form of random with each test set so that it is not sim- selections. Such a rule is usually ply classified as good or bad. In practice, referred to as a test case selection the percentage of code coverage is often criterion. Using a test case selection used as an adequacy measurement. criterion, a testing method may be Thus, an adequacy criterion C can be defined constructively in the form of formally defined to be a function C from an algorithm which generates a test a program p, a specification s, and a test set from the software under test and set t to a real number r C(p, s, t), its own specification. This test set is the degree of adequacy [Zhu and Hall then considered adequate. It should 1992]. Formally: be noticed that for a given test case selection criterion, there may exist a Definition 1.2 (Test Data Adequacy number of test case generation algo- Criteria as Measurements). A test data rithms. Such an algorithm may also adequacy criterion is a function C, C: involve random sampling among P S T 3 [0,1]. C(p, s, t) r means many adequate test sets. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  4. 4. Test Coverage and Adequacy • 369 (2) It can also be in the form of specify- cess of software testing. If path cover- ing how to decide whether a given age is used, then the observation of test set is adequate or specifying whether statements have been executed how to measure the adequacy of a is insufficient; execution paths should test set. A rule that determines be observed and recorded. However, if whether a test set is adequate (or mutation score is used, it is unneces- more generally, how adequate) is sary to observe whether a statement is usually referred to as a test data executed during testing. Instead, the adequacy criterion. output of the original program and the output of the mutants need to be re- However, the fundamental concept corded and compared. underlying both test case selection cri- Although, given an adequacy crite- teria and test data adequacy criteria is rion, different methods could be devel- the same, that is, the notion of test oped to generate test sets automatically adequacy. In many cases they can be or to select test cases systematically easily transformed from one form to an- and efficiently, the main features of a other. Mathematically speaking, test testing method are largely determined case selection criteria are generators, by the adequacy criterion. For example, that is, functions that produce a class of as we show later, the adequacy criterion test sets from the program under test is related to fault-detecting ability, the and the specification (see Definition dependability of the program that 1.3). Any test set in this class is ade- passes a successful test and the number quate, so that we can use any of them of test cases required. Unfortunately, equally.1 Test data adequacy criteria the exact relationship between a partic- are acceptors that are functions from ular adequacy criterion and the correct- the program under test, the specifica- ness or reliability of the software that tion of the software and the test set to a passes the test remains unclear. characteristic number as defined in Def- Due to the central role that adequacy inition 1.1. Generators and acceptors criteria play in software testing, soft- are mathematically equivalent in the ware testing methods are often com- sense of one-one correspondence. Hence, pared in terms of the underlying ade- we use “test adequacy criteria” to de- quacy criteria. Therefore, subsequently, note both of them. we use the name of an adequacy crite- Definition 1.3 (Test Data Adequacy rion as a synonym of the corresponding Criteria as Generators [Budd and An- testing method when there is no possi- gluin 1982]). A test data adequacy cri- bility of confusion. terion C is a function C: P S 3 2 T. A test set t C(p, s) means that t satis- 1.2 The Uses of Test Adequacy Criteria fies C with respect to p and s, and it is said that t is adequate for (p, s) accord- An important issue in the management ing to C. of software testing is to “ensure that before any testing the objectives of that The second role that an adequacy cri- testing are known and agreed and that terion plays is to determine the observa- the objectives are set in terms that can tions that should be made during the be measured.” Such objectives “should testing process. For example, statement be quantified, reasonable and achiev- coverage requires that the tester, or the able” [Ould and Unwin 1986]. Almost testing system, observe whether each all test adequacy criteria proposed in statement is executed during the pro- the literature explicitly specify particu- lar requirements on software testing. 1 Test data selection criteria as generators should They are objective rules applicable by not be confused with test case generation software project managers for this purpose. tools, which may only generate one test set. For example, branch coverage is a ACM Computing Surveys, Vol. 29, No. 4, December 1997
  5. 5. 370 • Zhu et al. test requirement that all branches of Generally speaking, there are two basic the program should be exercised. The aspects of software dependability as- objective of testing is to satisfy this sessment. One is the dependability esti- requirement. The degree to which this mation itself, such as a reliability fig- objective is achieved can be measured ure. The other is the confidence in quantitatively by the percentage of estimation, such as the confidence or branches exercised. The mutation ade- the accuracy of the reliability estimate. quacy criterion specifies the testing re- The role of test adequacy here is a con- quirement that a test set should be able tributory factor in building confidence to rule out a particular set of software in the integrity estimate. Recent re- faults, that is, those represented by mu- search has shown some positive results tants. Mutation score is another kind of with respect to this role [Tsoukalas quantitative measurement of test qual- 1993]. ity. Although it is common in current soft- Test data adequacy criteria are also ware testing practice that the test pro- very helpful tools for software testers. cesses at both the higher and lower There are two levels of software testing levels stop when money or time runs processes. At the lower level, testing is out, there is a tendency towards the use a process where a program is tested by of systematic testing methods with the feeding more and more test cases to it. application of test adequacy criteria. Here, a test adequacy criterion can be used as a stopping rule to decide when 1.3 Categories of Test Data Adequacy this process can stop. Once the mea- Criteria surement of test adequacy indicates that the test objectives have been There are various ways to classify ade- achieved, then no further test case is quacy criteria. One of the most common needed. Otherwise, when the measure- is by the source of information used to ment of test adequacy shows that a test specify testing requirements and in the has not achieved the objectives, more measurement of test adequacy. Hence, tests must be made. In this case, the an adequacy criterion can be: adequacy criterion also provides a —specification-based, which specifies guideline for the selection of the addi- the required testing in terms of iden- tional test cases. In this way, adequacy tified features of the specification or criteria help testers to manage the soft- the requirements of the software, so ware testing process so that software that a test set is adequate if all the quality is ensured by performing suffi- identified features have been fully ex- cient tests. At the same time, the cost of ercised. In software testing literature testing is controlled by avoiding redun- it is fairly common that no distinction dant and unnecessary tests. This role of is made between specification and re- adequacy criteria has been considered quirements. This tradition is followed by some computer scientists [Weyuker in this article also; 1986] to be one of the most important. —program-based, which specifies test- At a higher level, the testing proce- ing requirements in terms of the pro- dure can be considered as repeated cy- gram under test and decides if a test cles of testing, debugging, modifying set is adequate according to whether program code, and then testing again. the program has been thoroughly ex- Ideally, this process should stop only ercised. when the software has met the required reliability requirements. Although test It should not be forgotten that for both data adequacy criteria do not play the specification-based and program-based role of stopping rules at this level, they testing, the correctness of program out- make an important contribution to the puts must be checked against the speci- assessment of software dependability. fication or the requirements. However, ACM Computing Surveys, Vol. 29, No. 4, December 1997
  6. 6. Test Coverage and Adequacy • 371 in both cases, the measurement of test In the software testing literature, peo- adequacy does not depend on the results ple often talk about white-box testing of this checking. Also, the definition of and black-box testing. Black-box testing specification-based criteria given previ- treats the program under test as a ously does not presume the existence of “black box.” No knowledge about the a formal specification. implementation is assumed. In white- It has been widely acknowledged that box testing, the tester has access to the software testing should use information details of the program under test and from both specification and program. performs the testing according to such Combining these two approaches, we details. Therefore, specification-based have: criteria and interface-based criteria be- long to black-box testing. Program- —combined specification- and program- based criteria and combined specifica- based criteria, which use the ideas of tion and program-based criteria belong both program-based and specification- to white-box testing. based criteria. Another classification of test ade- quacy criteria is by the underlying test- There are also test adequacy criteria ing approach. There are three basic ap- that specify testing requirements with- proaches to software testing: out employing any internal information from the specification or the program. (1) structural testing: specifies testing For example, test adequacy can be mea- requirements in terms of the cover- sured according to the prospective us- age of a particular set of elements in age of the software by considering the structure of the program or the whether the test cases cover the data specification; that are most likely to be frequently (2) fault-based testing: focuses on de- used as input in the operation of the tecting faults (i.e., defects) in the software. Although few criteria are ex- software. An adequacy criterion of plicitly proposed in such a way, select- this approach is some measurement ing test cases according to the usage of of the fault detecting ability of test the software is the idea underlying ran- sets.2 dom testing, or statistical testing. In (3) error-based testing: requires test random testing, test cases are sampled cases to check the program on cer- at random according to a probability tain error-prone points according to distribution over the input space. Such our knowledge about how programs a distribution can be the one represent- typically depart from their specifica- ing the operation of the software, and tions. the random testing is called representa- The source of information used in the tive. It can also be any probability dis- adequacy measurement and the under- tribution, such as a uniform distribu- lying approach to testing can be consid- tion, and the random testing is called ered as two dimensions of the space of nonrepresentative. Generally speaking, software test adequacy criteria. A soft- if a criterion employs only the “inter- ware test adequacy criterion can be face” information—the type and valid classified by these two aspects. The re- range for the software input—it can be view of adequacy criteria is organized called an interface-based criterion: according to the structure of this space. —interface-based criteria, which specify testing requirements only in terms of the type and range of software input 2 We use the word fault to denote defects in soft- without reference to any internal fea- ware and the word error to denote defects in the tures of the specification or the pro- outputs produced by a program. An execution that gram. produces an error is called a failure. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  7. 7. 372 • Zhu et al. 1.4 Organization of the Article been used as a model of program struc- ture. It is widely used in static analysis The remainder of the article consists of of software [Fenton et al. 1985; Ko- two main parts. The first part surveys saraju 1974; McCabe 1976; Paige 1975]. various types of test data adequacy cri- It has also been used to define and teria proposed in the literature. It in- study program-based structural test ad- cludes three sections devoted to struc- equacy criteria [White 1981]. In this tural testing, fault-based testing, and section we give a brief introduction to error-based testing. Each section con- the flow-graph model of program struc- sists of several subsections covering the ture. Although we use graph-theory ter- principles of the testing method and minology in the following discussion, their application to program-based and readers are required to have only a pre- specification-based test criteria. The liminary knowledge of graph theory. To second part is devoted to the rationale help understand the terminology and to presented in the literature in support of avoid confusion, a glossary is provided the various criteria. It has two sections. in the Appendix. Section 5 discusses the methods of com- A flow graph is a directed graph that paring adequacy criteria and surveys consists of a set N of nodes and a set the research results in the literature. E N N of directed edges between Section 6 discusses the axiomatic study nodes. Each node represents a linear and assessment of adequacy criteria. Fi- sequence of computations. Each edge nally, Section 7 concludes the paper. representing transfer of control is an ordered pair n 1 , n 2 of nodes, and is 2. STRUCTURAL TESTING associated with a predicate that repre- This section is devoted to adequacy cri- sents the condition of control transfer teria for structural testing. It consists of from node n 1 to node n 2 . In a flow two subsections, one for program-based graph, there is a begin node and an end criteria and the other for specification- node where the computation starts and based criteria. finishes, respectively. The begin node has no inward edges and the end node 2.1 Program-Based Structural Testing has no outward edges. Every node in a flow graph must be on a path from the There are two main groups of program- begin node to the end node. Figure 1 is based structural test adequacy criteria: an example of flow graph. control-flow criteria and data-flow crite- ria. These two types of adequacy crite- Example 2.1 The following program ria are combined and extended to give computes the greatest common divisor dependence coverage criteria. Most ade- of two natural numbers by Euclid’s al- quacy criteria of these two groups are gorithm. Figure 1 is the corresponding based on the flow-graph model of pro- flow graph. gram structure. However, a few control- Begin flow criteria define test requirements in input (x, y); terms of program text rather than using while (x 0 and y 0) do an abstract model of software structure. if (x y) then x: x y 2.1.1 Control Flow Adequacy Crite- else y: y x ria. Before we formally define various endif control-flow-based adequacy criteria, we endwhile; first give an introduction to the flow output (x y); graph model of program structure. end A. The flow graph model of program It should be noted that in the litera- structure. The control flow graph ture there are a number of conventions stems from compiler work and has long of flow-graph models with subtle differ- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  8. 8. Test Coverage and Adequacy • 373 Figure 1. Flow graph for program in Example 2.1. ences, such as whether a node is al- to another is represented by a directed lowed to be associated with an empty edge between the nodes such that the sequence of statements, the number of condition of the control transfer is asso- outward edges allowed for a node, and ciated with it. the number of end nodes allowed in a B. Control-flow adequacy criteria. flow graph, and the like. Although most Now, given a flow-graph model of a pro- adequacy criteria can be defined inde- gram and a set of test cases, how do we pendently of such conventions, using measure the adequacy of testing for the different ones may result in different program on the test set? First of all, measures of test adequacy. Moreover, recall that the execution of the program testing tools may be sensitive to such on an input datum is modeled as a conventions. In this article no restric- traverse in the flow graph. Every execu- tions on the conventions are made. tion corresponds to a path in the flow For programs written in a procedural graph from the begin node to the end programming language, flow-graph node. Such a path is called a complete models can be generated automatically. computation path, or simply a computa- Figure 2 gives the correspondences be- tion path or an execution path in soft- tween some structured statements and ware testing literature. their flow-graph structures. Using these A very basic requirement of adequate rules, a flow graph, shown in Figure 3, testing is that all the statements in the can be derived from the program given program are covered by test executions. in Example 2.1. Generally, to construct This is usually called statement cover- a flow graph for a given program, the age [Hetzel 1984]. But full statement program code is decomposed into a set coverage cannot always be achieved be- of disjoint blocks of linear sequences of cause of the possible existence of infea- statements. A block has the property sible statements, that is, dead code. that whenever the first statement of the Whether a piece of code is dead code is block is executed, the other statements undecidable [Weyuker 1979a; Weyuker are executed in the given order. Fur- 1979b; White 1981]. Because state- thermore, the first statement of the ments correspond to nodes in flow- block is the only statement that may be graph models, this criterion can be de- executed directly after the execution of fined in terms of flow graphs, as follows. a statement in another block. Each block corresponds to a node in the flow Definition 2.1 (Statement Coverage graph. A control transfer from one block Criterion). A set P of execution paths ACM Computing Surveys, Vol. 29, No. 4, December 1997
  9. 9. 374 • Zhu et al. Figure 2. Example flow graphs for structured statements. may be missed from an adequate test. Hence, we have a slightly stronger re- quirement of adequate test, called branch coverage [Hetzel 1984], that all control transfers must be checked. Since control transfers correspond to edges in flow graphs, the branch coverage crite- rion can be defined as the coverage of all edges in the flow graph. Definition 2.2 (Branch Coverage Crite- rion). A set P of execution paths satis- fies the branch coverage criterion if and only if for all edges e in the flow graph, there is at least one path p in P such that p contains the edge e. Figure 3. Flow graph for Example 2.1. Branch coverage is stronger than statement coverage because if all edges in a flow graph are covered, all nodes are necessarily covered. Therefore, a satisfies the statement coverage crite- test set that satisfies the branch cover- rion if and only if for all nodes n in the age criterion must also satisfy state- flow graph, there is at least one path p ment coverage. Such a relationship be- in P such that node n is on the path p. tween adequacy criteria is called the Notice that statement coverage is so subsumes relation. It is of interest in weak that even some control transfers the comparison of software test ade- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  10. 10. Test Coverage and Adequacy • 375 quacy criteria (see details in Section feasible elements. Most program-based 5.1.3). adequacy criteria in the literature are However, even if all branches are ex- not finitely applicable, but finitely ap- ercised, this does not mean that all com- plicable versions can often be obtained binations of control transfers are by redefinition in this way. Subse- checked. The requirement of checking quently, such a version is called the all combinations of branches is usually feasible version of the adequacy crite- called path coverage or path testing, rion. It should be noted, first, that al- which can be defined as follows. though we can often obtain finite appli- Definition 2.3 (Path Coverage Crite- cability by using the feasible version, rion). A set P of execution paths satis- this may cause the undecidability prob- fies the path coverage criterion if and lem; that is, we may not be able to only if P contains all execution paths decide whether a test set satisfies a from the begin node to the end node in given adequacy criterion. For example, the flow graph. whether a statement in a program is feasible is undecidable [Weyuker 1979a; Although the path coverage criterion Weyuker 1979b; White 1991]. There- still cannot guarantee the correctness of fore, when a test set does not cover all a tested program, it is too strong to be the statements in a program, we may practically useful for most programs, not be able to decide whether a state- because there can be an infinite number ment not covered by the test data is of different paths in a program with dead code. Hence, we may not be able to loops. In such a case, an infinite set of decide if the test set satisfies the feasi- test data must be executed for adequate ble version of statement coverage. Sec- testing. This means that the testing ond, for some adequacy criteria, such as cannot finish in a finite period of time. path coverage, we cannot obtain finite But, in practice, software testing must applicability by such a redefinition. be fulfilled within a limited fixed period Recall that the rationale for path cov- of time. Therefore, a test set must be erage is that there is no path that does finite. The requirement that an ade- not need to be checked by testing, while quacy criterion can always be satisfied finite applicability forces us to select a by a finite test set is called finite appli- finite subset of paths. Thus, research cability [Zhu and Hall 1993] (see Sec- into flow-graph-based adequacy criteria tion 6). has focused on the selection of the most The statement coverage criterion and important subsets of paths. Probably branch coverage criterion are not fi- the most straightforward solution to the nitely applicable either, because they conflict is to select paths that contain no require testing to cover infeasible ele- redundant information. Hence, two no- ments. For instance, statement cover- tions from graph theory can be used. age requires that all the statements in a First, a path that has no repeated occur- program are executed. However, a pro- rence of any edge is called a simple path gram may have infeasible statements, in graph theory. Second, a path that has that is, dead code, so that no input data no repeated occurrences of any node is can cause their execution. Therefore, in called an elementary path. Thus, it is such cases, there is no adequate test set possible to define simple path coverage that can satisfy statement coverage. and elementary path coverage criteria, Similarly, branch coverage is not fi- which require that adequate test sets nitely applicable because a program should cover all simple paths and ele- may contain infeasible branches. How- mentary paths, respectively. ever, for statement coverage and also These two criteria are typical ones branch coverage, we can define a fi- that select finite subsets of paths by nitely applicable version of the criterion specifying restrictions on the complexity by requiring testing only to cover the of the individual paths. Another exam- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  11. 11. 376 • Zhu et al. ple of this type is the length-n path graph theory, the maximal size of a set coverage criterion, which requires cover- of independent paths is unique for any age of all subpaths of length less than given graph and is called the cyclomatic or equal to n [Gourlay 1983]. A more number, and can be easily calculated by complicated example of the type is the following formula. Paige’s level-i path coverage criterion [Paige 1978; Paige 1975]. Informally, v G e n p, the criterion starts with testing all ele- where v(G) denotes the cyclomatic num- mentary paths from the begin node to ber of the graph G, n is the number of the end node. Then, if there is an ele- vertices in G, e is the number of edges, mentary subpath or cycle that has not and p is the number of strongly con- been exercised, the subpath is required nected components.3 The adequacy cri- to be checked at next level. This process terion is then defined as follows. is repeated until all nodes and edges are covered by testing. Obviously, a test set Definition 2.4 (Cyclomatic-Number that satisfies the level-i path coverage Criterion). A set P of execution paths criterion must also satisfy the elemen- satisfies the cyclomatic number crite- tary path coverage criterion, because rion if and only if P contains at least elementary paths are level-0 paths. one set of v independent paths, where A set of control-flow adequacy criteria v e n p is the cyclomatic number that are concerned with testing loops is of the flow graph. loop count criteria, which date back to the mid-1970s [Bently et al. 1993]. For McCabe also gave an algorithm to any given natural number K, the loop generate a set of independent paths count-K criterion requires that every from any given flow graph [McCabe loop in the program under test should 1976]. Paige [1978] has shown that the be executed zero times, once, twice, and level-i path coverage criterion subsumes so on, up to K times [Howden 1975]. McCabe’s cyclomatic number criterion. Another control-flow criterion con- The preceding control-flow test ade- cerned with testing loops is the cycle quacy criteria are all defined in terms of combination criterion, which requires flow-graph models of program structure that an adequate test set should cover except loop count criteria, which are all execution paths that do not contain a defined in terms of program text. A cycle more than once. number of other test adequacy criteria An alternative approach to defining are based on the text of program. One of control-flow adequacy criteria is to spec- the most popular criteria in software ify restrictions on the redundancy testing practice is the so-called multiple among the paths. McCabe’s cyclomatic condition coverage discussed in Myers’ measurement is such an example [Mc- [1979] classic book which has proved Cabe 1976; McCabe 1983; McCabe and popular in commercial software testing Schulmeyer 1985]. It is based on the practice. The criterion focuses on the theorem of graph theory that for any conditions of control transfers, such as flow graph there is a set of execution the condition in an IF-statement or a paths such that every execution path WHILE-LOOP statement. A test set is can be expressed as a linear combina- said of satisfying the decision coverage tion of them. A set of paths is indepen- criterion if for every condition there is dent if none of them is a linear combina- at least one test case such that the tion of the others. According to McCabe, condition has value true when evalu- a path should be tested if it is indepen- dent of the paths that have been tested. 3 A graph is strongly connected if, for any two On the other hand, if a path is a linear nodes a and b, there exists a path from a to b and combination of tested paths, it can be a path from b to a. Strongly connected compo- considered redundant. According to nents are maximal strongly connected subgraphs. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  12. 12. Test Coverage and Adequacy • 377 ated, and there is also at least one test TER2, respectively, where TER repre- case such that the condition has value sents test effectiveness ratio. The cover- false. In a high-level programming lan- age of LCSAJ is the third level, which is guage, a condition can be a Boolean defined as TER3. The hierarchy is then expression consisting of several atomic extended to the coverage of program predicates combined by logic connec- paths containing a number of LCSAJs. tives like and, or, and not. A test set satisfies the condition coverage crite- Definition 2.6 (TER3: LCSAJ Cover- rion if for every atomic predicate there age) is at least one test case such that the predicate has value true when evalu- number of LCSAJs exercised ated, and there is also at least one test at least once case such that the predicate has value TER3 total number of LCSAJs. false. Although the result value of an evaluation of a Boolean expression can Generally speaking, an advantage of only be one of two possibilities, true or text-based adequacy criteria is that test false, the result may be due to different adequacy can be easily calculated from combinations of the truth values of the the part of program text executed dur- atomic predicates. The multiple condi- ing testing. However, their definitions tion coverage criterion requires that a may be sensitive to language details. test set should cover all possible combi- For programs written in a structured nations of the truth values of atomic programming language, the application predicates in every condition. This crite- of TERn for n greater than or equal to 3 rion is sometimes also called extended requires analysis and reformatting of branch coverage in the literature. For- the program structure. In such cases, mally: the connection between program text Definition 2.5 (Multiple Condition and its test adequacy becomes less Coverage). A test set T is said to be straightforward. In fact, it is observed adequate according to the multiple-con- in software testing practice that a small dition-coverage criterion if, for every modification to the program may result condition C, which consists of atomic in a considerably different set of linear predicates (p 1 , p 2 , . . . , p n ), and all the sequence code and jumps. possible combinations (b 1 , b 2 , . . . , b n ) It should be noted that none of the of their truth values, there is at least adequacy criteria discussed in this sec- one test case in T such that the value of tion are applicable due to the possible p i equals b i , i 1, 2, . . . , n. existence of infeasible elements in a program, such as infeasible statements, Woodward et al. [1980] proposed and infeasible branches, infeasible combina- studied a hierarchy of program-text- tions of conditions, and the like. These based test data adequacy criteria based criteria, except path coverage, can be on a class of program units called linear redefined to obtain finite applicability code sequence and jump (LCSAJ). These by only requiring the coverage of feasi- criteria are usually referred to as test ble elements. effectiveness metrics in the literature. An LCSAJ consists of a body of code 2.1.2 Data-Flow-Based Test Data Ad- through which the flow of control may equacy Criteria. In the previous sec- proceed sequentially and which is ter- tion, we have seen how control-flow in- minated by a jump in the control flow. formation in the program under test is The hierarchy TERi , i 1, 2, . . . , used to specify testing requirements. In n, . . . of criteria starts with statement this section, data-flow information is coverage as the lowest level, followed by taken into account in the definition of branch coverage as the next lowest testing requirements. We first introduce level. They are denoted by TER1 and the way that data-flow information is ACM Computing Surveys, Vol. 29, No. 4, December 1997
  13. 13. 378 • Zhu et al. added into the flow-graph models of pro- have been bound to x in some node gram structures. Then, three basic other than the one in which it is being groups of data-flow adequacy criteria used. Otherwise it is a local computa- are reviewed. Finally, their limitations tional use. and extensions are discussed. Data-flow test adequacy analysis is concerned with subpaths from defini- A. Data-flow information in flow tions to nodes where those definitions graph. Data-flow analysis of test ade- are used. A definition-clear path with quacy is concerned with the coverage of respect to a variable x is a path where flow-graph paths that are significant for for all nodes in the path there is no the data flow in the program. Therefore, definition occurrence of the variable x. data-flow information is introduced into A definition occurrence of a variable x the flow-graph models of program struc- at a node u reaches a computational use tures. occurrence of the variable at node v if Data-flow testing methods are based and only if there is a path p from u to v on the investigation of the ways in such that p (u, w 1 , w 2 , . . . , w n , v), which values are associated with vari- and (w 1 , w 2 , . . . , w n ) is definition-clear ables and how these associations can with respect to x, and the occurrence of effect the execution of the program. This x at v is a global computational use. We analysis focuses on the occurrences of say that the definition of x at u reaches variables within the program. Each the computational occurrence of x at v variable occurrence is classified as ei- through the path p. Similarly, if there is ther a definition occurrence or a use a path p (u, w 1 , w 2 , . . . , w n , v) from occurrence. A definition occurrence of a u to v, and (w 1 , w 2 , . . . , w n ) is defini- variable is where a value is bound to the tion-clear with respect to x, and there is variable. A use occurrence of a variable a predicate occurrence of x associated is where the value of the variable is with the edge from w n to v, we say that referred. Each use occurrence is further u reaches the predicate use of x on the classified as being a computational use edge (w n , v) through the path p. If a or a predicate use. If the value of a path in one of the preceding definitions variable is used to decide whether a is feasible, that is, there is at least one predicate is true for selecting execution input datum that can actually cause the paths, the occurrence is called a predi- execution of the path, we say that a cate use. Otherwise, it is used to com- definition feasibly reaches a use of the pute a value for defining other variables definition. or as an output value. It is then called a Three groups of data-flow adequacy computational use. For example, the as- criteria have been proposed in the liter- signment statement “y : x 1 x 2 ” con- ature, and are discussed in the follow- tains computational uses of x 1 and x 2 ing. and a definition of y. The statement “if x1 x 2 then goto L endif” contains B. Simple definition-use association predicate uses of x 1 and x 2 . coverage—the Rapps-Weyuker-Frankl Since we are interested in tracing the family. Rapps and Weyuker [1985] flow of data between nodes, any defini- proposed a family of testing adequacy tion that is used only within the node in criteria based on data-flow information. which the definition occurs is of little Their criteria are concerned mainly importance. Therefore a distinction is with the simplest type of data-flow made between local computational uses paths that start with a definition of a and global computational uses. A global variable and end with a use of the same computational use of a variable x is variable. Frankl and Weyuker [1988] where no definition of x precedes the later reexamined the data-flow ade- computational use within the node in quacy criteria and found that the origi- which it occurs. That is, the value must nal definitions of the criteria did not ACM Computing Surveys, Vol. 29, No. 4, December 1997
  14. 14. Test Coverage and Adequacy • 379 satisfy the applicability condition. They uses are exercised, but it also requires redefined the criteria to be applicable. that at least one predicate use should be The following definitions come from the exercised when there is no computa- modified definitions. tional use of the variable. In contrast, The all-definitions criterion requires the all-p-uses/some-c-uses criterion puts that an adequate test set should cover emphasis on predicate uses by requiring all definition occurrences in the sense that test sets should exercise all predi- that, for each definition occurrence, the cate uses and exercise at least one testing paths should cover a path computational use when there is no through which the definition reaches a predicate use. Two even weaker criteria use of the definition. were also defined. The all-predicate- uses criterion completely ignores the Definition 2.7 (All Definitions Crite- computational uses and requires that rion). A set P of execution paths satis- only predicate uses need to be tested. fies the all-definitions criterion if and The all-computation-uses criterion only only if for all definition occurrences of a requires that computational uses should variable x such that there is a use of x be tested and ignores the predicate which is feasibly reachable from the def- uses. inition, there is at least one path p in P Notice that, given a definition occur- such that p includes a subpath through rence of a variable x and a use of the which the definition of x reaches some variable x that is reachable from that use occurrence of x. definition, there may exist many paths through which the definition reaches Since one definition occurrence of a the use. A weakness of the preceding variable may reach more than one use criteria is that they require only one of occurrence, the all-uses criterion re- such paths to be exercised by testing. quires that all of the uses should be However, the applicability problem exercised by testing. Obviously, this re- arises if all such paths are to be exer- quirement is stronger than the all-defi- cised because there may exist an infi- nition criterion. nite number of such paths in a flow Definition 2.8 (All Uses Criterion). A graph. For example, consider the flow set P of execution paths satisfies the graph in Figure 1, the definition of y at all-uses criterion if and only if for all node a1 reaches the use of y at node a3 definition occurrences of a variable x through all the paths in the form: and all use occurrences of x that the a1, a2 ∧ a2, a2 n∧ a2, a3 , n 1, definition feasibly reaches, there is at least one path p in P such that p in- where ∧ is the concatenation of paths, cludes a subpath through which that p n is the concatenation of p with itself definition reaches the use. for n times, which is inductively defined The all-uses criterion was also pro- to be p 1 p and p k p ∧ p k 1 , for all posed by Herman [1976], and called k 1. To obtain finite applicability, reach-coverage criterion. As discussed at Frankl and Weyuker [1988] and Clarke the beginning of the section, use occur- et al. [1989] restricted the paths to be rences are classified into computational cycle-free or only the end node of the use occurrences and predicate use oc- path to be the same as the start node. currences. Hence, emphasis can be put either on computational uses or on Definition 2.9 (All Definition-Use- predicate uses. Rapps and Weyuker Paths Criterion: Abbr. All DU-Paths [1985] identified four adequacy criteria Criterion). A set P of execution paths of different strengths and emphasis. satisfies the all-du-paths criterion if The all-c-uses/some-p-uses criterion re- and only if for all definitions of a vari- quires that all of the computational able x and all paths q through which ACM Computing Surveys, Vol. 29, No. 4, December 1997
  15. 15. 380 • Zhu et al. that definition reaches a use of x, there action is a path p (n 1 ) p 1 (n 2 ) is at least one path p in P such that q is . . . (n k 1 ) p k 1 (n k ) such that for a subpath of p, and q is cycle-free or all i 1, 2, . . . , k 1, d i (x i ) reaches contains only simple cycles. u i (x i ) through p i . The required k-tuples criterion then requires that all k–dr However, even with this restriction, it interactions are tested. is still not applicable since such a path may be infeasible. Definition 2.11 (Required k-Tuples Criteria). A set P of execution paths C. Interactions between variables—the satisfies the required k-tuples criterion, Ntafos required K-tuples criteria. Ntafos k 1, if and only if for all j–dr interac- [1984] also used data-flow information tions L, 1 j k, there is at least one to analyze test data adequacy. He stud- path p in P such that p includes a ied how the values of different variables subpath which is an interaction path for interact, and defined a family of ade- L. quacy criteria called required k-tuples, where k 1 is a natural number. These Example 2.1 Consider the flow criteria require that a path set cover the graph in Figure 2. The following are chains of alternating definitions and 3–dr interaction paths. uses, called definition-reference interac- tions (abbr. k–dr interactions) in a1, a3, a2, a4 for the 3–dr Ntafos’ terminology. Each definition in interaction d 1 x , u 1 x , d 2 y , a k–dr interaction reaches the next use in the chain, which occurs at the same u 2 y , d 3 x , u 3 x ; and node as the next definition in the chain. Formally: a1, a2, a3, a4 for the 3–dr Definition 2.10 (k–dr interaction). interaction d 1 y , u 1 y , d 2 x , For k 1, a k–dr interaction is a se- quence K [d 1 (x 1 ), u 1 (x 1 ), d 2 (x 2 ), u2 x , d3 y , u3 y . u 2 (x 2 ), . . . , d k (x k ), u k (x k )] where D. Combinations of definitions—the (i) d i ( x i ), 1 i k, is a definition Laski-Korel criteria. Laski and Korel occurrence of the variable x i ; [1983] defined and studied another kind (ii) u i ( x i ), 1 i k, is a use occur- of testing path selection criteria based rence of the variable x i ; on data-flow analysis. They observed (iii) the use u i ( x i ) and the definition that a given node may contain uses of d i 1 ( x i ) are associated with the several different variables, where each same node n i 1 ; use may be reached by several defini- (iv) for all i, 1 i k, the ith defini- tions occurring at different nodes. Such tion d i ( x i ) reaches the ith use definitions constitute the context of the u i ( x i ). computation at the node. Therefore, they are concerned with exploring such Note that the variables x 1 , x 2 , . . . , x k contexts for each node by selecting and the nodes n 1 , n 2 , . . . , n k need not paths along which the various combina- be distinct. This definition comes from tions of definitions reach the node. Ntafos’ [1988] later work. It is different from the original definition where the Definition 2.12 (Ordered Context). nodes are required to be distinct [Ntafos Let n be a node in the flow graph. 1984]. The same modification was also Suppose that there are uses of the vari- made by Clark et al. [1989] in their ables x 1 , x 2 , . . . , x m at the node n.4 Let formal analysis of data flow adequacy criteria. 4 This assumption comes from Clark et al. [1989]. An interaction path for a k–dr inter- The original definition given by Laski and Korel ACM Computing Surveys, Vol. 29, No. 4, December 1997
  16. 16. Test Coverage and Adequacy • 381 [n 1 , n 2 , . . . , n m ] be a sequence of nodes context {n 1 , n 2 , . . . , n m }. Ignoring the such that for all i 1, 2, . . . , m, there ordering between the nodes, a slightly is a definition of x i on node n i , and the weaker criterion, called the context-cov- definition of x i reaches the node n with erage criterion, requires that all con- respect to x i . A path p p 1 (n 1 ) p 2 texts for all nodes are covered. (n 2 ) . . . p m (n m ) p m 1 (n) is called an ordered context path for the Definition 2.14 (Context-Coverage Cri- node n with respect to the sequence [n 1 , terion). A set P of execution paths sat- n 2 , . . . , n m ] if and only if for all i 2, isfies the context coverage criterion if 3, . . . , m, the subpath p i (n i ) p i 1 and only if for all nodes n and for all ... p m 1 is definition clear with re- contexts for n, there is at least one path spect to x i 1 . In this case, we say that p in P such that p contains a subpath the sequence [n 1 , n 2 , . . . , n m ] of nodes which is a definition context path for n is an ordered context for n. with respect to the context. Example 2.2 Consider the flow E. Data-flow testing for structured graph in Figure 1. There are uses of the data and dynamic data. The data flow two variables x and y at node a4. The testing methods discussed so far have a node sequences [a1, a2], [a1, a3], [a2, number of limitations. First, they make a3], and [a3, a2] are ordered contexts no distinction between atomic data such for node a4. The paths (a1, a2, a4), as integers and structured or aggregate (a1, a3, a4), (a2, a3, a4), and (a3, a2, data such as arrays and records. Modifi- a4) are the ordered context paths for cations and references to an element of them, respectively. a structured datum are regarded as The ordered context coverage requires modifications and references to the that an adequate test set should cover whole datum. It was argued that treat- all ordered contexts for every mode. ing structured data, such as arrays, as aggregate values may lead to two types Definition 2.13 (Ordered-Context Cov- of mistakes [Hamlet et al. 1993]. A com- erage Criterion). A set P of execution mission mistake may happen when a paths satisfies the ordered-context cov- definition-use path is identified but it is erage criterion if and only if for all not present for any array elements. An nodes n and all ordered contexts c for n, omission mistake may happen when a there is at least one path p in P such path is missed because of a false inter- that p contains a subpath which is an mediate assignment. Such mistakes oc- ordered context path for n with respect cur frequently even in small programs to c. [Hamlet et al. 1993]. Treating elements Given a node n, let {x 1 , x 2 , . . . , x m } of structured data as independent data be a nonempty subset of the variables can correct the mistakes. Such an ex- that are used at the node n, the nodes tension seems to add no complexity n i, i 1, 2, . . . , m, have definition when the references to the elements of occurrences of the variables x i , that structured data are static, such as the reach the node n. If there is a permuta- fields of records. However, treating ar- tion of the nodes which is an ordered rays element-by-element may introduce context for n, then we say that the set a potential infinity of definition-use {n 1 , n 2 , . . . , n m } is a context for n, and paths to be tested. Moreover, theoreti- an ordered context path for n with re- cally speaking, whether two references spect to is also called a definition to array elements are references to the context path for n with respect to the same element is undecidable. Hamlet et al. [1993] proposed a partial solution to [1983] defines a context to be formed from all this problem by using symbolic execu- variables having a definition that reaches the tion and a symbolic equation solver to node. determine whether two occurrences of ACM Computing Surveys, Vol. 29, No. 4, December 1997
  17. 17. 382 • Zhu et al. array elements can be the occurrences efficient testing of the interaction be- of the same element. tween procedures. The basic idea of in- The second limitation of the data-flow terprocedural data-flow testing is to test testing methods discussed is that dy- the data dependence across procedure namic data were not taken into account. interfaces. Harrold and Soffa [1990; One of the difficulties in the data-flow 1991] identified two types of interproce- analysis of dynamic data such as those dural data dependences in a program: referred to by pointers is that a pointer direct data dependence and indirect variable may actually refer to a number data dependence. A direct data depen- of data storage. On the other hand, a dence is a definition-use association data storage may have a number of whose definition occurs in procedure P references to it, that is, the existence of and use occurs in a directly called proce- alias. Therefore, for a given variable V, dure Q of P. Such a dependence exists a node may contain a definite definition when (1) a definition of an actual pa- to the variable if a new value is defi- rameter in one procedure reaches a use nitely bound to the variable at the node. of the corresponding formal parameter It has a possible definition at a node n if at a call site (i.e., a procedure call); (2) a it is possible that a new value is bound definition of a formal parameter in a to it at the node. Similarly, a path may called procedure reaches a use of the be definitely definition-clear or possibly corresponding actual parameter at a re- definition-clear with respect to a vari- turn site (i.e., a procedure return); or (3) able. Ostrand and Weyuker [1991] ex- a definition of a global variable reaches tended the definition-use association re- a call or return site. An indirect data lation on the occurrences of variables to dependence is a definition-use associa- a hierarchy of relations. A definition- tion whose definition occurs in proce- use association is strong if there is a dure P and use occurs in an indirectly definite definition of a variable and a called procedure Q of P. Conditions for definite use of the variable and every indirect data dependence are similar to definition-clear path from the definition to the use is definitely definition-clear those for direct data dependence, except with respect to the variable. The associ- that multiple levels of procedure calls ation is firm if both the definition and and returns are considered. Indirect the use are definite and there is at least data dependence can be determined by one path from the definition to the use considering the possible uses of defini- that it is definitely definition-clear. The tions along the calling sequences. When association is weak if both the definition a formal parameter is passed as an ac- and the use are definite, but there is no tual parameter at a call site, an indirect path from the definition to the use data dependence may exist. Given this which is definitely definition-clear. An data dependence information, the data- association is very weak if the definition flow test adequacy criteria can be easily or the use or both of them are possible extended for interprocedural data-flow instead of definite. testing. Harrold and Soffa [1990] pro- posed an algorithm for computing the F. Interprocedural data-flow test- interprocedural data dependences and ing. The data-flow testing methods developed a tool to support interproce- discussed so far have also been re- dural data-flow testing. stricted to testing the data dependence Based on Harrold and Soffa’s work, existing within a program unit, such as Ural and Yang [1988; 1993] extended a procedure. As current trends in pro- the flow-graph model for accurate repre- gramming encourage a high degree of sentation of interprocedural data-flow modularity, the number of procedure information. Pande et al. [1991] pro- calls and returns executed in a module posed a polynomial-time algorithm for continues to grow. This mandates the determining interprocedural definition- ACM Computing Surveys, Vol. 29, No. 4, December 1997
  18. 18. Test Coverage and Adequacy • 383 use association including dynamic data program under test. Furthermore, these of single-level pointers for C programs. dependence relations can be efficiently calculated. 2.1.3 Dependence Coverage Criterion— an Extension and Combination of Data- Flow and Control-Flow Testing. An ex- 2.2 Specification-Based Structural Testing tension of data-flow testing methods was made by Podgurski and Clarke There are two main roles a specification can play in software testing [Richardson [1989; 1990] by generalizing control and et al. 1992]. The first is to provide the data dependence. Informally, a state- necessary information to check whether ment s is semantically dependent on a the output of the program is correct statement s if the function computed [Podgurski and Clarke 1989; 1990]. by s affects the execution behavior of s. Checking the correctness of program Podgurski and Clarke then proposed a outputs is known as the oracle problem. necessary condition of semantic depen- The second is to provide information to dence called weak syntactic dependence select test cases and to measure test as a generalization of data dependence. adequacy. As the purpose of this article There is a weak syntactic dependence is to study test adequacy criteria, we between two statements if there is a focus on the second use of specifications. chain of data flow and a weak control Like programs, a specification has dependence between the statements, two facets, syntactic structure and se- where a statement u is weakly control- mantics. Both of them can be used to dependent on statement v if v has suc- select test cases and to measure test cessors v and v such that if the branch adequacy. This section is concerned from v to v is executed then u is neces- with the syntactic structure of a specifi- sarily executed within a fixed number of cation. steps, whereas if the branch v to v is A specification specifies the proper- taken then u can be bypassed or its ties that the software must satisfy. execution can be delayed indefinitely. Given a particular instance of the soft- Podgurski and Clarke also defined the ware’s input and its corresponding out- notion of strong syntactic dependence: put, to check whether the instance of there is a strong syntactic dependence the software behavior satisfies these between two statements if there is a properties we must evaluate the specifi- chain of data flow and a strong control cation by substituting the instance of dependence between the statements. input and output into the input and Roughly speaking, a statement u is output variables in the specification, re- strongly control dependent on state- spectively. Although this evaluation ment v if v has two successors v and v process may take various forms, de- such that the execution through the pending on the type of the specification, branch v to v may result in the execu- the basic idea behind the approach is to tion of u, but u may be bypassed when consider a particular set of elements or the branch from v to v is taken. Pod- components in the specification and to gurski and Clarke proved that strong calculate the proportion of such ele- syntactic dependence is not a necessary ments or components involved in the condition of semantic dependence. evaluation. When the definition-use association There are two major approaches to relation is replaced with various depen- formal software functional specifica- dence relations, various dependence- tions, model-based specifications and coverage criteria can be obtained as ex- property-oriented specifications such as tensions to the data-flow test adequacy axiomatic or algebraic specifications. criteria. Such criteria make more use of The following discussion is based on semantic information contained in the these types of specifications. ACM Computing Surveys, Vol. 29, No. 4, December 1997
  19. 19. 384 • Zhu et al. 2.2.1 Coverage of Model-Based For- That is, some of the choices of output mal Functional Specifications. When a allowed by the specification may not be specification is model-based, such as implemented by the program. This may those written in Z and VDM, it has two not be considered a program error, but parts. The first describes the state it may result in infeasible combinations. space of the software, and the second A feasible combination of the atomic part specifies the required operations on predicates in the preconditions is a de- the space. The state space of the soft- scription of the conditions that test ware system can be defined as a set of cases should satisfy. It specifies a sub- typed variables with a predicate to de- domain of the input space. It can be scribe the invariant property of the expressed in the same specification lan- state space. The operations are func- guage. Such specifications of testing re- tions mapping from input data and the quirements are called test templates. state before the operation to the output Stocks and Carrington [1993] suggested data and the state after the operation. the use of the formal functional specifi- Such operations can be specified by a cation language Z to express test tem- set of predicates that give the precondi- plates because the schema structure of tion, that is, the condition on the input Z and its schema calculus can provide data and the state before the operation, support to the derivation and refine- and postconditions that specify the rela- ment of test templates according to for- tionship between the input data, output mal specifications and heuristic testing data, and the states before and after the rules. Methods have also been proposed operation. to derive such test templates from mod- The evaluation of model-based formal el-based specification languages. Amla functional specifications is fairly similar and Ammann [1992] described a tech- to the evaluation of a Boolean expres- nique to extract information from for- sion in an imperative programming lan- mal specifications in Z and to derive guage. When input and output variables test templates written in Z for partition in the expression are replaced with an testing. The key step in their method is instance of input data and program out- to identify the categories of the test puts, each atomic predicate must be ei- data for each parameter and environ- ther true or false. If the result of the ment variable of a functional unit under evaluation of the whole specification is test. These categories categorize the in- true, then the correctness of the soft- put domain of one parameter or one ware on that input is confirmed. Other- environment variable according to the wise, a program error is found. How- major characteristics of the input. Ac- ever, the same truth value of a cording to Amla and Ammann, there are specification on two instances of input/ typically two distinct sources of catego- output may be due to different combina- ries in Z specifications: (a) characteris- tions of the truth values of the atomic tics enumerated in the preconditions predicates. Therefore it is natural to and (b) characteristics of a parameter or require that an adequate test cover a environment variable by itself. For pa- certain subset of feasible combinations rameters, these characteristics are of the predicates. Here a feasible combi- based on their type. For environment nation means that the combination can variables, these characteristics may be satisfied; that is, there is an assign- also be based on the invariant for the ment of values to the input and output state components. Each category is then variables such that the atomic predi- further divided into a set of choices. A cates take their corresponding values in choice is a subset of data that can be the predicate combination. In the case assigned to the parameter or the envi- where the specification contains nonde- ronment variable. Each category can be terminism, the program may be less broken into at least two choices: one for nondeterministic than the specification. the valid inputs and the other for the ACM Computing Surveys, Vol. 29, No. 4, December 1997
  20. 20. Test Coverage and Adequacy • 385 invalid inputs. Finer partitions of valid ronment variable, there is at least one inputs are derived according to the syn- test datum that belongs to the choice. tactic structure of the precondition Formally: predicate, the parameters, or the invari- Definition 2.16 (Each-Choice-Used ant predicate of the environment vari- Criterion). A set of test data T satis- ables. For example, the predicate “A ∨ fies the each-choice-used criterion if the B” is partitioned into three choices: (a) subset E {e e C and ?t T.(t “¬ (A ∨ B)” for the set of data which are e)} satisfies the condition: invalid inputs; (b) “A” for the subset of valid inputs which satisfy condition A; @i. 1 i n (3) “B” for the subset of valid inputs which satisfy the condition B. f Ei A i,1 , A i,2 , . . . , A i,k i , Based on Amla and Ammann’s [1992] work, Ammann and Offutt [1994] re- where cently considered how to test a func- Ei e ?X1 , . . . , X i 1 , tional unit effectively and efficiently by selecting test cases to cover various sub- Xi , . . . , Xn . X1 ... Xi e 1 1 sets of the combinations of the catego- ries and choices when the functional Xi 1 ... Xn E . unit has more than one parameter and environment variable. They proposed Ammann and Offutt suggested the three coverage criteria. The all-combi- use of the base-choice-coverage criterion nations criterion requires that software and described a technique to derive test is tested on all combinations of choices; templates that satisfy the criterion. The that is, for each combination of the base-choice-coverage criterion is based choices of the parameters and the envi- on the notion of base choice, which is a ronment variables, there is at least one combination of the choices of parame- ters and environment variables that test datum in the combination. Let x 1 , represents the normal operation of the x 2 , . . . , x n be the parameters and envi- functional unit under test. Therefore, ronment variables of the functional unit test cases of the base choice are useful under test. Suppose that the choices for to evaluate the function’s behavior in x i are A i,1 , A i,2 , . . . , A i,k i, k i 0, i 1, normal operation mode. To satisfy the 2, . . . , n. Let C {A 1,u 1 A 2,u 2 ... base-choice-coverage criterion, software A n,u n 1 ui k i and 1 i n}. C needs to be tested on the subset of com- is then the set of all combinations of binations of choices such that for each choices. The all combination criterion choice in a category, the choice is com- can be formally defined as follows. bined with the base choices for all other categories. Assume that A 1,1 A 2,1 Definition 2.15 (All-Combination Cri- ... A n,1 is the base choice. The base- terion). A set of test data T satisfies choice coverage criterion can be for- the all-combination criterion if for all mally defined as follows. c C, there exists at least one t T Definition 2.17 (Base-Choice-Coverage such that t c. Criterion). A set of test data T satis- This criterion was considered to be fies the base-choice-coverage criterion if inefficient, and the each-choice-used cri- the subset E {e e C ∧ ?t T.(t terion was considered ineffective [Am- e)} satisfies the following condition: mann and Offutt 1994]. The each- n choice-used criterion requires that each choice is used in the testing; that is, for E Bi , each choice of each parameter or envi- i 1 ACM Computing Surveys, Vol. 29, No. 4, December 1997
  21. 21. 386 • Zhu et al. where the operations represents a test execu- tion of the program, where the test case Bi A 1,1 ... Ai 1,1 A i, j Ai 1,1 consists of the constants substituted for the variables. Second, a term also repre- ... A n,1 j 1, 2, . . . , k i . sents a value, that is, the result of the There are a number of works on spec- sequence of operations. Therefore, ification-based testing that focus on der- checking an equation means executing ivation of test cases from specifications, the operation sequences for the two including Denney’s [1991] work on test- terms on the two sides of the equation case generation from Prolog-based spec- and then comparing the results. If the ifications and many others [Hayes 1986; results are the same or equivalent, the Kemmerer 1985; McMullin and Gannon program is considered to be correct on 1983; Wild et al. 1992]. this test case, otherwise the implemen- Model-based formal specification can tation has errors. This interpretation also be in an executable form, such as a allows the use of algebraic specifica- finite state machine or a state chart. tions as test oracles. Aspects of such models can be repre- Since variables in a term can be re- sented in the form of a directed graph. placed by any value of the data type, Therefore, the program-based adequacy there is a great deal of freedom to choose criteria based on the flow-graph model input data for any given sequence of oper- can be adapted for specification-based ations. For algebraic specification, values testing [Fujiwara et al. 1991; Hall and are represented by ground terms, that is, Hierons 1991; Ural and Yang 1988; terms without variables. Gaudel [Bouge 1993]. et al. 1986; Bernot et al. 1991] and her 2.2.2 Coverage of Algebraic Formal colleagues suggested that the selection of Functional Specifications. Property- test cases should be based on partitioning oriented formal functional specifications the set of ground terms according to their specify software functions by a set of complexity so that the regularity and uni- properties that the software should pos- formity hypotheses on the subsets in the sess. In particular, an algebraic specifi- partition can be assumed. The complexity cation consists of a set of equations that of a test case is then the depth of nesting the operations of the software must sat- of the operators in the ground term. isfy. Therefore checking if a program Therefore, roughly speaking, the selection satisfies the specification means check- of test cases should first consider con- ing whether all of the equations are stants specified by the specification, then satisfied by the program. all the values generated by one applica- An equation in an algebraic specifica- tion of operations on constants, then val- tion consists of two terms as two sides ues generated by two applications on con- of the equation. A term is constructed stants, and so on until the test set covers from three types of symbols: variables data of a certain degree of complexity. representing arbitrary values of a given The following hypothesis, called the data type, constants representing a regularity hypothesis [Bouge et al. 1986; given data value in a data type, and Bernot et al. 1991], formally states the operators representing data construc- gap between software correctness and tors and operations on data types. adequate testing by the preceding ap- Each term has two interpretations in proach. the context of testing. First, a term rep- resents a sequence of calls to the opera- Regularity Hypothesis tions that implement the operators specified in the specification. When the @x complexity x Kft x variables in the term are replaced with constants, such a sequence of calls to t x f @x t x t x (2.1) ACM Computing Surveys, Vol. 29, No. 4, December 1997