As an exercise, refer the students to the Therac-case, discussed in chapter 1.
It thus really pays off to start testing early. See also following picture
Old picture from Boehm’s book. It shows that errors discovered during operation might cost 100 times as much as errors discovered during requirements engineering.
As a simple illustration of why exhaustive testing does not work: take a simple loop with an if statement in it. Exhaustive testing if the loop is executed 100 times takes 2 100 test cases Random testing does work if you want to achieve reliability (see later sections/slides).
Coverage-based: e.g. how many statements or requirements have been tested so far Fault-based: e.g., how many seeded faults are found Error-based: focus on error-prone points, e.g. off-by-one points Black-box: you do not look inside, but only base yourself on the specification/functional description White-box: you do look inside, to the structure, the actual program/specification. This classification is mostly used at the module level.
For example, I may accidentally assume a procedure is only called with a positive argument (the error). So I forget to test for negative values (the fault). Now if the procedure is actually called with a negative argument, something may go wrong (wrong answer, abortion): the failure Note that the relation between errors, faults and failures need not be 1-1.
But even with this definition, things may be subtle. Suppose a program contains a fault which never shows up, say because a certain piece of the code never gets executed. Is this “latent” fault actually a fault? If not, does it become a fault if we reuse this part of the program in another context? See also next slide.
The Ariane 5 took off and exploded within 40 seconds. Ultimate cause: overflow in conversion of some variable; this case was not tested. In the Ariane 4, this did not cause any problem. This variable related to the horizontal speed of the rocket. The piece of software in question only served to speed up the restart of the launching process in case something went wrong and one had to stop the launch prematurily. The software ran for about a minute after launching. The Ariane 4 is much slower than the Ariane 5, so within this one minute, the rocket was still going up, and the variable in question had a small value. In the Ariane 5, by this time horizontal speed was much higher. So, failure to specify boundary conditions for this software? Reuse failure?
If exhausting testing does not work, we have to select a good subset. But how do we determine the quality of such a test set? This is a very crucial step, and the various test techniques all address this issue in one way or another.
Note that the stopping rule view is a special case of the measurement view. We use these adequacy criteria to decide whether one testing technique is better than another. A number of such relations between test techniques is given later on.
Objective 1 is the kind of objective used in all kinds of functional and structural test techniques. These try to systematically exercise the software so as to make sure we test “everything”. The idea behind objective 2 is that we might not be interested in faults that never show up, but we really want to find those that hae a large probability of manifesting themselves. So we pursue a high reliability. Random testing then works, provided the test cases profile matches the operational profile, I.e. the distribution of test cases mimics actual use of the system. An example development method where this objective is applied is Cleanroom.
This as example approach where we want to find as many faults as possible. The partition is perfect iff the paths in the program follow the equivalence classes chosen. For instance, we assume that the sorting module treat all arrays of length 1 < n < 999 the same. Probably, those of length 1 an 999 are also treated in the same way, but just to make sure we test these boundary cases separately. Now if the sorting program treats, say, arrays with negative numbers differently from those with positive numbers, this equivalence class partitioning is not perfect, and a fault in the program may go unnoticed because we may happen to use a test case that, say, only has positive numbers, and none that has negative numbers.
The first two models are phase models; testing is a phase following coding. The demonstration mode is often used when testing one’s own software. This model also applies when the test set is not carefully/systematically constructed. All kinds of structural and functional techniques follow the destructive mode of operation. The last two models acknowledge that testing is something that has to be done in every development phase. For instance, requirements can be reviewed too. And by making sure that there is a test for every requirement, including every non-functional requirement, you can even prevent errors from being made in the first place. Over the years, a gradual shift can be observed, from demonstration to prevention.
Correctness proofs: complex, not one very often Stepwise abstraction: opposite of stepwise refinement, so you develop pre and post-conditions of a module by working backwards from the individual statements
Much of the real value of this type of technique is in the learning process that the peole get involved in.
In branch coverage, both branches of an if-statement are tested, even if one is empty In normal branch coverage, a combined condition like a = 1 and b = 2 requires two tests. We may also test al four combinations of the two simple predicates. The cyclomatic number criterion is related to the cyclomatic complexity metric of McCabe
We have to include each successor to enforce that all branches following a P-use are taken. Further variations differentiate between uses in a predicate (C-use) and uses elsewhere (computations, C-use). This leads to criteria like All-C-uses/Some-P-uses and the like.
Kinds of variations in program testing: seed faults founded by one group into the program tested by another group.
In each variation, mutant, one simple change is made.
Note that if we happen to insert an element of a that occurs before the final element, we won’t notice a difference
This is a graphical view of this same requirement. It shows the two-dimensional (age, average number of loans per month) domain. The subdomains are bordered by lnes such as age=6, or (age=4, 0<= av<= 5). For each border, it is indicated which of the adjacent subdomains is closed by putting a hachure at that side; a subdomain is closed at some border iff that border belongs to the subdomain; otherwise it is open.
This yields the same picture, with the same borders, and can be used with the same test set.
Usually, stronger metrics induce more costs
These properties relate to program-based criteria. The first four are rather general and should apply to any test adequacy criterion. E.G. the applicability criterion says:for every program, there is an adequate test set. This is not true for the All-Nodes and Al-Edges criteria, e.g., since they may have dead code, so that you cannot achieve 100% coverage. Anticomposition: if components have been tested adequately, this does not mean their composition is also tested adequately (re Ariane 5 disaster). This does not hold for the All-Nodes and All-Edges.
Requirements may be represented as graphs, where the nodes represent elementary requirements, and the edges represent relations (like yes/no) between requirements.
And next we may apply the earlier coverage criteria to this graph
Example translation of requirements to a graph
A user may order new books. He is shown a screen with fields to fill in. Certain fields are mandatory. One field is used to check whether the department’s budget is large enough. If so, the book is ordered and the budget reduced accordingly.
Enter fields All mandatory fields there? Check budget Order book Notify user Notify user
Library maintains a list of “hot” books. Each new book is added to this list. After six months, it is removed again. Also, if book is more than four months on the list, and has not been borrowed more than four times a month, or it is more than two months old and has been borrowed at most twice, it is removed from the list.
Example (cnt’d) 2 4 6 5 2 av # of loans age A B
Criterion A is stronger than criterion B if, for all programs P and all test sets T, X-adequacy implies Y-adequacy
In that sense, e.g., All-Edges is stronger that All-Nodes coverage (All-Edges “subsumes” All-Nodes)
One problem: such criteria can only deal with paths that can be executed (are feasible). So, if you have dead code, you can never obtain 100% statement coverage. Sometimes, the subsumes relation only holds for the feasible version.