As an exercise, refer the students to the Therac-case, discussed in chapter 1.
It thus really pays off to start testing early. See also following picture
Old picture from Boehm’s book.
It shows that errors discovered during operation might cost 100 times as much as errors discovered during requirements engineering.
As a simple illustration of why exhaustive testing does not work: take a simple loop with an if statement in it. Exhaustive testing if the loop is executed 100 times takes 2100 test cases
Random testing does work if you want to achieve reliability (see later sections/slides).
Coverage-based: e.g. how many statements or requirements have been tested so far
Fault-based: e.g., how many seeded faults are found
Error-based: focus on error-prone points, e.g. off-by-one points
Black-box: you do not look inside, but only base yourself on the specification/functional description
White-box: you do look inside, to the structure, the actual program/specification.
This classification is mostly used at the module level.
For example, I may accidentally assume a procedure is only called with a positive argument (the error).
So I forget to test for negative values (the fault).
Now if the procedure is actually called with a negative argument, something may go wrong (wrong answer, abortion): the failure
Note that the relation between errors, faults and failures need not be 1-1.
But even with this definition, things may be subtle. Suppose a program contains a fault which never shows up, say because a certain piece of the code never gets executed. Is this “latent” fault actually a fault? If not, does it become a fault if we reuse this part of the program in another context? See also next slide.
The Ariane 5 took off and exploded within 40 seconds.
Ultimate cause: overflow in conversion of some variable; this case was not tested.
In the Ariane 4, this did not cause any problem. This variable related to the horizontal speed of the rocket. The piece of software in question only served to speed up the restart of the launching process in case something went wrong and one had to stop the launch prematurily. The software ran for about a minute after launching. The Ariane 4 is much slower than the Ariane 5, so within this one minute, the rocket was still going up, and the variable in question had a small value. In the Ariane 5, by this time horizontal speed was much higher.
So, failure to specify boundary conditions for this software?
Reuse failure?
If exhausting testing does not work, we have to select a good subset. But how do we determine the quality of such a test set? This is a very crucial step, and the various test techniques all address this issue in one way or another.
Note that the stopping rule view is a special case of the measurement view.
We use these adequacy criteria to decide whether one testing technique is better than another. A number of such relations between test techniques is given later on.
Objective 1 is the kind of objective used in all kinds of functional and structural test techniques. These try to systematically exercise the software so as to make sure we test “everything”.
The idea behind objective 2 is that we might not be interested in faults that never show up, but we really want to find those that hae a large probability of manifesting themselves. So we pursue a high reliability. Random testing then works, provided the test cases profile matches the operational profile, I.e. the distribution of test cases mimics actual use of the system.
An example development method where this objective is applied is Cleanroom.
This as example approach where we want to find as many faults as possible.
The partition is perfect iff the paths in the program follow the equivalence classes chosen. For instance, we assume that the sorting module treat all arrays of length 1 < n < 999 the same. Probably, those of length 1 an 999 are also treated in the same way, but just to make sure we test these boundary cases separately.
Now if the sorting program treats, say, arrays with negative numbers differently from those with positive numbers, this equivalence class partitioning is not perfect, and a fault in the program may go unnoticed because we may happen to use a test case that, say, only has positive numbers, and none that has negative numbers.
The first two models are phase models; testing is a phase following coding. The demonstration mode is often used when testing one’s own software. This model also applies when the test set is not carefully/systematically constructed. All kinds of structural and functional techniques follow the destructive mode of operation.
The last two models acknowledge that testing is something that has to be done in every development phase. For instance, requirements can be reviewed too. And by making sure that there is a test for every requirement, including every non-functional requirement, you can even prevent errors from being made in the first place.
Over the years, a gradual shift can be observed, from demonstration to prevention.
Correctness proofs: complex, not one very often
Stepwise abstraction: opposite of stepwise refinement, so you develop pre and post-conditions of a module by working backwards from the individual statements
Much of the real value of this type of technique is in the learning process that the peole get involved in.
In branch coverage, both branches of an if-statement are tested, even if one is empty
In normal branch coverage, a combined condition like a = 1 and b = 2 requires two tests. We may also test al four combinations of the two simple predicates.
The cyclomatic number criterion is related to the cyclomatic complexity metric of McCabe
We have to include each successor to enforce that all branches following a P-use are taken.
Further variations differentiate between uses in a predicate (C-use) and uses elsewhere (computations, C-use).
This leads to criteria like All-C-uses/Some-P-uses and the like.
Kinds of variations in program testing: seed faults founded by one group into the program tested by another group.
In each variation, mutant, one simple change is made.
Note that if we happen to insert an element of a that occurs before the final element, we won’t notice a difference
This is a graphical view of this same requirement. It shows the two-dimensional (age, average number of loans per month) domain. The subdomains are bordered by lnes such as age=6, or (age=4, 0<= av<= 5).
For each border, it is indicated which of the adjacent subdomains is closed by putting a hachure at that side; a subdomain is closed at some border iff that border belongs to the subdomain; otherwise it is open.
This yields the same picture, with the same borders, and can be used with the same test set.
Usually, stronger metrics induce more costs
These properties relate to program-based criteria.
The first four are rather general and should apply to any test adequacy criterion.
E.G. the applicability criterion says:for every program, there is an adequate test set. This is not true for the All-Nodes and Al-Edges criteria, e.g., since they may have dead code, so that you cannot achieve 100% coverage.
Anticomposition: if components have been tested adequately, this does not mean their composition is also tested adequately (re Ariane 5 disaster). This does not hold for the All-Nodes and All-Edges.