Upcoming SlideShare
×

# Lecture Notes On Acceptance Testing

3,147 views

Published on

Lecture Notes On Acceptance Testing

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
3,147
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
72
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Lecture Notes On Acceptance Testing

1. 1. Acceptance Testing 1 Lecture Notes on Acceptance Testing Yakov Ben-Haim Faculty of Mechanical Engineering Technion — Israel Institute of Technology Haifa 32000 Israel yakov@technion.ac.il http://www.technion.ac.il/yakov Primary source material: William W. Hines, and Douglas C. Montgomery, Probability and Statistics in Engineering and Management Science. 3rd ed. Sections 11-1–11-4, 11-11, 11-12. A Note to the Student: These lecture notes are not a substitute for the thorough study of books. These notes are no more than an aid in following the lectures. Contents 1 Thickness Measurement: Testing a Sample Mean 1 2 Sequential Sampling: Testing a Mean 6 3 Matching Two Dimensions: Testing Two Sample Means 7 4 Engine Warming: A χ2 Test 9 5 Failure Rate: Poisson Distribution and the χ2 Test 11 6 χ2 Test of Independence in a 2-way Table 13 7 Acceptance Sampling of a Large Population 15 8 Sample Size for Detecting a Change 19 8.1 Sample Size and Error Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 19 8.2 Uncertain Eﬀect Size and Variance . . . . . . . . . . . . . . . . . . . . . . . . 21 8.3 Uncertain Sample PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8.3.2 Info-gap Approach to Determining the Sample Size . . . . . . . . . . . 24 8.3.3 An Approximate Robustness for Small Eﬀect Size . . . . . . . . . . . . 25 1 Thickness Measurement: Testing a Sample Mean ¶ In the ﬁrst few sections we illustrate various statistical hypothesis tests for acceptance testing. ¶ Suppose we measure the thickness of a plate at N widely separated points: x1 , . . . , xN . These measurements diﬀer from one another due to random measurement error, as well as ﬂuctuations in the local thickness. 0 c Yakov Ben-Haim 2005 lectures reltest acctes.tex 7.6.2005
2. 2. Acceptance Testing 2 How do we use these measurements to decide if the plate “really” has thickness T ? What does “really has thickness T ” mean? Perhaps: E(T ) = µ. ¶ A random sample is a set of independent measurements made on the same population. That is, a random sample is a set of independent and identically distributed (i.i.d.) random variables. ¶ The sample mean of a random sample is deﬁned as: N 1 x= xi (1) N i=1 Theorem 1. If a random sample of size N is taken from a population with mean µ and variance σ 2 , then the sample mean x has mean µ and variance σ 2 /N . That is: E(x) = µ (2) 2 σ var(x) = E [x − E(x)]2 = (3) N Theorem 2. If a random sample is taken from a normal population, then the sample mean is normal. Combining the last two theorems we can assert: xi ∼ N (µ, σ 2 ) x ∼ N (µ, σ 2 /N ) =⇒ (4) “Theorem” 3. An approximate statement of the central limit theorem: The sample mean of a random sample will be approximately normal for large sample size (N > 30, rough number). ¶ So, let us suppose that the thickness measurement is normally distributed, or that the number of measurements is large. Thus: x ∼ N (µ, σ 2 /N ) (5) where: µ = true mean thickness. σ 2 = variance of the thickness measurements. N = sample size. Also, assume that we know the value of σ 2 . ¶ How do we decide whether:
3. 3. Acceptance Testing 3 ◦ The true thickness of the plate is T ? ◦ To accept or reject the plate? We use an hypothesis test. ¶ The null hypothesis: H0 : µ=T (6) T = desired thickness: a known value. µ = true thickness: an unknown value. ¶ The alternative hypothesis: H1 : µ=T (7) This is a two-tailed test. The test would be one-tailed if the alternative hypothesis were: H1 : µ>T (8) Or: H1 : µ<T (9) ¶ Level of conﬁdence, α: • Probability of obtaining a result at least as extreme as the observed result, conditioned upon H0 . • Probability of rejecting H0 erroneously. For the two-tailed test in question:   α = Prob |x − T | ≥ |xo − T | H0  (10) x = the random variable “sample mean”. xo = the observed value of the random variable x. p(T ) 6 α/2 α/2 - x T − |x0 − T | T + |x0 − T | T Figure 1: Sketch of probability density, illustrating the level of conﬁdence in eq.(10). ¶ Interpreting the level of conﬁdence: If α is small: reject H0 . If α is large: accept H0 .
4. 4. Acceptance Testing 4 ¶ How to evaluate the level of conﬁdence? Conditioned upon H0 , we can assert: x ∼ N (T, σ 2 /N ) (11) x can be standardized as: x−T √ ∼ N (0, 1) z= (12) σ/ N Now the level of conﬁdence can be expressed as:   α = Prob |x − T | ≥ |xo − T | H0  (13)   |x − T | |xo − T | = Prob  √ ≥ H0  √ (14) σ/ N σ/ N |xo − T | √ = 2 1−Φ (15) σ/ N where Φ(·) is the cdf of the standard normal distribution. Deﬁne: |xo − T | √ zo = (16) σ/ N α = 2[1 − Φ(zo )] (17) p(T ) 6 α/2 α/2 - z −|z0 | |z0 | 0 Figure 2: Sketch of probability density illustrating level of conﬁdence in eq.(15). ¶ Numerical example. Measurements: 1.3, 1.2, 1.4, 1.3, 1.1 Desired thickness: T = 1.36. Known variance: σ 2 = 0.01 =⇒ σ = 0.1. Hence: N = 5 and xo = 1.26. 1.26 − 1.36 √ zo = = −2.236 (18) 0.1/ 5
5. 5. Acceptance Testing 5   α = Prob |z| ≥ |zo | H0  = 2 [1 − Φ(|zo |)] = 0.024 (19) • Note: Φ(|zo |) = 0.988. • So, the probability of getting a value as large or larger than the observed standardized sample mean is 0.024. • This is rather small, so we tend to reject H0 . If so, then we reject H0 at the 0.024 level of conﬁdence. • Similarly, 0.024 = probability of falsely rejecting H0 . ¶ We tested H0 on p. 3 under the assumption that σ 2 is known. What do we do if σ 2 is unknown? ¶ Two cases: N large. N small. ¶ Case 1: N large. If N ≥ 25 (rough number), then s2 , the sample variance, is a good estimate of the true population variance. N 1 (xi − x)2 2 s= (20) N − 1 i=1 We can now assume, conditioned on H0 , that: x−T √ ∼ N (0, 1) t= (21) s/ N This assumption would be precise if s2 = σ 2 . Now we proceed with the test as before. ¶ Case 2: N small. If N < 25 (rough number), then the sample variance s2 is not a good estimate of the population variance. The statistic: x−T t= √ (22) s/ N is broader than N (0, 1) since both x and s display variation. This is a t statistic with N − 1 degrees of freedom (dofs). ¶ We repeat the numerical example, without knowledge of σ 2 . Measurements: 1.3, 1.2, 1.4, 1.3, 1.1 Desired thickness: T = 1.36. N = 5 and xo = 1.26.
6. 6. Acceptance Testing 6 s2 = 0.013 Sample variance: =⇒ s = 0.1140. The observed t statistic is: x−T 1.26 − 1.36 √= √ = −1.961 to = (23) 0.1140/ 5 s/ N The dofs: 5 − 1 = 4. From the table of the t distribution (transparency AS-p.7.1): α= 0.1 0.05 0.025 0.01 ν=4: 1.533 2.132 2.776 3.747 So, with 4 dofs: The probability of t4 exceeding 1.533 is 0.1. The probability of t4 exceeding 2.132 is 0.05. Etc. The level of conﬁdence of this two-tailed test, with to = −1.961, is:   α = Prob |t| ≥ |to | H0  ≈ 2 × 0.07 = 0.14 (24) This is not small, so we cannot reject H0 . The probability of falsely rejecting H0 is 0.14. 2 Sequential Sampling: Testing a Mean ¶ In the previous example we found that, with 5 measurements, we reject H0 at 0.14 level of conﬁdence. This rejection is not very convincing. If we measured more, we could probably make a better, more conﬁdent decision. How many measurements to make? One approach is the idea of sequential sampling: Continue adding measurements until the level of conﬁdence is clear cut. The following table shows an example. √ s2 N xi x s/ N |to | α 5 1.3, 1.2, 1.4, 1.3, 1.1 1.26 0.013 0.0510 1.961 2 × 0.07 = 0.14 8 1.1, 1.3, 1.2 1.2375 0.01125 0.0375 3.267 2 × 0.007 = 0.014 11 1.1, 1.2, 1.1 1.2091 0.01091 0.0315 4.7915 < 0.002 Table 1: Data from a sequential test of the mean thickness measurement. (Transparency) After 11 measurements we can stop: Our conﬁdence in rejecting H0 is great. ¶ General theory: sequential analysis.1 1 Abraham Wald, Sequential Analysis.
7. 7. Acceptance Testing 7 3 Matching Two Dimensions: Testing Two Sample Means ¶ Let us suppose that we are measuring two dimensions, to see if they match. For instance, the inner and outer dimensions of pieces which need to ﬁt together. These two dimensions are measured repeatedly, with a measuring device which has random errors: ◦ The random sample of the inner dimension is: x1 , . . . , xN . ◦ The random sample of the outer dimension is: y1 , . . . , yM . We need not assume that the sample sizes, N and M are the same. The true mean values of these two samples, which are the true dimensions, are: µ1 = true (but unknown) inner dimension. µ2 = true (but unknown) outer dimension. ¶ We will assume that these two sets of measurements are normally distributed, each with known variance σ 2 . Will the pieces ﬁt snugly? How conﬁdent are we of the answer? In other words, we wish to test the null hypothesis: H0 : µ 1 = µ2 (25) against one of the following alternative hypotheses: H1 : µ 1 = µ2 (26) or: H1 : µ1 > µ2 (27) or: H1 : µ1 < µ2 (28) We choose an alternative hypothesis depending upon our prior information. Let x and y be the sample means: N M 1 1 x= xi , y= yi (29) N M i=1 i=1 Consider the statistic: ∆=x−y (30) What is the mean and variance of ∆, if H0 holds? Note that we cannot answer this question under H1 . E(∆) = E(x − y) = E(x) − E(y) = 0 (31)
8. 8. Acceptance Testing 8 σ2 σ2 var(∆) = var(x − y) = var(x) + var(y) = + (32) N M 2 σ∆ In eq.(32) we have used the fact that x and y are statistically independent because they are means of separate random samples: 2 var(x − y) = E x − y − [E(x − y)] (33) 2 2 =E x − E(x) − 2E x − E(x) y − E(y) +E y − E(y) (34) 2 2 =E x − E(x) +E y − E(y) (35) = var(x) + var(y) (36) How is ∆ distributed if H0 is true? We will assume either: ◦ The samples are large. ◦ The measurements are normally distributed. Then: 2 ∆ ∼ N (0, σ∆ ) (37) Deﬁne: ∆ z= (38) σ∆ If H0 holds, then: z ∼ N (0, 1) (39) Using the two-sided alternative hypothesis of eq.(26) on p. 7, we formulate the level of conﬁ- dence as:   α = Prob |z| ≥ |zo | H0  (40) p(T ) 6 α/2 α/2 - z −|z0 | |z0 | 0 Figure 3: Sketch of probability density illustrating level of conﬁdence in eq.(40). Or, if we use the 1-sided alternative hypothesis of eq.(27), the level of conﬁdence becomes:   α = Prob z ≥ zo H0  (41)
9. 9. Acceptance Testing 9 p(T ) 6 α - z z0 0 Figure 4: Sketch of probability density illustrating level of conﬁdence in eq.(41). In the 2-tailed case we obtain: α = 2 [1 − Φ(zo )] (42) In the 1-tailed case we obtain: α = 1 − Φ(zo ) (43) Engine Warming: A χ2 Test 4 ¶ The temperature of an operating engine ﬂuctuates between “warm” and “hot”. For normal operation the engine should be “hot” a fraction ph = 0.15 of the time. Maintenance records since the last overhaul of the engine show that the engine was “warm” at Nw = 162 and “hot” at Nh = 44 statistically independent sample instants. Nh 44 = ≈ 0.21 (44) Nh + Nw 206 Is the engine operating properly? ¶ The χ2 test for categorical data is suitable for addressing this question. We formulate the χ2 test as follows. Given: ◦ K types of outcomes of an ‘experiment.’ ◦ ni outcomes of type i, i = 1, . . . , K. ◦ N = K ni = total number of outcomes. i=1 Null Hypothesis: H0 : pi = probability of type i outcome, i = 1, . . . , K (45) where p1 , . . . , pK are known values. The alternative hypothesis, H1 , is simply: H0 is false. That is, at least one of the probabilities in H0 is wrong. The χ2 statistic is: K (ni − N pi )2 2 χ= (46) N pi i=1
10. 10. Acceptance Testing 10 Explanation: ◦ The numerator is a prediction-error for type-i outcomes. ◦ We expect χ2 to be small if H0 is correct. Theorem 4: If N is large and if H0 is true, then the statistic in eq.(46) is approximately a χ2 random variable with K − 1 dofs. ¶ The χ2 distribution is shown in transparency AS-p.13.2 for several values of the dof. ¶ How does one calculate the level of conﬁdence, α? • Recall: ◦ α is the probability, conditioned upon the null hypothesis, of obtaining a value more extreme than the observed statistic. ◦ α = probability of falsely rejecting H0 . Hence, let χ2 be the observed value of the χ2 statistic of eq.(46). The level of conﬁdence is: o   α = Prob χ2 ≥ χ2 H0  (47) o χ2 = ‘extreme’ =⇒ Reject H0 . α = ‘small’ =⇒ o χ2 = ‘not extreme’ =⇒ Accept H0 . α = ‘large’ =⇒ o ‘Small’ and ‘Large’ are understood from the natural calibration of probability: from 0 to 1. 6 p(χ2 ) 0 α - χ2 χ2 0 0 Figure 5: Level of conﬁdence in eq.(47) ¶ Let us apply this theorem to our example. K = 2: 2 possible states: ‘hot’ and ‘warm’. Nw = 162, Nh = 44, N = 206. H0 : ph = 0.15, pw = 0.85. (162 − 206 × 0.85)2 (44 − 206 × 0.15)2 χ2 = + = 6.533 (48) o 206 × 0.85 206 × 0.15 DOFs = 2 − 1 = 1. χ2 table:
11. 11. Acceptance Testing 11 α= 0.05 0.025 0.01 ν=1: 3.84 5.02 6.63 From this we see that α ≈ 0.01. So, we reject H0 at the 1% conﬁdence level. In other words, the evidence is strong that the engine is running ‘hot’. Failure Rate: Poisson Distribution and the χ2 Test 5 ¶ A computer controlled milling machine operates automatically except when jamming, tool breakage or other failures occur. Under normal circumstances these failures occur at a rate of about 1 or 2 per day. Also, the distribution in time of the failures is thought to be a Poisson process: (1) constant average failure rate; (2) events occur independently. ¶ Data have accumulated over a 50-day period for this machine. In this time 75 failures occurred, so the average failure rate is: 75 failures failures λ= = 1.5 (49) 50 days day Furthermore, we know how many days had 0, 1, 2 and 3 or more failures: Failures/day # days # failures 0 12 0 1 17 17 2 9 18 3+ 12 40 Totals: 50 75 Table 2: Failure data. ¶ We want to test the hypothesis: the distribution over time of failures is described by a Poisson process. Recall the Poisson distribution: Pi = probability of exactly i failures in a single day. e−λ λi Pi = , i = 0, 1, 2, . . . (50) i! ¶ We can use a χ2 test to test this hypothesis. First deﬁne some notation: N = total number of days. pi = Pi , the Poisson distribution in H0 , for i = 0, . . . , 2. p3 = ∞ Pi . i=3 ni = observed number of days with i failures, i = 0, . . . , 3. N pi = expected number of days with i failures, i = 0, . . . , 3. Now the null hypothesis is: H0 : distribution of failures is pi with λ = 1.5/day.
12. 12. Acceptance Testing 12 H1 : H0 is wrong. ¶ The χ2 statistic is: 3 (ni − N pi )2 χ2 = = 1.695 (51) o N pi i=0 The DOFs: DOF = 4 − 1 − 1 =2 (52) catagories normalization estimating λ ¶ Recall the pth quantile with ν DOFs, χ2 : (ν),p p = Prob χ2 ≤ χ2 (53) (ν) (ν),p From a χ2 table, the p-quantiles for ν = 2 are: p= 0.5 0.6 χ2 = 1.386 1.833 (2) From this table we see that the level of conﬁdence, with χ2 = 1.695, is: o   α = Prob χ2 ≥ χ2 H0  ≈ 1 − 0.55 = 0.45 (54) o This is very large, so we accept H0 : the distribution of failures is Poisson. ¶ Now consider a diﬀerent machine, for which the data are: Failures/day # days # failures 0 15 0 1 13 13 2 8 16 3+ 14 46 Totals: 50 75 Table 3: Failure data. As before, the average failure rate is: failures λ = 1.5 (55) day With these results the observed χ2 value is: χ2 = 5.86 (56) With 2 DOFs, this implies that the level of conﬁdence is:   α = Prob χ2 ≥ χ2 H0  ≈ 0.06 (57) o This is rather small, so we reject H0 : the distribution of failures is not Poisson. Some factor is causing either: ◦ Variation in the average failure rate. ◦ Failure inter-dependence. In short: clustering of failures.
13. 13. Acceptance Testing 13 χ2 Test of Independence in a 2-way Table 6 ¶ Consider the following situation: Two diﬀerent systems are used to perform a particular mission. Records indicate the number of failed and successful missions: Successful Failed Total Missions Missions Missions System 1 86 35 121 System 2 64 37 101 Total 150 72 Missions Table 4: Data on successes and failures of two systems. ¶ Question: Is there solid evidence for the contention that system 1 is more reliable than system 2? In other words, are the rows and columns of the table independent? That is, by choosing the row (system), do I have sound evidence that I signiﬁcantly inﬂuence the column (success or failure) in which I “fall”? We can use the χ2 test to evaluate the evidence. ¶ First we must formulate the null hypothesis. pij = probability of an event in row i, column j. pi• = probability of an event in row i. p•j = probability of an event in column j. The null hypothesis states: There is statistical independence between rows and columns. That is: H0 : pij = pi• p•j (58) The alternative hypothesis states that H0 is false. ¶ Now we can formulate the χ2 statistic. Deﬁne: nij = number of events in row i, r = number of rows. c = number of columns. N = total number of events = r c j=1 nij . i=1 2 The χ statistic is calculated as: r c (nij − N pi• p•j )2 χ2 = (59) N pi• p•j i=1 j=1 The level of conﬁdence is the probability of χ2 obtaining a value greater than the observed
14. 14. Acceptance Testing 14 value, χ2 , conditioned on H0 : o   α = Prob χ2 ≥ χ2 H0  (60) o ¶ A diﬃculty: in our example we don’t know the values of pi• and p•j , so we can’t calculate χ2 . o Solution: estimate pi• and p•j from the data: nij : ni• n•j pi• = , p•j = (61) N N where: ni• = sum of ith row. n•j = sum of jth column. ¶ How many DOFs do we have? DOF = number of categories − number of constraints. The number of categories is rc. 3 types of constraints: Type 1. 1 constraint (normalization): r c r c nij = N pij (62) i=1 j=1 i=1 j=1 Type 2. r − 1 constraints: Estimate p1• , . . . , pr−1• . pr• = 1 − r−1 pi• . i=1 Type 3. c − 1 constraints: Estimate p•1 , . . . , p•c−1 . p•c = 1 − c−1 p•j . j=1 So the number of DOFs is: DOF = rc − 1 − (r − 1) − (c − 1) = (r − 1)(c − 1) (63) ¶ Now the observed statistic can be calculated as: r c (nij − N pi• p•j )2 χ2 = (64) o N pi• p•j i=1 j=1 ¶ In our numerical example we have: n11 = 86, n12 = 35, n1• = 121 n21 = 64, n22 = 37, n2• = 101 n•1 = 150, n•2 = 72, N = 222
15. 15. Acceptance Testing 15 Hence the estimated probabilities are: n1• n2• p1• = = 0.545, p2• = = 0.455 (65) N N n•1 n•2 p•1 = = 0.676, p•2 = = 0.324 (66) N N The observed statistic becomes: χ2 = 1.493 (67) o And the number of DOFs is: DOF = (2 − 1)(2 − 1) = 1 (68) Is 1.493 large or small? Should we accept or reject H0 ? We need to calibrate χ2 using the quantile values of the χ2 distribution with 1 DOF. o Deﬁne: χ2 = χ2 random variable with ν DOFs. (ν) χ2 2 (ν),p = pth quantile of χ(ν) . χ2 (ν),p is deﬁned in: p = Prob χ2 ≤ χ2 (69) (ν) (ν),p From a χ2 table, the p-quantiles for ν = 1 are: p= 0.75 0.80 χ2 = 1.323 1.642 (1) From this table we see that: Prob χ2 ≥ 1.323 = 1 − 0.75 = 0.25 (70) (1) Prob χ2 ≥ 1.642 = 1 − 0.80 = 0.20 (71) (1) So, since χ2 = 1.493 we see that: o α ≈ 0.22 (72) We cannot reject H0 at 0.2 level of conﬁdence. So we accept H0 : ◦ Columns and rows are independent. ◦ The two systems are not signiﬁcantly diﬀerent in reliability. 7 Acceptance Sampling of a Large Population ¶ We have a large batch of items, among which a fraction p are defective. We wish to sample this population to decide whether or not to accept the batch. ¶ Deﬁne: N = sample size. pa = acceptable proportion of defective items. pu = unacceptable proportion of defective items. pu > pa (73)
16. 16. Acceptance Testing 16 pa and pu can be interpreted: ◦ We desire the defective fraction to be no greater than pa . ◦ We are willing to live with a defective fraction as large as pu . p = true but unknown fraction of defective items in the batch. PA = probability of accepting the batch. ¶ Our algorithm for accepting or rejecting the batch is: Accept if and only if: number of defective items in sample ≤ pa (74) sample size ¶ Two types of errors can be made: I. Acceptance with p > pu . Bad acceptance. II. Rejection with p < pa . Bad rejection. ¶ How do we calculate PA , the probability of acceptance? Assume that N batch size. Thus the sample does not signiﬁcantly change the composition of the population. ¶ Binomial distribution: b(i; p, N ) = probability of exactly i defectives in a sample of size N . ¶ Acceptance probability: pa N PA = b(i; p, n) (75) i=0 pa N N pi (1 − p)N −i = (76) i i=0 ¶ Example. Suppose: N = 100. pa = 0.01 Hence: pa N = 1 1 b(i; p, 100) = (1 − p)100 + 100p(1 − p)99 PA = (77) i=0 A plot of PA vs. p reveals the probabilities of type I and type II errors. Recall that: pa = max “acceptable” fraction of defectives. pu = max “tolerable” fraction of defectives.
17. 17. Acceptance Testing 17 1.0 6 0.8 PA 0.6 0.4 0.2 -p 0.0 0 0.02 0.04 0.06 Figure 6: Acceptance probability in eq.(77) ¶ Type I errors: Acceptance with p > pu . Bad acceptance. The probability of a type I error is called the consumer’s risk: PI = PA (p = pu ) (78) ¶ Type II errors: Rejection with p < pa . Bad rejection. The probability of a type II error is called the producer’s risk: PII = 1 − PA (p = pa ) (79) When we are designing a sampling scheme, we can evaluate it with 2 pairs of numbers: (pu , PI ), (pa , PII ) 1.0 6 6 PII = producer’s risk (bad rejection) 0.8 ? PA 0.6 0.4 0.2 ? PI = consumer’s risk (bad acceptance) -p 0.0 pu6 0 pa 0.02 0.06 Figure 7: Illustration of type I and type II errors in eqs.(78) and (79) using eq.(77) (N = 100). pa = 0.01, pu = 0.04. ¶ Example. Consider PA (p) in eq.(77) on p.16.
18. 18. Acceptance Testing 18 Choose pa = 0.01 and pu = 0.04. With N = 100, we have: pa N = 1: Consumer’s risk: PI = PA (p = pu ) = 0.0872. Producer’s risk: PII = 1 − PA (p = pa ) = 1 − 0.736 = 0.264. With N = 200, we have: pa N = 2: 2 (200)(199) 2 b(i; p, 100) = (1 − p)200 + 200p(1 − p)199 + p (1 − p)198 PA = (80) 2 i=0 Consumer’s risk: PI = PA (p = pu ) = 0.0125. Producer’s risk: PII = 1 − PA (p = pa ) = 1 − 0.677 = 0.323. By increasing the sample size we: ◦ Increased the producer’s risk: 0.264 → 0.323. ◦ Decreased the consumer’s risk: 0.0872 → 0.0125. ¶ At N → ∞ we expect rectangular PA vs. p: 1.0 6 0.8 PA 0.6 0.4 0.2 -p 0.0 0 pa pu 0.06 Figure 8: Asymptotical acceptance probability, N → ∞. Consumer’s risk: PI = PA (p = pu ) = 0. Producer’s risk: PII = 1 − PA (p = pa ) = 0. For ﬁnite N , PA (p) oscillates with N , as a result of the discrete, binary nature of the distrib- ution.
19. 19. Acceptance Testing 19 8 Sample Size for Detecting a Change Collaborators: David Fox, Mick McCarthy, Keith Hays, Brendan Wintle. 8.1 Sample Size and Error Probabilities ¶ The dispute. • x is the measurement of the output of a system. • One side argues: x has not changed, its mean is µ0 and its pdf is g(x). • The other side argues: x has changed, its mean is µ1 = µ0 + δ and its pdf has shifted to g(x − δ). • δ > 0 means that g(x − δ) is shifted to the right of g(x). • δ is the “eﬀect size”: the shift in the distribution of x. • Null and alternative hypotheses: H0 : x ∼ g(x) (81) H1 : x ∼ g(x − δ) (82) ¶ The question: How large a sample is needed to conﬁdently resolve the dispute? ¶ Random sample. • A random sample of x values is taken. n = sample size. • x = sample mean. fn (·) is the pdf of the sample mean. • Null and alternative hypotheses: H0 : x ∼ fn (x) (83) H1 : x ∼ fn (x − δ) (84) • δ > 0 means that fn (x − δ) is shifted δ to the right of fn (x). ¶ Example: Normal distribution. • x ∼ N (µ, σ 2 ) implies x ∼ N (µ, σ 2 /n). • Hence the null and alternative hypotheses are: x ∼ N (µ, σ 2 /n) H0 : (85) x ∼ N (µ + δ, σ 2 /n) H1 : (86) • The pdf of the sample mean depends on the sample size. ¶ Critical value, C: • Reject H0 iﬀ x > C. • C depends on the sample size. ¶ Errors: • Type I error: false rejection of H0 . • Type II error: false rejection of H1 .
20. 20. Acceptance Testing 20 ¶ Error Probabilities: • α = probability of falsely rejecting H0 : α = Prob(x > C|H0 ) (87) ∞ = f (x) dx (88) C • β = probability of falsely rejecting H1 : β = Prob(x ≤ C|H1 ) (89) C = f (x − δ) dx (90) −∞ C−δ = f (x) dx (91) −∞ • Power = 1 − β = probability of correctly rejecting H1 . ¶ Example: Normal distribution. • Null and alternative hypotheses are as in eqs.(85) and (86), p.19. • Probability of type I error: α = Prob(x > C|H0 ) (92) x−µ C −µ √> √ = Prob (93) σ/ n σ/ n C −µ √ = 1−Φ (94) σ/ n where Φ is the cumulative distribution function of the standard normal distribution. • Probability of type II error: β = Prob(x ≤ C|H1 ) (95) x − (µ + δ) C − (µ + δ) √ √ = Prob ≤ (96) σ/ n σ/ n C − (µ + δ) √ =Φ (97) σ/ n ¶ Choose the sample size, n, so that: • α = speciﬁed value, e.g. 0.02. • β ≤ speciﬁed value, e.g. 0.2. ¶ Example: normal distribution. • Given: µ, δ and σ. • Choose type-I error probability, α, say α = 0.02. • For any sample size n, the critical value, C, is found from eq.(94), p.20, as: C −µ √ Φ =1−α (98) σ/ n
21. 21. Acceptance Testing 21 So: C −µ √ = z1−α = (1 − α)th quantile of Φ (99) σ/ n So: z1−α σ C =µ+ √ (100) n • Now the type-II error probability, β, is, from eq.(97), p.20: C − (µ + δ) √ β=Φ (101) σ/ n δ = Φ z1−α − √ (102) σ/ n • Numerical example: µ = 0, δ = 0.01, σ = 0.007. ◦ Require α = 0.02 so z1−α = 2.05. ◦ Table 5 shows n, C and β. C−(µ+δ) n C, eq.(100) β, eq.(101) √ σ/ n 2 0.010147 0.029695 0.512 5 0.0064175 −1.14438 1 − 0.8729 = 0.127 10 0.0045379 −2.46754 1 − 0.9934 = 0.0066 Table 5: Sample size n, critical value C, and type-II error probability β. ◦ If we require β ≤ 0.02 then: — n = 5 is too small. — n = 10 is big enough. • Suppose that δ > 0. Then: δ √ Φ z1−α − < Φ(z1−α ) (103) σ/ n Thus, from eqs.(98) and (102): δ √ β = Φ z1−α − < Φ(z1−α ) = 1 − α (104) σ/ n That is: β <1−α (105) Trade-oﬀ: small α means that β may be large. 8.2 Uncertain Eﬀect Size and Variance ¶ Eﬀect size: • Eﬀect size: ∆ = µ0 − µ1 . Note: ∆ (here) = −δ (section 8.1). • Consider the upper-tail hypothesis: ∆ < 0. • A similar derivation can be formulated for other cases. ¶ The problem:
22. 22. Acceptance Testing 22 • Assume normal distribution. • We have an estimate of the eﬀect size, ∆, but we are unsure how negative it really should be. • We have an estimate σ of the population standard deviation but we are unconﬁdent that this estimate is correct. ¶ Fractional-error info-gap model: U(R, ∆, σ) = (∆, σ) : (1 + R)∆ ≤ ∆ ≤ min[0, (1 − R)∆] max[0, (1 − R)σ] ≤ σ ≤ (1 + R)σ , R≥0 (106) ¶ Power = 1 − β = 1− probability of type-II error, eq.(102), p.21: √ ∆n Power(∆, σ, n) = 1 − Φ + z1−α (107) σ where z1−α is the (1 − α)th quantile of the standard normal distribution. ¶ Robustness of sample size n, with requirement that the power be no less than 1 − βc : R(n, βc ) = max R : min Power(∆, σ, n) ≥ 1 − βc (108) (∆,σ)∈U(R,∆,σ) ¶ Inner minimum in eq.(108): µ(R) = min Power(∆, σ, n) (109) (∆,σ)∈U(R,∆,σ) • µ(R) decreases as R increases: nesting of uncertainty sets. • Robustness: greatest R such that µ(R) ≥ 1 − βc . • µ(R) is monotonic in R, so robustness is the max R satisfying µ(R) = 1 − βc . • Plot of µ(R) vs. R is plot of 1 − βc vs. α(n, βc ). • See ﬁg. 9. ¶ Derive the robustness function. • µ(R) occurs for the greatest allowed value of ∆/σ, which is negative and occurs when ∆ = min[0, (1 − R)∆] and when σ = (1 + R)σ. • If R ≥ 1 then: ◦ Power is minimized when ∆ = 0 so: Power(∆, σ, n) = 1 − Φ(z1−α ) = µ(R) (110) as in horizontal section of the curve in ﬁg. 9. ◦ Robustness is inﬁnite when 1 − βc < 1 − Φ(z1−α ). ◦ Thus very low demanded power (large βc ) implies very high robustness: R(n, βc ) = ∞ if βc > Φ(z1−α ) (111)
23. 23. Acceptance Testing 23 µ(R) 6 1 − βc 1 − Φ(z1−α ) }R=∞ - R 0 0 1 R(n, βc ) Figure 9: Illustration of the calculation of robustness. ¶ If R < 1: the robustness is the greatest value of R satisfying: √ (1 − R)∆ n Φ + z1−α ≤ βc (112) (1 + R)σ • Let us denote by q(βc ) the βc quantile of the standard normal distribution. • q(βc ) increases from −∞ to +∞ as βc increases from 0 to 1. • The robustness is the greatest value of R satisfying: √ (1 − R)∆ n + z1−α ≤ q(βc ) (113) (1 + R)σ • If: √ ∆n + z1−α > q(βc ) (114) σ then the robustness is zero for this value of βc , and positive robustness is obtained only for greater values of βc (lower power). • Deﬁne: q(βc ) − z1−α ν= (115) √ ∆ n/σ The robustness is positive only if ν < 1 (recalling that ∆ < 0). • Now, solving eq.(113) (as an equality) for R, in the case that eq.(114) does not hold (that is, ν < 1), we obtain the robustness:   1−ν  if ν < 1 1+ν R(n, βc ) = , if βc ≤ Φ(z1−α ) (116)   0 else The complete robustness function is eqs.(111) and (116). ¶ Trade-oﬀ: Robustness vs. power. Applying the chain rule for diﬀerentiation to eq.(116), one ﬁnds: ∂ R(n, βc ) <0 (117) ∂(1 − βc )
24. 24. Acceptance Testing 24 The robustness R(n, βc ) decreases as the demanded power, 1 − βc , increases: high aspirations are vulnerable to uncertainty. ¶ Trade-oﬀ: Robustness vs. sample size. One ﬁnds: ∂ R(n, βc ) >0 (118) ∂n Thus the robustness increases as the sample size increases. 8.3 Uncertain Sample PDF 8.3.1 Background ¶ Binary statistic test. • x = decision statistic. • Distribution of x under H1 equals distribution under H0 shifted up by δ. • We accept the null hypothesis iﬀ x ≤ C. ¶ Deﬁnitions. • α = level of signiﬁcance = probability of type-I error (falsely reject H0 ). • β(f ) = probability of type-II error (falsely reject H1 ) = 1− power. • δ = non-negative eﬀect size. • f (x) = pdf of decision statistic. • C = critical value. C 1−α = f (x) dx (119) −∞ C C−δ C β(f ) = f (x − δ) dx = f (x) dx = 1 − α − f (x) dx (120) −∞ −∞ C−δ ¶ Standard statistic approach. • Known sampling distribution: f (x). • f (x) depends on sample size (number of measurements.) • Specify α and δ. • Determine C and β from eqs.(119) and (120). • Increase sample size until the power is adequate. 8.3.2 Info-gap Approach to Determining the Sample Size ¶ Approach. • Sampling distribution is uncertain. • Evaluate info-gap robustness of the estimated power. • Determine sample size, n, so that adequate power is adequately robust. ¶ Info-gap model for pdf uncertainty: fractional-error. U(R, f ) = f (x) : f ∈ P, |f (x) − f (x)| ≤ Rf (x) , R≥0 (121)
25. 25. Acceptance Testing 25 P is the set of all non-negative and normalized pdfs on the domain of x. ¶ Performance requirement. • Power = 1 − β. Require large power; small β. • 1 − βd = demanded power. • Analyst requires β ≤ βd . ¶ Robustness: R(N, βd ) = max R : max β(f ) ≤ βd (122) f ∈U(R,f ) ¶ Evaluating robustness. • Denote inner maximum in eq.(122) by γ(R). • Robustness is max R such that γ(R) ≤ βd . • Uncertainty sets U(R, f ) are nested with respect to R. • Thus γ(R) increases as R increases. • Thus robustness is max R at which γ(R) = βd . • γ(R) is inverse of R(N, βd ): γ(R) = βd if and only if R(N, βd ) = R (123) 8.3.3 An Approximate Robustness for Small Eﬀect Size ¶ Special case, very small eﬀect size: δ 1 (124) Derive approximate expression for robustness. ¶ Now eq.(120), p.24, can be approximated as: β(f ) = 1 − α − f (C)δ (125) ¶ Nominal critical value, C: • C = (1 − α)th quantile of the best-estimated pdf f (x). • C depends on sample size n. ¶ Maximizing pdf. The pdf in U(R, f ) which maximizes β is very nearly:  f (x) if x < C − δ     f (x) = (126) (1 − R)f (x) if x ∈ [C − δ, C]     (1 + wR)f (x) if x > C
26. 26. where w is a very small positive number which normalizes f (x). That is, w is determined so that the decrement in f in [C − δ, C] is compensated by the increment in (C, ∞): wR[1 − F (C)] = Rf (C)δ (127) where F is the cumulative distribution function of f . ¶ Evaluating γ(R). β(f ) in eq.(125) becomes: γ(R) = β(f ) = 1 − α − (1 − R)f (C)δ (128) Note: γ(R) increases as R increases. ¶ Evaluating robustness. Equate γ(R) in eq.(128) to βd and solve for R:  0 if βd < 1 − α − f (C)δ    R(n, βd ) = (129)  βd − 1 + α + f (C)δ else   f (C)δ • Robustness increases as βd increases. Trade-oﬀ: high power ⇐⇒ low robustness. • Robustness is zero when βd equals the nominal value, β(f ). • Robustness depends on sample size through nominal critical value C. • This derivation is contingent on the assumption in eq.(124), p.25. 26