Upcoming SlideShare
×

# Point Estimate, Confidence Interval, Hypotesis tests

2,177 views

Published on

2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,177
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
70
0
Likes
2
Embeds 0
No embeds

No notes for slide

### Point Estimate, Confidence Interval, Hypotesis tests

1. 1. Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 3 - Point Estimate, Conﬁdence Interval and Hypotesis Tests - 16.01.2014
2. 2. Introduction Let’s start having empirical data (one variable of length N) extracted from external ﬁle, suppose to consider it to be the population. We deﬁne a sample of size n. Suppose we do not have information on population (or, better, we want to check if and how the sample can represent the population) We, in other words, want to make infererence using the information contained in the sample, in order to obtain an estimation for the population. That sample is one of several samples we can randomly draw from the population (the sample space). What are the instruments to obtain infos about the population? (1) Sample mean (point estimation) (2) Conﬁdence interval (3) Hypotesis tests
3. 3. Sample space In probability theory, the sample space of an experiment or random trial is the set of all possible outcomes or results of that experiment. It is common to refer to a sample space by the labels S, Ω, or U. For example, for tossing two coins, the corresponding sample space would be {(head,head), (head,tail), (tail,head), (tail,tail)}, so that the dimension is 4. dim(Ω) = 4. It means that we can obtain 4 diﬀerent samples with corresponding 4 diﬀerent sample means. In pratice, we face up with only one sample took at random from the sample space.
4. 4. Point estimate Point estimate permit us to summarize the information contained in the population (dimension N), throughout only 1 value constructed using n vales. The most used, unbiased point estimator is the sample mean. n xi ˆ X n = 1=1 n Other point estimators are: (1) Sample Median (2) Sample Mode (3) Geometric mean. Geometric Mean = Mg = 2 n i=1 xi 1 = exp[ n n 1=1 lnxi ] An example of what is not an estimator is when you use the sample mean after subsetting the sample truncating it on a certain value. P.S. A Naif deﬁnition of estimator: when the estimator is computed using all the n informations in the sample.
5. 5. Eﬃcient estimators The BLUE (Best Linear Unbiased Estimator) is deﬁned as follow: 1. is a linear function of all the sample values ˆ 2. is unbiased (E (Xn ) = θ) 3. has the smallest sample variance among all unbiased estimators. The sample mean is BLUE for the parameter µ Some estimators are biased but consistent: An estimator is consistent when become unbiased for n −→ ∞
6. 6. Point estimators - cases ˆ Normal samples: Xn is the BLUE estimator for µ parameter (mean) ˆ Bernoulli samples f (x) = ρx (1 − ρ)1−x : Xn is a unbiased estimator for ρ parameter (frequency) e −k k x ˆ ): Xn is a unbiased estimator x! for k parameter (which represent both mean and variance of the distribution) Poisson samples f (x) = 1 :is a unbiased ˆ Xn estimator for λ parameter (density at value 0) Exponential samples f (x) = λe −λy )
7. 7. Conﬁdence interval theory With point estimators we make use of only one value to infer about population. With conﬁdence interval we deﬁne a minimum and a maximum value in which the population parameter we expect to lie. Formally, we need to calculate: σ ˆ µ1 = Xn − z ∗ √ n σ ˆ µ2 = Xn + z ∗ √ n and we end up with interval µ = {µ1 ; µ2 } ˆ ˆ Here: Xn is the sample mean; z is the upper (or lower) critical value of the theoretical distribution. σ is the standard deviation of the theoretical distribution. n the sample size. (See the graph)
8. 8. Conﬁdence interval theory - Gaussian We will make some assumptions for what we might ﬁnd in an experiment and ﬁnd the resulting conﬁdence interval using a normal distribution. Let assume that the sample mean is 5, the standard deviation in population is known and it is equal to 2, and the sample size is n = 20. In the example below we will use a 95 per cent conﬁdence level and wish to ﬁnd the conﬁdence interval. N.B. Here, since the conﬁdence interval is 95, the z (the critical value) to consider is the one corresponding with CDF (i.e. dnorm) = 0.975. We also can speak of α = 0.05, or 1 − α = 0.95, or 1 − α/2 = 0.975
9. 9. Conﬁdence interval theory - T-student We use T − student distribution when n is small and sd is unknown in population. We need to use a sample variance estimation: σ = ˆ ˆ (xi −Xn )2 n−1 The t-student distribution is more spread out. In simple words, since we do not know the population sd, we need for more large intervals (caution - approach). The only diﬀerence with normal distribution, is that we use the command associated with the t-distribution rather than the normal distribution. Here we repeat the procedures above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation. N.B. The T distribution is characterize by its degree of freedom. In this test the degree aere equal to n − 1, because we use 1 estimation (1 constraint)
10. 10. Conﬁdence interval theory - comparison of two means In some case we can have an experiment called (for example) case-control. Let’s imagine to have the population splitted in 2: one is the treated group, the second is the non treated group. Suppose to extract two samples from them with aim to test if the two samples comes from a population with the same mean parameter (is the treatment eﬀective?) The output of this test will be a conﬁdence interval represting the diﬀerence between the two means. N.B. Here, the degree of freedom of the t-distribution are equal to min(n1 , n2 ) − 1
11. 11. Formulas Gaussian conﬁdence interval: ˆ µ = {µ1 , µ2 } = Xn ± z ∗ ˆ σ √ n T - student conﬁdence interval: ˆ µ = {µ1 , µ2 } = Xn ± tn−1 ∗ ˆ σ ˆ √ n T-student conﬁdence interval for two sample diﬀerence: ˆ ˆ µdiﬀ = {µdiﬀ 1 , µdiﬀ2 } = (X1 − X2 ) ± tn−1 ∗ sd; ˆ where sd = sd1 ∗ sd1 n1 + sd2 ∗ sd2 n2 Gussian conﬁdence interval for proportion (bernoulli distribution): ρ = {ρ1 , ρ2 } = fˆ ± z ∗ sd; ˆ 1 where sd = ρ(1−ρ) n2
12. 12. Hypotesis testing Researchers retain or reject hypothesis based on measurements of observed samples. The decision is often based on a statistical mechanism called hypothesis testing. A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true. The probability of committing a type I error is called the signiﬁcance level of the hypothesis testing, and is denoted by the Greek letter α (the same used in the conﬁdence intervals). We demonstrate the procedure of hypothesis testing in R ﬁrst with the intuitive critical value approach. Then we discuss the popular p − value (and very quick) approach as alternative.
13. 13. Hypotesis testing - lower tail The null hypothesis of the lower tail test of the population mean can be expressed as follows: µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true population mean µ. Let us deﬁne the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ: z= ˆ Xn −µ0 √ σ/ n Then the null hypothesis of the lower tail test is to be rejected if z ≤ zα , where zα is the 100(α) percentile of the standard normal distribution.
14. 14. Hypotesis testing - upper tail The null hypothesis of the upper tail test of the population mean can be expressed as follows: µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true population mean µ. Let us deﬁne the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ: z= ˆ Xn −µ0 √ σ/ n Then the null hypothesis of the upper tail test is to be rejected if z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the standard normal distribution.
15. 15. Hypotesis testing - two tailed The null hypothesis of the two-tailed test of the population mean can be expressed as follows: µ = µ0 ; where µ0 is a hypothesized value of the true population mean µ. Let us deﬁne the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ: z= ˆ Xn −µ0 √ σ/ n Then the null hypothesis of the two-tailed test is to be rejected if z ≤ zα/2 or z ≥ z1−α/2 , where zα/2 is the 100(α/2) percentile of the standard normal distribution.
16. 16. Hypotesis testing - lower tail with Unknown variance The null hypothesis of the lower tail test of the population mean can be expressed as follows: µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true population mean µ. Let us deﬁne the test statistic t in terms of the sample mean, the sample size and the sample standard deviation σ : ˆ t= ˆ Xn −µ0 √ σ/ n ˆ Then the null hypothesis of the lower tail test is to be rejected if t ≤ tα , where tα is the 100(α) percentile of the Student t distribution with n − 1 degrees of freedom.
17. 17. Hypotesis testing - upper tail with Unknown variance The null hypothesis of the upper tail test of the population mean can be expressed as follows: µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true population mean µ. Let us deﬁne the test statistic t in terms of the sample mean, the sample size and the sample standard deviation σ : ˆ t= ˆ Xn −µ0 √ σ/ n ˆ Then the null hypothesis of the upper tail test is to be rejected if t ≥ t1−α , where t1−α is the 100(1 − α) percentile of the Student t distribution with n1 degrees of freedom.
18. 18. Hypotesis testing - two tailed with Unknown variance The null hypothesis of the two-tailed test of the population mean can be expressed as follows: µ = µ0 ; where µ0 is a hypothesized value of the true population mean µ. Let us deﬁne the test statistic t in terms of the sample mean, the sample size and the sample standard deviation σ : ˆ t= ˆ Xn −µ0 √ σ/ n ˆ Then the null hypothesis of the two-tailed test is to be rejected if t ≤ tα/2 or t ≥ t1−α/2 , where tα/2 is the 100(α/2) percentile of the Student t distribution with n − 1 degrees of freedom.
19. 19. Lower Tail Test of Population Proportion The null hypothesis of the lower tail test about population proportion can be expressed as follows: ρ ≥ ρ0 ; where ρ0 is a hypothesized lower bound of the true population proportion ρ. Let us deﬁne the test statistic z in terms of the sample proportion and the sample size: z= ρ−ρ0 ˆ ρ0 (1−ρ0 ) n Then the null hypothesis of the lower tail test is to be rejected if z ≤ zα , where zα is the 100(α) percentile of the standard normal distribution.
20. 20. Upper Tail Test of Population Proportion The null hypothesis of the upper tail test about population proportion can be expressed as follows: ρ ≤ ρ0 ; where ρ0 is a hypothesized lower bound of the true population proportion ρ. Let us deﬁne the test statistic z in terms of the sample proportion and the sample size: z= ρ−ρ0 ˆ ρ0 (1−ρ0 ) n Then the null hypothesis of the lower tail test is to be rejected if z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the standard normal distribution.
21. 21. Two Tailed Test of Population Proportion The null hypothesis of the upper tail test about population proportion can be expressed as follows: ρ = ρ0 ; where ρ0 is a hypothesized true population proportion. Let us deﬁne the test statistic z in terms of the sample proportion and the sample size: z= ρ−ρ0 ˆ ρ0 (1−ρ0 ) n Then the null hypothesis of the lower tail test is to be rejected if z ≤ zα/2 or z ≥ z1−α/2
22. 22. Sample size deﬁnition The quality of a sample survey can be improved (worsened) by increasing (decreasing) the sample size. The formula below provide the sample size needed under the requirement of population proportion interval estimate at (1 − α) conﬁdence level, margin of error E and planned parameter estimation. Here, z1−α/2 is the 100(1 − α/2) percentile of the standard normal distribution. For mean: n = 2 z1−α/2 ∗σ 2 E2 For proportion: n = 2 z1−α/2 ρ∗(1−ρ) E2
23. 23. Sample size deﬁnition - Exercises Mean: Assume the population standard deviation σ of the student height in survey is 9.48. Find the sample size needed to achieve a 1.2 centimeters margin of error at 95 per cent conﬁdence level. Since there are two tails of the normal distribution, the 95 per cent conﬁdence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, z1−α/2 is given by qnorm(.975). Population: Using a 50 per cent planned proportion estimate, ﬁnd the sample size needed to achieve 5 per cent margin of error for the female student survey at 95 per cent conﬁdence level. Since there are two tails of the normal distribution, the 95 per cent conﬁdence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, z1−α/2 is given by qnorm(.975).
24. 24. Homeworks 1: Conﬁdence interval for the proportion. Suppose we have a sample of size n = 25 of births. 15 of that are female. Deﬁne the interval (at 99 per cent) for the proportion of female in the population. HINT: Apply with the proper functions in R, the formula in slide 11. 2: Hypotesis test to compare two proportions. Suppose we have two schools. Sampling from the ﬁrst, n = 20 and the Hispanics students are 8. Sampling from the second, n = 18 and Hispanics students are 4. Can we state (at 95 per cent) the frequency of Hispanics are the same in the two schools? N.B.: the test here is two tailed. The hypotesis test here is: z= ρ= ρ1 −ˆ2 ˆ ρ sd ; where (ρ1 ∗n1 +ρ2 +n2 ) n1 +n2 sd = 1 ρ(1 − ρ)[ n1 + 1 n2 ];
25. 25. Charts - 1 Figure: Representation of the critical point for the upper tail hypotesis test
26. 26. Charts - 2 Figure: Representation of the critical point for the lower tail hypotesis test
27. 27. Charts - 3 Figure: Representation of the critical point for the two-tailed hypotesis test
28. 28. Charts - 4 Figure: Type I and Type II errors in hypotesis testing