Successfully reported this slideshow.
Upcoming SlideShare
×

# C2 st lecture 10 basic statistics and the z test handout

1,185 views

Published on

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No
• Real Money Streams ~ Create multiple streams of wealth from your home! ●●● http://scamcb.com/ezpayjobs/pdf

Are you sure you want to  Yes  No

### C2 st lecture 10 basic statistics and the z test handout

1. 1. Lecture 10 - Basic Statistics and the Z-test C2 Foundation Mathematics (Standard Track) Dr Linda Stringer Dr Simon Craik l.stringer@uea.ac.uk s.craik@uea.ac.uk INTO City/UEA London
2. 2. Lecture 9 skills Calculate the following measures of location (AVERAGES) Mode Median Mean Calculate the following measures of dispersion (MEASURES OF SPREAD) Interquartile range Standard deviation Absolute deviation Perform a Z-test Write the null and alternative hypothesis Look up the critical value Calculate the test statistic Make the decision Write a conclusion
3. 3. A data set A data set is usually a list of values (numbers) that has been gathered in a survey. We will use the following data set to demonstrate the ideas in the ﬁrst part of this lecture. A statistician wants to ﬁnd how many pets the average person has. He interviews 10 people and gets the following values 0 2 0 1 0 8 2 1 0 0
4. 4. Bar charts A bar chart showing how many pets 10 people have: 0 2 0 1 0 8 2 1 0 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8
5. 5. Pie charts A pie chart of the data 0 2 0 1 0 8 2 1 0 0 0 50% 1 20% 2 20% 8 10%
6. 6. Histogram A histogram of the data showing how many people have each number of pets. 0 2 0 1 0 8 2 1 0 0 0 1 2 8 1 2 3 4 5
7. 7. Mode In a data set the mode is the most frequent value (the value which occurs most often). The mode is a type of average. Example: Find the mode of the following data set 0 2 0 1 0 8 2 1 0 0 In this data set the mode is 0.
8. 8. Mode There can be more than one mode in a data set Example: 0 5 5 0 1 5 0 1 6 There are two modes, they are 0 and 5.
9. 9. Median The median is the middle value in an ordered data set. It is another type of average. First order the data, with values increasing from left to right. Let n be the size of the data set (the number of values). If n 2 is an integer (whole number) then the median is the midpoint of the n 2 th value and the n 2 + 1th value (to ﬁnd the midpoint, add the values together and divide by 2). If n 2 is not an integer (whole number) then round it up to the nearest integer (n+1 2 ). The median is the n+1 2 th value. OR ﬁnd the median by crossing off pairs of values, starting from the ends of the data set.
10. 10. Example Order the data: 0 0 0 0 0 1 1 2 2 8 n = 10 (the number of values) n 2 = 10 2 = 5, which is an integer The median is the midpoint of the 5th and 6th value = 0+1 2 = 0.5.
11. 11. Example 2 Order the data: 0 0 0 1 1 5 5 5 6 n = 9 (the number of values) n 2 = 9 2 = 4.5, which is not an integer. Round up to 5. The median is the 5th value = 1.
12. 12. Interquartile range First order the data, with values increasing from left to right. We want to ﬁnd two values: the ﬁrst quartile Q1 and the third quartile Q3. Let n be the size of the data set (the number of values). To ﬁnd Q1 we multiply n by 1 4 . If n 4 is an integer (whole number) then Q1 is the midpoint of the (n 4 )th value and the (n 4 + 1)th value If n 4 is not an integer then round it up to the nearest integer. Q1 is the corresponding value. To ﬁnd Q3 we multiply n by 3 4 . If 3n 4 is an integer then Q3 is the midpoint of the (3n 4 )th value and the (3n 4 + 1)th value If 3n 4 is not an integer then round it up to the nearest integer. Q3 is the corresponding value. The interquartile range is Q3 − Q1.
13. 13. Example Order the data 0 0 0 0 0 1 1 2 2 8 n 4 = 10 4 = 2.5, which is not an integer. Round up to 3. Q1 is the third value, so Q1 = 0. 3n 4 = 3×10 4 = 7.5, which is not an integer. Round up to 8. Q3 is the eighth value, so Q3 = 2. The interquartile range is Q3 − Q1 = 2 − 0 = 2.
14. 14. Sigma notation Σ Given a data set X, we denote the sum of all the values x in X by x Example: If X = 0 2 0 1 0 8 2 1 0 0 then x = 0 + 2 + 0 + 1 + 0 + 8 + 2 + 1 + 0 + 0 = 14
15. 15. Mean The mean is our third average. In a data set of size n the mean, denoted ¯x, is the sum of all the values divided by n. ¯x = x n Example: What is the mean number of pets? Calculate the sum of all the values and divide by n ¯x = x n = 0 + 2 + 0 + 1 + 0 + 8 + 2 + 1 + 0 + 0 10 = 14 10 = 1.4
16. 16. Standard deviation, σ The standard deviation, σ is a measure of dispersion. First calculate the variance, σ2. The standard deviation, σ, is the square root of the variance. There are two formulae for variance. They give the same answer. Usually the second formula is easier to use. σ2 = (x − ¯x)2 n = x2 n − ¯x2 When you have found the variance, do not forget to take the square root ! σ = x2 n − ¯x2
17. 17. Proof that the two formulae for standard deviation are equivalent σ2 = (x−¯x)2 n = x2 −2x¯x+¯x2 n = x2 n − 2¯x x n + ¯x2 n = x2 n − 2¯x2 + ¯x2 1 n = x2 n − ¯x2
18. 18. Example What is the standard deviation of the following data ? 0 2 0 1 0 8 2 1 0 0 Use the second formula to calculate the variance. σ2 = x2 n − ¯x2 We previously worked out the mean ¯x = 1.4. x2 = 02 +22 +02 +12 +02 +82 +22 +12 +02 +02 = 74 The variance is σ2 = x2 n − ¯x2 = 74 10 − 1.42 = 5.44 The standard deviation is σ = √ 5.44 = 2.33 to 2 d.p.
19. 19. Absolute value The absolute value function gives the positive value of any number |x| = x if x ≥ 0 −x if x < 0 |5| = 5, | − 8| = 8, | − 1.213| = 1.213. |1, 000, 000| = 1, 000, 000.
20. 20. Absolute deviation The absolute deviation measures the average distance from each value to the mean. It is another measure of dispersion. As a formula: AD = |x − ¯x| n
21. 21. Example What is the absolute deviation of the data 0 2 0 1 0 8 2 1 0 0 The mean is ¯x = 1.4. We ﬁrst work out |x − ¯x|: 1.4 0.6 1.4 0.4 1.4 6.6 0.6 0.4 1.4 1.4 The absolute deviation is AD = |x − ¯x| n = 15.6 10 = 1.56
22. 22. Hypothesis testing We use hypothesis testing to compare the mean of a very large data set, a population mean, with the mean of a sample data set, a sample mean. Example: A lightbulb company says their lightbulbs last a mean time of 1000 hours with a standard deviation of 50. We think their lightbulbs last longer than this and propose a test at a 5% level of signiﬁcance. We buy 75 lightbulbs and they last a mean time of 1022 hours. The population mean is 1000 hours. The sample is the 75 light bulbs that we test. The sample mean is 1022 hours.
23. 23. Hypothesis testing The null hypothesis, H0 is a statement which is assumed to be true. Sample data is collected and tested to see if it is consistent with the null hypothesis. If the sample mean is signiﬁcantly different from the population mean, then we say that we have sufﬁcient evidence to reject the null hypothesis, H0, in favour of the alternative hypothesis, H1.
24. 24. The null hypothesis and the alternative hypothesis The null hypothesis concerns the population mean. It is of the form H0 : µ = A where µ is ’population mean’ and A is the hypothetical value The alternative hypothesis is that the null hypothesis is incorrect and will be one of H1 : µ = A H1 : µ < A H1 : µ > A The question will direct you which of the above to use.
25. 25. Signiﬁcance level The null hypothesis will always be tested to a given level of signiﬁcance. A 5% level of signiﬁcance means we are testing to see if the probability of getting the sample data is less than 0.05. If the probability is less we reject the null hypothesis in favour of the alternative hypothesis. A 1% level of signiﬁcance translates to a probability of 0.01.
26. 26. Critical value A critical value is the value beyond which we reject the null hypothesis. It tells us the boundary of the critical region(s) In a Z-test this depends on the alternative hypothesis and the signiﬁcance level. We look up the critical value(s) in tables. Sig. Lev. 5% Sig. Lev. 1% One-tail Two-tail One-tail Two-tail Critical value 1.65 1.96 2.33 2.58
27. 27. H1 : µ = A If our alternative hypothesis is H1 : µ = A we are doing a two-tailed test and we have 2 critical values, one negative and one positive. The critical value is the boundary of the rejection region. For a 5% level of signiﬁcance we have the following picture: −1.96 1.96 x y The rejection (shaded) regions have a combined area of 0.05.
28. 28. H1 : µ > A If our alternative hypothesis is H1 : µ > A we are doing a one-tailed test and we have 1 critical value which is positive. The critical value is the boundary of the rejection region. For a 5% level of signiﬁcance we have the following picture: 1.65 x y The rejection region has an area of 0.05.
29. 29. H1 : µ < A If our alternative hypothesis is H1 : µ < A we are doing a one-tailed test and we have 1 critical value which is negative. The critical value is the boundary of the rejection region. For a 5% level of signiﬁcance we have the following picture: 1.65 x y The rejection region has an area of 0.05.
30. 30. Test statistic The test statistic is difference between the sample mean, ¯x and the (hypothetical) population mean A, divided by the standard error. The standard error is σ/ √ n for the Z-test and s/ √ n for the T-test, where n is the sample size, σ is the population standard deviation and s is the sample standard deviation. The Z-test statistic is Z = ¯x − A σ/ √ n If the test statistic lies beyond the critical value(s) (in the rejection region) we reject H0. If it does not, we accept H0.
31. 31. Z-test - Example 1 Research says that the mean height for a man is 182cm with a standard deviation of 9. We suspect men might be shorter than this. We get the heights of 100 men and their mean height is 176. We test at a 1% level of signiﬁcance.
32. 32. Z-test - Example 1 The null hypothesis and alternative hypothesis are: H0 : µ = 182 H1 : µ < 182 We are doing a 1-tailed test at a 1% level of signiﬁcance so the critical value is: C = −2.33. The test statistic is Z = 176−182 9/ √ 100 = −6.67. −6.67 < −2.33 so we reject the null hypothesis.
33. 33. Z-test - Example 2 A company says employees are supposed to work an average of 40 hours a week with a standard deviation of 5 hours. Alfred wants to know if he ﬁts this to a 5% level of signiﬁcance. He notes down how many hours he works over 48 weeks and has a mean of 39 hours.
34. 34. Z-test - Example 2 The null hypothesis and alternative hypothesis are: H0 : µ = 40 H1 : µ = 40 We are doing a 2-tailed test at a 5% level of signiﬁcance so the critical values are: C = −1.96, 1.96. The test statistic is Z = 39−40 5/ √ 48 = −1.39. −1.96 < −1.39 < 1.96 so we accept the null hypothesis.
35. 35. Z-test - Example 3 A lightbulb company says their lightbulbs last a mean time of 1000 hours with a standard deviation of 50. We think their lightbulbs last longer than this and propose a test at a 5% level of signiﬁcance. We buy 75 lightbulbs and they last a mean time of 1022 hours.
36. 36. Z-test - Example 3 The null hypothesis and alternative hypothesis are: H0 : µ = 1000 H1 : µ > 1000 We are doing a 1-tailed test at a 5% level of signiﬁcance so the critical value is: C = 1.65. The test statistic is Z = 1022−1000 50/ √ 75 = 3.81. 1.65 < 3.81 so we reject the null hypothesis.
37. 37. Z-test summary You will be given 1. Population mean, A 2. Population standard deviation, σ 3. Signiﬁcance level 4. Sample mean, ¯x 5. Sample size, n 6. Quantifying word. You have to work out 1. Null hypothesis, alternative hypotheis 2. Critical value(s) 3. Test statistic 4. Decision - accept/reject H0 (sketch a picture if possible) 5. Conclusion
38. 38. The theory behind the Z-test and the T-test If samples of size n are taken from a population with mean A and standard deviation σ, then the sample means are distributed normally, with mean A and standard deviation σ/ √ n When we calculate the test statistic, we are calculating the Z-score of the sample mean The critical value is the Z-score of a sample mean which we have a 5% (or 1%) probability of obtaining For further information, try a statistics book from the library, or the khanacademy videos on youtube
39. 39. Normal distribution X ∼ N(µ, σ2 ) The normal distribution is deﬁned as f(x) = 1 σ √ 2π e − (x−µ)2 2σ2 where σ is the population standard deviation and µ is the population mean. The graph below is when µ = 0 and σ = 1. −4 −2 2 4 0.1 0.2 0.3 0.4 0.5 x y Probabilities correspond to areas under this curve