Your SlideShare is downloading.
×

- 1. Basic concepts Data visualization Data summarization Statistics and Data Analysis for Engineers Part 1: Introduction and Descriptive Statistics Ling-Chieh Kung Department of Information Management National Taiwan University September 4, 2016 Introduction and Descriptive Statistics 1 / 62 Ling-Chieh Kung (NTU IM)
- 2. Basic concepts Data visualization Data summarization What is Statistics? Many things are unknown... Consumers’ tastes. Quality of a product. Stock prices. The eﬀectiveness of a new way of teaching/training. Statistics is the science of collecting, analyzing, interpreting, and presenting (numerical) data. Ultimate goal (of Business Statistics): to achieve better decision making. The study of Statistics includes: Descriptive Statistics. Probability. Inferential Statistics: Estimation. Inferential Statistics: Hypothesis testing. Inferential Statistics: Prediction. In summary: To estimate, test, and predict those unknowns. Introduction and Descriptive Statistics 2 / 62 Ling-Chieh Kung (NTU IM)
- 3. Basic concepts Data visualization Data summarization My plan for today Descriptive Statistics. Visualization and summarization. Inferential Statistics. (Probability). Hypothesis testing and p-value. Regression analysis. Case studies. Introduction and Descriptive Statistics 3 / 62 Ling-Chieh Kung (NTU IM)
- 4. Basic concepts Data visualization Data summarization Road map Basic concepts. Data visualization. Data summarization. Introduction and Descriptive Statistics 4 / 62 Ling-Chieh Kung (NTU IM)
- 5. Basic concepts Data visualization Data summarization Populations vs. samples A population is a collection of persons, objects, or items. A census is to investigate the whole population. A sample is a portion of the population. Sampling is to investigate only a subset of the population. We then use the information contained in the sample to infer (“guess”) about the population. What are samples for the following populations? All students in NTU. All students in the business school. All chips made in one factory. All consumers who have bought iPhone 6. Two important questions: Why sampling? Is a sample representative? Introduction and Descriptive Statistics 5 / 62 Ling-Chieh Kung (NTU IM)
- 6. Basic concepts Data visualization Data summarization Descriptive vs. inferential statistics Descriptive statistics: Graphical or numerical summaries of data. Describing (visualizing or summarizing) a set of data. Inferential statistics: Making a “scientiﬁc guess” on unknowns. Trying to say something about the population. Which is descriptive and which is inferential? Calculating the average height of 1000 randomly selected NTU students. Using this number to estimate the average height of all NTU students. Another example (pharmaceutical research): All the potential patients form the population. A group of randomly selected patients is a sample. Use the result on the sample to infer the result on the population. Introduction and Descriptive Statistics 6 / 62 Ling-Chieh Kung (NTU IM)
- 7. Basic concepts Data visualization Data summarization Parameters vs. statistics A numerical summary of a population is a parameter. The average height of all NTU students. The expected coﬀee demand when the price is 50 NTD. A numerical summary of a sample is a statistic. The average height of all NTU male students. The average coﬀee demand when the price is 50 NTD in the past 6 days. Almost always people use a statistic to infer a parameter. Some statistics are “good” while some are “bad.” Introduction and Descriptive Statistics 7 / 62 Ling-Chieh Kung (NTU IM)
- 8. Basic concepts Data visualization Data summarization Parameters vs. statistics: an example What is the average height of all NTU students? While a census is possible, it is still quite costly. It is natural to: Sample some NTU students. Calculate a statistic. Use that statistic to estimate the average height (the parameter). Some (good or bad) samples and statistics: The average height of all students in this classroom. The average height of 100 students randomly drawn from all students. The maximum height of 100 students randomly drawn from all students. The sum of heights of 100 students randomly drawn from all students. The average height of 60 male and 40 female students randomly drawn from the population. Introduction and Descriptive Statistics 8 / 62 Ling-Chieh Kung (NTU IM)
- 9. Basic concepts Data visualization Data summarization Levels of data measurement Most data we will play with are numerical. Numerical data may be categorized to three levels: Nominal. Ordinal. Quantitative: interval or ratio. Introduction and Descriptive Statistics 9 / 62 Ling-Chieh Kung (NTU IM)
- 10. Basic concepts Data visualization Data summarization Nominal level A nominal scale classiﬁes data into categories with no ranking. Data are labels or names used to identify an attribute of the element. The label may be numeric or non-numeric label. Examples: Categorical variables Values (Categories) Laptop ownership Yes / No Citizenship Taiwan / Japan / ... Country code 886 / 86 / 1 / ... Arithmetic operations cannot be applied on nominal data. Introduction and Descriptive Statistics 10 / 62 Ling-Chieh Kung (NTU IM)
- 11. Basic concepts Data visualization Data summarization Ordinal level An ordinal scale classiﬁes data into categories with ranking. The order or rank of the data is meaningful. However, diﬀerences between numerical labels do not imply distances. Examples: Categorical variables Values (Categories) Product satisfaction Satisﬁed, neutral, unsatisﬁed Professor rank Full, associate, assistant Ranking of scores 1, 2, 3, 4, ... It is still not meaningful to do arithmetic on ordinal data. Assistant + associate = full?! The grade diﬀerence between no. 1 and no. 5 may not be equal to that between no. 11 and no. 15. Introduction and Descriptive Statistics 11 / 62 Ling-Chieh Kung (NTU IM)
- 12. Basic concepts Data visualization Data summarization Quantitative (interval and ratio) levels An interval scale is an ordered scale in which the diﬀerence between measurements is a meaningful quantity but the measurements do not have a true zero point. A ratio scale is an ordered scale in which the diﬀerence between measurements is a meaningful quantity and the measurements have a true zero point. Ratio data appear more often in the world. Heights, weights, income, prices. Interval data are actually rare. Degrees in Celsius or Fahrenheit. GRE or GMAT scores. How about degrees in Kelvin? Introduction and Descriptive Statistics 12 / 62 Ling-Chieh Kung (NTU IM)
- 13. Basic concepts Data visualization Data summarization Some remarks Nominal and ordinal data are called qualitative data. Interval and ratio data are called quantitative data. Most statistical methods are for quantitative data; some are for qualitative data. Distinguishing nominal and ordinal scales is important. Distinguishing interval and ratio scales is not. Sometimes qualitative data are called categorical data. Sometimes quantitative data are called numeric data. Introduction and Descriptive Statistics 13 / 62 Ling-Chieh Kung (NTU IM)
- 14. Basic concepts Data visualization Data summarization A short summary Understand these terms: Populations vs. samples. Parameters vs. statistics. Inferential statistics vs. descriptive statistics. For each scale of measurement, is it meaningful to calculate the following numbers? Level Ranking Distance Nominal No No Ordinal Yes No Quantitative Yes Yes Introduction and Descriptive Statistics 14 / 62 Ling-Chieh Kung (NTU IM)
- 15. Basic concepts Data visualization Data summarization Road map Basic concepts. Data visualization. Data summarization. Introduction and Descriptive Statistics 15 / 62 Ling-Chieh Kung (NTU IM)
- 16. Basic concepts Data visualization Data summarization An example For each day in 2011 and 2012, we record the number of daily rentals of the public bike rental system in Washington, D.C. 985, 801, 1349, 1562, 1600, 1606, 1510, ..., 1341, 1796. and 2729. The smallest and largest numbers are 22 and 8714, respectively. How to get some feeling on 731 numbers? date rental 2011/1/1 985 2011/1/2 801 2011/1/3 1349 2011/1/4 1562 2011/1/5 1600 2011/1/6 1606 2011/1/7 1510 ... 2012/12/29 1341 2012/12/30 1796 2012/12/31 2729 Introduction and Descriptive Statistics 16 / 62 Ling-Chieh Kung (NTU IM)
- 17. Basic concepts Data visualization Data summarization Frequency distributions The original 731 numbers form a set of ungrouped data. We start by grouping them into a frequency distribution. Grouped data presented in the form of class intervals and frequencies. Let’s create an intuitive frequency distribution. Introduction and Descriptive Statistics 17 / 62 Ling-Chieh Kung (NTU IM)
- 18. Basic concepts Data visualization Data summarization Frequency distributions: an example The resulting classes: Class Class interval (Which means) 1 [0, 1000) 0 ≤ x < 1000 2 [1000, 2000) 1000 ≤ x < 2000 3 [2000, 3000) 2000 ≤ x < 3000 ... 8 [7000, 8000) 7000 ≤ x < 8000 9 [8000, 9000) 8000 ≤ x < 9000 How about [0, 999], [1000, 1999], etc.? How about (0, 1000], (1000, 2000], etc.? Introduction and Descriptive Statistics 18 / 62 Ling-Chieh Kung (NTU IM)
- 19. Basic concepts Data visualization Data summarization Frequency distributions: an example Then we count to get the frequency distribution at the right. This is a set of grouped data. Some remarks: Typically we have 5 to 15 classes. Typically all classes have the same width. Be aware of class endpoints! Classes should NOT overlap with each other. If there are outliers, they should be removed ﬁrst. Class interval Frequency [0, 1000) 18 [1000, 2000) 80 [2000, 3000) 74 [3000, 4000) 107 [4000, 5000) 166 [5000, 6000) 106 [6000, 7000) 86 [7000, 8000) 82 [8000, 9000) 12 Introduction and Descriptive Statistics 19 / 62 Ling-Chieh Kung (NTU IM)
- 20. Basic concepts Data visualization Data summarization Something more We may add class midpoints, relative frequencies, and cumulative frequencies into a frequency table: Class Frequency Class Relative Cumulative interval midpoint frequency frequency [0, 1000) 18 500 2.46% 18 [1000, 2000) 80 1500 10.94% 98 [2000, 3000) 74 2500 10.12% 172 [3000, 4000) 107 3500 14.64% 279 [4000, 5000) 166 4500 22.71% 445 [5000, 6000) 106 5500 14.50% 551 [6000, 7000) 86 6500 11.76% 637 [7000, 8000) 82 7500 11.22% 719 [8000, 9000) 12 8500 1.64% 731 How about cumulative relative frequencies? Introduction and Descriptive Statistics 20 / 62 Ling-Chieh Kung (NTU IM)
- 21. Basic concepts Data visualization Data summarization Histograms A frequency distribution may be depicted as a histogram. Interval Freq. [0, 1000) 18 [1000, 2000) 80 [2000, 3000) 74 [3000, 4000) 107 [4000, 5000) 166 [5000, 6000) 106 [6000, 7000) 86 [7000, 8000) 82 [8000, 9000) 12 It consists of a series of contiguous rectangles, each representing the frequency in a class. Introduction and Descriptive Statistics 21 / 62 Ling-Chieh Kung (NTU IM)
- 22. Basic concepts Data visualization Data summarization Histograms Histograms may be the most important type of data graphs. One particular reason to draw histograms is to get some ideas about the distribution. Bell shape? M shape? Skewed? Any outlier? We will discuss distributions in more details. Introduction and Descriptive Statistics 22 / 62 Ling-Chieh Kung (NTU IM)
- 23. Basic concepts Data visualization Data summarization Frequency polygons Alternatively, we may draw a frequency polygon by using line segments connecting dots plotted at class midpoints. The information contained in a frequency polygon is quite similar to that contained in a histogram. Introduction and Descriptive Statistics 23 / 62 Ling-Chieh Kung (NTU IM)
- 24. Basic concepts Data visualization Data summarization Frequency polygons It is more convenient to use a frequency polygon to compare multiple frequency distributions. Both: Uni-modal and symmetric. 2011: Bi-modal and skewed to the right (right-tailed). 2012: Uni-modal and skewed to the left (left-tailed). Warning: People may misinterpret a frequency polygon as a line chart (for data with a time sequence). Introduction and Descriptive Statistics 24 / 62 Ling-Chieh Kung (NTU IM)
- 25. Basic concepts Data visualization Data summarization Line charts A line chart is useful in depicting a time series data set. A two-dimensional data set whose ﬁrst dimension (the x-axis) is for labels of time points. It visualizes how a quantity changes as time goes by. For our monthly bike rentals: Introduction and Descriptive Statistics 25 / 62 Ling-Chieh Kung (NTU IM)
- 26. Basic concepts Data visualization Data summarization Pie charts A pie chart is a circular depiction of data where each slice represents the percentage of the corresponding category. It visualizes relative frequency distributions well. For our bike rental data set: What are the proportions of rentals in the four seasons? What are the proportions of rentals on the seven days of a week? Introduction and Descriptive Statistics 26 / 62 Ling-Chieh Kung (NTU IM)
- 27. Basic concepts Data visualization Data summarization A pie chart for seasonal rentals Season Total rentals Proportion Winter (12/20-3/20) 471348 14.3% Spring (3/21-6/20) 918589 27.9% Summer (6/21-9/20) 1061129 32.2% Fall (9/21-12/20) 841613 25.6% Introduction and Descriptive Statistics 27 / 62 Ling-Chieh Kung (NTU IM)
- 28. Basic concepts Data visualization Data summarization A pie chart for rentals among weekdays Day Total rentals Sunday 444027 Monday 455503 Tuesday 469109 Wednesday 473048 Thursday 485395 Friday 487790 Saturday 477807 Introduction and Descriptive Statistics 28 / 62 Ling-Chieh Kung (NTU IM)
- 29. Basic concepts Data visualization Data summarization Data not appropriate for pie charts Pie charts are used to visualize proportions, i.e., subtotals over the overall total. It should not be used to compare averages. The total numbers of rentals made by male and female users are appropriate for a pie chart. The average numbers of rentals per male and female users are not appropriate for a pie chart. Introduction and Descriptive Statistics 29 / 62 Ling-Chieh Kung (NTU IM)
- 30. Basic concepts Data visualization Data summarization Bar charts Pie charts are useful in visualizing the proportions of each categories. In demonstrating the diﬀerences among categories, a bar chart is a better choice. The larger the category, the longer the bar. Some people draw bars vertically; some horizontally. Introduction and Descriptive Statistics 30 / 62 Ling-Chieh Kung (NTU IM)
- 31. Basic concepts Data visualization Data summarization Bar charts Let’s replace the pie chart to a bar chart. Day Total rentals Sunday 444027 Monday 455503 Tuesday 469109 Wednesday 473048 Thursday 485395 Friday 487790 Saturday 477807 Note that the y-axis does not start at 0! Introduction and Descriptive Statistics 31 / 62 Ling-Chieh Kung (NTU IM)
- 32. Basic concepts Data visualization Data summarization Bar charts v.s. histograms What are the diﬀerences that distinguish a bar chart from a histogram? A bar chart uses noncontiguous bars to visualize categorical data. A histogram uses contiguous bars to visualize quantitative data. Introduction and Descriptive Statistics 32 / 62 Ling-Chieh Kung (NTU IM)
- 33. Basic concepts Data visualization Data summarization Visualizing two variables When we have data for two variables, typically we want to identify whether there is any relationship between them. Visualizing the data in a two-dimensional manner helps. When the two vales are both measured in quantitative scales, we may depict each observation as a point on a plane to create a scatter plot. For our bike rental example: How do monthly rentals in 2011 and those in 2012 relate with each other? How do daily casual and registered rentals relate with each other? Introduction and Descriptive Statistics 33 / 62 Ling-Chieh Kung (NTU IM)
- 34. Basic concepts Data visualization Data summarization Monthly rentals in 2011 and 2012 Month 2011 2012 1 38189 96744 2 48215 103137 3 64045 164875 4 94870 174224 5 135821 195865 6 143512 202830 7 141341 203607 8 136691 214503 9 127418 218573 10 123511 198841 11 102167 152664 12 87323 123713 Introduction and Descriptive Statistics 34 / 62 Ling-Chieh Kung (NTU IM)
- 35. Basic concepts Data visualization Data summarization Road map Basic concepts. Data visualization. Data summarization. Introduction and Descriptive Statistics 35 / 62 Ling-Chieh Kung (NTU IM)
- 36. Basic concepts Data visualization Data summarization Summarizing the data with numbers Descriptive Statistics includes some common ways to describe data. Summarization with numbers. Visualization with graphs. This is always the ﬁrst step of any data analysis project: To get intuitions that guide our directions. Here we talk about summarization. For a set of (a lot of) numbers, we use a few numbers to summarize them. For a population: these numbers are parameters. For a sample: these numbers are statistics. We will talk about three things: Measures of central tendency for the center or middle part of data. Measures of variability for how variable the data are. Measures of correlation for the relationship between two variables. Introduction and Descriptive Statistics 36 / 62 Ling-Chieh Kung (NTU IM)
- 37. Basic concepts Data visualization Data summarization Medians The median is the middle value in an ordered set of numbers. Roughly speaking, half of the numbers are below and half are above it. Suppose there are N numbers: If N is odd, the median is the N+1 2 th large number. If N is even, the median is the average of the N 2 th and the (N 2 + 1)th large number. For example: The median of {1, 2, 4, 5, 6, 8, 9} is 5. The median of {1, 2, 4, 5, 6, 8} is 4+5 2 = 4.5. Introduction and Descriptive Statistics 37 / 62 Ling-Chieh Kung (NTU IM)
- 38. Basic concepts Data visualization Data summarization Medians A median is unaﬀected by the magnitude of extreme values: The median of {1, 2, 4, 5, 6, 8, 9} is 5. The median of {1, 2, 4, 5, 6, 8, 900} is still 5. Medians may be calculated from quantitative or ordinal data. It cannot be calculated from nominal data. Unfortunately, a median uses only part of the information contained in these numbers. For quantitative data, a median only treats them as ordinal. Introduction and Descriptive Statistics 38 / 62 Ling-Chieh Kung (NTU IM)
- 39. Basic concepts Data visualization Data summarization Means The mean is the average of a set of data. Can be calculated only from quantitative data. The mean of {1, 2, 4, 5, 6, 8, 9} is 1 + 2 + 4 + 5 + 6 + 8 + 9 7 = 5. A mean uses all the information contained in the numbers. Unfortunately, a mean will be aﬀected by extreme values. The mean of {1, 2, 4, 5, 6, 8, 900} is 1+2+4+5+6+8+900 7 ≈ 132.28! Using the mean and median simultaneously can be a good idea. We should try to identify outliers (extreme values that seem to be “strange”) before calculating a mean (or any statistics). Introduction and Descriptive Statistics 39 / 62 Ling-Chieh Kung (NTU IM)
- 40. Basic concepts Data visualization Data summarization Population means vs. sample means Let {xi}i=1,...,N be a population with N as the population size. The population mean is µ ≡ N i=1 xi N . Let {xi}i=1,...,n be a sample with n < N as the sample size. The sample mean is ¯x ≡ n i=1 xi n . People use µ and ¯x in almost the whole statistics world. Introduction and Descriptive Statistics 40 / 62 Ling-Chieh Kung (NTU IM)
- 41. Basic concepts Data visualization Data summarization Population means v.s. sample means µ ≡ N i=1 xi N ¯x ≡ n i=1 xi n . Isn’t these two means the same? From the perspective of calculation, yes. From the perspective of statistical inference, no. Typically the population mean is ﬁxed but unknown. The sample mean is random: We may get diﬀerent values of ¯x today and tomorrow. To start from ¯x and use inferential statistics to estimate or test µ, we need to apply probability. Introduction and Descriptive Statistics 41 / 62 Ling-Chieh Kung (NTU IM)
- 42. Basic concepts Data visualization Data summarization Quartiles and percentiles The median lies at the middle of the data. The ﬁrst quartile lies at the middle of the ﬁrst half of the data. The third quartile lies at the middle of the second half of the data. For the pth percentile: p 100 of the values are below it. 1 − p 100 of the values are above it. Median, quartiles, and percentiles: The 25th percentile is the ﬁrst quartile. The 50th percentile is the median (and the second quartile). The 75th percentile is the third quartile. Introduction and Descriptive Statistics 42 / 62 Ling-Chieh Kung (NTU IM)
- 43. Basic concepts Data visualization Data summarization Modes The mode(s) is (are) the most frequently occurring value(s) in a set of qualitative data. In the set {A, A, A, B, B, C, D, E, F, F, F, G, H}, the modes are A and F. The frequency of the modes (A and F) are 3. Though the above deﬁnition may also be applied to quantitative data, sometimes it is useless. In many case, all values are modes! For quantitative data, we instead look for the modal class(es). Introduction and Descriptive Statistics 43 / 62 Ling-Chieh Kung (NTU IM)
- 44. Basic concepts Data visualization Data summarization Modal classes In a baseball team, players’ heights (in cm) are: 178 172 175 184 172 175 165 178 177 175 180 182 177 183 180 178 179 162 170 171 For the classes [160, 165), [165, 170), ..., and [185, 190), the modal class is [175, 180). We sometimes say the mode of this set is 177.5. The way of grouping matters! Introduction and Descriptive Statistics 44 / 62 Ling-Chieh Kung (NTU IM)
- 45. Basic concepts Data visualization Data summarization Variability Measures of variability describe the spread or dispersion of a set of data. Especially important when two sets of data have the same center. Introduction and Descriptive Statistics 45 / 62 Ling-Chieh Kung (NTU IM)
- 46. Basic concepts Data visualization Data summarization Ranges and Interquartile ranges The range of a set of data {xi}i=1,...,N is the diﬀerence between the maximum and minimum numbers, i.e., max i=1,...,N {xi} − min i=1,...,N {xi}. The interquartile range of a set of data is the diﬀerence of the ﬁrst and third quartile. It is the range of the middle 50 of data. It excludes the eﬀects of extreme values. Introduction and Descriptive Statistics 46 / 62 Ling-Chieh Kung (NTU IM)
- 47. Basic concepts Data visualization Data summarization Deviations from the mean Consider a set of population data {xi}i=1,...,N with mean µ. Intuitively, a way to measure the dispersion is to examine how each number deviates from the mean. For xi, the deviation from the population mean is deﬁned as xi − µ. For a sample, the deviation from the sample mean of xi is xi − ¯x. i xi deviation 1 1 1 − 5 = −4 2 2 2 − 5 = −3 3 4 4 − 5 = −1 4 5 1 − 5 = 0 5 6 6 − 5 = 1 6 8 8 − 5 = 3 7 9 9 − 5 = 4 Mean 5 Introduction and Descriptive Statistics 47 / 62 Ling-Chieh Kung (NTU IM)
- 48. Basic concepts Data visualization Data summarization Mean deviations May we summarize the N deviations into a single number to summarize the aggregate deviation? Intuitively, we may sum them up and then calculate the mean deviation: N i=1(xi − µ) N . Is it always 0? i xi deviation 1 1 1 − 5 = −4 2 2 2 − 5 = −3 3 4 4 − 5 = −1 4 5 1 − 5 = 0 5 6 6 − 5 = 1 6 8 8 − 5 = 3 7 9 9 − 5 = 4 Mean 5 0 Introduction and Descriptive Statistics 48 / 62 Ling-Chieh Kung (NTU IM)
- 49. Basic concepts Data visualization Data summarization Adjusting mean deviations People use two ways to adjust mean deviations: Mean absolute deviations/errors (MAD): N i=1 |xi − µ| N . Mean squared deviations/errors (variance or MSE): N i=1(xi − µ)2 N . A larger MAD or variance means that the data are more disperse. i xi di |di| d2 i 1 1 −4 4 16 2 2 −3 3 9 3 4 −1 1 1 4 5 0 0 0 5 6 1 1 1 6 8 3 3 9 7 9 4 4 16 Mean 5 0 2.29 7.43 Introduction and Descriptive Statistics 49 / 62 Ling-Chieh Kung (NTU IM)
- 50. Basic concepts Data visualization Data summarization MAD vs. variance The main diﬀerence: An MAD puts the same weight on all values. A variance puts more weights on extreme values. They may give diﬀerent ranks of dispersion: i xi di |di| d2 i 1 0 −5 5 25 2 4 −1 1 1 3 5 0 0 0 4 6 1 1 1 5 10 5 5 25 Mean 5 0 2.4 10.4 i xi di |di| d2 i 1 1 4 4 16 2 2 3 3 9 3 5 0 0 0 4 8 3 3 9 5 9 4 4 16 Mean 5 0 2.8 10 In general, people use variances more than MADs. But MADs are still popular in some areas, e.g., demand forecasting. It is the analyst’s discretion to choose the appropriate one. Introduction and Descriptive Statistics 50 / 62 Ling-Chieh Kung (NTU IM)
- 51. Basic concepts Data visualization Data summarization Standard deviations One drawback of using variances is that the unit of measurement is the square of the original one. For the baseball team, the variance of member heights is 34.05 cm2 . What is it?! People take the square root of a variance to generate a standard deviation. The standard deviation of member heights is √ 34.05 ≈ 5.85 cm. 178 172 175 184 172 175 165 178 177 175 180 182 177 183 180 178 179 162 170 171 A standard deviation typically has more managerial implications. Introduction and Descriptive Statistics 51 / 62 Ling-Chieh Kung (NTU IM)
- 52. Basic concepts Data visualization Data summarization Population v.s. sample variances Recall that the formulas for population and sample means are µ ≡ N i=1 xi N and ¯x ≡ n i=1 xi n , respectively. Formula-wise there is no diﬀerence. However, population and sample variances are σ2 ≡ N i=1(xi − µ)2 N and s2 ≡ n i=1(xi − ¯x)2 n − 1 , respectively. Note the diﬀerence between N and n − 1! Population and sample standard deviations are σ = N i=1(xi−µ)2 N and s = n i=1(xi−¯x)2 n−1 , respectively. People use σ2 , σ, s2 , and s in almost the whole statistics world. Introduction and Descriptive Statistics 52 / 62 Ling-Chieh Kung (NTU IM)
- 53. Basic concepts Data visualization Data summarization Coeﬃcient of variation The coeﬃcient of variation is the ratio of the standard deviation to the mean: Coeﬃcient of variation = σ µ . When will you use coeﬃcients of variation? Introduction and Descriptive Statistics 53 / 62 Ling-Chieh Kung (NTU IM)
- 54. Basic concepts Data visualization Data summarization z-scores Consider a set of sample data {xi}i=1,...,n with sample mean ¯x and sample standard deviation s. For xi, the z-score is zi = xi − ¯x s . In a set of population data {xi}i=1,...,N with population mean µ and population standard deviation σ, the z-score of xi is zi = xi − µ σ . A value’s z-score measures for how many standard deviations it deviates from the mean. Introduction and Descriptive Statistics 54 / 62 Ling-Chieh Kung (NTU IM)
- 55. Basic concepts Data visualization Data summarization z-scores vs. outliers For detecting outliers, one common way is double check whether xi is an outlier if |zi| = xi − µ σ > 3. It is quite rare for a value’s magnitude of z-score to be so large. For sample data, use xi−¯x s . Some people propose the use of median and MAD is a similar way: double check whether xi is an outlier if1 xi − median MAD > 3. The above rules only suggest one to investigate some extreme values again. These rules are neither suﬃcient nor necessary for outliers. 1The “MAD” here can be mean absolute deviation from mean, mean absolute deviation from median, median absolute deviation from median, etc. Introduction and Descriptive Statistics 55 / 62 Ling-Chieh Kung (NTU IM)
- 56. Basic concepts Data visualization Data summarization Correlation Consider the size of a house and its price in a city: Size Price (in m2 ) (in $1000) 75 315 59 229 85 355 65 261 72 234 46 216 107 308 91 306 75 289 65 204 88 265 59 195 How do we measure/describe the correlation (linear relationship) between the two variables? Introduction and Descriptive Statistics 56 / 62 Ling-Chieh Kung (NTU IM)
- 57. Basic concepts Data visualization Data summarization Intuition Consider a set of paired data {(xi, yi)}i=1,...,N . When one variable goes up, does the other one tend to go up or down? More precisely, if xi is larger than µx (the mean of the xis), is it more likely to see yi > µy or yi < µy? We say that the two variables have a positive correlation. If one goes up when the other goes down, there is a negative correlation. Introduction and Descriptive Statistics 57 / 62 Ling-Chieh Kung (NTU IM)
- 58. Basic concepts Data visualization Data summarization Covariances We deﬁne the covariance of a set of two-dimensional (sample) data as sxy ≡ n i=1(xi − ¯x)(yi − ¯y) n − 1 . If most points fall in the ﬁrst and third quadrants, most (xi − µx)(y − µy) will be positive and sxy tends to be positive. Otherwise, sxy tends to be negative. So the covariance of house size and price is 617.16. Is it large or small? This depends on how variable the two variables themselves are. Introduction and Descriptive Statistics 58 / 62 Ling-Chieh Kung (NTU IM)
- 59. Basic concepts Data visualization Data summarization Pearson’s correlation coeﬃcients To take away the auto-variability of each variable itself, we deﬁne the population and sample correlation coeﬃcients as r ≡ sxy sxsy , sx and sy are the sample standard deviations of xis and yis. In our example, we have r = 617.16 16.78×50.45 ≈ 0.729. It can be shown that we always have −1 ≤ r ≤ 1. r > 0: Positive correlation. r = 0: No correlation. r < 0: Negative correlation. People often determine the degree of correlation based on |s|: 0 ≤ |s| < 0.25: A weak correlation. 0.25 ≤ |s| < 0.5: A moderately weak correlation. 0.5 ≤ |s| < 0.75: A moderately strong correlation. 0.75 ≤ |s| ≤ 1: A strong correlation. Introduction and Descriptive Statistics 59 / 62 Ling-Chieh Kung (NTU IM)
- 60. Basic concepts Data visualization Data summarization Correlation vs. independence A correlation coeﬃcient only measures how one variable linearly depends on the other variable. (r = 0.5973) (r = 0) Being uncorrelated does not mean being independent! Introduction and Descriptive Statistics 60 / 62 Ling-Chieh Kung (NTU IM)
- 61. Basic concepts Data visualization Data summarization Correlation vs. causation A correlation coeﬃcient only measures whether two variables correlate with each other. High correlation does not mean causation. A causes B or B causes A? C causes A and B? Or just by chance? Introduction and Descriptive Statistics 61 / 62 Ling-Chieh Kung (NTU IM)
- 62. Basic concepts Data visualization Data summarization Correlation of qualitative variables Sometimes the variables are not quantitative/numeric. For ordinal data, we calculate their Spearman’s rank correlation. For nominal data, we calculate Cramer’s V. Introduction and Descriptive Statistics 62 / 62 Ling-Chieh Kung (NTU IM)
- 63. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistics and Data Analysis for Engineers Part 2: Hypothesis Testing and p-value Ling-Chieh Kung Department of Information Management National Taiwan University September 4, 2016 Hypothesis Testing and p-value 1 / 71 Ling-Chieh Kung (NTU IM)
- 64. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. Hypothesis Testing and p-value 2 / 71 Ling-Chieh Kung (NTU IM)
- 65. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Random vs. nonrandom sampling Sampling is the process of selecting a subset of entities from the whole population. Sampling can be random or nonrandom. If random, whether an entity is selected is probabilistic. Randomly select 1000 phone numbers on the telephone book and then call them. If nonrandom, it is deterministic. Ask all your classmates for their preferences on iOS/Android. Most statistical methods are only for random sampling. Some popular random sampling techniques: Simple random sampling. Stratiﬁed random sampling. Cluster (or area) random sampling. Hypothesis Testing and p-value 3 / 71 Ling-Chieh Kung (NTU IM)
- 66. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Simple random sampling In simple random sampling, each entity has the same probability of being selected. The good part of simple random sampling is simple. However, it may result in nonrepresentative samples. In simple random sampling, there are some possibilities that too much data we sample fall in the same stratum. They have the same property. E.g., it is possible that all randomly sampled voters are younger than 40. The sample is thus nonrepresentative. How to ﬁx this problem? Hypothesis Testing and p-value 4 / 71 Ling-Chieh Kung (NTU IM)
- 67. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Stratiﬁed random sampling We may apply stratiﬁed random sampling. We ﬁrst split the whole population into several strata. Data in one stratum should be (relatively) homogeneous. Data in diﬀerent strata should be (relatively) heterogeneous. We then use simple random sampling for each stratum. Hypothesis Testing and p-value 5 / 71 Ling-Chieh Kung (NTU IM)
- 68. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Stratiﬁed random sampling As an example, suppose that we want to sample 40 out of 1000 graduates to understand the number of credits they get at school. Suppose that 100 students double majored, then we can split the whole population into two strata: Stratum Strata size Double major 100 No double major 900 To sample 40 graduates, we sample 40 × 100 1000 = 4 from the double-major stratum and 36 from the other stratum. Hypothesis Testing and p-value 6 / 71 Ling-Chieh Kung (NTU IM)
- 69. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Stratiﬁed random sampling We may further split the population into more strata. Double major: Yes or no. Class: 1994-1998, 1999-2003, 2004-2008, or 2009-2012. This stratiﬁcation makes sense only if students in diﬀerent classes tend to take diﬀerent numbers of units. Stratiﬁed random sampling is good in reducing sample error. But it can be hard to identify a reasonable stratiﬁcation. It is also more costly and time-consuming. Hypothesis Testing and p-value 7 / 71 Ling-Chieh Kung (NTU IM)
- 70. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Cluster (or area) random sampling Imagine that you are going to introduce a new product into all the retail stores in Taiwan. If the product is actually unpopular, an introduction with a large quantity will incur a huge lost. How to get an idea about the popularity? Typically we ﬁrst try to introduce the product in a small area. We put the product on the shelves only in those stores in the speciﬁed area. This is the idea of cluster (or area) random sampling. Those consumers in the area form a sample. Hypothesis Testing and p-value 8 / 71 Ling-Chieh Kung (NTU IM)
- 71. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Cluster (or area) random sampling In cluster random sampling, we deﬁne clusters. We will only choose one or some clusters and then collect all the data in these clusters. If a cluster is too large, we may further split it into multiple second-stage clusters. Therefore, we want data in a cluster to be heterogeneous, and data across clusters somewhat homogeneous. For example, people may do cluster random sampling to understand the popularity of a new product. Those chosen cities (counties, states, etc.) are called test market cities (counties, states, etc.). People use cluster random sampling in this case because of its feasibility and convenience. We should select test market cities whose population proﬁles are similar to that of the entire country. Hypothesis Testing and p-value 9 / 71 Ling-Chieh Kung (NTU IM)
- 72. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Nonrandom sampling Sometimes we do nonrandom sampling. Convenience sampling. The researcher sample data that are easy to sample. Judgment sampling. The researcher decides who to ask or what data to collect. Quota sampling. In each stratum, we use whatever method that is easy to ﬁll the quota, a predetermined number of samples in the stratum. Snowball sampling. Once we ask one person, we ask her/him to suggest others. Nonrandom sampling cannot be analyzed by the statistical methods we introduce in this course. Hypothesis Testing and p-value 10 / 71 Ling-Chieh Kung (NTU IM)
- 73. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. . Hypothesis Testing and p-value 11 / 71 Ling-Chieh Kung (NTU IM)
- 74. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sampling distributions When we cannot examine the whole population, we study a sample. What will be contained in a random sample is unpredictable. We need to know the probability distribution of a sample so that we may connect the sample with the population. The probability distribution of a sample is a sampling distribution. Hypothesis Testing and p-value 12 / 71 Ling-Chieh Kung (NTU IM)
- 75. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sampling distributions A factory produces bags of candies. Ideally, each bag should weigh 2 kg. As the production process cannot be perfect, a bag of candies should weigh between 1.8 and 2.2 kg. Let X be the weight of a bag of candies. Let µ and σ be its expected value and standard deviation. Is µ = 2? Is 1.8 < µ < 2.2? How large is σ? Let’s sample: In a random sample of 1 bag of candies, suppose it weighs 2.1 kg. May we conclude that 1.8 < µ < 2.2? What if the average weight of 5 bags in a random sample is 2.1 kg? What if the sample size is 10, 50, or 100? What if the mean is 2.3 kg? We need to know the sampling distribution of those statistics (sample mean, sample standard deviation, etc.). Hypothesis Testing and p-value 13 / 71 Ling-Chieh Kung (NTU IM)
- 76. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sample means The sample mean is one of the most important statistics. Deﬁnition 1 Let {Xi}i=1,...,n be a sample from a population, then ¯x = n i=1 Xi n is the sample mean. Sometimes we write ¯xn to emphasize that the sample size is n. We assume that Xi and Xj are independent for all i = j. This is ﬁne if n N, i.e., we sample a few items from a large population. In practice, we require n ≤ 0.05N. Hypothesis Testing and p-value 14 / 71 Ling-Chieh Kung (NTU IM)
- 77. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Means and variances of sample means Suppose the population mean and variance are µ and σ2 , respectively. These two numbers are ﬁxed. A sample mean ¯x is a random variable. It has its expected value E[¯x], variance Var(¯x), and standard deviation Var(¯x). These numbers are all ﬁxed They are also denoted as µ¯x, σ2 ¯x, and σ¯x, respectively. For any population, we have the following theorem: Proposition 1 (Mean and variance of a sample mean) Let {Xi}i=1,...,n be a size-n random sample from a population with mean µ and variance σ2 , then we have µ¯x = µ, σ2 ¯x = σ2 n , and σ¯x = σ √ n . Hypothesis Testing and p-value 15 / 71 Ling-Chieh Kung (NTU IM)
- 78. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Means and variances of sample means Do the terms confuse you? The sample mean vs. the mean of the sample mean. The sample variance vs. the variance of the sample mean. By deﬁnition, they are: ¯x = 1 n n i=1 Xi; a random variable. E[¯x]; a constant. s2 = 1 n−1 n i=1(Xi − ¯x)2 ; a random variable. Var(¯x); a constant. The sample variance also has its mean and variance. Hypothesis Testing and p-value 16 / 71 Ling-Chieh Kung (NTU IM)
- 79. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example: Quality inspection The weight of a bag of candies follow a normal distribution with mean µ = 2 and standard deviation σ = 0.2. Suppose the quality control oﬃcer decides to sample 4 bags and calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2]. Note that my production process is actually “good:” µ = 2. Unfortunately, it is not perfect: σ > 0. We may still be punished (if we are unlucky) even though µ = 2. What is the probability that I will be punished? We want to calculate 1 − Pr(1.8 < ¯x < 2.2). We know that µ¯x = µ = 2 and σ¯x = σ√ 4 = 0.1. But we do not know the probability distribution of ¯x! Hypothesis Testing and p-value 17 / 71 Ling-Chieh Kung (NTU IM)
- 80. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sampling from a normal population If the population is normal, the sample mean is also normal! Proposition 2 Let {Xi}i=1,...,n be a size-n random sample from a normal population with mean µ and standard deviation σ. Then ¯x ∼ ND µ, σ √ n . We already know that µ¯x = µ and σ¯x = σ√ n . This is true regardless of the population distribution. When the population is normal, the sample mean will also be normal. Hypothesis Testing and p-value 18 / 71 Ling-Chieh Kung (NTU IM)
- 81. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example revisited: Quality inspection The weight of a bag of candies follow a normal distribution with mean µ = 2 and standard deviation σ = 0.2. Suppose the quality control oﬃcer decides to sample 4 bags and calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2]. What is the probability that I will be punished? The distribution of the sample mean ¯x is ND(2, 0.1). Pr(¯x < 1.8) + Pr(¯x > 2.2) ≈ 0.045. Hypothesis Testing and p-value 19 / 71 Ling-Chieh Kung (NTU IM)
- 82. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Adjusting the standard deviation When the population is ND(µ = 2, σ = 0.2) and the sample size is n = 4, the probability of punishment is 0.045. If we adjust our standard deviation σ (by paying more or less attention to the production process), the probability will change. Reducing σ reduces the probability of being punished. With the sampling distribution of ¯x, we may optimize σ. An improvement from 0.2 to 0.15 is helpful; from 0.15 to 0.1 is not. Hypothesis Testing and p-value 20 / 71 Ling-Chieh Kung (NTU IM)
- 83. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Adjusting the sample size When the population is ND(2, 0.2) and the sample size is n = 4, the probability of punishment is 0.045. If the quality control oﬃcer increases the sample size n, the probability will decrease. µ = 2 is actually ideal. A larger sample size makes the oﬃcer less likely to make a mistake. Hypothesis Testing and p-value 21 / 71 Ling-Chieh Kung (NTU IM)
- 84. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Distribution of the sample mean So now we have one general conclusion: When we sample from a normal population, the sample mean is also normal. And its mean and standard deviation are µ and σ√ n , respectively. What if the population is non-normal? Fortunately, we have a very powerful theorem, the central limit theorem, which applies to any population. Hypothesis Testing and p-value 22 / 71 Ling-Chieh Kung (NTU IM)
- 85. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Central limit theorem The theorem says that a sample mean is approximately normal when the sample size is large enough. Proposition 3 (Central limit theorem) Let {Xi}i=1,...,n be a size-n random sample from a population with mean µ and standard deviation σ. Let ¯xn be the sample mean. If σ < ∞, then ¯xn converges to ND(µ, σ√ n ) as n → ∞. How large is “large enough”? In practice, typically n ≥ 30 is believed to be large enough. Hypothesis Testing and p-value 23 / 71 Ling-Chieh Kung (NTU IM)
- 86. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. . Hypothesis Testing and p-value 24 / 71 Ling-Chieh Kung (NTU IM)
- 87. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Hypothesis testing How do scientists (physicists, chemists, etc.) do research? Observe phenomena. Make hypotheses. Test the hypotheses through experiments (or other methods). Make conclusions about the hypotheses. Social scientists and business researchers do the same thing with hypothesis testing. One of the most important technique of statistical inference. A technique for (statistically) proving things. Relying on sampling distributions. Hypothesis Testing and p-value 25 / 71 Ling-Chieh Kung (NTU IM)
- 88. Sampling Sampling distributions Hypothesis testing p-value, t test, and more People ask questions In the business (or social science) world, people ask questions: Are older workers more loyal to a company? Does the newly hired CEO enhance our proﬁtability? Is one candidate preferred by more than 50% voters? Do teenagers eat fast food more often than adults? Is the quality of our products stable enough? How should we answer these questions? Statisticians suggest: First make a hypothesis. Then test it with samples and statistical methods. Hypothesis Testing and p-value 26 / 71 Ling-Chieh Kung (NTU IM)
- 89. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses A statistical hypothesis is a formal way of stating a hypothesis. Typically it is a mathematical description of parameters to test. It contains two parts: The null hypothesis (denoted as H0). The alternative hypothesis (denoted as Ha or H1). The alternative hypothesis is: The thing that we want (need) to prove. The conclusion that can be made only if we have a strong evidence. The null hypothesis corresponds to a default position. We ﬁrst assume that the null hypothesis is correct. Then we collect sample data. If under the null hypothesis it is quite unlikely to see our observed result, we claim that the null hypothesis is wrong. Hypothesis Testing and p-value 27 / 71 Ling-Chieh Kung (NTU IM)
- 90. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 1 In our factory, we produce packs of candy whose average weight should be 1 kg. One day, a consumer told us that his pack only weighs 900 g. We need to know whether this is just a rare event or our production system is out of control. If (we believe) the system is out of control, we need to shutdown the machine and spend two days for inspection and maintenance. This will cost us at least $100,000. So we should not to believe that our system is out of control just because of one complaint. What should we do? Hypothesis Testing and p-value 28 / 71 Ling-Chieh Kung (NTU IM)
- 91. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 1 We ﬁrst state a hypothesis: “Our production system is under control.” Then we ask: Is there a strong enough evidence showing that the hypothesis is wrong, i.e., the system is out of control? Initially, we assume that our system is under control. Then we do a survey to see if we have a strong enough evidence. We shutdown machines only if we can “prove” that the system is indeed out of control. Let µ be the average weight, the statistical hypothesis is H0 : µ = 1 Ha : µ = 1. Hypothesis Testing and p-value 29 / 71 Ling-Chieh Kung (NTU IM)
- 92. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 2 In our society, we adopt the presumption of innocence. One is considered innocent until proven guilty. So when there is a person who probably stole some money: H0 : The person is innocent Ha : The person is guilty. There are two possible errors: One is guilty but we think she/he is innocent. One is innocent but we think she/he is guilty. Which one is more critical? It is unacceptable that an innocent person is considered guilty. We will say one is guilty only if there is a strong evidence. Hypothesis Testing and p-value 30 / 71 Ling-Chieh Kung (NTU IM)
- 93. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 3 Consider the following hypothesis: “The candidate is preferred by more than 50% voters.” As we need a default position, and the percentage that we care about is 50%, we will choose our null hypothesis as H0 : p = 0.5. p is the population proportion of voters preferring the candidate. More precisely, let Xi = 1 if voter i prefers this candidate and 0 otherwise, i = 1, ..., N, then p = N i=1 Xi N . How about the alternative hypothesis? Should it be Ha : p > 0.5 or Ha : p < 0.5? Hypothesis Testing and p-value 31 / 71 Ling-Chieh Kung (NTU IM)
- 94. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 3 The choice of the alternative hypothesis depends on the related decisions or actions to make. Suppose one will go for the election only if she thinks she will win (i.e., p > 0.5), the alternative hypothesis will be Ha : p > 0.5. Suppose one tends to participate in the election and will give up only if the chance is slim, the alternative hypothesis will be Ha : p < 0.5. The alternative hypothesis is “the thing we want (need) to prove.” Hypothesis Testing and p-value 32 / 71 Ling-Chieh Kung (NTU IM)
- 95. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Two types of errors Type-1 error (false positive): Rejecting a true null hypothesis. There is nothing, but we say there is one. Type-2 error (false negative): Do not reject a false null hypothesis. There is something, but we do not see it. Hypothesis Testing and p-value 33 / 71 Ling-Chieh Kung (NTU IM)
- 96. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Hypothesis Testing and p-value 34 / 71 Ling-Chieh Kung (NTU IM)
- 97. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Remarks We want to control the chances for us to make these mistakes. Unfortunately, we cannot control both. We choose to control the probability of a type-1 error. The choice of the default position is important. For setting up a statistical hypothesis: Our default position will be put in the null hypothesis. The thing we want to prove (i.e., the thing that needs a strong evidence) will be put in the alternative hypothesis. For writing the mathematical statement: The equal sign (=) will always be put in the null hypothesis. The alternative hypothesis contains an unequal sign or strict inequality: =, >, or <. The direction of the alternative hypothesis, when it is an inequality, depends on the context. Hypothesis Testing and p-value 35 / 71 Ling-Chieh Kung (NTU IM)
- 98. Sampling Sampling distributions Hypothesis testing p-value, t test, and more One-tailed tests and two-tailed tests If the alternative hypothesis contains an unequal sign (=), the test is a two-tailed test. If it contains a strict inequality (> or <), the test is a one-tailed test. Suppose we want to test the value of the population mean. In a two-tailed test, we test whether the population mean signiﬁcantly deviates from a hypothesized value. We do not care whether it is larger than or smaller than. In a one-tailed test, we test whether the population mean signiﬁcantly deviates from a hypothesized value in a speciﬁc direction. Hypothesis Testing and p-value 36 / 71 Ling-Chieh Kung (NTU IM)
- 99. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The ﬁrst example: a two-tailed test Let’s test the average weight (in g) of our products. H0 : µ = 1000 Ha : µ = 1000. The variance of the product weights is σ2 = 40000 g2 . The case with unknown σ2 will be discussed later. A random sample has been collected. Suppose the sample size n = 100. Suppose the sample mean X = 963. How to make a conclusion? Hypothesis Testing and p-value 37 / 71 Ling-Chieh Kung (NTU IM)
- 100. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Controlling the error probability All we can do is to collect a random sample and make our conclusion based on the observed sample. It is natural that we may be wrong when we claim µ = 1000. We want to control the error probability. Let α be the maximum probability for us to make this error. α is called the signiﬁcance level. 1 − α is called the conﬁdence level. Target: If µ = 1000, our sampling and testing process will make us claim that µ = 1000 with probability at most α. Hypothesis Testing and p-value 38 / 71 Ling-Chieh Kung (NTU IM)
- 101. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule Now let’s test with the signiﬁcance level α = 0.05. Intuitively, if X deviates from 1000 a lot, we should reject the null hypothesis and believe that µ = 1000. If µ = 1000, it is so unlikely to observe such a large deviation. So such a large deviation provides a strong evidence. So we start by sampling and calculating the sample mean. We want to construct a rejection rule: If |X − 1000| > d, we reject H0. We need to calculate d. Hypothesis Testing and p-value 39 / 71 Ling-Chieh Kung (NTU IM)
- 102. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule We want a distance d such that if H0 is true, the probability of rejecting H0 is at most 5%, i.e., Pr |X − 1000| > d µ = 1000 ≤ 0.05. The smallest d that satisﬁes the above inequality requires Pr(|X − 1000| > d) = 0.05. Consider X: We know σ = 200 and n = 100. We assume that µ = 1000. Thanks to the central limit theorem, X ∼ ND(1000, 20). Pr(|X − 1000| > d) = 0.05. Hypothesis Testing and p-value 40 / 71 Ling-Chieh Kung (NTU IM)
- 103. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value According to X ∼ ND(1000, 20), Pr(|X − 1000| > 39.2) = 0.05. The rejection region is R = (−∞, 960.8) ∪ (1039.2, ∞). If X falls in the rejection region, we reject H0. Hypothesis Testing and p-value 41 / 71 Ling-Chieh Kung (NTU IM)
- 104. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value Because ¯x = 963 /∈ R, we cannot reject H0. The deviation from 1000 is not large enough. The evidence is not strong enough. Hypothesis Testing and p-value 42 / 71 Ling-Chieh Kung (NTU IM)
- 105. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value In this example, the two values 960.8 and 1039.2 are the critical values for rejection. If the sample mean is more extreme than one of the critical values, we reject H0. Otherwise, we do not reject H0. ¯x = 963 is not strong enough to support Ha: µ = 1000. Concluding statement: Because the sample mean does not lie in the rejection region, we cannot reject H0. With a 95% conﬁdence level, there is no strong evidence showing that the average weight is not 1000 g. Therefore, we should not shutdown machines to do an inspection. Hypothesis Testing and p-value 43 / 71 Ling-Chieh Kung (NTU IM)
- 106. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary We want to know whether the machine is out of control. If the machine is actually good, we do not want to reach a conclusion that requires an inspection and maintenance. We will do the inspection only if we have a strong evidence suggesting that µ = 1000. We want to know whether H0 is false, i.e., µ = 1000. We control the probability of making a wrong conclusion. We should not reject H0 if it is true. We limit the probability at α = 5%. We will conclude that H0 is false if X falls in the rejection region. The calculation of the the critical values is based on the normal distribution, which can always be transformed to the z distribution. This is called a z test. Hypothesis Testing and p-value 44 / 71 Ling-Chieh Kung (NTU IM)
- 107. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Not rejecting vs. accepting We should be careful in writing our conclusions: Wrong: Because the sample mean does not lie in the rejection region, we accept H0. With a 95% conﬁdence level, there is a strong evidence showing that the average weight is 1000 g. Right: Because the sample mean does not lie in the rejection region, we cannot reject H0. With a 95% conﬁdence level, there is no strong evidence showing that the average weight is not 1000 g. Unable to prove one thing is false does not mean it is true! Hypothesis Testing and p-value 45 / 71 Ling-Chieh Kung (NTU IM)
- 108. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The ﬁrst example (part 2) Suppose that we modify the hypothesis into a directional one:1 H0 : µ = 1000. Ha : µ < 1000. We still have σ2 = 40000, n = 100, and α = 0.05. This is a one-tailed test. Once we have a strong evidence supporting Ha, we will claim that µ < 1000. We need to ﬁnd a distance d such that Pr 1000 − X > d µ = 1000 = 0.05. 1Some researchers write µ ≥ 1000 in this case. Hypothesis Testing and p-value 46 / 71 Ling-Chieh Kung (NTU IM)
- 109. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value For 0.05 = Pr(1000 − X > d), we have d = 32.9. As the observed sample mean ¯x = 963 ∈ (−∞, 967.1), we reject H0. The deviation from 1000 is large enough. The evidence is strong enough. Hypothesis Testing and p-value 47 / 71 Ling-Chieh Kung (NTU IM)
- 110. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value In this example, 967.1 is the critical values for rejection. If the sample mean is more extreme than (in this case, below) the critical value, we reject H0. Otherwise, we do not reject H0. There is a strong evidence supporting Ha: µ < 1000. Concluding statement: Because the sample mean lies in the rejection region, we reject H0. With a 95% conﬁdence level, there is a strong evidence showing that the average weight is less than 1000 g. Hypothesis Testing and p-value 48 / 71 Ling-Chieh Kung (NTU IM)
- 111. Sampling Sampling distributions Hypothesis testing p-value, t test, and more One-tailed tests vs. two-tailed tests When should we use a two-tailed test? We use a two-tailed test when we are lack of the direction information. E.g., we suspect that the population mean has changed, but we have no idea about whether it becomes larger or smaller. If we know or believe that the change is possible only in one direction, we may use a one-tailed test. Having more information (i.e., knowing the direction of change) makes rejection “easier,”, i.e., easier to ﬁnd a strong enough evidence. Hypothesis Testing and p-value 49 / 71 Ling-Chieh Kung (NTU IM)
- 112. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary Distinguish the following pairs: One- and two-tailed tests. No evidence showing H0 is false and having evidence showing H0 is true. Not rejecting H0 and accepting H0. Using = and using ≥ or ≤ in the null hypothesis. Hypothesis Testing and p-value 50 / 71 Ling-Chieh Kung (NTU IM)
- 113. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. . Hypothesis Testing and p-value 51 / 71 Ling-Chieh Kung (NTU IM)
- 114. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The p-value The p-value is an important, meaningful, and widely-adopted tool for hypothesis testing. Deﬁnition 2 For an observed value of a statistic in a statistical test, the p-value is the probability of observing a value that is more extreme than the observed value under the assumption that the null hypothesis is true. Calculated based on an observed value of the statistic. Is the tail probability of the observed value. Assuming that the null hypothesis is true. Hypothesis Testing and p-value 52 / 71 Ling-Chieh Kung (NTU IM)
- 115. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The p-value Mathematically: Suppose we test a population mean µ with a one-tailed test H0 : µ = 1000 Ha : µ < 1000. Given an observed ¯x, the p-value is deﬁned as Pr(X ≤ ¯x). In the previous example, σ = 200, n = 100, α = 0.05, and ¯x = 963. If H0 is true, i.e., µ = 1000, we have Pr(X ≤ 963) = 0.032. The p-value of ¯x is 0.032. Hypothesis Testing and p-value 53 / 71 Ling-Chieh Kung (NTU IM)
- 116. Sampling Sampling distributions Hypothesis testing p-value, t test, and more How to use the p-value? The p-value can be used for constructing a rejection rule. For a one-tailed test: If the p-value is smaller than α, we reject H0. If the p-value is greater than α, we do not reject H0. In our example, the one-tailed test is H0 : µ = 1000 Ha : µ < 1000. We have α = 0.05. Because the p-value 0.032 < 0.05, we reject H0. Hypothesis Testing and p-value 54 / 71 Ling-Chieh Kung (NTU IM)
- 117. Sampling Sampling distributions Hypothesis testing p-value, t test, and more p-values vs. critical values Using the p-value is equivalent to using the critical values. The rejection-or-not decision we make will be the same based on the two methods. Hypothesis Testing and p-value 55 / 71 Ling-Chieh Kung (NTU IM)
- 118. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The beneﬁt of using the p-value In many studies, researchers do not determine the signiﬁcance level α before a test is conducted. They calculate the p-value and then mark the signiﬁcance of the result with stars. One typical way of assigning stars: p-value Signiﬁcant? Mark (0, 0.01] Highly signiﬁcant *** (0.01, 0.05] Moderately signiﬁcant ** (0.05, 0.1] Slightly signiﬁcant * (0.1, 1) Insigniﬁcant (Empty) Hypothesis Testing and p-value 56 / 71 Ling-Chieh Kung (NTU IM)
- 119. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The size of a p-value Suppose one is testing whether people at diﬀerent ages sleep for at least eight hours per day in average. Age groups: [10, 15), [15, 20), [20, 35), etc. For group i, a one-tailed test is conducted. Ha : µi > 8. The result may be presented in a table: Group Age group p-value 1 [10,15) 0.0002*** 2 [15,20) 0.2 3 [20,25) 0.06* 4 [25,30) 0.04** 5 [30,35) 0.03** A smaller p-value does NOT mean a larger deviation! We cannot conclude that µ5 > µ4, µ1 > µ3, etc. There are other tests for the diﬀerence between two population means. Hypothesis Testing and p-value 57 / 71 Ling-Chieh Kung (NTU IM)
- 120. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The p-value for two-tailed tests How to construct the rejection rule for a two-tailed test? If the p-value is smaller than α 2 , we reject H0. If the p-value is greater than α 2 , we do not reject H0. Consider the two-tailed test H0 : µ = 1000. Ha : µ = 1000. We have α = 0.05. Because the p-value 0.032 > α 2 = 0.025, we do not reject H0. Some researchers/books/software use another deﬁnition: The p-value for a two-tailed test is two times of that for the corresponding one-tailed test. They then compare this p-value with α. Hypothesis Testing and p-value 58 / 71 Ling-Chieh Kung (NTU IM)
- 121. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary The p-value is the tail probability of the realized value of a statistics assuming the null hypothesis is true. The p-value method is an alternative way of forming the rejection rule. It is equivalent to the critical-value method. The p-value is related to the probability for H0 to be false. It does not measure the magnitude of the deviation. Hypothesis Testing and p-value 59 / 71 Ling-Chieh Kung (NTU IM)
- 122. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The z test In example 1, basically we use the fact that X ∼ ND(µ, σ√ n . This implies that X−µ σ/ √ n ∼ ND(0, 1), the so-called standard normal distribution, or the z distribution. Therefore, this test is called the z test. This requires the knowledge about σ. Hypothesis Testing and p-value 60 / 71 Ling-Chieh Kung (NTU IM)
- 123. Sampling Sampling distributions Hypothesis testing p-value, t test, and more When the variance is unknown When the population variance σ2 is unknown, the quantity X−µ σ/ √ n is unknown. What if we use the sample variance S2 as a substitute? Proposition 4 For a normal population, the quantity T = X − µ S/ √ n follows the t distribution with degree of freedom n − 1. What is the t distribution? Hypothesis Testing and p-value 61 / 71 Ling-Chieh Kung (NTU IM)
- 124. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The t distribution The t distribution is deﬁned as follows: Deﬁnition 3 A random variable X follows the t distribution with degree of freedom n, denoted as X ∼ t(n), if f(x|n) = Γ(n+1 2 ) √ nπΓ(n 2 ) 1 + x2 n − n+1 2 , for all x ∈ (−∞, ∞). Γ(x) = ∞ 0 zx−1 e−z dz is the gamma function. Hypothesis Testing and p-value 62 / 71 Ling-Chieh Kung (NTU IM)
- 125. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The z and t distributions Let’s compare Z = X−µ σ/ √ n and T = X−µ S/ √ n . Because we do not know σ, we use S to substitute it. Z ∼ ND(0, 1) and T ∼ t(n − 1). As the t distribution is a substitution of the z distribution, it is designed to be also centered at 0: E[T] = E[Z] = 0. However, as we add one more random variable into the formula (σ is a known constant), T will be “more random” than Z, i.e., Var(T) > Var(Z). Graphically, t curves will be ﬂatter than the z curve. Fact: t(n) → ND(0, 1) as n → ∞. Hypothesis Testing and p-value 63 / 71 Ling-Chieh Kung (NTU IM)
- 126. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Hypothesis Testing and p-value 64 / 71 Ling-Chieh Kung (NTU IM)
- 127. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The t test We will use the t test to test the population mean if the population is normal. If the sample size is large, we may still use the z distribution with s substituting σ. Hypothesis Testing and p-value 65 / 71 Ling-Chieh Kung (NTU IM)
- 128. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2 An MBA program seldom admits applicants without a work experience longer than two years. To test whether the average work year of admitted students is above two years, 20 admitted applicants are randomly selected. Their work experiences prior to entering the program are recorded. Prior to entering the program, they have an average work experience of 2.5 years. This is the sample mean. The sample standard deviation is 1.3765 years. The population is believed to be normal. The conﬁdence level is set to 95%. Hypothesis Testing and p-value 66 / 71 Ling-Chieh Kung (NTU IM)
- 129. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2: hypothesis Suppose the one asking the question is a potential applicant with one year of work experience. He is pessimistic and will apply for the program only if the average work experience is proven to be less than two years. The hypothesis is H0 : µ = 2 Ha : µ < 2. µ is the average work experience (in years) of all admitted applicants prior to entering the program. To encourage him, we need to give him a strong evidence showing that his chance is high. Hypothesis Testing and p-value 67 / 71 Ling-Chieh Kung (NTU IM)
- 130. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2: hypothesis and test Suppose he is optimistic and will not apply for the program only if the average work experience is proven to be greater than two. The hypothesis becomes H0 : µ = 2 Ha : µ > 2. To discourage him, we need to give him a strong evidence showing that his chance is slim. Let’s consider the optimistic candidate (and Ha : µ > 2) ﬁrst. Because the population variance is unknown and the population is normal, we may use the t test. Hypothesis Testing and p-value 68 / 71 Ling-Chieh Kung (NTU IM)
- 131. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2A: calculation and interpretation Calculation: The p-value is Pr(X > 2.5|µ = 2) = 0.0604. Conclusion: For this one-tailed test, as the p-value > 0.05 = α, we do not reject H0. There is no strong evidence showing that the average work experience is longer than two years. The result is not strong enough to discourage the potential applicant, who has only one year of work experience. Decision: The (optimistic) applicant should apply. Hypothesis Testing and p-value 69 / 71 Ling-Chieh Kung (NTU IM)
- 132. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2B – a pessimistic applicant Suppose the applicant is pessimistic and the hypothesis is H0 : µ = 2 Ha : µ < 2. The p-value will be Pr(X < 2.5|µ = 2) = 1 − 0.0604 = 0.9396. This is calculated based on the t distribution. We do not reject H0 and cannot conclude that µ < 2. There is no strong evidence to encourage him. He should not apply. Note that when we write diﬀerent alternative hypotheses, the ﬁnal decision is diﬀerent! This happens if and only if in both cases we do not reject H0. Hypothesis Testing and p-value 70 / 71 Ling-Chieh Kung (NTU IM)
- 133. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary To test the population mean µ: σ2 Sample size Population distribution Normal Nonnormal Known n ≥ 30 z z n < 30 z Nonparametric Unknown n ≥ 30 t or z z n < 30 t Nonparametric More parameters that may be tested: Population proportion (z test). Population variance (χ2 test). Diﬀerence of two population means (t test). Ratio of two population variances (F test). Hypothesis Testing and p-value 71 / 71 Ling-Chieh Kung (NTU IM)
- 134. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Statistics and Data Analysis for Engineers Part 3: Regression Analysis Ling-Chieh Kung Department of Information Management National Taiwan University September 4, 2016 Regression Analysis 1 / 83 Ling-Chieh Kung (NTU IM)
- 135. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Correlation and prediction We often try to ﬁnd correlation among variables. For example, prices and sizes of houses: House 1 2 3 4 5 6 Size (m2) 75 59 85 65 72 46 Price ($1000) 315 229 355 261 234 216 House 7 8 9 10 11 12 Size (m2) 107 91 75 65 88 59 Price ($1000) 308 306 289 204 265 195 We may calculate their correlation coeﬃcient as r = 0.729. Now given a house whose size is 100 m2 , may we predict its price? Regression Analysis 2 / 83 Ling-Chieh Kung (NTU IM)
- 136. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Correlation among more than two variables Sometimes we have more than two variables: For example, we may also know the number of bedrooms in each house: House 1 2 3 4 5 6 Size (m2) 75 59 85 65 72 46 Price ($1000) 315 229 355 261 234 216 Bedroom 1 1 2 2 2 1 House 7 8 9 10 11 12 Size (m2) 107 91 75 65 88 59 Price ($1000) 308 306 289 204 265 195 Bedroom 3 3 2 1 3 1 How to summarize the correlation among the three variables? How to predict house price based on size and number of bedrooms? Regression Analysis 3 / 83 Ling-Chieh Kung (NTU IM)
- 137. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression analysis Regression is a solution! As one of the most widely used tools in Statistics, it discovers: Which variables aﬀect a given variable. How they aﬀect the target. In general, we will predict/estimate one dependent variable by one or multiple independent variables. Independent variables: Potential factors that may aﬀect the outcome. Dependent variable: The outcome. Independent variables are explanatory variables; the dependent variable is the response variable. As another example, suppose we want to predict the number of arrival consumers for tomorrow: Dependent variable: Number of arrival consumers. Independent variables: Weather, holiday or not, promotion or not, etc. Regression Analysis 4 / 83 Ling-Chieh Kung (NTU IM)
- 138. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Types of regression analysis Based on the number of independent variables: Simple regression: One independent variable. Multiple regression: More than one independent variables. The dependent variable may be quantitative or qualitative. In ordinary regression, the dependent variable is quantitative. In logistic regression, the dependent variable is qualitative. There are other types of regression models. Regression Analysis 5 / 83 Ling-Chieh Kung (NTU IM)
- 139. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Road map Simple regression. Multiple regression. Indicator variables and interaction. Endogeneity and residual analysis. Logistic regression. Regression Analysis 6 / 83 Ling-Chieh Kung (NTU IM)
- 140. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Basic principle Consider the price-size relationship again. In the sequel, let xi be the size and yi be the price of house i, i = 1, ..., 12. Size Price (in m2 ) (in $1000) 46 216 59 229 59 195 65 261 65 204 72 234 75 315 75 289 85 355 88 265 91 306 107 308 How to relate sizes and prices “in the best way?” Regression Analysis 7 / 83 Ling-Chieh Kung (NTU IM)
- 141. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Linear estimation If we believe that the relationship between the two variables is linear, we will assume that yi = β0 + β1xi + i. β0 is the intercept of the equation. β1 is the slope of the equation. i is the random noise for estimating record i. Somehow there is such a formula, but we do not know β0 and β1. β0 and β1 are the parameter of the population. We want to use our sample data (e.g., the information of the twelve houses) to estimate β0 and β1. We want to form two statistics ˆβ0 and ˆβ1 as our estimates of β0 and β1. Regression Analysis 8 / 83 Ling-Chieh Kung (NTU IM)
- 142. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Linear estimation Given the values of ˆβ0 and ˆβ1, we will use ˆyi = ˆβ0 + ˆβ1xi as our estimate of yi. Then we have yi = ˆβ0 + ˆβ1xi + i, where i is now interpreted as the estimation error. Let ˆyi = ˆβ0 + ˆβ1xi be our estimate of yi. We hope i = yi − ˆyi to be small. For all data points, let’s minimize the sum of squared errors (SSE): n i=1 2 i = (yi − ˆyi)2 = n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 . The solution of min ˆβ0, ˆβ1 n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 is our least square approximation (estimation) of the given data. Regression Analysis 9 / 83 Ling-Chieh Kung (NTU IM)
- 143. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Least square approximation The least square approximation problem min ˆβ0, ˆβ1 n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 has a closed-form formula for the best (ˆβ0, ˆβ1): ˆβ1 = n i=1(xi − ¯x)(yi − ¯y) n i=1(xi − ¯x)2 and ˆβ0 = ¯y − ˆβ1 ¯x. For our house example, we will get (ˆβ0, ˆβ1) = (102.717, 2.192). Its SSE is 13118.63. We will never know the true values of β0 and β1. However, according to our sample data, the best (least square) estimate is (102.717, 2.192). We tend to believe that β0 = 102.717 and β1 = 2.192. Regression Analysis 10 / 83 Ling-Chieh Kung (NTU IM)
- 144. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Interpretations Our regression model is y = 102.717 + 2.192x. Interpretation: When the house size increases by 1 m2 , the price is expected to increase by $2, 192. (Bad) interpretation: For a house whose size is 0 m2 , the price is expected to be $102,717. Regression Analysis 11 / 83 Ling-Chieh Kung (NTU IM)
- 145. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Linear multiple regression In most cases, more than one independent variable may be used to explain the outcome of the dependent variable. For example, consider the number of bedrooms. We may take both variables as independent variables to do linear multiple regression: yi = β0 + β1x1,i + β2x2,i + i. yi is the house price (in $1000). x1,i is the house size (in m2 ). x2,i is the number of bedrooms. i is the random noise. Our (least square) estimate is (ˆβ0, ˆβ1, ˆβ2) = (82.737, 2.854, −15.789). Price Size Bedroom (in $1000) (in m2 ) 315 75 1 229 59 1 355 85 2 261 65 2 234 72 2 216 46 1 308 107 3 306 91 3 289 75 2 204 65 1 265 88 3 195 59 1 Regression Analysis 12 / 83 Ling-Chieh Kung (NTU IM)
- 146. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Interpretations Our regression model is y = 82.737 + 2.854x1 − 15.789x2. When the house size increases by 1 m2 (and all other independent variables are ﬁxed), we expect the price to increase by $2, 854. When there is one more bedroom (and all other independent variables are ﬁxed), we expect the price to decrease by $15, 789. One must interpret the results and determine whether the result is meaningful by herself/himself. The number of bedrooms may not be a good indicator of house price. At least not in a linear way. We need more than ﬁnding coeﬃcients: We need to judge the overall quality of a given regression model. We may want to compare multiple regression models. We must test the signiﬁcance of regression coeﬃcients. Regression Analysis 13 / 83 Ling-Chieh Kung (NTU IM)
- 147. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Model validation: How good is a model? How to measure the quality of a model? For the model y = 102.717 + 2.192x, how good is it? In general, for a given regression model y = ˆβ0 + ˆβ1x1 + · · · ˆβkxk, how may we evaluate its overall quality? The sum of squared total errors (SST), SST = n i=1(yi − ¯y)2 , is for the worst model. With our regression model, the sum of squared errors (SSE) is SSE = n i=1 (yi − ˆyi)2 = n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 . The proportion of total variability that is explained by the regression model is 0 ≤ R2 = 1 − SSE SST ≤ 1. The larger R2 , the better the regression model. Regression Analysis 14 / 83 Ling-Chieh Kung (NTU IM)
- 148. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Obtaining R2 Whenever we ﬁnd the estimated coeﬃcients, we have R2 . Statistical software includes R2 in the regression report. For the regression model y = 102.717 + 2.192x, we have R2 = 0.5315: Around 53% of a house price is determined by its house size. If (and only if) there is only one independent variable, then R2 = r2 , where r is the correlation coeﬃcient between the dependent and independent variables. −1 ≤ r ≤ 1. 0 ≤ r2 = R2 ≤ 1. Regression Analysis 15 / 83 Ling-Chieh Kung (NTU IM)
- 149. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Comparing regression models Now we have a way to compare regression models. For our example: Size only Bedroom only Size and bedroom R2 0.5315 0.29 0.5513 Using prices only is better than using numbers of bedrooms only. Is using prices and bedrooms better? In general, adding more variables always increases R2 ! In the worst case, we may set the corresponding coeﬃcients to 0. Some variables may actually be meaningless. To perform a “fair” comparison and identify those meaningful factors, we need to adjust R2 based on the number of independent variables. Regression Analysis 16 / 83 Ling-Chieh Kung (NTU IM)
- 150. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Adjusted R2 The standard way to adjust R2 to adjusted R2 is R2 adj = 1 − n − 1 n − k − 1 (1 − R2 ). n is the sample size and k is the number of independent variables used. For our example: Size only Bedroom only Size and bedroom R2 0.5315 0.290 0.5513 R2 adj 0.4846 0.219 0.4516 Actually using sizes only results in the best model! Regression Analysis 17 / 83 Ling-Chieh Kung (NTU IM)
- 151. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Testing coeﬃcient signiﬁcance Another important task for validating a regression model is to test the signiﬁcance of each coeﬃcient. Recall our model with two independent variables y = 82.737 + 2.854x1 − 15.789x2. Note that 2.854 and −15.789 are solely calculated based on the sample. We never know whether β1 and β2 are really these two values! In fact, we cannot even be sure that β1 and β2 are not 0. We need to test them: H0 : βi = 0 Ha : βi = 0. We look for a strong enough evidence showing that βi = 0. Regression Analysis 18 / 83 Ling-Chieh Kung (NTU IM)
- 152. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Testing coeﬃcient signiﬁcance The testing results are provided in regression reports. Statistical software (e.g., R) tells us: Coeﬃcients Standard Error t Stat p-value Intercept 82.737 59.873 1.382 0.200 Size 2.854 1.247 2.289 0.048 ** Bedroom −15.789 25.056 −0.630 0.544 As we have no idea about population variance, we apply the t test. “Coeﬃcients” records sample means ¯x; “Standard Error” records S√ n ; “t Stat” records T = ¯x−0 S/ √ n . “p-value” are the tail probabilities of T multiplied by 2 (done by most software). Simply compare them with α! Recall the assumption that i is normal! Regression Analysis 19 / 83 Ling-Chieh Kung (NTU IM)
- 153. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Testing coeﬃcient signiﬁcance Statistical software tells us: Coeﬃcients Standard Error t Stat p-value Intercept 82.737 59.873 1.382 0.200 Size 2.854 1.247 2.289 0.048 ** Bedroom −15.789 25.056 −0.630 0.544 At a 95% conﬁdence level, we believe that β1 = 0. House size really has some impact on house price. At a 95% conﬁdence level, we have no evidence for β2 = 0. We cannot conclude that the number of bedrooms has an impact on house price. If we use only size as an independent variable, its p-value will be 0.00714. We will be quite conﬁdent that it has an impact. Regression Analysis 20 / 83 Ling-Chieh Kung (NTU IM)
- 154. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Road map Simple regression. Multiple regression. Indicator variables and interaction. Endogeneity and residual analysis. Logistic regression. Regression Analysis 21 / 83 Ling-Chieh Kung (NTU IM)
- 155. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression House age The age of a house may also aﬀect its price. Price Size Bedroom Age (in $1000) (in m2 ) (in years) 315 75 1 16 229 59 1 20 355 85 2 16 261 65 2 15 234 72 2 21 216 46 1 16 308 107 3 15 306 91 3 15 289 75 2 14 204 65 1 21 265 88 3 15 195 59 1 26 Let’s add age as an independent variable in explaining house prices. Because the number of bedroom seems to be unhelpful, let’s ignore it. Regression Analysis 22 / 83 Ling-Chieh Kung (NTU IM)
- 156. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression House age For house i, let yi be its price, x1,i be its size, and x3,i be its age. We assume the following linear relationship: yi = β0 + β1x1,i + β2x3,i + i. Software gives us the following regression report: Coeﬃcients Standard Error t Stat p-value Intercept 262.882 83.632 3.143 0.012 Size 1.533 0.628 2.443 0.037 ** Age −6.368 2.881 −2.211 0.054 * R2 = 0.696, R2 adj = 0.629 R2 goes up from 0.485 (size only) to 0.629. Age is signiﬁcant at a 10% signiﬁcance level. Seems good! Regression Analysis 23 / 83 Ling-Chieh Kung (NTU IM)
- 157. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression “Nonlinear” relationship May we do better? By looking at the age-price scatter plot (and our intuition), maybe the impact of age on price is “nonlinear”: A new house’s value depreciates fast. The value depreciates slowly when the house is old. At least this is true for a car. It is worthwhile to try a capture this nonlinear relationship. For example, we may try to replace house age by its reciprocal: yi = β0 + β1x1,i + β2 1 x3,i + i. Regression Analysis 24 / 83 Ling-Chieh Kung (NTU IM)
- 158. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variable transformation To ﬁt yi = β0 + β1x1,i + β2 1 x3,i + i. to our sample data: Prepare a new column as 1 age . Input these three columns to software. Read the report. We may consider any kind of nonlinear relationship. This technique is called variable transformation. Price Size 1/Age (in $1000) (in m2 ) (in 1/years) 315 75 0.063 229 59 0.05 355 85 0.063 261 65 0.067 234 72 0.048 216 46 0.063 308 107 0.067 306 91 0.067 289 75 0.071 204 65 0.048 265 88 0.067 195 59 0.038 Regression Analysis 25 / 83 Ling-Chieh Kung (NTU IM)
- 159. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression The reciprocal of house age Software gives us the following regression report: Coeﬃcients Standard Error t Stat p-value Intercept 22.905 57.154 0.401 0.698 Size 1.524 0.647 2.356 0.043 ** 1/Age 2185.575 1044.497 2.092 0.066 * R2 = 0.685, R2 adj = 0.615 Validation: Variables are both signiﬁcant (at diﬀerent signiﬁcance level). Using size and age better explains house price (at least for the given sample data). The intuition that house value depreciates at diﬀerent speeds is not supported by the data. Changing 1 age to age2 also does not help. Regression Analysis 26 / 83 Ling-Chieh Kung (NTU IM)
- 160. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Typical ways of variable transformation Regression Analysis 27 / 83 Ling-Chieh Kung (NTU IM)
- 161. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variable selection and model building In general, we may have a lot of candidate independent variables. Size, number of bedrooms, age, distance to a park, distance to a hospital, safety in the neighborhood, etc. If we consider only linear relationships, for p candidate independent variables, we have 2p − 1 combinations. For each variable, we have many ways to transform it. In the next lecture, we will introduce the way of modeling interaction among independent variables. How to ﬁnd the “best” regression model (if there is one)? Regression Analysis 28 / 83 Ling-Chieh Kung (NTU IM)
- 162. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variable selection and model building There is no “best” model; there are “good” models. Some general suggestions: Take each independent variable one at a time and observe the relationship between it and the dependent variable. A scatter plot helps. Use this to consider variable transformation. For each pair of independent variables, check their relationship. If two are highly correlated, quite likely one is not needed. Once a model is built, check the p-values. You may want to remove insigniﬁcant variables (but removing a variable may change the signiﬁcance of other variables). Go back and forth to try various combinations. Stop when a good enough one (with high R2 and R2 adj and small p-values) is found. Software can somewhat automate the process, but its power is limited (e.g., it cannot decide transformation). We may need to ﬁnd new independent variables. Intuitions and experiences may help (or hurt). Regression Analysis 29 / 83 Ling-Chieh Kung (NTU IM)
- 163. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Summary With a regression model, we try to identify how independent variables aﬀect the dependent variable. For a regression model, we adopt the least square criterion for estimating the coeﬃcients. Model validation: The overall quality of a regression model is decided by its R2 and R2 adj. We may test the signiﬁcance of independent variables by their p-values. Modeling building: Variable transformation. Variable selection. Regression Analysis 30 / 83 Ling-Chieh Kung (NTU IM)
- 164. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Case study: ticket selling A theater made hundreds of stage performances in the past six years. The owner hopes that statistics and data analysis may help her improve the ticket sales. Key questions: What makes a show popular? Popularity is deﬁned as the numbers of tickets sold. Potential factors: year, month, day, time, location, actors/actresses, drama type, ticket prices, etc. 100 performances are randomly drawn from the whole pool. All were made during weekends. Tickets were all publicly sold. Tickets for all performances were sold through the same channels. For each performance, the ticket price(s) remained the same. As a group of consultants, how may we help the theater? Regression Analysis 31 / 83 Ling-Chieh Kung (NTU IM)
- 165. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variables Six variables are obtained: Variable Meaning Year The year in which the performance was made Time Morning, afternoon, or evening Capacity The number of seats in the theater hall AvgPrice The average of all prices SalesQty The number of tickets sold SalesDuration Performance day − Announcement day Labeling and scaling: Years are labeled as 1, 2, ..., and 6 (6 means the last year). Capacities and sales quantities have been scaled in the same proportion. Regression Analysis 32 / 83 Ling-Chieh Kung (NTU IM)
- 166. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Data (incomplete) Yr. Tm. Cap. A.P. Qty S.D. Yr. Tm. Cap. A.P. Qty S.D. 5 A 230 400 218 50 2 M 190 575 190 289 5 A 150 500 119 46 6 A 130 500 108 89 5 A 230 400 160 126 4 E 200 775 169 100 5 A 200 775 200 324 4 E 200 775 135 259 6 E 190 1175 178 115 5 A 310 650 251 346 6 A 190 1175 183 109 2 A 250 550 250 145 5 E 190 775 161 58 1 A 190 675 183 254 3 A 200 675 200 112 6 A 200 1175 146 110 5 E 200 775 158 323 1 M 200 575 140 94 1 M 200 575 128 360 4 A 200 775 195 255 Regression Analysis 33 / 83 Ling-Chieh Kung (NTU IM)
- 167. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression To construct a regression model, we ﬁrst consider quantitative independent variables. Dependent variable: SalesQty. Independent variables: Capacity, AvgPrice, Year. Let’s ignore SalesDuration for a while. Note that Year is a quantitative variable. The diﬀerence between two values makes sense: 4 − 2 and 5 − 3 both mean a diﬀerence of two years. The values will keep increasing. If we have a variable Month whose possible values are 1, 2, ..., and 12, the diﬀerence between 12 and 1 is ambiguous: 11 months or 1 month. Scatter plots help us consider: Variable selection: Does a variable has an impact? Transformation: What is a variable’s impact? Multicollinearity: Are two variables highly correlated? Regression Analysis 34 / 83 Ling-Chieh Kung (NTU IM)
- 168. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression Analysis 35 / 83 Ling-Chieh Kung (NTU IM)
- 169. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression It seems that Capacity, AvgSales, and Year are all worth a try. Let’s put them into a regression model. If we do this one by one: SalesQty = 20.79 + 0.72Capacity: R2 = 0.538, p-value ≈ 0. SalesQty = 174.9 + 0.0028AvgPrice: R2 = 0.0002, p-value = 0.885. SalesQty = 203.6 − 6.77Y ear: R2 = 0.063, p-value = 0.0115. If we include them together: The regression model is SalesQty = 24.742 + 0.702Capacity + 0.027AvgPrice − 4.696Y ear. R2 = 0.57, R2 adj = 0.556; p-values are 0, 0.056, and 0.019, respectively. Do not try independent variables separately; try them together. Regression Analysis 36 / 83 Ling-Chieh Kung (NTU IM)
- 170. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Adding Time into the model Time may also be an inﬂuential variable. However, it is qualitative. More precisely, it is nominal. Even if we label Time with numeric values, we cannot treat it as a quantitative variable and put it into a regression model. For each qualitative variable, we need to introduce several indicator variables to represent its values. Regression Analysis 37 / 83 Ling-Chieh Kung (NTU IM)
- 171. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Road map Simple regression. Multiple regression. Indicator variables and interaction. Endogeneity and residual analysis. Logistic regression. Regression Analysis 38 / 83 Ling-Chieh Kung (NTU IM)
- 172. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Numeric labeling does not work The variable Time has three values. Morning, afternoon, and evening. Why can’t we label them as 1, 2, and 3 and do regression? Suppose we label (morning, afternoon, evening) as (1, 2, 3): The regression model is SalesQty = 164.021 + 6.313Time. Why is this wrong? Regression Analysis 39 / 83 Ling-Chieh Kung (NTU IM)
- 173. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Numeric labeling does not work Diﬀerent labeling gives diﬀerent regression results. We may also label (morning, afternoon, evening) as (1, 2, 10) or (3, 1, 2): SalesQty = 164.021 + 6.313Time p-value = 0.294 SalesQty = 177.224 − 0.075Time p-value = 0.95 SalesQty = 205.725 − 15.091Time p-value = 0.0084 Regression Analysis 40 / 83 Ling-Chieh Kung (NTU IM)
- 174. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Binary variables There is one exception: If a qualitative variable is binary, we may label the values as 0 and 1 and then treat it as quantitative. Labeling values as 1 and 0, 1 and 2, or 7 and 8 is also good. Labeling values as 1 and −1, 1 and 5, or 4 and 8 is bad. This is because a regression coeﬃcient measures what happens to the dependent variable “when that independent variable increases by 1.” When the binary variable is labeled with 0 and 1, its regression coeﬃcient ˆβi tells us that “if the value changes from 0 to 1 (while all others remain the same), we expect the dependent variable to increase by ˆβi.” What if we have more than two values? Regression Analysis 41 / 83 Ling-Chieh Kung (NTU IM)
- 175. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Indicator variables Consider a variable x with three values A, B, and C. We ﬁrst choose a reference level, say, A. We then manually create two indicator variables xB and xC : xB = 1 if x = B 0 otherwise and xC = 1 if x = C 0 otherwise In other words, we have a mapping: x xB xC A 0 0 B 1 0 C 0 1 Regression Analysis 42 / 83 Ling-Chieh Kung (NTU IM)