Statistical Analysis Software Click to edit Master title styleBivariate and Multivariate Regression Analysis Academic Department of Marketing Caucasus School of Business Caucasus University 1 2011
Problems of Test 1 • Formulating null and alternative hypotheses incorrectly • Ignoring question “why” • Ignoring the necessity to comment on the scale used • Mixing up Wilcoxon and paired samples T test • Massively ignoring the necessity to check the equality of variances (Levene’s test) • Kolmogorov-Smirnov test
Homework 1 • Three or four homework assignments will be given throughout the course. You will be informed about the number of points you can get from each assignment. • The first homework assignment will include two problems: the first one is the ANOVA problem from test 1 – each one of you will have individual databases. The second problem will be about using Pearson’s Chi Square statistic in cross-tabulations. However, you will have to come up with your own example and your own fictional database. • The assignment is worth 2 points and is due
Important Note (Homework) • EVEN IF ALL THE INTERPRETATION IS CORRECT, YOU WILL GET ZERO POINTS IN CASE YOU SUBMIT THE WRONG OUTPUT WHETHER IT’S BECAUSE YOU DID THE WRONG TEST OR YOU USED SOMEBODY ELSE’S DATASET.
Warming Up – Linear Equations • What does a linear relationship imply? • How does a linear relationship look like (mathematically)? • What are the variables in this equation and what are the parameters? • How are the parameters interpreted?
Scatterplot (1) • Scatterplot – collection of points (x,y) on the coordinate system. Each point on a scatterplot depicts a single case, that has a specific X value and a specific Y value, which you can find on the X and Y axis.
Scatterplot (2) • As we see, there is a certain relationship between income and saving – the higher the income, the higher the saving. • But are we interested only in the direction? Not really. It is important to measure by how much saving increases as income increases by, say, 1 Lari. • By saying this we imply that there is a linear relationship between income and saving (which is not necessarily true, but let’s ignore this for now).
Scatterplot (3) • Going back to our scatterplot, we need to find a line (i.e. determine the intercept and the slope) which best describes the relationship between two variables (in this case saving and income). • This is exactly where regression comes into play – it helps to identify such a line by using the sample information.
Bivariate Regression Model • In theory, the relationship between saving and income already exists and is somewhere out there – we can’t really determine it in practice. Why? Because we would need to collect information about everybody’s income and everybody’s saving (i.e. we would need information about the whole population). • If we could, the bivariate regression model would look like this: Y=β0+ β1*X, where Y is saving and X is income.
Error Term • Note that even in the ideal case, where we have information about the population, we are still unable to exactly predict the level of saving by the level of income. Why? Because income is not the only factor that determines saving. There are other factors that aren’t accounted for in our bivariate regression model. • All the other factors not explicitly accounted for in the regression model fall in the so called error term, denoted by ε. • Therefore, the population regression model looks like this: Y=β0+ β1*X+ ε
Linear Regression Analysis(Bivariate) • Identifying the line that depicts the relationship between X and Y boils down to estimating β0 and β1. • What a regression does is basically providing us with estimates (regression coefficients) of β0 and β1, which are denoted by b0 and b1. • The estimated regression model looks like this: Ŷ= b0 + b1*X
Interpreting RegressionCoefficients • Ŷ= b0 + b1*X • Ŷ – predicted values, shows us the predicted values of Y as X takes specific values. • b0 - intercept, shows the predicted value of Y when X=0. • b1 - slope estimate, shows by how much the predicted value of Y changes as X changes by 1 unit.
Residual • Residual is the difference between the actual value of Y and predicted value of Y, and is denoted by e. • e=Y – Ŷ • Do not mix up residual and error term. They are NOT the same. We never know the error term. However, we can easily estimate the residual. Residual is an estimate of the error term.
Linear Regression - Output • Thus, if income is 0, the predicted saving is equal to 124.842. And if income increases by 1 Lari, saving will increase by 0.147 Lari. • Is this model appropriate to predict the levels of saving? Not really. Saving is also determined by other factors, like family size, education level of household head, his/her age and gender. (Of course there may be other determinants as well, but let’s focus on these for now)
Multiple Regression Analysis • Multiple regression implies including more than one independent variable in the regression model. Basically it looks like this: Y=β0+ β1*X1+ β2*X2+ β3*X3+…+ βk*Xk+ ε • In this case we need to estimate (k+1) parameters - b0, b1, b2 … bk. • Interpretation of slope coefficients: b1 shows by how much predicted Y changes as X1 changes, holding all other X-s constant. • Interpretation of intercept – the predicted value of Y when all the X-s are equal to zero.
Major Goals of ConductingRegression Analysis • Goal 1. Measuring partial effects – by how much does Y change when X1 changes by 1 unit, holding all other X-s constant? • Goal 2. Forecasting the values of the dependent variable – what is the predicted saving level (measured in Laris) of a family that has a family income of 1000 Laris, that has 5 members, whose household head studied for 15 years and whose household head is 47 years old? • Regression provides answers to these questions.
Predictive Power of a Model • In order to know how good our model is for forecasting, we need to measure the predictive power of the model. In other words, we want to know how well the independent variables explain the dependent variable. • Coefficient of determination (R- squared) is widely used for this purpose.
Coefficient of Determination –R-Squared (1) • Coefficient of determination (R-squared) measures the portion of the variation in Y explained by the variation in X-s, in other words, how much of the variation in the dependent variable is explained by the independent variables. • This is also called goodness-of-fit. • R-squared ranges from 0 to 1 and shows how well the regression line describes the data cloud that you see on the scatterplot. • The closer the data are clustered around the regression line, the closer the R-squared is to 1. R2=1 is perfect fit (never possible in practice). The closer the R-squared is to 0, the worse the fit.
Coefficient of Determination –R-Squared (2) • For example, if R-squared is equal to 0.045, it means that independent variables explain only 4.5% of variation in the dependent variable. • This is an example of low predicting power. • The higher the R-squared, the better the predictive power of your model.
Testing Significance of RegressionCoefficients (1) • As we already mentioned, the other goal of regression analysis is to determine partial effects. • Basically, partial effects measure pure effects of respective independent variables on the dependent variable. • What we want to know is whether these pure effects are important. How can we find this out? • This is done by testing the significance of the regression coefficients.
Testing Significance of RegressionCoefficients (2) • Suppose we want to test whether age of household head (X4) has an important effect on saving once all the other factors (household size, income, education of household head) are controlled for. • Null hypothesis is that β4 = 0. (i.e., as X4 changes by 1 unit, nothing happens to Y, no effect on Y) • Alternative hypothesis is that β4 is different from 0 (two-tailed test).
Testing Significance of RegressionCoefficients (3) • It can be shown that if we divide the estimate of β4 (b4) by standard error of b4 (which is standard deviation of b4 ), the resulting statistic follows t distribution. • Thus, we can either calculate the t statistic and compare it to the critical t value at 5% significance level, or we can simply look at the p- value (Sig.) of the regression coefficient. If the latter is less than 0.05, we conclude that the regression coefficient is significantly different from zero (or just significant, shortly). In other words, the partial effect of this variable is statistically important.
Testing Significance of RegressionCoefficients - Example • Going back to our multivariate regression example, no single independent variable appears to be statistically significant – all the p-values are more than 0.05. • However, even though these variables are separately insignificant, there is a chance that they are collectively significant. • This hypothesis is tested by joint F test.
Joint F Test • Null Hypothesis: β1 = β2 = β3 = β4 = 0 • Alternative Hypothesis: at least one of them is different from zero. • This is equivalent to testing whether R2=0.
Important Note • It can happen that all the coefficients are separately insignificant but jointly significant, even though in our example they’re also jointly insignificant at 5% significance level. • It can also happen that regression coefficients are separately significant but jointly insignificant. WHEN?