Data Analysis II: Explaining Observed Differences – Cross Tabulation, Correlation and Regression Lesson 16
Explaining Variation with Dependent and Independent Variables Why there are differences or variations in the data? Product usage, preferences and attitudes are partially dependent upon the marketing activities hence such variables are called Dependent Variables. Independent Variable is one which the researcher believes can explain the differences or variations which occur in dependent variables. Eg., Brand’s price, package and advertising.
Assumptions The data to be analyzed are obtained from descriptive studies, not from experiments. The data are from very large samples, usually in excess of 300 and frequently as large as 1000. The data includes measures on a number of variables for each respondent.
Method of Analysis Analyze patterns of change hich are common to both a dependent variable and one or more independent variables. Cross-tabulation is applicable to data in which the dependent variable and the indpendent variables are categorical variables or continuous variables which have been placed into categories. Correlation and regression analysis are commonly applied to situations where both the dependent variable and the independent variables are continuous.
Cross-Tabulation Independent variables in cross-tabulations include the respondents’ age, amount of education, household income, size of family and occupation. In the cells of a cross-tabulation researchers typically show percentages as well as actual counts of the number of different responses given.
Constructing and Interpreting a Cross-Tabulation Cross-tabulation is used on both types of categorical variables. Assign categories associated with the dependent variable to the rows of the cross-tabulation and assign categories associated with with the independent variable(s) to the columns of the cross-tabulation. Assign top row to the dependent variable category with the largest quantifiable number. Each succeeding row is assigned a category with a progressively lower quantifiable number. Alternately, top row can be assigned to the highest or most desirable category while the bottom row can be assigned to the lowest or most undesirable category. The Independent variable category with the largest quantifiable number is assigned to the rightmost column and the category with the smallest quantifiable number is assigned to the leftmost column. Alternately, rightmost column to highest or most desirable category and leftmost column to the lowest or most undesirable category. Each column percentages total 100%. To interpret the cross-tabulation, analyze the pattern of percentages across each row separately. If the percentages increase from left to right, the dependent variable category is positively associated with the independent variable. If the percentages decrease from left to right, the dependent variable category is negatively associated with the independent variable.
Three Useful Questions Does the cross-tabulation show a valid or spurious relationship? How many independent variables should be used in the cross-tabulation? Are the differences seen in the cross-tabulation statistically significant, or could they have occurred by chance due to sampling variation?
Cross-Tabulation shows a valid explanation Changes in the independent variables are believed to cause changes or to explain variation In the dependent variable
Cross-Tabulation shows a spurious explanation Spurious relationship if the implied relationship between the dependent and independent variables does not seem to be logical.
How many independent variables should be used? 3 or 4 variables at best.
Are the differences statistically significant? Expected count in the ith cell =(Total Number of Observations in the row in which cell I is located x the total number of observations in the column in which cell I is located)/the total number of observations in all cells If a cross-tabulation reveals an interesting relationship between a dependent and an independent variable, a Chi-square analysis can be used to test whether the observed differences are statistically significant.
Introductory Comments on Correlation and Regression Analysis Correlation and regression analysis can be used in situations where both the dependent and independent variables are of the continuous type. For example examine the relationship between milk consumption and both age and income. More accurate representations of the relationships between variables More objectively arrived at than similar results from cross tabulations
CORRELATION Data to which regression and correlation can be applied are : They are continuous variables More than one variable is measured for each respondent The number of respondent is greater than the number of variables
Correlation Analysis A positive relationship is said to exist between two variables when larger values of one variable are associated with larger values of the other variable. Thus, wine consumption and income are said to be positively related if higher levels of wine consumption tend to be associated with higher income. Wine consumption and income are said to be negatively related if lower levels of wine consumption tend to be associated with higher income.
Correlation Coefficient r A measure of the relationship between two variables is the correlation coefficient r. If there is perfect positive correlation r=+1.00; Perfect negative correlation is indicated by r=-1.00 No relationship is shown by r=0
r R= ∑((Yi-Y)(Xi-X)) √(∑(Yi-Y)² )(∑Xi-X)²) Where Y = average wine consumption X = average income Yi=wine consumption of individual HHs Xi= income from individual HHs
Household Wine Consumption and Income : City A & B
Scatter Diagrams of Wine Consumption and Income For City A Annual Income ($000)
Scatter Diagrams of Wine Consumption and Income For City B Annual Income ($000)
Applying correlation coefficient formula for City A & B R = ∑((Yi-Y)(Xi-X)) √(∑(Yi-Y)² )(∑Xi-X)²) =8/√(10)X(10) =8/10 =+0.8 for City A r=10/√(10)(10) =10/10 =+1.00 for City B (perfect positive correlation)
R interpretation If r=0.8 or larger, there is a very strong or high relationship between variables If r is between 0.4 and 0.8 (disregarding sign) the relationship between the variables is considered moderate to high. For lower values of r, the relationship is small to insignificant. When a correlation analysis results in an r of less than 0.4 researchers do not have strong evidence that there is a relationship between the dependent and independent variables. The + or – on the correlation coefficient indicates whether the relationship is positive or negative. For eg., if r= -0.8 or -1.00 it means that wine consumption is lower among households with higher income. Correlation analysis, like cross-tabulation attempts to identify ptterns of variations common to a dependent variable and an independent variable.
REGRESSION ANALYSIS The correlation coefficient is a summary measure which indicates the relative strength and the + or – direction of a relationship between two variables. But it does not describe the underlying relationship. It cannot be used to predict the size of change to expect in the dependent variable if the independent variable is changed by one unit ie., how large a change in wine consumption would occur with a given increase in income? Hence, an equation is needed.
Regression Analysis Regression Analysis is a technique whereby a mathematical equation is “fitted” to a set of data. The Set of Data The Mathematical Equation The technique which fits the equation to the data How the equation is evaluated to see how well it “describes” the data.
Data Set Data set must consist of measures of two or more continuous variables and the sample size must be at least two or three times as large as the number of measured variables, preferably much larger.
Mathematical Equation In most applications of regression analysis, the equation which is used is that of a straight line. The general form of the equation of a straight line is Y=a + bx. Where : Y=the dependent variable X=the independent variable b=a coefficient which indicates the effect on Y of a one unit change in X. That is if b=+8.2, a one unit increase in X will result in an 8.2 unit increase in Y. a=the coefficient which identifies the value of Y when the value of X is equal to 0. That is, a is the value of Y at which the straight line intersects the Y axis.
Linear Regression Equation A simple linear regression equation can be applied to the wine consumption data available from City C Y = a +bX Where Y=the predicted values of the dependent variable, that is, monthly household wine consumption in quarts X=The observed values of the independent variable ie., the annual household income in thousands of dollars. b= a coefficient which indicates by how much household wine consumption in quarts is expected to increase with a $1000 increase in annual household income. a=point at which the straight line intersects the y axis.
Pictorial Presentation of Regression Analysis Household wine consumption Y Regression Line . . Y . . . . . . The location of each dot one Households wine consumption and its Annual income Xi Household Annual Income X
Fitting the equation to the observed data Plot all wine consumption and annual income data on a scatter diagram from a very large sample. Envision fitting a line through the points in such a manner which results in “the best possible fit” The fitted regression line can be viewed as a “predictor line” in the sense that it “predicts” household wine consumption for each different value of annual household income. For each and every observed value of an annual household income Xi, the regression line provides a predicted value of wine consumption Yi. The difference between the wine consumption reported by household i (Yi) and its predicted wine consumption (Yi) is (Yi-Yi) and this difference is called a residual. The procedure commonly used to calculate the regression line which “best fits” a particular set of data is one called the “least square method” This procedure identifies the one equation which when fitted to the observed data, minimizes the sum of the square of all residuals. That is the procedure minimizes ∑ (Yi-Yi)² for all i.
Calculating a and b values X=∑Xi/n =70/5 =14 Y=∑Yi/n =15/5 =3
Calculating a and b b=n(∑XiYi) – (∑ Xi)(∑ Yi) n (∑ Xi²) – (∑ Xi)² =5(226)-(70)(15) 5(1020)-(70)² =1130-1050 5100-4900 =80/200 =+0.40
a, b and slope It should be noted that the regression line intersects the Y axis at -2.60, that is at the value of the a coefficient. The b coefficient indicates that for each $1000 increase in annual household income, monthly wine consumption is predicted to increase by 0.40 quarts. This is demonstrated by setting the slope of the regression line at a value of +0.40, which takes it in an upward direction to the right.
Advantages of Regression The result demonstrates the advantage of a regression analysis over a correlation analysis. With an r=+0.80, the correlation analysis only identifies the presence of a moderately strong positive relationship between wine consumption and income. The regression analysis leads to a more complete description of the relationship, for example, Predicrted monthly wine consumption for selected annual incomes are shown in above table. Because the b coefficient indicates by how much the dependent variable will change for a Given change in the independent variable, a regression equation is a type of descriptive relationship which can help researchers arrive at better understanding of variation in Dependent variable.
Evaluating the Regression Equation All regression procedures also calculate a measure called the “coefficient of determination” which is identified as R². This coefficient takes on a maximum value of 1.00 but can be 0 also. An R² value of 1.00 indicates that the regression equation “explains” 100 percent of the variance in the dependent variable about its mean. This variance would be explained perfectly if every dot in the scatter diagram fell precisely on the regression line, that is, if all of the residuals were equal to 0. When the regression equation does not fit the data perfectly, some of the residuals will be greater than 0. Those residuals form a distribution around the regression equation and this distribution can be used as a measure of how much variance is “unexplained” by the regression equation.
Regression Equation R² = (Total Variance in the Dependent Variable) – (Variance unexplained by the regression equation) Total Variance in the Dependent Variable If the regression line explains all the variance in Y, all the residuals will be 0, the variance “unexplained” by the regression equation will be 0, and the coefficient of determination will be R² = Total Variance in the dependent variable -0 Total Variance in the dependent variable = 1.00 R² values in the 0.50-1.00 range are usually interpreted to mean that the regression equation does a good job of explaining the Y variation. = ∑ (Yi-Y)² - ∑(Yi-Y*)² ∑ (Yi-Y)²
Calculating R² for the Wine Consumption Regression Equation for City C Y = ∑Yi/n =15/5=3 R² = (10-3.6)/10 = 0.64
Interpretation of R² Values in Explaining Variance Y*i=-2.60+0.40Xi is capable of explaining about 64 percent of the total variance observed in the dependent variable – monthly household wine consumption. Or 36per cent of the total variance in household wine consumption is “unexplained” by the regression equation.
R² Explained If the regression line does not explain any of the variance in Y, all the residuals will be large, and the variance “unexplained” by the regression equation will be approximately equal to total Y variance. The two terms in the numerator of the R² formula will be equal, the numerator will be 0 and R² will be 0. An R² value approximating 0 indicates that the regression equation does not explain any of the variance observed in Y. In general, R² values of 0.25 or less indicate that the regression equation is of little use in explaining variance. Regression equations with R² values in the 0.25 – 0.50 range are typically judged to be of only moderate use in explaining the variance observed in a dependent variable.
Multiple Linear Regression Simple Linear Regression uses one independent variable. When two or more independent variables are used in a linear regression analysis, it is called multiple linear regression. Y=a+bX1+cX2+dX3+… Where Y is the dependent variable and X1,X2,X3… are independent variables. The additional coefficients (c,d) are similar to the b coefficient, except that they are associated with independent variables X2 and X3.
Calculations needed for the Multiple Linear Regression of Wine Consumption in City C Y= 15/5 X1=70/5 X2=150/5 = 3 =14 =30
Calculating c c=(∑yx2)(∑x1²)-(∑yx1)(∑x1x2) (∑x1²)(∑x2²)-(∑x1x2)² =(25)(40)-(16)(46) (40)(74)-(46)² =1000-736/2960-2116=264/844 =0.312
Calculating a a=Y-bX1-cX3 = 3.0-(0.0402)(14) =(0.312)(30) =-6.923
The REGRESSION EQUATION Y*=-6.923+0.0402X1+0.312X2 From this equation, researchers see that a greater increase in wine consumption is associated with a one-year increase in age(0.312 quarts) than is associated with a $1000 increase in annual income(0.0402 quarts) As in simple linear regression, the a coefficient (-6.923) should be interpreted as a structural coefficient.
Stepwise Multiple Linear Regression Five independent variables (x1,x2,x3,x4,X5) Stepwise multiple regression first evaluates each independent variable separately to determine which one results in the largest R² - that is , which one explains most of the variation in the dependent variable. If it is x3, that independent variable is selected for the regression equation.
Stepwise Multiple Linear Regression Next, the stepwise regression evaluates X3, in combination with each of the remaining independent variables (one at a time) to determine which one of the latter results in the largest increase in R². If it is X5, that variable becomes the second independent variable selected for the regression equation. Likewise, remaining variables are evaluated one at a time with the two already selected ones to determine which one is the third variable. This continues until R² can no longer be increased significantly by adding another independent variable to the regression equation. The independent variables which are not selected do not become part of the regression equation.
Problems in using Regression Analysis Inadequate sample size. Should be 2-3 times the number of variables. Independent variables measured do not have direct effect on the dependent variable. Independent Variables are highly correlated. Their effect will be same as single variable which has been used twice in equation. Relationship between dependent and independent variable is not linear ie unusual shape. Hence, it cannot be analyzed by regression techniques.