Upcoming SlideShare
×

# Data analysis test for association BY Prof Sachin Udepurkar

835 views

Published on

Test of Association - Bivariate Analysis.

To interpret relationship between variables

Published in: Technology, Health & Medicine
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
835
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
34
0
Likes
0
Embeds 0
No embeds

No notes for slide
• {}
• ### Data analysis test for association BY Prof Sachin Udepurkar

1. 1. DATA ANALYSIS – TESTING FOR ASSOCIATION Relationship :  A consistent and systematic link between two or more variables  While interpreting the relationship between variables following aspects are taken into account : 1. Whether two or more variables are related at all i.e To measure whether relationship is present vide concept of statistical significance 2. If the relationship is present it is important to know the direction which can be either Positive or Negative 3. Understanding strength of association 4. Type of relationship
2. 2. Difference between Univariate and Bivariate Univariate Data Bivariate Data • involving a single variable • involving two variables • does not deal with causes or relationships • deals with causes or relationships • the major purpose of univariate analysis is to describe • the major purpose of bivariate analysis is to explain • central tendency - mean, mode, median • analysis of two variables simultaneously • dispersion - range, variance, max, min, quartiles, standard deviation. • correlations • • frequency distributions comparisons, relationships, causes, explanations • bar graph, histogram, pie chart, line graph, box-and-whisker plot • tables where one variable is contingent on the values of the other variable. • independent and dependent variables Sample question: How many of the students in the freshman class Sample question: Is there a relationship between the number of are female? females in Computer Programming and their scores in Mathematics?
3. 3. 1) To measure whether relationship is present vide concept of statistical significance  Whether relation exist between two or more variables  If we test for statistical significance and find that it exists then it is said that relationship is present  Stated another way , we say that knowledge about the behavior of one variable allows us to make a useful prediction about the behavior of another  For example : If we found statistically significant relationship between the perceptions of the quality of Santa Fe Grill food and satisfaction , we would say a relationship is present and that perceptions of the quality of food will tell us what the perception of satisfaction are likely to be
4. 4. 2) If the relationship is present it is important to know the direction which can be either Positive or Negative  Presence of relationship precedes direction  The direction of relationship can either be positive or negative For example : Using Santa Fe Grill example we could say that a positive relationship exists if respondents who rate the quality of food high also are highly satisfied. Similarly , a negative relationship exists if respondents say the speed of service is slow (low rating ) but they are still satisfied (High rating)
5. 5. 3) Understanding strength of association  In general categorize the strength of association as a. b. c. d. Non existent Weak Moderate Strong  If a consistent and systematic relationship is not present then the strength of association is nonexistent  A weak association means there is low probability of variables having relationship  A strong association means there is high probability , a consistent and systematic relationship exists
6. 6. 4) Type of relationship  If we say two variables can be described as related, then we would pose this as question “What is the nature of relationship”? , How can the link between variables Y and X best be described ?  There are a number of different ways in which two variables (X & Y) can share a relationship
7. 7.  In the wake of finding answers to above questions following statistical methodologies will be applied a.Covariation a.Chi Square Test a.Correlation Coefficient 1. Pearson Correlation coefficient 2. Coefficient of determination 3. Spearman rank order correlation coefficient a.Regression Analysis
8. 8. COVARIATION :  It is defined as amount of change in one variable that is consistently related to the change in another variable of interest or degree of association between two items/variables  For example : If we know DVD purchases are related to age ,then we want to know the extent to which younger persons purchase more DVDs and ultimately which types of DVDs  If two variables are foound to change together on a reliable or consistent basis then we can use that information to make predictions as well as decisions on advertising and marketing strategies  For example Change in attitude towards Starbucks coffee advertising campaign as it varies between light, medium and heavy consumers of Starbucks coffee
9. 9. SCATTER PLOTS AND CORRELATION  A scatter plot (or scatter diagram) is used to show the relationship between two variables
10. 10. SCATTER PLOT EXAMPLES y Linear relationships y x y Curvilinear relationships x y x x
11. 11. SCATTER PLOT EXAMPLES y Strong relationships y x y (continued) Weak relationships x y x x
12. 12. SCATTER PLOT EXAMPLES y No relationship x y x (continued)
13. 13. Smoking and Lung Capacity • We can see easily from the graph that as smoking goes up, lung capacity tends to go down. • The two variables covary in opposite directions. • We now examine two statistics, covariance and correlation, for quantifying how variables covary. Cigarettes (X) Lung Capacity (Y) 0 45 5 42 10 33 15 31 20 29 50 40 Lung Capacity One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables 30 20 -10 Smoking 0 10 20 30
14. 14.  The formula for calculating covariance of sample data is as follows : x  = the independent variable y  = the dependent variable n  = number of data points in the sample   = the mean of the independent variable x   = the mean of the dependent variable y  Example : To understand how covariance is used, consider the table, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi)  Using the covariance formula, you can determine whether economic growth and S&P 500 returns have a positive or inverse relationship.
15. 15.  Before you compute the covariance, calculate the mean of x and y A ) Now you can identify the variables for the covariance formula as follows x = 2.1, 2.5, 4.0, and 3.6 (economic growth) y = 8, 12, 14, and 10 (S&P 500 returns)   = 3.1   = 11 B) Substitute these values into the covariance formula to determine the relationship between economic growth and S&P 500 returns.
16. 16. Interpretation :  The covariance between the returns of the S&P 500 and economic growth is 1.53.  Since the covariance is positive, the variables are positively related—they move together in the same direction
17. 17. Smoking and Lung Capacity • We can see easily from the graph that as smoking goes up, lung capacity tends to go down. • The two variables covary in opposite directions. • We now examine two statistics, covariance and correlation, for quantifying how variables covary. Cigarettes (X) Lung Capacity (Y) 0 45 5 42 10 33 15 31 20 29 50 40 Lung Capacity One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables 30 20 -10 Smoking 0 10 20 30
18. 18. Correlation :  Correlation is another way to determine how two variables are related.  In addition to telling you whether variables are positively or inversely related, correlation also tells you the degree to which the variables tend to move together  Correlation standardizes the measure of interdependence between two variables and, consequently, tells you how closely the two variables move.  The correlation measurement, called a correlation coefficient, will always take on a value between 1 and – 1 called Pearson Correlation coefficient A) If the correlation coefficient is one The variables have a perfect positive correlation. This means that if one variable moves a given amount, the second moves proportionally in the same direction. A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one.
19. 19. B) If correlation coefficient is zero No relationship exists between the variables  If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated. C) If correlation coefficient is –1  The variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other  If one variable increases, the other variable decreases proportionally  A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1
20. 20.  To calculate the correlation coefficient for two variables, you would use the correlation formula, shown below. = correlation of the variables x and y COV(x, y) = covariance of the variables x and y sx = sample standard deviation of the random variable x sy = sample standard deviation of the random variable y x,y)  To calculate correlation, you must know the covariance for the two variables and the standard deviations of each variable  From the earlier example, you know that the covariance of S&P 500 returns and
21. 21.  Now you need to determine the standard deviation of each of the variables  You would calculate the standard deviation of the S&P 500 returns and the economic growth  Using the information from above, you know that COV(x,y) = 1.53 sx = 0.90 sy = 2.58
22. 22. Now calculate the correlation coefficient by substituting the numbers above into the correlation formula, as shown below. A correlation coefficient of .66 tells you two important things: •Because the correlation coefficient is a positive number, returns on the S&P 500 and economic growth are postively related. •Because .66 is relatively far from indicating no correlation, the strength of the correlation between returns on the S&P 500 and economic growth is strong
23. 23. The coefficient of determination is the amount of variability in one measure that is explained by the other measure The coefficient of determination is the square of the correlation coefficient (r2) For example, if the correlation coefficient between two variables is r = 0.90, the coefficient of determination is (0.90)2 = 0.81 Square of coefficient of correlation (Pearson correlation coefficient) gives coefficient of determination given by r 2 This number ranges from .00 to 1.0 showing proportion variation explained or accounted for in one variable by another
24. 24. Spearman Rank Order correlation coefficient : A statistical measure of linear association between two variables where both have been measured using ordinal (rank order) scales Example :
25. 25. INTRODUCTION TO REGRESSION ANALYSIS  Regression analysis is used to:  Predict the value of a dependent variable based on the value of at least one independent variable  Explain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to explain Independent variable: the variable used to explain the dependent variable
26. 26. SIMPLE LINEAR REGRESSION MODEL  Only one independent variable, x  Relationship between x and y is described by a linear function  Changes in y are assumed to be caused by changes in x
27. 27. TYPES OF REGRESSION MODELS Positive Linear Relationship Negative Linear Relationship Relationship NOT Linear No Relationship
28. 28. POPULATION LINEAR REGRESSION The population regression model: Population Dependent Variable y intercept Populatio n Slope Coefficien t Independen t Variable y = β0 + β1x + ε Linear component Rando m Error term, or residual Random Error component
29. 29. LINEAR REGRESSION ASSUMPTIONS  Error values (ε) are statistically independent  Error values are normally distributed for any given value of x  The probability distribution of the errors is normal  The probability distribution of the errors has constant variance  The underlying relationship between the x variable and the y variable is linear
30. 30. POPULATION LINEAR REGRESSION y y = β0 + β1x + ε (continued) Observed Value of y for xi εi Predicted Value of y for xi Slope = β1 Random Error for this x value Intercept = β0 xi x
31. 31. ESTIMATED REGRESSION MODEL The sample regression line provides an estimate of the population regression line Estimated (or predicted) y value Estimate of the regression intercept Estimate of the regression slope ˆ y i = b0 + b1x Independen t variable The individual random error terms ei have a mean of zero
32. 32. LEAST SQUARES CRITERION  b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals ˆ )2 ∑ e = ∑ (y −y 2 = ∑ (y − (b + b1x)) 2 0
33. 33. THE LEAST SQUARES EQUATION  The formulas for b1 and b0 are: b1 ∑ ( x − x )( y − y ) = ∑ (x − x) 2 algebraic equivalent: b1 = ∑ x∑ y ∑ xy − x2 − ∑ n (∑ x ) 2 n and b0 = y − b1 x
34. 34. INTERPRETATION OF THE SLOPE AND THE INTERCEPT b is the estimated average value of y when the value of x is zero 0 b is the estimated change in the average value of y as a result of a one-unit change in x 1
35. 35. FINDING THE LEAST SQUARES EQUATION The coefficients b0 and b1 will usually be found using computer software, such as Excel or Minitab Other regression measures will also be computed as part of computerbased regression analysis
36. 36. SIMPLE LINEAR REGRESSION EXAMPLE  A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)  A random sample of 10 houses is selected Dependent in \$1000s variable (y) = house price Independent variable (x) = square feet
37. 37. SAMPLE DATA FOR HOUSE PRICE MODEL House Price in \$1000s (y) Square Feet (x) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700
38. 38. REGRESSION USING EXCEL  Tools / Data Analysis / Regression
39. 39. EXCEL OUTPUT Regression Statistics Multiple R 0.76211 R Square 0.58082 Adjusted R Square The regression equation is: house price = 98.24833 + 0.10977 (square feet) 0.52842 Standard Error 41.33032 Observations ANOVA 10 df SS MS F 11.084 8 Regression 1 18934.9348 18934.934 8 Residual 8 13665.5652 1708.1957 Total 9 Significance F 32600.5000 Coefficien ts Standard Error t Stat Pvalue 0.1289 0.01039 Lower 95% Upper 95% 232.0738
40. 40. GRAPHICAL PRESENTATION House price model: scatter plot and regression line Intercep t = 98.248 House Price (\$1000s)  450 400 350 300 250 200 150 100 50 0 Slope = 0.10977 0 500 1000 1500 2000 2500 3000 Square Feet house price = 98.24833 + 0.10977 (square feet)
41. 41. INTERPRETATION OF THE INTERCEPT, B0 house price = 98.24833 + 0.10977 (square feet)  b0 is the estimated average value of Y when the value of X is zero (if x = 0 is in the range of observed x values)  Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, \$98,248.33 is the portion of the house price not explained by square feet
42. 42. INTERPRETATION OF THE SLOPE COEFFICIENT, B1 house price = 98.24833 + 0.10977 (square feet) b measures the estimated change in the average value of Y as a result of a one-unit change in X 1  Here, b1 = .10977 tells us that the average value of a house increases by .10977(\$1000) = \$109.77, on average, for each additional one square foot of size
43. 43. LEAST SQUARES REGRESSION PROPERTIES  The sum of the residuals from the least ˆ squares regression line is 0 ( ∑ ( y − y ) = 0 )  The sum of the squared residuals is a ˆ ( y −y)2 ) minimum (minimized ∑  The simple regression line always passes through the mean of the y variable and the mean of the x variable  The least squares coefficients are unbiased estimates of β0 and β1
44. 44. EXPLAINED AND UNEXPLAINED VARIATION  Total variation is made up of two parts: SST = Total sum of Squares SST = ∑ ( y − y )2 SSE + Sum of Squares Error ˆ SSE = ∑ ( y − y )2 SSR Sum of Squares Regression ˆ SSR = ∑ ( y − y )2 where: y = Average value of the dependent variable y = Observed values of the dependent variable ˆ y = Estimated value of y for the given x value
45. 45. EXPLAINED AND UNEXPLAINED VARIATION (continued)  SST = total sum of squares  Measures the variation of the yi values around their mean y  SSE = error sum of squares  Variation attributable to factors other than the relationship between x and y  SSR = regression sum of squares  Explained variation attributable to the relationship between x and y
46. 46. EXPLAINED AND UNEXPLAINED VARIATION (continued) y yi ∧ SSE = ∑(yi - yi ) _ ∧ y ∧ y 2 SST = ∑(yi - y)2 ∧ _ 2 SSR = ∑(yi - y) _ y Xi _ y x
47. 47. THANKS……