DATA ANALYSIS – TESTING FOR
ASSOCIATION
Relationship :
 A consistent and systematic link between two or more variables
 ...
Difference between Univariate and Bivariate
Univariate Data

Bivariate Data

•

involving a single variable

•

involving ...
1) To measure whether relationship is present vide concept of
statistical significance  Whether relation exist between tw...
2) If the relationship is present it is important to know the direction
which can be either Positive or Negative
 Presenc...
3) Understanding strength of association
 In general categorize the strength of association as
a.
b.
c.
d.

Non existent
...
4) Type of relationship
 If we say two variables can be described as related, then we
would pose this as question “What i...
 In the wake of finding answers to above questions following statistical
methodologies will be applied
a.Covariation
a.Ch...
COVARIATION :
 It is defined as amount of change in one variable that is consistently
related to the change in another va...
SCATTER PLOTS AND
CORRELATION


A scatter plot (or scatter diagram) is used to
show the relationship between two variable...
SCATTER PLOT EXAMPLES
y

Linear
relationships

y

x
y

Curvilinear
relationships

x
y

x

x
SCATTER PLOT EXAMPLES
y

Strong
relationships

y

x
y

(continued)
Weak
relationships

x
y

x

x
SCATTER PLOT EXAMPLES
y

No
relationship

x
y

x

(continued)
Smoking and Lung Capacity

• We can see easily from the
graph that as smoking
goes up, lung capacity
tends to go down.
• T...
 The formula for calculating covariance of sample data is as follows :
x  = the independent variable
y  = the dependent v...
 Before you compute the covariance, calculate the mean
of x and y
A ) Now you can identify the variables
for the covarianc...
Interpretation :
 The covariance between
the returns of the S&P 500
and economic growth is
1.53.
 Since the covariance i...
Smoking and Lung Capacity

• We can see easily from the
graph that as smoking
goes up, lung capacity
tends to go down.
• T...
Correlation :
 Correlation is another way to determine how two variables are related.
 In addition to telling you whethe...
B) If correlation coefficient is zero
No relationship exists between the variables
 If one variable moves, you can make ...
 To calculate the correlation coefficient for two
variables, you would use the correlation
formula, shown below.

= corre...
 Now you need to
determine the standard
deviation of each of the
variables
 You would calculate the
standard deviation o...
Now calculate the correlation coefficient by substituting the numbers
above into the correlation formula, as shown below.
...
The coefficient of determination is the amount of variability in one measure
that is explained by the other measure
The co...
Spearman Rank Order correlation coefficient :
A statistical measure of linear association between two variables where
both...
INTRODUCTION TO
REGRESSION ANALYSIS


Regression analysis is used to:
 Predict

the value of a dependent variable based ...
SIMPLE LINEAR REGRESSION
MODEL


Only one independent variable, x



Relationship between x and y is described
by a line...
TYPES OF REGRESSION MODELS
Positive Linear
Relationship

Negative Linear
Relationship

Relationship NOT Linear

No Relatio...
POPULATION LINEAR REGRESSION
The population regression
model:
Population
Dependent
Variable

y intercept

Populatio
n Slop...
LINEAR REGRESSION
ASSUMPTIONS


Error values (ε) are statistically independent



Error values are normally distributed ...
POPULATION LINEAR REGRESSION

y

y = β0 + β1x + ε

(continued)

Observed Value
of y for xi

εi

Predicted
Value of y for
x...
ESTIMATED REGRESSION MODEL
The sample regression line provides an estimate
of the population regression line
Estimated
(or...
LEAST SQUARES CRITERION


b0 and b1 are obtained by finding the values of b0
and b1 that minimize the sum of the squared
...
THE LEAST SQUARES EQUATION


The formulas for b1 and b0 are:

b1

∑ ( x − x )( y − y )
=
∑ (x − x)
2

algebraic
equivalen...
INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
b

is the estimated average value
of y when the value of x is zero
0

b

i...
FINDING THE LEAST
SQUARES EQUATION
The

coefficients b0 and b1 will
usually be found using computer
software, such as Exc...
SIMPLE LINEAR REGRESSION
EXAMPLE


A real estate agent wishes to examine the
relationship between the selling price of a ...
SAMPLE DATA FOR HOUSE
PRICE MODEL
House Price in $1000s
(y)

Square Feet
(x)

245

1400

312

1600

279

1700

308

1875

...
REGRESSION USING EXCEL


Tools / Data Analysis / Regression
EXCEL OUTPUT
Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R
Square

The regression equation
is:
...
GRAPHICAL PRESENTATION
House price model: scatter plot and regression
line

Intercep
t
= 98.248

House Price ($1000s)



...
INTERPRETATION OF THE
INTERCEPT, B0

house price = 98.24833 + 0.10977 (square feet)


b0 is the estimated average value o...
INTERPRETATION OF THE
SLOPE COEFFICIENT, B1

house price = 98.24833 + 0.10977 (square feet)
b

measures the estimated cha...
LEAST SQUARES REGRESSION
PROPERTIES
 The

sum of the residuals from the least
ˆ
squares regression line is 0 ( ∑ ( y − y ...
EXPLAINED AND
UNEXPLAINED VARIATION


Total variation is made up of two parts:

SST =
Total sum
of Squares

SST = ∑ ( y −...
EXPLAINED AND
UNEXPLAINED VARIATION
(continued)


SST = total sum of squares
 Measures

the variation of the yi values a...
EXPLAINED AND
UNEXPLAINED VARIATION
(continued)

y
yi

∧
SSE = ∑(yi - yi )

_

∧
y

∧
y

2

SST = ∑(yi - y)2
∧ _ 2
SSR = ∑...
THANKS……
Data analysis   test for association BY Prof Sachin Udepurkar
Upcoming SlideShare
Loading in …5
×

Data analysis test for association BY Prof Sachin Udepurkar

835 views

Published on

Test of Association - Bivariate Analysis.

To interpret relationship between variables

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
835
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • {}
  • Data analysis test for association BY Prof Sachin Udepurkar

    1. 1. DATA ANALYSIS – TESTING FOR ASSOCIATION Relationship :  A consistent and systematic link between two or more variables  While interpreting the relationship between variables following aspects are taken into account : 1. Whether two or more variables are related at all i.e To measure whether relationship is present vide concept of statistical significance 2. If the relationship is present it is important to know the direction which can be either Positive or Negative 3. Understanding strength of association 4. Type of relationship
    2. 2. Difference between Univariate and Bivariate Univariate Data Bivariate Data • involving a single variable • involving two variables • does not deal with causes or relationships • deals with causes or relationships • the major purpose of univariate analysis is to describe • the major purpose of bivariate analysis is to explain • central tendency - mean, mode, median • analysis of two variables simultaneously • dispersion - range, variance, max, min, quartiles, standard deviation. • correlations • • frequency distributions comparisons, relationships, causes, explanations • bar graph, histogram, pie chart, line graph, box-and-whisker plot • tables where one variable is contingent on the values of the other variable. • independent and dependent variables Sample question: How many of the students in the freshman class Sample question: Is there a relationship between the number of are female? females in Computer Programming and their scores in Mathematics?
    3. 3. 1) To measure whether relationship is present vide concept of statistical significance  Whether relation exist between two or more variables  If we test for statistical significance and find that it exists then it is said that relationship is present  Stated another way , we say that knowledge about the behavior of one variable allows us to make a useful prediction about the behavior of another  For example : If we found statistically significant relationship between the perceptions of the quality of Santa Fe Grill food and satisfaction , we would say a relationship is present and that perceptions of the quality of food will tell us what the perception of satisfaction are likely to be
    4. 4. 2) If the relationship is present it is important to know the direction which can be either Positive or Negative  Presence of relationship precedes direction  The direction of relationship can either be positive or negative For example : Using Santa Fe Grill example we could say that a positive relationship exists if respondents who rate the quality of food high also are highly satisfied. Similarly , a negative relationship exists if respondents say the speed of service is slow (low rating ) but they are still satisfied (High rating)
    5. 5. 3) Understanding strength of association  In general categorize the strength of association as a. b. c. d. Non existent Weak Moderate Strong  If a consistent and systematic relationship is not present then the strength of association is nonexistent  A weak association means there is low probability of variables having relationship  A strong association means there is high probability , a consistent and systematic relationship exists
    6. 6. 4) Type of relationship  If we say two variables can be described as related, then we would pose this as question “What is the nature of relationship”? , How can the link between variables Y and X best be described ?  There are a number of different ways in which two variables (X & Y) can share a relationship
    7. 7.  In the wake of finding answers to above questions following statistical methodologies will be applied a.Covariation a.Chi Square Test a.Correlation Coefficient 1. Pearson Correlation coefficient 2. Coefficient of determination 3. Spearman rank order correlation coefficient a.Regression Analysis
    8. 8. COVARIATION :  It is defined as amount of change in one variable that is consistently related to the change in another variable of interest or degree of association between two items/variables  For example : If we know DVD purchases are related to age ,then we want to know the extent to which younger persons purchase more DVDs and ultimately which types of DVDs  If two variables are foound to change together on a reliable or consistent basis then we can use that information to make predictions as well as decisions on advertising and marketing strategies  For example Change in attitude towards Starbucks coffee advertising campaign as it varies between light, medium and heavy consumers of Starbucks coffee
    9. 9. SCATTER PLOTS AND CORRELATION  A scatter plot (or scatter diagram) is used to show the relationship between two variables
    10. 10. SCATTER PLOT EXAMPLES y Linear relationships y x y Curvilinear relationships x y x x
    11. 11. SCATTER PLOT EXAMPLES y Strong relationships y x y (continued) Weak relationships x y x x
    12. 12. SCATTER PLOT EXAMPLES y No relationship x y x (continued)
    13. 13. Smoking and Lung Capacity • We can see easily from the graph that as smoking goes up, lung capacity tends to go down. • The two variables covary in opposite directions. • We now examine two statistics, covariance and correlation, for quantifying how variables covary. Cigarettes (X) Lung Capacity (Y) 0 45 5 42 10 33 15 31 20 29 50 40 Lung Capacity One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables 30 20 -10 Smoking 0 10 20 30
    14. 14.  The formula for calculating covariance of sample data is as follows : x  = the independent variable y  = the dependent variable n  = number of data points in the sample   = the mean of the independent variable x   = the mean of the dependent variable y  Example : To understand how covariance is used, consider the table, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi)  Using the covariance formula, you can determine whether economic growth and S&P 500 returns have a positive or inverse relationship.
    15. 15.  Before you compute the covariance, calculate the mean of x and y A ) Now you can identify the variables for the covariance formula as follows x = 2.1, 2.5, 4.0, and 3.6 (economic growth) y = 8, 12, 14, and 10 (S&P 500 returns)   = 3.1   = 11 B) Substitute these values into the covariance formula to determine the relationship between economic growth and S&P 500 returns.
    16. 16. Interpretation :  The covariance between the returns of the S&P 500 and economic growth is 1.53.  Since the covariance is positive, the variables are positively related—they move together in the same direction
    17. 17. Smoking and Lung Capacity • We can see easily from the graph that as smoking goes up, lung capacity tends to go down. • The two variables covary in opposite directions. • We now examine two statistics, covariance and correlation, for quantifying how variables covary. Cigarettes (X) Lung Capacity (Y) 0 45 5 42 10 33 15 31 20 29 50 40 Lung Capacity One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables 30 20 -10 Smoking 0 10 20 30
    18. 18. Correlation :  Correlation is another way to determine how two variables are related.  In addition to telling you whether variables are positively or inversely related, correlation also tells you the degree to which the variables tend to move together  Correlation standardizes the measure of interdependence between two variables and, consequently, tells you how closely the two variables move.  The correlation measurement, called a correlation coefficient, will always take on a value between 1 and – 1 called Pearson Correlation coefficient A) If the correlation coefficient is one The variables have a perfect positive correlation. This means that if one variable moves a given amount, the second moves proportionally in the same direction. A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one.
    19. 19. B) If correlation coefficient is zero No relationship exists between the variables  If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated. C) If correlation coefficient is –1  The variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other  If one variable increases, the other variable decreases proportionally  A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1
    20. 20.  To calculate the correlation coefficient for two variables, you would use the correlation formula, shown below. = correlation of the variables x and y COV(x, y) = covariance of the variables x and y sx = sample standard deviation of the random variable x sy = sample standard deviation of the random variable y x,y)  To calculate correlation, you must know the covariance for the two variables and the standard deviations of each variable  From the earlier example, you know that the covariance of S&P 500 returns and
    21. 21.  Now you need to determine the standard deviation of each of the variables  You would calculate the standard deviation of the S&P 500 returns and the economic growth  Using the information from above, you know that COV(x,y) = 1.53 sx = 0.90 sy = 2.58
    22. 22. Now calculate the correlation coefficient by substituting the numbers above into the correlation formula, as shown below. A correlation coefficient of .66 tells you two important things: •Because the correlation coefficient is a positive number, returns on the S&P 500 and economic growth are postively related. •Because .66 is relatively far from indicating no correlation, the strength of the correlation between returns on the S&P 500 and economic growth is strong
    23. 23. The coefficient of determination is the amount of variability in one measure that is explained by the other measure The coefficient of determination is the square of the correlation coefficient (r2) For example, if the correlation coefficient between two variables is r = 0.90, the coefficient of determination is (0.90)2 = 0.81 Square of coefficient of correlation (Pearson correlation coefficient) gives coefficient of determination given by r 2 This number ranges from .00 to 1.0 showing proportion variation explained or accounted for in one variable by another
    24. 24. Spearman Rank Order correlation coefficient : A statistical measure of linear association between two variables where both have been measured using ordinal (rank order) scales Example :
    25. 25. INTRODUCTION TO REGRESSION ANALYSIS  Regression analysis is used to:  Predict the value of a dependent variable based on the value of at least one independent variable  Explain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to explain Independent variable: the variable used to explain the dependent variable
    26. 26. SIMPLE LINEAR REGRESSION MODEL  Only one independent variable, x  Relationship between x and y is described by a linear function  Changes in y are assumed to be caused by changes in x
    27. 27. TYPES OF REGRESSION MODELS Positive Linear Relationship Negative Linear Relationship Relationship NOT Linear No Relationship
    28. 28. POPULATION LINEAR REGRESSION The population regression model: Population Dependent Variable y intercept Populatio n Slope Coefficien t Independen t Variable y = β0 + β1x + ε Linear component Rando m Error term, or residual Random Error component
    29. 29. LINEAR REGRESSION ASSUMPTIONS  Error values (ε) are statistically independent  Error values are normally distributed for any given value of x  The probability distribution of the errors is normal  The probability distribution of the errors has constant variance  The underlying relationship between the x variable and the y variable is linear
    30. 30. POPULATION LINEAR REGRESSION y y = β0 + β1x + ε (continued) Observed Value of y for xi εi Predicted Value of y for xi Slope = β1 Random Error for this x value Intercept = β0 xi x
    31. 31. ESTIMATED REGRESSION MODEL The sample regression line provides an estimate of the population regression line Estimated (or predicted) y value Estimate of the regression intercept Estimate of the regression slope ˆ y i = b0 + b1x Independen t variable The individual random error terms ei have a mean of zero
    32. 32. LEAST SQUARES CRITERION  b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals ˆ )2 ∑ e = ∑ (y −y 2 = ∑ (y − (b + b1x)) 2 0
    33. 33. THE LEAST SQUARES EQUATION  The formulas for b1 and b0 are: b1 ∑ ( x − x )( y − y ) = ∑ (x − x) 2 algebraic equivalent: b1 = ∑ x∑ y ∑ xy − x2 − ∑ n (∑ x ) 2 n and b0 = y − b1 x
    34. 34. INTERPRETATION OF THE SLOPE AND THE INTERCEPT b is the estimated average value of y when the value of x is zero 0 b is the estimated change in the average value of y as a result of a one-unit change in x 1
    35. 35. FINDING THE LEAST SQUARES EQUATION The coefficients b0 and b1 will usually be found using computer software, such as Excel or Minitab Other regression measures will also be computed as part of computerbased regression analysis
    36. 36. SIMPLE LINEAR REGRESSION EXAMPLE  A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)  A random sample of 10 houses is selected Dependent in $1000s variable (y) = house price Independent variable (x) = square feet
    37. 37. SAMPLE DATA FOR HOUSE PRICE MODEL House Price in $1000s (y) Square Feet (x) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700
    38. 38. REGRESSION USING EXCEL  Tools / Data Analysis / Regression
    39. 39. EXCEL OUTPUT Regression Statistics Multiple R 0.76211 R Square 0.58082 Adjusted R Square The regression equation is: house price = 98.24833 + 0.10977 (square feet) 0.52842 Standard Error 41.33032 Observations ANOVA 10 df SS MS F 11.084 8 Regression 1 18934.9348 18934.934 8 Residual 8 13665.5652 1708.1957 Total 9 Significance F 32600.5000 Coefficien ts Standard Error t Stat Pvalue 0.1289 0.01039 Lower 95% Upper 95% 232.0738
    40. 40. GRAPHICAL PRESENTATION House price model: scatter plot and regression line Intercep t = 98.248 House Price ($1000s)  450 400 350 300 250 200 150 100 50 0 Slope = 0.10977 0 500 1000 1500 2000 2500 3000 Square Feet house price = 98.24833 + 0.10977 (square feet)
    41. 41. INTERPRETATION OF THE INTERCEPT, B0 house price = 98.24833 + 0.10977 (square feet)  b0 is the estimated average value of Y when the value of X is zero (if x = 0 is in the range of observed x values)  Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
    42. 42. INTERPRETATION OF THE SLOPE COEFFICIENT, B1 house price = 98.24833 + 0.10977 (square feet) b measures the estimated change in the average value of Y as a result of a one-unit change in X 1  Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size
    43. 43. LEAST SQUARES REGRESSION PROPERTIES  The sum of the residuals from the least ˆ squares regression line is 0 ( ∑ ( y − y ) = 0 )  The sum of the squared residuals is a ˆ ( y −y)2 ) minimum (minimized ∑  The simple regression line always passes through the mean of the y variable and the mean of the x variable  The least squares coefficients are unbiased estimates of β0 and β1
    44. 44. EXPLAINED AND UNEXPLAINED VARIATION  Total variation is made up of two parts: SST = Total sum of Squares SST = ∑ ( y − y )2 SSE + Sum of Squares Error ˆ SSE = ∑ ( y − y )2 SSR Sum of Squares Regression ˆ SSR = ∑ ( y − y )2 where: y = Average value of the dependent variable y = Observed values of the dependent variable ˆ y = Estimated value of y for the given x value
    45. 45. EXPLAINED AND UNEXPLAINED VARIATION (continued)  SST = total sum of squares  Measures the variation of the yi values around their mean y  SSE = error sum of squares  Variation attributable to factors other than the relationship between x and y  SSR = regression sum of squares  Explained variation attributable to the relationship between x and y
    46. 46. EXPLAINED AND UNEXPLAINED VARIATION (continued) y yi ∧ SSE = ∑(yi - yi ) _ ∧ y ∧ y 2 SST = ∑(yi - y)2 ∧ _ 2 SSR = ∑(yi - y) _ y Xi _ y x
    47. 47. THANKS……

    ×