INTRODUCTION TOSTATISTICAL THEORY FORSCIENTISTCORRELATION AND REGRESSION
• If we have question like “are two or more variables linearly related? If so, what is the strength of the relationship?”• Numerical measure used to determine whether two or more variables are linearly related and to determine the strength of the relationship. This measure called CORRELATION COEFFICIENT• There are two types of relationship; SIMPLE RELATIONSHIP AND MULTIPLE RELATIONSHIP.
• Statistical method used to determineCORRELATION whether a linear relationship between variables exist • Used to describe the nature of relationship REGRESSION between variables; positive/negative or linear/nonlinear • Have two variables; an independent SIMPLE variable (explanatory) and a dependent REGRESSION variable (response) MULTIPLE • Two or more independent variables where REGRESSION used to predict one dependent variable POSITIVE • Both variables increase or decrease at theRELATIONSHIP same time NEGATIVE • As one variable increase, the other variableRELATIONSHIP decrease and vice versa.
Scatter plots and Correlation• In order to find relationship between two different variables, data need to be collected. Example: relationship between number of hours study and grades for exam• Independent variable is variable that can be controlled or manipulated while dependent variable cannot• Dependent and independent variable can be plotted in graph named scatter plot• Independent variable x plotted on the horizontal axis while dependent y on vertical axis• Scatter plot is visual way to show the relationship between two variable
SCATTER PLOT is a graph of the ordered pairs (x,y) ofnumber consisting of the independent variable x and dependent variable y Cars (in ten Revenue (in Company thousand) billion) A 63 7 B 29 3.9 C 20.8 2.1 D 19.1 2.8 E 13.4 1.4 F 8.5 1.5
Correlation• Correlation explained here is from Pearson Product Moment Correlation Coefficient (PPMC) by Karl Pearson Correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two quantitative variables. The symbol for the sample correlation is r while ρ (rho) for population correlation• Value range for correlation is from -1 to +1.• Correlation value which is close to +1 shows that there were a strong positive correlation while when the value is close to -1, it shows that there were a strong negative correlation• Value of r close to zero means that no linear relationship between the variable or only a weak relationship between both variables.
Regression• We previously test the significance of the correlation coefficient. If the correlation is significant, the next step is to determine the equation of regression line• LINE OF BEST FIT: best fit means that the sum of squares of the vertical distance from each point to the line is at minimum• Reason best fit needed is that the value of y will be predicted from the values of x; hence the closer the points to the lines, the better prediction will be
• MARGINAL CHANGE: the magnitude of the change in one variable when the other variable changes exactly 1 unit.• See example 10-9; the slope of the regression line is 0.106 which means for each increase of 10,000 cars, the value of y changes 0.106 unit ($ 106 million) on average.• EXTRAPOLATION: making prediction beyond the bounds of the data.• When prediction are made, they are based on present condition or on the premise that present trends will continue.• OUTLIER: point that seems out of place when compared with the other points• Some of this points can affect the equation of the regression line where the points are called influential points or influential observation
Coefficient of determination x 1 2 3 4 5 y 10 8 12 16 20