What is regression analysis? <ul><li>Regression analysis is a technique for measuring the relationship between two interval- or ratio-level variables. </li></ul><ul><li>The regression framework is at the heart of empirical social and political science research. </li></ul><ul><li>Regression analysis acts as a statistical surrogate for controlled experiments, and can be used to make causal inferences. </li></ul>
Regression models <ul><li>Researchers translate verbal theories, hypotheses, even hunches into models. </li></ul><ul><li>A model shows how and under what conditions two (or more) variables are related. </li></ul><ul><li>A regression model with a dependent variable and one independent variable is known as a bivariate regression model. </li></ul><ul><li>A regression model with a dependent variable and two or more independent variables and/or control variables is known as a multivariate regression model. </li></ul>
Scatterplots <ul><li>A scatterplot graphs the sample observations by placing them along the X,Y axis. </li></ul><ul><li>The X axis generally represents the values of the independent variable, and the Y axis usually represents the value of the dependent variable. </li></ul><ul><li>X is the horizontal axis; Y is the vertical axis. </li></ul>
Scatterplots <ul><li>Scatterplots allow you to study the flow of the dots, or the relationship between the two variables </li></ul><ul><li>Scatterplots allow political scientists to identify </li></ul><ul><li>-- positive or negative relationships -- monotonic or linear relationships </li></ul>
Regression Equation The linear equation is specified as follows: Y = a + bX Where Y = dependent variable X = independent variable a = constant (value of Y when X = 0) b = is the slope of the regression line
Regression Equation <ul><li>Y = a + bX </li></ul><ul><li>a can be positive or negative. In high school algebra, you may have referred to a as the intercept. This is because a is the point at which the slope line passes through the Y axis. </li></ul><ul><li>b (the slope coefficient) can be positive or negative. A positive coefficient denotes a positive relationship and a negative coefficient denotes a negative relationship. </li></ul><ul><li>The substantive interpretation of the slope coefficient depends on the variables involved, how they are coded and the scale of the variables. Larger coefficients may indicate a stronger relationship, but not necessarily. </li></ul>
The Regression Model <ul><li>The goal of regression analysis is to find an equation which “best fits” the data. </li></ul><ul><li>In regression, an equation is found in such a way that its graph is a line that minimizes the squared vertical distances between the data points and the lines drawn. </li></ul>
<ul><li>d 1 and d 2 represent the distances of observed data points from an estimated regression line. </li></ul><ul><li>Regression analysis uses a mathematical procedure that finds the single line that minimizes the squared distances from the line. </li></ul>
Regression Equation The standard regression equation is the same as the linear equation with one exception: the error term. Y = α + βX + ε Where Y = dependent variable α = constant term β = slope or regression coefficient X = independent variable ε = error term
Regression Equation This regression procedure is known as ordinary least squares (OLS). α (the constant term) is interpreted the same as before β (the regression coefficient) tells how much Y changes if X changes by one unit. The regression coefficient indicates the direction and strength of the relationship between the two quantitative variables.
Regression Equation The error ( ε ) indicates that observed data do not follow a neat pattern that can be summarized with a straight line. A observation's score on Y can be broken into two parts: α + βX is due to the independent variable ε is due to error Observed value = Predicted value (α + βX) + error (ε)
Regression Equation The error is the difference between the predicted value of Y and the observed value of Y. This difference is known as the residual .
Regression Interpretation For the data on the scatterplot: Y (depvar) = telephone lines for 1,000 people X (indvar) = Infant mortality We can use regression analysis to examine the relationship between communication capacity (measured here as telephone lines per capita) and infant mortality.
Regression Interpretation In this analysis, the intercept and regression coefficient are as follows: α (or constant) = 121 Means that when X (infant deaths) is 0 deaths, there are 121 phone lines per 1,000 population. β = -1.25 Means that when X (deaths) increases by 1, there is a predicted or estimated decrease of 1.25 phone lines.
Regression Interpretation <ul><li>These calculations can be useful because they allow you to make useful predictions about the data. </li></ul><ul><li>An increase from 1 to 10 deaths per 1,000 live births is associated with a decline of 119.75 – 108.5 = 11.25 telephone lines. </li></ul>
<ul><li>Interpreting the meaning of a coefficient can be tricky. What does a coefficient of -1.25 mean? </li></ul><ul><li>-- Well, it means a negative relationship between infant mortality and phone lines. </li></ul><ul><li>-- It means for every additional infant death there is a decrease of 1.25 phone lines. </li></ul><ul><li>This information is useful, but is there a measure that tells us how good a job we do predicting the observed values? </li></ul>
R-squared <ul><li>Yes, the measure is known as R-squared (or R 2 ). </li></ul><ul><li>As stated earlier, there are two component parts of the total deviation from the mean, which is usually measured as the sum of squares (or total variance). </li></ul><ul><li>The difference between the mean and the predicted value of Y. This is the explained part of the deviation, or (Regression Sum of Squares). </li></ul><ul><li>The second component is the residual sum of squares (Residual Sum of Squares), which measures prediction errors. The is the unexplained part of the deviation. </li></ul>
R-squared <ul><li>Total SS = Regression SS + Residual SS In other words, the total sum of squares is the sum of the regression sum of squares and the residual sum of squares. </li></ul><ul><li>R 2 = Regression SS/TSS The more variance the regression model explains, the higher the R 2 . </li></ul>