Exploring bivariate data


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Exploring bivariate data

  1. 1. Exploring Bivariate Data
  2. 2. Bivariate Data  Analyzing patterns in scatterplots  Correlation and linearity  Least-squares regression line  Residual plots, outliers, and influential points  Transformations to achieve linearity: logarithmic and power transformations
  3. 3. Scatterplots  The most effective way to display the relationship between two quantitative variables.  The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual.
  4. 4. Scatterplot Variables  Response Variable - measures the outcome of a study. (The dependent variable, plotted on the y-axis).  Explanatory or Predictor Variable – helps explain or predict changes in a response variable. (The independent variable, plotted on the x-axis.)
  5. 5. Example:  If you think that alcohol causes body temperature to increase, you might do a study giving certain amounts of alcohol to mice, and measuring the temperature drops.  In this case the explanatory variable is the amount of alcohol and the response variable is the measured temperature drop.
  6. 6. There are two ways of determining whether two variables are related: 1) By looking at a scatter plot (graphical approach) 2) By calculating a “correlation coefficient” (mathematical approach)
  7. 7. How to Make a Scatterplot 1. Decide which variable should go on each axis. 2. Label and scale your axes. 3. Plot individual data values.
  8. 8. Interpreting a Scatterplot  In any graph of data, look for the overall pattern and for striking deviations from that pattern.  You can describe the overall pattern of a scatterplot by the form, direction and strength of the relationship.  An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship.
  9. 9. Positive Linear Association
  10. 10. No Association
  11. 11. Clusters Clusters of points within the plot can indicate the presence of another variable. The scatterplot on the right shows two clear clusters—one near 2 minutes; the other between 4 – 5 minutes.
  12. 12. Gaps Gaps are regions (values) of the explanatory variable that have no associated response measurements. The scatterplot on the right shows a gap between 600,00 and 80,000 white blood cells (and probably another between 80,000 and 100,000).
  13. 13. Correlation Coefficient (r)  The correlation coefficient (r ) measures the strength of the linear relationship between two quantitative variables.  Gives a numerical description of the strength and direction of the linear association between two variables. r = 1 n −1 xi − x sx      ∑ yi − y sy      
  14. 14. Properties of r  r is always a number between -1 and 1  r > 0 indicates a positive association.  r < 0 indicates a negative association.  Values of r near 0 indicate a very weak linear relationship.  The strength of the linear relationship increases as r moves away from 0 towards -1 or 1.  The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship.
  15. 15. Correlation ≠ Causation  Whenever we have a strong correlation, it is tempting to explain it by imagining that the expanatory variable has caused the response to help.  A variable that is not explicitly part of a study but affects the way the variables in the study appear to be related is called a lurking variable.  Because we can never be certain that observational data are not hiding a lurking variable, it is never safe to conclude that a scatterplot demonstrates a cause-and- effect relationship, no matter how strong the correlation.  Scatterplots and correlation coefficients never prove causation.
  16. 16. Least-Squares Regression (LSRL) Least Squares Regression (linear regression) allows you to fit a line to a scatter diagram in order to be able to predict what the value of one variable will be based on the value of another variable. a: y intercept b: slope of the linebxay +=ˆ
  17. 17. Regression Line • A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. • We often use the regression line for predicting the value of y for a given value of x.
  18. 18. Interpreting a Regression Line  The way the line is fitted to the data is through a process called the method of least squares. The main idea behind this method is that the square of the vertical distance between each data point and the line is minimized.  The least squares regression line is a mathematical model for the data that helps us predict values of the response (dependant) variable from the explanatory (independent) variable. Therefore, with regression, unlike with correlation, we must specify which is the response and which is the explanatory variable.
  19. 19. Formulas for finding the slope and y-intercept in a linear regression line: slope y-intercept a = y - bx b1 = r sy sx
  20. 20. When will we ever need this?  We use regression lines to make predictions.  Interpolation – making predictions within known data values.  Extrapolation – making predictions beyond known data values.
  21. 21. How good is our prediction? The strength of a prediction which uses the LSRL depends on how close the data points are to the regression line. The mathematical approach to describing this strength is via the coefficient of determination. The coefficient of determination gives us the proportion of variation in the values of y that is explained by least-squares regression of y on x. The coefficient of determination turns out to be the correlation coefficient squared (r²).
  22. 22. Residuals  Since the LSRL minimized the vertical distance between the data values and a trend line we have a special name for these vertical distances. They are called residuals.  A residual is simply the difference between the observed y and the predicted y.
  23. 23. Residual Plots  Residuals help us determine how well our data can be modeled by a straight line, by enabling us to construct a residual plot.  A residual plot is a scatter diagram that plots the residuals on the y-axis and their corresponding x values on the x-axis.
  24. 24. INTERPRETING RESIDUAL PLOTS: The following residual plot is in a curved pattern and shows that the relationship is not linear. A straight line is not a good summary for such data.
  25. 25. INTERPRETING RESIDUAL PLOTS: Increasing or decreasing spread about the line as x increases indicates that prediction of y will be less accurate for larger x as shown in this residual plot.
  26. 26. INTERPRETING RESIDUAL PLOTS: The following shows a residual plot that has a uniform scatter of points about the fitted line with no unusual observations. This tells us that our linear model (regression line) will give us a good prediction of the data.
  27. 27. Unusual and Influential Data Outliers Outlier: A value in a set of data that does not fit with the rest of the data Leverage - An observation with an extreme value on a predictor variable. • Leverage is a measure of how far an independent variable deviates from its mean. • These leverage points can have an effect on the estimate of regression coefficients. Influence - Influence can be thought of as the product of leverage and outlierness. • Removing the observation substantially changes the estimate of coefficients.
  28. 28. Outliers  Data points more than 2 standard deviations away from the mean of the data set  Data points that do not fit the pattern governed by the rest of the data  In regression, any data point that has an unusually large residual How can I tell if a point in my data set is an outlier? • Take the IQR (interquartile range) of your data set and multiply it by 1.5. Subtract that number from Quartile 1 and then from Quartile 3. Any number lying outside these points can be considered an outlier.
  29. 29. Influential Points  Influential points are normally outliers in the X direction, but are not always outliers in terms of regression  A point is said to influence the data if it is responsible for changes to the LSR line.  Any point that has leverage on a set of data is an influential point