Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 970303 views
- AI and Machine Learning Demystified... by Carol Smith 3592289 views
- 10 facts about jobs in the future by Pew Research Cent... 646852 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 1047375 views
- Harry Surden - Artificial Intellige... by Harry Surden 608470 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1191203 views

3,444 views

Published on

Published in:
Healthcare

No Downloads

Total views

3,444

On SlideShare

0

From Embeds

0

Number of Embeds

208

Shares

0

Downloads

188

Comments

0

Likes

5

No embeds

No notes for slide

- 1. Simple Linear Regression: Step-By-Step Dan Wellisch danwellisch@gmail.com Chicago Technology For Value-Based Healthcare Meetup https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare- Meetup/ 9/26/2017 Acknowledgement: Most of the tables and graphs used for this presentation are those of Brandon Foltz. https://www.youtube.com/user/BCFoltz I highly recommend his material. He is very good at explanations of complex statistical methods.
- 2. What is Simple Linear Regression? Simple Linear Regression is a method used to fit the best straight line between a set of data points. After a graph is properly scaled, the data points must “look” like they would fit a straight line, not a parabola, or any other shape. The line is used as a model in order to predict a variable y from another variable x. A regression line must involve 2 variables, the dependent and the independent variable. Finding the “best-fit” line is the goal of simple linear regression.
- 3. Definitions: Input, Predictive, Or Independent Variable X – 3 names mean the same thing. This is the variable whose value that is believed to influence the value of another variable. This variable should not be dependent on another variable (by definition) Output, Response, Or Dependent Variable Y – 3 names mean the same thing. This is the variable whose value that is believed to be influenced by the value of another variable. It is by definition, dependent on another variable. Best-Fit Line – Represents our model. It is the line that “best fits” our data points. The line represents the best estimate of the y value for every given input of x. Sum Of Squares – An important calculation we will use to find the best-fit line.
- 4. One Variable
- 5. • Problem: A waiter wants to predict his next tip, but he forgot to record the bill amounts for previous tips. • Here is a graph of his tips. The tips is the only variable. Let’s call it the y variable. • Meal# is not a variable. It is simply used to identify a tip. y variable Can we come up with a model for this problem with only 1 variable?
- 6. • The only option for our model is to use the mean of the Tips($) • Tips are on the y access. We would call the mean 𝒚 (y bar). • The mean for the tip amounts is 10. • The model for our problem is simply 𝒚 = 10. • 𝒚 = 10 is our best fit line (represented by bold black line). 𝒚 = 10
- 7. • Now, let’s talk about goodness of fit. This will tell us how good our data points fit the line. • We need to calculate the residuals (errors) for each point. -5 +7 +1 -2 +4 -5 𝒚 = 10
- 8. -5 +7 +1 -2 +4 -5 𝒚 = 10 • The best fit line is the one that minimizes the sum of the squares of the residuals (errors). • The error is the difference between the actual data point and the point on the line. • SSE (Sum Of Squared Errors) = (-5)2 + 72 + 12 + (-2)2 + 42 + (-5)2 = 120 • SST (Sum Of Squared Total) = SSR (Sum Of Squared Regression) + SSE is the Sum Of Squares Equation. • Since there is no regression line (as we only have 1 variable), we can not make the SSE any smaller than 120, because SSR = 0.
- 9. Two Variables
- 10. The goal of simple linear regression is to create a linear model that minimizes the sum of squares of the errors (SSE). From a previous slide: SST (Sum Of Squared Total) = SSR (Sum Of Squared Regression) + SSE When we have 2 variables, we can create a regression line; and therefore, we can calculate an SSR > 0. If SSR > 0, then we can reduce SSE. Minimizing the errors means that the line will fit the data better. 2 variables: One is the dependent variable: y. The other is the independent variable x. Goal Of Simple Linear Regression
- 11. So now, we need to introduce a little math….. Remember from the previous slide that we want to minimize the SSE. We write this mathematically this way.: 𝑦 is often referred to as y-hat.
- 12. • Repeating the Problem: As a waiter, how do we predict the tips we will receive for service rendered? • Let’s say, we didn’t forget to record the bill amount. Independent Variable (x) Dependent Variable (y)
- 13. If we scale the graph according to the data points available, we can then plot the points.
- 14. Algebra Of Lines - Slope Equation Of A Straight Line Y = mX + b slope y-intercept m = (rise/run) = (yj – yi) / (xj – xi) b is the point where x = 0 and y intersects the y-axis y x xj, yj • (0,b) •xi, yi run rise
- 15. Think of sy divided by sx as the variation (resembling change) in Y over the variation in X, in units of X and Y. Rise Over Run! Derivation to this Regression Line - Slope
- 16. Let’s create a table of calculations that we can use to calculate the slope of the regression (best-fit) line.
- 17. • Now that we have the slope, we can calculate the y-intercept because we know 1 point on the line already (74,10). • What do we call 74,10?
- 18. • (74,10) is the Centroid. • For comparison, Excel has calculated the regression equation very close to our manual calculation. Excel calculation
- 19. Meaning of our equation….
- 20. SST = SSR + SSE
- 21. Summary The regression line must have an SSE < the SSE of the one variable equation, where y = the mean of the data values. Otherwise, it is not a good model to predict y, given x. Remember the Sum Of Squares equation, SST = SSR + SSE. In our example, 120 is the SSE if we only had the past tips (no bill amounts) to predict our next tip. SSR is 0 as there is no regression line. When we add an independent variable (the bill amount), the tips becomes the dependent variable and we can create a regression line. The regression line contributes to the SST and therefore reduces the SSE from 120 to 30.075. Since 30.075 < 120, our regression line is a better model for predicting future tips.

No public clipboards found for this slide

Be the first to comment