This slide is self explanatoryMake sure you can recognize a research question that can be answered by simple linear regression – they are all predictive in nature
This is a review slide – Again this table shows you where regression is in the world of inferential statistics
Inferential statistics are more powerful when they can help us predict the future
It’s totally possible that the relationship between two variables is NOT linearCheck your scatterplots first to make sure the relationship looks somewhat linear; otherwise simple linear regression method should NOT be usedFor example, if you want to see if gender (IV1) and race (IV2) predict spending (DV). However in your sample, all men are Caucasian and all women are African American (perfect correlation between gender and race) – then you will NOT be able to run regressionFor example, if you want to see if eating ice cream (IV) causes people to go to the beach more often (DV). You probably will find a positive relationship, however the IV correlates with an external variable (temperature) which causes variance in your DV (it determines whether people go to beach or not.) In this case running a regression would not make senseFor example, if you interview 20 men and 30 women, but turned out that 2 of the 20 men are the same person being interviewed twice! Then the “independence” principle is violatedAfter the regression model is built, if you can still see a recognizable pattern in the errors, then the model is not good enough. The model should capture the trend of the data completely and leave behind completely random errors
This is the example used in Individual Assignment 6
To understand regression you need to first understand how a straight line is expressed mathematically1. All straight lines can be expressed in mathematical terms in terms of a constant and a slope2. We use y=2x+1 as an example
Regression is like what Ralph Lauren and Armani do everyday – finding the runway model that fits the best(note: Kate Moss is one of the most prototypical runway models)
Like Kate Moss (or other run way models), the regression line represents an idealized version of the real worldThe reality is the messy data we collected (aka the dots)The line is an idealized model that best represents the messy-data realityA good model, in the world of statistics, is close to reality. The goal is to minimize the difference between the model and the realityWhen a regression model represents the real world well, the errors (distances from the dots to the lines) are minimal. The goodness of fit measure, or R square, is large.
Accordingto this definition of “goodness of fit”, Kate Moss is a really bad model (poor goodness of hit, large errors, R-square would be very small)Your goal is to do better than Kate Moss!
In this case, Average Daily Clicks significantly predicted Direct Sales Revenue.However, the beta coefficient (the slope) is negative – this means the more the clicks the lower the revenueThe model fit is pretty good
Get the total df value (39 in this case) from the ANOVA table
So with regression you can get a model which is a line with a linear function (Y = BX). This means that given any X we can predict the value of Y. For example, if we would like to see if number of clicks (X) predicts revenue (Y). We get a regression line which is Y = 200X with a R square of 45%This means that given any number of clicks, we can predict the expected revenue level with 45% accuracy.Perhaps this is a regression model based on data of 5-20 clicks. But because the linear line can be extended infinitely to the upper right corner of a graph, we can predict with 45% accuracy that, when we get 1000 clicks, our revenue will be $200,000!
Thereis a LOT more to regression that what we discussed. We covered the basic concepts and you’re not expected to know more than that. However this slide gives you some ideas about other considerations when running regression
Here’ssome food for thought for your group project
To summarize, we have discussed 4different kinds of inferential statistics in this course:T testCorrelationChi squareRegressionHow do you know which test is appropriate for your project?Use this summary table to determineMany students often ask, which test is better than others. This question is like asking, is a pregnancy test better than a DNA test? It’s impossible to answer without knowing what’s your objective.Some people also wonder if we can use more than one test in a research study. The answer is obvious. Of course! We take the same approach to other "research questions" in our lives. For example, if you want to know if pregnant, you get a pregnancy test. If you want to know if you're diabetic, you get a blood test. If you want to know who's the father of your child, you get a DNA test! If you need to know answers to all 3 questions, you order all 3 tests!Again, it's all about your research question!
S6 w2 linear regression
Purpose – Determine if one or more IVs can predict a DV Examples: • Does your height (IV) predict how much money you will spend (DV)? • Does the number of store managers predict how often the machine will break down (DV)? • Does the number of clicks (IV1) and the number of comments (IV2) on the blog predict the size of revenue (DV)?
Research Question Inferential StatisticsCompare means of 2 numeric T testvariablesRelate 2 categorical variables Pearson Chi SquareRelate 2 numeric variables Pearson Correlation rUse 1+ IVs to explain 1 numeric DV Regression
Correlation tells us how X relates to Y (in the past) Simple Regression tells us how X predicts Y (in the future) • E.g., Does AvgDailyClicks predict DirectSalesRevenue? Multiple Regression tells us how X1, X2, X3, ….. predicts Y • E.g., Do NumberBlogAuthors & AvgDailyClicks predict SponsorRevenue?
The relationship between Xs and Y are linear If you have 2 or more Xs, they are not perfectly correlated with each other Xs are not correlated with external variables Independence – Any two observations should be independent from each other. Errors are normally distributed And a few others
Example:Does Number of Stupid Customers predict Self Checkout Error Rate? When we use X to predict Y: • X = the predictor = the independent variable (IV) • Y = the predicted value = the dependent variable (the value of Y depends on the predictor X) (DV) • You’re basically building a linear model between X and Y: Y = Constant + B*X + error
Who is the best fitting model? (Hint: Not Kate Moss)Line that’s closest to all dots
Goodness of Fit (R2): How well does the line fit the data? (How well does Kate fit the average woman?)(constant) Slope B Distances to regression line = error Good fit = small errors
Y = Constant + B*X + error DirectSalesRevenue = 19.466-.003*AvgDailyClicks+error Constant is significantly greater than zero Slope (-.003) is significantly less than zeroGoodness of Fit (R2): Model explains 59% variations in DirectSalesRevenue
The number of average daily clickssignificantly predicted direct sales revenue, b= -.03, t(39) = 14.72, p < .001. The number ofaverage daily clicks also explained asignificant proportion of variance in directsales revenue, R2 = .59, F(1, 38) = 42.64, p <.001. These findings suggest that, websiteswith more average daily clicks tend to havelower direct sales revenue level.
Y=200X (R2 = 45%)Given any X, we can predict value of Y with 45% accuracy
Assumptions: Xs are somewhat independent; Y values are independent; Y values are normally distributed; errors are normally distributed; X Y relations are linear; no outliers • Example: Time series data are NOT independent – stock price today depends on stock price yesterday which depends on stock price the day before, etc. Multiple regression is just an extension of single regression • Use multiple Xs (e.g., both AvgDailyClicks and NumberAuthors) to predict Y • When you have a condition (e.g., customer choice depends on gender; brand awareness depends on comm. channel; number of applications depends on program of study), you need to create an interaction term next class When an X is categorical (e.g., whether the blog host is Google or WordPress): Code X in numbers – e.g., 0 is Google, 1 is WordPress When Y is categorical (e.g., whether the blog won the Outstanding Blog Award): Code Y in numbers – e.g. 0 is No, 1 is Yes, and use Logistic Regression
What is your Y (the value you want to predict)? Is your Y categorical? Do you need Logistic Regression? See the instructor for help What is your X (your predictor variable)? How many Xs do you have? Is any of your Xs categorical? Do you have a coding scheme? Do you have a condition? (e.g., customer choice depends on gender; brand awareness depends on comm. channel; number of applications depends on program of study) See the instructor for help
Research Question Inferential StatisticsCompare means of 2 numeric variables T testRelate 2 numeric variables Pearson Correlation rRelate 2 categorical variables Pearson Chi SquareUse 1+ IVs to explain 1 numeric DV Regression