2. What we should know
Basics of the software
Know the difference between a continuous & discrete
(nominal) variable
Know how to summarize a continuous (e.g. mean income) and nominal
(e.g. % Female)
Relationship between 2 variables
Both continuous (correlation)
Both Nominal (cross-tab, mosaic plot)
One Continuous & one Nominal (e.g. take Mean of continuous
variable by Nominal)
Understand p-value: Only time we are interested in
‘statistical’ test is when doing controlled experiments
3. Why do need models?
o Graphs are useful for understanding but don’t
scale (when we have too many potential
predictors).
o We want to automate the analysis
o Which Ad to display?
o How to provide an insurance quote based on the
information provided by a new customer
o Conduct ‘what-if’ analysis for planning Black Friday.
4. Example
Predicting auto insurance
Traditional measures
Usage based
GPS device
Forecasting sales for Sony digital camera at Best
Buy
Build a demand model based on historical data from
1000 stores
5. Regression: Key Points
Regression: widely used research tool
Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
Control for other independent variables when evaluating the
contributions of a specific variable or set of variables. Marginal effect
Forecast/Predict the values of the dependent variable.
Use regression results as inputs to additional computations:
Optimal pricing, promotion, time to launch a product….
8. Box Office Prediction
Suppose you are helping Warner Bros. in
developing a model for forecasting Box Office
revenues for their new movie The Watchman. In
the file “BoxOffice.csv” you are provided the
opening week revenues (in millions of $) for
various past movies along with several predictor
variables:
Variable Description of the Variable
Opening_Week_Revenue Opening Week Revenue in millions of $
# of Theaters Number of movie theaters each movie was initially released
Overall Rating Critic ratings for each movie (high number implies more favorable ratings)
Genre 1 for Action, 2 for Comedy, 3 for Kids, and 4 Other
9. Data
Movie Opening_Week_Revenue Num_Theaters Overall_Rating Genre
The Dark Knight 158.4 4366 82 1
Iron Man 98.6 4105 79 1
Indiana Jones and the Kingdom of the Crystal Skull 100.1 4260 65 1
Hancock 62.6 3965 49 1
Quantum of Solace 67.5 3451 58 1
The Incredible Hulk 55.4 3505 61 1
Wanted 50.9 3175 64 1
Get Smart 38.7 3911 54 1
The Mummy: Tomb of the Dragon Emperor 40.5 3760 31 1
Journey to the Center of the Earth 21 2811 57 1
Eagle Eye 29.2 3510 43 1
10,000 B.C. 35.9 3410 34 1
Valkyrie 21 2711 56 1
Jumper 27.4 3428 35 1
Cloverfield 40.1 3411 64 1
The Day the Earth Stood Still (2008) 30.5 3560 40 1
Hellboy II: The Golden Army 34.5 3204 78 1
Spider-Man 3 151.1 4252 59 1
Transformers 70.5 4011 61 1
Pirates of the Caribbean: At World's End 114.7 4362 50 1
10. Objective
Develop a regression model for “Opening
week Revenues” and all other variables as
predictors. Interpret your parameters.
Prediction: The attributes for the movie
“Watchman” are as follows:
– Theaters= 3611, Rating= 57, Action= 1
– Given this information, what are the predicted
first week revenues for the new movie
Watchman?
14. Regression: Forecasting Box-office Revenues
You need to convert the “Genre” variable into a series of dummy variables. This
is a nominal variable (i.e. categories such as 1=Action, 2=Comedy..). Adding this
variable directly into regression does not teach us anything. For example, our
coding could have been 1=Comedy, 2=Action...).
In addition, note that total number of dummy variables we include/need is 1
less than the number of categories. The left out category is absorbed in the
intercept.
It does not matter what you leave out—all included dummy variables will be
interpreted with respect to what you leave out.
For example, suppose we leave out “Action” and include dummy variables for
“comedy”, “kids” and “other”. The output of this regression:
15. Regression with Genre Dummy Variables Only
We left out “Action” as the
base. Compare the Intercept &
Average for Action
Just looking at the means, we
see that “Kids” movies generate
(56.66 - 45.10 = 11.56) less
than action. This is the
coefficient for ‘kids’ in the
regression.
16. Output from JMP
Note: In JMP output, go to red triangle and then select Estimates- Indicator Function
Parameterization to get “dummy” variable output
JMP Output
What is the interpretation of
Action here?
17. Leave out Comedy this time
We left out “Comedy” this itme
which is the intercept now.
See that Action is 24.68 More
than Comedy. Compare this to
the -24.68 coefficient on
Comedy in the previous
regression
Obviously none of the model fit
change. The coefficients get
adjusted based on the left out
category (Comedy in this case)
18. Add All Predictors
• Regression is OWR (dependent variable) & #of Theaters,
Ratings, Genre as predictors
# of Theaters: Each additional point in overall
rating increases OWR by $.278mn
Overall_Rating: Each additional point in overall
rating increases OWR by $.278mn.
Genre (Kids): Compared to “Other”, kids
movies generate 17.53 less in OWR after
controlling for the effect of # of Theaters and
Ratings
19. Objective
Develop a regression model for “Opening
week Revenues” and all other variables as
predictors. Interpret your parameters.
Prediction: The attributes for the movie
“Watchman” are as follows:
– Theaters= 3611, Rating= 57, Action= 1
– Given this information, what are the predicted
first week revenues for the new movie
Watchman?
21. Context
Southwest & the Wright Amendment
Click on article or
google “Southwest
Wright Amendment”
to get context
22. Impact of Southwest Airlines on Price
Suppose you are representing Southwest and want to claim that
presence of SW in a market is good for consumers-- because it lowers
the fares.
For analysis, you are provided data on Fares from approximately 600
“city-pairs” with following variables:
Objective: Analyze the impact of Southwest
presence on the average fares
25. Compare Mean Fare by SW
NOTE: If you square the t-ratio 6.71:
(6.71* 6.71) you get 45.03 (F-ratio)
26. Basic intuition of Regression Based Models
o Conceptually, fares do not just depend on presence of
Southwest
o Other factors
o In our example: Competition, Distance
o Analyze relationship b/w these variables & Fares
o In analyzing output with single predictors, note the
correspondence between regression output vs. ANOVA (t-
test)
o We get the same output from regression as a t-test or ANOVA
o More important point is to understand the workings of a
“dummy” variable in regression
32. Understand Output
Rsquare: Of the
total variation in
Fares, 41.6% is
explained by our
model
Distance is the most
important predictor
& Southwest is least
important
33. Interpretation Of Coefficients
Southwest: After Controlling for Distance and Competition (#of airlines),
absence of Southwest in the market increases fares by approximately $49.
Distance: Increasing distance by 100 miles, increases the fare by $ 21.5
# of Airline: Increasing the number of airline serving the markets by 1, reduces
the fare by approximately $41.
34. • Least Squares Principle: Choose β’s so that the sum of the
squared prediction errors,
is a small as possible.
Ok, but what does that mean? Open the file SSQ_Intuition.xls
2
m3m2
1
m10m )SF()( CompDistWareSSQ
M
m
How does the software Compute the parameters?
35. Average Fare by # of Airlines
Split by Presence of Southwest (Interactions—for later)
36. Conclusion
T-test and ANOVA are
both used to compare
means across different
groups
T-test for 2 groups and
ANOVA for many
groups
We can always convert
the question to a
regression problem
using dummy variables
Advantage of
regression is that it is
straightforward to
control for any number
of other variables that
might impact the
outcome
From now on, we will
focus on regression
analysis
37. Regression: Key Points
Regression: widely used research tool
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables. Marginal effect
• Forecast/Predict the values of the dependent variable.
• Use regression results as inputs to additional computations:
Optimal pricing, promotion, time to launch a product….