This document discusses quantitative research methods for model building and multiple regression analysis. It covers regression diagnostics, polynomial models, nominal variables in regression, stepwise regression, and statistical analyses for different variable types. It also provides an overview of simple linear regression, multiple regression, and logistic regression models. Examples are given to demonstrate how to estimate regression coefficients, assess model fit, interpret results, and check assumptions using SPSS.
this presentation defines basics of regression analysis for students and scholars. uses, objectives, types of regression, use of spss for regression and various tools available in the market to calculate regression analysis
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
This was a presentation I gave to my firm's internal CPE in December 2012. It related to correlation and simple regression models and how we can utilize these statistics in both income and market approaches.
this presentation defines basics of regression analysis for students and scholars. uses, objectives, types of regression, use of spss for regression and various tools available in the market to calculate regression analysis
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
This was a presentation I gave to my firm's internal CPE in December 2012. It related to correlation and simple regression models and how we can utilize these statistics in both income and market approaches.
In this tutorial, we discuss how to do a regression analysis in Excel. I will teach you how to activate the regression analysis feature, what are the functions and methods we can use to do a regression analysis in Excel and most importantly, how to interpret the regression analysis results. Source: https://tinytutes.com/tutorials/regression-analysis-in-excel/
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
Regression Analysis presentation by Al Arizmendez and Cathryn LottierAl Arizmendez
We present an overview of regression analysis, theoretical construct, then provide a graphic representation before performing multiple regression analysis step by step using SPSS (audio files accompany the tutorial).
Brief description of the concepts related to correlation analysis. Problem Sums related to Karl Pearson's Correlation, Spearman's Rank Correlation, Coefficient of Concurrent Deviation, Correlation of a grouped data.
In this tutorial, we discuss how to do a regression analysis in Excel. I will teach you how to activate the regression analysis feature, what are the functions and methods we can use to do a regression analysis in Excel and most importantly, how to interpret the regression analysis results. Source: https://tinytutes.com/tutorials/regression-analysis-in-excel/
Simple Linear Regression: Step-By-StepDan Wellisch
This presentation was made to our meetup group found here.: https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ on 9/26/2017. Our group is focused on technology applied to healthcare in order to create better healthcare.
Regression Analysis presentation by Al Arizmendez and Cathryn LottierAl Arizmendez
We present an overview of regression analysis, theoretical construct, then provide a graphic representation before performing multiple regression analysis step by step using SPSS (audio files accompany the tutorial).
Brief description of the concepts related to correlation analysis. Problem Sums related to Karl Pearson's Correlation, Spearman's Rank Correlation, Coefficient of Concurrent Deviation, Correlation of a grouped data.
The process of describing populations and samples is called Descriptive Statistics. A population includes everyone in the area of interest. For example, every person in the United States, every dog owner in Florida, or every computer user in the world. A sample is a small piece of the whole (i.e. 1000 people in the United States, 250 Floridian dog owners, 2500 worldwide computer users). There are three main ways to describe populations and samples: central tendency, dispersion and association.
Business Analytics Foundation with R Tools Part 1Beamsync
Business Analytics Foundation with R Tools Part 1 presented by Beamsync.
If you are looking for analytics training in Bangalore visit: http://beamsync.com/business-analytics-training-bangalore/
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxmadlynplamondon
Distribution of Estimates
Linear Regression Model
Assume (yt, xt) are independent and identically distributed and E(xtet) = 0
Estimation Consistency
The estimates approach the true values as the sample size increases.
Estimation variance decreases as the sample size increases.
Illustration of Consistency
Take a random sample of U.S. men
Estimate a linear regression of log(wages) on education
Total sample = 9089
Start with 100 observations, and sequentially increase sample size until in the final regression use the whole 9089.
Sequence of Slope Coefficients
Asymptotic Normality
4
Illustration of Asymptotic Normality
Time Series
Do these results apply to time-series data?
Consistency
Asymptotic Normality
Variance Formula
Time-series models
AR models, i.e., xt = yt-1
Trend and seasonal models
One-step and multi-step forecasting
Derivation of Variance Formula
For simplicity
Assume the variables have zero mean
The regression has no intercept
Model with no intercept:
Model with no intercept
OLS minimizes the sum of squares
The first-order condition is
Solution
Now substitute
We have
The denominator is the sample variance (when x has mean zero), so
10
Then
Where
Since
Then
From the covariance formula
When the observations are independent, the covariances are zero.
And since
We obtain
We have found
As stated at the beginning.
Extension to Time-Series
The only place in this argument where we used the assumption of the independence of observations was to show that vt = xtet has zero covariance with vj = xjej.
This is saying that vt is not autocorrelated.
Unforecastable one-step errors
In one-step-ahead forecasting, if the regression error is unforecastable, then vt is not autocorrelated.
In this case, the variance formula for the least-squares estimate is
Why is this true?
The error is unforecastable if
For simplicity, suppose that xt = 1.
Then for
Summary
In one-step-ahead time-series models, if the error is unforecastable, then least-squares estimates satisfy the asymptotic (approximate) distribution
As the sample size T is in the denominator, the variance decreases as the sample size increases.
This means that least-squares is consistent.
Variance Formula
The variance formula for the least-squares estimate takes the form
This formula is valid in time-series regression when the error is unforecastable.
Classical Variance Formula
If we make the simplifying assumption
Then
Homoskedasticity
The variance simplification is valid under “conditional homoskedasticity”
This is a simplifying assumption made to make calculations easier, and is a conventional assumption in introductory econometrics courses.
It is not used in serious econometrics.
Variance Formula: AR(1) Model
Take the AR(1) model with unforecastable homoscedastic errors
Then the variance of the OLS estimate is
Since in this model
AR(1) Asymptotic Variance
We know that
So
The asymp ...
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxbudbarber38650
FSE 200
Adkins Page 1 of 10
Simple Linear Regression
Correlation only measures the strength and direction of the linear relationship between two quantitative variables. If the relationship is linear, then we would like to try to model that relationship with the equation of a line. We will use a regression line to describe the relationship between an explanatory variable and a response variable.
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.
Ex. It has been suggested that there is a relationship between sleep deprivation of employees and the ability to complete simple tasks. To evaluate this hypothesis, 12 people were asked to solve simple tasks after having been without sleep for 15, 18, 21, and 24 hours. The sample data are shown below.
Subject
Hours without sleep, x
Tasks completed, y
1
15
13
2
15
9
3
15
15
4
18
8
5
18
12
6
18
10
7
21
5
8
21
8
9
21
7
10
24
3
11
24
5
12
24
4
Draw a scatterplot and describe the relationship. Lay a straight-edge on top of the plot and move it around until you find what you think might be a “line of best fit.” Then try to predict the number of tasks completed for someone having been without sleep 16 hours.
Was your line the same as that of the classmate sitting next to you? Probably not. We need a method that we can use to find the “best” regression line to use for prediction. The method we will use is called least-squares. No line will pass exactly through all the points in the scatterplot. When we use the line to predict a y for a given x value, if there is a data point with that same x value, we can compute the error (residual):
Our goal is going to be to make the vertical distances from the line as small as possible. The most commonly used method for doing this is the least-squares method.
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of the Least-Squares Regression Line
· Least-Squares Regression Line:
· Slope of the Regression Line:
· Intercept of the Regression Line:
Generally, regression is performed using statistical software. Clearly, given the appropriate information, the above formulas are simple to use.
Once we have the regression line, how do we interpret it, and what can we do with it?
The slope of a regression line is the rate of change, that amount of change in when x increases by 1.
The intercept of the regression line is the value of when x = 0. It is statistically meaningful only when x can take on values that are close to zero.
To make a prediction, just substitute an x-value into the equation and find .
To plot the line on a scatterplot, just find a couple of points on the regression line, one near each end of the range of x in the data. Plot the points and connect them with a line. .
Methods of collecting data
Survey, methods and type, response rate, variable language
Hands on: Graphical techniques II, SPSS
Questionnaire design
Tips on writing a research paper
Individual project: article critique
Lecture 2:
What is statistics?
Types of statistics
Types of research and types of statistics
Levels of measurement
Rules of using measurement
Hands on: Graphical Descriptive Techniques
Format of asking
Quantitative Research Methods
1.What is scientific research? What is quantitative research?
2.Why we need research?
3.Who is conducting the research?
4.What is the research process?
5.What is the language of research?
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Pride Month Slides 2024 David Douglas School District
9 model building
1. Quantitative Research Methods
Lecture 9 Model Building
1. Regression Diagnostics I
2. Regression Diagnostics II Multicollinearity
3. Regression Diagnostics II Time series
4. Polynomial Models
5. Nominal variable in Multiple Regression
6. Stepwise Multiple Regression
2. Statistical analyses
• Group differences (nominal variable) on one interval variable:
▫ T-tests (2 groups)
▫ ANOVA (3 or more groups)
One factor: one way ANOVA
Two factor: two way/factor ANOVA
• The relationship between two nominal variable:
▫ Chi-square test
• The relationship between two interval variable:
▫ Correlation, simple linear regression
• The relationship between multiple interval variable on one
interval variable
▫ Multiple regression
• The relationship between multiple interval variable on one
nominal variable (yes/no)
▫ Logistic regression
3. Regression
• Single Linear Regression (interval)
▫ one independent, one dependent
• Multiple Regression (all interval)
▫ Multiple independent, one dependent
• Logistic Regression
▫ Multiple interval independent, one nominal
dependent (Yes/No)
▫ Check example: https://youtu.be/H_48AcV0qlY
▫
4. 16.4
Simple Linear Regression Model…
A straight line model with one independent
variable is called a simple linear regression
model. Its is written as:
error variable
dependent
variable
independent
variable
y-intercept slope of the line
5. 16.5
Simple Linear Regression Model…
Note that both and are population
parameters which are usually unknown and
hence estimated from the data.
y
x
run
rise
=slope (=rise/run)
=y-intercept
6. 16.6
Estimating the Coefficients…
In much the same way we base estimates of µ on x , we
estimate β0 using b0 and β1 using b1, the y-intercept
and slope (respectively) of the least squares or
regression line given by:
(Recall: this is an application of the least squares
method and it produces a straight line that
minimizes the sum of the squared differences
between the points and the line)
8. 16.8
Example 16.2…
Car dealers across North America use the "Red Book" to
help them determine the value of used cars that their
customers trade in when purchasing new cars.
The book, which is published monthly, lists the trade-in
values for all basic models of cars.
It provides alternative values for each car model according
to its condition and optional features.
The values are determined on the basis of the average paid
at recent used-car auctions, the source of supply for many
used-car dealers.
9. 16.9
Example 16.2…
However, the Red Book does not indicate the value
determined by the odometer reading, despite the fact that a
critical factor for used-car buyers is how far the car has
been driven.
To examine this issue, a used-car dealer randomly selected
100 three-year old Toyota Camrys that were sold at auction
during the past month.
The dealer recorded the price ($1,000) and the number of
miles (thousands) on the odometer. (Xm16-02).
The dealer wants to find the regression line.
10. 16.10
Using SPSS
Analyze > Regression > Linear
Simple Linear Regression
SPSS Steps: Analyze > Regression > Linear
11. 16.11
SPSS Output: check three tables
R2 strength of the linear relationship
Model
significance
/fit
b1 b0
12. 16.12
Example 16.2…
As you might expect with used cars…
The slope coefficient, b1, is –0.0669, that is, each
additional mile on the odometer decreases the price by
$.0669 or 6.69¢
The intercept, b0, is 17,250. One interpretation would
be that when x = 0 (no miles on the car) the selling
price is $17,250. However, we have no data for cars
with less than 19,100 miles on them so this isn’t a
correct assessment.
13. 16.13
Testing the Slope…
If no linear relationship exists between the two
variables, we would expect the regression line to be
horizontal, that is, to have a slope of zero.
We want to see if there is a linear relationship, i.e.
we want to see if the slope (β1) is something other
than zero. Our research hypothesis becomes:
H1: β1 ≠ 0
Thus the null hypothesis becomes:
H0: β1 = 0
14. 16.14
Coefficient of Determination…
Tests thus far have shown if a linear relationship
exists; it is also useful to measure the strength
of the relationship. This is done by calculating
the coefficient of determination – R2.
The coefficient of determination is the square of
the coefficient of correlation (r), hence R2 = (r)2
15. 16.15
Coefficient of Determination…
As we did with analysis of variance, we can partition
the variation in y into two parts:
Variation in y = SSE + SSR
SSE – Sum of Squares Error – measures the amount of
variation in y that remains unexplained (i.e. due to
error)
SSR – Sum of Squares Regression – measures the
amount of variation in y explained by variation in the
independent variable x.
16. 16.16
Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation
in the auction selling prices (y) is explained by the variation
in the odometer readings (x). The remaining 35.17% is
unexplained, i.e. due to error.
Unlike the value of a test statistic, the coefficient of
determination does not have a critical value that
enables us to draw conclusions.
In general the higher the value of R2, the better the model
fits the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
17. 16.17
Using the Regression Equation…
We could use our regression equation:
y = 17.250 – .0669x
to predict the selling price of a car with 40 (,000) miles
on it:
y = 17.250 – .0669x = 17.250 – .0669(40) = 14,574
We call this value ($14,574) a point prediction.
Chances are though the actual selling price will be
different, hence we can estimate the selling price in terms
of an interval.
18. 16.18
Prediction Interval
The prediction interval is used when we want to
predict one particular value of the dependent
variable, given a specific value of the independent
variable:
(xg is the given value of x we’re interested in)
19. 16.19
Confidence Interval Estimator…
…of the expected value of y. In this case, we are
estimating the mean of y given a value of x:
(Technically this formula is used for infinitely large
populations. However, we can interpret our
problem as attempting to determine the average
selling price of all Toyota Camrys, all with 40,000
miles on the odometer)
20. 16.20
What’s the Difference?
1 no 1
The confidence interval estimate of the expected value of y will be narrower than
the prediction interval for the same given value of x and confidence level. This is
because there is less error in estimating a mean value as opposed to predicting an
individual value.
Prediction Interval Confidence Interval
Used to estimate the value of
one value of y (at given x)
Used to estimate the mean
value of y (at given x)
23. 16.23
Regression Diagnostics…
There are three conditions that are required in order to
perform a regression analysis. These are:
• The error variable must be normally distributed,
• The error variable must have a constant variance,
• The errors must be independent of each other.
How can we diagnose violations of these conditions?
Residual Analysis, that is, examine the
differences between the actual data points and those
predicted by the linear equation…
24. 16.24
Nonnormality…
We can take the residuals and put them into a histogram
to visually check for normality…
…we’re looking for a bell shaped histogram with the
mean close to zero.
28. 16.28
Heteroscedasticity…
When the requirement of a constant variance is violated,
we have a condition of heteroscedasticity.
We can diagnose heteroscedasticity by plotting the
residual against the predicted y.
29. 16.29
Heteroscedasticity…
If the variance of the error variable ( ) is not constant,
then we have “heteroscedasticity”. Here’s the plot of
the residual against the predicted value of y:
there doesn’t appear to be a
change in the spread of the
plotted points, therefore no
heteroscedasticity
32. 16.32
Nonindependence of the Error Variable
If we were to observe the auction price of cars every week
for, say, a year, that would constitute a time series.
When the data are time series, the errors often are
correlated.
Error terms that are correlated over time are said to be
autocorrelated or serially correlated.
We can often detect autocorrelation by graphing the
residuals against the time periods. If a pattern
emerges, it is likely that the independence requirement is
violated.
33. 16.33
Nonindependence of the Error Variable
Patterns in the appearance of the residuals over time
indicates that autocorrelation exists:
Note the runs of positive residuals,
replaced by runs of negative residuals
Note the oscillating behavior of the
residuals around zero.
Durbin-Watson test, one way to test autocorrelation
34. 16.34
Outliers…
An outlier is an observation that is unusually
small or unusually large.
E.g. our used car example had odometer readings
from 19.1 to 49.2 thousand miles. Suppose we
have a value of only 5,000 miles (i.e. a car driven
by an old person only on Sundays ) — this point
is an outlier.
35. 16.35
Outliers…
Possible reasons for the existence of outliers include:
▫ There was an error in recording the value
▫ The point should not have been included in the sample
▫ Perhaps the observation is indeed valid.
Outliers can be easily identified from a scatter plot.
If the absolute value of the standard residual is > 2, we
suspect the point may be an outlier and investigate further.
They need to be dealt with since they can easily
influence the least squares line…
38. Procedure for Regression Diagnostics
1. Develop a model that has a theoretical basis; that
is, for the dependent variable in question, find an
independent variable that you believe is linearly
related to it.
2. Gather data for the two variables.
3. Draw the scatter diagram to determine whether a
linear model appears to be appropriate. Identify
possible outliers.
4. Determine the regression equation.
5. Calculate the residuals and check slides for the
required conditions.
6. Assess the model fit: Check slides SPSS output
7. If the model fits the data, use the regression
equation to predict a particular value of the
dependent variable or estimate its mean (or both)
39. From simple linear regression to
multiple regression
• Simple linear regression
Education
Income
40. 17.40
Multiple Regression…
The simple linear regression model was used to
analyze how one interval variable (the dependent
variable y) is related to one other interval variable (the
independent variable x).
Multiple regression allows for any number of
independent variables.
We expect to develop models that fit the data better
than would a simple linear regression model.
43. Example: GSS2008
• How is income affected by
▫ Age (AGE)
▫ Education (EDUC)
▫ Work hours (HRS)
▫ Spouse work hours (SPHRS)
▫ Occupation prestige score (PRESTG80)
▫ Number of children (CHILDS)
▫ Number of family members earn money (EARNS)
▫ Years with current employer (CUREMPYR)
44. 17.44
The Model…
We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:
In the one variable, two dimensional case we drew a regression
line; here we imagine a response surface.
error variable
dependent
variable
independent variables
coefficients
45. 17.45
Estimating the Coefficients…
The sample regression equation is expressed as:
We will use computer output to:
Assess the model…
How well it fits the data?
Is it useful?
Are any required conditions violated?
Employ the model…
Interpreting the coefficients
Predictions using the regression model.
46. 17.46
Regression Analysis Steps…
u Use a computer and software to generate the
coefficients and the statistics used to assess the model.
v Diagnose violations of required conditions. If there
are problems, attempt to remedy them.
w Assess the model’s fit.
coefficient of determination,
F-test of the analysis of variance.
x If u, v, and w are OK, use the model for prediction.
52. 17.52
The Model…
Although we haven’t done any assessment of the model yet,
at first pass:
ŷ= -51785.243 +460.87 x1+4100.9 x2+ 620 x3-862.201 x4…+329.771 x8
it suggests that increases in AGE, EDUC, HRS,
PRESTG80, EARNRS, CUREMPYR, will positively
impact the income.
Likewise, increases in the SPHRS, CHILDS will
negatively impact the operating margin…
INTERPRET
53. 17.53
Model Assessment…
We will assess the model in two ways:
Coefficient of determination, and
F-test of the analysis of variance.
54. 17.54
Coefficient of Determination…
• Again, the coefficient of determination is defined
as:
This means that 33.7% of the variation in income is
explained by the six independent variables, but
66.3% remains unexplained.
55. 17.55
Adjusted R2 value…
The adjusted” R2 is:
the coefficient of determination adjusted
for the number of explanatory variables.
It takes into account the sample size n, and k, the
number of independent variables, and is given by:
56. 17.56
Testing the Validity of the Model…
In a multiple regression model (i.e. more than one
independent variable), we utilize an analysis of
variance technique to test the overall validity of the
model. Here’s the idea:
H0:
H1: At least one is not equal to zero.
If the null hypothesis is true, none of the independent
variables is linearly related to y, and so the model is
invalid.
If at least one is not equal to 0, the model does have
some validity.
57. 17.57
Testing the Validity of the Model…
ANOVA table for regression analysis…
Source of
Variation
degrees of
freedom
Sums of
Squares
Mean Squares F-Statistic
Regression k SSR MSR = SSR/k F=MSR/MSE
Error n–k–1 SSE MSE = SSE/(n–k-1)
Total n–1
A large value of F indicates that most of the variation in y is explained by
the regression equation and that the model is valid. A small value of F
indicates that most of the variation in y is unexplained.
58. Testing the Validity of the Model…
P<.o5, at least one is not 0,
Reject H0, accept H1
the the model is valid
59. 17.59
Interpreting the Coefficients*
Intercept (b0) -51785.243 • This is the average income when
all of the independent variables are zero. It’s meaningless to try
and interpret this value, particularly if 0 is outside the range of
the values of the independent variables (as is the case here).
Age (b1) 460.87 • Each 1 year increase in age will increase
$460.87 in the income.
Education (b2) 4100.9• For each additional year of
education, the annual income will increase $4100.9.
Hours of work (b3) 620 • each additional hour of work per
week, the annual income will increase $620.
*in each case we assume all other variables are held constant…
60. 17.60
Interpreting the Coefficients*
Spouse hours of work (b4) -862.201• For each additional
hour the spouse work per week, the average annual income will
decrease $862.201 .
Occupation Prestige Score (b5) 641• For each additional
unit of score, the average annual income increases by $641
Number of Children (b6) -331 • For each additional child,
the average income decrease by -331
Number of family members earn money (b7) 687 • For
each additional family member earn money, the income
increase by $687
Number of years with current job (b8) 330• For each
additional year with current job, the income increase by
$330.
*in each case we assume all other variables are held constant…
61. 17.61
Testing the Coefficients…
For each independent variable, we can test to
determine whether there is enough evidence of a linear
relationship between it and the dependent variable for
the entire population…
H0: = 0
H1: ≠ 0
(for i = 1, 2, …, k) and using:
as our test statistic (with n–k–1 degrees of freedom).
62. 17.62
Testing the Coefficients
We can use SPSS output to quickly test each of the
8 coefficients in our model…
Thus, EDUC, HRS, SPHRS, PRESTG80, are linearly related to the
operating margin. There is no evidence to infer that AGE, CHILDS,
EARNS, CUREMPYR are linearly related to operating margin.
63. 17.63
Using the Regression Equation
Much like we did with simple linear regression, we
can produce a prediction interval for a
particular value of y.
As well, we can produce the confidence interval
estimate of the expected value of y.
64. 17.64
Using the Regression Equation
Exercise GSS2008:
We add one row (our given values for the independent
variables) to the bottom of our data set, please produce
▫ prediction interval
▫ confidence interval estimate
For the dependent variable y.
65. 17.65
Regression Diagnostics I
Exercise GSS2008
• Calculate the residuals and check the following:
▫ Is the error variable nonnormal?
▫ Perform a normality test
• Is the error variance constant?
▫ Plot the residuals versus the predicted values of y.
• Are the errors independent (time-series data)?
▫ Plot the residuals versus the time periods.
• Are there observations that are inaccurate or do
not belong to the target population?
▫ Double-check the accuracy of outliers and influential
observations.
66. 17.66
Regression Diagnostics II
• Multiple regression models have a problem that
simple regressions do not, namely
multicollinearity.
• It happens when the independent variables
are highly correlated.
• We’ll explore this concept through the following
example…
67. 17.67
Example GSS2008
• AGE and CUREMPYR are not significant
predictor for INCOME in multiple regression
model, but when we run correlation between
AGE and INCOME, CUREMPYR and INCOME.
They are both significantly correlated.
• How to account for this apparent contradiction?
• The answer is that the AGE and CUREMPYR are
correlated, all three independent variables
are correlated with each other !
• The is the problem of multicollinearity.
70. How to deal with multicollinearity
problem
• Multicollinearity exits in virtually all multiple
regression models.
• To minimize the effect:
▫ Try to include independent variables that are
independent of each other.
▫ Develop a model that has a theoretical basis and
include IVs that are necessary.
71. 17.71
Regression Diagnostics III – Time Series
• The Durbin-Watson test allows us to determine
whether there is evidence of first-order
autocorrelation — a condition in which a
relationship exists between consecutive
residuals, i.e. ei-1 and ei (i is the time period). The
statistic for this test is defined as:
• d has a range of values: 0 ≤ d ≤ 4.
72. 17.72
Durbin–Watson (two-tail test)
• To test for first-order autocorrelation:
• If d < dL or d > 4 – dL , first-order
autocorrelation exists.
• If d falls between dL and dU or between 4 – dU
and 4 – dU , the test is inconclusive.
• If d falls between dU and 4 – dU there is no
evidence of first order autocorrelation.
4-dU 4-dL
exists existsinconclusive
dUdL 2 40
inconclusive doesn’t exist
73. 17.73
Example 17.1 Xm17-01
Can we create a model that will predict lift ticket
sales at a ski hill based on two weather
parameters?
Variables:
y - lift ticket sales during Christmas week,
x1 - total snowfall (inches), and
x2 - average temperature (degrees Fahrenheit)
Our ski hill manager collected 20 years of data.
74. 17.74
Example 17.1
Both the coefficient of determination
and the p-value of the F-test indicate
the model is poor…
Neither variable is linearly related
to ticket sale…
75. 17.75
Example 17.1
• The histogram of residuals…
• reveals the errors may be normally distributed…
76. 17.76
Example 17.1
• In the plot of residuals versus predicted values
(testing for heteroscedasticity) — the error
variance appears to be constant…
77. 17.77
Example 17.1 Durbin-Watson
• Apply the Durbin-Watson Statistic from to the entire list of
residuals.
• Regression>Linear>Statistics>check Durbin-Watson
78. 17.78
Example 17.1
To test for first-order autocorrelation with α = .05, we
find in Table 8(a) in Appendix B
dL = 1.10 and dU = 1.54
The null and alternative hypotheses are
H0 : There is no first-order autocorrelation.
H1 : There is first-order autocorrelation.
The rejection region includes d < dL = 1.10. Since d =
.593, we reject the null hypothesis and conclude that
there is enough evidence to infer that first-order
autocorrelation exists.
79. 17.79
Example 17.1
Autocorrelation usually indicates that the model needs to
include an independent variable that has a time-ordered
effect on the dependent variable.
The simplest such independent variable represents the
time periods. We included a third independent variable
that records the number of years since the year the data
were gathered. Thus, x3 = 1, 2,..., 20. The new model is
y = β0 + β1x1 + β2x2 + β3x3 + ε
80. 17.80
Example 17.1
The fit of the model is high,
The model is valid…
Snowfall and time are linearly related to
ticket sales; temperature is not…
our new
variable
dL = 1.10 and dU = 1.54
dU <d<4- dU, first-order
autocorrelation doesn't exit
81. 17.81
Example 17.1
• The Durbin-Watson statistic against the residuals
from our Regression analysis is equal to 1.885.
• we can conclude that there is not enough evidence
to infer the presence of first-order
autocorrelation. (Determining dL is left as an
exercise for the reader…)
• Hence, we have improved out model dramatically!
82. 17.82
Example 17.1
Notice that the model is improved dramatically.
The F-test tells us that the model is valid. The t-tests tell us that
both the amount of snowfall and time are significantly linearly
related to the number of lift tickets.
This information could prove useful in advertising for the resort.
For example, if there has been a recent snowfall, the resort
could emphasize that in its advertising.
If no new snow has fallen, it may emphasize their snow-making
facilities.
84. 18.84
Polynomial Models
Previously we looked at this multiple regression
model:
(its considered linear or first-order since the
exponent on each of the xi’s is 1)
The independent variables may be functions of a
smaller number of predictor variables; polynomial
models fall into this category. If there is one
predictor value (x) we have:
85. 18.85
Polynomial Models
u
v
Technically, equation vis a multiple regression model
with p independent variables (x1, x2, …, xp). Since x1 =
x, x2 = x2, x3 = x3, …, xp = xp, its based on one predictor
value (x).
p is the order of the equation; we’ll focus equations of
order p = 1, 2, and 3.
86. 18.86
First Order Model
When p = 1, we have our simple linear regression model:
That is, we believe there is a straight-line relationship
between the dependent and independent variables over the
range of the values of x:
89. 18.89
Polynomial Models: 2 Predictor
Variables
Perhaps we suspect that there are two predictor
variables (x1 & x2) which influence the dependent
variable:
First order model (no interaction):
First order model (with interaction):
90. 18.90
Polynomial Models: 2 Predictor Variables
First order models, 2 predictors, without & with interaction:
91. 18.91
Polynomial Models: 2 Predictor Variables
If we believe that a quadratic relationship exists between y
and each of x1 and x2, and that the predictor variables
interact in their effect on y, we can use this model:
Second order model (in two variables) WITH interaction:
92. 18.92
Polynomial Models: 2 Predictor
Variables
2nd order models, 2 predictors, without & with interaction:
93. 18.93
Selecting a Model
One predictor variable, or two (or more)?
First order? Second order? Higher order?
With interaction? Without?
How do we choose the right model??
Use our knowledge of the variables involved to
build an initial model.
Test that model using statistical techniques.
If required, modify our model and re-test…
94. 18.94
Example 18.1
We’ve been asked to come up with a regression model
for a fast food restaurant. We know our primary
market is middle-income adults and their children,
particularly those between the ages of 5 and 12.
Dependent variable —restaurant revenue (gross or net)
Predictor variables — family income, age of children
Is the relationship first order? quadratic?…
95. 18.95
Example 18.1
The relationship between the dependent variable (revenue)
and each predictor variable is probably quadratic.
Members of low or high income households are less likely to eat at this chain’s
restaurants, since the restaurants attract mostly middle-income customers.
Neighborhoods where the mean age of children is either quite low or quite high
are also less likely to eat there vs. the families with children in the 5-to-12 year
range.
Seems reasonable?
96. 18.96
Example 18.1
Should we include the interaction term in our model?
When in doubt, it is probably best to include
it.
Our model then, is:
Where y = annual gross sales
x1 = median annual household income*
x2 = mean age of children*
*in the neighborhood
97. 18.97
Example 18.2 Xm18-02
Our fast food restaurant research department
selected 25 locations at random and gathered data
on revenues, household income, and ages of
neighborhood children.
Collected Data Calculated Data
98. 18.98
Example 18.2
You can take the original data collected (revenues,
household income, and age) and plot y vs. x1 and y
vs. x2 to get a feel for the data; trend lines were
added for clarity…
100. 18.100
Nominal Independent Variables
Thus far in our regression analysis, we’ve only
considered variables that are interval. Often
however, we need to consider nominal data in
our analysis.
For example, our earlier example regarding the
market for used cars focused only on mileage.
Perhaps color is an important factor. How can we
model this new variable?
101. 18.101
Indicator Variables
An indicator variable (also called a dummy
variable) is a variable that can assume either one
of only two values (usually 0 and 1).
A value of 1 usually indicates the existence of a certain
condition, while a value of 0 usually indicates that the
condition does not hold.
I1 =
I2 =
0 if color not white
1 if color is white
0 if color not silver
1 if color is silver
Car Color I1 I2
white 1 0
silver 0 1
other 0 0
two tone! 1 1
to represent m categories…
we need m–1 indicator variables
102. 18.102
Interpreting Indicator Variable Coefficients
After performing our regression analysis:
we have this regression equation…
Thus, the price diminishes with additional mileage (x)
a white car sells for $91.10 more than other colors (I1)
a silver car fetches $330.40 more than other colors (I2)
104. 18.104
Testing the Coefficients
To test the coefficient of I1, we use these
hypotheses…
H0: = 0
H1: ≠ 0
There is insufficient evidence to infer that in the
population of 3-year-old white Tauruses with the same
odometer reading have a different selling price than
do Tauruses in the “other” color category…
105. 18.105
Testing the Coefficients
To test the coefficient of I2, we use these
hypotheses…
H0: = 0
H1: ≠ 0
We can conclude that there are differences in
auction selling prices between all 3-year-old
silver-colored Tauruses and the “other” color
category with the same odometer readings
106. Stepwise Regression
• Stepwise Regression is an iterative procedure
that adds and deletes one independent variable
at a time. The decision to add or delete a variable
is made on the basis of whether that variable
improves the model.
• It is a procedure that can eliminate correlated
independent variables.
107. Step 1: do simultaneous regression and
rank all the significant variables
No.1
No.4
No.2
No.3
108. Step 2
• Analyze
• Regression
• Linear
• Stepwise
• Dependent variable
• Independent variables (1st round: the top
predictor; 2nd round: the top predictor & the 2nd
top predictor…until the nth round; n = number
of predictors
• Statistics
• R square change & Descriptives
109. • Stepwise output
• What to read?
• R2 , R2 change, F of R2 change, significance level
of F of R2 change in each round
112. Multiple regression
• Multiple regression examines the predictability
of a set of predictors on a dependent variable
(criterion)
• Why don’t we just throw in all the predictors and
let the MR determine which ones are good
predictors then?
• Reason 1: Theoretical consideration
• Reason 2: Concern of sample size
113. Concern of sample size
• The desired level is 20 observations for each
independent variable
• For instance, if you have 6 predictors, you’ve got
to have at least 120 subjects in your data
• However, if a stepwise procedure is employed,
the recommended level increases to 50 to 1
• That is, you’ve got to have at least 300 subjects
in order to run stepwise MR
114. 18.114
Model Building
Here is a procedure for building a regression model:
uIdentify the dependent variable; what is it we
wish to predict? Don’t forget the variable’s unit of
measure.
vList potential predictors; how would changes in
predictors change the dependent variable? Be
selective; go with the fewest independent variables
required. Be aware of the effects of multicollinearity.
w Gather the data; at least six? observations for
each independent variable used in the equation.
115. 18.115
Model Building
x Identify several possible models; formulate
first- and second- order models with and without
interaction. Draw scatter diagrams.
y Use statistical software to estimate the
models.
z Determine whether the required conditions
are satisfied; if not, attempt to correct the problem.
{ Use your judgment and the statistical output
to select the best model!