What are the advantages and disadvantages of membrane structures.pptx
604_multiplee.ppt
1. Quantitative Research Technique
Multiple Regression Analysis
Selection of Predictor Variables
Confidence and Prediction Interval
Dinesh Pudasaini (CRN 071MSI604)
1
2. Goal
• Develop a statistical model that can predict the values of a
dependent (response) variable based upon the values of
the Independent (explanatory) variables.
• In many situations, more than one independent variable
may be useful in predicting the value of a dependent
variable. We then use multiple regression.
2
3. Introduction
Simple Regression
A statistical model that utilizes one quantitative independent
variable “X” to predict the quantitative dependent variable
“Y.”
Multiple Regression:
A statistical model that utilizes two or more quantitative and
qualitative explanatory variables (x1,..., xk) to predict a
quantitative dependent variable Y.
3
4. Simple vs. Multiple
• Simple Regression
• represents the unit
change in Y per unit
change in X .
• Does not take into account
any other variable besides
single independent
variable.
• Multiple regression
• i represents the unit
change in Y per unit change
in Xi.
• Takes into account the
effect of other i s.
4
6. Linear Model
• Relationship between one dependent & two or more
independent variables is a linear function
Dependent
(response)
variable
Independent
(explanatory)
variables
Population
slopes
Population
Y-intercept
Random
error
P
P
X
X
X
Y
2
2
1
1
0
6
7. Linear Model
• The error terms Ɛ are mutually independent and identically
distributed, with mean = 0 and constant variances
• This is so, because the observations y1, y2, . . . ,yn are a random
sample, they are mutually independent and hence the error
terms are also mutually independent
• The distribution of the error term is independent of the joint
distribution of x i, x 2, . . . , x k
7
8. Method of Least Squares
• we use the least-squares method to fit a linear function to the
data.
• bo,b1, b2, b3 . . . , bk are the sample estimates of the
coefficients ß0,ß1, ß2, ß3 . . . , ßk
• The least-squares method chooses the b’s that make the sum of
squares of the residuals as small as possible.
• The least-squares estimates are the values that minimize the
quantity.
8
9. Standard Error of Estimate and
Coefficient of Multiple Determination
• The observed variability of the responses about
this fitted model is measured by the variance
and the regression standard error of estimate is
Coefficient of Multiple Determination
When null hypothesis is rejected, a relationship between Y and
the X variables exists. Strength measured by R2
9
10. Coefficient of Multiple Determination.
• Sum of squares due to error
SSE =
• Sum of squares due to regression
SSR =
• Total sum of squares
SST =
• Obviously,
• The ratio SSR/SST represents the proportion of the total variation in
y explained by the regression model.
• This ratio, denoted by R2, is called the coefficient of multiple
determination.
10
11. Adjusted Coefficient of Multiple
Determination.
• R2 is sensitive to the magnitudes of n and k in small samples.
If k is large relative to n, the model tends to fit the data very
well. In the extreme case, if n = k+1, the model would exactly
fit the data.
• A better goodness of fit measure is the adjusted R2
Adjusted R2= 1 – (n-1/n-k-1) (1-R2)
» 1- SSE/(n-k-1)/SST/(n-1)
11
12. Hypothesis Tests in Multiple Linear
Regression
• Three types of hypothesis tests can be carried out for multiple
linear regression models:
• First Test for significance of regression: This test checks the
significance of the whole regression model.
• Second Test: This test checks the significance of individual regression
coefficients.
• Third Test: This test can be used to simultaneously check the
significance of a number of regression coefficients.
12
16. ANOVA for Regression
• Analysis of Variance (ANOVA) consists of
calculations that provide information about levels of
variability within a regression model and form a basis
for tests of significance.
16
17. Example
A TV industry analyst wants to build a statistical model for
predicting the number of subscribers that a cable station can
expect.
Y = Number of cable subscribers (SUSCRIB).
X1 = Advertising rate which the station charges local advertisers for one minute
of prim time space (ADRATE).
X2 = Kilowatt power of the station’s non-cable signal (KILOWATT).
X3 = Number of families living in the station’s area of dominant influence
(ADI), a geographical division of radio and TV audiences (APIPOP).
X4 = Number of competing stations in the ADI (COMPETE).
17
19. Multiple Regression Equation
• Based on the partial t-test, the variables signal and compete
are the least significant variables in our model.
• Let’s drop the least significant variables one at a time.
19
22. Multiple Regression Prediction
• All the variables in the model are statistically significant,
therefore our final model is:
• Final Model
22
23. Multicollinearity
• High correlation between X variables (Independent variables).
• Coefficients measure combined effect.
• Leads to unstable coefficients depending on X variables in model
• Always exists; matter of degree
• Example: Using both total number of rooms and number of
bedrooms as explanatory variables in same model
• In many non-experimental situations in business,
economics, and the social and biological sciences, the
independent variables tend to be correlated among
themselves.
23
24. Detecting Multicollinearity
• Examine correlation matrix
– Determines if the Correlations between pairs of X
variables are more than with Y variable
• Few remedies
– Obtain new sample data
– Eliminate one correlated X variable
24
25. Finding the Best Multiple Regression
Equation
• Use common sense and practical considerations to
include or exclude variables.
• Consider the P-value.
• Consider equations with high values of adjusted R2
and try to include only a few variables.
• For a given number of predictor (x) variables,
select the equation with the largest value of adjusted
R2.
27. Statement of problem
• A common problem is that there is a large set of candidate
predictor variables.
• Goal is to choose a small subset from the larger set so that the
resulting regression model is simple, yet have good predictive
ability.
Example: Cement data
• Response y: heat evolved in calories during hardening of cement on a per
gram basis
• Predictor x1: % of tricalcium aluminate
• Predictor x2: % of tricalcium silicate
• Predictor x3: % of tetracalcium alumino ferrite
• Predictor x4: % of dicalcium silicate
27
28. Two basic methods
of selecting predictors
• Stepwise regression: Enter and remove predictors, in a
stepwise manner, until there is no justifiable reason to enter or
remove more.
• Best subsets regression: Select the subset of predictors that do
the best at meeting some well-defined objective criterion.
28
29. Stepwise regression: the idea
• Start with no predictors in the “stepwise model.”
• At each step, enter or remove a predictor based on partial F-
tests (that is, the t-tests).
• Stop when no more predictors can be justifiably entered or
removed from the stepwise model.
1. Specify an Alpha-to-Enter (αE = 0.15) significance level.
2. Specify an Alpha-to-Remove (αR = 0.15) significance level.
29
30. Stepwise regression:
Step #1
1. Fit each of the one-predictor models, that is, regress y on x1,
regress y on x2, … regress y on xp-1.
2. The first predictor put in the stepwise model is the predictor that
has the smallest t-test P-value (below αE = 0.15).
3. If P-value < 0.15, stop.
Step #2
1. Suppose x1 was the “best” one predictor.
2. Fit each of the two-predictor models with x1 in the model, that is,
regress y on (x1, x2), regress y on (x1, x3), …, and y on (x1, xp-1).
3. The second predictor put in stepwise model is the predictor that
has the smallest t-test P-value (below αE = 0.15).
4. If P-value < 0.15, stop.
30
31. Stepwise regression:
Step #2 (continued)
1. Suppose x2 was the “best” second predictor.
2. Step back and check P-value for β1 = 0. If the P-value for
β1 = 0 has become not significant (above αR = 0.15),
remove x1 from the stepwise model.
Step#3
1. Suppose both x1 and x2 made it into the two-predictor
stepwise model.
2. Fit each of the three-predictor models with x1 and x2 in the
model, that is, regress y on (x1, x2, x3), regress y on (x1, x2,
x4), …, and regress y on (x1, x2, xp-1).
31
32. Stepwise regression:
Step #3 (continued)
1. The third predictor put in stepwise model is the predictor
that has the smallest t-test P-value (below αE = 0.15).
2. If P-value < 0.15, stop.
3. Step back and check P-values for β1 = 0 and β2 = 0. If either
P-value has become not significant (above αR = 0.15),
remove the predictor from the stepwise model.
Stopping the procedure
The procedure is stopped when adding an additional predictor
does not yield a t-test P-value below αE = 0.15.
32
34. Confidence intervals are intervals constructed about the
predicted value of y, at a given level of x, which are used to
measure the accuracy of the mean response of all the
individuals in the population.
Prediction intervals are intervals constructed about the
predicted value of y that are used to measure the accuracy of a
single individual’s predicted value.
34
36. Example
• Suppose we want to estimate the average weight of an adult
male in a city We draw a random sample of 1,000 men from a
population of 1,000,000 men and weigh them. We find that the
average man in our sample weighs 180 pounds, and the
standard deviation of the sample is 30 pounds. What is the
95% confidence interval.
Solution:
• Identify a sample statistic. Since we are trying to estimate the
mean weight in the population, we choose the mean weight in
our sample (180) as the sample statistic.
• Select a confidence level. We are working with a 95%
confidence level.
36
37. Example Contd….
• Find the margin of error.
Find standard error.
The standard error (SE) of the mean is:
SE = s / sqrt( n ) = 30 / sqrt(1000) = 30/31.62 = 0.95
Find critical value.
• The critical value is a factor used to compute the margin of
error. To express the critical value as a t score(t*)
Compute alpha (α): α = 1 - (confidence level / 100) = 0.05
– Find the critical probability (p*): p* = 1 - α/2 = 1 - 0.05/2 =
0.975
– Find the degrees of freedom(df): df = n - 1 = 1000 - 1 =
999 37
38. Example Contd..
– The critical value is the t score having 999 degrees of
freedom and a cumulative probability equal to 0.975. From
the t distribution table, we find that the critical value is
1.96.
• Note: We might also have expressed the critical value as a z
for small sample size.
• Compute margin of error (ME): ME = critical value * standard
error = 1.96 * 0.95 = 1.86
• The range of the confidence interval = sample statistic +
margin of error.
• And the uncertainty is denoted by the confidence level, this
95% confidence interval is 180 + 1.86
38
39. Questions
• Explain the linear multiple regression model.
• How predictor variable can be selected Using stepwise
Regression Analysis?
• Suppose we want to estimate the average weight of an adult
male in a city We draw a random sample of 1,000 men from a
population of 1,000,000 men and weigh them. We find that the
average man in our sample weighs 180 pounds, and the
standard deviation of the sample is 30 pounds. What is the
95% confidence interval.
39