SlideShare a Scribd company logo
Categorical Variables
Confounding
Variable Selection
Other Considerations
Linear Modelling in Stata
Session 6: Further Topics in Linear Modelling
Mark Lunt
Centre for Epidemiology Versus Arthritis
University of Manchester
24/11/2020
Categorical Variables
Confounding
Variable Selection
Other Considerations
This Week
Categorical Variables
Comparing outcome between groups
Comparing slopes between groups (Interactions)
Confounding
Variable Selection
Other considerations
Polynomial Regression
Transformation
Regression through the origin
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Categorical Variables
None of the linear model assumptions mention the
distribution of x.
Can use x-variables with any distribution
This enables us to compare different groups
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Dichotomous Variable
Let x = 0 in group A and x = 1 in group B.
Linear model equation is Ŷ = β0 + β1x
In group A, x = 0 so Ŷ = β0
In group B, x = 1 so Ŷ = β0 + β1
Hence the coefficient of x gives the mean difference
between the two groups.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Dichotomous Variable Example
x takes values 0 or 1
Y is normally distributed with variance 1, and mean 3 if
x = 0 and 4 if x = 1.
We wish to test if there difference in the mean value of Y
between the groups with x = 0 and x = 1
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Dichotomous Variable: Stata output
. regress Y x
Source | SS df MS Number of obs = 40
-------------+------------------------------ F( 1, 38) = 10.97
Model | 9.86319435 1 9.86319435 Prob > F = 0.0020
Residual | 34.1679607 38 .89915686 R-squared = 0.2240
-------------+------------------------------ Adj R-squared = 0.2036
Total | 44.031155 39 1.12900398 Root MSE = .94824
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .9931362 .2998594 3.31 0.002 .3861025 1.60017
_cons | 3.0325 .2120326 14.30 0.000 2.603262 3.461737
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Dichotomous Variables and the T-Test
Differences in mean between two groups usually tested for
with t-test.
Linear model results are exactly the same.
Linear model assumptions are exactly the same.
Normal distribution in each group
Same variance in each group
A t-test is a special case of a linear model.
Linear model is far more versatile (can adjust for other
variables).
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
T-Test: Stata output
. ttest Y, by(x)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 20 3.0325 .2467866 1.103663 2.515969 3.54903
1 | 20 4.025636 .1703292 .7617355 3.669133 4.382139
---------+--------------------------------------------------------------------
combined | 40 3.529068 .1680033 1.062546 3.189249 3.868886
---------+--------------------------------------------------------------------
diff | -.9931362 .2998594 -1.60017 -.3861025
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -3.3120
Ho: diff = 0 degrees of freedom = 38
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0010 Pr(|T| > |t|) = 0.0020 Pr(T > t) = 0.9990
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Categorical Variable with Several Categories
What can we do if there are more than two categories ?
Cannot use x = 0, 1, 2, . . ..
Instead we use “dummy” or “indicator” variables.
If there are k categories, we need k − 1 indicators.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Three Groups: Example
Group x1 x2 Ȳ σ2
A 0 0 3 1 Baseline Group
B 1 0 5 1
C 0 1 4 1
β0 = Ŷ in group A
β1 = difference between Ŷ in group A and Ŷ in group B
β2 = difference between Ŷ in group A and Ŷ in group C
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Three Groups: Stata Output
. regress Y x1 x2
Source | SS df MS Number of obs = 60
-------------+------------------------------ F( 2, 57) = 16.82
Model | 37.1174969 2 18.5587485 Prob > F = 0.0000
Residual | 62.8970695 57 1.10345736 R-squared = 0.3711
-------------+------------------------------ Adj R-squared = 0.3491
Total | 100.014566 59 1.69516214 Root MSE = 1.0505
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 1.924713 .3321833 5.79 0.000 1.259528 2.589899
x2 | 1.035985 .3321833 3.12 0.003 .3707994 1.701171
_cons | 3.075665 .2348891 13.09 0.000 2.605308 3.546022
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Comparing Groups
In the previous example, groups B and C both compared to
group A.
Can we compare groups B and C as well ?
In group B, Ŷ = β0 + β1
In group C, Ŷ = β0 + β2
Hence difference between groups is β1 − β2
Can use lincom to obtain this difference, and test its
significance.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
The lincom Command
lincom is short for linear combination.
It can be used to calculate linear combinations of the
parameters of a linear model.
Linear combination = ajβj + ak βk + . . .
Can be used to find differences between groups
(Difference between Group B and Group C = β1 − β2)
Can be used to find mean values in groups
(Mean value in group B = β0 + β1).
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Stata Output from lincom
. lincom x1 - x2
( 1) x1 - x2 = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .8887284 .3321833 2.68 0.010 .2235428 1.553914
------------------------------------------------------------------------------
. lincom _cons + x1
( 1) x1 + _cons = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 5.000378 .2348891 21.29 0.000 4.530021 5.470736
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Factor Variables in Stata
Generating dummy variables can be tedious and
error-prone
Stata can do it for you
Identify categorical variables by adding “i.” to the start of
their name.
For example, suppose that the variable group contains the
values “1”, “2” and “3” for the three groups in the previous
example.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Stata Output with a Factor Variable
. regress Y i.group
Source | SS df MS Number of obs = 60
-------------+------------------------------ F( 2, 57) = 16.82
Model | 37.1174969 2 18.5587485 Prob > F = 0.0000
Residual | 62.8970695 57 1.10345736 R-squared = 0.3711
-------------+------------------------------ Adj R-squared = 0.3491
Total | 100.014566 59 1.69516214 Root MSE = 1.0505
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |
2 | 1.924713 .3321833 5.79 0.000 1.259528 2.589899
3 | 1.035985 .3321833 3.12 0.003 .3707994 1.701171
|
_cons | 3.075665 .2348891 13.09 0.000 2.605308 3.546022
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Using factor variables with lincom
. lincom 2.group - 3.group
( 1) 2.group - 3.group = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .8887284 .3321833 2.68 0.010 .2235428 1.553914
------------------------------------------------------------------------------
. lincom _cons + 2.group
( 1) 2.group + _cons = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 5.000378 .2348891 21.29 0.000 4.530021 5.470736
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Linear Models and ANOVA
Differences in mean between more than two groups
usually tested for with ANOVA.
Linear model results are exactly the same.
Linear model assumptions are exactly the same.
ANOVA is a special case of a linear model.
Linear model is far more versatile (can adjust for other
variables).
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Mixing Categorical & Continuous Variables
So far, we have only seen either continuous or categorical
predictors in a linear model.
No problem to mix both.
E.g. Consider a clinical trial in which the outcome is
strongly associated with age.
To test the effect of treatment, need to include both age
and treatment in linear model.
Once upon a time, this was called Analysis of Covariance
(ANCOVA)
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Example Clinical Trial: simulated data
0
5
10
Y
20 25 30 35 40
age
Placebo Active Treatment
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Stata Output Ignoring the Effect of Age
. regress Y treat
Source | SS df MS Number of obs = 40
-------------+------------------------------ F( 1, 38) = 2.86
Model | 26.5431819 1 26.5431819 Prob > F = 0.0989
Residual | 352.500943 38 9.27634061 R-squared = 0.0700
-------------+------------------------------ Adj R-squared = 0.0456
Total | 379.044125 39 9.71908013 Root MSE = 3.0457
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treat | 1.629208 .9631376 1.69 0.099 -.3205623 3.578978
_cons | 4.379165 .6810411 6.43 0.000 3.00047 5.757861
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Observed and predicted values from linear model
ignoring age
0
5
10
20 25 30 35 40
age
Placebo Active Treatment
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Stata Output Including the Effect of Age
. regress Y treat age
Source | SS df MS Number of obs = 40
-------------+------------------------------ F( 2, 37) = 262.58
Model | 354.096059 2 177.04803 Prob > F = 0.0000
Residual | 24.9480658 37 .674272049 R-squared = 0.9342
-------------+------------------------------ Adj R-squared = 0.9306
Total | 379.044125 39 9.71908013 Root MSE = .82114
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treat | 1.238752 .2602711 4.76 0.000 .7113924 1.766111
age | -.5186644 .0235322 -22.04 0.000 -.5663453 -.4709836
_cons | 20.59089 .7581107 27.16 0.000 19.05481 22.12696
------------------------------------------------------------------------------
Age explains variation in Y
This reduces RMSE (estimate of σ)
Standard error of coefficient = σ
√
nsx
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Observed and predicted values from linear model
including age
0
5
10
20 25 30 35 40
age
Placebo Active Treatment
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Interactions
In previous example, assumed that the effect of age was
the same in treated and untreated groups.
I.e. regression lines were parallel.
This may not be the case.
If the effect of one variable varies accord to the value of
another variable, this is called “interaction” between the
variables.
Don’t assume that an effect differs between two groups
because it is significant in one, not in the other
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Interaction Example
Consider the clinical trial in the previous example
Suppose treatment reverses the effect of aging, so that Ŷ
is constant in the treated group.
Thus the difference between the treated and untreated
groups will increase with increasing age.
Need to fit different intercepts and different slopes in the
two groups.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Clinical trial data with predictions assuming equal
slopes
0
5
10
20 25 30 35 40
age
Placebo Active Treatment
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Regression Equations
Need to fit the two equations
Y=



β00 + β10 × age + ε if treat = 0
β01 + β11 × age + ε if treat = 1
These are equivalent to the equation
Y=β00+β10×age+(β01−β00)×treat+(β11−β10)×age×treat+ε.
I.e. the output from stata can be interpreted as
_cons The intercept in the untreated group (treat
== 0)
age The slope with age in the untreated group
treat The difference in intercept between the
treated and untreated groups
treat#c.age The difference in slope between the treated
and untreated groups
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Interactions: Stata Output
. regress Y i.treat age i.treat#c.age
Source | SS df MS Number of obs = 40
-------------+------------------------------ F( 3, 36) = 173.38
Model | 563.762012 3 187.920671 Prob > F = 0.0000
Residual | 39.0189256 36 1.08385904 R-squared = 0.9353
-------------+------------------------------ Adj R-squared = 0.9299
Total | 602.780938 39 15.4559215 Root MSE = 1.0411
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.treat | -8.226356 1.872952 -4.39 0.000 -12.02488 -4.427833
age | -.4866572 .0412295 -11.80 0.000 -.5702744 -.40304
|
treat#c.age |
1 | .4682374 .0597378 7.84 0.000 .3470836 .5893912
|
_cons | 19.73531 1.309553 15.07 0.000 17.07942 22.39121
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Interactions: Using lincom
lincom can be used to calculate the slope in the treated
group:
. lincom age + 1.treat#c.age
( 1) age + 1.treat#c.age = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -.0184198 .0432288 -0.43 0.673 -.1060919 .0692523
------------------------------------------------------------------------------
Can also be used to calculate intercept in treated group.
However, this is not interesting since
We are unlikely to be be interested in subjects of age 0
The youngest subjects in our sample were 20, so we are
extrapolating a long way from the data.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Interactions: Predictions from Linear Model
0
5
10
20 25 30 35 40
age
Placebo Active Treatment
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Treatment effect at different ages
. lincom 1.treat + 20*1.treat#c.age
( 1) 1.treat + 20*1.treat#c.age = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.138392 .7279832 1.56 0.127 -.3380261 2.61481
------------------------------------------------------------------------------
. lincom 1.treat + 40*1.treat#c.age
( 1) 1.treat + 40*1.treat#c.age = 0
------------------------------------------------------------------------------
Y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 10.50314 .6378479 16.47 0.000 9.209524 11.79676
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
The testparm Command
Used to test a number of parameters simultaneously
Syntax: testparm varlist
Test β = 0 for all variables in varlist
Produces a χ2 test on k degrees of freedom, where there
are k variables in varlist.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Dichotomous Variables
Multiple Categories
Categorical & Continuous
Interactions
Old and new syntax for categorical variables
Stata used to use a different syntax for categorical
variables
Still works, but new method is preferred
You may still see old syntax in existing do-files
New syntax Old Syntax
Prefix none required xi:
Variable type Numeric String or numeric
Interaction # *
Creates new variables No Yes
More info help fvvarlist help xi
Categorical Variables
Confounding
Variable Selection
Other Considerations
Confounding
A linear model shows association.
It does not show causation.
Apparent association may be due to a third variable which
we haven’t included in model
Confounding is about causality, and knowledge of the
mechanisms are required to decide if a variable is a
confounder.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Confounding Example: Fuel Consumption
. regress mpg foreign
Source | SS df MS Number of obs = 74
---------+------------------------------ F( 1, 72) = 13.18
Model | 378.153515 1 378.153515 Prob > F = 0.0005
Residual | 2065.30594 72 28.6848048 R-squared = 0.1548
---------+------------------------------ Adj R-squared = 0.1430
Total | 2443.45946 73 33.4720474 Root MSE = 5.3558
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
foreign | 4.945804 1.362162 3.631 0.001 2.230384 7.661225
_cons | 19.82692 .7427186 26.695 0.000 18.34634 21.30751
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Confounding Example: Weight and Fuel Consumption
10
20
30
40
Mileage
(mpg)
2,000 3,000 4,000 5,000
Weight (lbs.)
US Vehicles non−US Vehicles
Categorical Variables
Confounding
Variable Selection
Other Considerations
Confounding Example: Controlling for Weight
. regress mpg foreign weight
Source | SS df MS Number of obs = 74
---------+------------------------------ F( 2, 71) = 69.75
Model | 1619.2877 2 809.643849 Prob > F = 0.0000
Residual | 824.171761 71 11.608053 R-squared = 0.6627
---------+------------------------------ Adj R-squared = 0.6532
Total | 2443.45946 73 33.4720474 Root MSE = 3.4071
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
foreign | -1.650029 1.075994 -1.533 0.130 -3.7955 .4954421
weight | -.0065879 .0006371 -10.340 0.000 -.0078583 -.0053175
_cons | 41.6797 2.165547 19.247 0.000 37.36172 45.99768
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
What is Confounding ?
What you see is not what you get
Ŷ = β0 + β1x
Two groups differing in x by ∆x will differ in Y by β1∆x
If we change x by ∆x, what happens to Ŷ ?
If it changes by β1∆x, no confounding
If it changes by anything else, there is confounding
Categorical Variables
Confounding
Variable Selection
Other Considerations
Path Variables vs. Confounders
Foreign //
$$
mpg
Weight
<<
Weight is a path variable
Foreign // mpg
Weight
<<
dd
Weight is a confounder
Categorical Variables
Confounding
Variable Selection
Other Considerations
Identifying a Confounder
Is a cause of the outcome irrespective of other predictors
Is associated with the predictor
Is not a consequence of the predictor
Weight is associated with mpg
This association does not depend on where the car was
designed
But is weight a path variable ?
Categorical Variables
Confounding
Variable Selection
Other Considerations
Identifying a Confounder
Is a cause of the outcome irrespective of other predictors
Is associated with the predictor
Is not a consequence of the predictor
Weight is associated with mpg
This association does not depend on where the car was
designed
But is weight a path variable ?
Foreign designers produce smaller cars in order to getter
better fuel consumption: path variable
Categorical Variables
Confounding
Variable Selection
Other Considerations
Identifying a Confounder
Is a cause of the outcome irrespective of other predictors
Is associated with the predictor
Is not a consequence of the predictor
Weight is associated with mpg
This association does not depend on where the car was
designed
But is weight a path variable ?
Foreign designers produce smaller cars in order to getter
better fuel consumption: path variable
Size is decided for reasons other than fuel consumption:
confounder
Categorical Variables
Confounding
Variable Selection
Other Considerations
Allowing for Confounding
In theory, adding a confounder to a regression model is
sufficient to adjust for confounding.
Then parameters for other variables measure the effects of
those variables when confounder does not change.
This assumes
Confounder measured perfectly
Linear association between confounder and outcome
If either of the above are not true, there will be residual
confounding
Categorical Variables
Confounding
Variable Selection
Other Considerations
Variable Selection
May wish to reduce the number of predictors used in a
linear model.
Efficiency
Clearer understanding
Several suggested methods
Forward selection
Backward Elimination
Stepwise
All subsets
Categorical Variables
Confounding
Variable Selection
Other Considerations
Forward Selection
Choose a significance level pe at which variables will enter
the model.
Fit each predictor in turn.
Choose the most significant predictor.
If its significance level is less than pe, it is selected.
Now add each remaining variable to this model in turn, and
test the most significant.
Continue until no further variables are added.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Backward Elimination
Starts with all predictors in model.
Removes the least significant.
Repeat until all remaining predictors significant at chosen
level pr .
Has the advantage that all parameters are adjusted for the
effect of all other variables from the start.
Can give unusual results if there are a large number of
correlated variables.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Stepwise Selection
Combination of preceding methods.
Variables are added one at a time.
Each time a variable is added, all the other variables are
tested to see if they should be removed.
Must have pr > pe, or a variable could be entered and
removed on the same step.
Categorical Variables
Confounding
Variable Selection
Other Considerations
All Subsets
Can try every possible subset of variables.
Can be hard work: 10 predictors = 1023 subsets.
Need a criterion to choose best model.
Adjusted R2 is possible, there are others.
Not implemented in stata.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Problems with Variable Selection
Significance Levels
Hypotheses tested are not independent.
Variables chosen for testing not randomly selected.
Hence significance levels not equal to nominal levels.
Less of a problem in large samples.
Differences in Models Selected
Models chosen by different methods may differ.
If variables are highly correlated, choice of variable
becomes arbitrary
Choice of significance level will affect models.
Need common sense.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Variable Selection in Stata
Command sw regress is used for forwards, backwards
and stepwise selection.
Option pe is used to set significance level for inclusion
Option pr is used to set significance level for exclusion
Set pe for forwards, pr for backwards and both for
stepwise regression.
The sw command does not work with factor variables, so
the old xi: syntax must be used.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Variable Selection in Stata: Example 1
. sw regress weight price hdroom trunk length turn displ gratio, pe(0.05)
p = 0.0000 < 0.0500 adding length
p = 0.0000 < 0.0500 adding displ
p = 0.0015 < 0.0500 adding price
p = 0.0288 < 0.0500 adding turn
Source | SS df MS Number of obs = 74
---------+------------------------------ F( 4, 69) = 293.75
Model | 41648450.8 4 10412112.7 Prob > F = 0.0000
Residual | 2445727.56 69 35445.3269 R-squared = 0.9445
---------+------------------------------ Adj R-squared = 0.9413
Total | 44094178.4 73 604029.841 Root MSE = 188.27
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
length | 19.38601 2.328203 8.327 0.000 14.74137 24.03064
displ | 2.257083 .467792 4.825 0.000 1.323863 3.190302
price | .0332386 .0087921 3.781 0.000 .0156989 .0507783
turn | 23.17863 10.38128 2.233 0.029 2.468546 43.88872
_cons | -2193.042 298.0756 -7.357 0.000 -2787.687 -1598.398
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Variable Selection in Stata: Example 2
. sw regress weight price hdroom trunk length turn displ gratio, pr(0.05)
p = 0.6348 >= 0.0500 removing hdroom
p = 0.5218 >= 0.0500 removing trunk
p = 0.1371 >= 0.0500 removing gratio
Source | SS df MS Number of obs = 74
---------+------------------------------ F( 4, 69) = 293.75
Model | 41648450.8 4 10412112.7 Prob > F = 0.0000
Residual | 2445727.56 69 35445.3269 R-squared = 0.9445
---------+------------------------------ Adj R-squared = 0.9413
Total | 44094178.4 73 604029.841 Root MSE = 188.27
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
price | .0332386 .0087921 3.781 0.000 .0156989 .0507783
turn | 23.17863 10.38128 2.233 0.029 2.468546 43.88872
displ | 2.257083 .467792 4.825 0.000 1.323863 3.190302
length | 19.38601 2.328203 8.327 0.000 14.74137 24.03064
_cons | -2193.042 298.0756 -7.357 0.000 -2787.687 -1598.398
------------------------------------------------------------------------------
Categorical Variables
Confounding
Variable Selection
Other Considerations
Polynomial Regression
If association between x and Y is non-linear, can fit
polynomial terms in x.
Keep adding terms until the highest order term is not
significant.
Parameters are meaningless: only entire function has
meaning.
Fractional polynomials and splines can also be used
Categorical Variables
Confounding
Variable Selection
Other Considerations
Transformations
If Y is not normal or has non-constant variance, it may be
possible to fit a linear model to a transformation of Y.
Interpretation becomes more difficult after transformation.
Log transformation has a simple interpretation.
log(Y) = β0 + β1x
when x increases by 1, log(Y) increases by β1,
Y is multiplied by eβ1
Transforming x is not normally necessary unless the
problem suggests it.
Categorical Variables
Confounding
Variable Selection
Other Considerations
Regression through the origin
You may know that if x = 0, y = 0.
Stata can force the regression line through the origin with
the option nocons.
However
R2
is calculated differently and cannot be compared to
conventional R2
.
If we have no data near the origin, should not force line
through the origin.
May obtain a better fit with a non-zero intercept if there is
measurement error.

More Related Content

Similar to slides.pdf

Sparsenet
SparsenetSparsenet
Sparsenetndronen
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
statistic project on Hero motocorp
statistic project on Hero motocorpstatistic project on Hero motocorp
statistic project on Hero motocorpYug Bokadia
 
Feature Scaling with R.pdf
Feature Scaling with R.pdfFeature Scaling with R.pdf
Feature Scaling with R.pdfShakiruBankole2
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviationRekha Yadav
 
Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...
Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...
Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...sravan66
 
18-21 Principles of Least Squares.ppt
18-21 Principles of Least Squares.ppt18-21 Principles of Least Squares.ppt
18-21 Principles of Least Squares.pptBAGARAGAZAROMUALD2
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptxMdRokonMia1
 
Lecture_4_Quality Control Tools.pdf / en
Lecture_4_Quality Control Tools.pdf / enLecture_4_Quality Control Tools.pdf / en
Lecture_4_Quality Control Tools.pdf / enssuser8634982
 
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
 
Correspondence Analysis
Correspondence AnalysisCorrespondence Analysis
Correspondence AnalysisGaetan Lion
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Utility maximisation and regret minimisation - a mixture of a generalisation
Utility maximisation and regret minimisation - a mixture of a generalisationUtility maximisation and regret minimisation - a mixture of a generalisation
Utility maximisation and regret minimisation - a mixture of a generalisationInstitute for Transport Studies (ITS)
 

Similar to slides.pdf (20)

Static Models of Continuous Variables
Static Models of Continuous VariablesStatic Models of Continuous Variables
Static Models of Continuous Variables
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
statistic project on Hero motocorp
statistic project on Hero motocorpstatistic project on Hero motocorp
statistic project on Hero motocorp
 
Feature Scaling with R.pdf
Feature Scaling with R.pdfFeature Scaling with R.pdf
Feature Scaling with R.pdf
 
Multiple Regression
Multiple RegressionMultiple Regression
Multiple Regression
 
Ch15
Ch15Ch15
Ch15
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviation
 
Chapter07.pdf
Chapter07.pdfChapter07.pdf
Chapter07.pdf
 
Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...
Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...
Introduction to Hybrid Vehicle System Modeling and Control - 2013 - Liu - App...
 
Regression
RegressionRegression
Regression
 
18-21 Principles of Least Squares.ppt
18-21 Principles of Least Squares.ppt18-21 Principles of Least Squares.ppt
18-21 Principles of Least Squares.ppt
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
 
Lecture_4_Quality Control Tools.pdf / en
Lecture_4_Quality Control Tools.pdf / enLecture_4_Quality Control Tools.pdf / en
Lecture_4_Quality Control Tools.pdf / en
 
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
 
Correspondence Analysis
Correspondence AnalysisCorrespondence Analysis
Correspondence Analysis
 
Data Mind Traps
Data Mind TrapsData Mind Traps
Data Mind Traps
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Utility maximisation and regret minimisation - a mixture of a generalisation
Utility maximisation and regret minimisation - a mixture of a generalisationUtility maximisation and regret minimisation - a mixture of a generalisation
Utility maximisation and regret minimisation - a mixture of a generalisation
 
Describing Data: Numerical Measures
Describing Data: Numerical MeasuresDescribing Data: Numerical Measures
Describing Data: Numerical Measures
 

More from dawitg2

bacterial plant pathogen jeyarajesh-190413122916.pptx
bacterial plant pathogen jeyarajesh-190413122916.pptxbacterial plant pathogen jeyarajesh-190413122916.pptx
bacterial plant pathogen jeyarajesh-190413122916.pptxdawitg2
 
unit1_practical-applications-of-biotechnology.ppt
unit1_practical-applications-of-biotechnology.pptunit1_practical-applications-of-biotechnology.ppt
unit1_practical-applications-of-biotechnology.pptdawitg2
 
Introduction-to-Plant-Cell-Culture-lec1.ppt
Introduction-to-Plant-Cell-Culture-lec1.pptIntroduction-to-Plant-Cell-Culture-lec1.ppt
Introduction-to-Plant-Cell-Culture-lec1.pptdawitg2
 
fertilizers-150111082613-conversion-gate01.pdf
fertilizers-150111082613-conversion-gate01.pdffertilizers-150111082613-conversion-gate01.pdf
fertilizers-150111082613-conversion-gate01.pdfdawitg2
 
sac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdf
sac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdfsac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdf
sac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdfdawitg2
 
davindergill 135021014 -170426133338.pdf
davindergill 135021014 -170426133338.pdfdavindergill 135021014 -170426133338.pdf
davindergill 135021014 -170426133338.pdfdawitg2
 
preparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdf
preparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdfpreparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdf
preparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdfdawitg2
 
pesticide stoeage storage ppwscript.ppt
pesticide stoeage storage  ppwscript.pptpesticide stoeage storage  ppwscript.ppt
pesticide stoeage storage ppwscript.pptdawitg2
 
-pre-cautions--on--seed-storage--1-.pptx
-pre-cautions--on--seed-storage--1-.pptx-pre-cautions--on--seed-storage--1-.pptx
-pre-cautions--on--seed-storage--1-.pptxdawitg2
 
Intgrated pest management 02 _ 06_05.pdf
Intgrated pest management 02 _ 06_05.pdfIntgrated pest management 02 _ 06_05.pdf
Intgrated pest management 02 _ 06_05.pdfdawitg2
 
major potato and tomato disease in ethiopia .pptx
major potato and tomato disease in ethiopia .pptxmajor potato and tomato disease in ethiopia .pptx
major potato and tomato disease in ethiopia .pptxdawitg2
 
pesticide formulation pp602lec1to2-211207073233.pdf
pesticide formulation pp602lec1to2-211207073233.pdfpesticide formulation pp602lec1to2-211207073233.pdf
pesticide formulation pp602lec1to2-211207073233.pdfdawitg2
 
diseasespests-2013-130708184617-phpapp02.pptx
diseasespests-2013-130708184617-phpapp02.pptxdiseasespests-2013-130708184617-phpapp02.pptx
diseasespests-2013-130708184617-phpapp02.pptxdawitg2
 
pesticide pp602lec1to2-211207073233.pptx
pesticide pp602lec1to2-211207073233.pptxpesticide pp602lec1to2-211207073233.pptx
pesticide pp602lec1to2-211207073233.pptxdawitg2
 
agricultaraly important agrochemicals.ppt
agricultaraly important agrochemicals.pptagricultaraly important agrochemicals.ppt
agricultaraly important agrochemicals.pptdawitg2
 
agri pesticide chemistry-180722174627.pdf
agri pesticide chemistry-180722174627.pdfagri pesticide chemistry-180722174627.pdf
agri pesticide chemistry-180722174627.pdfdawitg2
 
372922285 -important Fungal-Nutrition.ppt
372922285 -important Fungal-Nutrition.ppt372922285 -important Fungal-Nutrition.ppt
372922285 -important Fungal-Nutrition.pptdawitg2
 
disease development and pathogenesis-201118142432.pptx
disease development and pathogenesis-201118142432.pptxdisease development and pathogenesis-201118142432.pptx
disease development and pathogenesis-201118142432.pptxdawitg2
 
ppp211lecture8-221211055228-824cf9da.pptx
ppp211lecture8-221211055228-824cf9da.pptxppp211lecture8-221211055228-824cf9da.pptx
ppp211lecture8-221211055228-824cf9da.pptxdawitg2
 
defensemechanismsinplants-180308104711.pptx
defensemechanismsinplants-180308104711.pptxdefensemechanismsinplants-180308104711.pptx
defensemechanismsinplants-180308104711.pptxdawitg2
 

More from dawitg2 (20)

bacterial plant pathogen jeyarajesh-190413122916.pptx
bacterial plant pathogen jeyarajesh-190413122916.pptxbacterial plant pathogen jeyarajesh-190413122916.pptx
bacterial plant pathogen jeyarajesh-190413122916.pptx
 
unit1_practical-applications-of-biotechnology.ppt
unit1_practical-applications-of-biotechnology.pptunit1_practical-applications-of-biotechnology.ppt
unit1_practical-applications-of-biotechnology.ppt
 
Introduction-to-Plant-Cell-Culture-lec1.ppt
Introduction-to-Plant-Cell-Culture-lec1.pptIntroduction-to-Plant-Cell-Culture-lec1.ppt
Introduction-to-Plant-Cell-Culture-lec1.ppt
 
fertilizers-150111082613-conversion-gate01.pdf
fertilizers-150111082613-conversion-gate01.pdffertilizers-150111082613-conversion-gate01.pdf
fertilizers-150111082613-conversion-gate01.pdf
 
sac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdf
sac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdfsac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdf
sac-301-manuresfertilizersandsoilfertilitymanagement-210105122952.pdf
 
davindergill 135021014 -170426133338.pdf
davindergill 135021014 -170426133338.pdfdavindergill 135021014 -170426133338.pdf
davindergill 135021014 -170426133338.pdf
 
preparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdf
preparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdfpreparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdf
preparationofdifferentagro-chemicaldosesforfield-140203005533-phpapp02.pdf
 
pesticide stoeage storage ppwscript.ppt
pesticide stoeage storage  ppwscript.pptpesticide stoeage storage  ppwscript.ppt
pesticide stoeage storage ppwscript.ppt
 
-pre-cautions--on--seed-storage--1-.pptx
-pre-cautions--on--seed-storage--1-.pptx-pre-cautions--on--seed-storage--1-.pptx
-pre-cautions--on--seed-storage--1-.pptx
 
Intgrated pest management 02 _ 06_05.pdf
Intgrated pest management 02 _ 06_05.pdfIntgrated pest management 02 _ 06_05.pdf
Intgrated pest management 02 _ 06_05.pdf
 
major potato and tomato disease in ethiopia .pptx
major potato and tomato disease in ethiopia .pptxmajor potato and tomato disease in ethiopia .pptx
major potato and tomato disease in ethiopia .pptx
 
pesticide formulation pp602lec1to2-211207073233.pdf
pesticide formulation pp602lec1to2-211207073233.pdfpesticide formulation pp602lec1to2-211207073233.pdf
pesticide formulation pp602lec1to2-211207073233.pdf
 
diseasespests-2013-130708184617-phpapp02.pptx
diseasespests-2013-130708184617-phpapp02.pptxdiseasespests-2013-130708184617-phpapp02.pptx
diseasespests-2013-130708184617-phpapp02.pptx
 
pesticide pp602lec1to2-211207073233.pptx
pesticide pp602lec1to2-211207073233.pptxpesticide pp602lec1to2-211207073233.pptx
pesticide pp602lec1to2-211207073233.pptx
 
agricultaraly important agrochemicals.ppt
agricultaraly important agrochemicals.pptagricultaraly important agrochemicals.ppt
agricultaraly important agrochemicals.ppt
 
agri pesticide chemistry-180722174627.pdf
agri pesticide chemistry-180722174627.pdfagri pesticide chemistry-180722174627.pdf
agri pesticide chemistry-180722174627.pdf
 
372922285 -important Fungal-Nutrition.ppt
372922285 -important Fungal-Nutrition.ppt372922285 -important Fungal-Nutrition.ppt
372922285 -important Fungal-Nutrition.ppt
 
disease development and pathogenesis-201118142432.pptx
disease development and pathogenesis-201118142432.pptxdisease development and pathogenesis-201118142432.pptx
disease development and pathogenesis-201118142432.pptx
 
ppp211lecture8-221211055228-824cf9da.pptx
ppp211lecture8-221211055228-824cf9da.pptxppp211lecture8-221211055228-824cf9da.pptx
ppp211lecture8-221211055228-824cf9da.pptx
 
defensemechanismsinplants-180308104711.pptx
defensemechanismsinplants-180308104711.pptxdefensemechanismsinplants-180308104711.pptx
defensemechanismsinplants-180308104711.pptx
 

Recently uploaded

Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointELaRue0
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya - UEM Kolkata Quiz Club
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Celine George
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdfVikramadityaRaj
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesashishpaul799
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff17thcssbs2
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTechSoup
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resourcesdimpy50
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePedroFerreira53928
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringDenish Jangid
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxShibin Azad
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptxmansk2
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxCapitolTechU
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesRased Khan
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryEugene Lysak
 

Recently uploaded (20)

Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
“O BEIJO” EM ARTE .
“O BEIJO” EM ARTE                       .“O BEIJO” EM ARTE                       .
“O BEIJO” EM ARTE .
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdfPost Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 

slides.pdf

  • 1. Categorical Variables Confounding Variable Selection Other Considerations Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Mark Lunt Centre for Epidemiology Versus Arthritis University of Manchester 24/11/2020
  • 2. Categorical Variables Confounding Variable Selection Other Considerations This Week Categorical Variables Comparing outcome between groups Comparing slopes between groups (Interactions) Confounding Variable Selection Other considerations Polynomial Regression Transformation Regression through the origin
  • 3. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Categorical Variables None of the linear model assumptions mention the distribution of x. Can use x-variables with any distribution This enables us to compare different groups
  • 4. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Dichotomous Variable Let x = 0 in group A and x = 1 in group B. Linear model equation is Ŷ = β0 + β1x In group A, x = 0 so Ŷ = β0 In group B, x = 1 so Ŷ = β0 + β1 Hence the coefficient of x gives the mean difference between the two groups.
  • 5. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Dichotomous Variable Example x takes values 0 or 1 Y is normally distributed with variance 1, and mean 3 if x = 0 and 4 if x = 1. We wish to test if there difference in the mean value of Y between the groups with x = 0 and x = 1
  • 6. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Dichotomous Variable: Stata output . regress Y x Source | SS df MS Number of obs = 40 -------------+------------------------------ F( 1, 38) = 10.97 Model | 9.86319435 1 9.86319435 Prob > F = 0.0020 Residual | 34.1679607 38 .89915686 R-squared = 0.2240 -------------+------------------------------ Adj R-squared = 0.2036 Total | 44.031155 39 1.12900398 Root MSE = .94824 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .9931362 .2998594 3.31 0.002 .3861025 1.60017 _cons | 3.0325 .2120326 14.30 0.000 2.603262 3.461737 ------------------------------------------------------------------------------
  • 7. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Dichotomous Variables and the T-Test Differences in mean between two groups usually tested for with t-test. Linear model results are exactly the same. Linear model assumptions are exactly the same. Normal distribution in each group Same variance in each group A t-test is a special case of a linear model. Linear model is far more versatile (can adjust for other variables).
  • 8. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions T-Test: Stata output . ttest Y, by(x) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 20 3.0325 .2467866 1.103663 2.515969 3.54903 1 | 20 4.025636 .1703292 .7617355 3.669133 4.382139 ---------+-------------------------------------------------------------------- combined | 40 3.529068 .1680033 1.062546 3.189249 3.868886 ---------+-------------------------------------------------------------------- diff | -.9931362 .2998594 -1.60017 -.3861025 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = -3.3120 Ho: diff = 0 degrees of freedom = 38 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0010 Pr(|T| > |t|) = 0.0020 Pr(T > t) = 0.9990
  • 9. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Categorical Variable with Several Categories What can we do if there are more than two categories ? Cannot use x = 0, 1, 2, . . .. Instead we use “dummy” or “indicator” variables. If there are k categories, we need k − 1 indicators.
  • 10. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Three Groups: Example Group x1 x2 Ȳ σ2 A 0 0 3 1 Baseline Group B 1 0 5 1 C 0 1 4 1 β0 = Ŷ in group A β1 = difference between Ŷ in group A and Ŷ in group B β2 = difference between Ŷ in group A and Ŷ in group C
  • 11. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Three Groups: Stata Output . regress Y x1 x2 Source | SS df MS Number of obs = 60 -------------+------------------------------ F( 2, 57) = 16.82 Model | 37.1174969 2 18.5587485 Prob > F = 0.0000 Residual | 62.8970695 57 1.10345736 R-squared = 0.3711 -------------+------------------------------ Adj R-squared = 0.3491 Total | 100.014566 59 1.69516214 Root MSE = 1.0505 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 1.924713 .3321833 5.79 0.000 1.259528 2.589899 x2 | 1.035985 .3321833 3.12 0.003 .3707994 1.701171 _cons | 3.075665 .2348891 13.09 0.000 2.605308 3.546022 ------------------------------------------------------------------------------
  • 12. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Comparing Groups In the previous example, groups B and C both compared to group A. Can we compare groups B and C as well ? In group B, Ŷ = β0 + β1 In group C, Ŷ = β0 + β2 Hence difference between groups is β1 − β2 Can use lincom to obtain this difference, and test its significance.
  • 13. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions The lincom Command lincom is short for linear combination. It can be used to calculate linear combinations of the parameters of a linear model. Linear combination = ajβj + ak βk + . . . Can be used to find differences between groups (Difference between Group B and Group C = β1 − β2) Can be used to find mean values in groups (Mean value in group B = β0 + β1).
  • 14. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Stata Output from lincom . lincom x1 - x2 ( 1) x1 - x2 = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | .8887284 .3321833 2.68 0.010 .2235428 1.553914 ------------------------------------------------------------------------------ . lincom _cons + x1 ( 1) x1 + _cons = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 5.000378 .2348891 21.29 0.000 4.530021 5.470736 ------------------------------------------------------------------------------
  • 15. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Factor Variables in Stata Generating dummy variables can be tedious and error-prone Stata can do it for you Identify categorical variables by adding “i.” to the start of their name. For example, suppose that the variable group contains the values “1”, “2” and “3” for the three groups in the previous example.
  • 16. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Stata Output with a Factor Variable . regress Y i.group Source | SS df MS Number of obs = 60 -------------+------------------------------ F( 2, 57) = 16.82 Model | 37.1174969 2 18.5587485 Prob > F = 0.0000 Residual | 62.8970695 57 1.10345736 R-squared = 0.3711 -------------+------------------------------ Adj R-squared = 0.3491 Total | 100.014566 59 1.69516214 Root MSE = 1.0505 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- group | 2 | 1.924713 .3321833 5.79 0.000 1.259528 2.589899 3 | 1.035985 .3321833 3.12 0.003 .3707994 1.701171 | _cons | 3.075665 .2348891 13.09 0.000 2.605308 3.546022 ------------------------------------------------------------------------------
  • 17. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Using factor variables with lincom . lincom 2.group - 3.group ( 1) 2.group - 3.group = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | .8887284 .3321833 2.68 0.010 .2235428 1.553914 ------------------------------------------------------------------------------ . lincom _cons + 2.group ( 1) 2.group + _cons = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 5.000378 .2348891 21.29 0.000 4.530021 5.470736 ------------------------------------------------------------------------------
  • 18. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Linear Models and ANOVA Differences in mean between more than two groups usually tested for with ANOVA. Linear model results are exactly the same. Linear model assumptions are exactly the same. ANOVA is a special case of a linear model. Linear model is far more versatile (can adjust for other variables).
  • 19. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Mixing Categorical & Continuous Variables So far, we have only seen either continuous or categorical predictors in a linear model. No problem to mix both. E.g. Consider a clinical trial in which the outcome is strongly associated with age. To test the effect of treatment, need to include both age and treatment in linear model. Once upon a time, this was called Analysis of Covariance (ANCOVA)
  • 20. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Example Clinical Trial: simulated data 0 5 10 Y 20 25 30 35 40 age Placebo Active Treatment
  • 21. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Stata Output Ignoring the Effect of Age . regress Y treat Source | SS df MS Number of obs = 40 -------------+------------------------------ F( 1, 38) = 2.86 Model | 26.5431819 1 26.5431819 Prob > F = 0.0989 Residual | 352.500943 38 9.27634061 R-squared = 0.0700 -------------+------------------------------ Adj R-squared = 0.0456 Total | 379.044125 39 9.71908013 Root MSE = 3.0457 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treat | 1.629208 .9631376 1.69 0.099 -.3205623 3.578978 _cons | 4.379165 .6810411 6.43 0.000 3.00047 5.757861 ------------------------------------------------------------------------------
  • 22. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Observed and predicted values from linear model ignoring age 0 5 10 20 25 30 35 40 age Placebo Active Treatment
  • 23. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Stata Output Including the Effect of Age . regress Y treat age Source | SS df MS Number of obs = 40 -------------+------------------------------ F( 2, 37) = 262.58 Model | 354.096059 2 177.04803 Prob > F = 0.0000 Residual | 24.9480658 37 .674272049 R-squared = 0.9342 -------------+------------------------------ Adj R-squared = 0.9306 Total | 379.044125 39 9.71908013 Root MSE = .82114 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treat | 1.238752 .2602711 4.76 0.000 .7113924 1.766111 age | -.5186644 .0235322 -22.04 0.000 -.5663453 -.4709836 _cons | 20.59089 .7581107 27.16 0.000 19.05481 22.12696 ------------------------------------------------------------------------------ Age explains variation in Y This reduces RMSE (estimate of σ) Standard error of coefficient = σ √ nsx
  • 24. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Observed and predicted values from linear model including age 0 5 10 20 25 30 35 40 age Placebo Active Treatment
  • 25. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Interactions In previous example, assumed that the effect of age was the same in treated and untreated groups. I.e. regression lines were parallel. This may not be the case. If the effect of one variable varies accord to the value of another variable, this is called “interaction” between the variables. Don’t assume that an effect differs between two groups because it is significant in one, not in the other
  • 26. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Interaction Example Consider the clinical trial in the previous example Suppose treatment reverses the effect of aging, so that Ŷ is constant in the treated group. Thus the difference between the treated and untreated groups will increase with increasing age. Need to fit different intercepts and different slopes in the two groups.
  • 27. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Clinical trial data with predictions assuming equal slopes 0 5 10 20 25 30 35 40 age Placebo Active Treatment
  • 28. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Regression Equations Need to fit the two equations Y=    β00 + β10 × age + ε if treat = 0 β01 + β11 × age + ε if treat = 1 These are equivalent to the equation Y=β00+β10×age+(β01−β00)×treat+(β11−β10)×age×treat+ε. I.e. the output from stata can be interpreted as _cons The intercept in the untreated group (treat == 0) age The slope with age in the untreated group treat The difference in intercept between the treated and untreated groups treat#c.age The difference in slope between the treated and untreated groups
  • 29. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Interactions: Stata Output . regress Y i.treat age i.treat#c.age Source | SS df MS Number of obs = 40 -------------+------------------------------ F( 3, 36) = 173.38 Model | 563.762012 3 187.920671 Prob > F = 0.0000 Residual | 39.0189256 36 1.08385904 R-squared = 0.9353 -------------+------------------------------ Adj R-squared = 0.9299 Total | 602.780938 39 15.4559215 Root MSE = 1.0411 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.treat | -8.226356 1.872952 -4.39 0.000 -12.02488 -4.427833 age | -.4866572 .0412295 -11.80 0.000 -.5702744 -.40304 | treat#c.age | 1 | .4682374 .0597378 7.84 0.000 .3470836 .5893912 | _cons | 19.73531 1.309553 15.07 0.000 17.07942 22.39121 ------------------------------------------------------------------------------
  • 30. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Interactions: Using lincom lincom can be used to calculate the slope in the treated group: . lincom age + 1.treat#c.age ( 1) age + 1.treat#c.age = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | -.0184198 .0432288 -0.43 0.673 -.1060919 .0692523 ------------------------------------------------------------------------------ Can also be used to calculate intercept in treated group. However, this is not interesting since We are unlikely to be be interested in subjects of age 0 The youngest subjects in our sample were 20, so we are extrapolating a long way from the data.
  • 31. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Interactions: Predictions from Linear Model 0 5 10 20 25 30 35 40 age Placebo Active Treatment
  • 32. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Treatment effect at different ages . lincom 1.treat + 20*1.treat#c.age ( 1) 1.treat + 20*1.treat#c.age = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 1.138392 .7279832 1.56 0.127 -.3380261 2.61481 ------------------------------------------------------------------------------ . lincom 1.treat + 40*1.treat#c.age ( 1) 1.treat + 40*1.treat#c.age = 0 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 10.50314 .6378479 16.47 0.000 9.209524 11.79676 ------------------------------------------------------------------------------
  • 33. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions The testparm Command Used to test a number of parameters simultaneously Syntax: testparm varlist Test β = 0 for all variables in varlist Produces a χ2 test on k degrees of freedom, where there are k variables in varlist.
  • 34. Categorical Variables Confounding Variable Selection Other Considerations Dichotomous Variables Multiple Categories Categorical & Continuous Interactions Old and new syntax for categorical variables Stata used to use a different syntax for categorical variables Still works, but new method is preferred You may still see old syntax in existing do-files New syntax Old Syntax Prefix none required xi: Variable type Numeric String or numeric Interaction # * Creates new variables No Yes More info help fvvarlist help xi
  • 35. Categorical Variables Confounding Variable Selection Other Considerations Confounding A linear model shows association. It does not show causation. Apparent association may be due to a third variable which we haven’t included in model Confounding is about causality, and knowledge of the mechanisms are required to decide if a variable is a confounder.
  • 36. Categorical Variables Confounding Variable Selection Other Considerations Confounding Example: Fuel Consumption . regress mpg foreign Source | SS df MS Number of obs = 74 ---------+------------------------------ F( 1, 72) = 13.18 Model | 378.153515 1 378.153515 Prob > F = 0.0005 Residual | 2065.30594 72 28.6848048 R-squared = 0.1548 ---------+------------------------------ Adj R-squared = 0.1430 Total | 2443.45946 73 33.4720474 Root MSE = 5.3558 ------------------------------------------------------------------------------ mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- foreign | 4.945804 1.362162 3.631 0.001 2.230384 7.661225 _cons | 19.82692 .7427186 26.695 0.000 18.34634 21.30751 ------------------------------------------------------------------------------
  • 37. Categorical Variables Confounding Variable Selection Other Considerations Confounding Example: Weight and Fuel Consumption 10 20 30 40 Mileage (mpg) 2,000 3,000 4,000 5,000 Weight (lbs.) US Vehicles non−US Vehicles
  • 38. Categorical Variables Confounding Variable Selection Other Considerations Confounding Example: Controlling for Weight . regress mpg foreign weight Source | SS df MS Number of obs = 74 ---------+------------------------------ F( 2, 71) = 69.75 Model | 1619.2877 2 809.643849 Prob > F = 0.0000 Residual | 824.171761 71 11.608053 R-squared = 0.6627 ---------+------------------------------ Adj R-squared = 0.6532 Total | 2443.45946 73 33.4720474 Root MSE = 3.4071 ------------------------------------------------------------------------------ mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- foreign | -1.650029 1.075994 -1.533 0.130 -3.7955 .4954421 weight | -.0065879 .0006371 -10.340 0.000 -.0078583 -.0053175 _cons | 41.6797 2.165547 19.247 0.000 37.36172 45.99768 ------------------------------------------------------------------------------
  • 39. Categorical Variables Confounding Variable Selection Other Considerations What is Confounding ? What you see is not what you get Ŷ = β0 + β1x Two groups differing in x by ∆x will differ in Y by β1∆x If we change x by ∆x, what happens to Ŷ ? If it changes by β1∆x, no confounding If it changes by anything else, there is confounding
  • 40. Categorical Variables Confounding Variable Selection Other Considerations Path Variables vs. Confounders Foreign // $$ mpg Weight << Weight is a path variable Foreign // mpg Weight << dd Weight is a confounder
  • 41. Categorical Variables Confounding Variable Selection Other Considerations Identifying a Confounder Is a cause of the outcome irrespective of other predictors Is associated with the predictor Is not a consequence of the predictor Weight is associated with mpg This association does not depend on where the car was designed But is weight a path variable ?
  • 42. Categorical Variables Confounding Variable Selection Other Considerations Identifying a Confounder Is a cause of the outcome irrespective of other predictors Is associated with the predictor Is not a consequence of the predictor Weight is associated with mpg This association does not depend on where the car was designed But is weight a path variable ? Foreign designers produce smaller cars in order to getter better fuel consumption: path variable
  • 43. Categorical Variables Confounding Variable Selection Other Considerations Identifying a Confounder Is a cause of the outcome irrespective of other predictors Is associated with the predictor Is not a consequence of the predictor Weight is associated with mpg This association does not depend on where the car was designed But is weight a path variable ? Foreign designers produce smaller cars in order to getter better fuel consumption: path variable Size is decided for reasons other than fuel consumption: confounder
  • 44. Categorical Variables Confounding Variable Selection Other Considerations Allowing for Confounding In theory, adding a confounder to a regression model is sufficient to adjust for confounding. Then parameters for other variables measure the effects of those variables when confounder does not change. This assumes Confounder measured perfectly Linear association between confounder and outcome If either of the above are not true, there will be residual confounding
  • 45. Categorical Variables Confounding Variable Selection Other Considerations Variable Selection May wish to reduce the number of predictors used in a linear model. Efficiency Clearer understanding Several suggested methods Forward selection Backward Elimination Stepwise All subsets
  • 46. Categorical Variables Confounding Variable Selection Other Considerations Forward Selection Choose a significance level pe at which variables will enter the model. Fit each predictor in turn. Choose the most significant predictor. If its significance level is less than pe, it is selected. Now add each remaining variable to this model in turn, and test the most significant. Continue until no further variables are added.
  • 47. Categorical Variables Confounding Variable Selection Other Considerations Backward Elimination Starts with all predictors in model. Removes the least significant. Repeat until all remaining predictors significant at chosen level pr . Has the advantage that all parameters are adjusted for the effect of all other variables from the start. Can give unusual results if there are a large number of correlated variables.
  • 48. Categorical Variables Confounding Variable Selection Other Considerations Stepwise Selection Combination of preceding methods. Variables are added one at a time. Each time a variable is added, all the other variables are tested to see if they should be removed. Must have pr > pe, or a variable could be entered and removed on the same step.
  • 49. Categorical Variables Confounding Variable Selection Other Considerations All Subsets Can try every possible subset of variables. Can be hard work: 10 predictors = 1023 subsets. Need a criterion to choose best model. Adjusted R2 is possible, there are others. Not implemented in stata.
  • 50. Categorical Variables Confounding Variable Selection Other Considerations Problems with Variable Selection Significance Levels Hypotheses tested are not independent. Variables chosen for testing not randomly selected. Hence significance levels not equal to nominal levels. Less of a problem in large samples. Differences in Models Selected Models chosen by different methods may differ. If variables are highly correlated, choice of variable becomes arbitrary Choice of significance level will affect models. Need common sense.
  • 51. Categorical Variables Confounding Variable Selection Other Considerations Variable Selection in Stata Command sw regress is used for forwards, backwards and stepwise selection. Option pe is used to set significance level for inclusion Option pr is used to set significance level for exclusion Set pe for forwards, pr for backwards and both for stepwise regression. The sw command does not work with factor variables, so the old xi: syntax must be used.
  • 52. Categorical Variables Confounding Variable Selection Other Considerations Variable Selection in Stata: Example 1 . sw regress weight price hdroom trunk length turn displ gratio, pe(0.05) p = 0.0000 < 0.0500 adding length p = 0.0000 < 0.0500 adding displ p = 0.0015 < 0.0500 adding price p = 0.0288 < 0.0500 adding turn Source | SS df MS Number of obs = 74 ---------+------------------------------ F( 4, 69) = 293.75 Model | 41648450.8 4 10412112.7 Prob > F = 0.0000 Residual | 2445727.56 69 35445.3269 R-squared = 0.9445 ---------+------------------------------ Adj R-squared = 0.9413 Total | 44094178.4 73 604029.841 Root MSE = 188.27 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- length | 19.38601 2.328203 8.327 0.000 14.74137 24.03064 displ | 2.257083 .467792 4.825 0.000 1.323863 3.190302 price | .0332386 .0087921 3.781 0.000 .0156989 .0507783 turn | 23.17863 10.38128 2.233 0.029 2.468546 43.88872 _cons | -2193.042 298.0756 -7.357 0.000 -2787.687 -1598.398 ------------------------------------------------------------------------------
  • 53. Categorical Variables Confounding Variable Selection Other Considerations Variable Selection in Stata: Example 2 . sw regress weight price hdroom trunk length turn displ gratio, pr(0.05) p = 0.6348 >= 0.0500 removing hdroom p = 0.5218 >= 0.0500 removing trunk p = 0.1371 >= 0.0500 removing gratio Source | SS df MS Number of obs = 74 ---------+------------------------------ F( 4, 69) = 293.75 Model | 41648450.8 4 10412112.7 Prob > F = 0.0000 Residual | 2445727.56 69 35445.3269 R-squared = 0.9445 ---------+------------------------------ Adj R-squared = 0.9413 Total | 44094178.4 73 604029.841 Root MSE = 188.27 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- price | .0332386 .0087921 3.781 0.000 .0156989 .0507783 turn | 23.17863 10.38128 2.233 0.029 2.468546 43.88872 displ | 2.257083 .467792 4.825 0.000 1.323863 3.190302 length | 19.38601 2.328203 8.327 0.000 14.74137 24.03064 _cons | -2193.042 298.0756 -7.357 0.000 -2787.687 -1598.398 ------------------------------------------------------------------------------
  • 54. Categorical Variables Confounding Variable Selection Other Considerations Polynomial Regression If association between x and Y is non-linear, can fit polynomial terms in x. Keep adding terms until the highest order term is not significant. Parameters are meaningless: only entire function has meaning. Fractional polynomials and splines can also be used
  • 55. Categorical Variables Confounding Variable Selection Other Considerations Transformations If Y is not normal or has non-constant variance, it may be possible to fit a linear model to a transformation of Y. Interpretation becomes more difficult after transformation. Log transformation has a simple interpretation. log(Y) = β0 + β1x when x increases by 1, log(Y) increases by β1, Y is multiplied by eβ1 Transforming x is not normally necessary unless the problem suggests it.
  • 56. Categorical Variables Confounding Variable Selection Other Considerations Regression through the origin You may know that if x = 0, y = 0. Stata can force the regression line through the origin with the option nocons. However R2 is calculated differently and cannot be compared to conventional R2 . If we have no data near the origin, should not force line through the origin. May obtain a better fit with a non-zero intercept if there is measurement error.