SlideShare a Scribd company logo
Design Matrix and
Contrast Statement in SAS Regression
Procedures
Gang Cui, MPH
Sr. Biostatistician, CSCC, UNC
Outline
• Motivations
• Design Matrix definition
• Three commonly used design matrix coding schemes in
logistic regression
• Contrast statement under effect and GLM coding scheme
• Summary of design matrix choices in all regression procedures
13:10
Motivation
• Deep learning on certain certain elusive topic
• Mutual learning and knowledge sharing
• Light weight but practical
• Engaging and enjoy
13:10
Design Matrix
• Design matrix is a matrix of “values” of explanatory
variables of a set of objects. The values in matrix is not the
raw value in dataset, also called parameterization.
– Exp. Comparing means of y among three groups: with y being the continuous response,
x being explanatory variable having 3 groups (1,2,3), 𝜇𝑖 being the mean of y of each
group, 𝜏𝑖 being the difference of each group comparing with reference group.
𝑦𝑖𝑗 = 𝜇𝑖 + 𝜖𝑖𝑗 𝑦𝑖𝑗 = 𝜇1 + 𝜏𝑖 + 𝜖𝑖𝑗














































































32
31
22
21
13
12
11
3
2
1
32
31
22
21
13
12
11
100
100
010
010
001
001
001










y
y
y
y
y
y
y














































































32
31
22
21
13
12
11
3
2
1
32
31
22
21
13
12
11
101
101
011
011
001
001
001










y
y
y
y
y
y
y
𝑦11 = 𝜇1 + 𝜖11
𝑦12 = 𝜇1 + 𝜖12
𝑦13 = 𝜇1 + 𝜖13
𝑦21 = 𝜇2 + 𝜖21
𝑦22 = 𝜇2 + 𝜖22
𝑦31 = 𝜇3 + 𝜖31
𝑦32 = 𝜇3 + 𝜖32
𝑦11 = 𝜇1 + 𝜖11
𝑦12 = 𝜇1 + 𝜖12
𝑦13 = 𝜇1 + 𝜖13
𝑦21 = 𝜇1 + 𝜏2 + 𝜖22
𝑦22 = 𝜇1 + 𝜏2 + 𝜖22
𝑦31 = 𝜇1 + 𝜏3 + 𝜖31
𝑦32 = 𝜇1 + 𝜏3 + 𝜖32 13:10
Design Matrix
In dataset, you may have explanatory vars Sex =“M”/”F”, and Age_Group =1/2/3. SAS
class and model statement will be:
Class Sex Age_Group/<options;
model outcome_var=sex age_group/<options>;
In design matrix, SAS creates “dummy variables” for sex and age_group, with value 1
and 0 only (GLM and REF coding). True mathematic model will be something like:
Outcom_var=𝜇+𝛽iSex1+ 𝛽2Sex2+ 𝛽3Age_Group1+ 𝛽4Age_Group2 +𝛽5Age_Group3,
where 𝜇 is the mean of reference group, 𝛽 is added effect corresponding to each group.
In GLM 𝛽 for reference group will be set to 0.
Data Design Matrix
Sex Age_Group
Sex Age_Group 𝜇 (intercept) Sex1 Sex2 Age_Group1 Age_Group2 Age_Group3
M 1 1 1 0 1 0 0
M 2 1 1 0 0 1 0
M 3 1 1 0 0 0 1
F 1 1 0 1 1 0 0
F 2 1 0 1 0 1 0
F 3 1 0 1 0 0 1
Dummy
variables
13:10
Design Matrix Choices in SAS
Three most common design matrix coding schemes:
• GLM - (indicator or dummy coding) reference value coded as “1”.
• Effect – Ref value coded as “-1”. Beta estimates are estimating the difference
in the effect of each nonreference level compared to the average effect over
all levels.
• REF - Reference cell coding, reference value coded as “0”.
Specified in Class statement global option Param=<GLM/EFFECT/REF>:
class <classvar1> <classvar2>/param = glm/effect/ref <or by default>;
*Note:
1. Design matrix option has profound impact on CONTRAST/LSMEAN statement and
result interpretation.
2. Different procedures have different default design matrix options.
3. Not all procedures allow all three options.
13:10
PARAM=GLM in PROC Logistic
• Suppose having a dataset with binary outcome Pain(Yes/No), and
explanatory variable treatment (A, B, P) and sex (M, F).
proc logistic data=data1;
class Treatment Sex/param=GLM;
model Pain= Treatment Sex/ expb;
run;
Class Level Information
Class Value Design Variables
Treatment A 1 0 0
B 0 1 0
P 0 0 1
Sex F 1 0
M 0 1
*“Dummy coding”: c columns in design matrix,
c is the number of level of class var.
Reference value coded as 1.
Reference is always “LAST” value with GLM, you
don’t have choice unless you formatted ref as
the “LAST”.
13:10
PARAM=GLM in PROC Logistic
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
With param=GLM, you can simply exp(beta)
to get OR, because 𝛽 coefficient estimate is
estimating effect comparing to reference
level.
OR of A vs P=exp(0.2.1928)=8.960;
OR of B vs P=exp(0.2.4700)=11.822
OR of F vs M=exp(1.4711)=4.354
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 -1.9705 0.7030 7.8570 0.0051 0.139
Treatment A 1 2.1928 0.7784 7.9353 0.0048 8.960
Treatment B 1 2.4700 0.8023 9.4770 0.0021 11.822
Treatment P 0 0 . . . .
Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354
Sex M 0 0 . . . .
13:10
PARAM=EFFECT in PROC Logistic
proc logistic data=data1;
class Treatment Sex/param=effect;
model Pain= Treatment Sex/ expb;
run;
Class Level Information
Class Value Design Variables
Treatment A 1 0
B 0 1
P -1 -1
Sex F 1
M -1
*Default param=effect
*Default reference is “LAST” value. But you
can change reference.
*There is c-1 column in design matrix, c is the
number of level of class var. Reference value
coded as -1.
13:10
PARAM=EFFECT in PROC Logistic
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 0.3193 0.3089 1.0685 0.3013 1.376
Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894
Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499
Sex F 1 0.7355 0.3275 5.0440 0.0247 2.087
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
*Normally, one CANNOT calculate
OR=exp(beta) in GLM, because 𝛽 coefficient
estimate is estimating effect comparing to
average of all level.
* It is possible that p-value in beta estimate
and 95% of OR in OR estimate are NOT
consistent
OR of A vs P=exp(2*0.6385+0.9157))=8.960;
OR of B vs P=exp(2*0.9157+0.6385))=11.822
OR if F vs M=exp(2x0.7355)=4.354
Huh
13:10
PARAM=EFFECT in PROC Logistic
According to design matrix: treatment=P is coded as -1:
Therefore the logit function of each treatment group, difference
between non-ref vs ref, Odds and OR are:
















 B
A


11
10
01
 F





1
1
Matrix design for treatment Matrix design for sex
Class Logit function Odds difference OR
Trt A L(A)=𝛽0+𝛽A Exp(𝛽0+𝛽A) A vs P: L(A)-L(P)=2𝛽A+𝛽B Exp(2𝛽A+𝛽B)
B L(B)=𝛽0+𝛽B Exp(𝛽0+𝛽B) B vs P: L(B)-L(P)=2𝛽B+𝛽A Exp(2𝛽B+𝛽A)
P L(P)=𝛽0-𝛽A-𝛽B Exp(𝛽0-𝛽A-𝛽B)
Sex F L(F)=𝛽0+𝛽F Exp(𝛽0+𝛽F) F vs M: L(F)-L(M)=2𝛽F Exp(2𝛽F)
M L(M)=𝛽0-𝛽F Exp(𝛽0-𝛽F)
13:10
PARAM=REF in PROC Logistic
• proc logistic data=data1;
class Treatment Sex/param=REF;
• model Pain= Treatment Sex/ expb;
• run;
*Reference value coded as 0.
Class Level Information
Class Value Design Variables
Treatment A 1 0
B 0 1
P 0 0
Sex F 1
M 0
13:10
PARAM=REF in PROC Logistic
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
With param=REF, SAS output simply omit
reference row, otherwise, estimate values
are the same as GLM. OR is calculated in the
same way as GLM.
OR of A vs P=exp(2.1928)=8.960;
OR of B vs P=exp(2.4700)=11.822
OR of F vs M=exp(1.4711)=4.354
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 -1.9705 0.7030 7.8570 0.0051 0.139
Treatment A 1 2.1928 0.7784 7.9353 0.0048 8.960
Treatment B 1 2.4700 0.8023 9.4770 0.0021 11.822
Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354
13:10
Mixed Coding System in PROC Logistic
One can specify different param option for different class vars, though not usual.
Individual parametrization trump global option, unless the global option is GLM.
proc logistic data=data1;
class Treatment(param=effect) Sex(param=ref)/param=ref;
model Pain= Treatment Sex / expb;
run;
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est)
Intercept 1 -0.4163 0.4252 0.9586 0.3276 0.659
Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894
Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499
Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
OR of A vs P=exp(2*0.6385+0.9157)=8.960
OR of B vs P=exp(2*0.9175+0.9157)=11.822
OR of F vs M=exp(1.4711)=4.354
13:10
Contrast Statement
What if I want to compare treatment A vs B, or average A and B vs P, without changing
reference group and without rerun the procedure?
The answer is using Contrast statement. However, different coding have profound
impact on how to write Contrast statement.
General syntax:
CONTRAST 'label' <var-name > <dummy_coeff_1 …dummy_coeffecient_n>/options;
13:10
Contrast Statement with Effect Coding in Logistic
Comparing A vs B: We know L(A)=𝛽0+𝛽A, and L(B)=𝛽0+𝛽B, therefore
L(A)-L(B)= 𝛽A - 𝛽B, and we can write contrast statement as:
Contrast “A vs B” treatment 1 -1;
Class Logit function Odds difference OR
Trt A L(A)=𝛽0+𝛽A Exp(𝛽0+𝛽A) A vs P: L(A)-L(P)=2𝛽A+𝛽B Exp(2𝛽A+𝛽B)
B L(B)=𝛽0+𝛽B Exp(𝛽0+𝛽B) B vs P: L(B)-L(P)=2𝛽B+𝛽A Exp(2𝛽B+𝛽A)
P L(P)=𝛽0-𝛽A-𝛽B Exp(𝛽0-𝛽A-𝛽B)
Sex F L(F)=𝛽0+𝛽F Exp(𝛽0+𝛽F) F vs M: L(F)-L(M)=2𝛽F Exp(2𝛽F)
M L(M)=𝛽0-𝛽F Exp(𝛽0-𝛽F)
13:10
Contrast Statement with Effect Coding in Logistic
Contrast Estimation and Testing Results by Row
Contrast Type Row Estimate Standard
Error
Alpha Confidence Limits Wald
Chi-Square
Pr > ChiSq
A vs B PARM 1 -0.2772 0.7463 0.05 -1.7399 1.1855 0.1380 0.7103
A vs B EXP 1 0.7579 0.5656 0.05 0.1755 3.2725 0.1380 0.7103
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 0.3193 0.3089 1.0685 0.3013 1.376
Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894
Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499
Sex F 1 0.7355 0.3275 5.0440 0.0247 2.087
proc logistic data=data1;
class Treatment Sex/param=effect;
model Pain= Treatment Sex /e expb;
Contrast "A vs B" treatment 1 -1/ e estimate=both;
run;
Coefficients of Contrast A
vs B
Parameter Row1
TreatmentA 1
TreatmentB -1
13:10
Contrast Statement with Effect Coding in Logistic
Now compare average of (A and B) vs P:
½{L(A)+L(B)} - L(P) = 1/2(𝛽0+𝛽A+𝛽0+𝛽B) - (𝛽0-𝛽A-𝛽B) = 1.5𝛽A+1.5 𝛽B
proc logistic data=data1;
class Treatment Sex/param=effect;
model Pain= Treatment Sex / expb;
Contrast “average of (A and B) vs P" treatment 1.5 1.5/ e estimate=both;
run;
Coefficients of Contrast average of (A and B) vs P
Parameter Row1
Intercept 0
TreatmentA 1.5
TreatmentB 1.5
Contrast Estimation and Testing Results by Row
Contrast Type Row Estimate Standard
Error
Alpha Confidence Limits Wald
Chi-Square
Pr > ChiSq
average of (A and B) vs P PARM 1 2.3314 0.6969 0.05 0.9656 3.6972 11.1931 0.0008
average of (A and B) vs P EXP 1 10.2923 7.1722 0.05 2.6263 40.3341 11.1931 0.0008
13:10
Contrast statement Coding in PROC GLM
• Unlike proc logistic, GLM coding is the only coding scheme in proc GLM.
• Always use the “LAST” value as reference group. A=2 and B=3 in following case.
Example: Y being continuous response following normal distribution with constant
variance. The model has two factors, A with 2 levels and B with 3 levels.
Main Effect model: Yij = μ + αi + βj + εij, where i is the level of factor A, and j is the level
of B
The design matrix of main effect model:
Data Design Matrix
A B
A B 𝜇 (intercept) A1 A2 B1 B2 B3
1 1 1 1 0 1 0 0
1 2 1 1 0 0 1 0
1 3 1 1 0 0 0 1
2 1 1 0 1 1 0 0
2 2 1 0 1 0 1 0
2 3 1 0 1 0 0 1
13:10
Contrast Statement in PROC GLM
Test hypothesis 1: H 0 : μB1 = μB2
From the design matrix, we know μ B1 = μ + αi + β1 and μ B2 = μ + αi + β2
μB1 - μB2 = β1 -β2
Contrast statement: Contrast “B=1 vs B=2” B 1 -1 / e
proc glm data=data3;
class a b;
model y=a b/solution;
contrast "B=1 vs B=2" B 1 -1/e;
ESTIMATE "B=1 vs B=2" B 1 -1/e; *Usually Contrast and Esitmate go hand-in-hand, Esitmate give
estimate of difference and SE
run;
Contrast DF Contrast SS Mean Square F Value Pr > F
B=1 vs B=2 1 4.98877622 4.98877622 4.85 0.0328
Parameter Estimate Standard Error t Value Pr > |t|
B=1 vs B=2 -0.84927549 0.38544563 -2.20 0.0328
13:10
Contrast Statement Coding in PROC GLM
Model with crossed effects: Yijk = μ + αi + βj + αβij + εijk , where i is the level of factor A,
and j is the level of B, and k is level of A*B
The design matrix of crossed effect model:
Data Design Matrix
A B A*B
A B μ A1 A2 B1 B2 B3
A1
B1
A1
B2
A1
B3
A2
B1
A2
B2
A2
B3
1 1 1 1 0 1 0 0 1 0 0 0 0 0
1 2 1 1 0 0 1 0 0 1 0 0 0 0
1 3 1 1 0 0 0 1 0 0 1 0 0 0
2 1 1 0 1 1 0 0 0 0 0 1 0 0
2 2 1 0 1 0 1 0 0 0 0 0 1 0
2 3 1 0 1 0 0 1 0 0 0 0 0 1
13:10
Contrast Statement in PROC GLM
Test hypothesis 2: H 0 : μ 𝛼B11 = μ 𝛼B12
From the design matrix, we know: μ 𝛼B11 = μ + α1 + β1 + αβ11 and μ 𝛼B12= μ + α1 + β2 + αβ12
μB1 - μB2 = β1 -β2+ αβ11- αβ12
Contrast statement: now have two dummy coefficients
proc glm data=data3;
class a b;
model y=a b/solution;
contrast "B=1 vs B=2" B 1 -1
A*B 1 -1; *This is equivalent as 1 -1 0 0 0 0 , trailing 0 can be ignored;
ESTIMATE "B=1 vs B=2" B 1 -1;
A*B 1 -1;
run;
Contrast DF Contrast SS Mean Square F Value Pr > F
AB11 vs AB12 1 6.81411446 6.81411446 6.53 0.0143
Parameter Estimate Standard Error t Value Pr > |t|
AB11 vs AB12 -1.40976950 0.55182283 -2.55 0.0143
13:10
Steps to Construct Contrast Statement
Step 1. Write down the model, two crucial parts to this.
– Parameterization: how design variables in class statement
are coded.
– Parameter ordering: the order of parameters depends on
class statement and order option. Confirm from SAS
output.
Step 2. Write down the hypothesis to be tested.
Step 3. Write the CONTRAST.
13:10
Default and Alternative Coding in Regression
SAS Procedures
Default design matrix coding Regression procedures
GLM coding (indicator or dummy):
ref value coded as 1
GENMOD, GLM, GLMSELECT, GLIMMIX,
LIFEREG, MIXED, and SURVEYPHREG
EFFECT coding (deviation from
mean) : ref value coded as -1
CATMOD, LOGISTIC, and SURVEYLOGISTIC
REF coding: ref value coded as 0 PHREG and TRANSREG
Alternative coding allowed? Regression procedures
Allowed LOGISTIC, GENMOD, GLMSELECT, PHREG,
SURVEYLOGISTIC, and SURVEYPHREG.
Not allowed GLM, MIXED, GLIMMIX, and LIFEREG
13:10
Take Home Messages
• Design matrix has profound impact on CONTRAST/LSMEAN statement and
result interpretation.
• First, know the design matrix you are using and true math linear
combination of the matrix.
• The variables order in class statement matters.
• Effect coding is estimating effect comparing to average of all level, not to
the reference.
• Follow steps of constructing contrast statement:
– Step 1: know the design matrix and order of parameters
– Step 2: Write done the hypothesis tests
– Step 3: Construct contrast and ESTIMATE statement
• Different regression procedures have different default design matrix
schemes, and not all procedure allows alternative schemes.
13:10
Acknowledgement
• Thanks Kathy Roggenkamp for your
encouragement and support!
13:10

More Related Content

What's hot

regression and correlation
regression and correlationregression and correlation
regression and correlation
Priya Sharma
 
Qualitative data analysis
Qualitative data analysisQualitative data analysis
Qualitative data analysis
Tilahun Nigatu Haregu
 
Single linear regression
Single linear regressionSingle linear regression
Single linear regression
Ken Plummer
 
Basic research ppt
Basic research pptBasic research ppt
Basic research ppt
Sajan Ks
 
Qualitative codes and coding
Qualitative codes and coding Qualitative codes and coding
Qualitative codes and coding
Heather Ford
 
Linear regression
Linear regressionLinear regression
Linear regression
Tech_MX
 
Impact of censored data on reliability analysis
Impact of censored data on reliability analysisImpact of censored data on reliability analysis
Impact of censored data on reliability analysis
ASQ Reliability Division
 
Statistical inference: Probability and Distribution
Statistical inference: Probability and DistributionStatistical inference: Probability and Distribution
Statistical inference: Probability and Distribution
Eugene Yan Ziyou
 
Statistics and probability lecture 01
Statistics and probability  lecture 01Statistics and probability  lecture 01
Statistics and probability lecture 01
MuhammadTufailKaran
 
Categorical data analysis.pptx
Categorical data analysis.pptxCategorical data analysis.pptx
Categorical data analysis.pptx
Begashaw3
 
Research method versus research methodology
Research method versus research methodologyResearch method versus research methodology
Research method versus research methodology
Md Nazir Ansari
 
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
yonas kebede
 
wilcoxon signed rank test
wilcoxon signed rank testwilcoxon signed rank test
wilcoxon signed rank test
raj shekar
 
Mpc 006 - 01-03 type i and type ii errors
Mpc 006 - 01-03 type i and type ii errorsMpc 006 - 01-03 type i and type ii errors
Mpc 006 - 01-03 type i and type ii errors
Vasant Kothari
 
Spss tutorial 1
Spss tutorial 1Spss tutorial 1
Spss tutorial 1
kunkumabala
 
Quantitative Methods of Research
Quantitative Methods of ResearchQuantitative Methods of Research
Quantitative Methods of Research
Jan Ine
 
Anova and T-Test
Anova and T-TestAnova and T-Test
Anova and T-Test
AD Sarwar
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
Avjinder (Avi) Kaler
 
Braun, Clarke & Hayfield Thematic Analysis Part 1
Braun, Clarke & Hayfield Thematic Analysis Part 1Braun, Clarke & Hayfield Thematic Analysis Part 1
Braun, Clarke & Hayfield Thematic Analysis Part 1
Victoria Clarke
 
Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)
Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)
Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)
pops macalino
 

What's hot (20)

regression and correlation
regression and correlationregression and correlation
regression and correlation
 
Qualitative data analysis
Qualitative data analysisQualitative data analysis
Qualitative data analysis
 
Single linear regression
Single linear regressionSingle linear regression
Single linear regression
 
Basic research ppt
Basic research pptBasic research ppt
Basic research ppt
 
Qualitative codes and coding
Qualitative codes and coding Qualitative codes and coding
Qualitative codes and coding
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Impact of censored data on reliability analysis
Impact of censored data on reliability analysisImpact of censored data on reliability analysis
Impact of censored data on reliability analysis
 
Statistical inference: Probability and Distribution
Statistical inference: Probability and DistributionStatistical inference: Probability and Distribution
Statistical inference: Probability and Distribution
 
Statistics and probability lecture 01
Statistics and probability  lecture 01Statistics and probability  lecture 01
Statistics and probability lecture 01
 
Categorical data analysis.pptx
Categorical data analysis.pptxCategorical data analysis.pptx
Categorical data analysis.pptx
 
Research method versus research methodology
Research method versus research methodologyResearch method versus research methodology
Research method versus research methodology
 
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
 
wilcoxon signed rank test
wilcoxon signed rank testwilcoxon signed rank test
wilcoxon signed rank test
 
Mpc 006 - 01-03 type i and type ii errors
Mpc 006 - 01-03 type i and type ii errorsMpc 006 - 01-03 type i and type ii errors
Mpc 006 - 01-03 type i and type ii errors
 
Spss tutorial 1
Spss tutorial 1Spss tutorial 1
Spss tutorial 1
 
Quantitative Methods of Research
Quantitative Methods of ResearchQuantitative Methods of Research
Quantitative Methods of Research
 
Anova and T-Test
Anova and T-TestAnova and T-Test
Anova and T-Test
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Braun, Clarke & Hayfield Thematic Analysis Part 1
Braun, Clarke & Hayfield Thematic Analysis Part 1Braun, Clarke & Hayfield Thematic Analysis Part 1
Braun, Clarke & Hayfield Thematic Analysis Part 1
 
Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)
Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)
Chi-Square test of Homogeneity by Pops P. Macalino (TSU-MAEd)
 

Similar to Design matrix and Contrast

Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
GairuzazmiMGhani
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria
Paulo Faria
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
Mohammed El Rafie Tarabay
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
Advanced Statistics Homework Help
Advanced Statistics Homework HelpAdvanced Statistics Homework Help
Advanced Statistics Homework Help
Statistics Homework Helper
 
ds2010
ds2010ds2010
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
Prof. Wim Van Criekinge
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
Satish Gupta
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
NBER
 
Advanced Statistics Homework Help
Advanced Statistics Homework HelpAdvanced Statistics Homework Help
Advanced Statistics Homework Help
Excel Homework Help
 
Calibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced ClassificationCalibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced Classification
Andrea Dal Pozzolo
 
Anova by Hazilah Mohd Amin
Anova by Hazilah Mohd AminAnova by Hazilah Mohd Amin
Anova by Hazilah Mohd Amin
HazilahMohd
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
Shiwani Gupta
 
How to use statistica for rsm study
How to use statistica for rsm studyHow to use statistica for rsm study
How to use statistica for rsm study
Wan Nor Nadyaini Wan Omar
 
Digital electronics k map comparators and their function
Digital electronics k map comparators and their functionDigital electronics k map comparators and their function
Digital electronics k map comparators and their function
kumarankit06875
 
161783709 chapter-04-answers
161783709 chapter-04-answers161783709 chapter-04-answers
161783709 chapter-04-answers
BookStoreLib
 
161783709 chapter-04-answers
161783709 chapter-04-answers161783709 chapter-04-answers
161783709 chapter-04-answers
Firas Husseini
 
Graeco Latin Square Design
Graeco Latin Square Design Graeco Latin Square Design
Graeco Latin Square Design
NadeemAltaf2
 

Similar to Design matrix and Contrast (20)

Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
Advanced Statistics Homework Help
Advanced Statistics Homework HelpAdvanced Statistics Homework Help
Advanced Statistics Homework Help
 
ds2010
ds2010ds2010
ds2010
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Advanced Statistics Homework Help
Advanced Statistics Homework HelpAdvanced Statistics Homework Help
Advanced Statistics Homework Help
 
Calibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced ClassificationCalibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced Classification
 
Anova by Hazilah Mohd Amin
Anova by Hazilah Mohd AminAnova by Hazilah Mohd Amin
Anova by Hazilah Mohd Amin
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
 
How to use statistica for rsm study
How to use statistica for rsm studyHow to use statistica for rsm study
How to use statistica for rsm study
 
Digital electronics k map comparators and their function
Digital electronics k map comparators and their functionDigital electronics k map comparators and their function
Digital electronics k map comparators and their function
 
161783709 chapter-04-answers
161783709 chapter-04-answers161783709 chapter-04-answers
161783709 chapter-04-answers
 
161783709 chapter-04-answers
161783709 chapter-04-answers161783709 chapter-04-answers
161783709 chapter-04-answers
 
Graeco Latin Square Design
Graeco Latin Square Design Graeco Latin Square Design
Graeco Latin Square Design
 

Design matrix and Contrast

  • 1. Design Matrix and Contrast Statement in SAS Regression Procedures Gang Cui, MPH Sr. Biostatistician, CSCC, UNC
  • 2. Outline • Motivations • Design Matrix definition • Three commonly used design matrix coding schemes in logistic regression • Contrast statement under effect and GLM coding scheme • Summary of design matrix choices in all regression procedures 13:10
  • 3. Motivation • Deep learning on certain certain elusive topic • Mutual learning and knowledge sharing • Light weight but practical • Engaging and enjoy 13:10
  • 4. Design Matrix • Design matrix is a matrix of “values” of explanatory variables of a set of objects. The values in matrix is not the raw value in dataset, also called parameterization. – Exp. Comparing means of y among three groups: with y being the continuous response, x being explanatory variable having 3 groups (1,2,3), 𝜇𝑖 being the mean of y of each group, 𝜏𝑖 being the difference of each group comparing with reference group. 𝑦𝑖𝑗 = 𝜇𝑖 + 𝜖𝑖𝑗 𝑦𝑖𝑗 = 𝜇1 + 𝜏𝑖 + 𝜖𝑖𝑗                                                                               32 31 22 21 13 12 11 3 2 1 32 31 22 21 13 12 11 100 100 010 010 001 001 001           y y y y y y y                                                                               32 31 22 21 13 12 11 3 2 1 32 31 22 21 13 12 11 101 101 011 011 001 001 001           y y y y y y y 𝑦11 = 𝜇1 + 𝜖11 𝑦12 = 𝜇1 + 𝜖12 𝑦13 = 𝜇1 + 𝜖13 𝑦21 = 𝜇2 + 𝜖21 𝑦22 = 𝜇2 + 𝜖22 𝑦31 = 𝜇3 + 𝜖31 𝑦32 = 𝜇3 + 𝜖32 𝑦11 = 𝜇1 + 𝜖11 𝑦12 = 𝜇1 + 𝜖12 𝑦13 = 𝜇1 + 𝜖13 𝑦21 = 𝜇1 + 𝜏2 + 𝜖22 𝑦22 = 𝜇1 + 𝜏2 + 𝜖22 𝑦31 = 𝜇1 + 𝜏3 + 𝜖31 𝑦32 = 𝜇1 + 𝜏3 + 𝜖32 13:10
  • 5. Design Matrix In dataset, you may have explanatory vars Sex =“M”/”F”, and Age_Group =1/2/3. SAS class and model statement will be: Class Sex Age_Group/<options; model outcome_var=sex age_group/<options>; In design matrix, SAS creates “dummy variables” for sex and age_group, with value 1 and 0 only (GLM and REF coding). True mathematic model will be something like: Outcom_var=𝜇+𝛽iSex1+ 𝛽2Sex2+ 𝛽3Age_Group1+ 𝛽4Age_Group2 +𝛽5Age_Group3, where 𝜇 is the mean of reference group, 𝛽 is added effect corresponding to each group. In GLM 𝛽 for reference group will be set to 0. Data Design Matrix Sex Age_Group Sex Age_Group 𝜇 (intercept) Sex1 Sex2 Age_Group1 Age_Group2 Age_Group3 M 1 1 1 0 1 0 0 M 2 1 1 0 0 1 0 M 3 1 1 0 0 0 1 F 1 1 0 1 1 0 0 F 2 1 0 1 0 1 0 F 3 1 0 1 0 0 1 Dummy variables 13:10
  • 6. Design Matrix Choices in SAS Three most common design matrix coding schemes: • GLM - (indicator or dummy coding) reference value coded as “1”. • Effect – Ref value coded as “-1”. Beta estimates are estimating the difference in the effect of each nonreference level compared to the average effect over all levels. • REF - Reference cell coding, reference value coded as “0”. Specified in Class statement global option Param=<GLM/EFFECT/REF>: class <classvar1> <classvar2>/param = glm/effect/ref <or by default>; *Note: 1. Design matrix option has profound impact on CONTRAST/LSMEAN statement and result interpretation. 2. Different procedures have different default design matrix options. 3. Not all procedures allow all three options. 13:10
  • 7. PARAM=GLM in PROC Logistic • Suppose having a dataset with binary outcome Pain(Yes/No), and explanatory variable treatment (A, B, P) and sex (M, F). proc logistic data=data1; class Treatment Sex/param=GLM; model Pain= Treatment Sex/ expb; run; Class Level Information Class Value Design Variables Treatment A 1 0 0 B 0 1 0 P 0 0 1 Sex F 1 0 M 0 1 *“Dummy coding”: c columns in design matrix, c is the number of level of class var. Reference value coded as 1. Reference is always “LAST” value with GLM, you don’t have choice unless you formatted ref as the “LAST”. 13:10
  • 8. PARAM=GLM in PROC Logistic Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits Treatment A vs P 8.960 1.949 41.201 Treatment B vs P 11.822 2.453 56.972 Sex F vs M 4.354 1.206 15.719 With param=GLM, you can simply exp(beta) to get OR, because 𝛽 coefficient estimate is estimating effect comparing to reference level. OR of A vs P=exp(0.2.1928)=8.960; OR of B vs P=exp(0.2.4700)=11.822 OR of F vs M=exp(1.4711)=4.354 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est) Intercept 1 -1.9705 0.7030 7.8570 0.0051 0.139 Treatment A 1 2.1928 0.7784 7.9353 0.0048 8.960 Treatment B 1 2.4700 0.8023 9.4770 0.0021 11.822 Treatment P 0 0 . . . . Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354 Sex M 0 0 . . . . 13:10
  • 9. PARAM=EFFECT in PROC Logistic proc logistic data=data1; class Treatment Sex/param=effect; model Pain= Treatment Sex/ expb; run; Class Level Information Class Value Design Variables Treatment A 1 0 B 0 1 P -1 -1 Sex F 1 M -1 *Default param=effect *Default reference is “LAST” value. But you can change reference. *There is c-1 column in design matrix, c is the number of level of class var. Reference value coded as -1. 13:10
  • 10. PARAM=EFFECT in PROC Logistic Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est) Intercept 1 0.3193 0.3089 1.0685 0.3013 1.376 Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894 Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499 Sex F 1 0.7355 0.3275 5.0440 0.0247 2.087 Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits Treatment A vs P 8.960 1.949 41.201 Treatment B vs P 11.822 2.453 56.972 Sex F vs M 4.354 1.206 15.719 *Normally, one CANNOT calculate OR=exp(beta) in GLM, because 𝛽 coefficient estimate is estimating effect comparing to average of all level. * It is possible that p-value in beta estimate and 95% of OR in OR estimate are NOT consistent OR of A vs P=exp(2*0.6385+0.9157))=8.960; OR of B vs P=exp(2*0.9157+0.6385))=11.822 OR if F vs M=exp(2x0.7355)=4.354 Huh 13:10
  • 11. PARAM=EFFECT in PROC Logistic According to design matrix: treatment=P is coded as -1: Therefore the logit function of each treatment group, difference between non-ref vs ref, Odds and OR are:                  B A   11 10 01  F      1 1 Matrix design for treatment Matrix design for sex Class Logit function Odds difference OR Trt A L(A)=𝛽0+𝛽A Exp(𝛽0+𝛽A) A vs P: L(A)-L(P)=2𝛽A+𝛽B Exp(2𝛽A+𝛽B) B L(B)=𝛽0+𝛽B Exp(𝛽0+𝛽B) B vs P: L(B)-L(P)=2𝛽B+𝛽A Exp(2𝛽B+𝛽A) P L(P)=𝛽0-𝛽A-𝛽B Exp(𝛽0-𝛽A-𝛽B) Sex F L(F)=𝛽0+𝛽F Exp(𝛽0+𝛽F) F vs M: L(F)-L(M)=2𝛽F Exp(2𝛽F) M L(M)=𝛽0-𝛽F Exp(𝛽0-𝛽F) 13:10
  • 12. PARAM=REF in PROC Logistic • proc logistic data=data1; class Treatment Sex/param=REF; • model Pain= Treatment Sex/ expb; • run; *Reference value coded as 0. Class Level Information Class Value Design Variables Treatment A 1 0 B 0 1 P 0 0 Sex F 1 M 0 13:10
  • 13. PARAM=REF in PROC Logistic Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits Treatment A vs P 8.960 1.949 41.201 Treatment B vs P 11.822 2.453 56.972 Sex F vs M 4.354 1.206 15.719 With param=REF, SAS output simply omit reference row, otherwise, estimate values are the same as GLM. OR is calculated in the same way as GLM. OR of A vs P=exp(2.1928)=8.960; OR of B vs P=exp(2.4700)=11.822 OR of F vs M=exp(1.4711)=4.354 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est) Intercept 1 -1.9705 0.7030 7.8570 0.0051 0.139 Treatment A 1 2.1928 0.7784 7.9353 0.0048 8.960 Treatment B 1 2.4700 0.8023 9.4770 0.0021 11.822 Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354 13:10
  • 14. Mixed Coding System in PROC Logistic One can specify different param option for different class vars, though not usual. Individual parametrization trump global option, unless the global option is GLM. proc logistic data=data1; class Treatment(param=effect) Sex(param=ref)/param=ref; model Pain= Treatment Sex / expb; run; Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est) Intercept 1 -0.4163 0.4252 0.9586 0.3276 0.659 Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894 Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499 Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354 Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits Treatment A vs P 8.960 1.949 41.201 Treatment B vs P 11.822 2.453 56.972 Sex F vs M 4.354 1.206 15.719 OR of A vs P=exp(2*0.6385+0.9157)=8.960 OR of B vs P=exp(2*0.9175+0.9157)=11.822 OR of F vs M=exp(1.4711)=4.354 13:10
  • 15. Contrast Statement What if I want to compare treatment A vs B, or average A and B vs P, without changing reference group and without rerun the procedure? The answer is using Contrast statement. However, different coding have profound impact on how to write Contrast statement. General syntax: CONTRAST 'label' <var-name > <dummy_coeff_1 …dummy_coeffecient_n>/options; 13:10
  • 16. Contrast Statement with Effect Coding in Logistic Comparing A vs B: We know L(A)=𝛽0+𝛽A, and L(B)=𝛽0+𝛽B, therefore L(A)-L(B)= 𝛽A - 𝛽B, and we can write contrast statement as: Contrast “A vs B” treatment 1 -1; Class Logit function Odds difference OR Trt A L(A)=𝛽0+𝛽A Exp(𝛽0+𝛽A) A vs P: L(A)-L(P)=2𝛽A+𝛽B Exp(2𝛽A+𝛽B) B L(B)=𝛽0+𝛽B Exp(𝛽0+𝛽B) B vs P: L(B)-L(P)=2𝛽B+𝛽A Exp(2𝛽B+𝛽A) P L(P)=𝛽0-𝛽A-𝛽B Exp(𝛽0-𝛽A-𝛽B) Sex F L(F)=𝛽0+𝛽F Exp(𝛽0+𝛽F) F vs M: L(F)-L(M)=2𝛽F Exp(2𝛽F) M L(M)=𝛽0-𝛽F Exp(𝛽0-𝛽F) 13:10
  • 17. Contrast Statement with Effect Coding in Logistic Contrast Estimation and Testing Results by Row Contrast Type Row Estimate Standard Error Alpha Confidence Limits Wald Chi-Square Pr > ChiSq A vs B PARM 1 -0.2772 0.7463 0.05 -1.7399 1.1855 0.1380 0.7103 A vs B EXP 1 0.7579 0.5656 0.05 0.1755 3.2725 0.1380 0.7103 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est) Intercept 1 0.3193 0.3089 1.0685 0.3013 1.376 Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894 Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499 Sex F 1 0.7355 0.3275 5.0440 0.0247 2.087 proc logistic data=data1; class Treatment Sex/param=effect; model Pain= Treatment Sex /e expb; Contrast "A vs B" treatment 1 -1/ e estimate=both; run; Coefficients of Contrast A vs B Parameter Row1 TreatmentA 1 TreatmentB -1 13:10
  • 18. Contrast Statement with Effect Coding in Logistic Now compare average of (A and B) vs P: ½{L(A)+L(B)} - L(P) = 1/2(𝛽0+𝛽A+𝛽0+𝛽B) - (𝛽0-𝛽A-𝛽B) = 1.5𝛽A+1.5 𝛽B proc logistic data=data1; class Treatment Sex/param=effect; model Pain= Treatment Sex / expb; Contrast “average of (A and B) vs P" treatment 1.5 1.5/ e estimate=both; run; Coefficients of Contrast average of (A and B) vs P Parameter Row1 Intercept 0 TreatmentA 1.5 TreatmentB 1.5 Contrast Estimation and Testing Results by Row Contrast Type Row Estimate Standard Error Alpha Confidence Limits Wald Chi-Square Pr > ChiSq average of (A and B) vs P PARM 1 2.3314 0.6969 0.05 0.9656 3.6972 11.1931 0.0008 average of (A and B) vs P EXP 1 10.2923 7.1722 0.05 2.6263 40.3341 11.1931 0.0008 13:10
  • 19. Contrast statement Coding in PROC GLM • Unlike proc logistic, GLM coding is the only coding scheme in proc GLM. • Always use the “LAST” value as reference group. A=2 and B=3 in following case. Example: Y being continuous response following normal distribution with constant variance. The model has two factors, A with 2 levels and B with 3 levels. Main Effect model: Yij = μ + αi + βj + εij, where i is the level of factor A, and j is the level of B The design matrix of main effect model: Data Design Matrix A B A B 𝜇 (intercept) A1 A2 B1 B2 B3 1 1 1 1 0 1 0 0 1 2 1 1 0 0 1 0 1 3 1 1 0 0 0 1 2 1 1 0 1 1 0 0 2 2 1 0 1 0 1 0 2 3 1 0 1 0 0 1 13:10
  • 20. Contrast Statement in PROC GLM Test hypothesis 1: H 0 : μB1 = μB2 From the design matrix, we know μ B1 = μ + αi + β1 and μ B2 = μ + αi + β2 μB1 - μB2 = β1 -β2 Contrast statement: Contrast “B=1 vs B=2” B 1 -1 / e proc glm data=data3; class a b; model y=a b/solution; contrast "B=1 vs B=2" B 1 -1/e; ESTIMATE "B=1 vs B=2" B 1 -1/e; *Usually Contrast and Esitmate go hand-in-hand, Esitmate give estimate of difference and SE run; Contrast DF Contrast SS Mean Square F Value Pr > F B=1 vs B=2 1 4.98877622 4.98877622 4.85 0.0328 Parameter Estimate Standard Error t Value Pr > |t| B=1 vs B=2 -0.84927549 0.38544563 -2.20 0.0328 13:10
  • 21. Contrast Statement Coding in PROC GLM Model with crossed effects: Yijk = μ + αi + βj + αβij + εijk , where i is the level of factor A, and j is the level of B, and k is level of A*B The design matrix of crossed effect model: Data Design Matrix A B A*B A B μ A1 A2 B1 B2 B3 A1 B1 A1 B2 A1 B3 A2 B1 A2 B2 A2 B3 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 2 1 1 0 0 1 0 0 1 0 0 0 0 1 3 1 1 0 0 0 1 0 0 1 0 0 0 2 1 1 0 1 1 0 0 0 0 0 1 0 0 2 2 1 0 1 0 1 0 0 0 0 0 1 0 2 3 1 0 1 0 0 1 0 0 0 0 0 1 13:10
  • 22. Contrast Statement in PROC GLM Test hypothesis 2: H 0 : μ 𝛼B11 = μ 𝛼B12 From the design matrix, we know: μ 𝛼B11 = μ + α1 + β1 + αβ11 and μ 𝛼B12= μ + α1 + β2 + αβ12 μB1 - μB2 = β1 -β2+ αβ11- αβ12 Contrast statement: now have two dummy coefficients proc glm data=data3; class a b; model y=a b/solution; contrast "B=1 vs B=2" B 1 -1 A*B 1 -1; *This is equivalent as 1 -1 0 0 0 0 , trailing 0 can be ignored; ESTIMATE "B=1 vs B=2" B 1 -1; A*B 1 -1; run; Contrast DF Contrast SS Mean Square F Value Pr > F AB11 vs AB12 1 6.81411446 6.81411446 6.53 0.0143 Parameter Estimate Standard Error t Value Pr > |t| AB11 vs AB12 -1.40976950 0.55182283 -2.55 0.0143 13:10
  • 23. Steps to Construct Contrast Statement Step 1. Write down the model, two crucial parts to this. – Parameterization: how design variables in class statement are coded. – Parameter ordering: the order of parameters depends on class statement and order option. Confirm from SAS output. Step 2. Write down the hypothesis to be tested. Step 3. Write the CONTRAST. 13:10
  • 24. Default and Alternative Coding in Regression SAS Procedures Default design matrix coding Regression procedures GLM coding (indicator or dummy): ref value coded as 1 GENMOD, GLM, GLMSELECT, GLIMMIX, LIFEREG, MIXED, and SURVEYPHREG EFFECT coding (deviation from mean) : ref value coded as -1 CATMOD, LOGISTIC, and SURVEYLOGISTIC REF coding: ref value coded as 0 PHREG and TRANSREG Alternative coding allowed? Regression procedures Allowed LOGISTIC, GENMOD, GLMSELECT, PHREG, SURVEYLOGISTIC, and SURVEYPHREG. Not allowed GLM, MIXED, GLIMMIX, and LIFEREG 13:10
  • 25. Take Home Messages • Design matrix has profound impact on CONTRAST/LSMEAN statement and result interpretation. • First, know the design matrix you are using and true math linear combination of the matrix. • The variables order in class statement matters. • Effect coding is estimating effect comparing to average of all level, not to the reference. • Follow steps of constructing contrast statement: – Step 1: know the design matrix and order of parameters – Step 2: Write done the hypothesis tests – Step 3: Construct contrast and ESTIMATE statement • Different regression procedures have different default design matrix schemes, and not all procedure allows alternative schemes. 13:10
  • 26. Acknowledgement • Thanks Kathy Roggenkamp for your encouragement and support! 13:10

Editor's Notes

  1. Sometime, when I read statistical and SAS reference, I felt it is better to share my learning with others, and it will benefit both me and the center. When you study or read something, you think you understand until you try to explain to other clearly. That is what homework for. Sometime, explaining to others is the best way to really learn something, because it require you to think deeper and more thoroughly. In real work, everybody is doing unique works and using specific skills, therefore I also want to set up an example of this mini-lunch presentation at the center, to encourage sharing learning experiences at the center, so we can learn from each other. Sometime, I wish I could just watching a video or presentation that put together material related to specific topic of interest. For the mini-lunch presentation, the topic will be small and focused, and directly relate to our daily tasks, and hopefully engaging. By the way, I don’t take credit for material here, I just put together what I learned, and try to present it clearly. For this presentation, I will focus on three common parameterization choices in SAS egression procedures and how to write contrast and LSMEAN statement to perform various hypothesis tests. In all examples here, all explanatory variables are categorical, with response variable being either binary as in logistic model or continuous in GLM.
  2. Sometime, when I read statistical and SAS reference, I felt it is better to share my learning with others, and it will benefit both me and the center. When you study or read something, you think you understand until you try to explain to other clearly. Sometime, explaining to others is the best way to really learn something, because it require you to think deeper and more thoroughly. In real work, everybody is doing unique works and using specific skills, therefore I also want to set up an example of this mini-lunch presentation at the center, to encourage sharing learning experiences at the center, so we can learn from each other. Sometime, I wish I could just watching a video or presentation that put together material related to specific topic of interest. For the mini-lunch presentation, the topic will be small and focused, and directly relate to our daily tasks, and hopefully engaging. By the way, I don’t take credit for material here, I just put together what I learned, and try to present it clearly.
  3. According study interest and hypothesis, the design matrix could be different for the same model. Take this simple ANOVA example, y being continuous response variable, and has one explanatory variable with three group