Design matrix and Contrast

Design Matrix and
Contrast Statement in SAS Regression
Procedures
Gang Cui, MPH
Sr. Biostatistician, CSCC, UNC

Outline
• Motivations
• Design Matrix definition
• Three commonly used design matrix coding schemes in
logistic regression
• Contrast statement under effect and GLM coding scheme
• Summary of design matrix choices in all regression procedures
13:10

Motivation
• Deep learning on certain certain elusive topic
• Mutual learning and knowledge sharing
• Light weight but practical
• Engaging and enjoy
13:10

Design Matrix
• Design matrix is a matrix of “values” of explanatory
variables of a set of objects. The values in matrix is not the
raw value in dataset, also called parameterization.
– Exp. Comparing means of y among three groups: with y being the continuous response,
x being explanatory variable having 3 groups (1,2,3), 𝜇𝑖 being the mean of y of each
group, 𝜏𝑖 being the difference of each group comparing with reference group.
𝑦𝑖𝑗 = 𝜇𝑖 + 𝜖𝑖𝑗 𝑦𝑖𝑗 = 𝜇1 + 𝜏𝑖 + 𝜖𝑖𝑗














































































32
31
22
21
13
12
11
3
2
1
32
31
22
21
13
12
11
100
100
010
010
001
001
001










y
y
y
y
y
y
y














































































32
31
22
21
13
12
11
3
2
1
32
31
22
21
13
12
11
101
101
011
011
001
001
001










y
y
y
y
y
y
y
𝑦11 = 𝜇1 + 𝜖11
𝑦12 = 𝜇1 + 𝜖12
𝑦13 = 𝜇1 + 𝜖13
𝑦21 = 𝜇2 + 𝜖21
𝑦22 = 𝜇2 + 𝜖22
𝑦31 = 𝜇3 + 𝜖31
𝑦32 = 𝜇3 + 𝜖32
𝑦11 = 𝜇1 + 𝜖11
𝑦12 = 𝜇1 + 𝜖12
𝑦13 = 𝜇1 + 𝜖13
𝑦21 = 𝜇1 + 𝜏2 + 𝜖22
𝑦22 = 𝜇1 + 𝜏2 + 𝜖22
𝑦31 = 𝜇1 + 𝜏3 + 𝜖31
𝑦32 = 𝜇1 + 𝜏3 + 𝜖32 13:10

Design Matrix
In dataset, you may have explanatory vars Sex =“M”/”F”, and Age_Group =1/2/3. SAS
class and model statement will be:
Class Sex Age_Group/<options;
model outcome_var=sex age_group/<options>;
In design matrix, SAS creates “dummy variables” for sex and age_group, with value 1
and 0 only (GLM and REF coding). True mathematic model will be something like:
Outcom_var=𝜇+𝛽iSex1+ 𝛽2Sex2+ 𝛽3Age_Group1+ 𝛽4Age_Group2 +𝛽5Age_Group3,
where 𝜇 is the mean of reference group, 𝛽 is added effect corresponding to each group.
In GLM 𝛽 for reference group will be set to 0.
Data Design Matrix
Sex Age_Group
Sex Age_Group 𝜇 (intercept) Sex1 Sex2 Age_Group1 Age_Group2 Age_Group3
M 1 1 1 0 1 0 0
M 2 1 1 0 0 1 0
M 3 1 1 0 0 0 1
F 1 1 0 1 1 0 0
F 2 1 0 1 0 1 0
F 3 1 0 1 0 0 1
Dummy
variables
13:10

Design Matrix Choices in SAS
Three most common design matrix coding schemes:
• GLM - (indicator or dummy coding) reference value coded as “1”.
• Effect – Ref value coded as “-1”. Beta estimates are estimating the difference
in the effect of each nonreference level compared to the average effect over
all levels.
• REF - Reference cell coding, reference value coded as “0”.
Specified in Class statement global option Param=<GLM/EFFECT/REF>:
class <classvar1> <classvar2>/param = glm/effect/ref <or by default>;
*Note:
1. Design matrix option has profound impact on CONTRAST/LSMEAN statement and
result interpretation.
2. Different procedures have different default design matrix options.
3. Not all procedures allow all three options.
13:10

PARAM=GLM in PROC Logistic
• Suppose having a dataset with binary outcome Pain(Yes/No), and
explanatory variable treatment (A, B, P) and sex (M, F).
proc logistic data=data1;
class Treatment Sex/param=GLM;
model Pain= Treatment Sex/ expb;
run;
Class Level Information
Class Value Design Variables
Treatment A 1 0 0
B 0 1 0
P 0 0 1
Sex F 1 0
M 0 1
*“Dummy coding”: c columns in design matrix,
c is the number of level of class var.
Reference value coded as 1.
Reference is always “LAST” value with GLM, you
don’t have choice unless you formatted ref as
the “LAST”.
13:10

PARAM=GLM in PROC Logistic
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
With param=GLM, you can simply exp(beta)
to get OR, because 𝛽 coefficient estimate is
estimating effect comparing to reference
level.
OR of A vs P=exp(0.2.1928)=8.960;
OR of B vs P=exp(0.2.4700)=11.822
OR of F vs M=exp(1.4711)=4.354
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 -1.9705 0.7030 7.8570 0.0051 0.139
Treatment A 1 2.1928 0.7784 7.9353 0.0048 8.960
Treatment B 1 2.4700 0.8023 9.4770 0.0021 11.822
Treatment P 0 0 . . . .
Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354
Sex M 0 0 . . . .
13:10

PARAM=EFFECT in PROC Logistic
class Treatment Sex/param=effect;
model Pain= Treatment Sex/ expb;
run;
Treatment A 1 0
B 0 1
P -1 -1
Sex F 1
M -1
*Default param=effect
*Default reference is “LAST” value. But you
can change reference.
*There is c-1 column in design matrix, c is the
number of level of class var. Reference value
coded as -1.
13:10

Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 0.3193 0.3089 1.0685 0.3013 1.376
Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894
Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499
Sex F 1 0.7355 0.3275 5.0440 0.0247 2.087
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
*Normally, one CANNOT calculate
OR=exp(beta) in GLM, because 𝛽 coefficient
estimate is estimating effect comparing to
average of all level.
* It is possible that p-value in beta estimate
and 95% of OR in OR estimate are NOT
consistent
OR of A vs P=exp(2*0.6385+0.9157))=8.960;
OR of B vs P=exp(2*0.9157+0.6385))=11.822
OR if F vs M=exp(2x0.7355)=4.354
Huh
13:10

According to design matrix: treatment=P is coded as -1:
Therefore the logit function of each treatment group, difference
between non-ref vs ref, Odds and OR are:
















 B
A


11
10
01
 F





1
1
Matrix design for treatment Matrix design for sex
Class Logit function Odds difference OR
Trt A L(A)=𝛽0+𝛽A Exp(𝛽0+𝛽A) A vs P: L(A)-L(P)=2𝛽A+𝛽B Exp(2𝛽A+𝛽B)
B L(B)=𝛽0+𝛽B Exp(𝛽0+𝛽B) B vs P: L(B)-L(P)=2𝛽B+𝛽A Exp(2𝛽B+𝛽A)
P L(P)=𝛽0-𝛽A-𝛽B Exp(𝛽0-𝛽A-𝛽B)
Sex F L(F)=𝛽0+𝛽F Exp(𝛽0+𝛽F) F vs M: L(F)-L(M)=2𝛽F Exp(2𝛽F)
M L(M)=𝛽0-𝛽F Exp(𝛽0-𝛽F)
13:10

PARAM=REF in PROC Logistic
• proc logistic data=data1;
class Treatment Sex/param=REF;
• model Pain= Treatment Sex/ expb;
• run;
*Reference value coded as 0.
Treatment A 1 0
B 0 1
P 0 0
Sex F 1
M 0
13:10

PARAM=REF in PROC Logistic
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
With param=REF, SAS output simply omit
reference row, otherwise, estimate values
are the same as GLM. OR is calculated in the
same way as GLM.
OR of A vs P=exp(2.1928)=8.960;
OR of B vs P=exp(2.4700)=11.822
OR of F vs M=exp(1.4711)=4.354
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 -1.9705 0.7030 7.8570 0.0051 0.139
Treatment A 1 2.1928 0.7784 7.9353 0.0048 8.960
Treatment B 1 2.4700 0.8023 9.4770 0.0021 11.822
Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354
13:10

Mixed Coding System in PROC Logistic
One can specify different param option for different class vars, though not usual.
Individual parametrization trump global option, unless the global option is GLM.
class Treatment(param=effect) Sex(param=ref)/param=ref;
model Pain= Treatment Sex / expb;
run;
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Exp(Est)
Intercept 1 -0.4163 0.4252 0.9586 0.3276 0.659
Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894
Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499
Sex F 1 1.4711 0.6550 5.0440 0.0247 4.354
Confidence Limits
Treatment A vs P 8.960 1.949 41.201
Treatment B vs P 11.822 2.453 56.972
Sex F vs M 4.354 1.206 15.719
OR of A vs P=exp(2*0.6385+0.9157)=8.960
OR of B vs P=exp(2*0.9175+0.9157)=11.822
OR of F vs M=exp(1.4711)=4.354
13:10

Contrast Statement
What if I want to compare treatment A vs B, or average A and B vs P, without changing
reference group and without rerun the procedure?
The answer is using Contrast statement. However, different coding have profound
impact on how to write Contrast statement.
General syntax:
CONTRAST 'label' <var-name > <dummy_coeff_1 …dummy_coeffecient_n>/options;
13:10

Contrast Statement with Effect Coding in Logistic
Comparing A vs B: We know L(A)=𝛽0+𝛽A, and L(B)=𝛽0+𝛽B, therefore
L(A)-L(B)= 𝛽A - 𝛽B, and we can write contrast statement as:
Contrast “A vs B” treatment 1 -1;
Class Logit function Odds difference OR
Trt A L(A)=𝛽0+𝛽A Exp(𝛽0+𝛽A) A vs P: L(A)-L(P)=2𝛽A+𝛽B Exp(2𝛽A+𝛽B)
B L(B)=𝛽0+𝛽B Exp(𝛽0+𝛽B) B vs P: L(B)-L(P)=2𝛽B+𝛽A Exp(2𝛽B+𝛽A)
P L(P)=𝛽0-𝛽A-𝛽B Exp(𝛽0-𝛽A-𝛽B)
Sex F L(F)=𝛽0+𝛽F Exp(𝛽0+𝛽F) F vs M: L(F)-L(M)=2𝛽F Exp(2𝛽F)
M L(M)=𝛽0-𝛽F Exp(𝛽0-𝛽F)
13:10

Contrast Estimation and Testing Results by Row
Contrast Type Row Estimate Standard
Error
Alpha Confidence Limits Wald
Chi-Square
Pr > ChiSq
A vs B PARM 1 -0.2772 0.7463 0.05 -1.7399 1.1855 0.1380 0.7103
A vs B EXP 1 0.7579 0.5656 0.05 0.1755 3.2725 0.1380 0.7103
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept 1 0.3193 0.3089 1.0685 0.3013 1.376
Treatment A 1 0.6385 0.4323 2.1815 0.1397 1.894
Treatment B 1 0.9157 0.4467 4.2032 0.0403 2.499
Sex F 1 0.7355 0.3275 5.0440 0.0247 2.087
model Pain= Treatment Sex /e expb;
Contrast "A vs B" treatment 1 -1/ e estimate=both;
run;
Coefficients of Contrast A
vs B
Parameter Row1
TreatmentA 1
TreatmentB -1
13:10

Now compare average of (A and B) vs P:
½{L(A)+L(B)} - L(P) = 1/2(𝛽0+𝛽A+𝛽0+𝛽B) - (𝛽0-𝛽A-𝛽B) = 1.5𝛽A+1.5 𝛽B
model Pain= Treatment Sex / expb;
Contrast “average of (A and B) vs P" treatment 1.5 1.5/ e estimate=both;
run;
Coefficients of Contrast average of (A and B) vs P
Parameter Row1
Intercept 0
TreatmentA 1.5
TreatmentB 1.5
Contrast Estimation and Testing Results by Row
Contrast Type Row Estimate Standard
Error
Alpha Confidence Limits Wald
Chi-Square
Pr > ChiSq
average of (A and B) vs P PARM 1 2.3314 0.6969 0.05 0.9656 3.6972 11.1931 0.0008
average of (A and B) vs P EXP 1 10.2923 7.1722 0.05 2.6263 40.3341 11.1931 0.0008
13:10

Contrast statement Coding in PROC GLM
• Unlike proc logistic, GLM coding is the only coding scheme in proc GLM.
• Always use the “LAST” value as reference group. A=2 and B=3 in following case.
Example: Y being continuous response following normal distribution with constant
variance. The model has two factors, A with 2 levels and B with 3 levels.
Main Effect model: Yij = μ + αi + βj + εij, where i is the level of factor A, and j is the level
of B
The design matrix of main effect model:
Data Design Matrix
A B
A B 𝜇 (intercept) A1 A2 B1 B2 B3
1 1 1 1 0 1 0 0
1 2 1 1 0 0 1 0
1 3 1 1 0 0 0 1
2 1 1 0 1 1 0 0
2 2 1 0 1 0 1 0
2 3 1 0 1 0 0 1
13:10

Contrast Statement in PROC GLM
Test hypothesis 1: H 0 : μB1 = μB2
From the design matrix, we know μ B1 = μ + αi + β1 and μ B2 = μ + αi + β2
μB1 - μB2 = β1 -β2
Contrast statement: Contrast “B=1 vs B=2” B 1 -1 / e
proc glm data=data3;
class a b;
model y=a b/solution;
contrast "B=1 vs B=2" B 1 -1/e;
ESTIMATE "B=1 vs B=2" B 1 -1/e; *Usually Contrast and Esitmate go hand-in-hand, Esitmate give
estimate of difference and SE
run;
Contrast DF Contrast SS Mean Square F Value Pr > F
B=1 vs B=2 1 4.98877622 4.98877622 4.85 0.0328
Parameter Estimate Standard Error t Value Pr > |t|
B=1 vs B=2 -0.84927549 0.38544563 -2.20 0.0328
13:10

Contrast Statement Coding in PROC GLM
Model with crossed effects: Yijk = μ + αi + βj + αβij + εijk , where i is the level of factor A,
and j is the level of B, and k is level of A*B
The design matrix of crossed effect model:
Data Design Matrix
A B A*B
A B μ A1 A2 B1 B2 B3
A1
B1
A1
B2
A1
B3
A2
B1
A2
B2
A2
B3
1 1 1 1 0 1 0 0 1 0 0 0 0 0
1 2 1 1 0 0 1 0 0 1 0 0 0 0
1 3 1 1 0 0 0 1 0 0 1 0 0 0
2 1 1 0 1 1 0 0 0 0 0 1 0 0
2 2 1 0 1 0 1 0 0 0 0 0 1 0
2 3 1 0 1 0 0 1 0 0 0 0 0 1
13:10

Contrast Statement in PROC GLM
Test hypothesis 2: H 0 : μ 𝛼B11 = μ 𝛼B12
From the design matrix, we know: μ 𝛼B11 = μ + α1 + β1 + αβ11 and μ 𝛼B12= μ + α1 + β2 + αβ12
μB1 - μB2 = β1 -β2+ αβ11- αβ12
Contrast statement: now have two dummy coefficients
proc glm data=data3;
class a b;
model y=a b/solution;
contrast "B=1 vs B=2" B 1 -1
A*B 1 -1; *This is equivalent as 1 -1 0 0 0 0 , trailing 0 can be ignored;
ESTIMATE "B=1 vs B=2" B 1 -1;
A*B 1 -1;
run;
Contrast DF Contrast SS Mean Square F Value Pr > F
AB11 vs AB12 1 6.81411446 6.81411446 6.53 0.0143
Parameter Estimate Standard Error t Value Pr > |t|
AB11 vs AB12 -1.40976950 0.55182283 -2.55 0.0143
13:10

Steps to Construct Contrast Statement
Step 1. Write down the model, two crucial parts to this.
– Parameterization: how design variables in class statement
are coded.
– Parameter ordering: the order of parameters depends on
class statement and order option. Confirm from SAS
output.
Step 2. Write down the hypothesis to be tested.
Step 3. Write the CONTRAST.
13:10

Default and Alternative Coding in Regression
SAS Procedures
Default design matrix coding Regression procedures
GLM coding (indicator or dummy):
ref value coded as 1
GENMOD, GLM, GLMSELECT, GLIMMIX,
LIFEREG, MIXED, and SURVEYPHREG
EFFECT coding (deviation from
mean) : ref value coded as -1
CATMOD, LOGISTIC, and SURVEYLOGISTIC
REF coding: ref value coded as 0 PHREG and TRANSREG
Alternative coding allowed? Regression procedures
Allowed LOGISTIC, GENMOD, GLMSELECT, PHREG,
SURVEYLOGISTIC, and SURVEYPHREG.
Not allowed GLM, MIXED, GLIMMIX, and LIFEREG
13:10

Take Home Messages
• Design matrix has profound impact on CONTRAST/LSMEAN statement and
result interpretation.
• First, know the design matrix you are using and true math linear
combination of the matrix.
• The variables order in class statement matters.
• Effect coding is estimating effect comparing to average of all level, not to
the reference.
• Follow steps of constructing contrast statement:
– Step 1: know the design matrix and order of parameters
– Step 2: Write done the hypothesis tests
– Step 3: Construct contrast and ESTIMATE statement
• Different regression procedures have different default design matrix
schemes, and not all procedure allows alternative schemes.
13:10

Acknowledgement
• Thanks Kathy Roggenkamp for your
encouragement and support!
13:10

Design matrix and Contrast

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Design matrix and Contrast

Similar to Design matrix and Contrast (20)

Design matrix and Contrast

Editor's Notes