Multivariate Analysis
Prof. Dr. Jamalludin Ab Rahman MD MPH
Department of Community Medicine
Kulliyyah of Medicine
Smoking & lung cancer
Good case-control study associating lung cancer to smoking (Wynder EL, Graham E.
Tobacco smoking as a possible etiologic factor in bronchiogenic carcinoma: a study of 684
proven cases. JAMA 1950;143:329-36.)
Tobacco dust (not smoke) might be causing the elevated incidence of lung tumours among German tobacco
workers. (Hermann Rottmann in Würzburg 1898)
Difference Causal
Smoking might be related to lung cancer, but lung cancer is still rare (Adler I. Primary Malignant
Growths of the Lungs and Bronchi. London: Longmans, 1912:22)
86 lung cancers patients were likely smoked (Müller FH. Tabakmissbrauch und Lungencarcinom.
Zeitschrift für Krebsforschung 1939;49:57–85.)
Smoking 35 sticks per day increase risk to 40 times (Doll R,
Hill AB. The mortality of doctors in relation to their smoking
habits. BMJ 1954;1:1451–5.)
Animal study associating cigarette smoke tar with cancer (Wynder E,
Graham EA, Croninger AB. Experimental production of carcinoma with
cigarette tar. Cancer Res 1953;13:855–66)
8November2016(C)JamalludinAbRahman2015
2
It is about relationship
Analysis of the
relationships
between two or
more variables.
8November2016(C)JamalludinAbRahman2015
3
Multivariate?
 Multivariate - general
term – multiple IV
 May involved multiple DV
DVIV IV
DV DV
IV
IV
8November2016(C)JamalludinAbRahman2015
4
Outcome
Exposure
Exposure Exposure
Exposure
Exposure
Effect modifier or Moderator
Confounder
Mediator
8November2016(C)JamalludinAbRahman2015
5
8November2016(C)JamalludinAbRahman2015
6
8November2016(C)JamalludinAbRahman2015
7
Exercise & fitness
Low Moderate High
Is there any difference % between
Low & Moderate intensity?
How big is the difference %
between Low & Moderate?
Is there any pattern now?
What is your conclusion?
Fitness level
Exercise intensity
8November2016(C)JamalludinAbRahman2015
8
Physical activity & blood pressure
Time SBP
30 140
40 145
85 130
90 143
100 130
110 120
110 110
120 120
130 110
130 109
140 98
150 100
140 110
135 120
160 100
160 96
170 100
200 89
200 100
240 80
y = -0.3287x + 155.89
R² = 0.8508
0
20
40
60
80
100
120
140
160
0 50 100 150 200 250 300
SBP(mmHg)
Time (minutes/week)
8November2016(C)JamalludinAbRahman2015
9
The 3rd variable
Outcome
Exposure
Exposure
8November2016(C)JamalludinAbRahman2015
10
The 3rd factors can be a...
1. Confounder
2. Mediator or intervening factor
3. Moderator or effect modifier (interaction)
8November2016(C)JamalludinAbRahman2015
11
The 3rd variable
Outcome
Exposure
Exposure
Confounder
Confounder
influence a
relationship
(between two
variables) but it is
not a part of the
pathway
8November2016(C)JamalludinAbRahman2015
12
8November2016(C)JamalludinAbRahman2015
13
The 3rd variable
Outcome
Exposure
Exposure
Moderator
When an
exposure has
different effects
on disease at
different values
of a variable
(interaction)
8November2016(C)JamalludinAbRahman2015
14
8November2016(C)JamalludinAbRahman2015
15
Stress vs. MS vs. Coping mechanism
Multiple
sclerosis
new
lesions
Coping
Mech.
Stress
Mohr, D. C., Goodkin, D. E., Nelson, S., Cox, D., & Weiner, M. (2002).
Moderating Effects of Coping on the Relationship Between Stress and the
Development of New Brain Lesions in Multiple Sclerosis. Psychosom Med,
64(5), 803-809.
OR = 1.62, p = 0.009
Distraction (OR=0.69, p=0.009),
instrumental (OR=0.77, p=0.081),
emotional preoccupation (OR=1.46, p=0.088)
& palliative (NS)
8November2016(C)JamalludinAbRahman2015
16
The 3rd variable
Outcome
Exposure
Exposure
Mediator
Mediator
influence a
relationship
(between two
variables) and it
is also a part of
the pathway
8November2016(C)JamalludinAbRahman2015
17
8November2016(C)JamalludinAbRahman2015
18
8November2016(C)JamalludinAbRahman2015
19
Why multivariate?
 Multi-factorial – which are the significant factors?
 Multiple outcomes
 Multiple unit of measurements
 Exploration of associations
8November2016(C)JamalludinAbRahman2015
20
Regression
 Most common multivariate
technique
 Best line to fit the data - OLS
 Many types:
 Linear regression
 Logistic regression
 & many others!!
8November2016(C)JamalludinAbRahman2015
21
x
y
𝑦 = 𝛽0 + 𝛽1 𝑥
𝛽0
Regression equation
 e.g. Linear regression
8November2016(C)JamalludinAbRahman2015
22
Dependent Var
Intercept
Coefficient for Var x1
Explanatory Var x1
Error/Residual
𝑌 = 𝛽𝑜 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ 𝛽 𝑛 𝑥 𝑛 + 𝜀
Example #1
Arterial BP = Constant + Age + Body weight + Pulse rate +
Stress + Residual
𝑌 = 𝛽𝑜 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ 𝛽 𝑛 𝑥 𝑛 + 𝜀
8November2016(C)JamalludinAbRahman2015
23
Logistic regression
 𝐿𝑛
𝑃
1−𝑃
= 𝛽𝑜 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑛 𝑥 𝑛
 P = Probability that Y=1 i.e. Event occurs
 1-P = Event not occur
 𝐿𝑛
𝑃
1−𝑃
= Ln of OR = Logit
8November2016(C)JamalludinAbRahman2015
24
R2 = 0.986, meaning 99% of variation in ABP is
explained by Age, Body Weight, Pulse Rate &
Stress (F(4,15)=291.948), P<0.001)
8November2016(C)JamalludinAbRahman2015
25
Main Result
Arterial BP = 17.3 + 0.6(Age) + 0.9(Body weight) +
0.09(Pulse rate) + 0.01(Stress)
8November2016(C)JamalludinAbRahman2015
26
Example #2
 Snoring and risk of cardiovascular disease in women . Hu 2000.
From The Nurses’ Health Study. Cohort. Baseline, N=71,779
women 40 to 65 years old and without diagnosed CVD or cancer
in 1986. Till 31st May 1994.
 CVD = Snoring + Age +Smoking + BMI + Alcohol + Physical
Activity + Menopausal status + Family history of MI + DM +
Cholesterol + Hours sleeping + Sleeping position
8November2016(C)JamalludinAbRahman2015
27
8November2016(C)JamalludinAbRahman2015
28
8November2016(C)JamalludinAbRahman2015
29
Testing assumptions for linear
regression
1. Linearity between DV & IVs – Scatter plot residuals vs. predicted
2. Normality – Histogram of residuals
3. No outliers – Casewise diagnostics (within +/- 2SD), Cook’s D (for
influential points, <1), leverage point (< 4/n)
4. Homogeneity – Scatter plot
5. Independence (no autocorrelation) – Durbin Watson 1.5-2.5 (or
some says 0-4)
6. No multicollinearity – Tolerance > 0.1, VIF <10
8November2016(C)JamalludinAbRahman2015
30
Residuals
x
y
Residual = Observed - Predicted
𝑦 = 𝛽0 + 𝛽1 𝑥
𝛽0
8November2016(C)JamalludinAbRahman2015
31
Residual statistics - Linearity
Predicted
Residuals Residuals
Linear Non linear
8November2016(C)JamalludinAbRahman2015
32
Residual statistics - Normality
Predicted
Residuals Residuals
Normal distribution Not Normal
8November2016(C)JamalludinAbRahman2015
33
Residual statistics - Heterogeneity
Predicted
Residuals Residuals
Homogenous Heterogenous
8November2016(C)JamalludinAbRahman2015
34
Influential data
x
y A B
C
A = Outlier, still within the
range of x, large residual value
B & C = Leverage points
B = Good leverage, it won’t
impact the regression line
C = Bad leverage. It will
change the regression line
8November2016(C)JamalludinAbRahman2015
35
Type of multivariate tests
Dependent
Variables
Independent
Variables
Test
1 – Cont ≥ 2 – All Cont Linear Regression
1 – Cont ≥ 2 – All Cat ANOVA
1 – Cont ≥ 2 – Cont + Cat ANCOVA
> 1 – Cont All Cat MANOVA
> 1 – Cont Cat + Cont MANCOVA
1 – Dichotomous ≥ 2 – Cont + Cat Binary Logistic Regression
8November2016(C)JamalludinAbRahman2015
36

Introduction to Multivariate analysis

  • 1.
    Multivariate Analysis Prof. Dr.Jamalludin Ab Rahman MD MPH Department of Community Medicine Kulliyyah of Medicine
  • 2.
    Smoking & lungcancer Good case-control study associating lung cancer to smoking (Wynder EL, Graham E. Tobacco smoking as a possible etiologic factor in bronchiogenic carcinoma: a study of 684 proven cases. JAMA 1950;143:329-36.) Tobacco dust (not smoke) might be causing the elevated incidence of lung tumours among German tobacco workers. (Hermann Rottmann in Würzburg 1898) Difference Causal Smoking might be related to lung cancer, but lung cancer is still rare (Adler I. Primary Malignant Growths of the Lungs and Bronchi. London: Longmans, 1912:22) 86 lung cancers patients were likely smoked (Müller FH. Tabakmissbrauch und Lungencarcinom. Zeitschrift für Krebsforschung 1939;49:57–85.) Smoking 35 sticks per day increase risk to 40 times (Doll R, Hill AB. The mortality of doctors in relation to their smoking habits. BMJ 1954;1:1451–5.) Animal study associating cigarette smoke tar with cancer (Wynder E, Graham EA, Croninger AB. Experimental production of carcinoma with cigarette tar. Cancer Res 1953;13:855–66) 8November2016(C)JamalludinAbRahman2015 2
  • 3.
    It is aboutrelationship Analysis of the relationships between two or more variables. 8November2016(C)JamalludinAbRahman2015 3
  • 4.
    Multivariate?  Multivariate -general term – multiple IV  May involved multiple DV DVIV IV DV DV IV IV 8November2016(C)JamalludinAbRahman2015 4
  • 5.
    Outcome Exposure Exposure Exposure Exposure Exposure Effect modifieror Moderator Confounder Mediator 8November2016(C)JamalludinAbRahman2015 5
  • 6.
  • 7.
  • 8.
    Exercise & fitness LowModerate High Is there any difference % between Low & Moderate intensity? How big is the difference % between Low & Moderate? Is there any pattern now? What is your conclusion? Fitness level Exercise intensity 8November2016(C)JamalludinAbRahman2015 8
  • 9.
    Physical activity &blood pressure Time SBP 30 140 40 145 85 130 90 143 100 130 110 120 110 110 120 120 130 110 130 109 140 98 150 100 140 110 135 120 160 100 160 96 170 100 200 89 200 100 240 80 y = -0.3287x + 155.89 R² = 0.8508 0 20 40 60 80 100 120 140 160 0 50 100 150 200 250 300 SBP(mmHg) Time (minutes/week) 8November2016(C)JamalludinAbRahman2015 9
  • 10.
  • 11.
    The 3rd factorscan be a... 1. Confounder 2. Mediator or intervening factor 3. Moderator or effect modifier (interaction) 8November2016(C)JamalludinAbRahman2015 11
  • 12.
    The 3rd variable Outcome Exposure Exposure Confounder Confounder influencea relationship (between two variables) but it is not a part of the pathway 8November2016(C)JamalludinAbRahman2015 12
  • 13.
  • 14.
    The 3rd variable Outcome Exposure Exposure Moderator Whenan exposure has different effects on disease at different values of a variable (interaction) 8November2016(C)JamalludinAbRahman2015 14
  • 15.
  • 16.
    Stress vs. MSvs. Coping mechanism Multiple sclerosis new lesions Coping Mech. Stress Mohr, D. C., Goodkin, D. E., Nelson, S., Cox, D., & Weiner, M. (2002). Moderating Effects of Coping on the Relationship Between Stress and the Development of New Brain Lesions in Multiple Sclerosis. Psychosom Med, 64(5), 803-809. OR = 1.62, p = 0.009 Distraction (OR=0.69, p=0.009), instrumental (OR=0.77, p=0.081), emotional preoccupation (OR=1.46, p=0.088) & palliative (NS) 8November2016(C)JamalludinAbRahman2015 16
  • 17.
    The 3rd variable Outcome Exposure Exposure Mediator Mediator influencea relationship (between two variables) and it is also a part of the pathway 8November2016(C)JamalludinAbRahman2015 17
  • 18.
  • 19.
  • 20.
    Why multivariate?  Multi-factorial– which are the significant factors?  Multiple outcomes  Multiple unit of measurements  Exploration of associations 8November2016(C)JamalludinAbRahman2015 20
  • 21.
    Regression  Most commonmultivariate technique  Best line to fit the data - OLS  Many types:  Linear regression  Logistic regression  & many others!! 8November2016(C)JamalludinAbRahman2015 21 x y 𝑦 = 𝛽0 + 𝛽1 𝑥 𝛽0
  • 22.
    Regression equation  e.g.Linear regression 8November2016(C)JamalludinAbRahman2015 22 Dependent Var Intercept Coefficient for Var x1 Explanatory Var x1 Error/Residual 𝑌 = 𝛽𝑜 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ 𝛽 𝑛 𝑥 𝑛 + 𝜀
  • 23.
    Example #1 Arterial BP= Constant + Age + Body weight + Pulse rate + Stress + Residual 𝑌 = 𝛽𝑜 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ 𝛽 𝑛 𝑥 𝑛 + 𝜀 8November2016(C)JamalludinAbRahman2015 23
  • 24.
    Logistic regression  𝐿𝑛 𝑃 1−𝑃 =𝛽𝑜 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑛 𝑥 𝑛  P = Probability that Y=1 i.e. Event occurs  1-P = Event not occur  𝐿𝑛 𝑃 1−𝑃 = Ln of OR = Logit 8November2016(C)JamalludinAbRahman2015 24
  • 25.
    R2 = 0.986,meaning 99% of variation in ABP is explained by Age, Body Weight, Pulse Rate & Stress (F(4,15)=291.948), P<0.001) 8November2016(C)JamalludinAbRahman2015 25
  • 26.
    Main Result Arterial BP= 17.3 + 0.6(Age) + 0.9(Body weight) + 0.09(Pulse rate) + 0.01(Stress) 8November2016(C)JamalludinAbRahman2015 26
  • 27.
    Example #2  Snoringand risk of cardiovascular disease in women . Hu 2000. From The Nurses’ Health Study. Cohort. Baseline, N=71,779 women 40 to 65 years old and without diagnosed CVD or cancer in 1986. Till 31st May 1994.  CVD = Snoring + Age +Smoking + BMI + Alcohol + Physical Activity + Menopausal status + Family history of MI + DM + Cholesterol + Hours sleeping + Sleeping position 8November2016(C)JamalludinAbRahman2015 27
  • 28.
  • 29.
  • 30.
    Testing assumptions forlinear regression 1. Linearity between DV & IVs – Scatter plot residuals vs. predicted 2. Normality – Histogram of residuals 3. No outliers – Casewise diagnostics (within +/- 2SD), Cook’s D (for influential points, <1), leverage point (< 4/n) 4. Homogeneity – Scatter plot 5. Independence (no autocorrelation) – Durbin Watson 1.5-2.5 (or some says 0-4) 6. No multicollinearity – Tolerance > 0.1, VIF <10 8November2016(C)JamalludinAbRahman2015 30
  • 31.
    Residuals x y Residual = Observed- Predicted 𝑦 = 𝛽0 + 𝛽1 𝑥 𝛽0 8November2016(C)JamalludinAbRahman2015 31
  • 32.
    Residual statistics -Linearity Predicted Residuals Residuals Linear Non linear 8November2016(C)JamalludinAbRahman2015 32
  • 33.
    Residual statistics -Normality Predicted Residuals Residuals Normal distribution Not Normal 8November2016(C)JamalludinAbRahman2015 33
  • 34.
    Residual statistics -Heterogeneity Predicted Residuals Residuals Homogenous Heterogenous 8November2016(C)JamalludinAbRahman2015 34
  • 35.
    Influential data x y AB C A = Outlier, still within the range of x, large residual value B & C = Leverage points B = Good leverage, it won’t impact the regression line C = Bad leverage. It will change the regression line 8November2016(C)JamalludinAbRahman2015 35
  • 36.
    Type of multivariatetests Dependent Variables Independent Variables Test 1 – Cont ≥ 2 – All Cont Linear Regression 1 – Cont ≥ 2 – All Cat ANOVA 1 – Cont ≥ 2 – Cont + Cat ANCOVA > 1 – Cont All Cat MANOVA > 1 – Cont Cat + Cont MANCOVA 1 – Dichotomous ≥ 2 – Cont + Cat Binary Logistic Regression 8November2016(C)JamalludinAbRahman2015 36