Multivariate1

Multivariate Analysis using R
Dipika Patra
21 July 2016
1. One Sample Hotelling T- square test :
Descriptive Aspect :
Changes in Pulmonary Response of 12 workers after 6 hours of Exposure to Cotton dust measured by 3
following variables.
FVC - changes in Forced Vital Capacity (FVC) after 6 hours
FEV - Changes in Forced Expiratory Volume (FEV) after 6 hours
CC - Changes in Closing Capacity (CC) after 6 hours
Inferential Aspect :
Hypothesis 1 :
The null hypothesis : NO CHANGE IN PULMONARY FUNCTION against
The alternative hypothesis : CHANGE IN PULMONARY FUNCTION
In order to test the hypothesis we use the distribution of One Sample Hotelling T- square ,obtained by
Hotelling(1931).
The distribution is indexed by two parameters,the dimension p=3 and the degrees of freedom v=10-1= 9.
We Reject the null hypothesis if the calculated T-square is greater than the tabulated T-square value with
p=3, v=9 at 5% level of signiﬁcance or with respect to the P - value, we reject the null hypothesis if it is less
than 0.05.
• R Codes (to call the data) with Output:
library(ICSNP)
## Loading required package: mvtnorm
## Loading required package: ICS
data(pulmonary)
pulmonary
## FVC FEV CC
## 1 -0.11 -0.12 -4.3
## 2 0.02 0.08 4.4
## 3 -0.02 0.03 7.5
## 4 0.07 0.19 -0.3
## 5 -0.16 -0.36 -5.8
1

## 6 -0.42 -0.49 14.5
## 7 -0.32 -0.48 -1.9
## 8 -0.35 -0.30 17.3
## 9 -0.10 -0.04 2.5
## 10 0.01 -0.02 -5.6
## 11 -0.10 -0.17 2.2
## 12 -0.26 -0.30 5.5
• R codes (to test the Hypothesis 1) with Output:
attach(pulmonary)
HotellingsT2(pulmonary, mu = c(0,0,0))
##
## Hotelling's one sample T2-test
##
## data: pulmonary
## T.2 = 3.8231, df1 = 3, df2 = 9, p-value = 0.05123
## alternative hypothesis: true location is not equal to c(0,0,0)
Conclusion:
Since the P-value=0.05123 which is greater than 0.05 , we accept the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is no signiﬁcant change in pulmonary
function.
Hypothesis 2:
The null hypothesis : CHANGES ONLY IN CLOSING CAPACITY WITH 2 UNIT against
The alternative hypothesis : OTHER THAN NULL HYPOTHESIS
To test the above discussed hypothesis following codes are used.
• R Codes (to test the Hypothesis 2) with Output:
HotellingsT2(pulmonary, mu = c(0,0,2))
##
## Hotelling's one sample T2-test
##
## data: pulmonary
## T.2 = 6.6204, df1 = 3, df2 = 9, p-value = 0.01178
## alternative hypothesis: true location is not equal to c(0,0,2)
Conclusion:
sample.That is , Based on the given sample we conclude that the change in pulmonary function by closing
capacity with 2 unit.
2

2. Two Sample Hotelling T-square test:
Generating Data Using R code:
Considering the situation of rating two teachers by two independent groups with 3 and 6 numbers of students
respectively based on satisfaction level (scale of 6) and knowledge level (scale of 10) :
math.teach <- data.frame(teacher=factor(rep(1:2,c(3, 6))), satisfaction = c(1, 3,2, 4, 6, 6, 5,5, 4), kn
math.teach
## teacher satisfaction knowledge
## 1 1 1 3
## 2 1 3 7
## 3 1 2 2
## 4 2 4 6
## 5 2 6 8
## 6 2 6 8
## 7 2 5 10
## 8 2 5 10
## 9 2 4 6
Graphical Display ( Multiple Boxplot) :
1
2 4 6 8 10
Teacher Knowledge
1
1 2 3 4 5 6
Teacher Satisfaction
Hypothesis 1:
3

Consider testing the null hypothesis that the two groups have identical mean vectors. This is represented
below as well as the general alternative that the mean vectors are not equal.
The null hypothesis : NO DIFFERENCE IN RATING BETWEEN TWO GROUPS against
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS
In order to test the hypothesis we use the distribution of Two Sample Hotelling T- square. Hotelling T-Square
Statistic for two sample:
T2
= X1 − X2 S−1
p (
1
n1
+
1
n2
)
−1
X1 − X2
where
sp
be the pooled variance covariance matrix. And
T2
∼ F(p, n1 + n2 − p − 1)
p(n1 + n2 − 2)
(n1 + n2 − p − 1)
The distribution is indexed by two parameters,the dimension p=2 and the degrees of freedom
v = n1 + n2 − p − 1 = 3 + 6 − 2 − 1 = 6
. We Reject the null hypothesis if the calculated T-square is greater than the tabulated T-square value with
p=2, v=6 at 1% level of significance or with respect to the P - value, we reject the null hypothesis if it is less
than 0.01.
attach(math.teach)
HotellingsT2(cbind(satisfaction, knowledge) ~
teacher, mu=c(0,0))
##
## Hotelling's two sample T2-test
##
## data: cbind(satisfaction, knowledge) by teacher
## T.2 = 9, df1 = 2, df2 = 6, p-value = 0.01562
## alternative hypothesis: true location difference is not equal to c(0,0)
Conclusion:
sample.That is , Based on the given sample we conclude that there is no significant difference in rating
between two groups.
Hypothesis 2:
Consider testing the null hypothesis that the two groups have
(−1, 1)
mean vectors. This is represented below as well as the general alternative that the mean vectors are other
than that.
4

The null hypothesis : GIVEN SPECIFIC CHANGE IN RATING BETWEEN TWO GROUPS against
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS OTHER THAN GIVEN
In order to test the hypothesis we use the distribution of Two Sample Hotelling T- square.
HotellingsT2(cbind(satisfaction, knowledge) ~teacher, mu=c(-1,1))
##
##
## T.2 = 5.6897, df1 = 2, df2 = 6, p-value = 0.04115
## alternative hypothesis: true location difference is not equal to c(-1,1)
Conclusion:
Since the P-value=0.04115 which is less than 0.05 , we reject the null hypothesis based on the given sample.That
is , Based on the given sample we conclude that there is signiﬁcant diﬀerence in rating between two groups
other than the mean vector (-1,1)’.
Hypothesis 3:
(1, 1)
than that.
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS OTHER THAN THE
GIVEN
In order to test the hypothesis we use the distribution of Two Sample Hotelling T- square.
HotellingsT2(cbind(satisfaction, knowledge) ~teacher, mu=c(1,1))
##
##
## T.2 = 16.034, df1 = 2, df2 = 6, p-value = 0.003915
5

Conclusion:
Since the P-value=0.003915 which is less than 0.01 , we reject the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is significant difference in rating between
two groups other than the mean vector (1,1)’.
Hypothesis 4:
(2, 2)
than that.
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS OTHER THAN THE
GIVEN
To test the hypothesis we use the distribution of Two Sample Hotelling T- square.
HotellingsT2(cbind(satisfaction, knowledge) ~teacher, mu=c(2,2))
##
##
## T.2 = 25.138, df1 = 2, df2 = 6, p-value = 0.001212
Conclusion:
Since the P-value=0.001212 which is less than 0.01 , we reject the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is significant difference in rating between
two groups other than the mean vector (2,2)’.
3. Two Way MANOVA(two factors) :
Triathlon performance :- Multi-sport race where 60 competitors complete swim course, bike course, and run
course, in that order.
Factors :- gender (2), age group (3) and interaction (6)
6

Research Question :
If gender (M/F) or age category (CAT1,CAT2,CAT3) has an eﬀect on the times for the individual sports.
• R Codes (to read the data from CSV ﬁle) with Output:
getwd()
## [1] "C:/Users/User/Documents"
data.manova<-read.csv("triathlon.csv",header = T)
data.manova
## GENDER CATEGORY SWIM BIKE RUN
## 1 F CAT1 52 252 145
## 2 F CAT1 44 238 163
## 3 F CAT1 42 196 83
## 4 F CAT1 46 238 179
## 5 F CAT1 42 238 136
## 6 F CAT1 38 203 176
## 7 F CAT1 50 336 95
## 8 F CAT1 40 196 152
## 9 F CAT1 42 238 108
## 10 F CAT1 40 266 132
## 11 F CAT2 34 322 147
## 12 F CAT2 42 238 161
## 13 F CAT2 34 217 173
## 14 F CAT2 36 217 154
## 15 F CAT2 46 252 120
## 16 F CAT2 38 182 143
## 17 F CAT2 32 245 126
## 18 F CAT2 38 231 162
## 19 F CAT2 30 161 150
## 20 F CAT2 28 210 136
## 21 F CAT3 28 182 111
## 22 F CAT3 28 217 119
## 23 F CAT3 32 210 141
## 24 F CAT3 32 238 168
## 25 F CAT3 26 210 171
## 26 F CAT3 26 189 123
## 27 F CAT3 24 147 89
## 28 F CAT3 30 217 140
## 29 F CAT3 28 259 105
## 30 F CAT3 28 203 131
## 31 M CAT1 50 294 103
## 32 M CAT1 48 329 109
## 33 M CAT1 58 357 161
## 34 M CAT1 50 245 87
## 35 M CAT1 52 259 172
## 36 M CAT1 56 308 178
## 37 M CAT1 50 308 152
## 38 M CAT1 48 343 170
7

## 39 M CAT1 48 301 115
## 40 M CAT1 52 252 123
## 41 M CAT2 36 224 151
## 42 M CAT2 38 273 150
## 43 M CAT2 34 259 133
## 44 M CAT2 34 217 90
## 45 M CAT2 38 252 172
## 46 M CAT2 38 224 80
## 47 M CAT2 34 217 171
## 48 M CAT2 42 287 164
## 49 M CAT2 36 252 166
## 50 M CAT2 40 280 154
## 51 M CAT3 22 196 143
## 52 M CAT3 20 196 167
## 53 M CAT3 20 175 127
## 54 M CAT3 24 154 80
## 55 M CAT3 22 189 152
## 56 M CAT3 24 175 137
## 57 M CAT3 28 231 125
## 58 M CAT3 26 217 156
## 59 M CAT3 24 196 149
## 60 M CAT3 22 161 113
• Formating data to run MANOVA :
gender <- as.factor(data.manova[,1])
cat <- as.factor(data.manova[,2])
times <- as.matrix(data.manova[,3:5])
head(times)
## SWIM BIKE RUN
## [1,] 52 252 145
## [2,] 44 238 163
## [3,] 42 196 83
## [4,] 46 238 179
## [5,] 42 238 136
## [6,] 38 203 176
• R Code for two way MANOVA considering interaction eﬀect of gender & age :
output <- manova(times~gender*cat)
output
## Call:
## manova(times ~ gender * cat)
##
## Terms:
## gender cat gender:cat Residuals
## resp 1 24.07 4709.20 396.93 738.80
## resp 2 6468.82 51696.63 15093.63 65321.90
## resp 3 2.02 1681.60 212.13 43755.90
## Deg. of Freedom 1 2 2 54
8

##
## Residual standard errors: 3.698849 34.78024 28.46567
## Estimated effects may be unbalanced
Wilk’s lambda test :
summary(output, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## gender 1 0.90547 1.8095 3 52 0.1568890
## cat 2 0.12952 30.8289 6 104 < 2.2e-16 ***
## gender:cat 2 0.62497 4.5923 6 104 0.0003562 ***
## Residuals 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pillai’s trace test ( default in R):
summary(output, test="Pillai")
## Df Pillai approx F num Df den Df Pr(>F)
## gender 1 0.09453 1.8095 3 52 0.15689
## cat 2 0.90048 14.4686 6 106 5.338e-12 ***
## gender:cat 2 0.37636 4.0952 6 106 0.00098 ***
## Residuals 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hotelling-Lawley’s trace test :
summary(output, test="Hotelling")
## Df Hotelling-Lawley approx F num Df den Df Pr(>F)
## gender 1 0.1044 1.810 3 52 0.1568890
## cat 2 6.4889 55.156 6 102 < 2.2e-16 ***
## gender:cat 2 0.5979 5.083 6 102 0.0001335 ***
## Residuals 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
• R Code with output for separate study for each response :
summary.aov(output)
## Response SWIM :
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 24.1 24.07 1.7591 0.1903
## cat 2 4709.2 2354.60 172.1012 < 2.2e-16 ***
## gender:cat 2 396.9 198.47 14.5062 9.073e-06 ***
## Residuals 54 738.8 13.68
9

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response BIKE :
## gender 1 6469 6468.8 5.3476 0.024591 *
## cat 2 51697 25848.3 21.3682 1.458e-07 ***
## gender:cat 2 15094 7546.8 6.2388 0.003651 **
## Residuals 54 65322 1209.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response RUN :
## gender 1 2 2.02 0.0025 0.9604
## cat 2 1682 840.80 1.0376 0.3612
## gender:cat 2 212 106.07 0.1309 0.8776
## Residuals 54 43756 810.29
• R Code for two way MANOVA without interaction eﬀect of gender & age :
manova(times~gender*cat)
## Call:
## manova(times ~ gender * cat)
##
## Terms:
## gender cat gender:cat Residuals
## resp 1 24.07 4709.20 396.93 738.80
## resp 2 6468.82 51696.63 15093.63 65321.90
## resp 3 2.02 1681.60 212.13 43755.90
## Deg. of Freedom 1 2 2 54
##
• Paticular eﬀect of gender :
manova(times~gender)
## Call:
## manova(times ~ gender)
##
## Terms:
## gender Residuals
## resp 1 24.07 5844.93
## resp 2 6468.82 132112.17
## resp 3 2.02 45649.63
## Deg. of Freedom 1 58
##
10

• Paticular eﬀect of age category :
manova(times~cat)
## Call:
## manova(times ~ cat)
##
## Terms:
## cat Residuals
## resp 1 4709.2 1159.8
## resp 2 51696.63 86884.35
## resp 3 1681.60 43970.05
## Deg. of Freedom 2 57
##
4. Principal Component Analysis :
Switzerland, in 1888, was entering a period known as the demographic transition; i.e., its fertility was
beginning to fall from the high level typical of underdeveloped countries.The data collected are for 47
French-speaking “provinces” on 6 following variables.
Fertility - Common standardized fertility measure
Agriculture - Percentage of Male involved in Agriculture as occupation
Examination - Percentage of draftees receiving highest mark on army examination
Education - Percent Education beyond primay school for draftees
Catholic - Percentage of Catholic
Infant.Mortality - Live birth who lives less than 1 year
Here, all variables are scaled to [0, 100], where in the original, all but “Catholic” were scaled to [0, 1].
Purpose :
-To reduce the dimensionality of data
-To decrease redundancy
-To identify the variables work together to create dynamics of the system
• R Codes (to call the data) with Output:
library(psych)
attach(swiss)
11

Graphical Display to explore dependence structure:
Scatter plot of the data reveals strong corelation among the 6 variables.
plot(swiss, col="blue", pch=20)
Fertility
0 40 80 0 20 40 15 20 25
4080
060
Agriculture
Examination
525
030
Education
Catholic
060
40 60 80
1525
5 15 30 0 40 80
Infant.Mortality
round(cor(swiss),2)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.00 0.35 -0.65 -0.66 0.46
## Agriculture 0.35 1.00 -0.69 -0.64 0.40
## Examination -0.65 -0.69 1.00 0.70 -0.57
## Education -0.66 -0.64 0.70 1.00 -0.15
## Catholic 0.46 0.40 -0.57 -0.15 1.00
## Infant.Mortality 0.42 -0.06 -0.11 -0.10 0.18
## Infant.Mortality
## Fertility 0.42
## Agriculture -0.06
## Examination -0.11
## Education -0.10
## Catholic 0.18
## Infant.Mortality 1.00
• Principal Components :
12

swiss_pca<-prcomp(swiss,center=T,scale=T)
swiss_pca
## Standard deviations:
## [1] 1.7887865 1.0900955 0.9206573 0.6625169 0.4522540 0.3476529
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5
## Fertility -0.4569876 0.3220284 -0.17376638 0.53555794 -0.38308893
## Agriculture -0.4242141 -0.4115132 0.03834472 -0.64291822 -0.37495215
## Examination 0.5097327 0.1250167 -0.09123696 -0.05446158 -0.81429082
## Education 0.4543119 0.1790495 0.53239316 -0.09738818 0.07144564
## Catholic -0.3501111 0.1458730 0.80680494 0.09947244 -0.18317236
## Infant.Mortality -0.1496668 0.8111645 -0.16010636 -0.52677184 0.10453530
## PC6
## Fertility 0.47295441
## Agriculture 0.30870058
## Examination -0.22401686
## Education 0.68081610
## Catholic -0.40219666
## Infant.Mortality -0.07457754
-Center and scale refers to respective mean and standard deviation of the variables that are used for
normalization prior to implementing PCA.
-The rotation measure provides the principal component loading. Each column of rotation matrix contains
the principal component loading vector.
• Selection of Components :
The summary method describe the importance of the PCs. The first row describe again the standard deviation
associated with each PC. The second row shows the proportion of the variance in the data explained by each
component while the third row describe the cumulative proportion of explained variance.
summary(swiss_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.7888 1.0901 0.9207 0.66252 0.45225 0.34765
## Proportion of Variance 0.5333 0.1981 0.1413 0.07315 0.03409 0.02014
## Cumulative Proportion 0.5333 0.7313 0.8726 0.94577 0.97986 1.00000
We can see there that the first two Principal Components accounts for more than 70 % of the variance of the
data and considering the third principal component 85% of the data variability explained.
• Graphical selection by Screeplot:
The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure below
is useful to decide how many PCs to retain for further analysis.
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original
variables in standardized data. This is commonly used as a cutoff point for which PCs are retained.
In this case,we can see that the first two PCs explain most of the variability in the data.
13

library("factoextra")
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
eig.val <- get_eigenvalue(swiss_pca)
head(eig.val)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.1997570 53.329283 53.32928
## Dim.2 1.1883082 19.805137 73.13442
## Dim.3 0.8476098 14.126830 87.26125
## Dim.4 0.4389287 7.315478 94.57673
## Dim.5 0.2045337 3.408895 97.98562
## Dim.6 0.1208626 2.014376 100.00000
fviz_screeplot(swiss_pca,ncp=6, choice="eigenvalue")
0
1
2
3
1 2 3 4 5 6
Dimensions
Eigenvalue
Scree plot
14

Report :
We select out responsible variables whose contribution to the principal components is signiﬁcant (with loading
beyond ± 0.5). Responsible Varibles with their loadings :
• In PC1 “Examination(0.50)”
• In PC2 “Infant.Mortality(0.81)”
15

Multivariate1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multivariate1

Similar to Multivariate1 (20)

More from Seth Anandaram Jaipuria College

More from Seth Anandaram Jaipuria College (8)

Recently uploaded

Recently uploaded (20)

Multivariate1