ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
Ā
Multivariate1
1. Multivariate Analysis using R
Dipika Patra
21 July 2016
1. One Sample Hotelling T- square test :
Descriptive Aspect :
Changes in Pulmonary Response of 12 workers after 6 hours of Exposure to Cotton dust measured by 3
following variables.
FVC - changes in Forced Vital Capacity (FVC) after 6 hours
FEV - Changes in Forced Expiratory Volume (FEV) after 6 hours
CC - Changes in Closing Capacity (CC) after 6 hours
Inferential Aspect :
Hypothesis 1 :
The null hypothesis : NO CHANGE IN PULMONARY FUNCTION against
The alternative hypothesis : CHANGE IN PULMONARY FUNCTION
In order to test the hypothesis we use the distribution of One Sample Hotelling T- square ,obtained by
Hotelling(1931).
The distribution is indexed by two parameters,the dimension p=3 and the degrees of freedom v=10-1= 9.
We Reject the null hypothesis if the calculated T-square is greater than the tabulated T-square value with
p=3, v=9 at 5% level of signiļ¬cance or with respect to the P - value, we reject the null hypothesis if it is less
than 0.05.
ā¢ R Codes (to call the data) with Output:
library(ICSNP)
## Loading required package: mvtnorm
## Loading required package: ICS
data(pulmonary)
pulmonary
## FVC FEV CC
## 1 -0.11 -0.12 -4.3
## 2 0.02 0.08 4.4
## 3 -0.02 0.03 7.5
## 4 0.07 0.19 -0.3
## 5 -0.16 -0.36 -5.8
1
2. ## 6 -0.42 -0.49 14.5
## 7 -0.32 -0.48 -1.9
## 8 -0.35 -0.30 17.3
## 9 -0.10 -0.04 2.5
## 10 0.01 -0.02 -5.6
## 11 -0.10 -0.17 2.2
## 12 -0.26 -0.30 5.5
ā¢ R codes (to test the Hypothesis 1) with Output:
attach(pulmonary)
HotellingsT2(pulmonary, mu = c(0,0,0))
##
## Hotelling's one sample T2-test
##
## data: pulmonary
## T.2 = 3.8231, df1 = 3, df2 = 9, p-value = 0.05123
## alternative hypothesis: true location is not equal to c(0,0,0)
Conclusion:
Since the P-value=0.05123 which is greater than 0.05 , we accept the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is no signiļ¬cant change in pulmonary
function.
Hypothesis 2:
The null hypothesis : CHANGES ONLY IN CLOSING CAPACITY WITH 2 UNIT against
The alternative hypothesis : OTHER THAN NULL HYPOTHESIS
To test the above discussed hypothesis following codes are used.
ā¢ R Codes (to test the Hypothesis 2) with Output:
HotellingsT2(pulmonary, mu = c(0,0,2))
##
## Hotelling's one sample T2-test
##
## data: pulmonary
## T.2 = 6.6204, df1 = 3, df2 = 9, p-value = 0.01178
## alternative hypothesis: true location is not equal to c(0,0,2)
Conclusion:
Since the P-value=0.01178 which is greater than 0.01 , we accept the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that the change in pulmonary function by closing
capacity with 2 unit.
2
3. 2. Two Sample Hotelling T-square test:
Generating Data Using R code:
Considering the situation of rating two teachers by two independent groups with 3 and 6 numbers of students
respectively based on satisfaction level (scale of 6) and knowledge level (scale of 10) :
math.teach <- data.frame(teacher=factor(rep(1:2,c(3, 6))), satisfaction = c(1, 3,2, 4, 6, 6, 5,5, 4), kn
math.teach
## teacher satisfaction knowledge
## 1 1 1 3
## 2 1 3 7
## 3 1 2 2
## 4 2 4 6
## 5 2 6 8
## 6 2 6 8
## 7 2 5 10
## 8 2 5 10
## 9 2 4 6
Graphical Display ( Multiple Boxplot) :
1
2 4 6 8 10
Teacher Knowledge
1
1 2 3 4 5 6
Teacher Satisfaction
Hypothesis 1:
3
4. Consider testing the null hypothesis that the two groups have identical mean vectors. This is represented
below as well as the general alternative that the mean vectors are not equal.
The null hypothesis : NO DIFFERENCE IN RATING BETWEEN TWO GROUPS against
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS
In order to test the hypothesis we use the distribution of Two Sample Hotelling T- square. Hotelling T-Square
Statistic for two sample:
T2
= X1 ā X2 Sā1
p (
1
n1
+
1
n2
)
ā1
X1 ā X2
where
sp
be the pooled variance covariance matrix. And
T2
ā¼ F(p, n1 + n2 ā p ā 1)
p(n1 + n2 ā 2)
(n1 + n2 ā p ā 1)
The distribution is indexed by two parameters,the dimension p=2 and the degrees of freedom
v = n1 + n2 ā p ā 1 = 3 + 6 ā 2 ā 1 = 6
. We Reject the null hypothesis if the calculated T-square is greater than the tabulated T-square value with
p=2, v=6 at 1% level of signiļ¬cance or with respect to the P - value, we reject the null hypothesis if it is less
than 0.01.
ā¢ R Codes (to test the Hypothesis 1) with Output:
attach(math.teach)
HotellingsT2(cbind(satisfaction, knowledge) ~
teacher, mu=c(0,0))
##
## Hotelling's two sample T2-test
##
## data: cbind(satisfaction, knowledge) by teacher
## T.2 = 9, df1 = 2, df2 = 6, p-value = 0.01562
## alternative hypothesis: true location difference is not equal to c(0,0)
Conclusion:
Since the P-value=0.01562 which is greater than 0.01 , we accept the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is no signiļ¬cant diļ¬erence in rating
between two groups.
Hypothesis 2:
Consider testing the null hypothesis that the two groups have
(ā1, 1)
mean vectors. This is represented below as well as the general alternative that the mean vectors are other
than that.
4
5. The null hypothesis : GIVEN SPECIFIC CHANGE IN RATING BETWEEN TWO GROUPS against
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS OTHER THAN GIVEN
In order to test the hypothesis we use the distribution of Two Sample Hotelling T- square.
ā¢ R Codes (to test the Hypothesis 2) with Output:
HotellingsT2(cbind(satisfaction, knowledge) ~teacher, mu=c(-1,1))
##
## Hotelling's two sample T2-test
##
## data: cbind(satisfaction, knowledge) by teacher
## T.2 = 5.6897, df1 = 2, df2 = 6, p-value = 0.04115
## alternative hypothesis: true location difference is not equal to c(-1,1)
Conclusion:
Since the P-value=0.04115 which is less than 0.05 , we reject the null hypothesis based on the given sample.That
is , Based on the given sample we conclude that there is signiļ¬cant diļ¬erence in rating between two groups
other than the mean vector (-1,1)ā.
Hypothesis 3:
Consider testing the null hypothesis that the two groups have
(1, 1)
mean vectors. This is represented below as well as the general alternative that the mean vectors are other
than that.
The null hypothesis : GIVEN SPECIFIC CHANGE IN RATING BETWEEN TWO GROUPS against
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS OTHER THAN THE
GIVEN
In order to test the hypothesis we use the distribution of Two Sample Hotelling T- square.
ā¢ R Codes (to test the Hypothesis 3) with Output:
HotellingsT2(cbind(satisfaction, knowledge) ~teacher, mu=c(1,1))
##
## Hotelling's two sample T2-test
##
## data: cbind(satisfaction, knowledge) by teacher
## T.2 = 16.034, df1 = 2, df2 = 6, p-value = 0.003915
## alternative hypothesis: true location difference is not equal to c(1,1)
5
6. Conclusion:
Since the P-value=0.003915 which is less than 0.01 , we reject the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is signiļ¬cant diļ¬erence in rating between
two groups other than the mean vector (1,1)ā.
Hypothesis 4:
Consider testing the null hypothesis that the two groups have
(2, 2)
mean vectors. This is represented below as well as the general alternative that the mean vectors are other
than that.
The null hypothesis : GIVEN SPECIFIC CHANGE IN RATING BETWEEN TWO GROUPS against
The alternative hypothesis : CHANGE IN RATING BETWEEN TWO GROUPS OTHER THAN THE
GIVEN
To test the hypothesis we use the distribution of Two Sample Hotelling T- square.
ā¢ R Codes (to test the Hypothesis 4) with Output:
HotellingsT2(cbind(satisfaction, knowledge) ~teacher, mu=c(2,2))
##
## Hotelling's two sample T2-test
##
## data: cbind(satisfaction, knowledge) by teacher
## T.2 = 25.138, df1 = 2, df2 = 6, p-value = 0.001212
## alternative hypothesis: true location difference is not equal to c(2,2)
Conclusion:
Since the P-value=0.001212 which is less than 0.01 , we reject the null hypothesis based on the given
sample.That is , Based on the given sample we conclude that there is signiļ¬cant diļ¬erence in rating between
two groups other than the mean vector (2,2)ā.
3. Two Way MANOVA(two factors) :
Descriptive Aspect :
Triathlon performance :- Multi-sport race where 60 competitors complete swim course, bike course, and run
course, in that order.
Factors :- gender (2), age group (3) and interaction (6)
6
7. Research Question :
If gender (M/F) or age category (CAT1,CAT2,CAT3) has an eļ¬ect on the times for the individual sports.
ā¢ R Codes (to read the data from CSV ļ¬le) with Output:
getwd()
## [1] "C:/Users/User/Documents"
data.manova<-read.csv("triathlon.csv",header = T)
data.manova
## GENDER CATEGORY SWIM BIKE RUN
## 1 F CAT1 52 252 145
## 2 F CAT1 44 238 163
## 3 F CAT1 42 196 83
## 4 F CAT1 46 238 179
## 5 F CAT1 42 238 136
## 6 F CAT1 38 203 176
## 7 F CAT1 50 336 95
## 8 F CAT1 40 196 152
## 9 F CAT1 42 238 108
## 10 F CAT1 40 266 132
## 11 F CAT2 34 322 147
## 12 F CAT2 42 238 161
## 13 F CAT2 34 217 173
## 14 F CAT2 36 217 154
## 15 F CAT2 46 252 120
## 16 F CAT2 38 182 143
## 17 F CAT2 32 245 126
## 18 F CAT2 38 231 162
## 19 F CAT2 30 161 150
## 20 F CAT2 28 210 136
## 21 F CAT3 28 182 111
## 22 F CAT3 28 217 119
## 23 F CAT3 32 210 141
## 24 F CAT3 32 238 168
## 25 F CAT3 26 210 171
## 26 F CAT3 26 189 123
## 27 F CAT3 24 147 89
## 28 F CAT3 30 217 140
## 29 F CAT3 28 259 105
## 30 F CAT3 28 203 131
## 31 M CAT1 50 294 103
## 32 M CAT1 48 329 109
## 33 M CAT1 58 357 161
## 34 M CAT1 50 245 87
## 35 M CAT1 52 259 172
## 36 M CAT1 56 308 178
## 37 M CAT1 50 308 152
## 38 M CAT1 48 343 170
7
8. ## 39 M CAT1 48 301 115
## 40 M CAT1 52 252 123
## 41 M CAT2 36 224 151
## 42 M CAT2 38 273 150
## 43 M CAT2 34 259 133
## 44 M CAT2 34 217 90
## 45 M CAT2 38 252 172
## 46 M CAT2 38 224 80
## 47 M CAT2 34 217 171
## 48 M CAT2 42 287 164
## 49 M CAT2 36 252 166
## 50 M CAT2 40 280 154
## 51 M CAT3 22 196 143
## 52 M CAT3 20 196 167
## 53 M CAT3 20 175 127
## 54 M CAT3 24 154 80
## 55 M CAT3 22 189 152
## 56 M CAT3 24 175 137
## 57 M CAT3 28 231 125
## 58 M CAT3 26 217 156
## 59 M CAT3 24 196 149
## 60 M CAT3 22 161 113
ā¢ Formating data to run MANOVA :
gender <- as.factor(data.manova[,1])
cat <- as.factor(data.manova[,2])
times <- as.matrix(data.manova[,3:5])
head(times)
## SWIM BIKE RUN
## [1,] 52 252 145
## [2,] 44 238 163
## [3,] 42 196 83
## [4,] 46 238 179
## [5,] 42 238 136
## [6,] 38 203 176
ā¢ R Code for two way MANOVA considering interaction eļ¬ect of gender & age :
output <- manova(times~gender*cat)
output
## Call:
## manova(times ~ gender * cat)
##
## Terms:
## gender cat gender:cat Residuals
## resp 1 24.07 4709.20 396.93 738.80
## resp 2 6468.82 51696.63 15093.63 65321.90
## resp 3 2.02 1681.60 212.13 43755.90
## Deg. of Freedom 1 2 2 54
8
9. ##
## Residual standard errors: 3.698849 34.78024 28.46567
## Estimated effects may be unbalanced
Wilkās lambda test :
summary(output, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## gender 1 0.90547 1.8095 3 52 0.1568890
## cat 2 0.12952 30.8289 6 104 < 2.2e-16 ***
## gender:cat 2 0.62497 4.5923 6 104 0.0003562 ***
## Residuals 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pillaiās trace test ( default in R):
summary(output, test="Pillai")
## Df Pillai approx F num Df den Df Pr(>F)
## gender 1 0.09453 1.8095 3 52 0.15689
## cat 2 0.90048 14.4686 6 106 5.338e-12 ***
## gender:cat 2 0.37636 4.0952 6 106 0.00098 ***
## Residuals 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hotelling-Lawleyās trace test :
summary(output, test="Hotelling")
## Df Hotelling-Lawley approx F num Df den Df Pr(>F)
## gender 1 0.1044 1.810 3 52 0.1568890
## cat 2 6.4889 55.156 6 102 < 2.2e-16 ***
## gender:cat 2 0.5979 5.083 6 102 0.0001335 ***
## Residuals 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ā¢ R Code with output for separate study for each response :
summary.aov(output)
## Response SWIM :
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 24.1 24.07 1.7591 0.1903
## cat 2 4709.2 2354.60 172.1012 < 2.2e-16 ***
## gender:cat 2 396.9 198.47 14.5062 9.073e-06 ***
## Residuals 54 738.8 13.68
9
10. ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response BIKE :
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 6469 6468.8 5.3476 0.024591 *
## cat 2 51697 25848.3 21.3682 1.458e-07 ***
## gender:cat 2 15094 7546.8 6.2388 0.003651 **
## Residuals 54 65322 1209.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response RUN :
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 2 2.02 0.0025 0.9604
## cat 2 1682 840.80 1.0376 0.3612
## gender:cat 2 212 106.07 0.1309 0.8776
## Residuals 54 43756 810.29
ā¢ R Code for two way MANOVA without interaction eļ¬ect of gender & age :
manova(times~gender*cat)
## Call:
## manova(times ~ gender * cat)
##
## Terms:
## gender cat gender:cat Residuals
## resp 1 24.07 4709.20 396.93 738.80
## resp 2 6468.82 51696.63 15093.63 65321.90
## resp 3 2.02 1681.60 212.13 43755.90
## Deg. of Freedom 1 2 2 54
##
## Residual standard errors: 3.698849 34.78024 28.46567
## Estimated effects may be unbalanced
ā¢ Paticular eļ¬ect of gender :
manova(times~gender)
## Call:
## manova(times ~ gender)
##
## Terms:
## gender Residuals
## resp 1 24.07 5844.93
## resp 2 6468.82 132112.17
## resp 3 2.02 45649.63
## Deg. of Freedom 1 58
##
## Residual standard errors: 10.03866 47.72626 28.05464
## Estimated effects may be unbalanced
10
11. ā¢ Paticular eļ¬ect of age category :
manova(times~cat)
## Call:
## manova(times ~ cat)
##
## Terms:
## cat Residuals
## resp 1 4709.2 1159.8
## resp 2 51696.63 86884.35
## resp 3 1681.60 43970.05
## Deg. of Freedom 2 57
##
## Residual standard errors: 4.510806 39.04212 27.77417
## Estimated effects may be unbalanced
4. Principal Component Analysis :
Descriptive Aspect :
Switzerland, in 1888, was entering a period known as the demographic transition; i.e., its fertility was
beginning to fall from the high level typical of underdeveloped countries.The data collected are for 47
French-speaking āprovincesā on 6 following variables.
Fertility - Common standardized fertility measure
Agriculture - Percentage of Male involved in Agriculture as occupation
Examination - Percentage of draftees receiving highest mark on army examination
Education - Percent Education beyond primay school for draftees
Catholic - Percentage of Catholic
Infant.Mortality - Live birth who lives less than 1 year
Here, all variables are scaled to [0, 100], where in the original, all but āCatholicā were scaled to [0, 1].
Purpose :
-To reduce the dimensionality of data
-To decrease redundancy
-To identify the variables work together to create dynamics of the system
ā¢ R Codes (to call the data) with Output:
library(psych)
attach(swiss)
11
13. swiss_pca<-prcomp(swiss,center=T,scale=T)
swiss_pca
## Standard deviations:
## [1] 1.7887865 1.0900955 0.9206573 0.6625169 0.4522540 0.3476529
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5
## Fertility -0.4569876 0.3220284 -0.17376638 0.53555794 -0.38308893
## Agriculture -0.4242141 -0.4115132 0.03834472 -0.64291822 -0.37495215
## Examination 0.5097327 0.1250167 -0.09123696 -0.05446158 -0.81429082
## Education 0.4543119 0.1790495 0.53239316 -0.09738818 0.07144564
## Catholic -0.3501111 0.1458730 0.80680494 0.09947244 -0.18317236
## Infant.Mortality -0.1496668 0.8111645 -0.16010636 -0.52677184 0.10453530
## PC6
## Fertility 0.47295441
## Agriculture 0.30870058
## Examination -0.22401686
## Education 0.68081610
## Catholic -0.40219666
## Infant.Mortality -0.07457754
-Center and scale refers to respective mean and standard deviation of the variables that are used for
normalization prior to implementing PCA.
-The rotation measure provides the principal component loading. Each column of rotation matrix contains
the principal component loading vector.
ā¢ Selection of Components :
The summary method describe the importance of the PCs. The ļ¬rst row describe again the standard deviation
associated with each PC. The second row shows the proportion of the variance in the data explained by each
component while the third row describe the cumulative proportion of explained variance.
summary(swiss_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.7888 1.0901 0.9207 0.66252 0.45225 0.34765
## Proportion of Variance 0.5333 0.1981 0.1413 0.07315 0.03409 0.02014
## Cumulative Proportion 0.5333 0.7313 0.8726 0.94577 0.97986 1.00000
We can see there that the ļ¬rst two Principal Components accounts for more than 70 % of the variance of the
data and considering the third principal component 85% of the data variability explained.
ā¢ Graphical selection by Screeplot:
The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure below
is useful to decide how many PCs to retain for further analysis.
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original
variables in standardized data. This is commonly used as a cutoļ¬ point for which PCs are retained.
In this case,we can see that the ļ¬rst two PCs explain most of the variability in the data.
13
15. Report :
We select out responsible variables whose contribution to the principal components is signiļ¬cant (with loading
beyond Ā± 0.5). Responsible Varibles with their loadings :
ā¢ In PC1 āExamination(0.50)ā
ā¢ In PC2 āInfant.Mortality(0.81)ā
15