SlideShare a Scribd company logo
1 of 26
Measurements to Predict Gender
Roopan Verma
February 10, 2016
Table of Contents
Executive Summary............................................................................................................ 3
Background, Motivation, Statement of Problem and Objectives .................................... 4
Data Collection and Explanation ....................................................................................... 4
Statistical Analysis............................................................................................................ 12
Evaluation of Results........................................................................................................ 20
Appendix I: Variable Description..................................................................................... 21
Appendix II: R Code.......................................................................................................... 22
Appendix lll: Other Models.............................................................................................. 25
Executive Summary
Measurements were taken of over 500 individuals in an attempt to correctly predict
gender of the persons being investigated. 24 different diameter and girth
measurements were taken of each individual. The raw data was then grouped by gender
in order to get a better visualization of the distributions of both males and females and
how they compare to one another. A logistic regression model was created with the
goal being to predict gender of a given validation set. In an attempt to know how
accurate the model was at predicting, the training set given was split into both a training
set as well as a test set. Results showed that the model predicted extremely well and
had 98.53% accuracy. The results obtained were not surprising as it was felt the
variables in the data set would make it not all that difficult to determine which
individual was a male and which was a female.
Background, Motivation, Statement of Problem and Objectives
A set of measurements was collected on over 500 males and females where 24 different
variables were collected. The selection of these individuals was done by selecting
physically active people who were mostly in their 20s and 30s with some individuals
being older. Based on these measurements, the primary objective was to predict the
gender of a certain amount of individuals. The different measurements taken of these
individuals include different weight measurements including waist, chest, and other
body measurements. Along with the main objective of gender prediction other tasks
were to be accomplished such as converting data to English units and obtaining
descriptive statistics of the data such as the mean and coefficient of variation.
Data Collection and Explanation
24 different measurements of an individual were taken that consisted of skeletal
measurements, Girth measurements along with other measurements that included age,
weight and height (Full descriptions of all variables in the appendix). Gender, the
dependent variable, was only recorded for two-thirds of the individuals while the other
one-third of the dataset were used as a validation set.
Table 1: The first six observations of the complete dataset. The NA’s in the data frame
correspond to the validation set that required the dataset to be broken up.
The purpose of the data was to find a distinction between male and female and thus the
means and percent coefficient of variation of each variable grouped by gender. This
allowed getting a better understanding of the differencing between male and female.
Mean:
Coefficient of Variation:
Table 2: The meanand percentcoefficient of variation of each male and female in the dataset.
By grouping the variables by gender, results showed that there were some variables
where the distributions for each gender were clearly different while other variables
showed similar distributions. Interestingly the coefficient of variations of both genders
was similar with little differences between males and females. The variation values
themselves were generally low except for age, which was expected due to the large
range of individuals measured.
Plot 1: Histograms of four variables grouped by gender.
The above histograms show that both the biacromial diameter and chest depth have a
clear difference between males and females where the average in males is higher than
females. The bitrochantric and biiliac diameters however show very little differences in
distributions between males and females.
0
20
40
12 14 16 18
Biacromial Diameter (inch)
count
Gender
female
male
0
10
20
30
40
7 9 11 13
Biiliac Diameter (inch)
count
Gender
female
male
0
10
20
30
40
10 12 14
Bitrochantric Diameter (inch)
count
Gender
female
male
0
20
40
60
6 8 10
Chest Depth (inch)
count
Gender
female
male
Plot 2: Histograms of four variables grouped by gender.
Shoulder girth, chest girth and waist girth all present clear differences in their
distributions when grouped by gender. The histograms above clearly indicate that the
girth in males is larger in these variables than it is for females.
0
25
50
75
100
125
1 2 3 4
Ankle Diameter (inch)
count
Gender
female
male
0
5
10
15
35 40 45 50 55
Shoulder Girth (inch)
count
Gender
female
male
0
5
10
15
20
25
30 35 40 45
Chest Girth (inch)
count
Gender
female
male
0
5
10
15
20
25 30 35 40 45
Waist Girth (inch)
count
Gender
female
male
Plot 3: Histograms of four variables grouped by gender.
There is a clear difference in chest diameter between males and females while the other
three histograms presented in plot 2 show very little difference in distributions between
males and females.
0
10
20
30
40
50
8 10 12 14
Chest Diameter (inch)
count
Gender
female
male
0
30
60
90
3 4 5
Knee diameter
count
Gender
female
male
0
40
80
120
1 2 3 4
Elbow Diameter (inch)
count
Gender
female
male
0
50
100
150
1.0 1.5 2.0 2.5 3.0 3.5
Wrist Diameter (inch)
count
Gender
female
male
Plot 4: Histograms of four variables grouped by gender.
From the above histogram the only variable that shows a clear difference in
distributions is bicep girth.
0
5
10
15
25 30 35 40 45
Navel Girth (inch)
count
Gender
female
male
0
5
10
15
35 40 45
Hip Girth (inch)
count
Gender
female
male
0
5
10
15
20
25
20 24 28
Thigh Girth (inch)
count
Gender
female
male
0
10
20
30
10.0 12.5 15.0 17.5
Bicep Girth (inch)
count
Gender
female
male
Plot 5: Histograms of four variables grouped by gender
In plot 4, forearm girth shows a clear difference where males have a larger girth than
females. The other three histograms show a slight difference in favor of males but it is
not as grand as it is in with females.
0
20
40
60
8 10 12
Forearm Girth (inch)
count
Gender
female
male
0
10
20
30
40
12 14 16
Knee Girth (inch)
count
Gender
female
male
0
10
20
30
12.5 15.0 17.5
Calf Girth (inch)
count
Gender
female
male
0
20
40
8 10 12
Ankle Girth (inch)
count
Gender
female
male
Plot 6: Histograms of four variables grouped by gender.
Both height and weight is greater in males than in females, as well as wrist girth is. The
distribution of the ages measured seems to be pretty even between the genders.
The results obtained from the histograms and table of means coincide with what one
would expect when measurements are taken of males and females. One would assume
that there would be quite a bit of difference in height and weight when comparing
males and females and these results proved that. The same can be said for other
measurements such as bicep girth, chest measurements and waist measurements as it
can be safe to conclude that males would on average have greater measurements than
females. The measurements taken on the 300 plus individuals confirm that.
0
25
50
75
100
5 6 7 8
Wrist girth (inch)
count
Gender
female
male
0
5
10
15
20 30 40 50 60
Age (years)
count
Gender
female
male
0
2
4
6
100 150 200 250
weight (lbs)
count
Gender
female
male
0
5
10
15
20
60 65 70 75
Height (inch)
count
Gender
female
male
Statistical Analysis
With 24 different variables in the data, one would expect at least some of the different
predictors to be correlated with one another.
Figure 1: Correlation plot of the variables used to predict gender in the data set
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Biac
Biil
Bitr
Chst_Dpth
Chst_D
Elb_D
Wrst_D
Knee_D
Ankl_D
Shld_G
Chst_G
Wst_G
Abd_G
Hip_G
Thgh_G
Bcp_G
Frrm_G
Knee_G
Clf_GM
Ankl_Gm
Wrst_Gm
Age
Wgt
Hgt
Biac
Biil
Bitr
Chst_Dpth
Chst_D
Elb_D
Wrst_D
Knee_D
Ankl_D
Shld_G
Chst_G
Wst_G
Abd_G
Hip_G
Thgh_G
Bcp_G
Frrm_G
Knee_G
Clf_GM
Ankl_Gm
Wrst_Gm
Age
Wgt
Hgt
Variables VIF
1 Biac 5.156712
2 Biil 2.705762
3 Bitr 4.031599
4 Chst_Dpth 4.614601
5 Chst_D 6.255370
6 Elb_D 7.258601
7 Wrst_D 5.648761
8 Knee_D 4.026285
9 Ankl_D 4.764529
10 Shld_G 12.642681
11 Chst_G 18.308704
12 Wst_G 11.981441
13 Abd_G 5.738989
14 Hip_G 11.966555
15 Thgh_G 6.156189
16 Bcp_G 14.495503
17 Frrm_G 18.230675
18 Knee_G 4.815733
19 Clf_GM 4.379525
20 Ankl_Gm 3.988350
21 Wrst_Gm 9.670141
22 Age 1.714157
23 Wgt 42.805985
24 Hgt 4.917905
Table 1: Multicollinearity table of the predictors
Figure 1 and table 2 both show the high correlation and high multicollinearity among
the predictors. There is evidence of several variables being positively correlated to one
another and that all variables may not be needed to fit an adequate model. For
example, weight seems to be highly correlated with many other predictors as well as
have the highest vif value. One would expect that if an individual has high
measurements of girth and diameter, his or her weight would also be higher than those
who have lower measurements of girth.
After examining the raw data a model was fit using logistic regression analysis. It was
theorized that due to the different units of the variables, standardizing the predictors
might have been a good idea, however when a model was created using this
transformation, results showed the model was not as adequate as the one without
standardized variables.
The first model that was fit consisted of every predictor that was available in the data
set, even though it was expected that not all variables would be used. Results showed
that many variables in the original model were insignificant. Through backwards
elimination, a reduced model was selected. Starting with the full model, each
insignificant variable was eliminated based on its z probability. A variable was labeled as
insignificant if its z-score was greater than 0.05. Also considered was the BIC (Bayesian
Information Criterion) to see how the final model compared to previous models and if in
fact, the final model selected did have the lowest BIC value.
Deviance Residuals:
Min 1Q Median 3Q Max
-0.74731 -0.01801 -0.00045 0.00586 0.89380
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -72.6428 24.6754 -2.944 0.003241 **
Elb_D 11.8753 4.6046 2.579 0.009909 **
Wst_G 2.0371 0.4662 4.370 1.25e-05 ***
Abd_G -0.7015 0.2655 -2.643 0.008227 **
Hip_G -1.6422 0.4750 -3.457 0.000545 ***
Hgt 0.9839 0.3618 2.719 0.006542 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 467.8098 on 337 degrees of freedom
Residual deviance: 6.3932 on 332 degrees of freedom
AIC: 18.393
VIF:
Elb_D Wst_G Abd_G Hip_G Hgt
1.237114 7.592639 4.736656 5.320721 1.674569
Table 2: Table of reduced model
Plot 7: BIC values through model selection process
Results of the reduced model showed that the only significant variables in the model
were elbow diameter, waist girth, abdominal girth, hip girth, and height. Initial analysis
of the model adequacy indicated that based on the residual deviance and the AIC, this
model seemed to be sufficient. The BIC values also indicated that the BIC for the final
model was the lowest of all the models fit for the data set.
The most concerning thing about the reduced model were the slightly high vif values for
some of the variables. In an attempt to fix this, waist girth, which had the high vif value,
was removed from the model in the hopes of fixing the multicollinearity issue. This did
not work however, as the model’s adequacy was worsened. The vif values were indeed
all lowered to below three but it also led to another insignificant variable in the model
as well as extremely high AIC and BIC values (model in appendix III). It was decided that
the final reduced model would be kept as is, even with waist girth having a slightly high
vif value.
60
90
120
150
Full to Final Model
BIC
Due to this being a logistic regression model, an r-squared value is unobtainable for this
model. However, pseudo r-squared values such as McFadden’s, Cox and Snell and
Nagelkerke r-squared values were used.
Pseudo.R.squared
McFadden 0.986334
Cox and Snell (ML) 0.744655
Nagelkerke (Cragg and Uhler) 0.993616
Table 3: Pseudo R-Squared values for reduced model
Results from these pseudo indicate a sufficient model with extremely high McFadden
and Nagelkerke values. Another approach to determining the goodness-of-fit of the
model was the Hosmer-Lemeshow Test where the null hypothesis and a p-value indicate
that a value under 0.05 indicates a poor goodness-of-fit.
Hosmer-Lemeshow C statistic
data: fitted(gender.lm) and gender.train$Gend
X-squared = 0.53776, df = 8, p-value = 0.9998
Hosmer-Lemeshow H statistic
data: fitted(gender.lm) and gender.train$Gend
X-squared = 3.4337, df = 8, p-value = 0.9043
Table 4: Hosmer-Lemeshow Test
The high p-values of the Hosmer-Lemeshow Test indicate a good fit for this model. For
the residual analysis on this model, two plots were fitted. One where the Pearson
residuals were plotted against the fitted values as well as a plot fitted against the linear
predictors.
Plot 8: Residual plots of the reduced model
There doesn’t seemto be anything of concern when looking at the residual plots, and
thus it can be concluded that this model is sufficient for predicting.
The validation data set provided does not indicate what the actual gender of the
individuals is so determining the accuracy of the training set is unobtainable. To account
for this, the training set was split into two different sets where one portion was used as
the training set like before and the other data set was used as a test set. This was done
after fitting a model on the entire data set to see whether there would be any
differences between the predictive ability between the two models or whether the
results or variables selected would be any different. This also allowed for the
comparison of the predicted results by both models. The data was split in a 70/30
percent ratio where the test set had 30% of the 338 observations in the data set.
-0.4
0.0
0.4
-20 -10 0 10 20
Linear Predictor
PearsonResiduals
-0.4
0.0
0.4
0.00 0.25 0.50 0.75 1.00
Fitted Values
PearsonResiduals
Deviance Residuals:
Min 1Q Median 3Q Max
-0.78864 -0.02652 -0.00051 0.00938 0.74798
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -65.8307 21.6974 -3.034 0.002413 **
Elb_D 13.5727 5.1193 2.651 0.008019 **
Wst_G 1.7092 0.4059 4.211 2.54e-05 ***
Abd_G -0.4840 0.2450 -1.976 0.048173 *
Hip_G -1.6433 0.4877 -3.369 0.000754 ***
Hgt 0.8544 0.3351 2.550 0.010771 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 327.013 on 235 degrees of freedom
Residual deviance: 5.956 on 230 degrees of freedom
AIC: 17.956
Number of Fisher Scoring iterations: 33
Table 5: Reduced model on the new training set
Not surprisingly, the results of the backwards elimination provided the same results as
the training set did before. The goodness-of-fit tests and residual checks were all
consistent with prior results. This current model was used to predict on the test set to
see how accurate the predictive ability was. 10-fold cross validation was used as a
method to assess the predictive ability of the model.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 68 1
1 1 65
Accuracy : 0.9852
95% CI : (0.9475, 0.9982)
No Information Rate : 0.5111
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9704
Mcnemar's Test P-Value : 1
Sensitivity : 0.9855
Specificity : 0.9848
Pos Pred Value : 0.9855
Neg Pred Value : 0.9848
Prevalence : 0.5111
Detection Rate : 0.5037
Detection Prevalence : 0.5111
Balanced Accuracy : 0.9852
'Positive' Class : 0
Table 6: The measured accuracy of the training set
Results showed that the accuracy of the training model was extremely high with 98.52%
accuracy. This lined up with the high pseudo R-squared values and goodness-of-fit tests
as those results indicated that predictive ability of this model would be good. The ROC
curve below shows the high accuracy from the test set.
Area under the curve (AUC): 0.985
Plot 9: ROC Curve for the test set
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
ROC curve
False positive rate
Truepositiverate
Evaluation of Results
The predictive ability of the final reduced model proved to be extremely strong as the
training data set had 98.52% accuracy on the test set. These results may have been so
over powering due to the amount of predictors used and the clear differences many of
the variables showed in distributions when grouped by gender. For example, height was
a significant variable in the final model and based on the distributions between genders,
there was a clear difference between males and females.
Based on the distributions presented in the histograms, it was theorized that not all of
the variables presented in the data set would be significant. Of the 24 variables in the
data set, it was speculated that the variables in the final model would include all or
some of the following: Biacromial diameter, chest depth, chest girth, shoulder girth,
waist girth, chest diameter, bicep girth, forearm girth, wrist girth and weight and height.
The final model included only five significant variables and included two variables,
elbow diameter and navel girth, which were originally thought to be insignificant. The
other three variables were ones that were originally thought were going to be
significant.
It was felt that the based on variables presented in the data set, predicting gender
would be highly accurate due to the large differences in distributions for many of the
variables. This was confirmed by the 98.52% accuracy measures presented by the final
logistic model. It was expected that the accuracy of the model on the validation set
would also be extremely high.
Appendix I: Variable Description
Appendix II: R Code
#Unit Conversions
gender[,c(1:5,10:21 ,24)] <- round(conv_unit(gender[,c(1:5,
10:21, 24)], "cm", "inch"),2)
gender[, 6:9] <- round(conv_unit(gender[, 6:9], "cm", "inch")/2,
2)
gender[, "Wgt"] <- round(conv_unit(gender[, "Wgt"], "kg",
"lbs"),2)
#printing first 6 rows
print(head(format(gender, digits=4, nsmall=2)), row.names=F)
#mean and cov
gender.mean <- round(aggregate(gender[, 1:24],
list(gender$Gend), mean),1)
gender.cov <- round(aggregate(gender[, 1:24],
list(gender$Gend), co.var),1)
#Histograms
multiplot(
ggplot(gender.train, aes(x=Biac , fill=factor(Gend))) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
+ xlab("Biacromial Diameter (inch)") +
scale_fill_discrete(name = "Gender", labels=c("female",
"male")) ,
ggplot(gender.train, aes(x=Biil, fill=factor(Gend))) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
+ xlab("Biiliac Diameter (inch)") +
scale_fill_discrete(name = "Gender", labels=c("female",
"male")),
ggplot(gender.train, aes(x=Bitr, fill=factor(Gend))) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
+ xlab("Bitrochantric Diameter (inch)") +
scale_fill_discrete(name = "Gender", labels=c("female",
"male"))
,
ggplot(gender.train, aes(x=Chst_Dpth, fill=factor(Gend))) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
+ xlab("Chest Depth (inch)") + scale_fill_discrete(name =
"Gender", labels=c("female", "male")), cols=2)
#Corrplot
M <- cor(gender.train)
corrplot(M, method="circle")
#Model with all variables
gender.lm <- bayesglm(factor(Gend) ~., data=gender.train[,
1:25], family=binomial(link="logit"), control = list(maxit
= 50))
#Final reduced model
gender.lm <- bayesglm(as.factor(Gend) ~.-Age-Clf_GM-Knee_G-
Bitr-Wgt-Chst_D-Ankl_Gm-Knee_D-Biil-Chst_G-Ankl_D-
Chst_Dpth-Wrst_D-Shld_G-Bcp_G-Biac-Wrst_Gm-Thgh_G-Frrm_G,
data=gender.train[, 1:25], family=binomial(link="logit"),
control = list(maxit = 50))
#Pseudo R-2 values. Function obtained online
nagelkerke(gender.lm)
#Hosmer test
HLgof.test(fit = fitted(gender.lm), obs = train$Gend)
#residual plots
multiplot(
ggplot(gender.train, aes(x=gender.lm$linear.predictor,
y=residuals(gender.lm, "pearson"))) + geom_point(shape=1)
+xlab("Linear Predictor") + ylab("Pearson Residuals") ,
ggplot(gender.train, aes(x=gender.lm$fitted.values,
y=residuals(gender.lm, "pearson"))) + geom_point(shape=1)
+xlab("Fitted Values") + ylab("Pearson Residuals"), cols=2)
#Roc plot
prob <- predict(gender.lm,type=c("response"))
gender.lm$prob <- prob
library(pROC)
g <- roc(Gend ~ prob, data = train)
plot(g)
#predictions
fit1 <- predict(gender.lm, gender[, 1:24], type='response'
)
fitted.results1 <- ifelse(fit1 > 0.5,1,0)
gender10 <- gender
gender10$prediction_of_gender <- fitted.results1
gender.final <- gender10[, c("train", "ID",
"prediction_of_gender")]
ctrl <- trainControl(method = "repeatedcv", number = 10,
savePredictions = TRUE)
mod_fit <- train(Gend ~ Elb_D + Wst_G + Abd_G + Hip_G +
Hgt, data=training, method="glm", family="binomial",
trControl = ctrl, tuneLength = 5)
pred = predict(mod_fit, newdata=testing)
fitted.results <- ifelse(pred > 0.5,1,0)
confusionMatrix(data=fitted.results, testing$Gend)
library(ROSE)
roc.curve(testing$Gend,fitted.results)
Appendix lll: Other Models
Full Model:
Deviance Residuals:
Min 1Q Median 3Q Max
-0.60491 -0.03612 -0.00198 0.01249 0.69878
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.446e+01 2.369e+01 -2.721 0.00651 **
Biac 6.092e-01 7.678e-01 0.793 0.42752
Biil -2.174e-01 8.869e-01 -0.245 0.80637
Bitr -5.135e-02 9.642e-01 -0.053 0.95752
Chst_Dpth 4.102e-01 9.103e-01 0.451 0.65226
Chst_D -1.374e-01 7.688e-01 -0.179 0.85818
Elb_D 4.045e+00 4.107e+00 0.985 0.32467
Wrst_D 2.412e+00 4.970e+00 0.485 0.62748
Knee_D 7.479e-01 3.093e+00 0.242 0.80896
Ankl_D 1.677e+00 3.869e+00 0.433 0.66468
Shld_G 1.788e-01 2.728e-01 0.656 0.51214
Chst_G 7.703e-02 2.539e-01 0.303 0.76164
Wst_G 1.002e+00 3.983e-01 2.517 0.01184 *
Abd_G -3.998e-01 2.698e-01 -1.481 0.13850
Hip_G -4.576e-01 4.651e-01 -0.984 0.32512
Thgh_G -1.438e+00 7.234e-01 -1.988 0.04680 *
Bcp_G 2.132e-01 6.021e-01 0.354 0.72322
Frrm_G 7.071e-01 9.665e-01 0.732 0.46443
Knee_G 3.048e-02 8.657e-01 0.035 0.97192
Clf_GM 2.717e-02 7.576e-01 0.036 0.97140
Ankl_Gm 2.510e-01 1.185e+00 0.212 0.83233
Wrst_Gm 1.039e+00 1.884e+00 0.551 0.58145
Age 3.027e-04 6.953e-02 0.004 0.99653
Wgt -4.147e-03 3.772e-02 -0.110 0.91245
Hgt 5.569e-01 3.852e-01 1.446 0.14828
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 467.8098 on 337 degrees of freedom
Residual deviance: 5.3379 on 313 degrees of freedom
AIC: 55.338
Number of Fisher Scoring iterations: 49
Deviance Residuals:
Min 1Q Median 3Q Max
-2.38432 -0.19335 -0.01419 0.10344 2.35817
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -54.12967 8.31865 -6.507 7.67e-11 ***
Elb_D 17.62633 2.55615 6.896 5.36e-12 ***
Abd_G 0.02233 0.10519 0.212 0.831865
Hip_G -0.53972 0.15894 -3.396 0.000684 ***
Hgt 0.41021 0.10880 3.770 0.000163 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 467.81 on 337 degrees of freedom
Residual deviance: 109.34 on 333 degrees of freedom
AIC: 119.34
Number of Fisher Scoring iterations: 12
Vif:
Elb_D Abd_G Hip_G Hgt
1.400524 2.604707 2.953671 1.107523

More Related Content

Similar to Verma_measurements

Study on body fat density prediction
Study on body fat density predictionStudy on body fat density prediction
Study on body fat density predictionIJDKP
 
Running head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docxRunning head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docxjeanettehully
 
Running head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docxRunning head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docxwlynn1
 
Kines 260 Take Home FinalNameDue Friday December 12th at 11 A.docx
Kines 260 Take Home FinalNameDue Friday December 12th at 11 A.docxKines 260 Take Home FinalNameDue Friday December 12th at 11 A.docx
Kines 260 Take Home FinalNameDue Friday December 12th at 11 A.docxDIPESH30
 
Final Data Mining_Elizabeth Ortega
Final Data Mining_Elizabeth OrtegaFinal Data Mining_Elizabeth Ortega
Final Data Mining_Elizabeth OrtegaElizabeth Ortega
 
Explore, Analyze and Present your data
Explore, Analyze and Present your dataExplore, Analyze and Present your data
Explore, Analyze and Present your datagcalmettes
 
Biostatics introduction
Biostatics introductionBiostatics introduction
Biostatics introductionMidhun Mkc
 
Heart Stats(4)-1
Heart Stats(4)-1Heart Stats(4)-1
Heart Stats(4)-1Lei Barr
 
Writing and formatting figure captions and tables
Writing and formatting figure captions and tablesWriting and formatting figure captions and tables
Writing and formatting figure captions and tablesClaus Wilke
 
Leveraging DXA for use in Sports Performance
Leveraging DXA for use in Sports PerformanceLeveraging DXA for use in Sports Performance
Leveraging DXA for use in Sports PerformanceTyler Bosch
 
Running head Research report11Research report13Business Resea.docx
Running head Research report11Research report13Business Resea.docxRunning head Research report11Research report13Business Resea.docx
Running head Research report11Research report13Business Resea.docxcharisellington63520
 
6. le tan phung
6. le tan phung6. le tan phung
6. le tan phungBinhThang
 
Rosscraft Innovations Inc. Presentation To EAA Congress
Rosscraft Innovations Inc. Presentation To EAA Congress Rosscraft Innovations Inc. Presentation To EAA Congress
Rosscraft Innovations Inc. Presentation To EAA Congress Daryl Austman
 

Similar to Verma_measurements (20)

Study on body fat density prediction
Study on body fat density predictionStudy on body fat density prediction
Study on body fat density prediction
 
Phallosan forte
Phallosan fortePhallosan forte
Phallosan forte
 
Running head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docxRunning head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docx
 
Running head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docxRunning head Final Project Data Analysis1Final Project Data A.docx
Running head Final Project Data Analysis1Final Project Data A.docx
 
Multiple Linear Regression Homework Help
Multiple Linear Regression Homework HelpMultiple Linear Regression Homework Help
Multiple Linear Regression Homework Help
 
Mathproj final
Mathproj finalMathproj final
Mathproj final
 
Prob ^0 Stats Proj 1
Prob ^0 Stats Proj 1Prob ^0 Stats Proj 1
Prob ^0 Stats Proj 1
 
Kines 260 Take Home FinalNameDue Friday December 12th at 11 A.docx
Kines 260 Take Home FinalNameDue Friday December 12th at 11 A.docxKines 260 Take Home FinalNameDue Friday December 12th at 11 A.docx
Kines 260 Take Home FinalNameDue Friday December 12th at 11 A.docx
 
Final Data Mining_Elizabeth Ortega
Final Data Mining_Elizabeth OrtegaFinal Data Mining_Elizabeth Ortega
Final Data Mining_Elizabeth Ortega
 
Explore, Analyze and Present your data
Explore, Analyze and Present your dataExplore, Analyze and Present your data
Explore, Analyze and Present your data
 
Biostatics introduction
Biostatics introductionBiostatics introduction
Biostatics introduction
 
fashion_design.pdf
fashion_design.pdffashion_design.pdf
fashion_design.pdf
 
Heart Stats(4)-1
Heart Stats(4)-1Heart Stats(4)-1
Heart Stats(4)-1
 
Endomorphy dominance
Endomorphy dominanceEndomorphy dominance
Endomorphy dominance
 
Writing and formatting figure captions and tables
Writing and formatting figure captions and tablesWriting and formatting figure captions and tables
Writing and formatting figure captions and tables
 
SAS ePoster2
SAS ePoster2SAS ePoster2
SAS ePoster2
 
Leveraging DXA for use in Sports Performance
Leveraging DXA for use in Sports PerformanceLeveraging DXA for use in Sports Performance
Leveraging DXA for use in Sports Performance
 
Running head Research report11Research report13Business Resea.docx
Running head Research report11Research report13Business Resea.docxRunning head Research report11Research report13Business Resea.docx
Running head Research report11Research report13Business Resea.docx
 
6. le tan phung
6. le tan phung6. le tan phung
6. le tan phung
 
Rosscraft Innovations Inc. Presentation To EAA Congress
Rosscraft Innovations Inc. Presentation To EAA Congress Rosscraft Innovations Inc. Presentation To EAA Congress
Rosscraft Innovations Inc. Presentation To EAA Congress
 

Verma_measurements

  • 1. Measurements to Predict Gender Roopan Verma February 10, 2016
  • 2. Table of Contents Executive Summary............................................................................................................ 3 Background, Motivation, Statement of Problem and Objectives .................................... 4 Data Collection and Explanation ....................................................................................... 4 Statistical Analysis............................................................................................................ 12 Evaluation of Results........................................................................................................ 20 Appendix I: Variable Description..................................................................................... 21 Appendix II: R Code.......................................................................................................... 22 Appendix lll: Other Models.............................................................................................. 25
  • 3. Executive Summary Measurements were taken of over 500 individuals in an attempt to correctly predict gender of the persons being investigated. 24 different diameter and girth measurements were taken of each individual. The raw data was then grouped by gender in order to get a better visualization of the distributions of both males and females and how they compare to one another. A logistic regression model was created with the goal being to predict gender of a given validation set. In an attempt to know how accurate the model was at predicting, the training set given was split into both a training set as well as a test set. Results showed that the model predicted extremely well and had 98.53% accuracy. The results obtained were not surprising as it was felt the variables in the data set would make it not all that difficult to determine which individual was a male and which was a female.
  • 4. Background, Motivation, Statement of Problem and Objectives A set of measurements was collected on over 500 males and females where 24 different variables were collected. The selection of these individuals was done by selecting physically active people who were mostly in their 20s and 30s with some individuals being older. Based on these measurements, the primary objective was to predict the gender of a certain amount of individuals. The different measurements taken of these individuals include different weight measurements including waist, chest, and other body measurements. Along with the main objective of gender prediction other tasks were to be accomplished such as converting data to English units and obtaining descriptive statistics of the data such as the mean and coefficient of variation. Data Collection and Explanation 24 different measurements of an individual were taken that consisted of skeletal measurements, Girth measurements along with other measurements that included age, weight and height (Full descriptions of all variables in the appendix). Gender, the dependent variable, was only recorded for two-thirds of the individuals while the other one-third of the dataset were used as a validation set. Table 1: The first six observations of the complete dataset. The NA’s in the data frame correspond to the validation set that required the dataset to be broken up. The purpose of the data was to find a distinction between male and female and thus the means and percent coefficient of variation of each variable grouped by gender. This allowed getting a better understanding of the differencing between male and female.
  • 5. Mean: Coefficient of Variation: Table 2: The meanand percentcoefficient of variation of each male and female in the dataset. By grouping the variables by gender, results showed that there were some variables where the distributions for each gender were clearly different while other variables showed similar distributions. Interestingly the coefficient of variations of both genders was similar with little differences between males and females. The variation values themselves were generally low except for age, which was expected due to the large range of individuals measured.
  • 6. Plot 1: Histograms of four variables grouped by gender. The above histograms show that both the biacromial diameter and chest depth have a clear difference between males and females where the average in males is higher than females. The bitrochantric and biiliac diameters however show very little differences in distributions between males and females. 0 20 40 12 14 16 18 Biacromial Diameter (inch) count Gender female male 0 10 20 30 40 7 9 11 13 Biiliac Diameter (inch) count Gender female male 0 10 20 30 40 10 12 14 Bitrochantric Diameter (inch) count Gender female male 0 20 40 60 6 8 10 Chest Depth (inch) count Gender female male
  • 7. Plot 2: Histograms of four variables grouped by gender. Shoulder girth, chest girth and waist girth all present clear differences in their distributions when grouped by gender. The histograms above clearly indicate that the girth in males is larger in these variables than it is for females. 0 25 50 75 100 125 1 2 3 4 Ankle Diameter (inch) count Gender female male 0 5 10 15 35 40 45 50 55 Shoulder Girth (inch) count Gender female male 0 5 10 15 20 25 30 35 40 45 Chest Girth (inch) count Gender female male 0 5 10 15 20 25 30 35 40 45 Waist Girth (inch) count Gender female male
  • 8. Plot 3: Histograms of four variables grouped by gender. There is a clear difference in chest diameter between males and females while the other three histograms presented in plot 2 show very little difference in distributions between males and females. 0 10 20 30 40 50 8 10 12 14 Chest Diameter (inch) count Gender female male 0 30 60 90 3 4 5 Knee diameter count Gender female male 0 40 80 120 1 2 3 4 Elbow Diameter (inch) count Gender female male 0 50 100 150 1.0 1.5 2.0 2.5 3.0 3.5 Wrist Diameter (inch) count Gender female male
  • 9. Plot 4: Histograms of four variables grouped by gender. From the above histogram the only variable that shows a clear difference in distributions is bicep girth. 0 5 10 15 25 30 35 40 45 Navel Girth (inch) count Gender female male 0 5 10 15 35 40 45 Hip Girth (inch) count Gender female male 0 5 10 15 20 25 20 24 28 Thigh Girth (inch) count Gender female male 0 10 20 30 10.0 12.5 15.0 17.5 Bicep Girth (inch) count Gender female male
  • 10. Plot 5: Histograms of four variables grouped by gender In plot 4, forearm girth shows a clear difference where males have a larger girth than females. The other three histograms show a slight difference in favor of males but it is not as grand as it is in with females. 0 20 40 60 8 10 12 Forearm Girth (inch) count Gender female male 0 10 20 30 40 12 14 16 Knee Girth (inch) count Gender female male 0 10 20 30 12.5 15.0 17.5 Calf Girth (inch) count Gender female male 0 20 40 8 10 12 Ankle Girth (inch) count Gender female male
  • 11. Plot 6: Histograms of four variables grouped by gender. Both height and weight is greater in males than in females, as well as wrist girth is. The distribution of the ages measured seems to be pretty even between the genders. The results obtained from the histograms and table of means coincide with what one would expect when measurements are taken of males and females. One would assume that there would be quite a bit of difference in height and weight when comparing males and females and these results proved that. The same can be said for other measurements such as bicep girth, chest measurements and waist measurements as it can be safe to conclude that males would on average have greater measurements than females. The measurements taken on the 300 plus individuals confirm that. 0 25 50 75 100 5 6 7 8 Wrist girth (inch) count Gender female male 0 5 10 15 20 30 40 50 60 Age (years) count Gender female male 0 2 4 6 100 150 200 250 weight (lbs) count Gender female male 0 5 10 15 20 60 65 70 75 Height (inch) count Gender female male
  • 12. Statistical Analysis With 24 different variables in the data, one would expect at least some of the different predictors to be correlated with one another. Figure 1: Correlation plot of the variables used to predict gender in the data set -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Biac Biil Bitr Chst_Dpth Chst_D Elb_D Wrst_D Knee_D Ankl_D Shld_G Chst_G Wst_G Abd_G Hip_G Thgh_G Bcp_G Frrm_G Knee_G Clf_GM Ankl_Gm Wrst_Gm Age Wgt Hgt Biac Biil Bitr Chst_Dpth Chst_D Elb_D Wrst_D Knee_D Ankl_D Shld_G Chst_G Wst_G Abd_G Hip_G Thgh_G Bcp_G Frrm_G Knee_G Clf_GM Ankl_Gm Wrst_Gm Age Wgt Hgt
  • 13. Variables VIF 1 Biac 5.156712 2 Biil 2.705762 3 Bitr 4.031599 4 Chst_Dpth 4.614601 5 Chst_D 6.255370 6 Elb_D 7.258601 7 Wrst_D 5.648761 8 Knee_D 4.026285 9 Ankl_D 4.764529 10 Shld_G 12.642681 11 Chst_G 18.308704 12 Wst_G 11.981441 13 Abd_G 5.738989 14 Hip_G 11.966555 15 Thgh_G 6.156189 16 Bcp_G 14.495503 17 Frrm_G 18.230675 18 Knee_G 4.815733 19 Clf_GM 4.379525 20 Ankl_Gm 3.988350 21 Wrst_Gm 9.670141 22 Age 1.714157 23 Wgt 42.805985 24 Hgt 4.917905 Table 1: Multicollinearity table of the predictors Figure 1 and table 2 both show the high correlation and high multicollinearity among the predictors. There is evidence of several variables being positively correlated to one another and that all variables may not be needed to fit an adequate model. For example, weight seems to be highly correlated with many other predictors as well as have the highest vif value. One would expect that if an individual has high measurements of girth and diameter, his or her weight would also be higher than those who have lower measurements of girth. After examining the raw data a model was fit using logistic regression analysis. It was theorized that due to the different units of the variables, standardizing the predictors might have been a good idea, however when a model was created using this transformation, results showed the model was not as adequate as the one without standardized variables. The first model that was fit consisted of every predictor that was available in the data set, even though it was expected that not all variables would be used. Results showed that many variables in the original model were insignificant. Through backwards elimination, a reduced model was selected. Starting with the full model, each insignificant variable was eliminated based on its z probability. A variable was labeled as insignificant if its z-score was greater than 0.05. Also considered was the BIC (Bayesian
  • 14. Information Criterion) to see how the final model compared to previous models and if in fact, the final model selected did have the lowest BIC value. Deviance Residuals: Min 1Q Median 3Q Max -0.74731 -0.01801 -0.00045 0.00586 0.89380 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -72.6428 24.6754 -2.944 0.003241 ** Elb_D 11.8753 4.6046 2.579 0.009909 ** Wst_G 2.0371 0.4662 4.370 1.25e-05 *** Abd_G -0.7015 0.2655 -2.643 0.008227 ** Hip_G -1.6422 0.4750 -3.457 0.000545 *** Hgt 0.9839 0.3618 2.719 0.006542 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 467.8098 on 337 degrees of freedom Residual deviance: 6.3932 on 332 degrees of freedom AIC: 18.393 VIF: Elb_D Wst_G Abd_G Hip_G Hgt 1.237114 7.592639 4.736656 5.320721 1.674569 Table 2: Table of reduced model
  • 15. Plot 7: BIC values through model selection process Results of the reduced model showed that the only significant variables in the model were elbow diameter, waist girth, abdominal girth, hip girth, and height. Initial analysis of the model adequacy indicated that based on the residual deviance and the AIC, this model seemed to be sufficient. The BIC values also indicated that the BIC for the final model was the lowest of all the models fit for the data set. The most concerning thing about the reduced model were the slightly high vif values for some of the variables. In an attempt to fix this, waist girth, which had the high vif value, was removed from the model in the hopes of fixing the multicollinearity issue. This did not work however, as the model’s adequacy was worsened. The vif values were indeed all lowered to below three but it also led to another insignificant variable in the model as well as extremely high AIC and BIC values (model in appendix III). It was decided that the final reduced model would be kept as is, even with waist girth having a slightly high vif value. 60 90 120 150 Full to Final Model BIC
  • 16. Due to this being a logistic regression model, an r-squared value is unobtainable for this model. However, pseudo r-squared values such as McFadden’s, Cox and Snell and Nagelkerke r-squared values were used. Pseudo.R.squared McFadden 0.986334 Cox and Snell (ML) 0.744655 Nagelkerke (Cragg and Uhler) 0.993616 Table 3: Pseudo R-Squared values for reduced model Results from these pseudo indicate a sufficient model with extremely high McFadden and Nagelkerke values. Another approach to determining the goodness-of-fit of the model was the Hosmer-Lemeshow Test where the null hypothesis and a p-value indicate that a value under 0.05 indicates a poor goodness-of-fit. Hosmer-Lemeshow C statistic data: fitted(gender.lm) and gender.train$Gend X-squared = 0.53776, df = 8, p-value = 0.9998 Hosmer-Lemeshow H statistic data: fitted(gender.lm) and gender.train$Gend X-squared = 3.4337, df = 8, p-value = 0.9043 Table 4: Hosmer-Lemeshow Test The high p-values of the Hosmer-Lemeshow Test indicate a good fit for this model. For the residual analysis on this model, two plots were fitted. One where the Pearson residuals were plotted against the fitted values as well as a plot fitted against the linear predictors.
  • 17. Plot 8: Residual plots of the reduced model There doesn’t seemto be anything of concern when looking at the residual plots, and thus it can be concluded that this model is sufficient for predicting. The validation data set provided does not indicate what the actual gender of the individuals is so determining the accuracy of the training set is unobtainable. To account for this, the training set was split into two different sets where one portion was used as the training set like before and the other data set was used as a test set. This was done after fitting a model on the entire data set to see whether there would be any differences between the predictive ability between the two models or whether the results or variables selected would be any different. This also allowed for the comparison of the predicted results by both models. The data was split in a 70/30 percent ratio where the test set had 30% of the 338 observations in the data set. -0.4 0.0 0.4 -20 -10 0 10 20 Linear Predictor PearsonResiduals -0.4 0.0 0.4 0.00 0.25 0.50 0.75 1.00 Fitted Values PearsonResiduals
  • 18. Deviance Residuals: Min 1Q Median 3Q Max -0.78864 -0.02652 -0.00051 0.00938 0.74798 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -65.8307 21.6974 -3.034 0.002413 ** Elb_D 13.5727 5.1193 2.651 0.008019 ** Wst_G 1.7092 0.4059 4.211 2.54e-05 *** Abd_G -0.4840 0.2450 -1.976 0.048173 * Hip_G -1.6433 0.4877 -3.369 0.000754 *** Hgt 0.8544 0.3351 2.550 0.010771 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 327.013 on 235 degrees of freedom Residual deviance: 5.956 on 230 degrees of freedom AIC: 17.956 Number of Fisher Scoring iterations: 33 Table 5: Reduced model on the new training set Not surprisingly, the results of the backwards elimination provided the same results as the training set did before. The goodness-of-fit tests and residual checks were all consistent with prior results. This current model was used to predict on the test set to see how accurate the predictive ability was. 10-fold cross validation was used as a method to assess the predictive ability of the model. Confusion Matrix and Statistics Reference Prediction 0 1 0 68 1 1 1 65 Accuracy : 0.9852 95% CI : (0.9475, 0.9982) No Information Rate : 0.5111 P-Value [Acc > NIR] : <2e-16 Kappa : 0.9704 Mcnemar's Test P-Value : 1 Sensitivity : 0.9855 Specificity : 0.9848 Pos Pred Value : 0.9855 Neg Pred Value : 0.9848 Prevalence : 0.5111 Detection Rate : 0.5037
  • 19. Detection Prevalence : 0.5111 Balanced Accuracy : 0.9852 'Positive' Class : 0 Table 6: The measured accuracy of the training set Results showed that the accuracy of the training model was extremely high with 98.52% accuracy. This lined up with the high pseudo R-squared values and goodness-of-fit tests as those results indicated that predictive ability of this model would be good. The ROC curve below shows the high accuracy from the test set. Area under the curve (AUC): 0.985 Plot 9: ROC Curve for the test set 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 ROC curve False positive rate Truepositiverate
  • 20. Evaluation of Results The predictive ability of the final reduced model proved to be extremely strong as the training data set had 98.52% accuracy on the test set. These results may have been so over powering due to the amount of predictors used and the clear differences many of the variables showed in distributions when grouped by gender. For example, height was a significant variable in the final model and based on the distributions between genders, there was a clear difference between males and females. Based on the distributions presented in the histograms, it was theorized that not all of the variables presented in the data set would be significant. Of the 24 variables in the data set, it was speculated that the variables in the final model would include all or some of the following: Biacromial diameter, chest depth, chest girth, shoulder girth, waist girth, chest diameter, bicep girth, forearm girth, wrist girth and weight and height. The final model included only five significant variables and included two variables, elbow diameter and navel girth, which were originally thought to be insignificant. The other three variables were ones that were originally thought were going to be significant. It was felt that the based on variables presented in the data set, predicting gender would be highly accurate due to the large differences in distributions for many of the variables. This was confirmed by the 98.52% accuracy measures presented by the final logistic model. It was expected that the accuracy of the model on the validation set would also be extremely high.
  • 21. Appendix I: Variable Description
  • 22. Appendix II: R Code #Unit Conversions gender[,c(1:5,10:21 ,24)] <- round(conv_unit(gender[,c(1:5, 10:21, 24)], "cm", "inch"),2) gender[, 6:9] <- round(conv_unit(gender[, 6:9], "cm", "inch")/2, 2) gender[, "Wgt"] <- round(conv_unit(gender[, "Wgt"], "kg", "lbs"),2) #printing first 6 rows print(head(format(gender, digits=4, nsmall=2)), row.names=F) #mean and cov gender.mean <- round(aggregate(gender[, 1:24], list(gender$Gend), mean),1) gender.cov <- round(aggregate(gender[, 1:24], list(gender$Gend), co.var),1) #Histograms multiplot( ggplot(gender.train, aes(x=Biac , fill=factor(Gend))) + geom_histogram(binwidth=.5, alpha=.5, position="identity") + xlab("Biacromial Diameter (inch)") + scale_fill_discrete(name = "Gender", labels=c("female", "male")) , ggplot(gender.train, aes(x=Biil, fill=factor(Gend))) + geom_histogram(binwidth=.5, alpha=.5, position="identity") + xlab("Biiliac Diameter (inch)") + scale_fill_discrete(name = "Gender", labels=c("female", "male")), ggplot(gender.train, aes(x=Bitr, fill=factor(Gend))) + geom_histogram(binwidth=.5, alpha=.5, position="identity") + xlab("Bitrochantric Diameter (inch)") + scale_fill_discrete(name = "Gender", labels=c("female", "male")) , ggplot(gender.train, aes(x=Chst_Dpth, fill=factor(Gend))) + geom_histogram(binwidth=.5, alpha=.5, position="identity") + xlab("Chest Depth (inch)") + scale_fill_discrete(name = "Gender", labels=c("female", "male")), cols=2)
  • 23. #Corrplot M <- cor(gender.train) corrplot(M, method="circle") #Model with all variables gender.lm <- bayesglm(factor(Gend) ~., data=gender.train[, 1:25], family=binomial(link="logit"), control = list(maxit = 50)) #Final reduced model gender.lm <- bayesglm(as.factor(Gend) ~.-Age-Clf_GM-Knee_G- Bitr-Wgt-Chst_D-Ankl_Gm-Knee_D-Biil-Chst_G-Ankl_D- Chst_Dpth-Wrst_D-Shld_G-Bcp_G-Biac-Wrst_Gm-Thgh_G-Frrm_G, data=gender.train[, 1:25], family=binomial(link="logit"), control = list(maxit = 50)) #Pseudo R-2 values. Function obtained online nagelkerke(gender.lm) #Hosmer test HLgof.test(fit = fitted(gender.lm), obs = train$Gend) #residual plots multiplot( ggplot(gender.train, aes(x=gender.lm$linear.predictor, y=residuals(gender.lm, "pearson"))) + geom_point(shape=1) +xlab("Linear Predictor") + ylab("Pearson Residuals") , ggplot(gender.train, aes(x=gender.lm$fitted.values, y=residuals(gender.lm, "pearson"))) + geom_point(shape=1) +xlab("Fitted Values") + ylab("Pearson Residuals"), cols=2) #Roc plot prob <- predict(gender.lm,type=c("response")) gender.lm$prob <- prob library(pROC) g <- roc(Gend ~ prob, data = train) plot(g) #predictions fit1 <- predict(gender.lm, gender[, 1:24], type='response' ) fitted.results1 <- ifelse(fit1 > 0.5,1,0) gender10 <- gender gender10$prediction_of_gender <- fitted.results1
  • 24. gender.final <- gender10[, c("train", "ID", "prediction_of_gender")] ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE) mod_fit <- train(Gend ~ Elb_D + Wst_G + Abd_G + Hip_G + Hgt, data=training, method="glm", family="binomial", trControl = ctrl, tuneLength = 5) pred = predict(mod_fit, newdata=testing) fitted.results <- ifelse(pred > 0.5,1,0) confusionMatrix(data=fitted.results, testing$Gend) library(ROSE) roc.curve(testing$Gend,fitted.results)
  • 25. Appendix lll: Other Models Full Model: Deviance Residuals: Min 1Q Median 3Q Max -0.60491 -0.03612 -0.00198 0.01249 0.69878 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.446e+01 2.369e+01 -2.721 0.00651 ** Biac 6.092e-01 7.678e-01 0.793 0.42752 Biil -2.174e-01 8.869e-01 -0.245 0.80637 Bitr -5.135e-02 9.642e-01 -0.053 0.95752 Chst_Dpth 4.102e-01 9.103e-01 0.451 0.65226 Chst_D -1.374e-01 7.688e-01 -0.179 0.85818 Elb_D 4.045e+00 4.107e+00 0.985 0.32467 Wrst_D 2.412e+00 4.970e+00 0.485 0.62748 Knee_D 7.479e-01 3.093e+00 0.242 0.80896 Ankl_D 1.677e+00 3.869e+00 0.433 0.66468 Shld_G 1.788e-01 2.728e-01 0.656 0.51214 Chst_G 7.703e-02 2.539e-01 0.303 0.76164 Wst_G 1.002e+00 3.983e-01 2.517 0.01184 * Abd_G -3.998e-01 2.698e-01 -1.481 0.13850 Hip_G -4.576e-01 4.651e-01 -0.984 0.32512 Thgh_G -1.438e+00 7.234e-01 -1.988 0.04680 * Bcp_G 2.132e-01 6.021e-01 0.354 0.72322 Frrm_G 7.071e-01 9.665e-01 0.732 0.46443 Knee_G 3.048e-02 8.657e-01 0.035 0.97192 Clf_GM 2.717e-02 7.576e-01 0.036 0.97140 Ankl_Gm 2.510e-01 1.185e+00 0.212 0.83233 Wrst_Gm 1.039e+00 1.884e+00 0.551 0.58145 Age 3.027e-04 6.953e-02 0.004 0.99653 Wgt -4.147e-03 3.772e-02 -0.110 0.91245 Hgt 5.569e-01 3.852e-01 1.446 0.14828 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 467.8098 on 337 degrees of freedom Residual deviance: 5.3379 on 313 degrees of freedom AIC: 55.338 Number of Fisher Scoring iterations: 49
  • 26. Deviance Residuals: Min 1Q Median 3Q Max -2.38432 -0.19335 -0.01419 0.10344 2.35817 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -54.12967 8.31865 -6.507 7.67e-11 *** Elb_D 17.62633 2.55615 6.896 5.36e-12 *** Abd_G 0.02233 0.10519 0.212 0.831865 Hip_G -0.53972 0.15894 -3.396 0.000684 *** Hgt 0.41021 0.10880 3.770 0.000163 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 467.81 on 337 degrees of freedom Residual deviance: 109.34 on 333 degrees of freedom AIC: 119.34 Number of Fisher Scoring iterations: 12 Vif: Elb_D Abd_G Hip_G Hgt 1.400524 2.604707 2.953671 1.107523