2. Table of Contents
Executive Summary............................................................................................................ 3
Background, Motivation, Statement of Problem and Objectives .................................... 4
Data Collection and Explanation ....................................................................................... 4
Statistical Analysis............................................................................................................ 12
Evaluation of Results........................................................................................................ 20
Appendix I: Variable Description..................................................................................... 21
Appendix II: R Code.......................................................................................................... 22
Appendix lll: Other Models.............................................................................................. 25
3. Executive Summary
Measurements were taken of over 500 individuals in an attempt to correctly predict
gender of the persons being investigated. 24 different diameter and girth
measurements were taken of each individual. The raw data was then grouped by gender
in order to get a better visualization of the distributions of both males and females and
how they compare to one another. A logistic regression model was created with the
goal being to predict gender of a given validation set. In an attempt to know how
accurate the model was at predicting, the training set given was split into both a training
set as well as a test set. Results showed that the model predicted extremely well and
had 98.53% accuracy. The results obtained were not surprising as it was felt the
variables in the data set would make it not all that difficult to determine which
individual was a male and which was a female.
4. Background, Motivation, Statement of Problem and Objectives
A set of measurements was collected on over 500 males and females where 24 different
variables were collected. The selection of these individuals was done by selecting
physically active people who were mostly in their 20s and 30s with some individuals
being older. Based on these measurements, the primary objective was to predict the
gender of a certain amount of individuals. The different measurements taken of these
individuals include different weight measurements including waist, chest, and other
body measurements. Along with the main objective of gender prediction other tasks
were to be accomplished such as converting data to English units and obtaining
descriptive statistics of the data such as the mean and coefficient of variation.
Data Collection and Explanation
24 different measurements of an individual were taken that consisted of skeletal
measurements, Girth measurements along with other measurements that included age,
weight and height (Full descriptions of all variables in the appendix). Gender, the
dependent variable, was only recorded for two-thirds of the individuals while the other
one-third of the dataset were used as a validation set.
Table 1: The first six observations of the complete dataset. The NA’s in the data frame
correspond to the validation set that required the dataset to be broken up.
The purpose of the data was to find a distinction between male and female and thus the
means and percent coefficient of variation of each variable grouped by gender. This
allowed getting a better understanding of the differencing between male and female.
5. Mean:
Coefficient of Variation:
Table 2: The meanand percentcoefficient of variation of each male and female in the dataset.
By grouping the variables by gender, results showed that there were some variables
where the distributions for each gender were clearly different while other variables
showed similar distributions. Interestingly the coefficient of variations of both genders
was similar with little differences between males and females. The variation values
themselves were generally low except for age, which was expected due to the large
range of individuals measured.
6. Plot 1: Histograms of four variables grouped by gender.
The above histograms show that both the biacromial diameter and chest depth have a
clear difference between males and females where the average in males is higher than
females. The bitrochantric and biiliac diameters however show very little differences in
distributions between males and females.
0
20
40
12 14 16 18
Biacromial Diameter (inch)
count
Gender
female
male
0
10
20
30
40
7 9 11 13
Biiliac Diameter (inch)
count
Gender
female
male
0
10
20
30
40
10 12 14
Bitrochantric Diameter (inch)
count
Gender
female
male
0
20
40
60
6 8 10
Chest Depth (inch)
count
Gender
female
male
7. Plot 2: Histograms of four variables grouped by gender.
Shoulder girth, chest girth and waist girth all present clear differences in their
distributions when grouped by gender. The histograms above clearly indicate that the
girth in males is larger in these variables than it is for females.
0
25
50
75
100
125
1 2 3 4
Ankle Diameter (inch)
count
Gender
female
male
0
5
10
15
35 40 45 50 55
Shoulder Girth (inch)
count
Gender
female
male
0
5
10
15
20
25
30 35 40 45
Chest Girth (inch)
count
Gender
female
male
0
5
10
15
20
25 30 35 40 45
Waist Girth (inch)
count
Gender
female
male
8. Plot 3: Histograms of four variables grouped by gender.
There is a clear difference in chest diameter between males and females while the other
three histograms presented in plot 2 show very little difference in distributions between
males and females.
0
10
20
30
40
50
8 10 12 14
Chest Diameter (inch)
count
Gender
female
male
0
30
60
90
3 4 5
Knee diameter
count
Gender
female
male
0
40
80
120
1 2 3 4
Elbow Diameter (inch)
count
Gender
female
male
0
50
100
150
1.0 1.5 2.0 2.5 3.0 3.5
Wrist Diameter (inch)
count
Gender
female
male
9. Plot 4: Histograms of four variables grouped by gender.
From the above histogram the only variable that shows a clear difference in
distributions is bicep girth.
0
5
10
15
25 30 35 40 45
Navel Girth (inch)
count
Gender
female
male
0
5
10
15
35 40 45
Hip Girth (inch)
count
Gender
female
male
0
5
10
15
20
25
20 24 28
Thigh Girth (inch)
count
Gender
female
male
0
10
20
30
10.0 12.5 15.0 17.5
Bicep Girth (inch)
count
Gender
female
male
10. Plot 5: Histograms of four variables grouped by gender
In plot 4, forearm girth shows a clear difference where males have a larger girth than
females. The other three histograms show a slight difference in favor of males but it is
not as grand as it is in with females.
0
20
40
60
8 10 12
Forearm Girth (inch)
count
Gender
female
male
0
10
20
30
40
12 14 16
Knee Girth (inch)
count
Gender
female
male
0
10
20
30
12.5 15.0 17.5
Calf Girth (inch)
count
Gender
female
male
0
20
40
8 10 12
Ankle Girth (inch)
count
Gender
female
male
11. Plot 6: Histograms of four variables grouped by gender.
Both height and weight is greater in males than in females, as well as wrist girth is. The
distribution of the ages measured seems to be pretty even between the genders.
The results obtained from the histograms and table of means coincide with what one
would expect when measurements are taken of males and females. One would assume
that there would be quite a bit of difference in height and weight when comparing
males and females and these results proved that. The same can be said for other
measurements such as bicep girth, chest measurements and waist measurements as it
can be safe to conclude that males would on average have greater measurements than
females. The measurements taken on the 300 plus individuals confirm that.
0
25
50
75
100
5 6 7 8
Wrist girth (inch)
count
Gender
female
male
0
5
10
15
20 30 40 50 60
Age (years)
count
Gender
female
male
0
2
4
6
100 150 200 250
weight (lbs)
count
Gender
female
male
0
5
10
15
20
60 65 70 75
Height (inch)
count
Gender
female
male
12. Statistical Analysis
With 24 different variables in the data, one would expect at least some of the different
predictors to be correlated with one another.
Figure 1: Correlation plot of the variables used to predict gender in the data set
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Biac
Biil
Bitr
Chst_Dpth
Chst_D
Elb_D
Wrst_D
Knee_D
Ankl_D
Shld_G
Chst_G
Wst_G
Abd_G
Hip_G
Thgh_G
Bcp_G
Frrm_G
Knee_G
Clf_GM
Ankl_Gm
Wrst_Gm
Age
Wgt
Hgt
Biac
Biil
Bitr
Chst_Dpth
Chst_D
Elb_D
Wrst_D
Knee_D
Ankl_D
Shld_G
Chst_G
Wst_G
Abd_G
Hip_G
Thgh_G
Bcp_G
Frrm_G
Knee_G
Clf_GM
Ankl_Gm
Wrst_Gm
Age
Wgt
Hgt
13. Variables VIF
1 Biac 5.156712
2 Biil 2.705762
3 Bitr 4.031599
4 Chst_Dpth 4.614601
5 Chst_D 6.255370
6 Elb_D 7.258601
7 Wrst_D 5.648761
8 Knee_D 4.026285
9 Ankl_D 4.764529
10 Shld_G 12.642681
11 Chst_G 18.308704
12 Wst_G 11.981441
13 Abd_G 5.738989
14 Hip_G 11.966555
15 Thgh_G 6.156189
16 Bcp_G 14.495503
17 Frrm_G 18.230675
18 Knee_G 4.815733
19 Clf_GM 4.379525
20 Ankl_Gm 3.988350
21 Wrst_Gm 9.670141
22 Age 1.714157
23 Wgt 42.805985
24 Hgt 4.917905
Table 1: Multicollinearity table of the predictors
Figure 1 and table 2 both show the high correlation and high multicollinearity among
the predictors. There is evidence of several variables being positively correlated to one
another and that all variables may not be needed to fit an adequate model. For
example, weight seems to be highly correlated with many other predictors as well as
have the highest vif value. One would expect that if an individual has high
measurements of girth and diameter, his or her weight would also be higher than those
who have lower measurements of girth.
After examining the raw data a model was fit using logistic regression analysis. It was
theorized that due to the different units of the variables, standardizing the predictors
might have been a good idea, however when a model was created using this
transformation, results showed the model was not as adequate as the one without
standardized variables.
The first model that was fit consisted of every predictor that was available in the data
set, even though it was expected that not all variables would be used. Results showed
that many variables in the original model were insignificant. Through backwards
elimination, a reduced model was selected. Starting with the full model, each
insignificant variable was eliminated based on its z probability. A variable was labeled as
insignificant if its z-score was greater than 0.05. Also considered was the BIC (Bayesian
14. Information Criterion) to see how the final model compared to previous models and if in
fact, the final model selected did have the lowest BIC value.
Deviance Residuals:
Min 1Q Median 3Q Max
-0.74731 -0.01801 -0.00045 0.00586 0.89380
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -72.6428 24.6754 -2.944 0.003241 **
Elb_D 11.8753 4.6046 2.579 0.009909 **
Wst_G 2.0371 0.4662 4.370 1.25e-05 ***
Abd_G -0.7015 0.2655 -2.643 0.008227 **
Hip_G -1.6422 0.4750 -3.457 0.000545 ***
Hgt 0.9839 0.3618 2.719 0.006542 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 467.8098 on 337 degrees of freedom
Residual deviance: 6.3932 on 332 degrees of freedom
AIC: 18.393
VIF:
Elb_D Wst_G Abd_G Hip_G Hgt
1.237114 7.592639 4.736656 5.320721 1.674569
Table 2: Table of reduced model
15. Plot 7: BIC values through model selection process
Results of the reduced model showed that the only significant variables in the model
were elbow diameter, waist girth, abdominal girth, hip girth, and height. Initial analysis
of the model adequacy indicated that based on the residual deviance and the AIC, this
model seemed to be sufficient. The BIC values also indicated that the BIC for the final
model was the lowest of all the models fit for the data set.
The most concerning thing about the reduced model were the slightly high vif values for
some of the variables. In an attempt to fix this, waist girth, which had the high vif value,
was removed from the model in the hopes of fixing the multicollinearity issue. This did
not work however, as the model’s adequacy was worsened. The vif values were indeed
all lowered to below three but it also led to another insignificant variable in the model
as well as extremely high AIC and BIC values (model in appendix III). It was decided that
the final reduced model would be kept as is, even with waist girth having a slightly high
vif value.
60
90
120
150
Full to Final Model
BIC
16. Due to this being a logistic regression model, an r-squared value is unobtainable for this
model. However, pseudo r-squared values such as McFadden’s, Cox and Snell and
Nagelkerke r-squared values were used.
Pseudo.R.squared
McFadden 0.986334
Cox and Snell (ML) 0.744655
Nagelkerke (Cragg and Uhler) 0.993616
Table 3: Pseudo R-Squared values for reduced model
Results from these pseudo indicate a sufficient model with extremely high McFadden
and Nagelkerke values. Another approach to determining the goodness-of-fit of the
model was the Hosmer-Lemeshow Test where the null hypothesis and a p-value indicate
that a value under 0.05 indicates a poor goodness-of-fit.
Hosmer-Lemeshow C statistic
data: fitted(gender.lm) and gender.train$Gend
X-squared = 0.53776, df = 8, p-value = 0.9998
Hosmer-Lemeshow H statistic
data: fitted(gender.lm) and gender.train$Gend
X-squared = 3.4337, df = 8, p-value = 0.9043
Table 4: Hosmer-Lemeshow Test
The high p-values of the Hosmer-Lemeshow Test indicate a good fit for this model. For
the residual analysis on this model, two plots were fitted. One where the Pearson
residuals were plotted against the fitted values as well as a plot fitted against the linear
predictors.
17. Plot 8: Residual plots of the reduced model
There doesn’t seemto be anything of concern when looking at the residual plots, and
thus it can be concluded that this model is sufficient for predicting.
The validation data set provided does not indicate what the actual gender of the
individuals is so determining the accuracy of the training set is unobtainable. To account
for this, the training set was split into two different sets where one portion was used as
the training set like before and the other data set was used as a test set. This was done
after fitting a model on the entire data set to see whether there would be any
differences between the predictive ability between the two models or whether the
results or variables selected would be any different. This also allowed for the
comparison of the predicted results by both models. The data was split in a 70/30
percent ratio where the test set had 30% of the 338 observations in the data set.
-0.4
0.0
0.4
-20 -10 0 10 20
Linear Predictor
PearsonResiduals
-0.4
0.0
0.4
0.00 0.25 0.50 0.75 1.00
Fitted Values
PearsonResiduals
18. Deviance Residuals:
Min 1Q Median 3Q Max
-0.78864 -0.02652 -0.00051 0.00938 0.74798
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -65.8307 21.6974 -3.034 0.002413 **
Elb_D 13.5727 5.1193 2.651 0.008019 **
Wst_G 1.7092 0.4059 4.211 2.54e-05 ***
Abd_G -0.4840 0.2450 -1.976 0.048173 *
Hip_G -1.6433 0.4877 -3.369 0.000754 ***
Hgt 0.8544 0.3351 2.550 0.010771 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 327.013 on 235 degrees of freedom
Residual deviance: 5.956 on 230 degrees of freedom
AIC: 17.956
Number of Fisher Scoring iterations: 33
Table 5: Reduced model on the new training set
Not surprisingly, the results of the backwards elimination provided the same results as
the training set did before. The goodness-of-fit tests and residual checks were all
consistent with prior results. This current model was used to predict on the test set to
see how accurate the predictive ability was. 10-fold cross validation was used as a
method to assess the predictive ability of the model.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 68 1
1 1 65
Accuracy : 0.9852
95% CI : (0.9475, 0.9982)
No Information Rate : 0.5111
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9704
Mcnemar's Test P-Value : 1
Sensitivity : 0.9855
Specificity : 0.9848
Pos Pred Value : 0.9855
Neg Pred Value : 0.9848
Prevalence : 0.5111
Detection Rate : 0.5037
19. Detection Prevalence : 0.5111
Balanced Accuracy : 0.9852
'Positive' Class : 0
Table 6: The measured accuracy of the training set
Results showed that the accuracy of the training model was extremely high with 98.52%
accuracy. This lined up with the high pseudo R-squared values and goodness-of-fit tests
as those results indicated that predictive ability of this model would be good. The ROC
curve below shows the high accuracy from the test set.
Area under the curve (AUC): 0.985
Plot 9: ROC Curve for the test set
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
ROC curve
False positive rate
Truepositiverate
20. Evaluation of Results
The predictive ability of the final reduced model proved to be extremely strong as the
training data set had 98.52% accuracy on the test set. These results may have been so
over powering due to the amount of predictors used and the clear differences many of
the variables showed in distributions when grouped by gender. For example, height was
a significant variable in the final model and based on the distributions between genders,
there was a clear difference between males and females.
Based on the distributions presented in the histograms, it was theorized that not all of
the variables presented in the data set would be significant. Of the 24 variables in the
data set, it was speculated that the variables in the final model would include all or
some of the following: Biacromial diameter, chest depth, chest girth, shoulder girth,
waist girth, chest diameter, bicep girth, forearm girth, wrist girth and weight and height.
The final model included only five significant variables and included two variables,
elbow diameter and navel girth, which were originally thought to be insignificant. The
other three variables were ones that were originally thought were going to be
significant.
It was felt that the based on variables presented in the data set, predicting gender
would be highly accurate due to the large differences in distributions for many of the
variables. This was confirmed by the 98.52% accuracy measures presented by the final
logistic model. It was expected that the accuracy of the model on the validation set
would also be extremely high.