Statistical Analyses of athlete's BMI and haemoglobin level

Group Project
Analyses of Athlete’s BMI and hemoglobin level
Introduction
The purpose of this report is to give an overall understanding of the available data, the
type and description of the data and present analyses on the data. The dataset was collected in
a study on how the characteristics of the athletes’ body varied with sport and sex. The data is
available within the DAAG package in R as `ais`.
Data Description
Initially the data contained 202 observations and 13 variables, few columns which were
not necessary for the analyses were removed reducing the dataset to 202 observations and 8
variables. The table below lists the variables that were selected for the analyses after removing
the unused variables.
Variable Name Data Type Description
Rcc Numeric Red blood cell count
Hg Numeric Hemoglobin concentration, in g per decaliter
BMI Numeric Body Mass index kg
Lbm Numeric Lena body mass, kg
Ht numeric Height, cm
wt Numeric Weight, kg
Sex Factor A factor with levels f for female, m for male
Sport Factor A factor with 9 sports

Data Cleaning
The dataset does not contain any missing values, 5 columns were removed since they
were not used in the analyses.
DataSummary
Summary of the variables in the dataset is shown in figure 1.
Figure 1: Summary of variables in the dataset
 There are 6 numeric variables, 2 factor variables (sex and sport)
Data Analysis
1.0 Initial exploration of data
Initially to get a feel for the data, the variables of interest are BMI (Body Mass Index) and HG
(blood Hemoglobin concentration).

Figure 3: Histogram of BMI
The plots are shown above in figure 2 and 3. Later in the project we do hypothesis testing to
prove statistically that the distributions are normal.
2.0 Empirical CDF
The empirical CDF is a nonparametric way of estimating the underlying CDF of a random
variable. It is a visual display of how quickly the CDF increases to 1.
Figure 5: ECDF of the normal distribution of BMI
The figure 4 above shows the ECDF plot of the normal BMI for comparison, figure 5
shows the ECDF of the sample BMI. The ECDF is close to normal although the ECDF of the
sample BMI is increasing to 1 quickly.
Figure 2: Histogram of hemoglobin level
Figure 4: ECDF of the sample BMI

The ECDF for the blood hemoglobin level vs the normal is plotted.
Figure 7: ECDF of the normal with mean and sd of hg
The ECDF of HG is close to normal as shown in figure 6 and 7.
3.0 BMI Categorization and Confidence interval estimation of BMI mean
Categorical proportions for the BMI variable are calculate using the ecdf function, based
on the following range of BMI values:
 Underweight – BMI less than 18.5
 Normal – BMI between 18.5 and 24.9
 Overweight – BMI between 25 and 30
 Obese – BMI more than 30
The above proportions give a general idea about the proportion of all the athletes, who
belong to different sports taken together.
We are now interested in calculating the confidence interval of BMI of all the athletes.
We can achieve this by applying both non-parametric and parametric model approaches. These
results might be useful in comprehending the better approach.
Figure 6: ECDF of sample hemoglobin concentration

Non-Parametric Model approach:
 Mean - 22.95
 SE – 0.203
 CI – (22.557, 23.354)
Parametric Model approach:
 Mean - 22.956
 SE – 0.201
 CI – (22.562, 23.351)
Both the approaches produce similar results. There does not seemto be a clear winner
here, the confidence band for parametric approach is a little shorter.
HG (haemoglobin count) is another variable of interest. So, we will be estimating the
confidence interval for mean of this variable. Parametric method is the chosen
approach because the histogram of this variable as shown in figure 2 depicts normality.
 Mean - 14.56
 SE – 0.0945
 CI – (14.377, 14.748)
4.0 Hypothesis testing
This section includes all the hypothesis testing that was performed on this dataset.
4.1 Testing to check if the distribution follow normality
Both BMI and HG variables are tested to check if they follow a normal distribution.
Permutation test was employed for this purpose. 1000 random normal numbers from normal
distributions with mean and std. dev. of BMI and HG were generated and tested against the
sample data.
The null hypothesis of the permutation test is that both samples are from the same
distribution, we reject if p-value is less than 0.05.
The results of the test are given below.
BMI:
 P-Value: 0.948
 Conclusion: We do not have evidence against the null. Hence, we cannot reject the null
hypothesis. The p-value is very high and suggests that the null hypothesis is strong.

HG:
 P-Value: 0.71
 Conclusion: We do not have evidence against the null. Hence, we cannot reject the null
hypothesis. The p-value is high and suggests that the null hypothesis is strong.
4.2 Hypothesis to test if BMI varies according to sex
The BMI for males and females were segregated and Wald test was applied. The aimof
this test is to test if sex influences BMI. The null hypothesis is that the difference in the means is
equal to zero.
P-Value obtained from the test is 1.27e-06. Hence, we can reject the null hypothesis;
there is significant difference between the BMI of a male and female.
4.3 Hypotheses testing on HG based on category of sport
Sports in our dataset are segregated in two different categories:
 Endurance Sport: Row, Swim, T_400. Tennis, WaterPolo
 Power Sport: Netball, B_Ball, Field, Gym, TSprint
4.3.1 Endurance sport vs Endurance sport
We are interested in finding out if there is a difference between hg concentration within
the same category of sport. We apply Wald test with null hypothesis that there is no difference
between intra-category sport, the p value is 0.36, which is greater than 0.05 therefore we
cannot reject the null hypothesis.
A permutation test was also employed to test if the samples come from the same
distribution, the p value is 0.638, which is greater than 0.05 therefore the null hypothesis
cannot be rejected. There is insufficient evidence to disprove the null hypothesis.
We observe that the p-value for permutation test is more than that obtained through
Wald test. It implies that the null hypothesis is stronger in case of Permutation test.

4.3.2 Endurance sport vs Power sport
We are now interested in studying if there is a difference between hg count of sports
coming from two different categories. The sports that were selected are Rowing (as endurance
sport) and Netball (as power sport).
The null hypothesis is that the hemoglobin does not vary with its category of sport. p-
value obtained from Wald test is 6.33e-6. The null hypothesis is rejected, there is significant
difference in the hemoglobin concentrations in athletes between the two sport.
Permutation test also gives similar result with a p-value of 0. So, the power of test for
both of these tests is close to 1. Please note that permutation test trumps over Wald test by a
very small margin.
5.0 Maximum Likelihood Estimator
Based on the results in section 4.1, both BMI and HG variables were found to follow
normal distribution. Now the aimis to find the point estimates for the parameter of the normal
distribution using maximum likelihood estimation. The results are given below.
MLE for BMI ~ N(22.95, 2.86)
MLE for HG ~ N(14.57, 1.36)
6.0 Bayesian Analysis
The approach that we have incorporated till now revolved around the Frequentist
philosophy, which involved measures to find out the confidence interval of the estimates
of a parameter. Let us now look at how things differ when we find out the confidence
interval of the parameter itself using the Bayesian approach. We can do this for both the
bmi and hg variables in our dataset
BMI:
Let us assume a prior for the mean bmi, which follows N (0,1)
The posterior for the mean will follow N (22.95, 0.04) (results are calculated based on
the formula derived in class)
Posterior Interval for mean bmi: (22.877, 22.898)

Hg:
Let us assume a prior for the mean hg that follows N(1,2)
The posterior for the mean will follow N(14.56, 0.07)
Posterior Interval for mean bmi: (14.429, 14.465)
In the second case, we have deliberately substituted higher variance for the prior
to understand the effect of higher prior variance on the posterior. We observe that in our
case since the posterior variance is low, the prior variance tends to inflate the former.
Conclusion
 If the distribution of the population is known, the parametric bootstrapping method
gives better results when compared to non-parametric bootstrap
 The power of permutation test is slightly more than that of the Wald test for the chosen
dataset
 In Bayesian approach, the closer the prior is to the posterior, the better is the posterior
prediction
 The Body Mass Index(BMI) depends on the gender
 Hemoglobin Count is dependent on the category of sport

Statistical Analyses of athlete's BMI and haemoglobin level

Recommended

Recommended

More Related Content

Similar to Statistical Analyses of athlete's BMI and haemoglobin level

Similar to Statistical Analyses of athlete's BMI and haemoglobin level (20)

Recently uploaded

Recently uploaded (20)

Statistical Analyses of athlete's BMI and haemoglobin level