Statistical Analysis Project

Braedon Churchill
Audrey Fu
Stats 431
Computing Project
1. Background
a. Description of the problem – As a professional golfer playing in
tournaments/events you want to maximize the earnings you receive from each
event. In order to do this you want to find out which aspects of a golfers
performance have a higher correlation with the average earnings attained in a
given event. Such aspects include number of events played in, average score per
round, percentage of greens hit in regulation, average driving distance, driving
accuracy, and average putts per round. By discovering which aspects correlate to
higher earnings you know which aspects of your own game you need to focus on
in order to obtain more money.
b. Description of statistical questions – Do any of the variables significantly explain
or predict the average earnings per round of golfers.
2. Results
- Exploratory Analysis
- Using the pairs function in r, the pairwise relationship of the data is observed.
Appendix I shows the results. Although appearing fairly random, a linear
relationship can be seen among the data points between most of the variables.
- Appendix II shows the residual plot of the linear model. The εi looks to be
normally distributed due to randomness and assuming they are independent.
- Hypothesis Testing
- Conducted an F Statistic hypothesis test to see if at least one of the variables has
a predictive value on the average earnings per event of golfers. Results are seen in
Appendix III. As seen in the test there is significant evidence that at least one of
the variables has a predictive value.
- Summary of the data
- Conducted VIF test as a function in r to see how the variance of the coefficients
are correlated as compared to when they are not linearly related. The results are
seen in Appendix IV. The coefficients are not very highly correlated which shows
that each coefficient has its own predictive value on average earnings per event.
- Performing the summary function in r, as seen in Appendix V, every coefficient
has a very low p value showing that they each have a predictive value on average
earnings per event. It is also seen that the R2 value is .82 which shows that 82% of
the variability is explained by the model.
3. Discussion
- The statistical analysis of an F Test was performed in order to test and see if at
least one of the variables has a predictive value on earnings per round. The
limitation of this test is that it doesn’t show which of the variables have predictive
values, only that if at least one of them does.
- Another hypothesis test which tests for the significance of p values of each
individual coefficient could be performed in order to see which of them have
predictive values. Using a data set which includes the stats of more individual
golfers could also be used in order to come up with more accurate test results.

4. Appendix
a. Details of Statistical Analyses (Appendix I – V)
Appendix I
Appendix II
Appendix III
H0: X1=X2=X3=X4=X5=X6=0 Ha: At least one Xi ≠ 0
k = 6 n = 18
DF1 = k = 6 DF2 = n-(k+1) = 11 α = .05
F Statistic = 13.99 with Fα, DF1=6, DF2=11 = 3.09
Reject Ho if F > Fα
Since 13.99 > 3.09 I reject the null hypothesis
It can be concluded that at least one of the variables has a predictive value on the average
earnings per event of golfers. It is shown with the residual plot that the data is normally
distributed. Independence is assumed.

Appendix IV
> vif(lm)
X1 X2 X3 X4 X5 X6
2.937727 3.598127 1.846040 1.830498 1.742528 2.145418
Appendix V
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6)
Y = Average earnings per event
X1 = Average score per round
X2 = Percentage of greens in regulation
X3 = Driving accuracy
X4 = Average putts per round
X5 = Number of events
X6 = Average driving distance
Residuals:
Min 1Q Median 3Q Max
-50215 -21877 1518 18345 37626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1411171.4 1350234.6 1.045 0.309796
X1 -44490.9 19562.2 -2.274 0.035418 *
X2 22564.0 4727.3 4.773 0.000152 ***
X3 -5463.8 1453.6 -3.759 0.001437 **
X4 57686.9 23583.6 2.446 0.024946 *
X5 -4751.8 1288.8 -3.687 0.001687 **
X6 -3466.1 999.3 -3.469 0.002742 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29530 on 18 degrees of freedom
Multiple R-squared: 0.8234, Adjusted R-squared: 0.7646
F-statistic: 13.99 on 6 and 18 DF, p-value: 6.497e-06
b. R Code
#load the data set
Golfstats <- read.delim("~/Golfstats.txt")
> View(Golfstats)
#name the variables
> Y=Golfstats$Earnings.Event
> X1=Golfstats$Avg..Score

> X2=Golfstats$GIR.....
> X3=Golfstats$Driving.Accuracy....
> X4=Golfstats$Putts.Round
> X5=Golfstats$Events
> X6=Golfstats$Driving.Distance
#Perform Exploratory Analysis
> pairs(Data)
#create linear model
> lm(Y~X1+X2+X3+X4+X5+X6)
> lm=lm(Y~X1+X2+X3+X4+X5+X6)
#create residual plot to test the residuals
> plot(lm$fitted, lm$resid)
> abline(h=0, lty=2)
#check the significance of the variables and find test statistics
> summary(lm)
#check for multicolinearity of the variables
> vif(lm)

Statistical Analysis Project

Recommended

Recommended

More Related Content

Similar to Statistical Analysis Project

Similar to Statistical Analysis Project (20)

Recently uploaded

Recently uploaded (20)

Statistical Analysis Project