SlideShare a Scribd company logo
1 of 16
Eun Seuk Choi (Eric)
Statistical Methods & Data Analytics
Final Project
Professor Alan Huebner
December 10, 2015
<Analysis on NBA Real Plus-Minus for 2014-2015 Regular Seasons>
Table of Contents
1. Introduction
a. Describe data
b. About variables
c. Purpose of analysis
2. Data
a. More details about data
b. The source of data
3. Regression Analysis
a. Exploratory data analysis
i. Scatterplots of each of X variables vs. Y variable
ii. Most highly correlated X variables
b. Linear Regression Analysis
i. Fit a full model and report the 𝑅2
ii. Conduct one F-test to test for the removal of a subset of variables
iii. Use all stepAIC()
iv. Find outliers
v. Choose the “final” model
vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot)
vii. Validate the model by cross validation or bootstrapping
4. Results
a. Three inferences about the final model and importance of each inference
i. A confidence interval for a fitted value
ii. A prediction interval for a fitted value
iii. A confidence interval for one or more slope parameters
5. Conclusion
a. How well the model describes Y variable
b. Factors that can improve the predictive power of the model
1. Introduction
a. Describe data
The file “NBA real plus-minus for 2014-2015 regular seasons” contains the data
extracted from ESPN.com, about an individual NBA player’s influence on his team’s wins by
analyzing the number of games he played during the season, the number of minutes he
played on each game, on-court team offensive performance, and on-court team defensive
performance. Data consists of 474 NBA players who played for at least one game for 2014-
2015 regular seasons.
b. About variables
There are 5 variables in total: GP, M, ORPM, DRPM, and WINS. While WINS is the
response variable, all the other 4 variables are predictor variables. GP is the number of
games played for 2014-2015 regular seasons out of 82 games. M is minutes per game for
each player. ORPM is player’s estimated on-court impact on team’s offensive performance,
measured in points scored per 100 offensive possessions, while DRPM is player’s estimated
on-court impact on team’s defensive performance. WINS provides an estimate of the
number of wins each player has contributed to his team’s win total on the season. WINS
includes the player’s Real Plus-Minus and his number of possessions played.
c. Purpose of analysis
By interpreting the result of linear regressions on those 5 variables (WINS for the
response variable and the other four for predictor variables), I want to find out primary
factors that positively affect WINS. I will find the optimal model to predict WINS by
conducting F-test to remove a subset of variables from the model, observe outliers within
data, perform model diagnostics on my final model, and validate it using cross validation.
Based on these interpretations, I will make inferences pertinent to my topic by using
combinations of a confidence interval for a fitted value and a confidence interval for slope
parameters. With an evaluation about my final model, I will finish this project by finding a
way to improve the predictive power of the model.
2. About Data
a. More details about data
The data was extracted from ESPN.com website. Original data includes 6 variables,
which include 5 variable mentioned above plus RPM, but I excluded it since RPM is just
ORPM+DRPM. RPM has a perfect correlation with ORPM+DRPM, so there is no need to
include RPM on my model.
b. The source of data
The source of data is Basketball-Reference.com. It provided play-by-play data to
ESPN and Data Analysts on ESPN assembled play-by-play data to construct ORPM and
DRPM data with their own ways for 2014-2015 regular seasons. According to ESPN, the
ORPM and DRPM model sifts through more than 230,000 possessions each NBA season to
tease apart the real plus-minus effects attributable to each player.
3. Regression Analysis
a. Exploratory data analysis
i. Scatterplots of each of X variables vs. Y variable
RPM1<-read.table("NBARPM.txt",header=T)
0 20 40 60 80
-505101520
GP
WINS
0 10 20 30 40
-505101520
M
WINS
-4 -2 0 2 4 6 8
-505101520
ORPM
WINS attach(RPM1)
1) plot(GP,WINS) 2) plot(M,WINS)
3) plot(ORPM,WINS)
ii. Most highly correlated X variables
cor(cbind(GP,M,ORPM,DRPM))
According to the correlation matrix, GP and M are most highly correlated X variables (With
cor = 0.66)
b. Linear Regression Analysis
i. Fit a full model and report the 𝑅2
mod.RPM<-lm(WINS~GP+M+ORPM+DRPM)
summary(mod.RPM)
R^2 = 0.8575, Adjusted R^2 = 0.8563
ii. Conduct one F-test to test for the removal of a subset of variables
Given mod.RPM is a full model, I want to find out if the set of three variables, M,
ORPM, DRPM can be removed in my model by conducting F-test for comparing nested
models.
mod.reduced<-lm(WINS~GP)
summary(mod.reduced)
SSE.r<-sum(mod.reduced$residuals^2)
SSE.c<-sum(mod.RPM$residuals^2)
F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1)))
#F=763.4384
pf(763.4384,3,470,lower.tail=F) # very low p-value
Given very low p-value, we reject the null and cannot remove a group of 3 predictors from
the model.
iii. Use all stepAIC()
library(MASS)
optimal.bp <- stepAIC(mod.RPM)
optimal.bp$anova
Initial Model : WINS~GP+M+ORPM+DRPM
Final Model : WINS~GP+M+ORPM+DRPM
iv. Find outliers
rstandard(mod.RPM)
I found out that two players, Draymond Green (121st value) and Stephen Curry (421st
value), have z-score >3. They are outliers.
v. Choose the “final” model
I chose the intial model (WINS~GP+M+ORPM+DRPM) to be the final model since it
has the highest adjusted R^2 among combinations of other variables.
Adjusted R^2 for WINS~M+GP+ORPM+DRPM = 0.8563
Adjusted R^2 for WINS~M+GP+ORPM = 0.6335
Adjusted R^2 for WINS~M+GP+DRPM = 0.5572
Adjusted R^2 for WINS~GP+ORPM+DRPM = 0.8502, and so on. The initial model
has the highest adjusted R^2. In addition, according to StepAIC function, the initial model is
the optimal model for this data.
vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot)
plot(mod.RPM$fitted.values,mod.RPM$residuals)
-5 0 5 10
-4-20246
mod.RPM$fitted.values
mod.RPM$residuals
-5 0 5 10
-4-20246
mod.RPM3$fitted.values
mod.RPM3$residuals
Since the plot does not have a random pattern, I changed the model, reflecting the
result in plot(GP,WINS) and plot(M,WINS). Since those two plots have quadratic pattern, I
tried with GP^2 and M^2 for the new model.
GP1<-GP^2
M1<-M^2
mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM)
summary(mod.RPM3)
plot(mod.RPM3$fitted.values,mod.RPM3$residuals)
However, I got the similar graph as above, meaning that the assumption that
residuals are normal might not hold for my model. Additionally, I tried to obtain a better
-3 -2 -1 0 1 2 3
-4-20246
Normal Q-Q Plot
Theoretical Quantiles
SampleQuantiles
Histogram of mod.RPM$residuals
mod.RPM$residuals
Frequency
-4 -2 0 2 4 6 8
050100150
plot by trying quadratic, log, exponential transformation on my parameters, but I could not
find a better one than the original model. Therefore, I decided to stick with my original
model.
qqnorm(mod.RPM$residuals)
On the other hand, qqnorm(mod.RPM$residuals) has approximately linear
increasing function (nearly straight line), which indicates that residuals might be normal.
hist(mod.RPM$residuals)
In addition, the histogram of residuals has approximately a bell shape, supporting
the claim that residuals are normal.
vii. Validate the model by cross validation
Using cross validation (code is attached on Appendix), rsquared.Group2 = 0.851 and
rsquared.Group1=0.845. Since the mean of those two values = 0.848 is close to the
R^2=0.8575 of the final model, I concluded that this model is valid.
4. Results
a. Three inferences about the final model and importance of each inference
i. A confidence interval for a fitted value
I chose to compute a 95% confidence interval for the mean WINS for all players who
has ORPM=0, DRPM=0, M=20.43, GP=54.29.
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi
dence",level=0.95)
The result demonstrates that the mean WINS for all players having ORPM=0,
DRPM=0, M=20.43, and GP=54.29, falls within [2.705, 2.968] with 95% confidence. I chose
ORPM=0, DRPM=0 since it is greater than mean(ORPM)=-0.646 and mean(DRPM)=-0.278
and it is where each player breaks even (when ORPM=DRPM=0) in his offensive and
defensive contribution to the team. Mean(M)=20.43 and mean(GP)=54.29 were chosen for
fitted values for M and GP so that I can better compare WINS value with ORPM and DRPM
values. I can conclude that the mean WINS for all players with ORPM=0, DRPM=0, and
average for GP and M values, who performs better than the average on both ends of the
floor, falls within [2.705, 2.968].
ii. A prediction interval for a fitted value
I chose to compute a 95% prediction interval for the mean WINS for a “new"player
who has ORPM=0, DRPM=0, M=20.43, GP=54.29.
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="pred
iction",level=0.95)
The result indicates that the mean WINS for a “new” players having ORPM=0,
DRPM=0, M=20.43, and GP=54.29, falls within [0.183,5.490] with 95% confidence. For the
same reason as the confidence interval for a fitted value, I chose ORPM=0, DRPM=0,
M=20.43, and GP=54.29. I can conclude that the mean WINS for a new player with ORPM=0,
DRPM=0, and average for GP and M values, who performs better than the average on both
ends of the floor, falls within [0.183,5.490].
iii. A confidence interval for one or more slope parameters
I chose to compute a 95% confidence interval for the ORPM variable.
Lower = 1.142995-1.96*0.03653 = 1.071396
Upper= 1.142995+1.96*0.03653 = 1.214594
Therefore, I am 95% confident that the value of ORPM falls within [1.071396,
1.214594]. Since this interval does not contain 0, I can conclude that ORPM variable is a
significant predictor of this model. This can also be verified with the low p-value for the
ORPM variable.
5. Conclusion
a. How well the model describes Y variable.
In general, I found that my model satisfactorily describes my response variable
(WINS), as the model has about 0.85 value for R^2 and adjusted R^2. Especially, results are
consistent with my intuition that WINS increases as GP, M, ORPM, and DRPM increase, but
the amount of increase in WINS is the most significantly affected by ORPM and DRPM, as
they have bigger slopes than GP and M.
b. Factors that can improve the predictive power of the model
It would be better if I could find a model having a random pattern on its fitted values
vs. residuals plot. I manipulated some of my predictor variables to fit the better model, but I
could not find the better model than my final model.
Additionally, if I had PER (Player Efficiency Rating) as one of my predictor variables,
the predictive power of the model might have increased, since PER also has a positive
correlation with WINS. If I find that PER variable does not have significant correlation with
my original predictor variables, I would be able to interpret how each player’s performance
affects WINS better with PER added as an additional variable.
<Appendix>
#attach data
RPM1<-read.table("NBARPM.txt",header=T)
attach(RPM1)
#scatterplots of each of predictor variables
plot(GP,WINS)
plot(M,WINS)
plot(ORPM,WINS)
#correlation matrix among predictor variables
cor(cbind(GP,M,ORPM,DRPM))
#Fit full model using all X’s and report R^2
mod.RPM<-lm(WINS~GP+M+ORPM+DRPM)
summary(mod.RPM)
#Use a reduced model to conduct F-test
mod.reduced<-lm(WINS~GP)
summary(mod.reduced)
SSE.r<-sum(mod.reduced$residuals^2)
SSE.c<-sum(mod.RPM$residuals^2)
F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1)))
F
pf(763.4384,3,470,lower.tail=F)
#pf value=1, which seems wrong here. However, if you turn off R, reopen, and paste the
code pf(763.4384,3,470,lower.tail=F), you will get 3.28*e^-180, the true value. Thank you!
#Use StepAIC to find final (optimal) model
library(MASS)
optimal.bp <- stepAIC(mod.RPM)
optimal.bp$anova # display results
#Find out outliers
rstandard(mod.RPM)
#model diagnostics (fitted values vs residuals and normal plot)
plot(mod.RPM$fitted.values,mod.RPM$residuals)
qqnorm(mod.RPM$residuals)
hist(mod.RPM$residuals)
#model diagnostics with changed model (fitted values vs residuals and normal plot)
GP1<-GP^2
M1<-M^2
mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM)
summary(mod.RPM3)
plot(mod.RPM3$fitted.values,mod.RPM3$residuals)
qqnorm(mod.RPM3$residuals)
#validate the final model by using cross validation
set.seed(5)
#obtain total sample size
n<-dim(RPM1)[1]
Group1.index<-sample(1:n,round(n/2),replace=F)
Group2.index<-setdiff(1:n,Group1.index)
Group1<-RPM1[Group1.index,]
Group2<-RPM1[Group2.index,]
#Fit a linear model on Group1 and a separate one on Group2
mod.Group1<-lm(WINS~GP+M+ORPM+DRPM,data=Group1)
mod.Group2<-lm(WINS~GP+M+ORPM+DRPM,data=Group2)
###Compute fitted values on Group2 using model fit on Group1
fitted.Group2<-NULL
for (i in 1:dim(Group2)[1]){
fitted.Group2<-
c(fitted.Group2,(mod.Group1$coef[1]+mod.Group1$coef[2]*Group2$GP[i]
+mod.Group1$coef[3]*Group2$M[i]
+mod.Group1$coef[4]*Group2$ORPM[i]
+mod.Group1$coef[5]*Group2$DRPM[i]
))
}
##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula
1 - (SSE/SSTo)
rsquared.Group2 <- 1 - sum((Group2$WINS-fitted.Group2)^2)/sum((Group2$WINS-
mean(Group2$WINS))^2)
rsquared.Group2
###Compute fitted values on Group1 using model fit on Group2
fitted.Group1<-NULL
for (i in 1:dim(Group1)[1]){
fitted.Group1<-
c(fitted.Group1,(mod.Group2$coef[1]+mod.Group2$coef[2]*Group1$GP[i]
+mod.Group2$coef[3]*Group1$M[i]
+mod.Group2$coef[4]*Group1$ORPM[i]
+mod.Group2$coef[5]*Group1$DRPM[i]
))
}
##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula
1 - (SSE/SSTo)
rsquared.Group1 <- 1 - sum((Group1$WINS-fitted.Group1)^2)/sum((Group1$WINS-
mean(Group1$WINS))^2)
rsquared.Group1
###Compute mean of both R^2
mean(c(rsquared.Group2,rsquared.Group1))
#A confidence interval for a fitted value
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi
dence",level=0.95)
#A prediction interval for a fitted value
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval
="prediction",level=0.95)
# A confidence interval for one or more slope parameters is calculated manually by looking
at summary(mod.RPM)

More Related Content

Viewers also liked

eCommerce shopping cart - basic elements
eCommerce shopping cart - basic elementseCommerce shopping cart - basic elements
eCommerce shopping cart - basic elementsMineWhat
 
G08.2014 magic-quadrant-for-unified-threat-management-2014
G08.2014   magic-quadrant-for-unified-threat-management-2014G08.2014   magic-quadrant-for-unified-threat-management-2014
G08.2014 magic-quadrant-for-unified-threat-management-2014Satya Harish
 
How we attracted our target audience
How we attracted our target audienceHow we attracted our target audience
How we attracted our target audienceMax Trimming
 
AddReality RetailBanking
AddReality RetailBankingAddReality RetailBanking
AddReality RetailBankingAddReality
 
CENTRALSTATE_Diploma_
CENTRALSTATE_Diploma_CENTRALSTATE_Diploma_
CENTRALSTATE_Diploma_DEREK FOLLEY
 
Naomi Virginia Lyons
Naomi Virginia LyonsNaomi Virginia Lyons
Naomi Virginia LyonsNaomi Lyons
 
Plan grand palais visiteur (1)
Plan grand palais visiteur (1)Plan grand palais visiteur (1)
Plan grand palais visiteur (1)0665
 
Autoevaluación rúbrica atlas
Autoevaluación rúbrica atlasAutoevaluación rúbrica atlas
Autoevaluación rúbrica atlasnuria garcia gomez
 
Correo electrónico y el correo postal
Correo electrónico y el correo postalCorreo electrónico y el correo postal
Correo electrónico y el correo postalfernandez madrid
 
supervalu refrence lett zane
supervalu refrence lett zanesupervalu refrence lett zane
supervalu refrence lett zaneZane Sheikh
 
EMPLOYEE VOLUNTARY BENEFITS
EMPLOYEE VOLUNTARY BENEFITSEMPLOYEE VOLUNTARY BENEFITS
EMPLOYEE VOLUNTARY BENEFITSSMITA RASTOGI
 
Where is my JETPACK? CAG 2012
Where is my JETPACK? CAG 2012Where is my JETPACK? CAG 2012
Where is my JETPACK? CAG 2012Brian Housand
 

Viewers also liked (19)

Misbah cv (2) (1)
Misbah cv (2) (1)Misbah cv (2) (1)
Misbah cv (2) (1)
 
Syllabus informatica
Syllabus informaticaSyllabus informatica
Syllabus informatica
 
eCommerce shopping cart - basic elements
eCommerce shopping cart - basic elementseCommerce shopping cart - basic elements
eCommerce shopping cart - basic elements
 
A beleza dos cisnes
A beleza dos cisnesA beleza dos cisnes
A beleza dos cisnes
 
G08.2014 magic-quadrant-for-unified-threat-management-2014
G08.2014   magic-quadrant-for-unified-threat-management-2014G08.2014   magic-quadrant-for-unified-threat-management-2014
G08.2014 magic-quadrant-for-unified-threat-management-2014
 
How we attracted our target audience
How we attracted our target audienceHow we attracted our target audience
How we attracted our target audience
 
AddReality RetailBanking
AddReality RetailBankingAddReality RetailBanking
AddReality RetailBanking
 
CENTRALSTATE_Diploma_
CENTRALSTATE_Diploma_CENTRALSTATE_Diploma_
CENTRALSTATE_Diploma_
 
Naomi Virginia Lyons
Naomi Virginia LyonsNaomi Virginia Lyons
Naomi Virginia Lyons
 
Plan grand palais visiteur (1)
Plan grand palais visiteur (1)Plan grand palais visiteur (1)
Plan grand palais visiteur (1)
 
REPORT
REPORTREPORT
REPORT
 
Autoevaluación rúbrica atlas
Autoevaluación rúbrica atlasAutoevaluación rúbrica atlas
Autoevaluación rúbrica atlas
 
Correo electrónico y el correo postal
Correo electrónico y el correo postalCorreo electrónico y el correo postal
Correo electrónico y el correo postal
 
supervalu refrence lett zane
supervalu refrence lett zanesupervalu refrence lett zane
supervalu refrence lett zane
 
звіт 1
звіт 1звіт 1
звіт 1
 
CSS Layout
CSS LayoutCSS Layout
CSS Layout
 
EMPLOYEE VOLUNTARY BENEFITS
EMPLOYEE VOLUNTARY BENEFITSEMPLOYEE VOLUNTARY BENEFITS
EMPLOYEE VOLUNTARY BENEFITS
 
Where is my JETPACK? CAG 2012
Where is my JETPACK? CAG 2012Where is my JETPACK? CAG 2012
Where is my JETPACK? CAG 2012
 
Spare Tyre
Spare TyreSpare Tyre
Spare Tyre
 

Similar to Data Analytics Project_Eun Seuk Choi (Eric)

Six Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementSix Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementIJERA Editor
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptxAbithaSam
 
Six Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementSix Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementRSIS International
 
Capstone Project - Nicholas Imholte - Final Draft
Capstone Project - Nicholas Imholte - Final DraftCapstone Project - Nicholas Imholte - Final Draft
Capstone Project - Nicholas Imholte - Final DraftNick Imholte
 
Uber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectUber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectKushal417
 
Grouped time-series forecasting: Application to regional infant mortality counts
Grouped time-series forecasting: Application to regional infant mortality countsGrouped time-series forecasting: Application to regional infant mortality counts
Grouped time-series forecasting: Application to regional infant mortality countshanshang
 
Auto MPG Regression Analysis
Auto MPG Regression AnalysisAuto MPG Regression Analysis
Auto MPG Regression AnalysisAnirudh Srinath.V
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecastingDevon Barrow
 
Machine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationMachine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationAmit J Bhattacharyya
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_JieMDO_Lab
 
Stats computing project_final
Stats computing project_finalStats computing project_final
Stats computing project_finalAyank Gupta
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
Statistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for SemiconductorStatistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for SemiconductorXuanhua(Peter) Yin
 
lanen_5e_ch05_student.ppt
lanen_5e_ch05_student.pptlanen_5e_ch05_student.ppt
lanen_5e_ch05_student.pptSURAJITDASBAURI
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points FinalJohn Michael Croft
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013OptiModel
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliMDO_Lab
 

Similar to Data Analytics Project_Eun Seuk Choi (Eric) (20)

Six Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementSix Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality Management
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptx
 
Six Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementSix Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality Management
 
IPL Data Analysis using Data Science
IPL Data Analysis using Data ScienceIPL Data Analysis using Data Science
IPL Data Analysis using Data Science
 
Capstone Project - Nicholas Imholte - Final Draft
Capstone Project - Nicholas Imholte - Final DraftCapstone Project - Nicholas Imholte - Final Draft
Capstone Project - Nicholas Imholte - Final Draft
 
Uber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectUber Data Analysis - SAS Project
Uber Data Analysis - SAS Project
 
Grouped time-series forecasting: Application to regional infant mortality counts
Grouped time-series forecasting: Application to regional infant mortality countsGrouped time-series forecasting: Application to regional infant mortality counts
Grouped time-series forecasting: Application to regional infant mortality counts
 
Auto MPG Regression Analysis
Auto MPG Regression AnalysisAuto MPG Regression Analysis
Auto MPG Regression Analysis
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 
Machine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationMachine Learning Foundations Project Presentation
Machine Learning Foundations Project Presentation
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
Stats computing project_final
Stats computing project_finalStats computing project_final
Stats computing project_final
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Statistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for SemiconductorStatistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for Semiconductor
 
lanen_5e_ch05_student.ppt
lanen_5e_ch05_student.pptlanen_5e_ch05_student.ppt
lanen_5e_ch05_student.ppt
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points Final
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
AUTO MPG Regression Analysis
AUTO MPG Regression AnalysisAUTO MPG Regression Analysis
AUTO MPG Regression Analysis
 
WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
 

Data Analytics Project_Eun Seuk Choi (Eric)

  • 1. Eun Seuk Choi (Eric) Statistical Methods & Data Analytics Final Project Professor Alan Huebner December 10, 2015 <Analysis on NBA Real Plus-Minus for 2014-2015 Regular Seasons>
  • 2. Table of Contents 1. Introduction a. Describe data b. About variables c. Purpose of analysis 2. Data a. More details about data b. The source of data 3. Regression Analysis a. Exploratory data analysis i. Scatterplots of each of X variables vs. Y variable ii. Most highly correlated X variables b. Linear Regression Analysis i. Fit a full model and report the 𝑅2 ii. Conduct one F-test to test for the removal of a subset of variables iii. Use all stepAIC() iv. Find outliers v. Choose the “final” model vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot) vii. Validate the model by cross validation or bootstrapping 4. Results a. Three inferences about the final model and importance of each inference i. A confidence interval for a fitted value ii. A prediction interval for a fitted value iii. A confidence interval for one or more slope parameters 5. Conclusion a. How well the model describes Y variable b. Factors that can improve the predictive power of the model
  • 3. 1. Introduction a. Describe data The file “NBA real plus-minus for 2014-2015 regular seasons” contains the data extracted from ESPN.com, about an individual NBA player’s influence on his team’s wins by analyzing the number of games he played during the season, the number of minutes he played on each game, on-court team offensive performance, and on-court team defensive performance. Data consists of 474 NBA players who played for at least one game for 2014- 2015 regular seasons. b. About variables There are 5 variables in total: GP, M, ORPM, DRPM, and WINS. While WINS is the response variable, all the other 4 variables are predictor variables. GP is the number of games played for 2014-2015 regular seasons out of 82 games. M is minutes per game for each player. ORPM is player’s estimated on-court impact on team’s offensive performance, measured in points scored per 100 offensive possessions, while DRPM is player’s estimated on-court impact on team’s defensive performance. WINS provides an estimate of the number of wins each player has contributed to his team’s win total on the season. WINS includes the player’s Real Plus-Minus and his number of possessions played. c. Purpose of analysis By interpreting the result of linear regressions on those 5 variables (WINS for the response variable and the other four for predictor variables), I want to find out primary factors that positively affect WINS. I will find the optimal model to predict WINS by conducting F-test to remove a subset of variables from the model, observe outliers within
  • 4. data, perform model diagnostics on my final model, and validate it using cross validation. Based on these interpretations, I will make inferences pertinent to my topic by using combinations of a confidence interval for a fitted value and a confidence interval for slope parameters. With an evaluation about my final model, I will finish this project by finding a way to improve the predictive power of the model. 2. About Data a. More details about data The data was extracted from ESPN.com website. Original data includes 6 variables, which include 5 variable mentioned above plus RPM, but I excluded it since RPM is just ORPM+DRPM. RPM has a perfect correlation with ORPM+DRPM, so there is no need to include RPM on my model. b. The source of data The source of data is Basketball-Reference.com. It provided play-by-play data to ESPN and Data Analysts on ESPN assembled play-by-play data to construct ORPM and DRPM data with their own ways for 2014-2015 regular seasons. According to ESPN, the ORPM and DRPM model sifts through more than 230,000 possessions each NBA season to tease apart the real plus-minus effects attributable to each player. 3. Regression Analysis a. Exploratory data analysis i. Scatterplots of each of X variables vs. Y variable RPM1<-read.table("NBARPM.txt",header=T)
  • 5. 0 20 40 60 80 -505101520 GP WINS 0 10 20 30 40 -505101520 M WINS -4 -2 0 2 4 6 8 -505101520 ORPM WINS attach(RPM1) 1) plot(GP,WINS) 2) plot(M,WINS) 3) plot(ORPM,WINS) ii. Most highly correlated X variables cor(cbind(GP,M,ORPM,DRPM)) According to the correlation matrix, GP and M are most highly correlated X variables (With cor = 0.66) b. Linear Regression Analysis
  • 6. i. Fit a full model and report the 𝑅2 mod.RPM<-lm(WINS~GP+M+ORPM+DRPM) summary(mod.RPM) R^2 = 0.8575, Adjusted R^2 = 0.8563 ii. Conduct one F-test to test for the removal of a subset of variables Given mod.RPM is a full model, I want to find out if the set of three variables, M, ORPM, DRPM can be removed in my model by conducting F-test for comparing nested models. mod.reduced<-lm(WINS~GP) summary(mod.reduced) SSE.r<-sum(mod.reduced$residuals^2) SSE.c<-sum(mod.RPM$residuals^2) F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1))) #F=763.4384 pf(763.4384,3,470,lower.tail=F) # very low p-value Given very low p-value, we reject the null and cannot remove a group of 3 predictors from the model. iii. Use all stepAIC() library(MASS)
  • 7. optimal.bp <- stepAIC(mod.RPM) optimal.bp$anova Initial Model : WINS~GP+M+ORPM+DRPM Final Model : WINS~GP+M+ORPM+DRPM iv. Find outliers rstandard(mod.RPM) I found out that two players, Draymond Green (121st value) and Stephen Curry (421st value), have z-score >3. They are outliers. v. Choose the “final” model I chose the intial model (WINS~GP+M+ORPM+DRPM) to be the final model since it has the highest adjusted R^2 among combinations of other variables. Adjusted R^2 for WINS~M+GP+ORPM+DRPM = 0.8563 Adjusted R^2 for WINS~M+GP+ORPM = 0.6335 Adjusted R^2 for WINS~M+GP+DRPM = 0.5572 Adjusted R^2 for WINS~GP+ORPM+DRPM = 0.8502, and so on. The initial model has the highest adjusted R^2. In addition, according to StepAIC function, the initial model is the optimal model for this data. vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot) plot(mod.RPM$fitted.values,mod.RPM$residuals)
  • 8. -5 0 5 10 -4-20246 mod.RPM$fitted.values mod.RPM$residuals -5 0 5 10 -4-20246 mod.RPM3$fitted.values mod.RPM3$residuals Since the plot does not have a random pattern, I changed the model, reflecting the result in plot(GP,WINS) and plot(M,WINS). Since those two plots have quadratic pattern, I tried with GP^2 and M^2 for the new model. GP1<-GP^2 M1<-M^2 mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM) summary(mod.RPM3) plot(mod.RPM3$fitted.values,mod.RPM3$residuals) However, I got the similar graph as above, meaning that the assumption that residuals are normal might not hold for my model. Additionally, I tried to obtain a better
  • 9. -3 -2 -1 0 1 2 3 -4-20246 Normal Q-Q Plot Theoretical Quantiles SampleQuantiles Histogram of mod.RPM$residuals mod.RPM$residuals Frequency -4 -2 0 2 4 6 8 050100150 plot by trying quadratic, log, exponential transformation on my parameters, but I could not find a better one than the original model. Therefore, I decided to stick with my original model. qqnorm(mod.RPM$residuals) On the other hand, qqnorm(mod.RPM$residuals) has approximately linear increasing function (nearly straight line), which indicates that residuals might be normal. hist(mod.RPM$residuals)
  • 10. In addition, the histogram of residuals has approximately a bell shape, supporting the claim that residuals are normal. vii. Validate the model by cross validation Using cross validation (code is attached on Appendix), rsquared.Group2 = 0.851 and rsquared.Group1=0.845. Since the mean of those two values = 0.848 is close to the R^2=0.8575 of the final model, I concluded that this model is valid. 4. Results a. Three inferences about the final model and importance of each inference i. A confidence interval for a fitted value I chose to compute a 95% confidence interval for the mean WINS for all players who has ORPM=0, DRPM=0, M=20.43, GP=54.29. predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi dence",level=0.95) The result demonstrates that the mean WINS for all players having ORPM=0, DRPM=0, M=20.43, and GP=54.29, falls within [2.705, 2.968] with 95% confidence. I chose ORPM=0, DRPM=0 since it is greater than mean(ORPM)=-0.646 and mean(DRPM)=-0.278 and it is where each player breaks even (when ORPM=DRPM=0) in his offensive and defensive contribution to the team. Mean(M)=20.43 and mean(GP)=54.29 were chosen for fitted values for M and GP so that I can better compare WINS value with ORPM and DRPM values. I can conclude that the mean WINS for all players with ORPM=0, DRPM=0, and average for GP and M values, who performs better than the average on both ends of the floor, falls within [2.705, 2.968]. ii. A prediction interval for a fitted value I chose to compute a 95% prediction interval for the mean WINS for a “new"player who has ORPM=0, DRPM=0, M=20.43, GP=54.29.
  • 11. predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="pred iction",level=0.95) The result indicates that the mean WINS for a “new” players having ORPM=0, DRPM=0, M=20.43, and GP=54.29, falls within [0.183,5.490] with 95% confidence. For the same reason as the confidence interval for a fitted value, I chose ORPM=0, DRPM=0, M=20.43, and GP=54.29. I can conclude that the mean WINS for a new player with ORPM=0, DRPM=0, and average for GP and M values, who performs better than the average on both ends of the floor, falls within [0.183,5.490]. iii. A confidence interval for one or more slope parameters I chose to compute a 95% confidence interval for the ORPM variable. Lower = 1.142995-1.96*0.03653 = 1.071396 Upper= 1.142995+1.96*0.03653 = 1.214594 Therefore, I am 95% confident that the value of ORPM falls within [1.071396, 1.214594]. Since this interval does not contain 0, I can conclude that ORPM variable is a significant predictor of this model. This can also be verified with the low p-value for the ORPM variable. 5. Conclusion a. How well the model describes Y variable. In general, I found that my model satisfactorily describes my response variable (WINS), as the model has about 0.85 value for R^2 and adjusted R^2. Especially, results are consistent with my intuition that WINS increases as GP, M, ORPM, and DRPM increase, but the amount of increase in WINS is the most significantly affected by ORPM and DRPM, as they have bigger slopes than GP and M. b. Factors that can improve the predictive power of the model
  • 12. It would be better if I could find a model having a random pattern on its fitted values vs. residuals plot. I manipulated some of my predictor variables to fit the better model, but I could not find the better model than my final model. Additionally, if I had PER (Player Efficiency Rating) as one of my predictor variables, the predictive power of the model might have increased, since PER also has a positive correlation with WINS. If I find that PER variable does not have significant correlation with my original predictor variables, I would be able to interpret how each player’s performance affects WINS better with PER added as an additional variable.
  • 13. <Appendix> #attach data RPM1<-read.table("NBARPM.txt",header=T) attach(RPM1) #scatterplots of each of predictor variables plot(GP,WINS) plot(M,WINS) plot(ORPM,WINS) #correlation matrix among predictor variables cor(cbind(GP,M,ORPM,DRPM)) #Fit full model using all X’s and report R^2 mod.RPM<-lm(WINS~GP+M+ORPM+DRPM) summary(mod.RPM) #Use a reduced model to conduct F-test mod.reduced<-lm(WINS~GP) summary(mod.reduced) SSE.r<-sum(mod.reduced$residuals^2) SSE.c<-sum(mod.RPM$residuals^2)
  • 14. F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1))) F pf(763.4384,3,470,lower.tail=F) #pf value=1, which seems wrong here. However, if you turn off R, reopen, and paste the code pf(763.4384,3,470,lower.tail=F), you will get 3.28*e^-180, the true value. Thank you! #Use StepAIC to find final (optimal) model library(MASS) optimal.bp <- stepAIC(mod.RPM) optimal.bp$anova # display results #Find out outliers rstandard(mod.RPM) #model diagnostics (fitted values vs residuals and normal plot) plot(mod.RPM$fitted.values,mod.RPM$residuals) qqnorm(mod.RPM$residuals) hist(mod.RPM$residuals) #model diagnostics with changed model (fitted values vs residuals and normal plot) GP1<-GP^2 M1<-M^2
  • 15. mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM) summary(mod.RPM3) plot(mod.RPM3$fitted.values,mod.RPM3$residuals) qqnorm(mod.RPM3$residuals) #validate the final model by using cross validation set.seed(5) #obtain total sample size n<-dim(RPM1)[1] Group1.index<-sample(1:n,round(n/2),replace=F) Group2.index<-setdiff(1:n,Group1.index) Group1<-RPM1[Group1.index,] Group2<-RPM1[Group2.index,] #Fit a linear model on Group1 and a separate one on Group2 mod.Group1<-lm(WINS~GP+M+ORPM+DRPM,data=Group1) mod.Group2<-lm(WINS~GP+M+ORPM+DRPM,data=Group2) ###Compute fitted values on Group2 using model fit on Group1 fitted.Group2<-NULL for (i in 1:dim(Group2)[1]){ fitted.Group2<- c(fitted.Group2,(mod.Group1$coef[1]+mod.Group1$coef[2]*Group2$GP[i] +mod.Group1$coef[3]*Group2$M[i] +mod.Group1$coef[4]*Group2$ORPM[i] +mod.Group1$coef[5]*Group2$DRPM[i] )) } ##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula 1 - (SSE/SSTo) rsquared.Group2 <- 1 - sum((Group2$WINS-fitted.Group2)^2)/sum((Group2$WINS- mean(Group2$WINS))^2) rsquared.Group2
  • 16. ###Compute fitted values on Group1 using model fit on Group2 fitted.Group1<-NULL for (i in 1:dim(Group1)[1]){ fitted.Group1<- c(fitted.Group1,(mod.Group2$coef[1]+mod.Group2$coef[2]*Group1$GP[i] +mod.Group2$coef[3]*Group1$M[i] +mod.Group2$coef[4]*Group1$ORPM[i] +mod.Group2$coef[5]*Group1$DRPM[i] )) } ##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula 1 - (SSE/SSTo) rsquared.Group1 <- 1 - sum((Group1$WINS-fitted.Group1)^2)/sum((Group1$WINS- mean(Group1$WINS))^2) rsquared.Group1 ###Compute mean of both R^2 mean(c(rsquared.Group2,rsquared.Group1)) #A confidence interval for a fitted value predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi dence",level=0.95) #A prediction interval for a fitted value predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval ="prediction",level=0.95) # A confidence interval for one or more slope parameters is calculated manually by looking at summary(mod.RPM)