Explore the effect of offensive and defensive team productivity in the NBA on wins, 10+ years of NBA regular season data (2002 – 2013).
Key words: data normalization; directional hypotheses; feauture engineering; ols regression; web scraping
1. Radu Stancut
Applied Data Science
Foundations Project Outline/Notes
Introduction
This Foundations paper will look at the performance of professional basketball teams in
the National Basketball Association (NBA) over the past ten years and investigate the
relationship that both offensive and defensive team statistics have on win totals, and by extension
playoff appearances. Specifically, I will be looking at the field goal percentages (FG%) on both
offense, the proficiency with which a team scores a basket, and defense, the effectiveness of
keeping the other team from scoring. It is expected that above average rates of performance in
either category would have a positive influence on winning games, but just how much remains to
be seen.
Statistical analysis and sports have become increasingly more intertwined over the last
decade in the public imagination. The reader may be aware of Moneyball, either the book or the
movie, if not both, describing the machinations of Major League Baseball’s (MLB) Oakland
Athletics’ General Manager, Billy Bean, who due to a limited payroll was inspired to leverage
statistical analysis in order to find underappreciated assets, in this case ballplayers, with which to
build several winning teams. This quantification of sports has in fact been going on for several
decades, from Bill James’ Baseball Prospectus in the 1980’s through to today’s Nate Silver,
ESPN-backed, website, fivethirtyeight.com, and onto the MIT Sloan Sports Analytics
Conference.
2. This paper is meant to provide a surface investigation into some such sports stats and
their influence on team outcomes. Below I describe the data collected, its analysis, my findings
and conclusions, followed by ideas on next steps
Data Description
NBA team statistics for regular seasons 2005 – 2014 were collected in the form of CSV
files from Basketball-Reference.com; two tables per season were collected, one for stats on
offense and the other on defense. The tables consisted of team names and a breakdown of team
performance by minutes, points, assists, rebounds, field goal percentages, and other categories (a
full list may be found in the appendix). In addition to the team season performance metrics I also
collected the win/loss standings and conference/division information for each season in an Excel
document that I later formatted for merging.
The offense and defense statistics were kept separate from one another but the ten
seasons had to be combined into one data frame for analysis. This combining of different seasons
was done via a ‘for loop’ on import within Python. The NBA team information of win/loss
records and conference/division data was also imported and subsequently merged, leading to two
data frames, one each for offense and defense, of 300 instances (30 teams, ten years) with
varying performance measurements (list in appendix).
Several calculations and identification steps were taken prior to running regression analysis:
• The winning percentage (Win%) of each team, by year, was calculated and placed into a
new column.
• A column was created to flag whether a team had a winning record (Win% > 0.5).
• Playoff teams were identified and flagged in a new binary column (Playoffs)
3. • FG% was normalized by grouping along Year, Year & Conference, and Year & Division,
and setting the average to 1.0 to allow for comparative analysis across seasons; a rating
above 1.0 in FG% on offense meant better than average, while the reverse was true on
defense FG%, a team had to be below 1.0 to be ‘above’ average in keeping opponents
from scoring (team examples in appendix).
Descriptive Statistics
To provide a general picture and context to the findings described in the next section, I have
outlined some exploratory numbers on the data here:
• The win distribution of NBA teams over this 10 year range tends toward a normal
distribution (number description in appendix):
• Winning teams had the following min, max, and mean indexed FG% over ten years:
o Offense – 0.945 (min/worst); 1.102 (max/best)1.015 (mean)
o Defense – 0.917 (min/best); 1.030 (max/worst); 0.982 (mean)
• Losing teams had the following min, max, and mean indexed FG% over ten years:
o Offense – 0.924 (min/worst); 1.044 (max/best); 0.984 (mean)
o Defense – 0.939 (min/best); 1.082 (max/worst); 1.018 (mean)
4. Methods & Analysis
As seen in the descriptive statistics above, this researcher’s expectations that above
average rates of performance in either category would have a positive influence on winning
games, has some superficial merit. In order to delve deeper I created scatter plots (see appendix),
by offense and defense, for FG% indexed by year (FG% IDX_Year), by year & conference
(FG% IDX_Conf), and year & division (FG% IDX_Div). NBA teams routinely play most of
their games within conference (52 of 82) and a sizable sample of those games are against
division rivals (16 of 52), the field goal grouping was meant to see if any of the various
performance indexes provided a better predictor of wins.
The slopes of the fitted lines for each indexes FG% group, league-wide were:
• Offense
o FG% IDX_Year: 3.011
o FG% IDX_Conf: 3.048
o FG% IDX_Div: 3.419
• Defense
o FG% IDX_Year: -3.500
o FG% IDX_Conf: -3.595
o FG% IDX_Div: -3.667
Two things jump out from the numbers above. First, the defensive FG% appeared to have
a bigger influence, as measured by the slope, than the offensive rate. Secondly, the more specific
indexes, with respect to conference and division, had a bigger influence by slope for both offense
and defense.
Following the scatter plots and fitted lines, a linear regression analysis was run; the
dependent variable in all instances was Win%, with the independent variable alternating between
offensive and defensive FG% IDX fields. In this instance there was little to no difference in
5. results between league, conference, and division comparisons so only the broadest measure,
league-wide, is outlined below and included in the appendix:
• Offense
o Coef: 0.5025; Std err: .008; R-squared: 0.924
• Defense
o Coef: 0.496; Std err: .0010; R-squared: 0.901
The results indicate that both offensive and defensive FG% performance numbers do a good job
of explaining winning in the NBA. Of course, much like the influence of education on wages
discussed in class, this could be due to other factors being bundled within these measurements.
Discussion & Next Steps
Taking up the thread of bundled influences above this research would need to be expanded to
other team statistical performances to properly weigh the impact that FG% alone has on winning.
Additionally, the analysis contained here dealt with only end of the year numbers and does not
purport to be predictive for in-season numbers. This is another area where further research would
need to be done, i.e., how does a team’s performance up to the All-Star break (about half way
into the season) factor into the remaining games and final win/loss record.
6. Appendix
List of Fields Used in Analysis, Including ‘Index’ Fields Created
'2P' 'Playoffs'
'2P%' 'STL'
'2PA' 'TOV'
'3P' 'TRB'
'3P%' 'Team'
'3PA' 'Team2'
'AST' 'Wins'
'BLK' 'Year'
'DEF Rk' 'Conference'
'DRB' 'Division'
'FG' 'Win%'
'FG%'
'Winning
Record'
'FGA' 'FG% Year'
'FT' '2P% Year'
'FT%' '3P% Year'
'FTA' 'FG% Conf'
'G' '2P% Conf'
'Losses' '3P% Conf'
'MP' 'FG% Div'
'ORB' '2P% Div'
'PF' '3P% Div'
'PTS'
'FG%
IDX_Year'
'PTS/G'
'FG%
IDX_Conf'
'FG% IDX_Div'
NBA Team Win Description (Histogram Companion)
count 300.000000
mean 40.196667
std 12.610899
min 7.000000
25% 31.750000
50% 41.000000
75% 50.000000
max 67.000000