SlideShare a Scribd company logo
1 of 22
DeLay, Pescatore, Meyer, Howes, Danker 1
Bank Shots to Bankroll
Joseph DeLay, Adam Pescatore, Zach Meyer, Lucas Howes, Josh Danker
The University of Iowa College of Liberal Arts and Sciences
Abstract
The goals of our research is to determine, based on win shares (WS) and player
efficiency rating (PER), what the expected salary should be for each player in the NBA.
We then want to take that number and compare it to what players were paid during the
2013-14 NBA season to see if players were paid what their statistics say they deserved,
what teams were best at spending money in general, and if there is any information
about NBA pay in general that can be learned from the data.
Introduction
The data we are using is from a spreadsheet from industry-leading analyst Nate Brixius’
blog, completed with statistics from the 2013-14 NBA season, such as games played,
minutes played, field goals made, field goals attempted, and many more (Brixius). We
simply added minutes per game, PER, and player salaries. We did not use every single
NBA player in our data set. In order to be eligible for a PER, a player must have played
an average of 6.09 minutes per game throughout the season, so that left us with 337
data points, or eligible players. We want to conduct an analysis of WS and PER and
create a formula using those two variables and league average salary to determine what
players at certain levels of those statistics should be paid.
Now, this does not mean that we are looking to see if any teams are saving
money by paying players less than what they deserve. If we determine a specific player
deserves $8 million a year, but the team is paying him $6 million, then clearly the player
is being underpaid, so we would consider. We assume that there is a big difference
between the surplus that the overpaid players are receiving and the deficit for the
underpaid players, but we want to make sure we weed out the noise, or the salary
differences that are negligible based on a percentage of the league average salary,
DeLay, Pescatore, Meyer, Howes, Danker 2
before we start making analyses. We will now explain the two statistics we are using,
PER and win shares, as well as the NBA salary cap more in depth.
PER Explained
PER is an acronym for player efficiency rating. PER is an advanced statistic that measures
a player’s effectiveness on a per-minute basis, while taking into consideration the pace
at which a team plays. Created by statistician John Hollinger, PER is a formula that
includes but is not limited to “positive accomplishments such as field goals, free throws,
3-pointers, assists, rebounds, blocks, and steals, and negative ones such as missed shots,
turnovers and personal fouls” (Hollinger, 2011). PER is an effective statistic to use
because it can compare two players even if there is a significant minutes gap, meaning
one player is on the court significantly longer than another player. One flaw in PER is
that it cannot measure a player’s defensive efficiency. However, PER is useful because it
can “summarize a player’s statistical accomplishments in a single number” (Hollinger,
2011).
Win Shares Explained
According to David Corby of Basketball Reference, win shares is a statistic to “credit a
player’s total measurable contribution to his team’s win total during the season”
(Casciaro 2015). Unlike PER, win shares can measure both offensive and defensive
productivity from a player. Statistics on offense include field goals, assists, free throws,
and offensive rebounds that lead to points. However, defensive statistics are not
measured as easily as offensive statistics. One aspect of defense that is measured to
determine win shares is a “stop.” A stop is generally given to a player who gets a steal,
block or defensive rebound. Thus, when factoring in defense to win shares, the main
criteria measured is how often a team gets a stop, as well as the player who forces a
stop. One advantage win shares has over PER is that it measures the player’s total
productivity, instead of the player’s per minute productivity. In other words, win shares
tells us how much a player produces, and PER tells us how efficient a player is.
DeLay, Pescatore, Meyer, Howes, Danker 3
Salary Cap Explained
Since the statistics used in our research are from the 2013-14 NBA season, the salary
cap we use will also be from that season because the salary cap changes every year.
According to senior NBA analyst Sekou Smith, the salary cap for the 2013-14 NBA season
was $58.679M (Smith). Furthermore, the minimum a team was required to spend was
$52.811M, or 90 percent of the salary cap. However, the salary cap is not necessarily the
maximum a team can spend on players. The maximum a team can spend without a
penalty is called a tax level, and it was set at $71.748 million dollars. However, if a team
exceeds the tax level, they will have to pay the NBA the following fees (provided by
Sekou Smith):
• Portion of team salary $0-$4.99 million over tax level: $1.50 for $1
• Portion of team salary $5-$9.99 million over tax level: $1.75 for $1
• Portion of team salary $10-$14.99 million over tax level: $2.50 for $1
• Portion of team salary $15-$19.99 million over tax level: $3.25 for $1
• Rates increase by $0.50 for each additional $5 million of team salary above the tax
level.
For example, if a team exceeds the tax level by 2 million dollars, they must pay the NBA
a fee of 3 million dollars. As long as they stay under the tax level, if a team exceeds the
initial salary cap at $58.679 million dollars, there is no penalty. Conversely, if a team
exceeds the tax level, they will have to pay a fee explained in the above bullet points.
DeLay, Pescatore, Meyer, Howes, Danker 4
Background
Throughout our project we used a concept called Topological Data Analysis. This is a
mathematical technique that involves creating simplicial complexes. We represent
data as a simplicial complex in order to discover its topological attributes. A
simplicial complex is a type of graph comprised of points, edges, triangles, faces,
tetrahedrons, etc. as seen in figure 1. One way to create these edges is based upon
how close the points are to each other using a Euclidean distance. There are many
other kinds of ways of saying how “close” points are to each other, but we won’t get
into those seeing as we did not use any of those methods. We use epsilon balls to
determine if points are close to each other. By this I mean if another point is within a
distance of radius, epsilon we connect the points with an edge. Epsilon is just a fancy
word for a number that we can change or choose. So having a radius of epsilon is no
different than having a radius of 1, we just choose epsilon to equal 1. We can then
use these edges we just created to similarly create faces. To create a face all three
edges need to be connected and there needs to be a common area of intersection in
all 3 points epsilon ball. If all three points’ epsilon balls don’t overlap then there is a
hole. For example, in Figure 2, the left complex would have a hole because the 3
epsilon balls do not intersect in the middle. However the right figure would be filled
in, creating a face, because they all overlap in the center. The simplicial complex for
the example on the left would be a triangle with white in the middle, a hole. While
the example on the right would be colored in, creating a face, there would be no
hole.
Mimicked from Power Point 1
slide 23
From Isabel Darcy.
Figure 2
Figure 1
http://inperc.com/wiki/inde
x.php?title=Simplicial_homol
ogy
DeLay, Pescatore, Meyer, Howes, Danker 5
In topological data analysis many inferences can be made depending on the number
of holes and the location of these holes compared to the rest of your data in the
graph. Making inferences based upon holes works in higher dimensions as well not
just for two dimensional triangles and faces. This way of
creating simplicial complexes is called the Čech Complex.  Another style of
simplicial complex building is called the Vietoris-Rips complex.
Vietoris-Rips Complex
This type of simplicial creation does not involve creating epsilon balls and needing a
common intersection to fill in a hole. With this style, if the edges are connected, you
fill in the complex.  You fill in all complexes of dimension greater than one. This is
especially important when building our persistent diagrams and barcodes because
filling once the cycles become connected the cycle then dies. This allows us to find
long lasting cycles and determine their importance. There are some drawbacks
though, for example, in figure 2, both simplicial complexes would be identical and
they would be filled in. Topology has some loss of precision but we can still learn a
lot from it. The reason for doing this will be explained more when discussing the
study of homology.
Homology
Homology studies and compares manifolds, their triangulations, simplicial
complexes, and holes. The last two complexes are used to determine homology.
Homology also has to do with a collection of edges being cycles which is defined as,
if you can travel around the edges and get back to the same point. In homology we
are dealing with Z mod 2 coefficients. This means that all the values are either 0 or
1. So if you add two edges with 1 mod 2 together then the value equals 0 mod 2.
Homology is equal to cycles mod boundaries. This method can tell whether a certain
collection of edges is a circle, a face, a torus, or a ball. While homology deals with a
collection of edges, persistent homology goes further with this idea by telling us
how long these collection of cycles exist.
DeLay, Pescatore, Meyer, Howes, Danker 6
Persistent Homology
Persistent homology has a lot of the same characteristics as homology. Instead of
using a single distance for our epsilon balls, we greatly enlarge our epsilon balls to
determine whether certain cycles persist throughout time. These persistent cycles
tell us the importance of the shapes and holes by showing us how long each cycle
lasts. With our data, we are trying to determine which of these cycles are noise and
which are important to us. We will determine this by using two methods, barcodes
and persistent diagrams.
Barcodes
Barcodes are used to help researchers better understand clusters of certain dimensions
in their data. They are based upon when a cycle starts and when it gets close enough to
another cycle to become part of that cycle. This absorption is called the death of the
cycle. A barcode is created on an axis by taking the time the cycle is created and creating
a line that stretches from the time the cycle began to the time it dies. An important
note about creating a barcode is that it is based upon a filtered complex. This means
that time determines when each point and edge comes into existence because we look
at the time when gradually growing our epsilon balls in steps. When the epsilon ball
intersects with another, the younger of the two cycles dies. Because of this fact we can
tell when cycles start and end which can tell us how important they are. This coupled
with persistence diagrams can tell us a lot about our data. The three barcodes we
computed are shown on the next page.
Barcodes from Perseus-
DeLay, Pescatore, Meyer, Howes, Danker 7
Barcodes From R- We attempted to make an H2 but we had an insufficient memory
error and could not make it.
Persistent Diagrams
A persistence diagram is used similarly to barcodes in the way that they both involve the
starting and ending of cycles and that they are both visual representations of the data
set. However persistence diagrams are more of a graph because it maps the birth time
on the x-axis vs. the end time on the y-axis. There is also a line drawn as y=x. This line on
the graph can help us determine which cycles are noise because most of the cycles that
are close to this line are considered noise. The reason we think much of this is noise is
H0 Barcode H1 Barcode H2 Barcode
DeLay, Pescatore, Meyer, Howes, Danker 8
because the starting time is so close to the ending time of the cycle that most times
these cycles are not important. There are exceptions when short cycles can be
important, but we will not get into those in this report. It is important to note that
persistence diagrams and barcodes are just two ways of visualizing the same
information.
H0 Explained
H0: Studying a data set’s H0 is very important because it can tell us how many
components there are. When doing homology and not persistent homology this can
really allow us to see if there are two different components living in our data set
based upon its rank. However, when doing H0 in the barcode when we let the
growing of the epsilon balls go on indefinitely there will always be just one
connected component. Therefore one long line, however you can still see other long
lines compared to the rest. This will tell us that this component persisted for a long
time before becoming connected to the other component. This is where some of the
ambiguousness comes from in Topological Data Analysis, in trying to judge which
lines are important and which are not. For most data sets it is really clear to tell
which is important and which is not through the combination of the barcode and
persistent diagram. For example if our data set was figure 3 we would see two long
bars for a very long time because the distance between the two circles is so big.
When talking about what these H0 through H3, this number comes from cycles that
do not have any boundaries. Actually to calculate our Hn values we take our cycles
and mod them by the boundaries. This is synonymous with how many connected
components there are for H0 because we consider a point to be a 0-dimensional
cycle. We use this is in many of the higher dimensions to figure out large holes in the
data that last for a long time. We can learn a lot from realizing which cycles do not
bound a surface.
Figure 3
http://aninfopage.blogspot.c
om/
DeLay, Pescatore, Meyer, Howes, Danker 9
R Explained
R is a software tool that allows for programming to analyze data. R has the
capabilities of creating histograms, pie charts, box and whisker plots, barcodes,
scatterplots, and persistent diagrams. It is very helpful and something we used to
help create the barcodes for us. R is a culmination of all the ideas we have been
trying to put together. For example R uploads our set of 337 data points which is in
R^3, meaning we have three different variables we are testing on, and takes their
Euclidean distance to calculate when the points connect for H0. It also does this for
H1 calculating the distances until 1-dimensional cycles are formed. R is a script-
based language where you can make commands to accomplish tasks. R is also a free
software program, which is worked on by many people to make improvements and
libraries. A library has a list of functions and applications you can use in it. For
example, we used the TDA and PHOM library to make the barcodes. R is a very
useful resource because to do this by hand in higher dimensions would be
impossible.
DeLay, Pescatore, Meyer, Howes, Danker 10
Results
This graph shows us every player who is eligible for a PER (minimum 6.09 minutes per
game) and their relation to a line that goes through the average salary, PER, and WS.
There is a different color set for each team so we can tell if any team has more of
their players above or below the line. We have established that the players farthest
from the origin while still under this line are the players who are more efficient and
lower cost to their team. On the contrary, it shows that if a player is anywhere above
the line, we feel that his efficiency and/or WS is not worth the salary he is paid. We
noted a few exemplary points like LeBron James, Kevin Durant, and other notable
players to show where they stand on this graph. This diagram is important because it
offers a clear visual and combined with the spreadsheet of player statistics we can learn
a lot about each team.
Figure 4
DeLay, Pescatore, Meyer, Howes, Danker 11
Analysis of H0
The H0 barcodes as well as the persistence diagrams tell us a lot about the shape of
our data. There are many early deaths, and deaths happen far less frequently the
more steps that are taken. This tells us that there are lots of areas with many data
points crowded together, but these areas aren’t necessarily very close to each other.
This makes sense as player’s salaries definitely tend to form into clusters as many
play for the league minimum, then there is a veterans minimum, so many data
points would have identical values in at least one of their dimensions. Moreover,
most of these players won’t play much so their PER and Win shares will also be
close to each other, near zero, so these points start out extremely close. Close points
will soon make cycles in H0, so it explains why there are so many early deaths.
Analysis of H1
H1: We found that there are 5 cycles that persist indefinitely. This can be explained
by not having such a large gap in players’ abilities and salaries. Not only that
because we need these cycles to not bound anything. So not only do we need a large
gap between certain players but we also need there to be not many other players in
between them. This can be explained by having a few players having really good
PER and WS with poor salaries compared to players with really high salary and poor
H0 Persistent Diagram H1 Persistent Diagram
DeLay, Pescatore, Meyer, Howes, Danker 12
stats. A cycle like this would explain for a long time and there would not be a bunch
of players directly through this cycle. Two more cycles could be explained with the
aforementioned groups compared to the players doing well in all regards, high
salary, PER, and WS. The first two types of players described combined with the
players who do poorly in stats and get paid poorly would explain the last two
persisting cycles. We do not expect a lot of players to fall in between the really good
players getting paid well and the players who play well and do not be paid a lot.
Similarly we do not expect there to be a lot of players in between the really good
players getting a lot with the players getting paid a lot who play poorly. This logic
also applies to the players who do poorly and get paid poorly. Using the 3-
dimensional cube of players we can see that there are not a lot of players in the
scenarios described therefore not allowing there to be a surface there. This causes
these cycles to persist indefinitely. This is the type of data we expect to see with our
hypothesis. We expect there to be a large misdistribution of money compared to
statistics. This shows us that there are teams who are getting a really good deal on
players and some teams who are misusing their money. If all of the teams were
correctly using the money for the team we would expect there to be no one
dimensional cycles because the majority of the points would fall inside a diagonal
cylinder. This would cause the cycles to bound a surface, the points in the cylinder,
and therefore have no cycles in H1.
DeLay, Pescatore, Meyer, Howes, Danker 13
Explanations of H2 and H3
H2: In our H2 data we concluded that there are no important cycles because they all
ended relatively close to when they were formed. This is not necessarily a bad thing,
this just means that all of our 2-dimensional cycles bound a surface. We can not
gather any more information from our H2.
H3: We tried to give an appropriate analysis but from the H3 persistent diagram we
could not come up with any valid conclusions of why there would any 3-dimensional
cycles that do not bound any surfaces. We especially could not figure out why there
would be only be 3-dimensional cycles that do not bound a surface. However, we did
find what we were looking for in our H1.
Software Used and creation of data:
To help us better understand our research, we used a variety of different software
packages to help us understand our data. The input for all the data was a normalized
data set, with a maximum value of 1 and a minimum value of 0. It was a three
dimensional data set, containing PER, Win Shares, and Salary. The software that
allowed us to compute the barcodes, as well as the data for the persistence
diagrams, was Perseus. Perseus is a software that computes the persistent
homology of a set of data after taking a scaling factor, step size, and number of steps,
H2 Persistent Diagram H3 Persistent Diagram
DeLay, Pescatore, Meyer, Howes, Danker 14
dimension of the data, and the data itself as input. To create the persistence
diagrams, we used a Matlab script called persdia, which came bundled with the
Perseus program. Taking the birth and death times from Perseus, I wrote my own
program using the Python turtle to draw the barcodes for our group. Turtle is just a
library meant to help introductory programs understand programming in general,
but it was functionalities that allowed it to be able to draw the barcodes, so we used
it, as the barcodes are one of the most revealing things about a data set, and
somewhat easier to interpret than birth vs. death times graphs. The final software
used was the Mapper, a software that reveals clusters within data. It is what
produced the clustering diagram and created the 3-d cube of points.
R is a software tool that allows for programming to analyze data. R has the
capabilities of creating histograms, pie charts, box and whisker plots, barcodes,
scatterplots, and persistent diagrams. It is very helpful and something we used to
help create the barcodes for us. R is a culmination of all the ideas we have been
trying to put together. For example R uploads our set of 337 data points which is in
R^3, meaning we have three different variables we are testing on, and takes their
Euclidean distance to calculate when the points connect for H0. It also does this for
H1 calculating the distances until 1-dimensional cycles are formed. R is a script-
based language where you can make commands to accomplish tasks. R is also a free
software program, which is worked on by many people to make improvements and
libraries. A library has a list of functions and applications you can use in it. For
example, we used the TDA and PHOM library to make the barcodes. R is a very
useful resource because to do this by hand in higher dimensions would be
impossible. To compare with the program I wrote that did the barcodes, we plotted
the barcodes using R. The results were very similar, the only differences being my
barcodes that lasted infinitely where the ones plotted by R didn’t, and R somehow
specifying a starting radius, as we never chose one for the R program.
Additional information on use of software
It’s important to note certain settings on the software packages used and
why these settings were used. Changing any of these settings even a small amount
DeLay, Pescatore, Meyer, Howes, Danker 15
would completely change the output. Most of the settings fell into somewhat of a
sweet spot determined by testing. For example, for the step size in Perseus we tried
1000, 20, 75, and many other step sizes until we found what we determined gave an
output that revealed the most about the data. For the Perseus software, a scaling
factor of .03, a step size of .01, a radius of .2 and 150 total steps were the settings for
the software. These were used for a number of reasons. A scaling factor of .03 was
used as it shrunk the size of each of the points considerably, which was necessary as
many were extremely close to begin with, so their epsilon balls would intersect
immediately, destroying a cycle before a single step was taken. The step size of .01
was used for much of the same reason, it was relatively small and because much of
the data started so close together it had to be small as to allow the data points to die
more gradually instead of all at once. The starting radius of.2 seemed to be the
sweet spot for producing a good output, if it was much lower the output would
contain many infinite cycles, and if it was much larger to begin with the points
would die as soon as they were born. 150 steps were used as it produced the best
shape of data in comparison with step numbers higher or lower. Most of the infinite
cycles probably would have intersected had we went up to say, 1000 steps, but this
was much harder on the software itself, and it crashed almost every time I tried step
numbers that high. For the settings on Python Mapper we used the default values on
the GUI, with the only change being the Cover, which I switched to a balanced 1-d
cover over uniform 1-d, and changing the clustering setting to complete over single.
The reason we changed the cover was simply because it produced results we
thought were easier to interpret, and I thought the balanced 1-d cover made slightly
more sense with our data, as it had some areas where data points were heavily
crowded. We changed the clustering to complete as I thought it would be more
important that the entire data set was close to the entire other data set, which it
much better achieved by clustering based on furthest points versus closest points.
DeLay, Pescatore, Meyer, Howes, Danker 16
Discussion
When our group considered how to determine what each players expected salary
should be, we took it in terms of what we thought were the two most important, stand-
out statistics, PER and win shares. In our spreadsheet we had with all players’ statistics,
we created a formula that weighed each player’s normalized PER and win shares against
the league average, and then we simply multiplied that number by the league average
salary. This is the formula we used:
SQRT((PER1/AvePER)*(WS1/AveWS))*AveSal
Where:
PER1 Player’s normalized PER
AvePER Normalized league average PER
WS1 Player’s normalized WS
AveWS Normalized league average WS
AveSal League average salary
We want to give a few examples of players whose salaries stood out to us in our
data. We have calculated each of these players expected salaries based on the weighed
PER and WS formula we used in the spreadsheet. Now, while these salaries we
computed were based on what players should have been paid for the 2013-14 season,
the statistics we have are from that season, so the numbers we have are more of a
“what the recorded statistics from the season would have been worth if they were
accurately predicted.” We will give examples of the most overpaid and underpaid
player, the most deserving player for the league’s highest salary, the most accurately
paid2, the most deserving for the league’s lowest salary3, and the most deserving for
league average4.
DeLay, Pescatore, Meyer, Howes, Danker 17
Category Player Real salary-expected salary1
Most overpaid Amar’e Stoudemire $18,162K
Most underpaid Isaiah Thomas -$8,743K
Deserving of highest salary Kevin Durant $415K
Most accurately paid2 Luis Scola $765
Lowest deserved salary (2)3 Tony Wroten, Jae Crowder $1,160K, $788K respectively
Deserving of average salary4 Derrick Favors $937K
1. Real salary-expected salary is our statistic to show the difference in what players are
being paid versus what we believe they should be paid.
2. The most accurately paid (Luis Scola) is based on the smallest difference between real
salary and expected salary.
3. The lowest deserves salaries (Tony Wroten, Jae Crowder) are based on, since our data
is normalized, the player with the league’s lowest PER (Wroten) and the league’s lowest
WS (Crowder), which will both show up on our data as zero, so our calculations show
they technically deserve a salary of zero.
4. The player who most deserves the average salary (Favors) is based on the player who
is closest to the league’s average PER and the league’s average WS.
Validation
There are many factors that teams use when determining a contract to offer a
player. Since we decided to use PER, average salary and win shares, we were able to
generate a formula to produce an expected salary. We feel this formula is an
accurate depiction of how much players deserve to be paid because it considers
what we believe are the most useful statistics. However, others who examine how
much each player is worth would likely use different formulas and methods to come
up with different salary figures for each player because they may value other
statistics differently. For example, Mike Ghirardo of California Polytechnic State
University created a different method to analyze how much a player should be paid.
He chose to compare salary to Adjusted Plus-Minus (APM), which measures and
compares how a team does when a particular player is on or off the court. However,
DeLay, Pescatore, Meyer, Howes, Danker 18
after analyzing his report we found his method to be flawed (and ours to be more
effective) because he believes his method has a lot of noise, and we agree.
Of course, every data set will have noise, but we expect the data to have less
than ten percent noise. Ghirardo’s data may have had a lot of noise because he was
only using one basic statistic to determine a player’s salary. Ghirardo believes his
data had a lot of noise because of the co-linearity between players, meaning one
player could play the majority of his minutes with a specific group of players. This
would lead to a lot of useless data because many players would have similar APM.
We found our data to have little noise, which is why we rendered our method to be
more effective. Our method focuses more on the individual instead of the team,
which gives us more relevant data when determining a player’s salary. We feel this
is due to using two advanced statistics, as well as a basic statistic.
Minor Validation Issue
We ran two different types of software in our research to double check our results.
We used both R/Phom and Perseus/Mapper but instead of finding the same
barcodes like we expected there was a slight difference. In our H1 on the R barcode
we were missing the 5 infinitely lasting cycle. This is a large difference in our results
because this is what reassures of our hypothesis. However, we believe our Perseus
results to be the correct ones because Perseus is better at handling bigger data sets.
Our H2 did not even work with R because we had an insufficient memory error. We
do still believe in our results but this is just a disclaimer that this happened
Limitations
We were not able to meet all of our research goals. After using the collecting the
data and using Perseus it became apparent that we would not be able to learn
anything about how well specific teams in the NBA manage their money, as Perseus
doesn’t allow you to identify specific data points. Without knowing what data points
were associated with what cycles, we decided it would be impossible to analyze
players or teams individually using topology.
DeLay, Pescatore, Meyer, Howes, Danker 19
Conclusion
Contrary to popular belief, our data shows that of the 337 players we analyzed who
were eligible for a PER, only 129 of them were overpaid by our standards. That leaves
over 60% of the league (again, only including players who were eligible for a PER,
otherwise there are about 440 players all together) who were paid less than their
statistics say they deserved during the 2013-14 NBA season. However, while the number
of underpaid players almost doubled that of those overpaid, the amount by which the
overpaid players surpassed their expected salaries is much higher than the difference
between real and expected salaries for those who were underpaid. Some prime
examples of these players are Amar’e Stoudemire, Carmelo Anthony, Dwyane Wade,
Russell Westbrook, and many others. What these players share in common is their
marketability. What I mean by that is that these players have become public figures
throughout their respective careers, so many times fans go to see them play solely
because they want to be able to say that they saw that player play in person. This
information can be helpful to NBA front offices that need to determine how they can
save money on one player to make room in the salary cap for other players. Also we
learned that over half of all NBA contracts are very similar for the value in relation to
performance. There are several groups of outliers whose value is completely different
from what is expected based on their performance. These contracts are likely the most
important decisions an organization makes as it either has found tremendous value in
relation to the contract, paid a premium for a premium player, or wasted its money on
an underperforming player. The fact that there are outliers at all proves that talent
evaluation is far from perfect in the NBA, and that there are likely some teams that are
better at it than others. Also, with a group of only 8 players who were paid and played
like superstars, and 30 teams in the league, teams should probably focus on getting
great draft picks for their cheap contracts, and avoid overpaying middling players. With
only an 8/30 chance of having a properly paid superstar, teams should focus on winning
without a superstar, as most of them won’t have a choice.
DeLay, Pescatore, Meyer, Howes, Danker 20
Contributions
 Zach contributed the abstract (edited by Joey), introduction (edited by Joey), math
background, vietoris-rips complex, homology (edited by Josh), persistent homology
(edited by Josh), barcodes (edited by Lucas), persistence diagrams (edited by Lucas),
H0 explained, explanation of H1-H3 persistence diagrams, and the 3D scatterplot,
Minor Validation Issue, R explained, and Barcodes from R.
 Lucas contributed the barcodes, analysis of H0, and the explanations for all software
we used.
 Adam contributed explanations for PER (edited by Joey), salary cap, and win shares,
and the validation.
 Joey contributed the discussion, conclusion, and the explanation for the 3D
scatterplot.
DeLay, Pescatore, Meyer, Howes, Danker 21
References
"3d Scatterplot for MS Excel." Doka.ch. Doka Life Cycle Assessments, n.d. Web. 10
Apr. 2015.
Brixius, Nathan. "NBA Rosters and Team Changes: 2013-2014." Nathan Brixius.
Wordpress, 08 Oct. 2014. Web. 24 Mar. 2015.
Casciaro, Joseph. "Get To Know An Advanced Stat: Win Shares." TheScore. N.p., 12
Feb. 2015. Web. 31 Mar. 2015.
Ghirardo, Mike. "NBA Salaries: Assessing True Player Value." Calpoly.edu. Digital
Comons, n.d. Web. 20 Apr. 2015.
Hollinger, John. “What Is PER?” ESPN. ESPN Internet Ventures, 11 Aug. 2011. Web.
26 Mar. 2015.
Smith, Sekou. "2013-14 NBA Salary Cap Figure Set at $58.679 Million." NBA.com
Hang Time Blog. NBA, 9 July 2013. Web. 07 Apr. 2015.
http://www.r-bloggers.com/topological-data-analysis-with-r/
Weisstein, Eric. "Homology." Wolfram Mathworld. Wolfram Alpha, n.d. Web. 24 Mar.
2015.
DeLay, Pescatore, Meyer, Howes, Danker 22
R code-
> NBA.Graphs <- read.csv("/mnt/nfs/netapp2/students/zmmeyer/Downloads/NBA Graphs.csv", header=FA
LSE)
> View(NBA.Graphs) > data <- data.matrix(NBA.Graphs,rownames.force=NA) > library("TDA") Loadin
g required package: FNN Loading required package: igraph Loading required package: parallel Loading re
quired package: scales
> library(phom) Loading required package: Rcpp
> head(data) V1 V2 V3 [1,] 29.90 19.2 17832 [2,] 29.40 15.9 19067 [3,] 26.97 14.3 14693 [4,] 26.
54 10.4 5375 [5,] 26.18 7.9 4916 [6,] 25.98 12.2 18668
> max_dim <- 0 //dimension =0
> max_f <- 1 //display/x-axi=1
> bball <- pHom(data, dimension = max_dim, max_filtration_value=max_f,mode= "vr", metric = "euclidea
n")
> plotBarcodeDiagram(bball,max_dim,max_f,title="H0 of Stats vs Salary")
//switched to normalized data
> NORMD <- read.delim("/mnt/nfs/netapp2/students/zmmeyer/Downloads/NORMD.txt", header=FALSE)
> View(NORMD)
> data <- data.matrix(NORMD,rownames.force=NA) > bball <- pHom(data, dimension = max_dim, max_f
iltration_value=max_f,mode= "vr", metric = "euclidean")
> head(data) V1 V2 V3 V4 [1,] 0.7817508 1.0000000 1.0000000 0.2 [2,] 0.8368823 0.979
7325 0.8358209 0.2 [3,] 0.6416231 0.8812323 0.7562189 0.2 [4,] 0.2256596 0.8638022 0.5621891 0.2 [5,]
0.2051694 0.8492096 0.4378109 0.2 [6,] 0.8190706 0.8411026 0.6517413 0.2
> plotBarcodeDiagram(bball,max_dim,max_f,title="H0 of Stats vs Salary")
> max_dim<-1
> max_f<-1
> bball <- pHom(data,dimension=max_dim,max_filtration_value=max_f,mode="vr",metric="euclidean") >
plotBarcodeDiagram(bball,max_dim,max_f,title="H1 of Stats vs Salary")
attempted to do H2 but I got a insufficient memory error.

More Related Content

Viewers also liked

Looking to sell statistical products on Amazon? Do it in five steps, using th...
Looking to sell statistical products on Amazon? Do it in five steps, using th...Looking to sell statistical products on Amazon? Do it in five steps, using th...
Looking to sell statistical products on Amazon? Do it in five steps, using th...Stats Cosmos
 
Js frameworksmackdown
Js frameworksmackdownJs frameworksmackdown
Js frameworksmackdownmichaelbreyes
 
Ayman_CV_eng_sep16
Ayman_CV_eng_sep16Ayman_CV_eng_sep16
Ayman_CV_eng_sep16Ayman Wadi
 
Entorno y primeros pasos de powerpoint
Entorno y primeros pasos de powerpointEntorno y primeros pasos de powerpoint
Entorno y primeros pasos de powerpointheectordav
 
Sample Workout Program
Sample Workout ProgramSample Workout Program
Sample Workout ProgramTaylor Box
 
marketingfinancial
marketingfinancialmarketingfinancial
marketingfinancialJon Lee
 
Of fomentar-h-bitos-alimenticios-saludables
Of fomentar-h-bitos-alimenticios-saludablesOf fomentar-h-bitos-alimenticios-saludables
Of fomentar-h-bitos-alimenticios-saludablesLiceo Integral del Saber
 

Viewers also liked (12)

Looking to sell statistical products on Amazon? Do it in five steps, using th...
Looking to sell statistical products on Amazon? Do it in five steps, using th...Looking to sell statistical products on Amazon? Do it in five steps, using th...
Looking to sell statistical products on Amazon? Do it in five steps, using th...
 
Negara berkembang
Negara berkembangNegara berkembang
Negara berkembang
 
Js frameworksmackdown
Js frameworksmackdownJs frameworksmackdown
Js frameworksmackdown
 
Ayman_CV_eng_sep16
Ayman_CV_eng_sep16Ayman_CV_eng_sep16
Ayman_CV_eng_sep16
 
Entorno y primeros pasos de powerpoint
Entorno y primeros pasos de powerpointEntorno y primeros pasos de powerpoint
Entorno y primeros pasos de powerpoint
 
Resume
ResumeResume
Resume
 
Deemester_Training_Slides
Deemester_Training_SlidesDeemester_Training_Slides
Deemester_Training_Slides
 
Sample Workout Program
Sample Workout ProgramSample Workout Program
Sample Workout Program
 
Insect Mointoring
Insect MointoringInsect Mointoring
Insect Mointoring
 
Ciudadaníadigital
CiudadaníadigitalCiudadaníadigital
Ciudadaníadigital
 
marketingfinancial
marketingfinancialmarketingfinancial
marketingfinancial
 
Of fomentar-h-bitos-alimenticios-saludables
Of fomentar-h-bitos-alimenticios-saludablesOf fomentar-h-bitos-alimenticios-saludables
Of fomentar-h-bitos-alimenticios-saludables
 

Similar to Bank Shots to Bankroll Final

Predicting Salary for MLB Players
Predicting Salary for MLB PlayersPredicting Salary for MLB Players
Predicting Salary for MLB PlayersRobert-Ian Greene
 
Sports Aanalytics - Goaltender Performance
Sports Aanalytics - Goaltender PerformanceSports Aanalytics - Goaltender Performance
Sports Aanalytics - Goaltender PerformanceJason Mei
 
1. After watching the attached video by Dan Pink on .docx
1. After watching the attached video by Dan Pink on .docx1. After watching the attached video by Dan Pink on .docx
1. After watching the attached video by Dan Pink on .docxjeremylockett77
 
Measuring Team Chemistry in MLB
Measuring Team Chemistry in MLBMeasuring Team Chemistry in MLB
Measuring Team Chemistry in MLBDavid Kelly
 
The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...
The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...
The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...Andre Williams
 
WageDiscriminationAmongstNFLAthletes
WageDiscriminationAmongstNFLAthletesWageDiscriminationAmongstNFLAthletes
WageDiscriminationAmongstNFLAthletesGeorge Ulloa
 
Identifying Key Factors in Winning MLB Games Using a Data-Mining Approach
Identifying Key Factors in Winning MLB Games Using a Data-Mining ApproachIdentifying Key Factors in Winning MLB Games Using a Data-Mining Approach
Identifying Key Factors in Winning MLB Games Using a Data-Mining ApproachJoelDabady
 
Yujie Zi Econ 123CW Research Paper - NBA Defensive Teams
Yujie Zi Econ 123CW Research Paper - NBA Defensive TeamsYujie Zi Econ 123CW Research Paper - NBA Defensive Teams
Yujie Zi Econ 123CW Research Paper - NBA Defensive TeamsYujie Zi
 
Pressure Index in Cricket
Pressure Index in CricketPressure Index in Cricket
Pressure Index in CricketIOSR Journals
 
Writing Paper And Envelopes Sets, 72PCS Cute Stationary
Writing Paper And Envelopes Sets, 72PCS Cute StationaryWriting Paper And Envelopes Sets, 72PCS Cute Stationary
Writing Paper And Envelopes Sets, 72PCS Cute StationaryGina Alfaro
 
1982 maher modelling association football scores
1982 maher   modelling association football scores1982 maher   modelling association football scores
1982 maher modelling association football scoresponton42
 
Scoring Contribution Percentage
Scoring Contribution PercentageScoring Contribution Percentage
Scoring Contribution PercentageDamon Smith
 
4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx
4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx
4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docxtroutmanboris
 

Similar to Bank Shots to Bankroll Final (20)

Predicting Salary for MLB Players
Predicting Salary for MLB PlayersPredicting Salary for MLB Players
Predicting Salary for MLB Players
 
Sports Aanalytics - Goaltender Performance
Sports Aanalytics - Goaltender PerformanceSports Aanalytics - Goaltender Performance
Sports Aanalytics - Goaltender Performance
 
LAX IMPACT! White Paper
LAX IMPACT! White PaperLAX IMPACT! White Paper
LAX IMPACT! White Paper
 
Final Research Paper
Final Research PaperFinal Research Paper
Final Research Paper
 
1. After watching the attached video by Dan Pink on .docx
1. After watching the attached video by Dan Pink on .docx1. After watching the attached video by Dan Pink on .docx
1. After watching the attached video by Dan Pink on .docx
 
Measuring Team Chemistry in MLB
Measuring Team Chemistry in MLBMeasuring Team Chemistry in MLB
Measuring Team Chemistry in MLB
 
The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...
The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...
The Effect of RAT on Wages for Professional Basketball Players 0505.docx upda...
 
WageDiscriminationAmongstNFLAthletes
WageDiscriminationAmongstNFLAthletesWageDiscriminationAmongstNFLAthletes
WageDiscriminationAmongstNFLAthletes
 
Identifying Key Factors in Winning MLB Games Using a Data-Mining Approach
Identifying Key Factors in Winning MLB Games Using a Data-Mining ApproachIdentifying Key Factors in Winning MLB Games Using a Data-Mining Approach
Identifying Key Factors in Winning MLB Games Using a Data-Mining Approach
 
Yujie Zi Econ 123CW Research Paper - NBA Defensive Teams
Yujie Zi Econ 123CW Research Paper - NBA Defensive TeamsYujie Zi Econ 123CW Research Paper - NBA Defensive Teams
Yujie Zi Econ 123CW Research Paper - NBA Defensive Teams
 
Lineup Efficiency
Lineup EfficiencyLineup Efficiency
Lineup Efficiency
 
Research Paper
Research PaperResearch Paper
Research Paper
 
Pressure Index in Cricket
Pressure Index in CricketPressure Index in Cricket
Pressure Index in Cricket
 
Kerber_NBA_Analysis
Kerber_NBA_AnalysisKerber_NBA_Analysis
Kerber_NBA_Analysis
 
Writing Paper And Envelopes Sets, 72PCS Cute Stationary
Writing Paper And Envelopes Sets, 72PCS Cute StationaryWriting Paper And Envelopes Sets, 72PCS Cute Stationary
Writing Paper And Envelopes Sets, 72PCS Cute Stationary
 
1982 maher modelling association football scores
1982 maher   modelling association football scores1982 maher   modelling association football scores
1982 maher modelling association football scores
 
Sharpe Ratio & Information Ratio
Sharpe Ratio & Information RatioSharpe Ratio & Information Ratio
Sharpe Ratio & Information Ratio
 
Sharpe
SharpeSharpe
Sharpe
 
Scoring Contribution Percentage
Scoring Contribution PercentageScoring Contribution Percentage
Scoring Contribution Percentage
 
4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx
4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx
4DDBA 8307 Week 7 Assignment TemplateJohn DoeD.docx
 

Bank Shots to Bankroll Final

  • 1. DeLay, Pescatore, Meyer, Howes, Danker 1 Bank Shots to Bankroll Joseph DeLay, Adam Pescatore, Zach Meyer, Lucas Howes, Josh Danker The University of Iowa College of Liberal Arts and Sciences Abstract The goals of our research is to determine, based on win shares (WS) and player efficiency rating (PER), what the expected salary should be for each player in the NBA. We then want to take that number and compare it to what players were paid during the 2013-14 NBA season to see if players were paid what their statistics say they deserved, what teams were best at spending money in general, and if there is any information about NBA pay in general that can be learned from the data. Introduction The data we are using is from a spreadsheet from industry-leading analyst Nate Brixius’ blog, completed with statistics from the 2013-14 NBA season, such as games played, minutes played, field goals made, field goals attempted, and many more (Brixius). We simply added minutes per game, PER, and player salaries. We did not use every single NBA player in our data set. In order to be eligible for a PER, a player must have played an average of 6.09 minutes per game throughout the season, so that left us with 337 data points, or eligible players. We want to conduct an analysis of WS and PER and create a formula using those two variables and league average salary to determine what players at certain levels of those statistics should be paid. Now, this does not mean that we are looking to see if any teams are saving money by paying players less than what they deserve. If we determine a specific player deserves $8 million a year, but the team is paying him $6 million, then clearly the player is being underpaid, so we would consider. We assume that there is a big difference between the surplus that the overpaid players are receiving and the deficit for the underpaid players, but we want to make sure we weed out the noise, or the salary differences that are negligible based on a percentage of the league average salary,
  • 2. DeLay, Pescatore, Meyer, Howes, Danker 2 before we start making analyses. We will now explain the two statistics we are using, PER and win shares, as well as the NBA salary cap more in depth. PER Explained PER is an acronym for player efficiency rating. PER is an advanced statistic that measures a player’s effectiveness on a per-minute basis, while taking into consideration the pace at which a team plays. Created by statistician John Hollinger, PER is a formula that includes but is not limited to “positive accomplishments such as field goals, free throws, 3-pointers, assists, rebounds, blocks, and steals, and negative ones such as missed shots, turnovers and personal fouls” (Hollinger, 2011). PER is an effective statistic to use because it can compare two players even if there is a significant minutes gap, meaning one player is on the court significantly longer than another player. One flaw in PER is that it cannot measure a player’s defensive efficiency. However, PER is useful because it can “summarize a player’s statistical accomplishments in a single number” (Hollinger, 2011). Win Shares Explained According to David Corby of Basketball Reference, win shares is a statistic to “credit a player’s total measurable contribution to his team’s win total during the season” (Casciaro 2015). Unlike PER, win shares can measure both offensive and defensive productivity from a player. Statistics on offense include field goals, assists, free throws, and offensive rebounds that lead to points. However, defensive statistics are not measured as easily as offensive statistics. One aspect of defense that is measured to determine win shares is a “stop.” A stop is generally given to a player who gets a steal, block or defensive rebound. Thus, when factoring in defense to win shares, the main criteria measured is how often a team gets a stop, as well as the player who forces a stop. One advantage win shares has over PER is that it measures the player’s total productivity, instead of the player’s per minute productivity. In other words, win shares tells us how much a player produces, and PER tells us how efficient a player is.
  • 3. DeLay, Pescatore, Meyer, Howes, Danker 3 Salary Cap Explained Since the statistics used in our research are from the 2013-14 NBA season, the salary cap we use will also be from that season because the salary cap changes every year. According to senior NBA analyst Sekou Smith, the salary cap for the 2013-14 NBA season was $58.679M (Smith). Furthermore, the minimum a team was required to spend was $52.811M, or 90 percent of the salary cap. However, the salary cap is not necessarily the maximum a team can spend on players. The maximum a team can spend without a penalty is called a tax level, and it was set at $71.748 million dollars. However, if a team exceeds the tax level, they will have to pay the NBA the following fees (provided by Sekou Smith): • Portion of team salary $0-$4.99 million over tax level: $1.50 for $1 • Portion of team salary $5-$9.99 million over tax level: $1.75 for $1 • Portion of team salary $10-$14.99 million over tax level: $2.50 for $1 • Portion of team salary $15-$19.99 million over tax level: $3.25 for $1 • Rates increase by $0.50 for each additional $5 million of team salary above the tax level. For example, if a team exceeds the tax level by 2 million dollars, they must pay the NBA a fee of 3 million dollars. As long as they stay under the tax level, if a team exceeds the initial salary cap at $58.679 million dollars, there is no penalty. Conversely, if a team exceeds the tax level, they will have to pay a fee explained in the above bullet points.
  • 4. DeLay, Pescatore, Meyer, Howes, Danker 4 Background Throughout our project we used a concept called Topological Data Analysis. This is a mathematical technique that involves creating simplicial complexes. We represent data as a simplicial complex in order to discover its topological attributes. A simplicial complex is a type of graph comprised of points, edges, triangles, faces, tetrahedrons, etc. as seen in figure 1. One way to create these edges is based upon how close the points are to each other using a Euclidean distance. There are many other kinds of ways of saying how “close” points are to each other, but we won’t get into those seeing as we did not use any of those methods. We use epsilon balls to determine if points are close to each other. By this I mean if another point is within a distance of radius, epsilon we connect the points with an edge. Epsilon is just a fancy word for a number that we can change or choose. So having a radius of epsilon is no different than having a radius of 1, we just choose epsilon to equal 1. We can then use these edges we just created to similarly create faces. To create a face all three edges need to be connected and there needs to be a common area of intersection in all 3 points epsilon ball. If all three points’ epsilon balls don’t overlap then there is a hole. For example, in Figure 2, the left complex would have a hole because the 3 epsilon balls do not intersect in the middle. However the right figure would be filled in, creating a face, because they all overlap in the center. The simplicial complex for the example on the left would be a triangle with white in the middle, a hole. While the example on the right would be colored in, creating a face, there would be no hole. Mimicked from Power Point 1 slide 23 From Isabel Darcy. Figure 2 Figure 1 http://inperc.com/wiki/inde x.php?title=Simplicial_homol ogy
  • 5. DeLay, Pescatore, Meyer, Howes, Danker 5 In topological data analysis many inferences can be made depending on the number of holes and the location of these holes compared to the rest of your data in the graph. Making inferences based upon holes works in higher dimensions as well not just for two dimensional triangles and faces. This way of creating simplicial complexes is called the Čech Complex.  Another style of simplicial complex building is called the Vietoris-Rips complex. Vietoris-Rips Complex This type of simplicial creation does not involve creating epsilon balls and needing a common intersection to fill in a hole. With this style, if the edges are connected, you fill in the complex.  You fill in all complexes of dimension greater than one. This is especially important when building our persistent diagrams and barcodes because filling once the cycles become connected the cycle then dies. This allows us to find long lasting cycles and determine their importance. There are some drawbacks though, for example, in figure 2, both simplicial complexes would be identical and they would be filled in. Topology has some loss of precision but we can still learn a lot from it. The reason for doing this will be explained more when discussing the study of homology. Homology Homology studies and compares manifolds, their triangulations, simplicial complexes, and holes. The last two complexes are used to determine homology. Homology also has to do with a collection of edges being cycles which is defined as, if you can travel around the edges and get back to the same point. In homology we are dealing with Z mod 2 coefficients. This means that all the values are either 0 or 1. So if you add two edges with 1 mod 2 together then the value equals 0 mod 2. Homology is equal to cycles mod boundaries. This method can tell whether a certain collection of edges is a circle, a face, a torus, or a ball. While homology deals with a collection of edges, persistent homology goes further with this idea by telling us how long these collection of cycles exist.
  • 6. DeLay, Pescatore, Meyer, Howes, Danker 6 Persistent Homology Persistent homology has a lot of the same characteristics as homology. Instead of using a single distance for our epsilon balls, we greatly enlarge our epsilon balls to determine whether certain cycles persist throughout time. These persistent cycles tell us the importance of the shapes and holes by showing us how long each cycle lasts. With our data, we are trying to determine which of these cycles are noise and which are important to us. We will determine this by using two methods, barcodes and persistent diagrams. Barcodes Barcodes are used to help researchers better understand clusters of certain dimensions in their data. They are based upon when a cycle starts and when it gets close enough to another cycle to become part of that cycle. This absorption is called the death of the cycle. A barcode is created on an axis by taking the time the cycle is created and creating a line that stretches from the time the cycle began to the time it dies. An important note about creating a barcode is that it is based upon a filtered complex. This means that time determines when each point and edge comes into existence because we look at the time when gradually growing our epsilon balls in steps. When the epsilon ball intersects with another, the younger of the two cycles dies. Because of this fact we can tell when cycles start and end which can tell us how important they are. This coupled with persistence diagrams can tell us a lot about our data. The three barcodes we computed are shown on the next page. Barcodes from Perseus-
  • 7. DeLay, Pescatore, Meyer, Howes, Danker 7 Barcodes From R- We attempted to make an H2 but we had an insufficient memory error and could not make it. Persistent Diagrams A persistence diagram is used similarly to barcodes in the way that they both involve the starting and ending of cycles and that they are both visual representations of the data set. However persistence diagrams are more of a graph because it maps the birth time on the x-axis vs. the end time on the y-axis. There is also a line drawn as y=x. This line on the graph can help us determine which cycles are noise because most of the cycles that are close to this line are considered noise. The reason we think much of this is noise is H0 Barcode H1 Barcode H2 Barcode
  • 8. DeLay, Pescatore, Meyer, Howes, Danker 8 because the starting time is so close to the ending time of the cycle that most times these cycles are not important. There are exceptions when short cycles can be important, but we will not get into those in this report. It is important to note that persistence diagrams and barcodes are just two ways of visualizing the same information. H0 Explained H0: Studying a data set’s H0 is very important because it can tell us how many components there are. When doing homology and not persistent homology this can really allow us to see if there are two different components living in our data set based upon its rank. However, when doing H0 in the barcode when we let the growing of the epsilon balls go on indefinitely there will always be just one connected component. Therefore one long line, however you can still see other long lines compared to the rest. This will tell us that this component persisted for a long time before becoming connected to the other component. This is where some of the ambiguousness comes from in Topological Data Analysis, in trying to judge which lines are important and which are not. For most data sets it is really clear to tell which is important and which is not through the combination of the barcode and persistent diagram. For example if our data set was figure 3 we would see two long bars for a very long time because the distance between the two circles is so big. When talking about what these H0 through H3, this number comes from cycles that do not have any boundaries. Actually to calculate our Hn values we take our cycles and mod them by the boundaries. This is synonymous with how many connected components there are for H0 because we consider a point to be a 0-dimensional cycle. We use this is in many of the higher dimensions to figure out large holes in the data that last for a long time. We can learn a lot from realizing which cycles do not bound a surface. Figure 3 http://aninfopage.blogspot.c om/
  • 9. DeLay, Pescatore, Meyer, Howes, Danker 9 R Explained R is a software tool that allows for programming to analyze data. R has the capabilities of creating histograms, pie charts, box and whisker plots, barcodes, scatterplots, and persistent diagrams. It is very helpful and something we used to help create the barcodes for us. R is a culmination of all the ideas we have been trying to put together. For example R uploads our set of 337 data points which is in R^3, meaning we have three different variables we are testing on, and takes their Euclidean distance to calculate when the points connect for H0. It also does this for H1 calculating the distances until 1-dimensional cycles are formed. R is a script- based language where you can make commands to accomplish tasks. R is also a free software program, which is worked on by many people to make improvements and libraries. A library has a list of functions and applications you can use in it. For example, we used the TDA and PHOM library to make the barcodes. R is a very useful resource because to do this by hand in higher dimensions would be impossible.
  • 10. DeLay, Pescatore, Meyer, Howes, Danker 10 Results This graph shows us every player who is eligible for a PER (minimum 6.09 minutes per game) and their relation to a line that goes through the average salary, PER, and WS. There is a different color set for each team so we can tell if any team has more of their players above or below the line. We have established that the players farthest from the origin while still under this line are the players who are more efficient and lower cost to their team. On the contrary, it shows that if a player is anywhere above the line, we feel that his efficiency and/or WS is not worth the salary he is paid. We noted a few exemplary points like LeBron James, Kevin Durant, and other notable players to show where they stand on this graph. This diagram is important because it offers a clear visual and combined with the spreadsheet of player statistics we can learn a lot about each team. Figure 4
  • 11. DeLay, Pescatore, Meyer, Howes, Danker 11 Analysis of H0 The H0 barcodes as well as the persistence diagrams tell us a lot about the shape of our data. There are many early deaths, and deaths happen far less frequently the more steps that are taken. This tells us that there are lots of areas with many data points crowded together, but these areas aren’t necessarily very close to each other. This makes sense as player’s salaries definitely tend to form into clusters as many play for the league minimum, then there is a veterans minimum, so many data points would have identical values in at least one of their dimensions. Moreover, most of these players won’t play much so their PER and Win shares will also be close to each other, near zero, so these points start out extremely close. Close points will soon make cycles in H0, so it explains why there are so many early deaths. Analysis of H1 H1: We found that there are 5 cycles that persist indefinitely. This can be explained by not having such a large gap in players’ abilities and salaries. Not only that because we need these cycles to not bound anything. So not only do we need a large gap between certain players but we also need there to be not many other players in between them. This can be explained by having a few players having really good PER and WS with poor salaries compared to players with really high salary and poor H0 Persistent Diagram H1 Persistent Diagram
  • 12. DeLay, Pescatore, Meyer, Howes, Danker 12 stats. A cycle like this would explain for a long time and there would not be a bunch of players directly through this cycle. Two more cycles could be explained with the aforementioned groups compared to the players doing well in all regards, high salary, PER, and WS. The first two types of players described combined with the players who do poorly in stats and get paid poorly would explain the last two persisting cycles. We do not expect a lot of players to fall in between the really good players getting paid well and the players who play well and do not be paid a lot. Similarly we do not expect there to be a lot of players in between the really good players getting a lot with the players getting paid a lot who play poorly. This logic also applies to the players who do poorly and get paid poorly. Using the 3- dimensional cube of players we can see that there are not a lot of players in the scenarios described therefore not allowing there to be a surface there. This causes these cycles to persist indefinitely. This is the type of data we expect to see with our hypothesis. We expect there to be a large misdistribution of money compared to statistics. This shows us that there are teams who are getting a really good deal on players and some teams who are misusing their money. If all of the teams were correctly using the money for the team we would expect there to be no one dimensional cycles because the majority of the points would fall inside a diagonal cylinder. This would cause the cycles to bound a surface, the points in the cylinder, and therefore have no cycles in H1.
  • 13. DeLay, Pescatore, Meyer, Howes, Danker 13 Explanations of H2 and H3 H2: In our H2 data we concluded that there are no important cycles because they all ended relatively close to when they were formed. This is not necessarily a bad thing, this just means that all of our 2-dimensional cycles bound a surface. We can not gather any more information from our H2. H3: We tried to give an appropriate analysis but from the H3 persistent diagram we could not come up with any valid conclusions of why there would any 3-dimensional cycles that do not bound any surfaces. We especially could not figure out why there would be only be 3-dimensional cycles that do not bound a surface. However, we did find what we were looking for in our H1. Software Used and creation of data: To help us better understand our research, we used a variety of different software packages to help us understand our data. The input for all the data was a normalized data set, with a maximum value of 1 and a minimum value of 0. It was a three dimensional data set, containing PER, Win Shares, and Salary. The software that allowed us to compute the barcodes, as well as the data for the persistence diagrams, was Perseus. Perseus is a software that computes the persistent homology of a set of data after taking a scaling factor, step size, and number of steps, H2 Persistent Diagram H3 Persistent Diagram
  • 14. DeLay, Pescatore, Meyer, Howes, Danker 14 dimension of the data, and the data itself as input. To create the persistence diagrams, we used a Matlab script called persdia, which came bundled with the Perseus program. Taking the birth and death times from Perseus, I wrote my own program using the Python turtle to draw the barcodes for our group. Turtle is just a library meant to help introductory programs understand programming in general, but it was functionalities that allowed it to be able to draw the barcodes, so we used it, as the barcodes are one of the most revealing things about a data set, and somewhat easier to interpret than birth vs. death times graphs. The final software used was the Mapper, a software that reveals clusters within data. It is what produced the clustering diagram and created the 3-d cube of points. R is a software tool that allows for programming to analyze data. R has the capabilities of creating histograms, pie charts, box and whisker plots, barcodes, scatterplots, and persistent diagrams. It is very helpful and something we used to help create the barcodes for us. R is a culmination of all the ideas we have been trying to put together. For example R uploads our set of 337 data points which is in R^3, meaning we have three different variables we are testing on, and takes their Euclidean distance to calculate when the points connect for H0. It also does this for H1 calculating the distances until 1-dimensional cycles are formed. R is a script- based language where you can make commands to accomplish tasks. R is also a free software program, which is worked on by many people to make improvements and libraries. A library has a list of functions and applications you can use in it. For example, we used the TDA and PHOM library to make the barcodes. R is a very useful resource because to do this by hand in higher dimensions would be impossible. To compare with the program I wrote that did the barcodes, we plotted the barcodes using R. The results were very similar, the only differences being my barcodes that lasted infinitely where the ones plotted by R didn’t, and R somehow specifying a starting radius, as we never chose one for the R program. Additional information on use of software It’s important to note certain settings on the software packages used and why these settings were used. Changing any of these settings even a small amount
  • 15. DeLay, Pescatore, Meyer, Howes, Danker 15 would completely change the output. Most of the settings fell into somewhat of a sweet spot determined by testing. For example, for the step size in Perseus we tried 1000, 20, 75, and many other step sizes until we found what we determined gave an output that revealed the most about the data. For the Perseus software, a scaling factor of .03, a step size of .01, a radius of .2 and 150 total steps were the settings for the software. These were used for a number of reasons. A scaling factor of .03 was used as it shrunk the size of each of the points considerably, which was necessary as many were extremely close to begin with, so their epsilon balls would intersect immediately, destroying a cycle before a single step was taken. The step size of .01 was used for much of the same reason, it was relatively small and because much of the data started so close together it had to be small as to allow the data points to die more gradually instead of all at once. The starting radius of.2 seemed to be the sweet spot for producing a good output, if it was much lower the output would contain many infinite cycles, and if it was much larger to begin with the points would die as soon as they were born. 150 steps were used as it produced the best shape of data in comparison with step numbers higher or lower. Most of the infinite cycles probably would have intersected had we went up to say, 1000 steps, but this was much harder on the software itself, and it crashed almost every time I tried step numbers that high. For the settings on Python Mapper we used the default values on the GUI, with the only change being the Cover, which I switched to a balanced 1-d cover over uniform 1-d, and changing the clustering setting to complete over single. The reason we changed the cover was simply because it produced results we thought were easier to interpret, and I thought the balanced 1-d cover made slightly more sense with our data, as it had some areas where data points were heavily crowded. We changed the clustering to complete as I thought it would be more important that the entire data set was close to the entire other data set, which it much better achieved by clustering based on furthest points versus closest points.
  • 16. DeLay, Pescatore, Meyer, Howes, Danker 16 Discussion When our group considered how to determine what each players expected salary should be, we took it in terms of what we thought were the two most important, stand- out statistics, PER and win shares. In our spreadsheet we had with all players’ statistics, we created a formula that weighed each player’s normalized PER and win shares against the league average, and then we simply multiplied that number by the league average salary. This is the formula we used: SQRT((PER1/AvePER)*(WS1/AveWS))*AveSal Where: PER1 Player’s normalized PER AvePER Normalized league average PER WS1 Player’s normalized WS AveWS Normalized league average WS AveSal League average salary We want to give a few examples of players whose salaries stood out to us in our data. We have calculated each of these players expected salaries based on the weighed PER and WS formula we used in the spreadsheet. Now, while these salaries we computed were based on what players should have been paid for the 2013-14 season, the statistics we have are from that season, so the numbers we have are more of a “what the recorded statistics from the season would have been worth if they were accurately predicted.” We will give examples of the most overpaid and underpaid player, the most deserving player for the league’s highest salary, the most accurately paid2, the most deserving for the league’s lowest salary3, and the most deserving for league average4.
  • 17. DeLay, Pescatore, Meyer, Howes, Danker 17 Category Player Real salary-expected salary1 Most overpaid Amar’e Stoudemire $18,162K Most underpaid Isaiah Thomas -$8,743K Deserving of highest salary Kevin Durant $415K Most accurately paid2 Luis Scola $765 Lowest deserved salary (2)3 Tony Wroten, Jae Crowder $1,160K, $788K respectively Deserving of average salary4 Derrick Favors $937K 1. Real salary-expected salary is our statistic to show the difference in what players are being paid versus what we believe they should be paid. 2. The most accurately paid (Luis Scola) is based on the smallest difference between real salary and expected salary. 3. The lowest deserves salaries (Tony Wroten, Jae Crowder) are based on, since our data is normalized, the player with the league’s lowest PER (Wroten) and the league’s lowest WS (Crowder), which will both show up on our data as zero, so our calculations show they technically deserve a salary of zero. 4. The player who most deserves the average salary (Favors) is based on the player who is closest to the league’s average PER and the league’s average WS. Validation There are many factors that teams use when determining a contract to offer a player. Since we decided to use PER, average salary and win shares, we were able to generate a formula to produce an expected salary. We feel this formula is an accurate depiction of how much players deserve to be paid because it considers what we believe are the most useful statistics. However, others who examine how much each player is worth would likely use different formulas and methods to come up with different salary figures for each player because they may value other statistics differently. For example, Mike Ghirardo of California Polytechnic State University created a different method to analyze how much a player should be paid. He chose to compare salary to Adjusted Plus-Minus (APM), which measures and compares how a team does when a particular player is on or off the court. However,
  • 18. DeLay, Pescatore, Meyer, Howes, Danker 18 after analyzing his report we found his method to be flawed (and ours to be more effective) because he believes his method has a lot of noise, and we agree. Of course, every data set will have noise, but we expect the data to have less than ten percent noise. Ghirardo’s data may have had a lot of noise because he was only using one basic statistic to determine a player’s salary. Ghirardo believes his data had a lot of noise because of the co-linearity between players, meaning one player could play the majority of his minutes with a specific group of players. This would lead to a lot of useless data because many players would have similar APM. We found our data to have little noise, which is why we rendered our method to be more effective. Our method focuses more on the individual instead of the team, which gives us more relevant data when determining a player’s salary. We feel this is due to using two advanced statistics, as well as a basic statistic. Minor Validation Issue We ran two different types of software in our research to double check our results. We used both R/Phom and Perseus/Mapper but instead of finding the same barcodes like we expected there was a slight difference. In our H1 on the R barcode we were missing the 5 infinitely lasting cycle. This is a large difference in our results because this is what reassures of our hypothesis. However, we believe our Perseus results to be the correct ones because Perseus is better at handling bigger data sets. Our H2 did not even work with R because we had an insufficient memory error. We do still believe in our results but this is just a disclaimer that this happened Limitations We were not able to meet all of our research goals. After using the collecting the data and using Perseus it became apparent that we would not be able to learn anything about how well specific teams in the NBA manage their money, as Perseus doesn’t allow you to identify specific data points. Without knowing what data points were associated with what cycles, we decided it would be impossible to analyze players or teams individually using topology.
  • 19. DeLay, Pescatore, Meyer, Howes, Danker 19 Conclusion Contrary to popular belief, our data shows that of the 337 players we analyzed who were eligible for a PER, only 129 of them were overpaid by our standards. That leaves over 60% of the league (again, only including players who were eligible for a PER, otherwise there are about 440 players all together) who were paid less than their statistics say they deserved during the 2013-14 NBA season. However, while the number of underpaid players almost doubled that of those overpaid, the amount by which the overpaid players surpassed their expected salaries is much higher than the difference between real and expected salaries for those who were underpaid. Some prime examples of these players are Amar’e Stoudemire, Carmelo Anthony, Dwyane Wade, Russell Westbrook, and many others. What these players share in common is their marketability. What I mean by that is that these players have become public figures throughout their respective careers, so many times fans go to see them play solely because they want to be able to say that they saw that player play in person. This information can be helpful to NBA front offices that need to determine how they can save money on one player to make room in the salary cap for other players. Also we learned that over half of all NBA contracts are very similar for the value in relation to performance. There are several groups of outliers whose value is completely different from what is expected based on their performance. These contracts are likely the most important decisions an organization makes as it either has found tremendous value in relation to the contract, paid a premium for a premium player, or wasted its money on an underperforming player. The fact that there are outliers at all proves that talent evaluation is far from perfect in the NBA, and that there are likely some teams that are better at it than others. Also, with a group of only 8 players who were paid and played like superstars, and 30 teams in the league, teams should probably focus on getting great draft picks for their cheap contracts, and avoid overpaying middling players. With only an 8/30 chance of having a properly paid superstar, teams should focus on winning without a superstar, as most of them won’t have a choice.
  • 20. DeLay, Pescatore, Meyer, Howes, Danker 20 Contributions  Zach contributed the abstract (edited by Joey), introduction (edited by Joey), math background, vietoris-rips complex, homology (edited by Josh), persistent homology (edited by Josh), barcodes (edited by Lucas), persistence diagrams (edited by Lucas), H0 explained, explanation of H1-H3 persistence diagrams, and the 3D scatterplot, Minor Validation Issue, R explained, and Barcodes from R.  Lucas contributed the barcodes, analysis of H0, and the explanations for all software we used.  Adam contributed explanations for PER (edited by Joey), salary cap, and win shares, and the validation.  Joey contributed the discussion, conclusion, and the explanation for the 3D scatterplot.
  • 21. DeLay, Pescatore, Meyer, Howes, Danker 21 References "3d Scatterplot for MS Excel." Doka.ch. Doka Life Cycle Assessments, n.d. Web. 10 Apr. 2015. Brixius, Nathan. "NBA Rosters and Team Changes: 2013-2014." Nathan Brixius. Wordpress, 08 Oct. 2014. Web. 24 Mar. 2015. Casciaro, Joseph. "Get To Know An Advanced Stat: Win Shares." TheScore. N.p., 12 Feb. 2015. Web. 31 Mar. 2015. Ghirardo, Mike. "NBA Salaries: Assessing True Player Value." Calpoly.edu. Digital Comons, n.d. Web. 20 Apr. 2015. Hollinger, John. “What Is PER?” ESPN. ESPN Internet Ventures, 11 Aug. 2011. Web. 26 Mar. 2015. Smith, Sekou. "2013-14 NBA Salary Cap Figure Set at $58.679 Million." NBA.com Hang Time Blog. NBA, 9 July 2013. Web. 07 Apr. 2015. http://www.r-bloggers.com/topological-data-analysis-with-r/ Weisstein, Eric. "Homology." Wolfram Mathworld. Wolfram Alpha, n.d. Web. 24 Mar. 2015.
  • 22. DeLay, Pescatore, Meyer, Howes, Danker 22 R code- > NBA.Graphs <- read.csv("/mnt/nfs/netapp2/students/zmmeyer/Downloads/NBA Graphs.csv", header=FA LSE) > View(NBA.Graphs) > data <- data.matrix(NBA.Graphs,rownames.force=NA) > library("TDA") Loadin g required package: FNN Loading required package: igraph Loading required package: parallel Loading re quired package: scales > library(phom) Loading required package: Rcpp > head(data) V1 V2 V3 [1,] 29.90 19.2 17832 [2,] 29.40 15.9 19067 [3,] 26.97 14.3 14693 [4,] 26. 54 10.4 5375 [5,] 26.18 7.9 4916 [6,] 25.98 12.2 18668 > max_dim <- 0 //dimension =0 > max_f <- 1 //display/x-axi=1 > bball <- pHom(data, dimension = max_dim, max_filtration_value=max_f,mode= "vr", metric = "euclidea n") > plotBarcodeDiagram(bball,max_dim,max_f,title="H0 of Stats vs Salary") //switched to normalized data > NORMD <- read.delim("/mnt/nfs/netapp2/students/zmmeyer/Downloads/NORMD.txt", header=FALSE) > View(NORMD) > data <- data.matrix(NORMD,rownames.force=NA) > bball <- pHom(data, dimension = max_dim, max_f iltration_value=max_f,mode= "vr", metric = "euclidean") > head(data) V1 V2 V3 V4 [1,] 0.7817508 1.0000000 1.0000000 0.2 [2,] 0.8368823 0.979 7325 0.8358209 0.2 [3,] 0.6416231 0.8812323 0.7562189 0.2 [4,] 0.2256596 0.8638022 0.5621891 0.2 [5,] 0.2051694 0.8492096 0.4378109 0.2 [6,] 0.8190706 0.8411026 0.6517413 0.2 > plotBarcodeDiagram(bball,max_dim,max_f,title="H0 of Stats vs Salary") > max_dim<-1 > max_f<-1 > bball <- pHom(data,dimension=max_dim,max_filtration_value=max_f,mode="vr",metric="euclidean") > plotBarcodeDiagram(bball,max_dim,max_f,title="H1 of Stats vs Salary") attempted to do H2 but I got a insufficient memory error.