Econometrics_A_Final_Project

Big Data and The Hot Hand Fallacy:
A Nonparametric Approach
Jacob Dorn, Lukas Hager, David Mendelssohn∗
University of Chicago
March 2016
Abstract
While many fans believe in the concept of a “Hot Hand,” where a player who has
made many shots in close succession will have more success in the next attempt than
normal, statisticians have argued that shooting success is independent of whether or
not a player is indeed “hot.” We add to this discussion by using a dataset which has
only been in existence since 2010 and novel, nonparametric methods for defining heat.
In general, we find no effect of heat on shot outcomes after accounting for location. In
some special cases, we find that hot players in fact fare worse on their next attempt
by 2.5 to 5 percentage points. Additionally, we find that both offensive and defensive
players act as if there were such a thing as the Hot Hand, possibly explaining our lack of
Hot Hand findings, and our apparent worse performance may, instead, be endogenous
unobserved defender characteristics. This behavior is an important consideration in
future evaluations of any Hot Hand effect.
1 Introduction and Review of Relevant Literature
One of the most compelling spectacles in a basketball game is a ‘hot streak’, where a player
seemingly cannot miss. This effect is well documented in the more poetic portions of the
annals of basketball; many players, whether amateur or professional, have described them-
selves as hot when racking up high scoring totals. Purvis Short of the Golden State Warriors
famously declared that the Hot Hand “...[is] hard to describe. But the basket seems to be
so wide. No matter what you do, you know the ball is going to go in” [6, p. 158]. Based
on this statement, the Hot Hand effect can be conceptualized as immediately-prior success
positively correlating with immediately-subsequent success.
By and large, researchers have disagreed with Short’s contention of the Hot Hand’s existence.
The classic explanation follows the premise of Kahneman and Tversky’s “Law of Small
∗
With thanks to Winnie van Dijk, Ken DeGennaro, Jordan Solomon, and Charlie Rohlf
1

Numbers,” that humans are pattern-seeking beings, attributing meaning to random sampling
variation [8]. Essentially, they argue that, when a fan thinks a player is hot, they are over-
interpreting a chance occurrence.
While this disagreement might not look like an economic question at first glance, it can
be interpreted not only statistically, but also economically in many ways. The first such
interpretation is practical: An NBA championship may bring perhaps tens of millions of
dollars in revenue to a local economy [3]. While much of that income ends up in the hands of
already-wealthy team owners, and some of it may come at the expense of other parts of the
local economy, a true (or false) Hot Hand could present an inefficiency in player evaluation,
and thus be a factor in the allocation of a nontrivial amount of money. If there are hot
players, a coach who lets such a player take a break in the midst of a streak is a foolish
one. Further, if only certain players have the ability to get “hot”, then a coach ought to
consciously play them more when the team is losing with little time left, as the Hot Hand
becomes particularly beneficial if the player manages to make the game’s close by playing
well.1
The second interpretation views the response to perceived hot streaks as a particularly plain
instance of classic economic theory. A generalized corporation has the goal of maximizing
shareholder profit. Firms theoretically optimize profits, but the agents doing so do not always
have aligned incentives. A team is a specific corporation where “profit” is likely some mix of
the prestige of victory and the joy of hard cash. During a game, these two presumably align
and agents – players and coaches – should, at least in theory, aim to optimize the likelihood
of victory.
We may speculate on the implications of possible outcomes. If there are hot streaks and
coaches act as if there are not, it suggests some bit of information failure. But if there are no
real hot streaks, and agents act as if there are, it suggests some other incentives may be at
play. Perhaps coaches and defenders maneuver to avoid appearing, respectively, as oblivious
chumps who spurn a well-positioned asset by not playing a hot player or as absentminded
marks who fail to adjust to the time-specific threat of a hot shooter. Perhaps shooters take
advantage of the widespread belief that they have temporary special powers in order to take
more shots and experience a bit more of the limelight. More shots might even translate to
higher salaries, yielding mismatched incentives. While we do not consider these implications
in our paper, they are certainly relevant for future consideration.
1
It is worth noting that we do not consider time effects in “cooling” a player’s hot streak, but our inclusion
of only the prior four shots in evaluating heat should, hopefully, make this omission’s effects negligible.
2

The third interpretation is to view our paper as an investigation of a peculiar cultural
situation. The NBA can be viewed as a series of high-stakes Bernoulli trials – effectively,
coin flips with differing probabilities. Organizations spend hundreds of millions of dollars in
the hopes that, in any given night, they will have more coins come up heads than another
organization. The efficiency of these choices of metaphorical coins, of throwing style, and
of ordering of throws, gets debated for hours each day before and during the season on
commute-time radio and Internet forums (e.g. [5]). Many of these commentators think there
is such a thing as the Hot Hand; in these terms, a persistent belief in correlation across
Bernoulli trials. We would like to know if they are correct.
In 1985, Gilovich and Tversky wrote a seminal paper on this subject entitled “The Hot
Hand in Basketball: On the Misperception of Random Sequences.” After defining hotness
as a player hitting multiple shots in a row, they found (for a Philadelphia 76ers team over
48 home games) no statistical difference between the supposed hot streaks and the expected
number of hot streaks if shots were independent Bernoulli. This paper has been highly
regarded for a long time. In fact, the Harvard basketball team was criticized on the basis of
this paper’s findings when they professed belief in the Hot Hand [7]. However, Gilovich and
Tversky have a notable assumption, commendably mentioned in the beginning of the paper:
“Each player has an ensemble of shots that vary in difficulty (depending, for
example, on the distance from the basket and on defensive pressure), and each
shot is randomly selected from this ensemble.” [2, p. 297]
This statement, while perhaps necessary when using only data available in 1985, represents
a fundamental oversimplification of the problem. Player shot choice is non-constant. And,
in fact, we find that their choice depends on their heat level, as hot players take worse shots.
As such, treating the overall shooting percentage as constant across shots is fundamentally
misguided, as the very instance of a hot streak makes its continuation less likely.
Their approach was improved upon somewhat by Bocskocsky et al.’s “The Hot Hand: A New
Approach to an Old ‘Fallacy.’” The authors set about to ameliorate issues with Gilovich and
Tversky’s approach by defining heat as exceeding some expectation of performance, rather
than raw streaks of hits and misses. This required the creation of some way of defining the
expectation of a certain shot on the court. To achieve this end, Bocskocsky et al. regressed
shot outcomes on dummy variables for position on court, by dividing the floor into 2-by-2
foot boxes with assumed homogeneous effects across players, and regressing on dummies for
box and player [1]. Thus, their expected probability of making a shot model took distance
3

into account, only insofar as discrete squares can be a proxy for location. Further, they used
a linear regression model because of a lack of convergence of their Probit models; we consider
our KNN approach, using a similar data source, preferable.
2 Data
The SportVU system is an arrangement of cameras hung from the rafters of arenas for all
NBA teams. It collects 25 readings per second to provide statistics on player position, ball
position, and play information. STATS LLC provides an Application Programmer Interface
(API) to access the data, beginning in the 2010-2011 NBA season. They use two API sources:
shot chart data and shot log data. Shot chart data offers position, time, and outcome
information. Shot type (layup, jump shot, etc.) and some other provided attributes were
discarded from our analysis. We also removed the provided distance value in favor of a more
exact measure we calculated using (X,Y) data.2
We were able to retrieve approximately
205,000 shots worth of data from the shot charts from the 2014-2015 season. However,
because shots on which the shooter was fouled are only considered shots if the basket is
made, we limited our study to the 160,000 shots where no foul occurred. An example of all
of the remaining individual (X,Y) data points for Stephen Curry is provided as Figure 1.
Unfortunately, the shot log dataset became unavailable after we began this project. By
communicating directly with STATS, we were able to procure fewer than 10,000 shots worth
of data from the 2015-2016 season. This data is anonymized, meaning we could not integrate
it with the shot chart data we had already obtained, nor could we determine if a foul had
been committed on any shot. Not only did this mean that any analysis we could perform
would potentially suffer from omitted variable bias in the form of missing attributes, but it
also meant that we had very few shots worth of shot log data for any given player (and, at
most, four games per player). This issue is further accentuated by the fact that some players
shoot much more often than others, giving us even less data on those players who attempt
shots infrequently. Histograms of number of shots by a player in shot chart data and shot
log data are provided as Figure 2 and Figure 3 respectively.
As is clearly evident by comparing these figures, the shot chart data commonly has 1000 shots
per player, while, in the less complete shot log data, almost no one reaches even one tenth
of that frequency. Still, for analyzing this smaller shot log data, we used closest defender
2
We verified that our estimates were consistent with the given data. In fact, we found that the STATS
data consistently rounded in the same direction, so our imputation was a slight improvement.
4

0
10
20
30
40
−20 −10 0 10 20
x
y
Figure 1: All Stephen Curry shots. Green triangles are made shots. Red circles are missed
shots.
5

0.0
0.3
0.6
0.9
1.2
10 1000
Number of shots by player (log scale)
density
Figure 2: Histogram of number of shots per player in the larger shot chart dataset
6

0.0
0.5
1.0
1 10 100
Number of shots by player (log scale)
Density
Figure 3: Histogram of number of shots per player in the smaller shot log dataset
7

0
50
100
150
1 2 3 4
Number of games in dataset
Numberofplayers
Figure 4: Number of games by player in shot log dataset
8

height, closest defender distance, number of dribbles taken before the shot, an indicator
for scoring, (X,Y) location, (anonymized) game id number, (anonymized) player id number,
game period, game clock, and shot clock.3
Length of touch time was also provided, but we
exclude this from the analysis on theoretical grounds because it is potentially an effect of
player heat and thus should not be evaluated as a factor in order to prevent endogeneity,
with a justification we detail below. We also did not consider effects of game time, shot clock,
or number of dribbles, though these effects may turn out to be relevant in future analyses.
3 Methods
We consider each attempted shot to be, at the moment the offensive player decides to shoot,
a Bernoulli random variable with unknown probability. The probability is clearly in the
interval [0, 1], with a mix of factors in the probability generation model both unknown and
known to the econometrician. We may then write the outcome model for a shot by player i
in game g at time t as:
make(i,g,t) = f( Z(g,t), Y(i,g,t)
) + η(i,g,t) = f(X(i,g,t)) + η(i,g,t)
Here, f() is some unknown function of Z, factors which might apply identically to any other
player in the same game at the same time (e.g. if there is something odd about the arena at
that moment), and Y , factors which are time-, player-, and game-specific. It is worth noting
that, if the factors in Z can be consistently estimated for a player’s data, we can do just as
well by treating any factors in Z as factors in Y (though there may be losses of efficiency
in the finite case). We then combine these factors as Xi,g,t. As is well-known, since f is
necessarily in the interval [0, 1], η may be written as an exogenous (to f) variable to ensure
that the make function takes on the values 0 or 1, 0 for missing and 1 for making, and f()
is the probability of the shot being made. Our question of interest may then be restated as
a test for substantial evidence that heat belongs as an input to the functional form. We aim
to test the null hypothesis that most fans are wrong, i.e. our null hypothesis is that heat is
not a factor in f().
Before we may consider whether heat belongs as an argument to the functional form, we
must have some definition of heat. In turn, heat is defined as outperforming expectations
in prior shots, so to have a definition of heat, we must have a definition of expectations. In
3
For a reader more interested in the econometric content than the basketball content of this paper, this
clock effectively refers to the time remaining for the offensive team to attempt a shot. More details can be
found through alternative sources.
9

our main dataset, we have only position on court and time. We chose not to pursue time
controls, so we are left with only two potential inputs to control for in expectations: distance
and position.
We do not see any good method of parametrizing distance. Proportion of shots made by
nearest distance can be found in Figure 54
. While, in general, shots from farther away are
less likely to go in, there is a large stretch of distances where farther shots are more likely
to go in, before percentages fall off quickly with distances beyond the three-point line. This
makes the parametrization of distance difficult with any polynomial.
Indeed, we consider any attempt to parametrize these positional factors with a generalized
equation form difficult, because of the variety of player-specific positional factors for which
no econometrician can account. Consider Figure 6, a graph of Stephen Curry’s shooting
percentages by four square-foot (”2-by-2”) box in our dataset. Curry shoots far better in
the lower-left corner than the lower-right corner, and has a large patch of particularly good
shooting near (-8, 8). While the first factor might be captured by side of court, the latter
cannot easily be included in any top-down analysis.
There is also a more pernicious threat to any analysis of our larger dataset: defenders.
Because of limitations on data availability, we cannot incorporate any defender information
into our shot chart analysis. If defenders play more closely when the offensive player is hot,
this is a threat to exogeneity. The defender problem is an unavoidable result of temporary
data limitations. The issues of parametrizations, though, can be addressed with a different
prediction method: K Nearest Neighbors (KNN).
Given some number of shots (K) and some cutoff value (a distance), KNN predicts a shot’s
likelihood of going in by taking the sample average of the outcomes of the first K shots that
are within the cutoff. As we increase K, we incorporate more data and, if the additional data
is sufficiently similar, achieve more-exact predictions. However, many players lack sufficient
shots in all locations, so increasing K means that we cannot predict some shots for some
players. To remedy this, we can increase our cutoff, but this leads to us predicting a shot
using the outcomes of attempts that are less and less similar to it in terms of location. As
such, deciding which K and which cutoff value to use is a sort of optimization: we want to
be able to predict as many shots as possible, while also making sure that those predictions
are as accurate as possible. KNN further represents an attractive choice of specification,
as it has the property that when the ratio K
n
→ 0 when n → ∞, we have that the bias of
4
Data is rounded in log scale to the nearest multiple of 0.2, then exponentiated, because of large swings
in probability among the many shots near the basket.
10

0.00
0.25
0.50
0.75
1 100
Distance (log scale)
prop_made
Figure 5: Shooting percentage, by distance, in larger dataset. Vertical line represents general
distance of three-point line.
11

0
10
20
30
40
50
−20 −10 0 10 20
round_x_2_ft
round_y_2_ft
0.00
0.25
0.50
0.75
1.00
prop_made
Figure 6: Stephen Curry’s shooting percentage by location, divided into 2-by-2 foot squares
12

the estimator goes to 0, and, under suitable moment conditions on f(), the variance of the
estimator tends towards 0 as well [4]. While our data is finite, a large K and small cutoff
should then yield low-variance and low-bias estimation of f().
We construct our KNN estimate as follows:
1. For each player’s shot, consider all of their other shots
2. If there are not at least K other shots by the player within the chosen cutoff distance
of the given shot (in our case, the cutoff is evaluated under the Euclidian metric, and
so can be measured in feet), make no prediction
3. If there are at least K other shots by the player within the chosen cutoff of the given
shot, use the proportion of those K which were successful
4. Optionally, among the shots where we can make a prediction, regress outcome on
prediction in a simple linear model to generate an improved “fitted” prediction
Choosing a specification for KNN (values for K and cutoff) is a non-trivial optimization
problem. We want to be able to predict on as many shots as possible, while also making
good predictions. We considered these factors in picking a specification, based on R2
values
for different specifications, visible in Figure 7. While R2
is certainly problematic when used
alone as a metric, it gives us a good idea of the fit of the specifications used. The best
specification by this metric was K=25 with a cutoff of 5 feet. However, we could only
predict 38% of shots with this standard. Potentially, these 38% may be different shots than
the remainder of our dataset, so we also consider the R2
-maximizing metric which covers at
least 80% of shots. This specification ended up being a K-value of 50 with no cutoff, for
which we can make a prediction on 99% of shots. We further cut our sample arbitrarily in
half and applied KNN to ensure that no overfitting was occurring; the results are shown in
Figure 8. We see that the prediction is worse, but not so much worse as to expect that KNN
has overfitted for any given specification.
While there are alternatives, such as Bocsocsky et al.’s division of the court into 2-by-2 foot
squares with player fixed effects (and other factors which we did not incorporate into our
analysis), we find that, after regressing on KNN predictions to produce “fitted” predictions,
our low-cutoff prediction method outperforms all other parametrizations tried (see Table 1),
with fewer degrees of freedom needed. Admittedly, our specification with no cutoff under-
performs other specifications, but, as it is solely used as a robustness check, this is of little
consequence.
13

q
q
q
q
q
q
q
q
q q q q
q
q
q
q
q
q
q
q q q
q
q q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
−0.8
−0.6
−0.4
−0.2
0.0
1 10 100
k
knn_r_sq
as.factor(groups)
q 2
3
5
10
20
Inf
5
10
15
20
25
cutoff
Figure 7: R2
values for diﬀerent choices of K and cutoﬀ
14

q
q
q
q
q
q
q q q
q
q q
q
−0.8
−0.6
−0.4
−0.2
0.0
1 10 100
k
knn_r_sq
10
cutoff
cutoff_is
q 2
5
10
25
Inf
Figure 8: Predictions based on half of the data
15

Table 1: Comparison of R2
for various prediction methods in NBA location-only dataset5
prediction regression r sq adjusted r sq simple knn r sq
box pred with fe 0.079 0.073
linear fe pred 0.072 0.070
third deg fe pred 0.080 0.077
fifth deg fe pred 0.085 0.082
lin by player 0.078 0.073
lin plus theta by player 0.080 0.075
lin plus sin by player 0.080 0.075
lin plus triangle by player 0.083 0.075
cubic by player 0.089 0.080
cubic plus theta by player 0.092 0.082
cubic plus sin by player 0.092 0.082
cubic plus triangle by player 0.094 0.082
k=50 cutoff=none 0.061 0.054
k=25 cutoff=5 0.095 0.084
We then have a method to test the hypothesis that heat levels have no effect on shooting
percentage. If it were truly a part of our functional form, we could not give a specific level
of effect without parametrizing f(). But, to test our null (that each shot is independent),
we may consider the derivative of f() with respect to heat, however heat is defined. If f
is, in fact, differentiable with respect to heat, this function has a linear first order Taylor
Approximation, so we may include the average marginal effect of heat, however heat is
defined, as:
make(i,g,t) = βknn(i,g,t) + γheat(i,g,t) + (i,g,t)
Where knn(i,g,t) is our KNN prediction of the shot calculated using the player’s other shots,
as explained above.
Even if we were to find an effect of heat, the likely interpretation would be endogeneity.
Unobserved factors in shooting (e.g. defender quality, if the shooter fought with their spouse
5
Here, “box pred with fe” refers our attempt to replicate Bocsocksky et al.’s method of division into 2-by-
2 boxes, plus player fixed effects, though we did not have their other controls (namely their controls for time
remaining, score differential, play-by-play categorizaton, angle of defender, and height differential). “lin,”
“linear”, “third”, “cubic”, and “fifth” refer to, respectively, 1st
, 1st
, 3rd
, 3rd
, and 5th
degree polynomials of
imputed distance. “theta” refers to regressing on angle between shooter and basket, relative to the symmetric
line from basket to basket, while “sin” refers to the sin of that angle, and “triangle” regressions divided the
court with the y = x and y = −x lines to produce triangles for left, right, top, and bottom portions of the
court (with the bottom and top portions combined to form a “middle”). “fe” regressions include player fixed
effects; “by player” regressions regressed on distance and angle separately by player.
16

the prior night, the quality of the shooter’s lunch, time-specific injuries, etc.) are almost
certainly correlated among shots in the same game, and even more so, among shots in the
same game at similar times. Indeed, when viewed in this way, it would be shocking not to
find apparent positive effects of heat. It would suggest that the econometrician accounted
for all significant non-shot-specific factors!
We have thus far delayed any real definition of heat. This is, in part, because our choice of
definition requires navigating waves of ambiguity. It is not immediately clear what factors
would suggest one measure over the other. While we choose to use a novel definition that
seems at least as good as the others, we include other definitions for comparison. Before
embarking onto these mathematical decisions, it would be good to have a more specific sense
of what the “Hot Hand” is.
The informal sense – a player having a temporary burst of near-magical powers – suffices
from a fan’s perspective. Let us develop a more formal sense with a brief thought experiment.
Presume, perhaps impossibly, that a hot streak does exist during a specific player’s shot,
and bestows positive effects. What can the hot streak not do?
Suppose, at the precise moment the hot player releases his shot, a physicist were to freeze
the arena in time. They may explore the action with a calculator, a supercomputer, and
as much time as they would like. The physicist can calculate the velocity of the ball, the
thickness of the air, the density of sweat on the ball. We presume they may predict, with
some high degree of precision, whether the ball, if unimpeded on its way, will fall through
the basket. What role, at this moment, could the shooter’s hot streak serve? Very nearly
none, as the result is quite literally out of their hands.
Consider now the moment the player decides to shoot. If they have normally disadvantageous
momentum, they may be inclined to shoot anyway because of their heat. But at this moment,
their momentum is fixed, so heat’s only apparent effect can be the change of choice function.
A well designed model may show selection bias later rooted in this stage. But if a hot streak,
in fact, were to do nothing to shot quality (nor rational players’ shot selection), our physicist,
perhaps in cahoots with an econometrician and using more extensive data than ours, could
control for these shot-specific effects, yielding no remaining role for heat.
So we have a specific period of time in which we may see heat: During the action of the shot.
The player’s momentum, position, and defender are all fixed at the moment of decision, so
we should, given perfect information, control for these.
17

We are then left with what we would characterize as a hot streak given perfect information: a
change in likelihood of shot success, controlling for all characteristics decided at the beginning
of the shot.
The defender’s precise actions are in part decided in the few moments after release, so even
if we could, we should not control for overly-specific defender actions. For different reasons,
even if we could observe the quality of the player’s shooting motion, we should not control
for this; the shooting motion would be a likely path for heat to have its effect. Thus, we
would prefer to not include touch time in our analysis.
This definition has an uncomfortable implication that we should not control for some defender
characteristics. Morally, it feels wrong to ascribe the actions of the defender to the shooter.
Indeed, a player with heat may shoot better, but if they are also defended better, we may
even see them shoot more poorly during a hot streak. While perhaps morally repugnant,
this is the economically-relevant perspective. If all positive offensive effects are cancelled out
by defensive effects, there is no reason to pass to a “hot shooter.”
We now proceed to our mathematical definition of statistical heat. We first revisit Bocskoc-
sky et al.’s paper. Particularly, we are interested in their specification of “complex heat.”
Their definition is
Complexn = % of past n shots made − a priori expected shooting % of those n shots
This specification is certainly viable, but we chose to alternatively define heat using what
we called a “Heat CDF,” the probability that the shooter would do at least as well as they
actually did on the n immediately-prior shots. We can express this value as:
CDFn = P(X ≥ shots actually made) = P(makes at least as many shots as observed)
where n is the number of shots attempted and X is a variable referring to the number
of makes. Consistently with Bocskocsky et al., we choose n=4. This value is, in fact,
1 − CDFshooting percentage(shots made), but can be viewed as CDF“missing” percentage(shots
missed-ε), for any ε in (0, 1).
This statistic has two benefits. One benefit accrues to the researcher: We can compare
outcomes in a fair way. Not all players who make one more shot than expected – which is to
say, 25% more shots than expected – are equal. Instead, the language of probability allows
us to compare the marginal difficulty of those 25% additional makes, and to look at the
18

effects of the CDF being below any level, i.e. the effect of the immediately-prior four shots
being at least a certain level of improbability. The second benefit accrues to the reader: A
fan does not, in our experience, consider a player “hot” directly because of the additional
points they contribute. “Hot” is, instead, closer to a synonym of “impossible,” and so we
prefer to define heat as the impossibility of outcome; how likely a player was to do as well
as they did.
To construct the distribution, we use our KNN predictions to create a probability for each
of the prior four shots6
, and then calculate the player’s CDF function on those four shots.
As such, if the probability that they made as many shots as they did is extremely small
(i.e. small probability of seeing more extreme results), we would call them hot. We can
additionally define coldness as the opposite: if the probability that they made so few shots
out of the n is very small, we can call them cold. We omit results for coldness from this
paper for brevity, not logic.
As a robustness check, we also try Bocskocksy’s definition of heat, after multiplying by n, as
“complex difference,” and the ratio of percentage of shots made to expected shots made as
“complex ratio.” We find similar results with these definitions, though we lose the natural
cutoffs for comparing heat levels, p-values.
We also explored the use of linear, cubic, and higher-order polynomials, with either homoge-
neous effects in addition to player fixed effects or as independent variables with coefficients
allowed to vary for all players. Again, their R2
values all fell in between those of a fitted
KNN prediction with K=50 and no cutoff and with K=25 and a cutoff at 5 feet (Table 1),
suggesting that the aforementioned convergence as K
n
→ 0 and K → ∞ is, to at least a loose
degree, holding in our finite sample for K=25.
As a side note, by and large, our predictions are in the interval [0, 1]. For direct KNN, failure
in this is impossible. While regressing on KNN prediction, it is theoretically possible to find
either negative or greater-than-1 fitted probabilities. For the other methods of approximating
distance and angle, it is certainly possible that the sum of average effects may, in some cases,
yield an implied probability outside the realm of feasibility. In fact, these occurrences are
rare for non-KNN methods, and never were observed in our KNN methods. The non-KNN
percentages are detailed in Table 2.
6
We do not consider the possibility that a player might be “hot” during the first four shots of the game,
nor do we consider heat if we cannot make a prediction during all four immediately preceding shots. The
latter is alleviated by the inclusion of the no-cutoff KNN specification, but the former remains as a case
where our findings may not have external validity.
19

Table 2: Proportion of shots feasible for non-KNN predictions
Predictions in interval [0, 1]
Baseline, shot log 9036 feasible predictions
(0.994% of shots in dataset)
Cubic & triangle by player, shot chart 188901
(0.997%)
Cubic by player, shot chart 188961
(0.998%)
Fifth degree & fe, shot chart 189075
(0.998%)
4 Results and Discussion
Our main results will be based on the shot chart data, which lacks defender information.
We will also briefly look at shot log data to make conclusions about defender behavior, but
will be unable to look for heat for statistical reasons rooted in our lack of data. We ran
many alternative regressions which were omitted because of space but which, in general,
were consistent with the results given here.
Within the shot chart data, we will look for, and in general fail to find, heat, in our main
specification, K=25 with a cutoff of 5 feet and using our CDF of heat. We will then search,
analogously, for heat in the larger sample size where K=50 and no cutoff, to check for
selection bias among the shots with sufficient similar shots. As a robustness check, we will
also consider other methods of predicting shots and defining heat.
For all of these shot chart results, besides effects of heat on shooting percentage, we will
also consider apparent effects of heat on shooter behavior. This is done by regressing our
prediction itself on heat among the prior four shots. If a hotter player takes worse shots, we
will see a decrease in predicted quality of shot associated with hotter players.
Let us begin with the shot chart data. Consider Table 3. An increasing Heat CDF (i.e. a
player who was more likely to do at least as well as they did on their prior four shots, which
is to say he performed worse) is associated with worse outcomes, and the other two measures
of heat show slight improvement in shooting from hot players. While all these effects are in
the direction of increased performance by “hot” players, none of these effects are significant,
either statistically or practically.
There is some effect of statistically significantly hot players, but it runs in the opposite
20

Table 3: Dependent variable: Shot Made (k=25, cutoff=5 ft, fitted)
Definition of heat:
Heat CDF complex difference complex ratio
Fitted KNN prediction 1.008∗∗∗
1.048∗∗∗
1.048∗∗∗
(0.033) (0.044) (0.044)
Heat −0.015 0.003 0.007
(0.020) (0.008) (0.015)
Constant 0.008 −0.023 −0.030
(0.024) (0.023) (0.028)
Observations 8,786 4,407 4,407
R2
0.098 0.116 0.116
Adjusted R2
0.098 0.116 0.116
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
direction. As seen in Table 4, players with a Heat CDF below .05, i.e. players whose play
on the prior four shots would be seen as statistically-significantly better than expected from
a one-sided test, actually perform five percentage points worse. While the number of shots
where this occurs is not large enough to find statistical significance, similar effects are found
by using K=50 with no cutoff, as in Table 5. The magnitude of effect is smaller – only 2.5
percentage points, perhaps because of imprecision in prediction estimates – but the effect is
clear: Extremely hot players shoot worse.
21

Table 4: Dependent variable: Shot Made (k=25, cutoff=5 ft, fitted)
Heat CDF... ≤ 0.05 ≤ 0.1 ≤ 0.5
Fitted KNN Prediction 1.008∗∗∗
1.008∗∗∗
1.008∗∗∗
(0.033) (0.033) (0.033)
Heat CDF ≤ (x) −0.052 −0.002 0.021
(0.070) (0.042) (0.015)
Constant −0.004 −0.004 −0.007
(0.017) (0.017) (0.017)
R2
0.098 0.098 0.098
Adjusted R2
0.098 0.098 0.098
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
Table 5: Dependent variable: Shot Made (k=50, no cutoff,fitted)
Heat CDF... ≤ 0.05 ≤ 0.1 ≤ 0.5
Fitted KNN prediction 1.012∗∗∗
1.012∗∗∗
1.012∗∗∗
(0.012) (0.012) (0.012)
Heat CDF ≤ (x) −0.026∗∗
−0.011 −0.001
(0.011) (0.008) (0.003)
Constant −0.008 −0.008 −0.008
(0.005) (0.005) (0.005)
Observations 100,726 100,726 100,726
R2
0.062
Adjusted R2
0.062 0.062 0.062
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
How can we make sense of this? We cannot write off these effects as reflecting some curious
result of KNN predictions. Our alternative methods of making predictions – here we note
regressing on a cubic polynomial of log shot distance, with the possible inclusion of a triangle
for shot angle, and a general fifth-degree polynomial of log distance with player fixed effects
– yields similar coefficients on Heat CDF, as in Table 6
22

Table 6: Dependent variable: Shot Made
Prediction Method:
Cubic of Distance Cubic with Fifth-Degree Polynomial
Triangle with Player Fixed Effects
Shot prediction 1.007∗∗∗
1.006∗∗∗
1.003∗∗∗
(0.010) (0.010) (0.011)
Heat CDF ≤ CDF Max −0.024∗∗
−0.026∗∗
−0.024∗∗
(0.011) (0.011) (0.011)
Constant −0.008∗
−0.008∗
−0.006
(0.004) (0.004) (0.005)
Observations 100,994 100,994 100,994
R2
0.085 0.089 0.082
Adjusted R2
0.085 0.089 0.082
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
As an explanation, we suggest that unobserved defender choices are correlated with heat.
If, when a shooter plays exceptionally well, defenders put more effort into defense, we would
find that players who play extremely well on four shots will be guarded more carefully (or
even guarded by multiple players) on the fifth, yielding worse shooting percentages after
controlling for location. These effects would be most marked for extreme offensive perfor-
mances, yielding similar results. While we cannot test this hypothesis directly in the shot
chart data, we can find indirect evidence for it in offensive shot choice in the shot chart data
and direct evidence from the shot log data.
Analagous to Tables 3, 4, and 5 are Tables 7, 8, and 9. Here, we regress our predicted
likelihoods on heat – if a player takes shots that are more likely to go in, they are taking
better shots. Across measures in these three tables, we consistently find that hotter players
take worse shots (in the shot log data). While the general regression is not very significant,
the large-cutoff measures are.7
So we have a clear narrative that “hot” players do not seem to do better than others if we do
not control for their defenders, and, if anything, do slightly worse in extreme cases. We can
7
It is worth noting that there may be endogeneity, also caused by omitted defender information. It may
be that closely-defended shooters take farther shots to increase their distance from the defender. Removing
such endogeneity can be done once the shot log dataset becomes available again.
23

Table 7: Dependent variable: Prediction (k=25, cutoff=5 ft, fitted knn)
Definition of heat:
Heat CDF complex difference complex ratio
Heat −0.011∗
−0.004 −0.008
(0.006) (0.003) (0.005)
Constant 0.515∗∗∗
0.513∗∗∗
0.521∗∗∗
(0.006) (0.002) (0.005)
R2
0.0004 0.0004 0.001
Adjusted R2
0.0002 0.0002 0.0004
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
Table 8: Dep var: KNN Prediction (k=25, cutoff = 5 ft, fitted)
Heat CDF... ≤ 0.05 ≤ 0.1 ≤ 0.5
Heat CDF ≤ (x) −0.040∗
−0.014 0.003
(0.023) (0.014) (0.005)
0.505∗∗∗
0.505∗∗∗
(0.002) (0.002) (0.002)
R2
0.0003 0.0001 0.0001
Adjusted R2
0.0002 0.00001 −0.0001
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
24

Table 9: Dep var: KNN Prediction (k=50, no cutoff, fitted)
Heat CDF... ≤ 0.05 ≤ 0.1 ≤ 0.5
Heat CDF ≤ (x) −0.013∗∗∗
−0.015∗∗∗
−0.014∗∗∗
(0.003) (0.002) (0.001)
0.403∗∗∗
0.407∗∗∗
(0.0004) (0.0004) (0.0005)
Observations 100,726 100,726 100,726
R2
0.0002 0.001 0.003
Adjusted R2
0.0002 0.001 0.003
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
test that this pattern does not reflect some sort of selection issues by regressing outcomes
and shot selection on heat, by each player, and comparing the distribution of coefficients.
To ensure we have enough shots, we used K=50 and no cutoff. The relevant graphs can
be found in Figure 9 and Figure 10, with median coefficients as a vertical line. Indeed, the
distribution of estimates appears roughly centered on our pooled OLS estimates, supporting
our findings.
We can also consider the shot log data. With this, we used only a “baseline” model, generat-
ing predictions from a player fixed effects model with a homogeneous third-degree polynomial
of log distance in addition to homogeneous defender distance and height effects, with the
results in Table 10. We found, unsurprisingly, that defender closeness (the distance between
shooter and defender when the shot is taken) and height both reduced the likelihood of shot
success, again suggesting that the omission of these defensive factors in our model of heat is
detrimental to our method.
We can do the analagous regressions of shot outcomes on heat. We find, consistently, that
hot players shoot worse (e.g. Table 11). Yet there is a clear explanation, which is rooted in
our player fixed effects.
Because residuals are orthogonal to player after the inclusion of player fixed effects, the sum
of residuals for each player must be zero. For a player with, say, five shots, if the first four
exceed expectations greatly, the fifth must be significantly below expectations in order to
balance out the residuals. Since we have at most four games of data for any player, such
effects persist in our sample. Indeed, we find that if we reduce our regression to players where
25

0
1
2
3
0 1
Coefficient on heat
Density(unweighted)
Figure 9: Coeﬃcients on Heat CDF from regressing outcome on ﬁtted knn prediction and
heat, by player
26

0.00
0.25
0.50
0.75
1.00
1.25
−4 −2 0 2
Coefficient on heat
Density(unweighted)
Figure 10: Coeﬃcients on Heat CDF from regressing shot quality on heat, by player
27

Table 10: Dependent variable: Shot Made (shot log data)
log(imputed dist to hoop) −0.010
(0.022)
log dist squared −0.065∗∗∗
(0.014)
log dist cubed 0.008∗∗∗
(0.003)
nearest def height −0.108∗∗∗
(0.019)
nearest def dist 0.025∗∗∗
(0.002)
Observations 9,091
R2
0.100
Adjusted R2
0.061
Residual Std. Error 0.482 (df = 8714)
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
28

we can make predictions on at least 10 shots, the effect is markedly reduced (see Table 11),
strongly suggesting that these apparent findings reflect our statistical approach, rather than
anything which has to do with basketball itself. It is notable that KNN, which only uses
other shots to make a prediction on a given shot, would avoid these issues, even though the
shot log dataset is not large enough to make the approach useful.
Table 11: Dependent variable: Shooting Percentage (shot log data)
Group:
all shooters
Baseline prediction 1.014∗∗∗
(0.047)
Heat CDF 0.086∗∗∗
(0.024)
Constant −0.056∗∗
(0.027)
Observations 5,076
R2
0.087
Adjusted R2
0.087
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
frequent shooters
0.995∗∗∗
(0.052)
0.059∗∗
(0.026)
−0.031
(0.029)
4,497
0.078
0.077
On the other hand, hot players had defenders play more closely to the shooter, with similar
magnitudes of effect across both frequent and all shooters; defender height does not appear
to be affected (Tables 12 and 13, respectively). While we do not find statistical significance
in either, undoubtedly caused at least in part by our small sample size, we expect that,
once the relevant data becomes available again at a comparable scale, this portion of our
analysis could be completed, perhaps using KNN with some defender distance metric. In
the meantime, the consistency among frequent and infrequent shooters is heartening but
unsurprising. We did not mandate that residuals in defender distance balance out to zero,
so we no longer are forcing canceling effects in subsequent shots.
29

Table 12: Dependent variable: Closest Defender Distance
Group:
all shooters
Heat CDF 0.102
(0.135)
(0.094)
Observations 5,076
R2
0.0001
Adjusted R2
−0.0001
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
frequent shooters
0.104
(0.139)
3.997∗∗∗
(0.097)
4,497
0.0001
−0.0001
Table 13: Dependent variable: Closest Defender Height
Group:
all shooters
Heat CDF −0.001
(0.015)
(0.011)
Observations 5,076
R2
0.00000
Adjusted R2
−0.0002
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
frequent shooters
−0.013
(0.016)
6.617∗∗∗
(0.011)
4,497
0.0001
−0.0001
5 Conclusion
We find no real evidence that hot shooters shoot better. If anything, we find that hot shooters
shoot worse on their next shot. We suspect that this reflects endogeneity dictated by limited
data availability. This explanation is in agreement with Bocskocsky et al.’s finding that a
player shooting well caused a defender to move closer. We also find strong indication that
defender size and effort have significant effects on shot quality. Our clearest findings – a
30

weak indication that defender effort is impacted by heat, and a strong indication that shot
choice is impacted – are important for future studies of the Hot Hand. Our KNN and CDF
methodology will be a useful approach for further examinations as well.
References
[1] Bocskocsky, A., J. Ezekowitz, and C. Stein (2014). The hot hand: A new ap-
proach to an old “Fallacy”. In 8th Annual MIT Sloan Sports Analytics Conference.
http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014 SSAC The-
Hot-Hand-A-New-Approach.pdf.
[2] Gilovich, T. (1985). The hot hand in basketball: On the misperception of random
sequences. Cognitive Psychology 17(3), 295–314.
[3] Hemlock, D. (2013). What’s an nba championship worth? at least tens of millions of
dollars, sports business experts say. SunSentinel. http://articles.sun-sentinel.com/2013-
05-26/news/fl-heat-nba-championship-worth-20130526 1 miami-heat-nba-championship-
merchandise-sales”.
[4] Mack, Y. and M. Rosenblatt (1979). Multivariate k-nearest neighbor density estimates.
Journal of Multivariate Analysis 9(1), 1–15.
[5] Paine, N. (2015). No matter how much they make, the best players in the nba are vastly
underpaid. Fivethirtyeight.com. http://fivethirtyeight.com/features/kawhi-leonard-like-
all-the-best-nba-players-is-vastly-underpaid/.
[6] Poundstone, W. (2014). How to Predict the Unpredictable: The Art of Outsmarting
Almost Everyone. Oneworld Publications.
[7] Tepper, T. (2014). Why you shouldn’t overplay a hot hand — in basketball or investing.
Time Magazine. http://time.com/money/3145979/hot-hand-basketball-investing/.
[8] Tversky, A. and D. Kahneman (1971). Belief in the law of small numbers. Psychological
Bulletin 76(2), 105–110.
31

Econometrics_A_Final_Project

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Econometrics_A_Final_Project

Similar to Econometrics_A_Final_Project (20)

Econometrics_A_Final_Project