1. Logistic Regression and Markov Chain approach to
NCAA Basketball seeding
Michael Hankin
University of Southern California
mhankin@usc.edu
April 22, 2013
Michael Hankin (USC)
LRMC
April 22, 2013
1 / 22
3. Overview of Logistic Regression
Basic idea of Logistic Regression: Given explanatory variables X and
binary response variable Y we wish to determine P(Y = 1 | X ). Logistic
regression allows us to estimate this by modeling
1
Y ∼ Bernoulli σ(w T X ) where σ(w T X ) =
1+e w T X
If we model P(i beats j on j’s homecourt | i beat j by x on i’s homecourt)
as σ(α + βx) we obtain the following likelihood:
L(α, β) =
g :games
Michael Hankin (USC)
1
1 + e α+βxg
LRMC
wg
1−
1
1 + e α+βxg
1−wg
April 22, 2013
3 / 22
4. We then find parameters that maximize the likelihood.
= log L(α, β) =
wg log
g :games
+(1 − wg ) log 1 −
1
1 + e α+βxg
1
1 + e α+βxg
−wg log 1 + e α+βxg +(1−wg ) α + βxg − log 1 + e α+βxg
=
g :games
(1 − wg ) (α + βxg ) − log 1 + e α+βxg
=
g :games
Michael Hankin (USC)
LRMC
April 22, 2013
4 / 22
5. ∂
e α+βxg
=
(1 − wg ) −
∂α g :games
1 + e α+βxg
(1 − wg ) − 1 −
=
g :games
=
g :games
(1)
1
1 + e α+βxg
(2)
1
− wg
1 + e α+βxg
(3)
e α+βxg
∂
=
xg
(1 − wg )xg −
∂β g :games
1 + e α+βxg
(1 − wg )xg − 1 −
=
g :games
=
g :games
Michael Hankin (USC)
1
− wg
1 + e α+βxg
LRMC
1
1 + e α+βxg
(4)
xg
xg
(5)
(6)
April 22, 2013
5 / 22
6. ∂2
=
−
∂α2 g :games
1
1 + e α+βxg
e α+βxg
1 + e α+βxg
−
1
1 + e α+βxg
1−
=
g :games
∂2
∂2
=
=
−
∂α∂β
∂β∂α g :games
1
1 + e α+βxg
1
1 + e α+βxg
1
1 + e α+βxg
1−
∂2
=
−
∂β 2 g :games
1
1 + e α+βxg
e α+βxg
1 + e α+βxg
−
1
1 + e α+βxg
1−
g :games
=
g :games
Michael Hankin (USC)
LRMC
(8)
e α+βxg
1 + e α+βxg
−
=
(7)
1
1 + e α+βxg
xg
xg
2
xg
1
1 + e α+βxg
(9)
(10)
(11)
2
xg
April 22, 2013
(12)
6 / 22
7. Want α, β s.t.
Taylor we have:
0=
(α, β) = 0. For α , β let
(α, β) =
0=
(α +
α, β
(α , β ) +
2
+
β)
≈
(α , β )
α
=α−α ,
(α , β ) +
β
= β − β . By
2
(α , β )
α
β
α
−
β
2
(α , β )
α
β
Newton to the rescue: Successive updates of the following form should
converge to the optimal values.
α
α
=
β
β
Michael Hankin (USC)
−
2
(α , β )
LRMC
−1
(α , β )
April 22, 2013
7 / 22
8. Use of Logistic Regression in LRMC
H
Victory/Defeat margin: We have now found rx , the probability that if
team i beats team j by x at i’s home court, team i will beat team j at j’s
home court. Assuming homecourt advantage is additive, the superiority
H
probability sx , the probability that team i would beat team j on a neutral
H
court given that team i beat team j by x on team i’s home court= rx+h .
H
This gives h = −αrr and sx = σ( αr + βr x).
2β
2
Michael Hankin (USC)
LRMC
April 22, 2013
8 / 22
9. Alternative assumptions: Because each game has finite length (equal
except for overtime), a reasonable estimator for a teams skill is the
proportion of they control the ball. Going further, the proportion of time a
team controls the ball can be estimated by their score divided by the sum
of both teams scores. Multiplicative homecourt advantage (look at score
ratio) and log multiplicative (log of score ratio).
Reduce overfitting: By penalizing for large parameter values (implying
that future games are independent of past games) we can reduce
overfiiting by choosing nonnegative λα , λβ and minimizing
− + λα α2 + λβ β 2 .
In my regularized examples I placed larger penalties on the α’s, operating
under the hypothesis that there is no homecourt advantage.
Michael Hankin (USC)
LRMC
April 22, 2013
9 / 22
10. Logistic Regression ”Goodness of Fit”
Assumptions for test: Because the number of observations is much
larger than the number of ”buckets” (for classical LRMC mean and
median observations per score differential were approximately 32.9 and 17,
respectively) the CLT allows us to normalize the residuals by assuming
H
y
that each observation is Bernoulli ri = √ yi −ˆi and thus i ri2 ∼ χ2 .
n−2
yi (1−ˆi )
ˆ
y
-
Michael Hankin (USC)
LRMC
April 22, 2013
10 / 22
11. Chi Squared p-values for logistic regressions
additive
additive (reg)
multiplicative
multiplicative (reg)
log mult
log mult (reg)
2011
0.511777
0.500654
0.495586
0.027208
0.499545
0.424898
2012
0.552131
0.534811
0.537728
0.001498
0.558072
0.440884
2013
0.569139
0.550568
0.522612
0.001819
0.593485
0.483908
Table : χ2 p-values
Michael Hankin (USC)
LRMC
April 22, 2013
11 / 22
16. Overview of Markov Chains
Stochastic Process with finite states: A Finite-state markov chain is a
stochastic process where the probability of being in X at time t is
dependent only on the state at time t-1.
Steady state: Given some basic conditions, there exists a probability
distribution across the states such that if a Markov Chain is run for a long
time we can expect the state at any given time to be ”Multinoulli” with
the steady state distribution.
Michael Hankin (USC)
LRMC
April 22, 2013
16 / 22
17. Use of Markov Chains in LRMC
LRMC states: In LRMC we create a state for each team, indicating that
we think that team is the best team.
Transition probabilities: Given some probability distribution based on
each team’s regular season record we either jump to another team or stay
put at each ”step”.
Expected time per state: Eventually a steady state distribution emerges
representing the amount of time we expect to be in each state. In this
case because the transition matrix is sparse and small enough for my
laptop to handle, we just find its eigenvector corresponding to
eigenvalue=1, and normalize in L1 .
Michael Hankin (USC)
LRMC
April 22, 2013
17 / 22
18. Transition Probabilities
Naive Approach: To motivate the more complex LRMC approach we
start simple. Take
p = P(team i is better than team j | team i beat team j), wij = the
number of times i beat j, lij = the number of time j beat i, and Ni = total
number of games played by i (required to normalize transition
probabilities). Then we define the transition probability
1
tij = Ni (wij (1 − p) + lij p).
Better approach: Obviously we can do better by considering the victory
1
H
H
margin and game location. tij Ni
g :iatj rx(g ) +
g :jati (1 − rx(g ) ) ,
tii = 1 − j=i tij .
-
Michael Hankin (USC)
LRMC
April 22, 2013
18 / 22
19. 2013 Top 10 projected teams
0
1
2
3
4
5
6
7
8
9
Top teams
Miami (FL)
Michigan
Wisconsin
Ohio State
Syracuse
Kansas
Gonzaga
Indiana
Louisville
Florida
Michael Hankin (USC)
Top teamsL
Nevada-Las Vegas
Notre Dame
Virginia Commonwealth
James Madison
Louisville
North Carolina A&T
North Carolina State
New Mexico
Syracuse
Memphis
LRMC
TopProb
0.006619
0.006619
0.006670
0.006788
0.006991
0.007234
0.007625
0.008241
0.008352
0.008582
TopProbL
0.003262
0.003262
0.003262
0.003262
0.003262
0.003262
0.003262
0.003361
0.003361
0.003361
April 22, 2013
19 / 22
20. Solitary and comparative accuracy
Proportion of Tournament matchups predicted correctly:
2012-2013
2011-2012
2010-2011
Additive
0.630769230769 0.716417910448 0.615384615385
Multiplicative 0.569230769231
0.641791044776
0.615384615385
Log Mult
0.630769230769 0.686567164179 0.630769230769
Michael Hankin (USC)
LRMC
April 22, 2013
20 / 22
21. 2012-2013 Linear Regression for Playoff probability
difference vs victory margin
Michael Hankin (USC)
LRMC
April 22, 2013
21 / 22
22. References
Paul Kvam and Joel S. Sokol (2006)
A Logistic Regression/Markov Chain Model for NCAA Basketball
Naval Research Logistics
RogueWave Logistic Regression Documentation
http://www.roguewave.com/portals/0/products/legacy-hpp/docs/anaug/3-3.html
Michael Hankin (USC)
LRMC
April 22, 2013
22 / 22