1. 1
Total
Payroll
vs.
Winning
Percentage
In
Major
League
Baseball
Bayesian
Statistics
Fall,
2014
Lingwen
He
Zijian
Su
Xiangyu
Li
Padraic
O’Shea
2. 2
Introduction
Major
League
Baseball
(MLB)
is
the
last
professional
sport
in
America
to
have
not
adopted
a
salary
cap.
The
lack
of
a
salary
cap
has
led
to
large
differences
in
the
total
payroll
for
big
market
teams
vs.
small
market
teams.
This
glaring
difference
in
total
payroll
has
fed
the
ongoing
discussion
of
whether
or
not
teams
can
“buy”
wins
by
spending
more
money.
To
investigate
whether
teams
that
spend
more
money
have
higher
winning
percentages,
we
will
explore
the
existence
of
a
linear
relationship
between
average
total
payroll
and
average
winning
percentage
of
MLB
teams
from
2004
to
2012.
Methods
Data
From
Baseball-‐Reference.com
we
acquired
data
on
regular
season
winning
percentage
by
team.
This
data
can
be
accessed
from
the
following
link:
http://www.baseball-‐reference.com/leagues/MLB/.
Data
on
total
payroll,
by
team,
was
acquired
through
USA
Today.
A
link
to
that
data
is
provided
here:
http://content.usatoday.com/sportsdata/baseball/mlb/salaries/team/2004.
To
explore
the
linear
relationship
between
total
payroll
and
winning
percentage
over
time
one
data
point
for
each
team
was
needed.
To
calculate
these
data
points
winning
percentage
and
total
payroll
were
collected
from
the
2004
to
2012
seasons
and
averaged
by
team.
The
predictor
variable
was
re-‐scaled,
by
dividing
by
a
million,
to
increase
the
size
of
the
coefficient.
Initial
inference
on
the
‘averaged’
dataset
did
not
indicate
a
severe
violation
of
the
assumption
of
normality
but
the
normal-‐QQ
plot
was
not
perfectly
linear.
Three
potential
outliers
were
also
identified
while
performing
inference.
One
at
a
time,
the
possible
outliers
were
removed
and
analysis
completed
using
residual
plots
and
QQ
plots.
We
found,
overall,
that
removing
the
points
did
not
improve
the
model
or
the
fit
of
the
distribution.
Therefore,
the
dataset
containing
all
points
was
used
for
the
analysis.
The
plots
used
for
inference
can
be
found
in
Appendix
C.
To
begin
understanding
the
data,
descriptive
statistics
may
be
considered.
The
mean,
median
and
standard
deviation
for
each
variable
under
consideration
is
given
in
Table
1
below.
Predictably
we
see
that
average
total
salary
appears
skewed
to
the
right
as
the
mean
is
greater
than
the
median.
Average
winning
percentage
appears
to
have
a
relatively
normal
distribution.
Additionally,
standard
deviation
is
fairly
large,
especially
for
total
salary.
3. 3
Data
(Avg.
2004
to
2012)
Mean
Standard
Deviation
Minimum
Median
Maximum
Total
Salary
84.657
32.693
44.046
74.752
199.368
Winning
%
0.5003
0.0414
0.40
0.50
0.58
Table
1.
Descriptive
Statistics
for
Study
Variables
Total
Salary
in
Millions
Statistical
Method
We
hypothesize
that
average
total
payroll
and
average
winning
percentage
are
linearly
associated.
To
assess
this
relationship,
Bayesian
simple
linear
regressions
will
be
utilized
with
average
winning
percentage
as
the
response.
Two
methods
will
be
used
to
explore
this
linear
relationship.
Firstly,
a
non-‐informative
prior
to
illustrate
the
lack
of
prior
knowledge
about
the
effects
of
salary
on
winning
percentage.
Next,
an
informative
prior
based
on
our
prior
beliefs.
The
two
methods’
predictive
outputs
will
then
be
compared.
For
the
informative
prior
a
N(0.5,
0.05)
for
beta0
is
used
as
our
expectation
for
the
winning
percentage
is
50%
with
small
variance.
For
beta1
a
N(0.1,
100)
is
used
due
to
the
lack
of
knowledge
and
an
expectation
that
this
rate
will
be
positive,
but
not
overly
large.
Our
expectation
for
the
variance
of
beta1
is
that
it
will
be
large.
Convergence
was
assessed
via
OpenBUGS
output
by
history
plots,
auto-‐correlation
plots
and
MC_error
values.
Due
to
rapid
convergence,
only
one
chain
was
used
for
the
MCMC
integration.
However,
this
meant
BGR
plots
could
not
be
used
to
assess
burn-‐in.
In
an
effort
to
exclude
initial
values,
since
they
were
based
on
intuition
and
likely
not
representative
of
the
posterior
distribution,
a
3000
sample
burn-‐in
was
used.
5. 5
Figure
2
contains
the
node
statistics
for
the
informative
prior.
The
history
plots
and
auto-‐correlation
plots
used
for
assessing
convergence
in
the
informative
prior
model
can
be
found
in
Appendix
A.
mean sd MC_error val2.5pc median val97.5pc start sample
beta0 0.5002 0.006425 6.46E-5 0.4876 0.5002 0.5131 3001 12000
beta1 7.966E-4 2.01E-4 1.935E-6 4.013E-4 7.952E-4 0.001184 3001 12000
mu[1] 0.4835 0.007636 7.081E-5 0.4684 0.4835 0.4989 3001 12000
mu[2] 0.5043 0.006521 6.684E-5 0.4914 0.5043 0.5171 3001 12000
mu[3] 0.4925 0.006691 6.445E-5 0.4792 0.4924 0.5059 3001 12000
mu[4] 0.5449 0.01305 1.345E-4 0.5189 0.5449 0.5705 3001 12000
mu[5] 0.52 0.008179 8.614E-5 0.5033 0.52 0.536 3001 12000
mu[6] 0.5124 0.007157 7.506E-5 0.498 0.5125 0.5266 3001 12000
mu[7] 0.4873 0.007166 6.734E-5 0.4732 0.4872 0.5018 3001 12000
mu[8] 0.4808 0.008022 7.386E-5 0.465 0.4808 0.497 3001 12000
mu[9] 0.4862 0.007292 6.824E-5 0.4718 0.4862 0.501 3001 12000
mu[10] 0.5131 0.007238 7.598E-5 0.4986 0.5132 0.5275 3001 12000
mu[11] 0.499 0.006429 6.421E-5 0.4864 0.499 0.512 3001 12000
mu[12] 0.4767 0.008688 7.94E-5 0.4593 0.4767 0.4943 3001 12000
mu[13] 0.5241 0.008868 9.323E-5 0.5062 0.5242 0.5415 3001 12000
mu[14] 0.5122 0.00713 7.474E-5 0.4978 0.5122 0.5262 3001 12000
mu[15] 0.4716 0.009598 8.73E-5 0.4525 0.4716 0.4909 3001 12000
mu[16] 0.4878 0.007113 6.698E-5 0.4738 0.4877 0.5022 3001 12000
mu[17] 0.4922 0.006711 6.455E-5 0.4789 0.4922 0.5057 3001 12000
mu[18] 0.4781 0.008451 7.74E-5 0.4614 0.4781 0.4951 3001 12000
mu[19] 0.5256 0.009123 9.582E-5 0.5072 0.5257 0.5434 3001 12000
mu[20] 0.5916 0.02402 2.405E-4 0.5445 0.5915 0.639 3001 12000
mu[21] 0.4806 0.008057 7.414E-5 0.4646 0.4806 0.4968 3001 12000
mu[22] 0.5273 0.00943 9.892E-5 0.5083 0.5274 0.5457 3001 12000
mu[23] 0.4679 0.01032 9.374E-5 0.4475 0.4678 0.4885 3001 12000
mu[24] 0.4773 0.008585 7.853E-5 0.4602 0.4773 0.4947 3001 12000
mu[25] 0.5077 0.006722 6.976E-5 0.4943 0.5078 0.5211 3001 12000
mu[26] 0.5067 0.006652 6.882E-5 0.4935 0.5068 0.5199 3001 12000
mu[27] 0.5082 0.006758 7.024E-5 0.4947 0.5082 0.5216 3001 12000
mu[28] 0.4685 0.0102 9.27E-5 0.4483 0.4684 0.4889 3001 12000
mu[29] 0.4905 0.006852 6.532E-5 0.4769 0.4904 0.5042 3001 12000
mu[30] 0.4884 0.007049 6.655E-5 0.4745 0.4883 0.5027 3001 12000
postprob0.9998 0.01581 1.431E-4 1.0 1.0 1.0 3001 12000
sigma 0.03472 0.004829 4.704E-5 0.02672 0.03418 0.04542 3001 12000
tausq 876.2 235.3 2.254 484.9 856.2 1401.0 3001 12000
Figure
2.
Node
Statistics
for
Informative
Prior
Discussion
Based
on
our
analyses,
we
found
a
positive
relationship
between
average
total
payroll
and
average
winning
percentage
in
Major
League
Baseball
for
the
years
2004
to
2012.
For
both
the
non-‐informative
and
informative
methods,
the
statistics
for
postprob
indicate
that
Pr(β1≥0|{y})
is
about
0.9998.
Or
in
other
words,
there
is
a
greater
than
99%
chance
that
beta1
>
0.
These
findings
are
similarly
supported
by
the
means
and
positive
95%
credible
sets
for
beta1.
Therefore,
there
does
appear
to
be
a
linear
association
between
average
total
payroll
and
average
winning
percentage
for
MLB
teams.
6. 6
There
was
very
little
difference
in
the
results
of
the
non-‐informative
and
informative
priors.
Our
belief
is
that
this
is
due
to
the
informative
prior
being
very
consistent
with
the
data.
The
mean
and
median
for
the
informative
prior
are
actually
slightly
larger
than
those
of
the
non-‐informative
prior.
This
may
be
an
indication
that
our
non-‐informative
prior
fits
the
data
better,
but
the
difference
is
very
small.
If
further
exploration
of
the
linear
relationship
between
average
total
payroll
and
average
winning
percentage
for
MLB
teams
was
completed
more
information
about
parameters
would
help
improve
the
analysis.
Additionally,
if
an
‘averaged’
data
set
was
used
in
follow-‐up
exploration,
including
more
years
would
be
advised.
Finally,
although
it
appears
that
a
positive
relationship
exists
between
total
payroll
and
winning
percentage
based
on
this
analysis.
It
would
be
important
to
explore
the
ongoing
changes
in
the
league.
Most
notably,
the
debate
on
the
use
of
statistics
for
calculating
wins
based
on
on-‐base-‐
percentage
rather
than
traditional
baseball
measurements
for
success.
This
ongoing
development
is
having
an
impact
on
perceived
value
for
many
players
and
may
drastically
affect
a
team’s
salary
and
winning
percentage.
References
Our
dataset
was
constructed
by
combining
the
historical
Major
League
Baseball
team
salaries
and
winning
percentage.
This
data
was
drawn
from
the
same
time
period,
2004
to
2012,
for
both
variables.
Links
to
these
MLB
data
sources
can
be
found
below:
Baseball-‐Refernce.com.
(2014).
Team
Wins.
Retrieved
from
http://www.baseball-‐
reference.com/leagues/MLB/.
USA
Today.
(2014).
USATODAY
Salaries
Database,
MLB
salaries
by
team
for
various
years
(2004
to
2014).
Retrieved
from
http://content.usatoday.com/sportsdata/baseball/mlb/salaries/team/2004.
Appendix
A:
MCMC
Integration
Convergence
Figure
3
and
4
below
are
the
history
plots
and
auto-‐correlation
plots,
respectively,
for
the
non-‐informative
prior.
From
these
plots
it
was
assessed
that
convergence
occurred
quickly
for
every
variable.
Postprob’s
convergence
was
assessed
using
the
MC_error
found
in
the
results
section
of
this
paper.
7. 7
Figure
3.
History
Plots
for
Non-‐Informative
Prior
Figure
4.
Auto-‐Correlation
Plots
for
Non-‐Informative
Prior
Figure
5
and
6
below
are
the
history
plots
and
auto-‐correlation
plots,
respectively,
for
the
informative
prior.
From
these
plots
it
was
assessed
that
convergence
occurred
quickly
for
every
variable.
Postprob’s
convergence
was
assessed
using
the
MC_error
found
in
the
results
section
of
this
paper.
8. 8
Figure
5.
History
Plots
for
Informative
Prior
Figure
6.
Auto-‐Correlation
Plots
for
Informative
Prior
9. 9
Appendix
B:
OpenBUGS
Code
Non-‐informative
Prior
model
{
for
(i
in
1:N){
xcent[i]<-‐x[i]-‐mean(x[])
}
for
(i
in
1:N){
mu[i]<-‐beta0+beta1*xcent[i]
y[i]~dnorm(mu[i],tausq)
}
postprob<-‐step(beta1)
beta0~dflat()
beta1~dflat()
tausq~dgamma(0.001,0.001)
sigma<-‐1/sqrt(tausq)
}
#data
list(x=c(63.69154422,
89.76840667,
74.92461167,
140.7180136,
109.4106621,
99.92325911,
68.43453544,
60.32229233,
67.06203433,
100.8169706,
83.123759,
55.12364089,
114.6389837,
99.61515889,
48.75051056,
69.04419133,
74.57996711,
56.90573011,
116.4527793,
199.368707,
60.03360822,
118.5733706,
44.04681044,
55.889759,
94.06393778,
92.80824244,
94.66015589,
44.78459711,
72.37762889,
69.80194444),y=c(0.47,
0.54,
0.44,
0.56,
0.48,
0.53,
0.49,
0.49,
0.48,
0.52,
0.46,
0.40,
0.55,
0.52,
0.50,
0.50,
0.50,
0.47,
0.50,
0.58,
0.52,
0.53,
0.42,
0.50,
0.50,
0.46,
0.57,
0.49,
0.54,
0.50),
N=30)
#inits
list(beta0=0,
beta1=0,tausq=1)
Informative
Prior
model
{
for
(i
in
1:N){
xcent[i]<-‐x[i]-‐mean(x[])
}
for
(i
in
1:N){
mu[i]<-‐beta0+beta1*xcent[i]
y[i]~dnorm(mu[i],tausq)
}
postprob<-‐step(beta1)
beta0~dnorm(0.5,
0.05)
beta1~dnorm(0.1,
100)
tausq~dgamma(0.001,0.001)
sigma<-‐1/sqrt(tausq)
11. 11
Coefficients:
Estimate
Std.
Error
t
value
Pr(>|t|)
(Intercept)
0.4361350
0.0164219
26.558
<
2e-‐16
***
xnew2
0.0007798
0.0001804
4.323
0.000187
***
-‐-‐-‐
Signif.
codes:
0
‘***’
0.001
‘**’
0.01
‘*’
0.05
‘.’
0.1
‘
’
1
Residual
standard
error:
0.03171
on
27
degrees
of
freedom
Multiple
R-‐squared:
0.4091,
Adjusted
R-‐squared:
0.3872
F-‐statistic:
18.69
on
1
and
27
DF,
p-‐value:
0.0001872
Reduced
Model
(#12
Removed)
>
data_new
<-‐
mydata[-‐c(12),]
>
xnew
<-‐
data_new$AverageTotalPayroll
>
pct_new
<-‐
data_new$AveragePCT
>
remove_residual_line
<-‐
lm(pct_new~xnew)
>
summary(remove_residual_line)
Call:
lm(formula
=
pct_new
~
xnew)
Residuals:
Min
1Q
Median
3Q
Max
-‐0.056063
-‐0.013108
0.004436
0.016632
0.059747
Coefficients:
Estimate
Std.
Error
t
value
Pr(>|t|)
(Intercept)
0.4421938
0.0156408
28.272
<
2e-‐16
***
xnew
0.0007190
0.0001709
4.208
0.000255
***
-‐-‐-‐
Signif.
codes:
0
‘***’
0.001
‘**’
0.01
‘*’
0.05
‘.’
0.1
‘
’
1
Residual
standard
error:
0.02964
on
27
degrees
of
freedom
Multiple
R-‐squared:
0.396,
Adjusted
R-‐squared:
0.3737
F-‐statistic:
17.7
on
1
and
27
DF,
p-‐value:
0.0002551
Reduced
Model
(#24
Removed)
>
data_up1
<-‐
data.bb[-‐c(24),]
>
xnew1
<-‐
data_up1$Avg.Total.Payroll
>
pct1
<-‐
data_up1$Average.PCT
>
red_residual_line
<-‐
lm(pct1~xnew1)
>
plot(red_residual_line)
>
summary(red_residual_line)
Call:
lm(formula
=
pct1
~
xnew1)
Residuals:
Min
1Q
Median
3Q
Max
-‐0.075337
-‐0.013510
0.007229
0.017961
0.062273
Coefficients:
Estimate
Std.
Error
t
value
Pr(>|t|)
(Intercept)
0.4301762
0.0174143
24.702
<
2e-‐16
***
xnew1
0.0008193
0.0001903
4.305
0.000196
***
-‐-‐
Signif.
codes:
0
‘***’
0.001
‘**’
0.01
‘*’
0.05
‘.’
0.1
‘
’
1
Residual
standard
error:
0.03304
on
27
degrees
of
freedom
Multiple
R-‐squared:
0.4071,
Adjusted
R-‐squared:
0.3851
F-‐statistic:
18.54
on
1
and
27
DF,
p-‐value:
0.0001965
12. 12
Residual
&
QQ
plots
Based
on
Complete
Dataset
Residual
&
QQ
plots
Based
on
Reduced
Model
(Data
point
3
Removed)
50 100 150 200
0.400.450.500.55
x
y
13. 13
Residual
&
QQ
plots
Based
on
Reduced
Model
(Data
point
12
Removed)
Residual
&
QQ
plots
Based
on
Reduced
Model
(Data
point
24
Removed)
15. 15
Contributions
Project
proposal:
All
Members
OpenBUGS/R
Computing:
-‐ Non-‐informative
prior:
Lingwen
He
-‐ Informative
prior:
Zijian
Su
-‐ Inference:
Xiangyu
Li
-‐ Additional
Computing:
Lingwen
He,
Zijian
Su,
Xiangyu
Li
Interim
report:
All
Members
Final
report
writing
and
formatting:
Padraic
O’Shea