What Innings Determine Total Wins

What Innings Determine Total Wins?
By: Payton Soicher and Kevin Mosier
“Winning the inning” is one of the most popular phrases thrown around by baseball
players and coaches. Baseball teams try to “win” every inning in the game, because chances are
that the more innings you win in a game, the better the chance you have to be victorious.
However, in a 162 game Major League Baseball season, it is impossible to win every inning.
With all MLB teams trying to figure out which players will give their team the best chance to
win by studying batting averages, slugging and on-base percentages, runs batted in, and other
advanced analytics, our study was to see if managers and organizations were overlooking an
important predictor of a team’s record at the end of the season: Which specific innings are the
best predictors of your record?
The first thing that we did was take the 2015 MLB won-loss records of every team. After
that, we found the numbers of runs that they scored in every single inning, and made them
their own variables (Inning1, Inning2, Inning3, etc.). This would show how many runs each
team’s offense put up in each inning during the season. We did not do any extra innings
because not every team plays the same number of extra innings as other teams, but every
baseball game has to go a minimum of 9 innings. After collecting all the runs for each frame, we
found the number of runs each team gave up in innings 1-9 to show how many runs the
defense gave up each inning during the year. We then took each inning of offense and
subtracted the number of runs their defense gave up in that same inning, giving them a “net
run” for each inning. If the result of that inning was positive, they won that inning for the
season, and a negative run total would show that they lost that inning on the year.
The data that we used did not need any kind of transformation. We also did not want to
eliminate any outlier data points, if there were any, that would be in the model. For example,
the Blue Jays scored a total of 116 runs in the second inning of the season, when the MLB
average of the second inning was 70.63. Although that might look like an extremely high value,
the Blue Jays were known to have one of the best offensive lineups in the MLB with their third,
fourth and fifth batters all hitting over 40 home runs on the season. We also looked at each
residual plot for each inning, and they all looked random and scattered, which is what you
would like to see for a regression analysis. Also, each inning variable will be independent of
each other since runs, hits, outs, or any other determining factor in baseball don’t carry over
from one inning to the next.
The first test that we decided to run before we ran our model with each independent
inning was to first take the sum of every inning and make one “net” variable. In order to
accomplish this, we took the sum of innings 1 through inning 9 to see if they had a positive or
negative run total on the year, and then ran a model against the dependent variable of wins.

From looking at the graph, it definitely looks like there is a linearity with the X points (net runs)
with the dependent variable (wins), there is constant variance, the points are independent, and
the scatter plot also shows mean zero with no trend.
Now the null and alternative hypothesis for this regression test would be:
𝐻𝑜: 𝛽1 = 0.
𝐻𝑎: 𝛽1 ≠ 0 𝑤𝑖𝑡ℎ 𝛼 = .05

Net ANOVA Table
Looking at the ANOVA (Analysis of Variance) table, with an extremely high F value of 102.36
leading to an extremely low p value of less than .0001, we can reject the null hypothesis and
conclude that 𝛽1 ≠ 0. Also, looking at the parameter estimates, with the only variable in the
model being Net Runs, the F p value and the Net p value will be identical, which they are. From
what we observe in this simple one variable linear regression analysis, Net Runs for an entire
season is a good predictor of wins at the end of the regular season.
Although that is an interesting regression test, it is a very vague examination compared
to the one that we are most interested in, but it sets a base that this would be a good test to
analyze even further. After we had run the first model, we decided to run the more specific
model with all innings as variables. When running the ANOVA statistics, the F p-value had the
null and alternative hypothesis:
𝐻𝑜: 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 𝛽5 = 𝛽6 = 𝛽7 = 𝛽8 = 𝛽9 = 0.
𝐻𝑎: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 ≠ 0 𝑤𝑖𝑡ℎ 𝛼 = .05
In more practical terms, we wanted to see if there was one variable that had a slope that didn’t
have 0 as a slope value in a 95% confidence interval.

Residual Plot of All 9 Innings
Looking at the residual plots for each of the innings, they are randomly scattered with
mean 0 and no increase in variance, all great signs for a linear regression.

Model With All 9 Innings
With all 9 variables in the model, we get a F p-value of less than .0001, therefore, we
reject the null hypothesis and conclude that at least one of the 𝛽′𝑠 ≠ 0. The R-Square and
Adjusted R-Square are both above .75, which says that the independent variables can explain
about 75% of the dependent variable, and historically a .75 Adjusted R-Square is pretty good.
We can also see from the full model above that at the 𝛼 = .05 level, the intercept, Inning1 and
Inning5 are the only parameter estimates that are significant in the model. According to this
multivariable regression, the final model would be:
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑊𝑖𝑛𝑠 = 80.833+ .11651( 𝐼𝑛𝑛𝑖𝑛𝑔1) + .19052(𝐼𝑛𝑛𝑖𝑛𝑔5)
You might also notice that every single parameter estimate is positive, which might give people
the wrong impression thinking that you can only get a win total higher than 80.83, but that is

not true. Since you can have positive or negative inning values, if you lose both Inning1 and
Inning5 this season, you would have a win total less than 80.83 (which is the mean of all the
wins by all the teams in the MLB).
It is unnecessary, and not a good model, to have all the other variables in the model that
are not significant, so we decided to run a few tests to see which model we could get with all of
the variables in it that would still be significant predictors of wins. The first thing we looked at
was Mallow’s C(p) criteria, which is a good indicator of what model is the best predictor of wins.
We took the best result for every model with 1 variable, up till the last model with all nine
variables in it, and this was what the output looked like:
When looking at the C(p) criteria, the smallest value is the model that you would pick, and the
smallest value is 4.4689 (in model #4) with the variables Inning1, Inning2, Inning5, and Inning7.
That same model has a R-Square of .7584, which is higher than the model with all nine variables
in it, which is great since there are five less variables in the model and it can explain the
dependent variable wins just as well.
To help support our initial theory that the model with Inning1, Inning2, Inning5, and
inning 7 in the model, our next step was to do a forward selection process to see if that method
would give us the same model. The forward selection process takes in each variable one at a
time, takes the highest |t| value that is within the significance level (𝛼 = .05), and then keeps
that variable in the model and does the entire process again with the other variables until there
are no variables in the model that meet the significance level. When we ran the stepwise
model, this is what we got:

As you can see, the variables in the forward selection process match the same model that was
concluded with Mallow’s C(p) model. We then concluded that the model with Inning1, Inning2,
Inning5, and Inning7 would be the best model to predict wins in a regular season. This was the
ANOVA and parameter estimates from that model:
Model With Inning1 Inning2 Inning5 Inning7
The final model to predict the number of wins at the end of the regular season is
𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑾𝒊𝒏𝒔 = 𝟖𝟎. 𝟖𝟑𝟑+. 𝟏𝟐𝟑𝟐𝟐( 𝑰𝒏𝒏𝒊𝒏𝒈𝟏)+. 𝟏𝟓𝟏𝟏𝟗( 𝑰𝒏𝒏𝒊𝒏𝒈𝟐)
+. 𝟐𝟐𝟓𝟗𝟔( 𝑰𝒏𝒏𝒊𝒏𝒈𝟓)+. 𝟏𝟖𝟓𝟒𝟑( 𝑰𝒏𝒏𝒊𝒏𝒈𝟕)
For example, if you you had +10 for the first inning, -3 for the second inning, +8 for the fifth
inning, and +10 for the seventh inning, your predicted win total would be
𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑾𝒊𝒏𝒔 = 𝟖𝟎. 𝟖𝟑𝟑+. 𝟏𝟐𝟑𝟐𝟐( 𝟏𝟎)+. 𝟏𝟓𝟏𝟏𝟗(−𝟑)
+. 𝟐𝟐𝟓𝟗𝟔( 𝟖)+. 𝟏𝟖𝟓𝟒𝟑( 𝟏𝟎)
= 85.27361 wins
Looking at the first 20 (out of 30) residuals, they are random and have no real trend to it. About
half of the residuals are above the predicted value and half are below. Just like normal

regression plots do, they flare out towards the ends and are closer to the predicted values near
the centroid.

Determining that this model will be the best choice to try and predict wins at the end of
the regular season, a common question arises: why those innings? As any baseball player will
tell you, the first two innings are the hardest two innings for any pitcher because you are
guaranteed at a minimum to face the top six hitters in the opposing team’s lineup. As most
people would correctly assume, managers place their best hitters at the top of the lineup
because they want those players to have the greatest chance of batting more times than the
rest of the players in the batting order. Looking at the dataset, the vast majority of teams both
scored and allowed most of their runs in the in the first two innings of their seasons. Teams also
get influenced or discouraged easily in those first two innings. If you jump out to an early lead
in a game, it is difficult for the opposing team to stay positive and locked in playing from
behind, while the leading team gains more confidence knowing that they are playing with a
lead and don’t have as much pressure placed upon them.
As for the fifth and seventh innings, offense is not so much of an indicator at this point,
but relief pitching now comes into play. In the MLB, a “Quality Start” is defined as a starting
pitcher lasting six or more innings and giving up three or fewer runs. Most MLB starting pitchers
only toss six innings, and then the team’s bullpen of relief pitchers take over. Considering
recent research regarding the dangers of over-using a pitcher’s throwing arm, most teams like
to pull the plug on a starting pitcher after the sixth inning, as his pitch-count nears a hundred. In
most cases, the seventh inning would be the first frame that would involve a reliever. That is
often when hitters get to look at a new pitcher after possibly struggling with the starter, giving
them an opportunity to score more runs and regain momentum for their team. On the opposite
side of that spectrum, a team with a great bullpen can definitely win more games if they can
hold on to a lead beginning in the seventh inning. As for the fifth inning, teams that don’t get
the quality start that they were looking for out of their starting pitcher for their game will have
to go to the bullpen earlier than expected. The fifth inning is another common inning for a new
pitcher to come into a game if the starting pitcher isn’t performing well.
This dataset shows in great clarity that both offense and pitching have big influences on
the outcome of wins in a season, but there are certain innings that are more important than
others. Teams might be focusing too heavily on individual statistics to try to determine what is
the key to success. Organizations can look at this study to help explain what needs more work,
offense at the beginning of games or relief pitching towards the middle or later part of games.
Our study has shown, one of the biggest predictors to determine a ball club’s record at the end
of the regular season is to “win” the first, second, fifth, and seventh innings of a baseball game.

What Innings Determine Total Wins

Recommended

Recommended

More Related Content

Similar to What Innings Determine Total Wins

Similar to What Innings Determine Total Wins (20)

What Innings Determine Total Wins