Analysis_of_the_Impact_of_Weather_on_Runs_Scored_in_Baseball_Games_at_Fenway_Park

Analysis of the Impact of Weather on Runs Scored in Baseball
Games at Fenway Park
Steve Cultrera
A Thesis
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science in Data Mining
Department of Mathematics
Central Connecticut State University
New Britain, Connecticut
April 2013
Thesis Committee:
Daniel Larose (chair), Ph.D.
Roger Bilisoly, Ph.D.
Dan Miller, Ph.D.

Analysis of the Impact of Weather on Runs Scored in Baseball - 2
Abstract
The goal of this thesis is to determine the impact of weather on the number of runs scored in a
baseball game. To accomplish this, a dataset was created using 40 years of baseball data from
Major League Baseball games played at Fenway Park in Boston, MA. This data was combined
with weather data measured five miles away at Logan International Airport. Several data mining
techniques were used to analyze the data including unsupervised k-means clustering and
principal components analysis (PCA), as well as supervised decision trees and exhaustive linear
model search.
Several interesting results were obtained through exploratory data analysis (EDA). The
orientation of the park, it’s location on the New England coast, and the prevailing wind
conditions mean there are many games played during favorable hitting conditions. We also see
there is no difference in average runs scored between warmer months and cooler months.
But using linear models, both the wind blowing toward the outfield and the temperature were
found to account for just over 2% of the variance of total runs scored by both teams in the
games played during the 40-year span. And it is shown that this variance is not explained by
players’ increased intent to hit home runs while the wind is blowing in a favorable direction.
Additionally, there is are about 14.5% more runs scored during warm games when the wind is
blowing out as compared to cooler games when the wind is blowing in toward home plate.

Table of Contents
Abstract............................................................................................................................................2
Table of Contents.............................................................................................................................3
Introduction .....................................................................................................................................4
Literature Review.............................................................................................................................6
Methods...........................................................................................................................................8
Conclusions....................................................................................................................................74
End Notes.......................................................................................................................................75
References .....................................................................................................................................76

Introduction
According to the treatise, “The Physics of Baseball” (Adair, 2002), everything from temperature,
humidity, and air pressure to the ball’s velocity, stitch height, roughness, circumference and
rotation impacts the flight of a baseball. Major League Baseball (MLB) pitchers are specifically
prohibited from scuffing or moistening the ball because even these minor changes impact the
amount of movement of the pitched ball. Admittedly these affects are small – but in a game of
inches small changes can be significant.
This thesis attempts to find a measurable change in the number of runs scored during a game
due to weather conditions. To accomplish this a dataset containing statistics from 40 years of
baseball games played at Fenway Park in Boston, Massachusetts, was combined with a dataset
describing the weather conditions at the starting time of each game as recorded by the National
Weather Service at Logan International Airport.
 It would seem obvious that when the wind is blowing toward the outfield there should
be more home runs (HR) hit. In fact Adair (2002) expresses the belief that the longest
HRs are probably aided by the wind. But what if players try to hit more HRs when they
see the wind blowing out? Would an increase in home runs be due to wind or a player’s
extra effort?
 The change in air pressure by elevation and its effect on a batted ball is well-
documented – for example balls generally travel farther in Denver than at lower
elevations (Cross, 2011). In theory, extreme variations in temperature, humidity, and air
pressure can result in a 5% difference in the distance a ball travels (Watts & Bahill,
2000). But do normal weather variations which occur during a game have a measurable
impact on the number of runs scored?
 There is also currently a debate about whether or not an increase in temperature causes
an increase in the number of HRs hit (Dykstra, 2012)(Alan, 2012).
Baseball is perhaps unique among professional sports in that there is a vast amount of data
available for analysis. Unfortunately, there is little accurate weather information associated with
the data. This thesis systematically examines the interplay between weather conditions and the
game of baseball.
The remainder of this thesis is organized as follows:
 Data Pre-Processing – a discussion of how this thesis’s dataset was created by
combining two separate datasets using the Perl scripting language.
 Exploratory Analysis – the variables in the dataset are explored by examining
distributions and their relationships to key variables.
 Unsupervised Analysis – Principal Components Analysis (PCA) and K-means clustering
are run against the dataset to see if underlying groupings of data can be found.
 Supervised Analysis – an attempt to improve upon previously created predictive models
using weather variables.

 Conclusions – summary of the results found.

Literature Review
Explaining the number of runs scored in a baseball game has been the topic of several previous
research papers, but so far none have been found that analyze the impact of weather on runs
scored. It is generally considered true that the number of base runners obtained (either through
hits, walks, or hit-by-pitch) is the primary driver of the number of runs scored. Using a linear
model, the expected run production (ERP) model (Bennett and Fleuck, 1983) examined statistics
from 24 Major League Baseball (MLB) teams for games between 1969 and 1976 (192 cases
representing some 15,000 games) to create a linear model which explains 95% of the variance
(adjusted R2
) of the runs scored in the games studied. The adjusted R2
metric is a measure of
how well the model fits the data, or its “goodness of fit,” that takes into consideration the
efficiency of the model. A model with an adjusted R2
= 100% would match the data perfectly and
contain only pertinent variables. Finally, the value is “adjusted” because it has been penalized
for the number of variables included in the model; simply adding a variable to a model would
always increase the unadjusted R2
even if just by chance. Table 1 shows the ERP regression
model:
Table 1 – Expected Run Production Model
Variable Coefficient Description
H1B 0.499 Singles
H2B 0.728 Doubles
H3B 1.265 Triples
HR 1.449 Home Runs
TBB 0.353 Total Base on Balls
HP 0.362 Hit By Pitch
SB 0.126 Stolen Base
SF 0.394 Sacrifice
GIDP (0.395) Ground into Double
Play
OUT (0.085) Putouts
CONSTANT (-67.0)
An interpretation of the regression coefficient is that it indicates each variable’s effect on runs
scored if all the other variables were held constant. So assuming no changes to any other
variable, one home run (HR) results in a 1.449 increase in runs scored, which seems odd because
a home run is only worth one run during the game, but the model also accounts for home runs
hit when other runners are on base that might also score. We can see this more clearly by
comparing single (0.499) to base on balls (0.353). Both the single and the base on balls only
allow the hitter to move forward one base, but since a single results from a ball being hit into
the field, it has the added possibility of allowing a runner that might be on second base or third
base to score. On the other hand, the base on balls will only score a run if the bases happen to
be loaded (i.e. one runner is on each base).

We also see in Table 1 that the variables with the highest coefficients are all hits (single, double,
triple, home run), and that there’s a “second tier” of variables which are not quite as strong
(base on balls, hit by pitch, sacrifice, and stolen base) but generate runs simply because each
action results in base runners progressing forward on the path from first base to home plate.
Any action that advances runners generally results in a greater change in the numbers of Runs
Scored.
Finally in Table 1 we see a “third tier” of variables; those having negative coefficients (indicating
that overall they result in a decrease in the number of runs scored). Both ground into double
play (-0.395) and putouts (-0.085) tend to reduce the number of runs scored.
The ERP model also demonstrates one of the deficiencies of Batting Average (BA = Total Hits/At
Bats) as a measure of a player’s offensive performance. BA includes the variables with the
strongest regression coefficients, but it doesn’t completely account for a player’s additional
ability to move runners forward by getting a base on balls or hit by pitch. We see in the ERP
model that the metric on-base percentage (OBP = [Total Hits + Base on Balls + Hit by Pitch]/Total
Plate Appearances) gives a more comprehensive estimate of a player’s ability to increase the
number of runs scored for his team.
Another early work on this subject is Lindsey, 1963. Lindsey’s formulas in Table 2 are
interestingly close to ERP, considering he created it by hand in 1959 using only four individual
batting records (Harvey Kuenn, Al Kaline, Rocky Colavito, Harmon Killebrew):
Table 2 - Lindsey (1963)
Variable Coefficient
H1B 0.410
H2B 0.820
H3B 1.060
HR 1.420
The similar coefficients indicate that despite all the complexities of the game of baseball, the
simple act of moving from one base to the next should be considered the most important factor
in increasing the number of runs scored.
Another popular metric is Bill James’ Runs Created (the basic version), (James, 2003). James is
generally considered to be an early adopter of a more scientific approach to baseball data
analysis. According to him, this formula gives you a team’s Runs Created within 5% on 90% of
the games played since 1920:
(1)
Later in this thesis James’ formulas will be compared to the games played in this thesis’s
dataset.

Methods
The dataset for this thesis was developed by taking statistics from MLB games played at Fenway
Park between 1970 and 2009, and combining with weather observations recorded by the
National Climatic Data Center’s (NCDC) Boston /Logan International Airport weather station
(Station identifier, NCDC=14739). By using NCDC’s Integrated Surface Hourly (ISH) dataset, the
weather recorded within one hour of the start of each game was available.
Fenway Park is the home field for the Boston Red Sox, but the Red Sox team statistics will be
combined with each visiting team’s statistics to come up with a total for each game.
It should also be noted that Fenway Park is five miles further inland than Logan International
Airport, which is a critical factor weather-wise. We may never know what the exact weather was
at Fenway Park on each of certain hours for the past 40 years, but the weather within five miles
of the park will suffice. For example, wind blowing from the Southwest at Logan may only
translate to an approximate South-Southwest wind at Fenway, but this will still provide insight
into weather’s impact on baseball due to the consistent translation of conditions – any
differences in weather between the two locations will vary consistently. The temperature may
be two degrees cooler at Logan on days when there’s a sea breeze in August, but this difference
will be consistent over time – i.e. the sea breeze doesn’t make it cooler for some games and
warmer for others during the August games. Likewise that same sea breeze in April will have a
consistent affect (if any) at Fenway.
The website Retrosheet.org maintains comprehensive datasets of MLB game statistics. In the
time span covered by this thesis (including regular season, divisional, and World Series playoff
games) there were 3,231 games played at Fenway Park. Retrosheet’s “game log” file provides a
useful summary of the statistics for each game. But game log files do not include the starting
time of each game. So Retrosheet’s “event” files (also known as the “play-by-play” files) were
used to determine the starting time of each game.
The Retrosheet.org files did contain temperature and wind directions, but these were either too
general (e.g. “wind blowing left to right”) or too infrequently populated to be of use for any kind
of trend analysis.
Data pre-processing
The most significant initial task was to combine the following two datasets:
1. Retrosheet.org – The game log files were used primarily with game start time extracted from
the event log files. The game logs are a flat file and easily downloaded and imported into a
Microsoft Access database. The Event files on the other hand are hierarchical in nature – i.e.,
each row of data can be one of several formats and thus cannot be processed as a simple
import. Of interest for this thesis was the “info” formatted row type, which contains each
game’s starting time. To get this information along with the related Game Id (described later),
the general purpose scripting language, Perl, was used to extract the “info” rows which were

written to a formatted table file. This could then easily be joined to the summarized Game Log
file using Game Id.
2. National Climatic Data Center (NCDC) Integrated Surface Hourly (ISH) dataset – The 40-year
ISH file which was downloaded contained about half a million hourly weather observations. This
is not a huge number of rows, but unfortunately each row was quite wide, making the final text
file extremely large, several gigabytes. To reduce this file to a manageable size a Perl script was
used to strip out a subset of weather data only for days on which games occurred.
The first issue was ensuring the two files used were the same time zone. The ISH file’s
Greenwich Mean Time date/times were converted to Eastern Standard Time by subtracting five
hours. Since all games took place between April and October, no consideration was required for
Daylight Savings Time.
Combining the two files was a challenge. Initially it was hoped the two files could be joined via a
simple database process, but there were too many issues preventing a simple database SQL
(Structured Query Language) join operation. For example while most of the hourly weather
observations were on the hour (e.g. 7:00 PM) many were several minutes before or after the
hour (e.g. 6:59 or 7:03). In some cases the temperature was in an observation made at 7:00 PM
while the wind speed had been measured and stored in a different time (and row) at 7:03 PM.
The final solution was to create a custom Perl script which crawled the weather file to build a
complete row of weather data for each game using the weather data that was closest to the
game’s start time.
The resulting combined baseball/weather dataset was spot-checked for accuracy and missing
values. There were several weather fields that were generally not populated well enough over
the 40-year span to be usable (for example, several of the precipitation variables), so they were
removed from the analysis. About a dozen games had missing temperature or wind speed
values for various reasons; the data for them was manually retrieved using one of the many
websites that provide historical weather information.
Additional Fields
Calculated Fields
The final step was to generate the following additional fields as shown in Table 3:

Table 3 - Calculated Fields
Field Type Description
Total_Runs Integer Total number of runs scored by both teams in a
game
“PAB” versions of each
field
Real Each baseball variable was converted to a “per at
bat” version to account for the different number
at bats in a game
TDS Real Temperature-Dew Point Spread – the difference
between the temperature and the dew point; a
low TDS is considered to create conditions for low
visibility
The baseball statistics in the Retrosheet.org files are split between home and visitor. For this
thesis, these two groups were added together. For example, total runs was created from home
team runs plus visiting team runs. “Total” versions of all the fields were created. Using the total
runs scored by both teams should remove any biases introduced by the “home field advantage.”
The next step was to change the overall total fields into “per at bat” or PAB versions of each
metric. In baseball the AB (at bat) differs from the PA (plate appearance) in that technically an
AB does not include a) BB (base on balls), b) HBP (hit by pitch), or c) SF (sacrifice), as well as
several other rare scenarios.
For the purposes of this thesis, AB was used to normalize each statistic. By using PAB versions of
each field, this allows the use of both the 236 extra-inning games (7% of the dataset) and 22
shortened games (0.9% of the dataset) within the dataset. Otherwise, games like the 20-inning
marathon played on September 3, 1981, which had 47 total hits could not be used. These 47 hits
converted to 0.30 total hits PAB, which is only a little above the mean of this field of 0.27.
A disadvantage of using PAB versions of the fields is the statistics lose easy interpretability.
While an analyst might know that 20 strikeouts by one pitcher in a game is a good performance
(Roger Clemens, April 29, 1986 in 30 ABs), this same performance in K_PAB is 0.67 which is less
interpretable. In this thesis PAB versions are used in all the modeling; for discussion purposes
either may be used.
One issue is technically PAB versions of each field are discrete values – there are only 34
possible values for total runs in the data set (1 thru 36 excluding 32 and 34). Likewise, there are
only 71 possible values for AB. Together there are 547 combinations of total runs and AB with
the most common combination being 7 total runs in 66 ABs (36 occurrences). These games have
a TotRun_PAB of 7/66 = 0.106. Fortunately, these discrete values are also ordinal (0.106
TotalRun_PAB is less than 8/66, or 0.121, TotalRun_PAB) and will be treated as “real” numbers
for the purposes of modeling.

Wind Quantification
The additional fields listed in Table 4 were added to the dataset to aid downstream analyses in
determining the wind’s effects with respect to the park’s orientation. These will be discussed
further in the Exploratory Data Analysis Section.
Table 4 - Additional Wind Fields
Field Type Description
W_NE Real Wind speed when wind is blowing from 0°-90°
W_SE Real Wind speed when wind is blowing from 100°-180°
W_SW Real Wind speed when wind is blowing from 190°-270°
W_NW Real Wind speed when wind is blowing from 280°-360°
W_HR_Max Boolean (1/0) 1 when blowing from 180°-290°; 0 otherwise

Complete Dataset
The final set of variables used in this thesis is listed in Table 5:
Table 5 - Complete Dataset
Field Description
GameId Retrosheet.org’s Unique Game ID, a combination of the home field
code (all games in this thesis are “BOS”), the date, and game number
(usually “0”, or “1” for the second game of a double header). So for
example “BOS197908240” indicates the first game of a double header
on 8/24/1979)
Month_MM Month
Hour_UTC_HH Hour in UTC
TotRuns_PAB Runs Per AB (PAB)
TotRC_Bas_PAB Runs Created as calculated by James' Basic formula
TotAssists_PAB Assists PAB
TotBalk_PAB Balks
TotBB_PAB Base on balls
TotCatchInt_PAB Catcher Interference
TotCS_PAB Caught Stealing
TotDbl_PAB Doubles
TotDblPlay_PAB Double Plays
TotErrors_PAB Errors
TotGDP_PAB Ground into Double Plays
TotH_PAB Hits
TotHBP_PAB Hit by Pitch
TotHR_PAB Home Runs
TotIB_PAB Intentional Base on Balls
TotIndER_PAB Individual Earned Runs
TotK_PAB Strikeouts
TotLOB_PAB Left on Base
TotPB_PAB Passed Balls
TotPitchersUsed_PAB Pitchers Used
TotPutouts_PAB Putouts
TotRBI_PAB Runs Batted In
TotSacFly_PAB Sacrifice Flys
TotSacHt_PAB Sacrifice Hits
TotSing_PAB Singles

TotSB_PAB Stolen Bases
TotTmER_PAB Team Earned Run
TotTrip_PAB Triples
TotTripPlay_PAB Triple Plays
TotWP_PAB Wild Pitches
Wind_Dir Wind Direction – angle measure clockwise from true north
Wind_Spd Wind Speed – in Meters per Second
Ceil_Hgt Ceiling Height – height above ground level in meters of lowest cloud or
obscuring phenomena
Visby_Dist Visibility Distance – distance in Meters at which an object can be seen
horizontally.
Temp Temperature in degrees Celsius (°C)
Dewpt Dew Point (°C)
SLP Sea Level Pressure in Hectopascals (hPa)
TDS Temperature/Dew Point Spread
W_NE Wind speed when wind is blowing from 0°-90°
W_SE Wind speed when wind is blowing from 100°-180°
W_SW Wind speed when wind is blowing from 190°-270°
W_NW Wind speed when wind is blowing from 280°-360°
W_HR_Max A Boolean field indicating if the wind is blowing toward the outfield. 1
when blowing from 180°-290°; 0 otherwise

Exploratory Data Analysis
Exploration begins with an examination of the target variable and continues with an
examination of each of the baseball and weather variables.
Target Variable – Total Runs
Box & whisker plots of both Tot_Runs and TotRuns_PAB are shown in Figure 1. A concern was
how much converting TotRuns to TotRuns_PAB would change the distribution of the variable.
The answer is not much:
Figure 1 – Distribution of Total Runs
In Figure 1 we see that the mean number of runs scored in a game is around 9.9 which
translates to a mean TotRun_PAB of 0.13. We also see that there are some outliers which we
will need to assess for influence when doing regression.
We also see in Figure 1 that the distribution is slightly skewed right (skew=0.418). A histogram,
Figure 2, shows this more clearly:

Figure 2 - TotRun_PAB Histogram
When creating linear models, we’ll assess diagnostics to ensure regression assumptions are met.
For example, using the square root of TotRun_PAB improves the symmetry of the variable’s
distribution as shown in Figure 3, but it is still slightly skewed (skew=-0.346):
Figure 3 - Sqrt(TotRun_PAB)
An initial check was to see if simple correlation revealed any relationships between runs scored
and any of the weather variables:

Table 6 – Correlation Table
Correlations
TotRun_PAB Temp Dewpt SLP Visby_Dist Ceil_Hgt TDS
TotRun_PAB 1 0.13 0.07 -0.05 -0.04 0.01 0.07
Temp 0.13 1 0.66 -0.16 -0.04 0.24 0.34
Dewpt 0.07 0.66 1 -0.16 -0.39 -0.14 -0.49
SLP -0.05 -0.16 -0.16 1 0.15 0.19 0.01
Visby_Dist -0.04 -0.04 -0.39 0.15 1 0.32 0.43
Ceil_Hgt 0.01 0.24 -0.14 0.19 0.32 1 0.45
TDS 0.07 0.34 -0.49 0.01 0.43 0.45 1
P-values
TotRun_PAB Temp Dewpt SLP Visby_Dist Ceil_Hgt TDS
TotRun_PAB - 0 0.0003 0.0049 0.0554 0.4441 0.0004
Temp 0 - 0 0 0.0188 0 0
Dewpt 0.0003 0 - 0 0 0 0
SLP 0.0049 0 0 - 0 0 0.4472
Visby_Dist 0.0554 0.0188 0 0 - 0 0
Ceil_Hgt 0.4441 0 0 0 0 - 0
TDS 0.0004 0 0 0.4472 0 0 -
Figure 4 - Corrgram
Table 6 gives indicates the Pearson correlation and significance of this correlation between pairs
of variables. Correlation values range from -1 to +1 and indicate the strength of the linear
relationship between two variables. Values stronger than ±0.60 are generally considered to be
strong with a positive value indicating that the variables tend to move in the same direction, and

a negative value indicating the variables tend to move in opposite directions. Figure 4, a
corrgram, indicates positive relationships with blue, negative relationships with red, and the
strength of the correlation is indicated by the color saturation (lower left) or a pie chart (upper
right). We see the strongest correlation with total runs is Temp (temperature) at (0.13), and
while statistically significant is hardly strong. We also see that fields Temp and Dewpt (0.66) are
strongly correlated as expected – both temperature and the amount of moisture in the air are
generally higher in the summer and lower in the winter. It also makes sense that TDS has a mild
correlation with Visby_Dist (0.43) and Ceil_Hgt (0.46) – low TDS is an indicator of possible poor
visibility and a high TDS is indicative of clear skies.
Table 7 shows a breakdown of mean runs scored by month and there appears to be fewer runs
scored early and late in the season (when temperatures tend to be cooler). Figure 5 shows this
difference graphically. But this difference is not supported by a one-way ANOVA test on the
temperature by month (p=0.095). The one-way ANOVA tells us if there is a statistical
significance to this apparent difference. A p-value less than α = 0.05 would indicate that at least
one of the months had a statistically different mean and that this difference was likely not due
to chance.
Table 7 – Mean Runs Scored by Month
Month Mean Runs Scored
(p = 0.095)
Scheffe
Group
4 9.69 A
5 9.65 A
6 10.17 A
7 10.13 A
8 10.24 A
9 9.57 A
10 9.76 A

Figure 5 – Total Runs by Month
An additional test for determining if there is a difference in mean runs scored between months
is the Scheffe method (Mendenhall, 2003) for comparison of multiple means. Added to Table 7
is the Scheffe Group, which also indicates there is no difference between months at α = 0.05.
Finally, we compare Tot_Runs with two of the Bill James’ Runs Created formulas. Table 8 shows
the correlation between the three metrics (all p 0):
Table 8 - Total Runs Correlation Comparison
Tot_Runs RC_Tot_Bas RC_Tot_SB
Tot_Runs 1.000 0.855 0.858
RC_Tot_Bas 0.855 1.000 0.997
RC_Tot_SB 0.858 0.997 1.000
Here RC_Tot_Bas is calculated with Equation 1 which is repeated here:
(1)
RC_Tot_SB is the slightly more complicated “stolen base” version of runs created:

(2)
With a correlation of 0.997 there is not much difference between the “basic” and the “stolen
base” versions using this thesis’s data. An obvious difference is that the “stolen base” version
includes stolen bases in the calculation. Using BaseballReference.com, we can see that the
mean number of stolen bases in the major leagues between 1970 and 2009 is 0.67 per game per
team. In this thesis’s dataset, the visiting teams average 0.67 stolen bases per game, while the
Boston Red Sox average only 0.41 stolen bases per game. Clearly the stolen base version of Runs
Created would be less accurate for a team like the Red Sox, which does not steal as many bases
historically.
Finally, the correlation between Tot_Runs and RC_Tot_Bas is 0.855. One property of correlation
is that we can square it to get R2
(with only two variables R2
= adjusted R2
) of 0.552
= 72.5%. Thus
in a linear model regressing Tot_Runs on RC_Tot_Bas, RC_Tot_Bas would account for about
72.5% of the variance in Tot_Runs for the games played in this dataset.

Baseball Variables
Table 9 below shows a summary of the regular baseball statistics while Table 10 shows the
“PAB” versions of each.
Table 9 – Baseball Variables Summary
Field Min 1st Qu Median Mean 3rd Qu Max
Tot_Runs 1 7 9 9.91 13 36
TotAssists 6 18 21 21.05 24 50
TotBalk 0 0 0 0.08 0 4
TotBB 0 5 6 6.81 9 19
TotCatchInt 0 0 0 0.01 0 2
TotCS 0 0 0 0.52 1 4
TotDbl 0 3 4 4.07 5 13
TotDblPlay 0 1 2 2.03 3 10
TotErrors 0 1 1 1.52 2 8
TotGDP 0 1 2 1.70 2 9
TotH 5 15 19 19.23 23 47
TotHBP 0 0 0 0.57 1 6
TotHR 0 1 2 1.92 3 11
TotIB 0 0 0 0.56 1 6
TotIndER 0 6 9 9.00 12 31
TotK 1 9 11 11.53 14 35
TotLOB 3 12 14 14.73 17 39
TotPB 0 0 0 0.20 0 4
TotPitchersUsed 2 4 6 6.06 7 17
TotPutouts 30 51 51 53.13 54 120
TotRBI 0 6 9 9.38 12 35
TotSacFly 0 0 0 0.62 1 5
TotSacHt 0 0 0 0.53 1 6
TotSB 0 0 1 1.08 2 8
TotTmER 0 6 9 8.97 12 31
TotTrip 0 0 0 0.39 1 5
TotTripPlay 0 0 0 0.00 0 2
TotWP 0 0 0 0.58 1 6

Table 10 - Baseball Variables - PAB Versions
Field Min 1st Qu Median Mean 3rd Qu Max
TotRun_PAB 0.010 0.090 0.130 0.136 0.170 0.410
TotAssists_PAB 0.110 0.250 0.300 0.301 0.350 0.610
TotBalk_PAB 0.000 0.000 0.000 0.001 0.000 0.050
TotBB_PAB 0.000 0.060 0.090 0.094 0.120 0.300
TotCatchInt_PAB 0.000 0.000 0.000 0.000 0.000 0.020
TotCS_PAB 0.000 0.000 0.000 0.006 0.010 0.060
TotDbl_PAB 0.000 0.030 0.050 0.053 0.070 0.160
TotDblPlay_PAB 0.000 0.010 0.020 0.025 0.040 0.140
TotErrors_PAB 0.000 0.010 0.010 0.018 0.030 0.100
TotGDP_PAB 0.000 0.010 0.020 0.021 0.030 0.130
TotH_PAB 0.090 0.230 0.270 0.270 0.310 0.460
TotHBP_PAB 0.000 0.000 0.000 0.006 0.010 0.080
TotHR_PAB 0.000 0.010 0.020 0.023 0.040 0.140
TotIB_PAB 0.000 0.000 0.000 0.006 0.010 0.060
TotIndER_PAB 0.000 0.080 0.120 0.123 0.160 0.370
TotK_PAB 0.010 0.120 0.160 0.163 0.200 0.440
TotLOB_PAB 0.050 0.170 0.210 0.207 0.240 0.390
TotPB_PAB 0.000 0.000 0.000 0.002 0.000 0.050
TotPitchersUsed_PAB 0.020 0.060 0.080 0.082 0.100 0.180
TotPutouts_PAB 0.580 0.720 0.770 0.767 0.800 1.060
TotRBI_PAB 0.000 0.090 0.130 0.128 0.160 0.400
TotSacFly_PAB 0.000 0.000 0.000 0.007 0.010 0.070
TotSacHt_PAB 0.000 0.000 0.000 0.006 0.010 0.070
TotSB_PAB 0.000 0.000 0.010 0.013 0.020 0.120
TotTmER_PAB 0.000 0.080 0.120 0.122 0.160 0.370
TotTrip_PAB 0.000 0.000 0.000 0.004 0.010 0.070
TotTripPlay_PAB 0.000 0.000 0.000 0.000 0.000 0.030
TotWP_PAB 0.000 0.000 0.000 0.006 0.010 0.070
Also, for the PAB versions, histograms of each variable are shown in Figure 6:

Figure 6 - Variable Density Plots
We see that many of the variables are highly skewed, even more than we saw with total runs.
We also see that some of the variables, like total intentional base on balls (TotIB_PAB) and total
passed balls (TotPB_PAB), are almost constant. For example there were no intentional base on
balls in almost half of the 3231 games.

Weather Variables
While we generally know any given geography will have seasonal variations in weather, the
game of baseball is not played year round. Baseball is only played for seven months out of the
year which means the games in this thesis’s dataset were all played between the months of
April and October; there are no games played during the winter.
A topic we’d like to explore is how much the weather statistics vary throughout the course of
the baseball season. In so doing, some general assumptions were made that should be
explained and learned from a lifetime of living in New England in Massachusetts and close to the
coast in Connecticut:
 Large bodies of water have a moderating effect – an ocean breeze cools the shore
during the summer and warms it during the fall.
 A Nor’easter is usually a winter storm and named because the storm travels up the
coast and results in winds from the Northeast.
 Dew point is a measure of the moisture in the air; when it gets above about 15.5°C (60°
F) it feels oppressive.
 Most New Englanders don’t use Celsius, but since it is used in this thesis’s source for
weather data, it will be used here.
Temperature
Temperature is the air temperature in degrees Celsius. As shown in Figure 7 the temperature
distribution is symmetric with a mean of 20°C (68°F).
Figure 7 – Temp Box & Whisker plot
Table 11 shows a breakdown of mean temperature by month and shows months at the
beginning and end of the playing season are cooler. This is supported by a very small p-value (p
0) on the one-way ANOVA test. This means the apparent difference in monthly means is of
statistical significance, and the Scheffe Group tells us that April is the coldest month, May and

October have the same mean temperatures, June is different, and finally July and August have
about the same mean temperatures.
Table 11 – Mean Temperature by Month
Month Mean Temperature (°C)
(p 0)
Scheffe
Group
4 12.04 A
5 16.46 B
6 21.77 C
7 25.02 D
8 24.04 D
9 19.86 E
10 16.48 B
Dew Point
Dew point is a measure of the amount of moisture in the air. The dew point is slightly skewed
left with a mean of 11.6°C (52.8°F) as shown in Figure 8:
Figure 8 – Dew Point Boxplot
The average monthly Dew Point in Table 12 also seems to show a significant difference by
month with a p-value of almost 0. Again this is as we would expect – any New Englander can tell
you there is a large difference between the oppressive feel of days in July and the crisp days of
October.
Table 12 – Mean Dew Point by Month
Month Mean Dew Point (°C)
(p 0)
Scheffe
Group
4 1.55 A
5 7.69 B
6 13.16 C
7 16.22 D
8 16.12 D
9 12.48 C
10 10.01 E

Sea Level Pressure (SLP)
Sea Level Pressure (SLP) is a measure of the air pressure in hectopascals (hPa). In Figure 9 the
SLP seems to have a fairly symmetrical distribution with some outliers on both ends and a mean
of 1015.3 hPa. And in Table 13 we see it has some statistically significant seasonal variation with
a p-value very close to 0.
Figure 9 - SLP Boxplot
Table 13 – SLP by Month
Month SLP (hPa)
(p 0)
Scheffe
Group
4 1015.15 A
5 1014.77 A
6 1014.35 A
7 1014.41 A
8 1015.68 A
9 1017.56 B
10 1015.37 A
Interestingly, there’s not an obvious difference between means by month in Table 13, but the
low p-value indicates there is a difference that is not due to chance. Included in Table 13 is the
group as indicated by the Scheffe test for multiple comparisons. We can see that September
does in fact have a higher sea level pressure on average than the other months.
Ceiling Height
The Ceiling Height is essentially visibility in the vertical direction – the distance between the
ground and lowest clouds measure – and is measured in meters. Just over half of the games
have a Ceiling Height of 22,000 meters (m), which is essentially maximum. The box & whisker
plot in Figure 10 is also a little deceiving, so a histogram (Figure 11) is included as well:
Figure 10 - Ceiling Height Boxplot

Figure 11 - Ceiling Height Histogram
We can see in Figure 11 there is a large gap between 10,000-11,000 meters and 22,000 meters.
This is probably due to 22,000 being used to indicate “maximum visibility.”
Table 14 – Ceiling Height by Month
Month Ceiling Height (m)
(p = 0.00167)
Scheffe
Group
4 11634.90 A
5 11738.24 A
6 12621.88 A
7 13446.97 A
8 13556.52 A
9 13262.74 A
10 11118.80 A
Table 14 with a low p-value (0.00167) indicates that not all the differences of Ceiling Height
between the months are due to chance – there seems to be some kind of seasonal variation by
this test, but the Scheffe method does not detect any difference between the months.
Visibility Distance
Visibility is simply the distance one can see in the horizontal direction. The mean visibility is
17.7Km, and with a minimum of 200m it doesn’t seem like it would be an impact on something
at the scale of a baseball game. On the other hand, low visibility could indicate other

atmospheric conditions like high humidity. As with Ceiling Height, both a box & whisker plot
(Figure 12) and histogram (Figure 13) are included to better illustrate the distribution:
Figure 12 - Visibility Boxplot
Figure 13 - Visibility Histogram
Table 15 – Visibility Distance by Month
Month Visibility Distance (m)
(p = 0.0621)
4 18717.61
5 18125.01
6 17689.24
7 16670.39
8 16878.40
9 18531.32
10 17579.30

Table 15’s p-value of 0.0621 indicates the difference Visibility Distance by month is probably
(but just barely) due to chance.
Wind Speed and Direction
As discussed in the Data Pre-Processing section and shown in Table 4 - Additional Wind Fields,
several variables were created to aid in quantifying the combination of speed and direction –
W_NE, W_NW, W_SE, S_SW, and W_HR_Max.
Figure 14 shows a box & whisker plot of Wind Speed. The mean is just over 5 meters per second
(MPS), equivalent to 2.2 MPH:
Figure 14 – Wind Speed Box & Whisker Plot
But as shown in Figure 15, there is clearly a difference by direction – stronger winds tend to
blow from the South-Southwest (180° to 320°) . The one-way ANOVA has a p-value of close to 0
indicating there is a statistical difference:
In addition to speed tendencies, the wind also has directional tendencies. These tendencies for a
given geographic location are called that area’s “prevailing winds.” Figure 16 shows the number
of games played at Fenway per wind direction:
Figure 15 - Wind Speed by Direction

Figure 16 - Games Played by Compass Point
In weather terminology, Figure 16 is called a “wind rose.” We see that most games are played
with the wind blowing generally from either the East (90°) or West (270°). But there are also
almost 225 games played with winds specifically blowing from 200° which is the South-
Southwest direction. Very few games are played when the wind is blowing from the North. An
examination of a wind rose from Logan International Airport, shows that ours is not unusual.
Since Logan is on the New England coast (which runs north to south), wind is generally either a
“land breeze” (winds from the west) or a “sea breeze” (winds from the east). A bias of the wind
rose generated from this dataset is that baseball games are usually played during fair weather
and none are played during the winter – there are generally no games played during the typical
New England Nor’easter. This would be the explanation for very few games (less than 50) being
played when winds are blowing from 340° to around 70°.
In Figure 17 we see a radial plot of the maximum and mean wind speed by direction:

Figure 17 – Max & Mean Wind Speed in MPS by Direction
Intuitively, it seems that wind speed and direction should have an impact on batted balls. So it
was desirable to quantify wind speed and direction as they relate to the orientation of Fenway
Park itself. A 15 MPS wind blowing in toward home plate should have a different affect than that
same speed wind blowing straight out to center field. Again since the wind metrics are actually
recorded five miles away at Logan International Airport, it was desirable to see if this wind
quantification made sense in relation to the baseball metrics recorded at Fenway Park.
As stated in the Introduction, Adair (2002) speculates that the longest home runs are probably
aided by the wind. Added to Figure 17 are two letters indicating the historic wind speed and
direction for two of the longest home runs hit at Fenway Park. The “T” marks the conditions
when a 502 foot home room was hit by Ted Williams on June 9, 1946 (Kease, 1946)(Johnson,

2012) . The weather data for the Williams home run was extracted manually from a historical
weather site, since this thesis’s dataset only goes back to 1970. The “M” in Figure 17 marks the
conditions when a home run was hit by Manny Ramirez on June 23, 2001, and which is officially
listed one foot shorter (501 feet) than the Williams home run. Ramirez’s home run was reported
to have hit a light tower in left field (Associated Press, 2001), while Williams’ home run is
famously commemorated by a lone red seat set far back in the right field bleachers of the park.
Clearly both home runs could have been wind aided as Adair speculates, but the Williams home
run was hit on a day when the wind was more favorable for home run hitting (15 MPS blowing
straight out to right field) than any of the games played in the 40 years covered by this thesis’s
dataset.
So how do the wind measurements made at Logan relate to the orientation of Fenway Park?
Figure 18 – Home Runs per Game per Compass Point
An early concern in this project was to verify that weather conditions at Logan made sense to
the baseball metrics recorded at Fenway Park. One would think more home runs are hit when
the wind is “blowing out” at any baseball field. But since we see clearly in Figure 16 that the
wind isn’t uniformly distributed across all wind directions, we can’t just consider the total
number of home runs hit for each direction. Therefore Figure 18 shows the number of HR’s hit
per game per wind direction, overlaid on an image of Fenway Park taken from Google Maps. As
can be seen, there are generally more HRs hit per game at Fenway when the wind is blowing out

– close to 2.5 HRs per game when the wind is blowing out vs. fewer than 1.5 per game when the
wind is blowing in toward home plate. This seems to validate that the weather, at least as far as
wind is concerned, should be suitable for analysis of baseball played at Fenway Park 2
.
Figure 19 is a bubble plot where each bubble’s position indicates the rate of home runs
(mean=0.02) and doubles (mean= 0.05), and a bubble’s size indicates the wind speed. It seems
(graphically, at least) that when the wind is from the Southwest the bubbles trail farther toward
the right indicating a higher rate of home runs. On the other hand, the same effect does not
appear for doubles – they generally range between 0 and 0.15 doubles PAB for all quadrants.
Figure 19 - Doubles and HRs by Wind Direction
Of course, even if there are more home runs hit when the wind is blowing toward the outfield,
how do we actually know the increase is caused by the wind? It’s common knowledge that HRs
and strikeouts are highly correlated – a batter who is trying to hit a home run is more likely to
strikeout. As evidence of this in Figure 20 we see total home runs and strikeouts per at bat in
MLB since 1970 (scaled).3
They have a correlation of 0.85:

Figure 20 - HR and SO Per At Bat by Year
We now repeat the bubble plot but instead of doubles, strikeouts are on the Y-axis (Figure 21):
Figure 21 – Home runs (HRs) and Strikeouts (SOs) per At Bat (AB) by Wind Direction

Regardless of the quadrant the wind blows from, strikeouts generally range from 0 to 0.4 per
AB. One final check is to look at the ratio of TotHR_PAB (mean=0.023) to TotK_PAB
(mean=0.162) at a more granular level by wind direction.
Figure 22 - HR/K PAB by Wind Direction
Figure 22 shows a radial plot of this ratio. The mean the ratio of TotHR_PAB to TotK_PAB is
0.023/0.162 = 0.142, but we can clearly see it is disproportionally higher when the wind is
blowing out.
Finally a Boolean (yes or no) field was also created, W_HR_Max, to quantify when the wind was
“blowing out” – a time when home runs should be maximized because the wind was blowing
from home plate toward the outfield. Figure 23 shows another bubble plot of the difference
between wind blowing out (W_HR_Max =1) and wind blowing in (W_HR_Max = 0). Clearly there
are far more games played at Fenway when the wind is blowing out, as indicated by the number
of bubbles in the left plot. Less clear is the apparent tendency for HRs to be hit at a higher rate
when the wind blows out. But this we hope to quantify.

Figure 23 - HRs and Doubles by Wind In or Out

Unsupervised Analysis
The goal of unsupervised analysis is to determine if the data contains natural groups, or clusters,
of data. This type of analysis is “unsupervised” as there is no specific target for prediction. Any
underlying groupings we find will, hopefully, help us better understand the data.
An advantage of unsupervised analysis is that there is no need for a human to label the data and
thus there is no target classification or metric. A significant benefit of removing the target is that
we also remove any preconceived notions we may have about the nature of the relationships
within the dataset. In theory we can point these unsupervised algorithms at our dataset and
they will provide us valuable insights we might otherwise miss.
Of course, there are always downsides. One disadvantage of unsupervised analysis is that
determining the exact number of natural, underlying groupings can be subjective. Perhaps there
are no underlying groupings. We can apply our knowledge of, or hunches about, the domain at
hand but in so doing we are at risk of introducing biases. Still this concept of grouping the data is
a very powerful tool for helping us understand a large quantity of information.
Finally, the groupings created during the unsupervised analysis can be passed to downstream,
supervised algorithms. The pooled effect created by these groupings can often be used by
subsequent algorithms to make better predictions.
This thesis will use two common algorithms for unsupervised analysis – K-means Clustering
(McQueen, 1967) and Principal Components Analysis (PCA) (Larose, 2007). In addition, the
following subsets of the thesis’s dataset will be analyzed for groupings separately:
1. Weather-only fields – W_NE, W_SE, W_SW, W_NW, W_HR_Max, Ceil_Hgt, Visby_Dist, Temp,
Dewpt, TDS, SLP
2. Baseball-only fields – TotalAssists_PAB, TotBalk_PAB, TotBB_PAB, TotCS_PAB, TotDbl_PAB,
TotErrors_PAB, TotGDP_PAB, TotH_PAB, TotHBP_PAB, TotHR_PAB, TotIB_PAB, TotK_PAB,
TotLOB_PAB, TotPB_PAB, TotPitchersUsed_PAB, TotPutouts_PAB, TotSacFly_PAB,
TotSacHt_PAB, TotSB_PAB, TotTrip_PAB, TotWP_PAB, TotRun_PAB
3. Complete dataset – both of the above sets combined.
It’s also common during unsupervised analysis to scale the variables so that variables with large
values don’t “overpower” those with relatively small values. With this in mind, all variables were
standardized to a mean of 0 and a standard deviation of 1.
Also as stated earlier, using “PAB” versions of each field allows us to use the 236 extra-inning
and 22 shortened games. But one consequence of converting to “PAB” is that the variable
TotPutouts_PAB becomes highly correlated with TotRuns_PAB, but this relationship doesn’t tell
us anything about the game we are trying to understand. This is because TotPutouts is the total
number of outs made in a game and for most games (2870 in this thesis’s dataset, or 88.8%) has
values of either 51 (when the home team doesn’t need its final turn batting) or 54 (when the

home team wins or loses in its final turn batting) for a nine-inning game. When TotPutouts is
divided by at bats, the rate of putouts per at bat, TotPutouts_PAB, becomes strongly and
negatively correlated with TotRuns_PAB (-0.69). Essentially any AB that does not result in an out
has to then be some other action (hit, walk, error) which moves the base runners forward and
thus results in an increase in the number of runs scored due to the nature of the game. Because
of this strong relationship effectively masks hits and prevents us from seeing if weather is of any
impact on runs scored, it will be removed from the dataset.
Principal Components
Principal Component Analysis (PCA) is an unsupervised technique generally used for dimension
reduction, or to reduce the number of variables. These fewer variables are usually called
“components,” and each component explains a percentage of the total variance within the
entire dataset. We then examine the weightings of each component to characterize it in terms
of our original variables. A key decision when using PCA is deciding how many of the
components to keep, since not all are of significance. To make this decision, the following
criteria were used (Larose, 2007):
 Eigenvalue – Only components with an eigenvalue greater than 1 are kept.
 Scree Plot – The point where the scree plot levels out generally indicates no
additional explained variance so components beyond this “elbow” are
superfluous.
 Cumulative Variance (CV) explained – Ideally the components will explain some
domain dependent percentage of the total variance in the dataset. For this
project 90% would be nice.
Assessment of components was done using the above criteria as suggestions with the final
number to use being determined by a combination of the above metrics. Also, these criteria are
only loosely applied and the right to bend the rules is reserved. A component with an eigenvalue
of 0.99 may be kept if it makes sense for the analysis. Likewise a component of 1.01 isn’t rigidly
accepted.
The Occam’s razor heuristic suggests that, a five-component solution is better than a six-
component solution – this rule will be used here.
Finally, after choosing the number of components they will be profiled base on their weightings.
A rule of thumb is a weight should be ±0.50 to be considered a “characteristic” of the
component (Larose, 2007).

Weather-Only Fields
Table 16 – Weather-Only Fields PCA Summary
PC Eigenvalue Proportion
of
Variance
Cumulative
Variance
1 2.85 0.26 0.26
2 2.45 0.22 0.48
3 1.47 0.13 0.62
4 1.23 0.11 0.73
5 1.06 0.10 0.82
6 0.62 0.06 0.88
7 0.58 0.05 0.93
8 0.41 0.04 0.97
9 0.21 0.02 0.99
10 0.12 0.01 1.00
11 0.00 0.00 1.00
Table 17 – Weather-Only PCA Weightings
PC1 PC2 PC3 PC4 PC5 PC6
W_NE 0.33 0.13 -0.28 -0.38 0.47 -0.28
W_SE 0.24 -0.03 0.60 0.27 -0.37 -0.04
W_SW -0.46 0.16 -0.03 -0.38 -0.26 -0.03
W_NW -0.06 -0.38 -0.37 0.47 0.17 0.23
W_HR_Max -0.50 0.09 -0.10 -0.21 -0.24 0.10
Ceil_Hgt -0.16 -0.33 0.37 -0.09 0.36 -0.10
Visby_Dist -0.07 -0.44 0.14 -0.17 -0.07 -0.68
Temp -0.43 0.13 0.24 0.21 0.45 0.00
Dewpt -0.16 0.50 0.24 0.22 0.33 -0.13
TDS -0.31 -0.47 -0.02 -0.03 0.11 0.15
SLP 0.20 -0.13 0.36 -0.49 0.19 0.58
Weather-Only Summary
The eigenvalue criterion suggests five components (Table 16), the scree plot criterion (Figure 24)
suggests 6, and the CV criterion (Table 16) suggests seven. But six components explains 88% of
the variance which is close enough to 90%, so six is the number of weather components that will
be used.
Table 18 shows the profile of each of the six components.
Figure 24 – Weather-Only Variables Scree Plot

Table 18 - Weather Component Profiles
Component Description
PC_W1 Winds from East/NE (non-HR Max
Direction); Lower temperature
PC_W2 Higher Dew Point; Lower Visibility
PC_W3 Wind from the Southeast
PC_W4 Lower SLP
PC_W5 Winds from Northeast; Higher
Temp;
PC_W6 Lower Visibility; High SLP
In PCA analysis, the first component accounts for the most variance. From Table 16 we see that
PC_W1 accounts for 26% of all the variance in the data. This makes sense since, as we saw
earlier, a significant number of games are played when the winds are from the East.
Temperature may be slightly lower in these games though the weight (-0.43) doesn’t quite reach
our ±0.50 criterion. This also makes sense, as during the warmer months when baseball is
played, a “sea breeze” tends to have a moderating effect on the temperature. The next highest
component, PC_W2, accounts for 22% more variance. The games which have a higher score for
PC_W2 will tend to have a higher dew point with perhaps low visibility, although the weight
here is only -0.44. Together PC_W1 and PC_W2 explain almost half of all the variance seen in
the dataset.

Baseball-Only Fields
Table 19 – Baseball variables PCA Summary
PC Eigenvalue Prop of
Var
Cum
Var
1 2.82 0.13 0.13
2 2.08 0.10 0.23
3 1.80 0.09 0.32
4 1.22 0.06 0.38
5 1.12 0.05 0.43
6 1.06 0.05 0.48
7 1.03 0.05 0.53
8 1.00 0.05 0.58
9 0.98 0.05 0.62
10 0.98 0.05 0.67
11 0.95 0.05 0.72
12 0.92 0.04 0.76
13 0.90 0.04 0.80
14 0.80 0.04 0.84
15 0.79 0.04 0.88
16 0.73 0.03 0.91
17 0.59 0.03 0.94
18 0.52 0.02 0.97
19 0.36 0.02 0.98
20 0.33 0.02 1.00
21 0.02 0.00 1.00
Figure 25 - Baseball Variables Scree Plot

Table 20 – Baseball-Only Fields PCA Weightings
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
TotAssists_PAB 0.17 -0.55 0.23 -0.21 -0.01 0.05 0.06 0.03
TotBalk_PAB -0.05 0.00 0.06 0.15 0.34 0.30 -0.20 0.44
TotBB_PAB -0.43 -0.26 -0.14 -0.17 0.01 0.02 -0.06 -0.07
TotCS_PAB 0.04 -0.19 -0.06 -0.05 0.24 0.14 0.58 -0.29
TotDbl_PAB -0.28 0.24 0.28 0.10 -0.32 0.04 0.03 0.12
TotErrors_PAB -0.13 -0.13 0.02 0.27 0.38 -0.37 -0.19 -0.05
TotGDP_PAB 0.04 -0.38 0.32 -0.36 0.03 0.06 -0.21 0.17
TotH_PAB -0.28 0.20 0.52 0.06 -0.04 -0.01 0.08 -0.09
TotHBP_PAB -0.17 0.06 -0.10 -0.36 0.12 0.29 0.09 0.07
TotHR_PAB 0.02 0.28 0.23 -0.37 0.07 -0.34 0.26 -0.16
TotIB_PAB -0.28 -0.24 -0.05 0.13 -0.35 -0.08 0.16 -0.02
TotK_PAB -0.04 0.22 -0.53 -0.13 0.08 0.09 0.10 0.11
TotLOB_PAB -0.46 -0.15 -0.17 0.17 -0.17 0.11 -0.15 -0.06
TotPB_PAB -0.12 -0.01 -0.05 -0.18 0.02 0.01 -0.45 -0.63
TotPitchersUsed_PAB -0.40 0.13 -0.01 -0.38 0.11 0.00 0.10 0.10
TotSacFly_PAB -0.21 0.07 0.21 0.22 0.34 -0.06 0.14 0.05
TotSacHt_PAB -0.13 -0.29 -0.04 0.24 -0.09 0.01 0.41 0.00
TotSB_PAB -0.12 -0.14 -0.14 0.02 0.34 -0.50 -0.01 0.02
TotTrip_PAB -0.03 0.05 0.15 0.20 0.35 0.52 0.01 -0.35
TotWP_PAB -0.20 -0.02 0.02 -0.16 0.18 -0.02 0.01 0.28
Baseball-Only Summary
The eigenvalue criterion (Table 19) suggests 8-10 components, the scree plot (Figure 25)
suggests 4-6 components, and the CV criterion (Table 19) suggests 16 components. But since
components 9 thru 16 each only account for 3-5% of the variance of the variables, these will not
be considered further. And since none of the components account for a particularly large
amount of variance (the highest is 13%), the smallest number, four, suggested from the scree
plot criterion will be used. These four components only account for 38% of the variance in the
dataset, but a lower explained variance is more desirable than insignificant components.
Table 21 – Baseball Component Profiles
PC_B1 Fewer LOB and Low BB
PC_B2 Fewer Assists
PC_B3 Many Hits and Fewer K’s
PC_B4 Fewer HRs; Fewer Pitchers
Used

None of the weightings in PC_B1 are particularly strong (i.e. greater than ±0.50), but the
component could probably be characterized as the “well pitched game.” Still PC_B1 only
accounts for 13% of the variance in the dataset. PC_B2 accounts for an additional 10% and has
as its strongest feature a low number of assists with the weighting for TotAssists_PAB at -0.55.
The other traits of each component are shown in Table 21, but in general the amount of
variance explained is not large by any one component.

Complete Set of Fields
Table 22 – Complete set PCA Summary
PC Eigenvalue Prop of
Var
Cum Var
1 3.29 0.10 0.10
2 2.68 0.08 0.19
3 2.49 0.08 0.26
4 1.89 0.06 0.32
5 1.79 0.06 0.38
6 1.45 0.05 0.42
7 1.27 0.04 0.46
8 1.20 0.04 0.50
9 1.14 0.04 0.54
10 1.08 0.03 0.57
11 1.03 0.03 0.60
12 1.01 0.03 0.63
13 0.99 0.03 0.67
14 0.98 0.03 0.70
15 0.95 0.03 0.73
16 0.95 0.03 0.76
17 0.92 0.03 0.78
18 0.90 0.03 0.81
19 0.80 0.03 0.84
20 0.78 0.02 0.86
21 0.73 0.02 0.88
22 0.62 0.02 0.90
23 0.57 0.02 0.92
24 0.55 0.02 0.94
25 0.51 0.02 0.96
26 0.41 0.01 0.97
27 0.36 0.01 0.98
28 0.32 0.01 0.99
29 0.21 0.01 1.00
30 0.12 0.00 1.00
31 0.02 0.00 1.00
32 0.00 0.00 1.00
Figure 26 - Complete Set Scree Plot

Table 23 - Full Set PCA Weightings
PC1 PC2 PC3 PC4 PC5 PC6
TotAssists_PAB 0.12 -0.16 0.03 -0.17 0.53 -0.15
TotBalk_PAB -0.02 0.04 0.04 -0.04 0.01 -0.04
TotBB_PAB -0.01 0.30 0.25 0.24 0.20 0.02
TotCS_PAB 0.05 -0.06 0.02 0.06 0.14 -0.01
TotDbl_PAB -0.21 0.23 0.06 -0.13 -0.06 0.02
TotErrors_PAB -0.02 0.09 0.11 0.00 0.10 0.00
TotGDP_PAB 0.01 -0.04 0.03 -0.19 0.44 -0.18
TotH_PAB -0.33 0.23 0.10 -0.30 0.03 -0.03
TotHBP_PAB -0.03 0.15 0.03 0.11 -0.06 -0.02
TotHR_PAB -0.24 0.01 -0.05 -0.21 -0.17 -0.03
TotIB_PAB -0.02 0.16 0.11 0.21 0.27 0.10
TotK_PAB 0.09 0.04 -0.07 0.38 -0.34 0.16
TotLOB_PAB 0.02 0.30 0.21 0.34 0.15 0.09
TotPB_PAB -0.01 0.09 0.07 0.05 -0.01 -0.02
TotPitchersUsed_PAB -0.15 0.33 0.09 0.13 -0.06 -0.03
TotSacFly_PAB -0.12 0.16 0.12 -0.12 0.00 0.02
TotSacHt_PAB 0.03 0.05 0.11 0.14 0.28 0.05
TotSB_PAB 0.04 0.09 0.06 0.13 0.09 0.10
TotTrip_PAB -0.10 0.00 0.04 -0.08 0.01 -0.01
TotWP_PAB -0.05 0.14 0.10 0.05 0.03 -0.09
TotRun_PAB -0.35 0.27 0.13 -0.29 -0.11 -0.05
W_NE 0.24 0.22 -0.10 0.00 0.01 -0.23
W_SE 0.16 0.09 0.01 -0.29 0.02 0.52
W_SW -0.36 -0.16 -0.12 0.23 0.10 0.02
W_NW 0.02 -0.16 0.33 0.05 -0.13 -0.38
W_HR_Max -0.39 -0.20 -0.06 0.25 0.05 -0.05
Ceil_Hgt -0.07 -0.21 0.29 -0.03 0.01 0.38
Visby_Dist 0.02 -0.22 0.38 -0.07 -0.03 0.14
Temp -0.35 -0.19 -0.12 0.10 0.15 0.23
Dewpt -0.19 0.07 -0.44 0.03 0.20 0.21
TDS -0.17 -0.30 0.42 0.07 -0.08 0.00
SLP 0.17 0.05 0.11 -0.13 0.00 0.38
Complete Set Summary
For the complete set of clustering fields, the eigenvalues (Table 22) suggest 12-13 components,
the scree plot (Figure 26) suggests 6-7, and to account for 90% of variance we’d need 22
components. Note also, the scree plot only shows the first 10 components, but all the
components after 10 only account for less than 3% of variance each as the scree plot is quite flat
after the first 10. While two of our criteria suggest a larger number should be used, to

determine the number of components to keep for the complete set of data, we will use the
scree plot as the primary measure. Both 22 as suggested by the CV criterion and 12 (as
suggested by the eigenvalue criterion) both are too high since most of these components
account for very little variance. Using the scree plot we see that the amount of variance
explained after about six components drops off sharply. So six is the number of components
that will be used for the complete dataset.
Table 24 – Complete Set Component Profiles
PC_CMP_1 Fewer Runs; Low Temp; Fewer Hits; Wind not out of
the SW
PC_CMP_2 More Pitchers Used; More Walks; Low TDS
PC_CMP_3 Wind from NW; Higher Visby_Dist; Low Dew Point
PC_CMP_4 More Ks; More LOB; Fewer Runs; Fewer Hits
PC_CMP_5 More Assists; Fewer Ks
PC_CMP_6 SE Wind; High Ceiling Height
In Table 24 we see that the first component, PC_CMP_1, is characterized by a lower number of
runs, low temperature, lower number of walks and the wind is blowing from a direction other
than the SW. But none of these weightings are particularly strong. Likewise none of the
weightings are compelling enough to create any strong characterizations for any of the
components.

Clustering
Data clustering is a popular technique for finding underlying groupings that might help explain
data. Points or observations (or in the case of this thesis, games) that are in the same cluster
could be considered to be similar to each other. The Fisher Iris dataset (Fisher, 1936) is a classic
dataset often used for the exploration of clustering techniques. It contains length and width
measures in millimeters of the petals and sepals of various species of irises. Figure 27 shows a
simple plot of the sepal and petal widths. Notice how there appears to be two to three
groupings of dots.
Figure 27 - Fisher Iris Data
Because of the underlying clusters in the iris data, if we know the species we’ll also generally
know the size of its parts and vice versa.
For our clustering we’ll use an algorithm called k-means, which essentially starts with random
cluster centers (k) and adds a data point to the closest center, recalculates the new center of the
cluster (the cluster’s mean), and then repeats this process for all points in the dataset. Like the
above graph, k-means uses Euclidean distance as the metric; unlike the above graph, the k-
means algorithm can handle many dimensions.
A downside to k-means clustering is determining k, or the number of clusters. There are a
number of ways to determine a good value for k. We will examine all values of k between 2 and
15, to determine which is most logical for groupings of the data.
The following metrics were considered to find optimal clusters within the three unsupervised
datasets:

1. For clustering, Partitioning Around Medoids (PAM) as described in Kaufman &
Rousseeuw, 1990, will be used. This algorithm is a more robust version of k-means and is
implemented in the programming language R’s cluster package. PAM provides a silhouette
metric which is useful for determining cluster cohesiveness. A cluster’s silhouette width is a
measure of how each cluster member differs from co-members, as well as all other clusters. The
silhouette value for each cluster is the mean dissimilarity for all members. Silhouette is
interpreted as shown in Table 25.
Table 25 - Silhouette Metric Interpretation
Value Interpretation
Close to 1 Cluster member is well defined
Near 0 Cluster member is on a boundary between clusters
Close to -1 Cluster member is poorly defined
Thus, a good cluster has a mean silhouette for all members closer to one, and an overall solution
that is closer to one is a better cluster solution than a lower value.
2. Prediction strength (Tibshirani & Walther, 2005) is another metric for assessing
cluster quality. The main idea behind prediction strength is to take each variable, m, and create
an m x m matrix, M. Matrix M is divided into training and testing datasets and cross-validated to
predict cluster co-membership, or how often each pair of variables falls into the same cluster.
Any algorithm can be used to create clusters, but this project uses R’s fpc package which uses
the Nearest Neighbor (k-nn) algorithm in its prediction.strength() function. Note a value of
prediction strength of one (k=1) indicates 100% – all pairs are in the same cluster. So one
weakness of prediction strength is it doesn’t help us determine whether or not clusters should
be created at all. Tibshirani & Walther recommend the value of k that results in the highest
prediction strength of at least 0.8.
For reference, these two techniques were used on the previously seen Fisher Iris dataset. The
set is generally considered to have two to three clusters of irises. As can be seen in Table 26
there is agreement between the two metrics in this case, as the highest mean silhouette width
(0.68) and the largest prediction strength (0.95), both metrics suggest two clusters. Using a
three cluster solution would give a mean silhouette width of 0.57, which is a small drop-off from
the two cluster solution for this metric. But a three cluster solution using prediction strength
causes a significant drop in prediction strength (from 0.95 to 0.58). This loss in performance is
from having to differentiate between the two types of irises which can be seen in the upper
cluster in Figure 27.
Figure 28 shows the silhouette plot for the k=2 cluster solution. We can see that Cluster 1 has 50
members and a mean silhouette width of 0.84, which is quite close to 1, indicating the members
of this cluster are very similar to each other. Cluster 2 has 100 members and a mean width of

0.60 which also indicates a decent value. At the bottom of the silhouette plot is the overall
width, which is also the value we see in Table 26 for k=2.

Table 26 – Iris Cluster Results
Iris Dataset
k Avg Silhouette Width Prediction Strength
1 n/a 1.00
2 0.68 0.95
3 0.57 0.58
4 0.50 0.58
5 0.49 0.50
6 0.51 0.45
7 0.37 0.43
8 0.34 0.40
9 0.30 0.32
10 0.31 0.26
11 0.30 0.28
12 0.27 0.12
13 0.25 0.11
14 0.25 0.06
15 0.25 0.00
One final point: Figure 27 shows that one could argue for either two or three clusters in the
Fisher Iris data. While there are clearly at least two clusters, the upper cluster could conceivably
be separated into two clusters, for a total of three. The ultimate number of clusters used should
be based on common sense and knowledge of the specific domain from which the data
originates.
Figure 28 - Iris Silhouette

Clustering of Weather Variables
Table 27 – Weather-Only Cluster Results
Weather-Only
k Mean Silhouette
Width
Prediction Strength
1 n/a 1.00
2 0.213 0.93
3 0.258 0.67
4 0.290 0.72
5 0.245 0.54
6 0.206 0.59
7 0.204 0.49
8 0.192 0.49
9 0.187 0.46
10 0.180 0.41
11 0.182 0.40
12 0.182 0.42
13 0.187 0.43
14 0.188 0.37
15 0.186 0.37
The criterion Mean Silhouette Width suggests a four cluster solution, while Prediction Strength
indicates a two cluster solution (0.93). In Figure 30, Figure 31, and Figure 32 are shown
silhouette plots for k values of 2 through 4.
Figure 29 – Weather-Only Prediction Strength
Figure 31 – Weather-Only Silhouette (k=3)Figure 30 – Weather-Only Silhouette (k=2)

None of the silhouette solutions is any more compelling than the others. Strictly speaking the
k=4 solution has the best silhouette width at 0.29, but with k=2 having such a high prediction
strength, it seems using two clusters would be an easier choice to defend. So k=2 is the value
that will be used for this project. A summary of variables for the two cluster solution is shown in
Table 28:
Table 28 – Cluster Summary
Cluster 1 2
Count 1233 1998
Average of Tot_Runs 10.7 9.4
Average of
W_HR_Max
1.0 0.1
Average of W_NE 0.0 1.6
Average of W_SE 0.0 2.0
Average of W_SW 5.3 0.0
Average of W_NW 0.8 1.4
Average of Ceil_Hgt 13357 12306
Average of Visby_Dist 17634 17792
Average of Temp 23.9 17.7
Average of Dewpt 13.7 10.2
Average of TDS 10.2 7.4
Average of SLP 1013.4 1016.5
Perhaps the biggest difference that stands out between the two clusters is in variable, W_SW. In
Cluster 1, winds are from the Southwest with a mean speed of 5.3 MPS. It also makes sense that
Temp (temperature) and Dewpt (dew point) would be higher when winds are from the
Southwest. There are more runs scored (10.7) for this cluster as well. Recall we saw in the Wind
Figure 32 – Weather-Only Silhouette (k=4)

Quantification section that it appeared as if more runs were scored when the wind was blowing
from the Southwest.
Because there are only two clusters, a new field will be added to the dataset, Cluster1. Cluster1
will have the value of 1 when the game falls in Cluster 1 and a value of 0 if the game is not in
Cluster 1 (i.e. Cluster 2).

Clustering of Baseball Variables
The baseball variables did not form particularly strong clusters by either metric:
Table 29 – Baseball-Only Clustering Results
Baseball-Only
k Avg Silhouette Width Prediction Strength
1 n/a 1.00
2 0.038 0.51
3 0.055 0.36
4 0.032 0.28
5 0.028 0.24
6 0.032 0.20
7 0.027 0.18
8 0.023 0.16
9 0.018 0.16
10 0.012 0.15
11 0.016 0.15
12 0.021 0.13
13 0.019 0.13
14 0.018 0.13
15 0.013 0.11
In Table 29 we see the maximum prediction strength is for k=2, which is only slightly higher than
predicting with a coin flip. A value of k=3 is suggested by the Mean Silhouette Width. But neither
is particularly compelling. We also see in Figure 33 that none of the values for k results in a
prediction strength greater than 0.80. Because of these results no clusters will be created from
the set of baseball variables. This would also seem to validate what we saw in the PCA – that the
baseball data seems to be very diverse.
Figure 33 – Baseball-Only Prediction Strength

Clustering of Complete Set of Variables
As with the Baseball-only variables, Table 30 shows the full set of variables did not create
compelling clusters:
Table 30 – Complete Set Cluster Summary
Full
k Avg Silhouette Width Prediction
Strength
1 n/a 1.00
2 0.061 0.51
3 0.059 0.36
4 0.070 0.28
5 0.051 0.24
6 0.045 0.20
7 0.038 0.18
8 0.038 0.16
9 0.031 0.16
10 0.030 0.15
11 0.024 0.15
12 0.019 0.13
13 0.024 0.13
14 0.018 0.13
15 0.022 0.11
Table 30 shows that predicting any given value of k is only slightly better than could be had by
guessing (2=0.51, 3=0.36, 4=0.28). Because of this, and the fact that no value for k results in a
prediction strength greater than 0.80, no clusters will be created with the complete set.
Figure 34 – Cluster Only Prediction Strength

Supervised Analysis
The goal of this thesis is to explore the role that weather might play in scoring runs in the game
of baseball. Therefore a gauge of this role would be the extent to which weather variables can
improve the performance of a model that predicts and/or explains runs scored.
Supervised analysis is a branch of machine learning that applies algorithms to data that has been
labeled or otherwise categorized prior to application of the algorithm. The hope is that data
mining algorithms, in the course of learning the difference within our labeled data, will help us
find relationships that we would not find otherwise.
The supervised algorithms used in this thesis were chosen for specific reasons. Since we are
examining metrics generated during the game in an attempt to explain past performance, the
primary purpose of this thesis is one of explanation, not prediction. Artificial neural nets are a
supervised technique that can be highly predictive, but they’re also “relatively opaque to human
interpretation” (Larose, 2004), so they would not be the best technique to help us understand
our data. On the other hand, decision trees produce very understandable results even though
they might not be the most powerful techniques (Caruana, 2006). A secondary factor in
choosing a supervised data mining algorithm is the ability to compare to past work like Bennett
& Fleuck, 1983, and Lindsey, 1963. For that reason linear models will also be used to analyze this
thesis’s data.
A pitfall of supervised techniques is overfitting. This happens when the model learns our data so
well, that it generalizes poorly on, or doesn’t work well with, new data that the model has never
been exposed to before, (Mitchell, 1997). A young child that points at an airplane and says
“bird” has over fit his conceptual model of a bird. When the parent (playing the role of
supervisor) informs the child that he is not exactly correct, the parent has given a negative
learning example that helps the child generalize his model – not everything in the sky is a bird.
To avoid overfitting and test model performance, we will split the data into two groups – one
consisting of 85% of the data chosen at random and the other consisting of the remaining 15%.
The larger group will be considered the “training” set and the smaller will be the “test” set. We
will build all our models with the training set and measure their performance with the test set.
By measuring model performance with data that was not used to build the models, we avoid
overfitting and get a more accurate assessment of how well each model will perform on data to
which it has never been exposed.
Finally, with the exception of the unique fields (e.g. Game ID), constant or almost constant fields
(TotIB_PAB, TotPB_PAB), the date fields, and TotPutouts_PAB (because it combines hits, errors,
and walks), the supervised analysis was run on the complete set of variables (Table 5 - Complete
Dataset) plus the variables created in the Unsupervised section – the six weather Principal
Components (PC_W1 thru PC_W6) described in Table 18 - Weather Component Profiles, the
four baseball Principal Components (PC_B1 through PC_B4) described in Table 21 – Baseball
Component Profiles, the six complete Principal Components (PC_CMP_1 through PC_CMP_6)

described in Table 24 – Complete Set Component Profiles, and finally the cluster variable as
show in Table 28 – Cluster Summary. The target variable for all supervised analysis will be
TotRun_PAB.
Decision Tree
Simply put, a decision tree represents a series of decisions. Initially, the algorithm considers all
fields and all values to determine the point at which the data could be split to create the two
groups that best differentiate the target we’re interested in. The algorithm is then applied
recursively to each resulting group (or branch) until some stopping criterion is reached – for
example a minimum number of data points in each final group (or leaf).
CART (Classification and Regression Tree) is a popular decision tree algorithm (Breiman, 1984).
This paper uses the regression part of CART since our target is a real number, as opposed to a
classification tree that tries to predict a class of some kind. R’s rpart library will be used for
creating CART decision trees as this library closely follows the Breiman algorithm for its
implementation.
Figure 35 shows the first decision tree where the target is TotRun_PAB. We see the most
important splitter variable at top of the tree, PC_B_1. Recall PC_B_1 was the “Well Pitched
Game” principal component. Low Left on Base (LOB) is its strongest feature at (-0.46), followed
by low Base on Balls, BB (-0.43), and Total Pitchers Used (-0.40). Interestingly, a large value for
LOB is often used by baseball announcers as an indicator of poor, or perhaps untimely, hitting.
But here in PC_B_1 it is grouped with other metrics that are indicative of good pitching and thus
LOB is probably low because fewer batters are reaching base in games where PC_B_1 is strong.
In the entire training set, the mean TotRun_PAB is 0.136 and is displayed in the top node. Also in
the top node we see the total number of records in the training set (n=2746; 2746/3231 = 85%).
Games with PC_B_1 less than -0.26 are less well pitched, and the TotRun_PAB rises to 0.18.
Since the mean number of Total at Bats (TotAB) in the dataset is 69.22, this change in
TotRun_PAB translates to about 3.25 runs per game (69.22*0.183 - 69.22*0.136).
Farther down, the decision tree algorithm selects PC_B_2 and TotLOB_PAB as important splitter
variables. PC_B_2 was characterized as the “Low number of Assists” principal component. The
strongest trait of PC_B_2 is TotAssists_PAB (-0.55). Also in Figure 35 we see the only other
variable used by the tree, TotLOB_PAB.
To summarize Decision Tree 1, games that are well pitched (as defined by low number of base
on balls and fewer pitchers used) have fewer runs scored. Unfortunately, this doesn’t tell us
much about the impact of weather on baseball, other than that the weather effect is not strong
enough, or more important enough, to be chosen as a splitter variable by the decision tree
algorithm. It’s not even clear that it tells us much about baseball that isn’t already known – if a
pitcher stays in the game and hasn’t given up a large number of walks, and hasn’t left men on
base, then the number of runs scored in the game will be less. This result underscores common
baseball knowledge.

We now use Decision Tree 1 on the “test” partition of our dataset, the 15% of data that was
held back to prevent overfitting.
Figure 35 - Decision Tree 1
Table 31 shows the tree’s performance on the held back test part of the dataset. It includes the
mean as a “model” for comparison and it contains the following metrics:
Mean Absolute Error (MAE) = Mean(Abs(Actual - Predicted)) – the mean absolute (no negative
values) error.
Mean Squared Error (MSE) = Mean[ (Actual - Predicted)2
] – another metric for dealing with the
potential of negative values, taking the square instead of the absolute value. Unfortunately, this
metric is in different units than TotRun_PAB, and is harder to interpret.

RMSE = Sqrt(MSE) – Square root of the above metric, to put MSE in the same units as
TotRun_PAB.
NMSE = Normalized MSE where MSE is compared to a hypothetical model which uses the mean
as the prediction, NMSE = MSE/[Actual - mean(Actual)]2
.
NMAE = Normalized MAE where MAE is compared to a hypothetical model which uses the mean
as a prediction, NMAE = MAE / [Abs(Actual – mean(Actual))].
Table 31 - Model Performance on Test Set
Model MAE MSE RMSE NMSE NMAE
Mean TotRun_PAB (0.136) 0.496 0.037 0.191 1.000 1.000
Decision Tree 1 0.026 0.001 0.033 0.318 0.561
Table 31 contains our performance metrics for two models. The first row, Mean TotRun_PAB,
indicates the performance of a hypothetical model which simply uses the mean of TotRun_PAB
as our prediction (0.136 in both the training and testing partitions). Using this simple model,
MAE is close to 0.50 which makes sense since it is the absolute error of all values which should
be equally spread on both sides of the mean (assuming symmetry). Likewise both NMSE and
NMAE for this hypothetical model are equal to one as they should be – since both NMSE and
NMAE are comparisons to the hypothetical mean model. In the final row, we can see the
performance of the Decision Tree 1 model. Both NMSE of 0.318 and NMAE of 0.561 indicate the
Decision Tree 1 would be a significant improvement over using the mean as a prediction.
Linear Models
As discussed in the Literature Review, the Expected Run Production (ERP) model (Bennett and
Flueck, 1983) purportedly has an adjusted R2
of 95% and is repeated here in Table 32:
Table 32 - ERP (Bennett and Flueck)
Variable Coefficient Description
H1B 0.499 Singles
H2B 0.728 Doubles
H3B 1.265 Triples
HR 1.449 Home Runs
TBB 0.353 Total Base on Balls
HP 0.362 Hit By Pitch
SB 0.126 Stolen Base
SF 0.394 Sacrifice
GIDP (0.395) Ground into Double
Play
OUT (0.085) Putouts
CONSTANT (-67.0)

For the purpose of comparison, the ERP model was applied to this thesis’s dataset and the
resulting regression coefficients (using total runs as the response variable) are shown in Table
33:

Table 33 – ERP Coefficients on This Thesis’s Dataset
Variable Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.44 0.46 7.46 0
Tot1B 0.53 0.01 49.94 0
TotDbl 0.77 0.02 40.22 0
TotTrip 0.98 0.07 14.85 0
TotHR 1.41 0.03 53.06 0
TotBB 0.37 0.01 26.59 0
TotHBP 0.46 0.05 8.99 0
TotSB -0.03 0.03 -0.89 0.374
TotSac 0.29 0.04 7.70 0
TotGDP -0.25 0.03 -7.70 0
TotPutouts -0.17 0.01 -19.16 0
The linear model in Table 33 has an adjusted R2
of only 76.7% which is lower than the 95% cited
in Bennett and Flueck’s original paper. One difference may be the ERP considered 15,000 games
from both the National and American Leagues. Boston is in the American League where the
designated hitter bats instead of the pitcher and the games are generally higher scoring.
Another difference is the Boston Red Sox over the years have been notorious for having neither
stealing nor sacrificing as a significant part of their playing style (this will be demonstrated later).
But probably the most significant difference is that this thesis examines data from across
multiple decades – the 70s, 80s, 90s, and 00s. The nature of baseball has changed significantly
across this span. For example James, 2003, points out that the 80s represent a significant shift in
the nature of the game. It was the first decade where players were finally paid well enough to
play baseball full time and train during the off season. And perhaps the added salary incentive
also increased the incentive to use performance enhancing drugs. As previously discussed, it can
be seen in Figure 20 - HR and SO Per At Bat by Year , that there was a large spike in both HRs
and SOs starting in the mid-80s.
Still, the parameter estimates are fairly close, except for TotSB, which again may be because the
Red Sox traditionally do not use the stolen base as a significant part of their game play.
In Figure 36 we see the diagnostic plots for the ERP model. For a regression model to be valid,
certain statistical assumptions must be met. We want to see that the residuals (the difference
between the predicted and actual fitted value for a given data point) vary constantly over the
whole range of the predictor values. In the Residuals vs. Fitted plot, there is a red line that
indicates the mean residual – with it close to zero for the entire set of values we are confident
the residual variance is constant. There are some outliers (indicated by an observation number
in the plot) that leveraged the red mean residual line upward, but not substantially. The Scale
Location plot allows us to check that the magnitude of the residuals is constant throughout the
entire range of values – another assumption for valid regression. Here the red line indicates the

mean of residuals throughout the range of fitted values is fairly flat, but it does slant upward
overall. This indicates the ERP regression model is more accurate in the middle of the range of
values and either lower or higher at the ends.
Another assumption of regression modeling is that the data is normally distributed. When most
of the dots are on the diagonal line of the Normal Q-Q plot, this gives us an indication that the
data is normally distributed (as is the case here). Finally, regression is impacted by outliers – a
small number of extreme values can cause the regression coefficients to be off. The Residuals vs.
Leverage plot assesses Cook’s distance, a measure of overall influence of an outlier plot
(Mendenhall, 2003). The plot does not indicate any high leverage values with Cook’s distance
greater than 1, the commonly accepted limit. Overall, the model performs a little more poorly
at the extremes, but the diagnostics are acceptable.
Figure 36 - ERP Diagnostics
And now this same ERP model was created using the PAB versions of each field, after removing
both TotSB_PAB (as it was not significant here) and TotPutouts_PAB (because it so highly
correlated with TotRun_PAB):

Table 34 - ERP Using PAB Fields Summary
Variable Estimate Std. Error t value Pr(>|t|)
TotHR_PAB 1.513 0.029 51.810 0
TotTrip_PAB 1.146 0.086 13.359 0
TotDbl_PAB 0.818 0.021 39.246 0
Tot1B_PAB 0.578 0.013 44.645 0
TotBB_PAB 0.342 0.014 25.098 0
TotHBP_PAB 0.525 0.062 8.516 0
TotSacFly_PAB 0.764 0.065 11.741 0
TotSacHt_PAB (0.293) 0.063 (4.670) 0
TotGDP_PAB (0.314) 0.034 (9.343) 0
The ERP model using PAB versions of each field has an R2
of 69.14%, again lower than the
original ERP study, but comparable to the “non PAB” version of ERP. The diagnostic plots (not
shown) are similar. Finally, to ease interpretability and comparison, this model does not use the
transformed target to account for the slight skewing identified in the Exploratory Data Analysis.
Table 35 shows this ERP using the PAB fields added to the Model Performance table.
Table 35 - Model Performance
Model MAE MSE RMSE NMSE NMAE
Mean TotRun_PAB (0.136) 0.496 0.037 0.191 1.000 1.000
Decision Tree 1 0.026 0.001 0.033 0.318 0.561
ERP PAB Version 0.026 0.001 0.033 0.325 0.572
We see that Decision Tree 1 Model with a lower NMAE outperforms the ERP PAB model, but
only slightly. This is interesting since as stated earlier, the decision tree primarily uses pitching
performance while the ERP PAB model primarily uses hitting statistics.
The next step is to perform an exhaustive search similar to the one done by Bennett and Fleuck,
but instead with the weather variables included. An exhaustive search is also sometimes called a
“regular subsets” search, and it consists of iterating through all possible combinations of
predictors using some metric to assess each model’s performance. This thesis uses adjusted R2
to measure performance.
A note on the weaknesses of using an exhaustive search: Exhaustive search is only intended to
be a starting point in the model building process (Mendenhall, 2003). Models found by the
search should be evaluated with common sense. The technique can miss the best subsets
(Hocking, 1976), though there are probably several equally good “best” subsets (Gorman &
Toman, 1966).

Before running an exhaustive search we must address the issue of multicollinearity (correlation
between our predictor variables) within the dataset, especially since we have added principal
components and a cluster variable. Correlation was calculated between each pair of variables in
the dataset (not shown), but Table 36 shows the most strongly correlated variables (> ±0.60).
The strongest correlations are between variables created in the unsupervised process and this is
to be expected. But for the purposes of building linear models, these variables will be removed
to ease interpretation which is, after all, the main purpose of this thesis.
Several correlated variables are being left in the dataset for the exhaustive search. There is a
fairly strong correlation between Temp & Dewpt (0.66, p 0) and W_SW & W_HR_max (0.78,
p 0), both of which we saw when exploring the weather variables in Table 6 – Correlation
Table. Another strongly correlated pair is TotH_PAB and TotRun_PAB (0.70, p 0). It should be
noted that TotH_PAB is the total number of hits (singles + doubles + triples + home runs), yet
the correlations between TotH_PAB and TotDbl_PAB (0.44, p 0), TotTrip_PAB (0.15, p 0), and
TotHR (0.26, p 0) are not overly strong. Although not included in the dataset, the correlation
between total singles PAB and TotRun_PAB (0.36, p 0) is not strong. So TotH_PAB is not
primarily made up of any one type of hit (singles for example). Ultimately, these correlated
variables will be left in the dataset and in the event they are identified as being useful for the
prediction of runs scored, we will reassess the impact of having multicollinearity in the model.

Table 36 - Strongly Correlated Variables
PC_B_1
PC_B_2
PC_B_3
TotAssists_PAB
TotBB_PAB
TotGDP_PAB
TotH_PAB
TotK_PAB
TotLOB_PAB
TotRun_PAB
Cluster1
Dewpt
PC_W_1
PC_W_2
PC_W_3
TDS
Temp
Visby_Dist
W_HR_Max
W_SE
W_SW
PC_B_1 - -0.01 0.00 0.30 -0.43 0.04 -0.74 0.13 -0.36 -0.83 -0.10 -0.06 0.11 -0.06 0.01 -0.04 -0.09 0.07 -0.11 0.03 -0.10
PC_B_2 -0.01 - -0.02 -0.05 -0.61 0.03 0.37 -0.32 -0.69 0.38 0.14 0.12 -0.17 0.07 0.06 0.05 0.18 -0.01 0.14 -0.01 0.13
PC_B_3 0.00 -0.02 - 0.78 0.16 0.67 0.18 -0.67 0.00 0.03 -0.02 -0.01 -0.01 -0.07 0.06 0.05 0.03 0.09 -0.03 0.05 -0.01
TotAssists_PAB 0.30 -0.05 0.78 - 0.05 0.46 -0.15 -0.35 -0.15 -0.21 -0.06 -0.04 0.05 -0.07 0.06 0.01 -0.04 0.08 -0.07 0.07 -0.05
TotBB_PAB -0.43 -0.61 0.16 0.05 - 0.12 -0.02 0.05 0.55 0.25 -0.06 -0.11 0.07 -0.07 -0.03 0.03 -0.09 0.02 -0.06 0.01 -0.05
TotGDP_PAB 0.04 0.03 0.67 0.46 0.12 - 0.10 -0.21 -0.09 -0.03 0.00 0.03 -0.01 0.00 0.03 0.01 0.04 0.01 0.00 0.02 0.00
TotH_PAB -0.74 0.37 0.18 -0.15 -0.02 0.10 - -0.31 0.19 0.70 0.14 0.12 -0.17 0.08 0.04 0.05 0.17 -0.03 0.14 -0.03 0.13
TotK_PAB 0.13 -0.32 -0.67 -0.35 0.05 -0.21 -0.31 - 0.02 -0.19 -0.02 -0.02 0.05 0.03 -0.03 -0.06 -0.07 -0.07 -0.02 -0.02 -0.03
TotLOB_PAB -0.36 -0.69 0.00 -0.15 0.55 -0.09 0.19 0.02 - -0.10 -0.05 -0.07 0.08 -0.02 -0.05 -0.03 -0.10 0.01 -0.05 -0.01 -0.04
TotRun_PAB -0.83 0.38 0.03 -0.21 0.25 -0.03 0.70 -0.19 -0.10 - 0.12 0.07 -0.14 0.05 0.02 0.07 0.13 -0.04 0.13 -0.03 0.11
Cluster1 -0.10 0.14 -0.02 -0.06 -0.06 0.00 0.14 -0.02 -0.05 0.12 - 0.25 -0.85 0.21 -0.10 0.24 0.48 -0.01 0.93 -0.43 0.84
Dewpt -0.06 0.12 -0.01 -0.04 -0.11 0.03 0.12 -0.02 -0.07 0.07 0.25 - -0.28 0.78 0.28 -0.49 0.66 -0.39 0.19 -0.02 0.21
PC_W_1 0.11 -0.17 -0.01 0.05 0.07 -0.01 -0.17 0.05 0.08 -0.14 -0.85 -0.28 - -0.01 0.01 -0.51 -0.74 -0.11 -0.84 0.41 -0.78
PC_W_2 -0.06 0.07 -0.07 -0.07 -0.07 0.00 0.08 0.03 -0.02 0.05 0.21 0.78 -0.01 - -0.01 -0.74 0.21 -0.68 0.15 -0.06 0.26
PC_W_3 0.01 0.06 0.06 0.06 -0.03 0.03 0.04 -0.03 -0.05 0.02 -0.10 0.28 0.01 -0.01 - -0.02 0.28 0.17 -0.12 0.74 -0.04
TDS -0.04 0.05 0.05 0.01 0.03 0.01 0.05 -0.06 -0.03 0.07 0.24 -0.49 -0.51 -0.74 -0.02 - 0.34 0.43 0.27 -0.18 0.18
Temp -0.09 0.18 0.03 -0.04 -0.09 0.04 0.17 -0.07 -0.10 0.13 0.48 0.66 -0.74 0.21 0.28 0.34 - -0.05 0.43 -0.18 0.39
Visby_Dist 0.07 -0.01 0.09 0.08 0.02 0.01 -0.03 -0.07 0.01 -0.04 -0.01 -0.39 -0.11 -0.68 0.17 0.43 -0.05 - 0.01 0.08 0.00
W_HR_Max -0.11 0.14 -0.03 -0.07 -0.06 0.00 0.14 -0.02 -0.05 0.13 0.93 0.19 -0.84 0.15 -0.12 0.27 0.43 0.01 - -0.39 0.78
W_SE 0.03 -0.01 0.05 0.07 0.01 0.02 -0.03 -0.02 -0.01 -0.03 -0.43 -0.02 0.41 -0.06 0.74 -0.18 -0.18 0.08 -0.39 - -0.37
W_SW -0.10 0.13 -0.01 -0.05 -0.05 0.00 0.13 -0.03 -0.04 0.11 0.84 0.21 -0.78 0.26 -0.04 0.18 0.39 0.00 0.78 -0.37 -

The R language’s leaps package was used for the exhaustive regular subsets search on the training
partition of the dataset.
The top 25 models are shown in Table 37. We can see that the top one value model uses TotH_PAB
which has an adjusted R2
of 0.489. Recall TotH_PAB and TotRun_PAB were correlated at 0.70. This
makes sense statistically since adjusted R2
can be converted into correlation by taking the square root,
√ = 0.70, (Kutner, 2004). The regression equation of this one variable model is:
TotRun_PAB = -0.068 + 0.754 * Tot H_PAB
This means each hit per AB (regardless of the type of hit) accounts for about ¾ of a run.
We can also see in Table 37 that after about 7-9 variables not much in the way of variance is explained
as measured by adjusted R2
. This can be seen visually in Figure 37 – the adjusted R2
is quite flat after 10
variables. It certainly wouldn’t make much sense to use more than 10 variables for such a small amount
of explained variance.
Figure 37 - Number of Variables and Adjusted R
2

Table 37 - Regular Subsets for TotRun_PAB
Num Vars Model Adj R2
1 TotH_PAB 0.48942
2 TotH_PAB + TotHR_PAB 0.57342
3 TotBB_PAB + TotH_PAB + TotLOB_PAB 0.77580
4 TotBB_PAB + TotGDP_PAB + TotH_PAB + TotLOB_PAB 0.83309
5 TotBB_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotLOB_PAB 0.86488
6 TotBB_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotLOB_PAB 0.89487
7 TotBB_PAB + TotCS_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotLOB_PAB 0.92146
8 TotAssists_PAB + TotBB_PAB + TotCS_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotLOB_PAB 0.92595
9 TotAssists_PAB + TotBB_PAB + TotCS_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotHR_PAB + TotLOB_PAB 0.92859
10 TotAssists_PAB + TotBB_PAB + TotCS_PAB + TotDbl_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotHR_PAB +
TotLOB_PAB
0.92924
11 TotAssists_PAB + TotBB_PAB + TotCS_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotHR_PAB + TotK_PAB +
TotLOB_PAB + TotPitchersUsed_PAB
0.92992
TotK_PAB + TotLOB_PAB + TotPitchersUsed_PAB
0.93048
TotK_PAB + TotLOB_PAB + TotPitchersUsed_PAB + TotWP_PAB
0.93090
TotK_PAB + TotLOB_PAB + TotPitchersUsed_PAB + TotSacHt_PAB + TotWP_PAB
0.93112
TotK_PAB + TotLOB_PAB + TotPB_PAB + TotPitchersUsed_PAB + TotSacHt_PAB + TotWP_PAB
0.93130
TotK_PAB + TotLOB_PAB + TotPB_PAB + TotPitchersUsed_PAB + TotSacFly_PAB + TotSacHt_PAB + TotWP_PAB
0.93150
TotK_PAB + TotLOB_PAB + TotPB_PAB + TotPitchersUsed_PAB + TotSacFly_PAB + TotSacHt_PAB + TotTrip_PAB + TotWP_PAB
0.93165
TotK_PAB + TotLOB_PAB + TotPB_PAB + TotPitchersUsed_PAB + TotSacFly_PAB + TotSacHt_PAB + TotSB_PAB + TotTrip_PAB +
TotWP_PAB
0.93175
TotWP_PAB + Temp
0.93179
TotWP_PAB + W_NW + Dewpt
0.93182
TotWP_PAB + W_SE + W_NW + Dewpt
0.93184
TotWP_PAB + W_NW + Temp + Dewpt + TDS
0.93186
TotWP_PAB + W_SE + W_NW + Temp + Dewpt + TDS
0.93189
24 TotAssists_PAB + TotBalk_PAB + TotBB_PAB + TotCS_PAB + TotDbl_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB +
TotHR_PAB + TotK_PAB + TotLOB_PAB + TotPB_PAB + TotPitchersUsed_PAB + TotSacFly_PAB + TotSacHt_PAB + TotSB_PAB +
TotTrip_PAB + TotWP_PAB + W_SE + W_NW + Temp + Dewpt + TDS
0.93190
25 TotAssists_PAB + TotBalk_PAB + TotBB_PAB + TotCS_PAB + TotDbl_PAB + TotErrors_PAB + TotGDP_PAB + TotH_PAB + TotHBP_PAB +
TotHR_PAB + TotK_PAB + TotLOB_PAB + TotPB_PAB + TotPitchersUsed_PAB + TotSacFly_PAB + TotSacHt_PAB + TotSB_PAB +
TotTrip_PAB + TotWP_PAB + W_SE + W_NW + Ceil_Hgt + Temp + Dewpt + TDS
0.93189

Assuming we don’t include extra variables beyond seven because they account for so little variance, the
best model out of the exhaustive search was the seven variable model as shown in Table 38:
Table 38 - Leaps 7 Linear Model
Call:
lm(formula = TotRun_PAB ~ TotBB_PAB + TotCS_PAB + TotErrors_PAB +
TotGDP_PAB + TotH_PAB + TotHBP_PAB + TotLOB_PAB, data = data.prediction.train)
Residuals:
Min 1Q Median 3Q Max
-0.081041 -0.009913 0.001121 0.011621 0.060278
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.006104 0.002010 -3.036 0.00242 **
TotBB_PAB 0.924893 0.009155 101.026 < 2e-16 ***
TotCS_PAB -1.120415 0.036774 -30.468 < 2e-16 ***
TotErrors_PAB 0.624288 0.018779 33.243 < 2e-16 ***
TotGDP_PAB -0.915649 0.018641 -49.119 < 2e-16 ***
TotH_PAB 0.932288 0.006031 154.577 < 2e-16 ***
TotHBP_PAB 1.199033 0.034720 34.535 < 2e-16 ***
TotLOB_PAB -0.917415 0.008801 -104.235 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01705 on 2738 degrees of freedom
Multiple R-squared: 0.9217, Adjusted R-squared: 0.9215
F-statistic: 4602 on 7 and 2738 DF, p-value: < 2.2e-16
The variables are:
 TotBB_PAB – total base on balls per at bat
 TotCS_PAB – total times caught stealing per at bat
 TotErrors_PAB – total errors per at bat
 TotGDP_PAB – total ground into double play per at bat
 TotH_PAB – total hits per at bat
 TotHBP_PAB – total hit by pitch per at bat
 TotLOB_PAB – total left on base per at bat
This model has an adjusted R2
of 92.15%. Other than the selection of TotH_PAB over TotHR_PAB,
TotDbl_PAB, and TotTrip_PAB this model is very similar to the ERP model. We can see that generally any
action that moves a base runner forward (TotBB_PAB, TotHBP_PAB) has positive regression coefficients

and thus a positive effect on runs scored. Actions which remove runners from the game like getting
caught stealing (TotCS_PAB) or grounding into a double play (TotGDP_PAB) drive the runs scored
downward. TotErrors_PAB has a positive effect on runs scored – this makes sense, as an error is
generally a failed putout attempt – and is a missed opportunity by the defense to remove a potential
run from the game.
The diagnostic plots for this model are show in Figure 38.
Figure 38 - Leaps 7 Diagnostics
The diagnostic plots have a few issues. In the Normal Q-Q plot in Figure 38 we see that the Leaps 7
model has quite a few points off the diagonal in the lower left, indicating a stray from normality, which
could probably be corrected by using the transformed target as mentioned in the Exploratory Data
Analysis. We also see in the Residuals vs. Leverage plot that there are some points which exceed Cook’s
distance. Removing these high leverage points would change our regression coefficients slightly, but for
the purpose of this thesis we will disregard them.
We now compare this model to the others that have been built:

Analysis_of_the_Impact_of_Weather_on_Runs_Scored_in_Baseball_Games_at_Fenway_Park

Analysis_of_the_Impact_of_Weather_on_Runs_Scored_in_Baseball_Games_at_Fenway_Park

Recommended

Recommended

More Related Content

Similar to Analysis_of_the_Impact_of_Weather_on_Runs_Scored_in_Baseball_Games_at_Fenway_Park

Similar to Analysis_of_the_Impact_of_Weather_on_Runs_Scored_in_Baseball_Games_at_Fenway_Park (20)

Analysis_of_the_Impact_of_Weather_on_Runs_Scored_in_Baseball_Games_at_Fenway_Park