SlideShare a Scribd company logo
1 of 74
CASE STUDY
Understanding the Key Drivers to Maximise Revenue Generated From
Handle – Methodology, Findings & Results
CONTENTS
1. Data Preparation
i. Evaluating composition of available data
ii. Combining given data sets
iii. Deriving variables for analysis
2. Data Exploration
i. Univariate Analysis
a. Categorical variables
b. Numeric variables
c. Synopsis of Findings
ii. Bivariate Analysis
a. Plots
b. Categorical variables
c. Numeric variables
d. Synopsis of Key Findings
iii. Multivariate Analysis
3. Testing Assumptions of OLS
4. Regression Model Building
i. Summary of the Iterations performed for building the Models
ii. A combined model for all three race tracks
iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit
iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit
v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
1. DATA PREPARATION
(i)EVALUATING THE COMPOSITION OF AVAILABLE DATA
Firstly, composition of the available data was evaluated. Some of the key points looked at were:
• Type of the given variables (Character/Numeric) was evaluated in each of the data sets before merging using the procedure of PROC CONTENTS.
• Extent of the missing data was identified using PROC FREQ.
Table Name Variable # of missing data points Total # of data points in Table % of total missing values Meaning of the variable
Race conditions_of_race 103101 185360 56%Restrictions (or conditions) on the eligibility of horses to run. See decode table
Race Race_Type 1 185360 0%
Race sex_restriction 116451 185360 63%Restrictions on the gender of horses that can run. See decode table
Race scheduled_surface 182532 185360 98%Planned surface for a race. In inclement weather, turf races are often moved to dirt
Race track_condition 7279 185360 4%Describes the condition of the surface at race time. See decode table.
Race weather 7151 185360 4%Weather at race time. See decode table
Race grade 183201 185360 99%Ranking of stakes races. 1 being the highest.
Race About_distance_indicator 24392 2071 92%
Can be used with distance_id and distance_unit to indicate estimated length of races. Used for turf
races. See decode table
Race track_sealed_indicator 47203 185360 25%
Y/N on whether dirt track is "sealed" Sealing is a process of smoothing and compacting the dirt
surface to make it less penetrable to rain
Race_Distance_Conv Race_Type 1 185360 0%
sex_restriction 116451 185360 63%Restrictions on the gender of horses that can run. See decode table
scheduled_surface 182532 185360 98%Planned surface for a race. In inclement weather, turf races are often moved to dirt
track_condition 7279 185360 4%Describes the condition of the surface at race time. See decode table.
weather 7151 185360 4%Weather at race time. See decode table
grade 183201 185360 99%Ranking of stakes races. 1 being the highest.
track_sealed_indicator 47203 185360 25%
Y/N on whether dirt track is "sealed" Sealing is a process of smoothing and compacting the dirt
surface to make it less penetrable to rain
conditions_of_race 103101 185360 56%Restrictions (or conditions) on the eligibility of horses to run. See decode table
Track Track_Id 2 811 0%Track abbreviation
Track_Type 12 811 1%Should all be T for thoroughbred
State 28 811 3%State of operation
Track_Statistic Loaction_Type 7 188565 0%Should all be T for track. See decode
Location 22483 3980 83% Should all be ON for on (vs. off)
Track_Zone DST_YN 1 167 1%
1. DATA PREPARATION
(ii) COMBINING GIVEN DATA SETS
The following presents a synopsis of the manner in which the given data sets were merged/combined to arrive at a
consolidated data set for use in final analysis of the variable ‘Handle’:
ORIGINAL DATA FILES MERGED DATA FILES
File # File Name # of Observations # of Variables
1 Exotic_Payoff 722873 15
2 Race 185360 60
3 Race_Distance_Conv 185360 61
4 Track 811 12
5 Track_Statistic 188565 11
6 Track_Zone 167 8
File # File Name
# of
Observations
# of
Variables Files Combined Primary Key
1 Race_Combined 185360 61Race + Race_Distance_Conv
Track_id Race_Date
Race_Number
2 Exotic_Race_Combi 724278 68Exotic_Payoff + Race_Combined
Track_id Race_Date
Race_Number
3 TrackStat_Zone1 188569 17Track_Zone1 + Track_Statistic Track_id Country
4 Track_Final 189214 22Track1 + TrackStat_Zone1
Track_id Country
Track_Name State
5 CDI_0 875026 83Track_Final + Exotic_Race_Combi Track_id Race_Date Country
• Data set ‘Track_Zone’ was modified by renaming the variable 'Area_ID' as 'Country'. The data set thus obtained was
'Track_Zone1‘. Similarly, data set 'Track' was also modified by renaming the variable 'Area_ID' as 'Country'. The data set thus
obtained is 'Track1‘.
• In some of the data files it is seen that for the variable ‘Track_id’ there are tracks that are not common in the two data sets
being merged. Hence, in the data files obtained after merging it is seen that count of the observations has increased.
• Variables with more than 50% missing values were dropped for the purpose of analysis as no meaningful analysis was
possible in the absence of adequate data
1. DATA PREPARATION
(iii)DERIVING VARIABLES FOR ANALYSIS
From the final data set obtained after merging given data files, the following variables were further derived:
# Original Variable Derived Variable Description of the Derived Variable
1. Race_Dt Date_Race The derived variable shows the date on which a race took place.
2. Race_Dt Yr The derived variable shows the year in which the race took place.
3. Race_Dt Weekday It shows which day of the week the race took place.
4. Weekday Day_of_Week This variable renames weekday as a character variable.
5. Weekday Weekend_Indi This variable is categorical and shows whether the day of race is
a weekend or not.
6. WPS_pool , Total_pool Handle_Combi This variable shows the Handle generated from a race.
7. Race_Dt Mon This variable shows the month of the year in which the race took
place.
8. Post_time Race_time The variable ‘Post_time’ is converted from character to time
format to derive other time related variables for analysis.
9. Race_time HOD This variable shows the hour of the day in which the race took
place.
10. Race_Dt HOL_XXX Holiday indicators are created for each of the holidays
mentioned in the list of holidays for year 2005 & 2006.
1. DATA PREPARATION
(iii)DERIVING VARIABLES FOR ANALYSIS
Notes:
• For deriving variables from ‘Race_Date’ (which was in a date-time format), it was first converted from a
number format to a SAS date format.
• The existing data set that emerged from merging the given data files was further sub-setted to include:
 only the relevant track ids required for the analysis: CRC, AP & FG
 only the relevant years: 2005 & 2006
• The variable ‘Handle_Combi’ was identified as the Dependent variable.
• Continuous variables were binned after evaluating the distribution using the procedure of PROC
UNIVARIATE. The variables that were binned included:
 Purse_usa
 Minimum_claim_price
 Maximum_claim_price
 Attendance
 Handle_combi
CONTENTS
1. Data Preparation
i. Evaluating composition of available data
ii. Combining given data sets
iii. Deriving variables for analysis
2. Data Exploration
i. Univariate Analysis
a. Categorical variables
b. Numeric variables
c. Synopsis of Findings
ii. Bivariate Analysis
a. Plots
b. Categorical variables
c. Numeric variables
d. Synopsis of Key Findings
iii. Multivariate Analysis
3. Testing Assumptions of OLS
4. Regression Model Building
i. Summary of the Iterations performed for building the Models
ii. A combined model for all three race tracks
iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit
iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit
v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
2. DATA EXPLORATION
(i) Univariate Analysis
The procedure of PROC CONTENTS was used to identify the character & numeric variables in the data set that emerged after
merging the given data files. The analysis shown in the following slides is for Track Ids: AP, CRC & FG and for the years 2005 &
2006.
i. Categorical variables: PROC FREQ was used to evaluate the distribution of categorical variables in the data set.
ii. Numeric variables: PROC MEAN was used to understand the characteristics of each numeric variable using
i. Measures of Central Tendency and
ii. Dispersion
iii. Numeric variables: PROC UNIVARIATE was used to evaluate the:
i. Skewness & Kurtosis
ii. Distribution of the variable
iii. Inter Quartile Range (IQR) : This was used in binning the variable for subsequent multivariate analysis
iv. Categorical & Numeric variables were analysed both on an overall level as well as track id wise in line with the business
requirements.
2. DATA EXPLORATION: Univariate Analysis
(a) Categorical Variables: Wager_Type
Frequency distribution on an overall basis
wager_type Frequency Percent
3 4729 17.87
4 1090 4.12
5 5 0.02
6 104 0.39
9 41 0.15
D 1454 5.49
E 6352 24.01
M 184 0.7
Q 612 2.31
S 5633 21.29
T 6190 23.39
Z 67 0.25
Frequency distribution Track_id wise
wager_type Frequency Percent
3 1359 19.33
4 379 5.39
D 573 8.15
E 1745 24.82
S 1304 18.54
T 1647 23.42
Z 25 0.36
wager_type Frequency Percent
3 2679 17.48
4 540 3.52
5 5 0.03
6 2 0.01
9 41 0.27
D 709 4.63
E 3751 24.48
M 184 1.2
S 3693 24.1
T 3686 24.06
Z 32 0.21
wager_type Frequency Percent
3 691 16.82
4 171 4.16
6 102 2.48
D 172 4.19
E 856 20.84
Q 612 14.9
S 636 15.49
T 857 20.87
Z 10 0.24
Track_id: AP Track_id: CRC
Track_id: FG
2. DATA EXPLORATION: Univariate Analysis
(a) Categorical Variables: Race_Type
Frequency distribution on an overall basis
Race_type Frequency Percent
ALW 2738 10.35
AOC 2855 10.79
CAN 325 1.23
CLM 8683 32.81
DBY 4 0.02
HCP 22 0.08
MCL 6094 23.03
MSW 3406 12.87
OCS 14 0.05
SHP 17 0.06
SIM 5 0.02
SST 14 0.05
STK 1768 6.68
STR 516 1.95
Frequency distribution Track_id wise
Race_type Frequency Percent
ALW 1248 17.75
AOC 621 8.83
CLM 2687 38.21
HCP 22 0.31
MCL 767 10.91
MSW 1079 15.34
SHP 17 0.24
STK 449 6.39
STR 142 2.02
Race_type Frequency Percent
ALW 983 6.42
AOC 1857 12.12
CAN 282 1.84
CLM 4551 29.7
MCL 4577 29.87
MSW 1694 11.06
OCS 14 0.09
SIM 5 0.03
SST 14 0.09
STK 1066 6.96
STR 279 1.82Race_type Frequency Percent
ALW 507 12.34
AOC 377 9.18
CAN 43 1.05
CLM 1445 35.18
DBY 4 0.1
MCL 750 18.26
MSW 633 15.41
STK 253 6.16
STR 95 2.31
Track_id: AP Track_id: CRC
Track_id: FG
2. DATA EXPLORATION: Univariate Analysis
(a) Categorical Variables: Age_restriction
Frequency distribution on an overall basis
Age_restriction Frequency Percent
2 5175 19.56
3 2606 9.85
4 12 0.05
2U 6 0.02
34 3905 14.76
35 402 1.52
3U 12510 47.28
45 227 0.86
4U 1612 6.09
5U 6 0.02
Frequency distribution Track_id wise
Age_restriction Frequency Percent
2 493 7.01
3 681 9.68
34 1393 19.81
3U 4465 63.5
Age_restriction Frequency Percent
2 4409 28.78
3 1110 7.24
4 12 0.08
2U 6 0.04
34 2512 16.39
3U 7169 46.79
4U 98 0.64
5U 6 0.04
Age_restriction Frequency Percent
2 273 6.65
3 815 19.84
35 402 9.79
3U 876 21.33
45 227 5.53
4U 1514 36.86
Track_id: AP Track_id: CRC
Track_id: FG
2. DATA EXPLORATION: Univariate Analysis
(a) Categorical Variables: Track_condition
Frequency distribution on an overall basis
Track_condition Frequency Percent
FM 3707 14.18
FT 17139 65.56
GD 1928 7.38
MY 120 0.46
SF 129 0.49
SY 2571 9.84
WF 122 0.47
YL 425 1.63
Frequency distribution Track_id wise
Track_condition Frequency Percent
FM 1216 17.29
FT 4264 60.64
GD 433 6.16
MY 120 1.71
SF 129 1.83
SY 466 6.63
WF 62 0.88
YL 342 4.86
Track_condition Frequency Percent
FM 1606 10.68
FT 10042 66.78
GD 1470 9.78
SY 1812 12.05
WF 38 0.25
YL 70 0.47
Track_condition Frequency Percent
FM 885 21.74
FT 2833 69.59
GD 25 0.61
SY 293 7.2
WF 22 0.54
YL 13 0.32
Track_id: AP Track_id: CRC
Track_id: FG
2. DATA EXPLORATION: Univariate Analysis
(a) Categorical Variables: Weather
Frequency distribution on an overall basis
Weather Frequency Percent
C 14479 55.39
F 175 0.67
H 622 2.38
L 8316 31.81
O 2237 8.56
R 312 1.19
Frequency distribution Track_id wise
Weather Frequency Percent
C 4360 62
H 622 8.85
L 1703 24.22
O 117 1.66
R 230 3.27
Weather Frequency Percent
C 8586 57.1
F 38 0.25
L 4453 29.61
O 1931 12.84
R 30 0.2
Weather Frequency Percent
C 1533 37.66
F 137 3.37
L 2160 53.06
O 189 4.64
R 52 1.28
Track_id: AP Track_id: CRC
Track_id: FG
2. DATA EXPLORATION: Univariate Analysis
(a) Categorical Variables: Others
Sex_restriction
Sex_restriction Frequency Percent
B 5779 51.13
F 5523 48.87
Stakes_indicator Surface
Stakes_indicator Frequency Percent
N 24661 93.2
Y 1800 6.8
Surface Frequency Percent
D 20936 79.12
T 5525 20.88
2. DATA EXPLORATION: Univariate Analysis
(b) Numeric Variables
track_id N Obs Variable Label Minimum Mean Median Maximum Std.Dev
AP 7034purse_usa purse_usa 9500 28548.9 25000 1000000 52990.9
minimum_claim_price minimum_claim_price 0 13264.9 10000 100000 17398.2
maximum_claim_price maximum_claim_price 0 14559.1 10000 100000 18215.5
number_of_runners number_of_runners 3 8 8 14 2
Handle_Combi 52542 238538 215058 3139455 146357
CRC 15322purse_usa purse_usa 7000 23742.1 18000 2000000 48642.1
minimum_claim_price minimum_claim_price 0 14076.3 12500 62500 12314.2
maximum_claim_price maximum_claim_price 0 14434.1 12500 62500 12255.9
number_of_runners number_of_runners 0 8 7 13 2
Handle_Combi 0 138670 124265 1186000 74123.6
FG 4107purse_usa purse_usa 8000 28401.7 20500 600000 39505
minimum_claim_price minimum_claim_price 0 12009.3 9000 80000 15375.1
maximum_claim_price maximum_claim_price 0 13804.9 10000 80000 15947.5
number_of_runners number_of_runners 0 8 8 13 2
Handle_Combi 0 199949 183672 1647365 99617.3
Note:
Variable ‘Attendance’, originally in data set ‘Track_Statistic’ only, shows the combined attendance at the track for the whole day whereas in
other files, data has been shown for multiple races at a track for any day. Hence, the merged data file will not show correct numbers for the
variable ‘Attendance’. It has, thus, has not been included in the analysis.
2. DATA EXPLORATION: Univariate Analysis
(c) Synopsis of Key Findings
The table below shows the synopsis of the Univariate analysis performed in preceding slides for both categorical as well as
numeric variables:
Variable Overall Remarks
Wager Type Exacta & Trifecta Exacta & Trifecta were the most common wage types across all three
tracks
Race Type Claiming Track CRC also had Maiden Claiming as the most common race types
besides Claiming.
Age Restriction 3 yo’s & up 4 yo’s & up was also most common on track FG besides 3 yo’s & up
Track Condition Fast A Fast track condition was most common across all three race tracks.
Weather Clear Track FG was most often found Cloudy.
Purse_usa While Track AP & FG had an average purse of USD 30000 (appox), track
CRC’s purse was USD 24000 appox. Track AP had the highest Median value
for purse_usa.
Minimum_claim_price It was roughly the same for all three race tracks, appox USD 14000
Maximum_claim_price It was roughly the same for all three race tracks. Also, not much difference
b/w min & max claim price for all three race tracks.
Number_of_runners Average number of runners was 8 for all three race tracks.
Handle_Combi The average Handle & median Handle was highest for track AP.
CONTENTS
1. Data Preparation
i. Evaluating composition of available data
ii. Combining given data sets
iii. Deriving variables for analysis
2. Data Exploration
i. Univariate Analysis
a. Categorical variables
b. Numeric variables
c. Synopsis of Findings
ii. Bivariate Analysis
a. Plots
b. Categorical variables
c. Numeric variables
d. Synopsis of Key Findings
iii. Multivariate Analysis
3. Testing Assumptions of OLS
4. Regression Model Building
i. Summary of the Iterations performed for building the Models
ii. A combined model for all three race tracks
iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit
iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit
v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
2. DATA EXPLORATION
(ii) Bivariate Analysis
(a) Plots: For both, categorical as well as numeric variables, plots were used to graphically assess the data, identify any group
patterns and detect extreme values & outliers, if any.
Each categorical & numeric variable was plotted on the X-axis against the dependent variable, ‘Handle_combi’, on the Y-
axis.
(b) Categorical Variables: Chi-Square Test was used to evaluate the strength of association between the dependent variable
and each of the categorical variables, both existing as well as those created by binning continuous numeric variable. For
this purpose, the dependent variable, ‘Handle_Combi’ was converted from a numeric variable to an ordinal variable. Refer
to the tab ‘Proc Univariate_Binning’ in the worksheet of the link below for workings.
Measures of Central Tendency & Dispersion were also used.
Workings
(c) Numeric Variables: Correlation Analysis was used for each of the numeric variables and the dependent variable,
‘Handle_Combi’.
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Categorical variables
Handle_Combi & Age_Restriction Handle_Combi & Grade
 The values highlighted in the plot for Handle_Combi & Grade above are those for which there are no grades.
 The count of such missing values is 26116.
 Since the count of such values is around 99% of the total data, in the absence of a confirmation from business, this variable will be
dropped for the purpose of analysis & model building.
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Categorical variables
Handle_Combi & Race_Type Handle_Combi & Track_condition
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Categorical variables
Handle_Combi & Wager_Type Handle_Combi & Weather
The following observations in the plots above appear to be outliers:
 Wager_Type= 4 Handle_Combi=3139455 Count=1
 Wager_Type= E Handle_Combi=3187911 & 3206094 Count=1 each
 Wager_Type= T Handle_Combi=2946736 & 3026707 Count=1 each
 Wager_Type= 5 Handle_Combi= 1074715 Count= 1
 Wager_Type= 3 Handle_Combi= 2062128 & 2079182 Count= 1 each
 Wager_Type= S Handle_Combi= 2341293 & 2423058 Count= 1 each
 Weather= Blank Handle_Combi= 0 Count=330
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Attendance Handle_Combi & Distance_id
 When Attendance = 0, how can there be any Handle?
 Attendance=Blank Handle_Combi=0 Count= 299
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Fraction_1 Handle_Combi & Fraction_2
 Fraction is the split time and distance of a race. Not too sure if Handle>0 in case Fraction=0
 Fraction_1=5534 Count= 4
 Fraction_2= 15140 Count= 4
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Fraction_3 Handle_Combi & Fraction_4
 Fraction_3=21840 Count=4
 Not too sure if Handle should be>0 in case Fraction=0
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Fraction_5 Handle_Combi & HOD
 Not too sure if Handle should be >0 in case Fraction=0
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Maximum Claim Price Handle_Combi & Minimum Claim Price
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Month Handle_Combi & No. of Runners
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Number of Tickets bet Handle_Combi & Payoff Amount
 Number_of_tickets_bet= 100 Handle_Combi=3139455 Count=1
 Number_of_tickets_bet= 300 Count= 6
 Payoff_amount=449240 Handle_Combi= 74352 Count=1
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Purse_usa Handle_Combi & Race_Number
 Purse_usa= 2000000 Count= 5
 Race_number= 66 Track_id= CRC Count=5
2. DATA EXPLORATION: Bivariate Analysis
(a) Plots: Numeric variables
Handle_Combi & Weekday
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical variables
track_id N Obs N nmiss Minimum Mean Median Maximum Sum Std Dev
AP 7034 7032 2 52542 238538 215058 3139455 1677401898 146357
CRC 15322 15322 0 0 138670 124265 1186000 2124701137 74124
FG 4107 4107 0 0 199949 183672 1647365 821191578 99617
Track_Id wise Handle for the years 2005 & 2006
Year N Obs N nmiss Minimum Mean Maximum Sum Std Dev
2005 14355 14353 2 0 178654 3139455 2564218644 116427
2006 12108 12108 0 0 170059 3060903 2059075969 104284
Year wise Handle
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical variables
Holiday Nobs N Mean Std Dev Min Max
No Holiday 24797 24795 174490 112375 0 3139455
HOL_BxD 95 95 244158 118135 88287 585465
HOL_GF 58 58 209818 75809 85759 411644
HOL_NY 150 150 241862 119630 72810 645937
HOL_TGV 201 201 117320 54094 29170 283323
HOL_Vet 88 88 212197 111846 68649 585766
HOL_Lab 184 184 185184 85400 65943 445343
HOL_ID 183 183 155035 82665 46025 457547
HOL_Mem 352 352 170100 80894 49733 438551
HOL_CDM 113 113 179644 59513 87072 362010
HOL_East 94 94 177696 57118 67108 307895
HOL_SPD 103 103 154384 51668 55188 279983
HOL_SB 45 45 170669 61101 62080 343682
Handle on different Holidays for the year 2005 & 2006
• Boxing Day & New Year Day is
with the highest average
Handle
• Though the dispersion is also
on the higher side
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical variables
Handle & Race #: Overall basis
race_nu
mber
nmiss Minimum Mean Maximum Sum Std Dev
1 0 0 119529 376662 206425851 53432
2 0 0 144899 525388 364420470 60688
3 0 0 152468 586142 372631958 69666
4 0 0 164420 747300 442126299 79359
5 0 0 177810 902720 498046927 87828
6 0 0 192440 1647365 531134486 105368
7 0 0 191724 941320 486979194 100842
8 0 0 197928 2090764 477798356 132956
9 0 0 204963 3139455 614683133 180861
10 0 0 185809 1045357 364929770 106234
11 0 0 173064 1353026 144508699 147978
12 0 0 179634 1186000 82990909 130050
13 0 0 120881 320650 34571841 50269
14 0 66689 102336 171037 2046720 28514
66 0 0 0 0 0 0
Handle & Race #: Track_Id wise
race_number Minimum Mean Maximum Std Dev
1 61220 164992 376662 51027
2 73235 183461 525388 58135
3 81479 201061 586142 72153
4 61246 220511 747300 85305
5 60299 229374 902720 98073
6 63431 245164 818390 96160
7 62133 262249 941320 97489
8 52542 269484 2090764 187592
9 68110 282277 3139455 255602
10 126975 322285 1045357 142620
11 729217 1014273 1353026 233335
12 440455 527820 624104 66091
race_number Minimum Mean Maximum Std Dev
1 0 95986 256454 39501
2 0 120956 316368 49793
3 0 127744 410791 56311
4 0 137095 454350 62528
5 0 141896 432622 66342
6 0 146594 477671 65293
7 0 146183 689454 75138
8 0 150831 578579 72838
9 0 141709 645937 74310
10 0 150792 773904 91221
11 0 161757 1074715 121928
12 0 170349 1186000 118032
13 0 120881 320650 50269
14 66689 102336 171037 28514
66 0 0 0 0
race_number Minimum Mean Maximum Std Dev
1 0 136802 346263 49652
2 0 161746 484711 59473
3 0 159603 399003 63393
4 0 176754 613785 74055
5 0 189721 568623 71512
6 0 236041 1647365 155171
7 0 228021 861703 99563
8 0 230092 737818 103343
9 0 243992 1015652 133043
10 0 204050 500897 74997
11 177969 252145 390084 56180
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical variables
Handle & Race Type: Overall basis
race_type nmiss Minimum Mean Maximum Sum Std Dev
ALW 0 31129 202051 818390 553215834 94282
AOC 0 40955 174011 568623 496802159 75421
CAN 0 0 0 0 0 0
CLM 0 28771 160986 675791
139783970
1
73082
DBY 0 57846 75204 89428 300817 15417
HCP 0 161830 269683 490685 5933015 87602
MCL 0 24623 141795 583365 864097538 62007
MSW 0 27740 197642 747300 673168518 89045
OCS 0 90100 177323 334321 2482525 70185
SHP 0 185893 286774 421977 4875159 73033
SIM 0 0 0 0 0 0
SST 0 66215 176655 252036 2473163 54601
STK 0 43966 302815 3139455 535376495 275741
STR 0 29170 168081 419676 86729689 76316
Handle & Race Type: Track_Id wise
race_type Minimum Mean Maximum Std Dev
ALW 1248 62133 247491 818390
AOC 621 52542 231366 500463
CLM 2687 60299 211068 675791
HCP 22 161830 269683 490685
MCL 767 81532 198251 470437
MSW 1079 61220 243456 747300
SHP 17 185893 286774 421977
STK 449 91487 442856 3139455
STR 142 108189 234625 406691
race_type Minimum Mean Maximum Std Dev
ALW 31129 141594 689454 65323
AOC 40955 147556 454350 58556
CAN 0 0 0 0
CLM 28771 126675 479510 52577
MCL 24623 126323 583365 51485
MSW 27740 156988 578579 70481
OCS 90100 177323 334321 70185
SIM 0 0 0 0
SST 66215 176655 252036 54601
STK 43966 233655 1186000 144080
STR 29170 132104 419676 60860
race_type Minimum Mean Maximum Std Dev
ALW 60184 207417 499212 73915
AOC 70325 209846 568623 74347
CAN 0 0 0 0
CLM 55188 175920 468001 65413
DBY 57846 75204 89428 15417
MCL 55353 178479 434572 61185
MSW 65918 228344 613785 78170
STK 78226 345681 1647365 233437
STR 75712 174274 334480 61308
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical variables
Handle & Age_Restriction: Overall basis
Age_restri
ction
nmiss Minimum Mean Maximum Sum Std Dev
2 0 0 153635 747300 795063415 80370
3 0 0 210265 1647365 547951285 142556
4 0 96480 166530 298930 1998364 59611
2U 0 0 63590 157102 381538 72528
34 0 0 160670 583365 627417387 80087
35 0 0 183691 434572 73843611 69143
3U 0 0 176734 31394552210939606 123652
45 0 65918 177127 355546 40207927 54918
4U 0 56314 201147 1007029 324248936 92251
5U 0 157566 207091 248986 1242544 35932
Handle & Age_Restriction: Track_Id wise
Age_restric
tion
Minimum Mean Maximum Std Dev
2 61220 242223 747300 103225
3 80139 262704 1353026 158859
34 72518 220667 524670 81221
3U 52542 240021 3139455 162695
Age_restri
ction
Minimum Mean Maximum Std Dev
2 0 140838 584870 69891
3 0 161285 1186000 112616
4 96480 166530 298930 59611
2U 0 63590 157102 72528
34 0 127400 583365 56663
3U 0 136606 1074715 73016
4U 72810 221839 454350 75001
5U 157566 207091 248986 35932
Age_restrict
ion
Minimum Mean Maximum Std Dev
2 0 200347 459996 73247
3 57846 233157 1647365 143454
35 0 183691 434572 69143
3U 0 182550 509929 78766
45 65918 177127 355546 54918
4U 56314 199808 1007029 93120
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
Handle & Distance_id: Overall basis
Distnace
_id
nmiss Minimum Mean Maximum Sum Std Dev
200 0 0 94080 326326 4233597 61825
350 0 104981 142810 175986 428430 35730
400 0 57846 129871 188257 1688319 44850
440 0 79568 109127 135500 436507 28628
450 0 0 116453 258603 47629242 48152
500 0 0 155807 569295289488594 79451
550 0 0 147242 667579433038125 73547
600 0 0 179191 1186000
100454456
4
91576
650 0 42656 158924 504548272077630 68769
700 0 0 146596 991286322950464 83803
750 0 54576 205633 773904 69092549 103604
800 0 0 183797 909653920821709 93494
818 0 55188 180558 450004 40444935 68968
832 0 39771 126262 394530 67045136 54619
850 0 0 186831 1647365745453980 111383
900 0 0 218692 1007029294577826 106595
950 0 90100 583927 2090764 33867777 508071
1000 0 0 724202 3139455 46348930 920541
1100 0 0 81521 174799 978254 64746
1200 0 91487 299421 805235 27247318 140306
1600 0 168211 225182 290300 900727 55358
Handle & Distance_id : Track_Id wise
Distnace_id Minimum Mean Maximum Std Dev
400 123621 154167 188257 27745
450 61220 134536 229418 39500
500 81984 237488 477310 81093
550 74563 210006 498141 91052
600 61246 220387 1045357 89701
650 63431 204931 504548 72213
700 52542 218187 991286 116989
750 118256 202789 301862 58614
800 70663 235191 909653 95855
850 97339 282728 699943 89145
900 62133 223963 568607 84253
950 185893 645239 2090764 512202
1000 230036 1180695 3139455 1014228
1200 91487 299504 584818 106742
Distnace_id Minimum Mean Maximum Std Dev
200 0 94080 326326 61825
450 0 114970 258603 48539
500 0 138812 569295 67703
550 0 127500 447481 58367
600 0 141577 1186000 88648
650 42656 137669 409607 55482
700 0 130858 583365 64729
750 54576 187841 773904 109816
800 0 129737 517699 60098
832 39771 126262 394530 54619
850 0 151523 689454 80153
900 0 190424 706858 123442
950 90100 137227 191116 35961
1000 0 137283 334321 93446
1100 0 81521 174799 64746
1200 117416 299268 805235 189300
1600 168211 225182 290300 55358
Distnace_id Minimum Mean Maximum Std Dev
350 104981 142810 175986 35730
400 57846 75204 89428 15417
440 79568 109127 135500 28628
550 0 194172 667579 81347
600 0 186492 613785 74779
750 102894 238813 653609 88864
800 60184 197700 568623 79025
818 55188 180558 450004 68968
850 0 228115 1647365 155189
900 86624 293787 1007029 162084
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical Variables
Handle & Track_Condition: Overall basis
Track_con
dition
nmiss Minimum Mean Maximum Sum Std Dev
FM 0 47127 229414 3060903850436130 140592
FT 0 0 165842 1647365
284236631
0
88371
GD 0 27740 180584 773904348166275 99651
MY 0 72518 180118 416176 21614199 65542
SF 0 97339 313760 910328 40475089 129318
SY 0 0 141092 557473362746364 64893
WF 0 65195 180629 403022 22036696 71323
YL 0 78342 318714 3139455135453550 343003
Handle & Track_Condition : Track_Id wise
Track_conditi
on
Minimum Mean Maximum Std Dev
FM 104925 292445 3060903 190157
FT 52542 218140 1045357 92865
GD 83694 262882 750821 105619
MY 72518 180118 416176 65542
SF 97339 313760 910328 129318
SY 61246 181344 438860 68084
WF 98683 214068 403022 70024
YL 123543 344863 3139455 374599
Track_conditio
n
Minimum Mean Maximum Std Dev
FM 47127 179514 805235 90965
FT 0 135460 1186000 66663
GD 27740 155587 773904 83682
SY 33291 126181 557473 57553
WF 65195 128146 263524 47777
YL 78342 198372 505685 101064
Track_conditio
n
Minimum Mean Maximum Std Dev
FM 81009 233360 667579 90214
FT 55188 194819 1647365 101440
GD 135246 225033 413163 73366
SY 0 169282 384473 66152
WF 91081 177043 284815 52412
YL 179507 278785 384008 70898
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical Variables
Handle & Weather: Overall basis
Weather nmiss Minimum Mean Maximum Sum Std Dev
C 0 0 182642 3060903 2644474127 111013
F 0 60135 177446 388463 31053098 69996
H 0 83538 245123 750821 152466346 97967
L 0 27740 170675 3139455 1419335293 114994
O 0 31129 143224 1186000 320391583 78758
R 0 0 178122 432213 55574166 69767
Handle & Weather : Track_Id wise
Weather Minimum Mean Maximum Std Dev
C 60299 243543 3060903 132301
H 83538 245123 750821 97967
L 52542 233100 3139455 196410
O 74563 202576 457547 77792
R 61246 184426 432213 65310
Weather Minimum Mean Maximum Std Dev
C 0 146379 1074715 78120
F 60135 145458 279416 47698
L 27740 134054 457672 56383
O 31129 135004 1186000 76598
R 85277 157631 342570 64507
Weather Minimum Mean Maximum Std Dev
C 56314 212536 1647365 116815
F 65918 186319 388463 72693
L 55188 196956 1015652 86203
O 60184 190461 446041 68389
R 0 162061 334254 86450
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
Handle & Number_of_Runners: Overall basis
No._of_
Runners
nmiss Minimum Mean Maximum Sum Std Dev
0 0 0 0 0 0 0
3 0 68110 87790 136111 614531 24010
4 0 29170 114078 339675 14373783 52845
5 0 31129 131162 574244 136277078 71137
6 0 27740 141212 1186000 616248774 76070
7 0 39557 156329 902720 1019111132 72781
8 0 24623 172195 1353026 935708945 84230
9 0 38246 203835 2090764 648603161 121901
10 0 51652 216924 3139455 619101417 187917
11 0 72784 234635 1015652 327081336 111020
12 0 49268 246030 818390 291791853 107368
13 0 175033 516909 1074715 11371992 261140
14 0 587653 752653 910328 3010611 146399
Handle & Number_of_Runners : Track_Id wise
No._of_Runne
rs
Minimum Mean Maximum Std Dev
3 68110 88723 136111 26163
4 60299 126896 256008 48252
5 52542 181985 574244 74382
6 69928 190938 1030559 79928
7 61220 209634 902720 80247
8 81479 235889 1353026 103748
9 96150 266363 2090764 135946
10 109264 328673 3139455 331440
11 121347 280257 750821 91916
12 106441 288447 818390 96510
13 353658 473836 577231 101247
14 587653 752653 910328 146399
No._of_Runne
rs
Minimum Mean Maximum Std Dev
0 0 0 0 0
3 82191 82191 82191 .
4 29170 85022 196532 37081
5 31129 97017 327829 44356
6 27740 115761 1186000 60986
7 39557 132015 769498 57135
8 24623 141845 619376 59079
9 38246 160553 605132 77775
10 51652 168770 689454 75795
11 72784 188094 773904 100674
12 49268 207189 805235 115008
13 175033 534763 1074715 333074
No._of_Runne
rs
Minimum Mean Maximum Std Dev
0 0 0 0 0
4 75939 142690 339675 62476
5 62080 153114 526757 73351
6 55188 169506 737818 75955
7 55353 171881 394925 65169
8 72351 190798 462356 71482
9 88655 229691 1647365 147919
10 57846 215446 468001 71121
11 101330 251000 1015652 119559
12 85195 234524 653609 86906
13 396762 504946 667579 120182
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical Variables
Handle & Location_Type: Overall basis
Location_T
ype
nmiss Minimum Mean Maximum Sum Std Dev
F 1 86735 175072 376662 47269555 49556
I 0 0 116255 256454 40340341 39131
L 0 69928 180899 318480 6512372 64031
O 0 96663 209871 386702 7765224 84081
S 0 0 76598 175461 26579435 25944
T 1 0 178739 3139455 4477232533 111255
Handle & Location_Type : Track_Id wise
Location_Typ
e
Minimum Mean Maximum Std Dev
F 93507 179912 376662 51240
L 69928 180899 318480 64031
O 96663 209871 386702 84081
T 52542 240715 3139455 148876
Location_Type Minimum Mean Maximum Std Dev
I 0 116255 256454 39131
S 0 76598 175461 25944
T 0 143180 1186000 73051
Location_Type Minimum Mean Maximum Std Dev
F 86735 164357 346263 44025
T 55188 202861 1647365 98654
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical Variables
Handle & Day_of_Week: Overall basis
Day_of_
Week
nmiss Minimum Mean Maximum Sum Std Dev
Fri 0 0 171751 503267 897744907 80706
Mon 0 0 141411 585465 473726296 67001
Sat 0 0 223245 3139455 1383896707 173655
Sunday 0 0 163067 1186000 898986819 82768
Thurs 0 0 160395 574244 548228915 71613
Tues 2 27740 110526 332921 151088435 45915
Wed 0 54471 194393 456229 269622534 68865
Handle & Day_of_Week : Track_Id wise
Day_of_
Week
Minimum Mean Maximum Std Dev
Fri 80995 240492 503267 78187
Mon 61246 225548 457547 81572
Sat 68110 317064 3139455 266268
Sunday 83580 227642 563999 80174
Thurs 52542 194249 574244 68250
Tues 63431 164264 332921 69310
Wed 70663 202620 456229 69792 Day_of_Week Minimum Mean Maximum Std Dev
Fri 0 132774 454808 57846
Mon 0 127376 585465 59292
Sat 0 176748 1074715 99373
Sunday 0 125038 1186000 63192
Thurs 0 124788 340647 56543
Tues 27740 105446 297950 40389
Wed 54471 160513 327829 52889
Day_of_Week Minimum Mean Maximum Std Dev
Fri 57472 191383 425895 67939
Mon 60184 166166 410833 55430
Sat 55353 261509 1647365 151040
Sunday 62080 197506 452179 69753
Thurs 0 166491 374051 71729
Tues 76610 161920 313327 54843
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
Handle & Month: Overall basis
Month nmiss Minimum Mean Maximum Sum Std Dev
1 0 57786 208338 667579 284797365 85543
2 0 60830 200426 1015652 201829172 99535
3 0 55188 202569 1647365 188996819 133950
4 0 58512 168971 518254 53732836 77771
5 0 0 157576 689454 485017662 70931
6 0 33291 156905 578960 517159956 79389
7 2 0 183157 1186000 644530914 107033
8 0 0 217787 3139455 672525570 203657
9 0 42249 168886 584818 478622274 88225
10 0 0 112135 550582 213841634 62991
11 0 24623 144993 585766 283315911 64563
12 0 0 189533 805235 598924500 88769
Handle & Month : Track_Id wise
Month Minimum Mean Maximum Std Dev
5 52542 193084 623554 69567
6 60299 206980 578960 80135
7 69928 238548 910328 94481
8 70663 295475 3139455 250128
9 61246 249515 584818 92486
Month Minimum Mean Maximum Std Dev
1 72810 249833 645937 108879
4 58512 168971 518254 77771
5 0 134856 689454 61910
6 33291 112235 365592 44334
7 0 132613 1186000 91722
8 0 130480 470593 58221
9 42249 127321 344041 48056
10 0 112135 550582 62991
11 24623 141236 585766 63116
12 42540 186032 805235 89726
Month Minimum Mean Maximum Std Dev
1 57786 201351 667579 78893
2 60830 200426 1015652 99535
3 55188 202569 1647365 133950
11 55353 177397 462356 67908
12 0 199967 509929 85060
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
Handle & HOD (Hour of the day): Overall basis
HOD nmiss Minimum Mean Maximum Sum Std Dev
1 0 24623 153512 902720 718435054 69015
2 0 31129 170912 818390 820378441 80133
3 0 29170 189699 2090764 1025700855 114302
4 0 40367 195446 3139455 1060881991 147331
5 0 33291 202763 1353026 534077362 122728
6 0 46025 174845 1186000 201596057 113198
7 0 156994 249484 379489 31933999 55992
11 0 39400 150897 376662 2565252 107688
12 0 28771 120498 586142 225933517 56593
Handle & HOD (Hour of the day): Track_Id wise
HOD Minimum Mean Maximum Std Dev
1 61220 187499 902720 78039
2 61246 209849 818390 89115
3 60299 235107 2090764 139178
4 52542 268527 3139455 224173
5 68110 278611 1353026 122267
6 80995 276369 1030559 104720
7 156994 249484 379489 55992
11 176749 275783 376662 72334
12 115850 377879 586142 125569
HOD Minimum Mean Maximum Std Dev
1 24623 132098 454350 57664
2 31129 144769 522378 64563
3 29170 152494 689454 73566
4 40367 150608 805235 80956
5 33291 149124 1074715 90316
6 46025 124325 1186000 78465
11 39400 82778 162251 38426
12 28771 109425 316368 46450
HOD Minimum Mean Maximum Std Dev
1 59350 169483 484711 62488
2 71195 205370 667579 79772
3 60184 235762 1647365 131045
4 83017 220443 1015652 98004
5 101330 247910 1007029 129986
6 202109 257019 328307 48277
12 55188 144130 346263 48998
Track: AP
Track: FG
Track: CRC
2. DATA EXPLORATION: Bivariate Analysis
(b). Categorical Variables: Chi-Square Test of Association
The following is a summary of the Chi-Square test performed to evaluate whether the association between each of the independent variables
& Handle (the dependent variable) is statistically significant or not. The results will be thus used for building the OLS Regression model with
those independent variables that will have an association significant @ 5% with the dependent variable, Handle. For the purpose of this test,
continuous variables have been binned as categorical variables on basis of the variable distribution found using the procedure of PROC
UNIVARIATE:
Variable P-Value Statistical association with
Handle
Wager_Type <0.0001 Significant
Race_Type <0.0001 Significant
Age_Restriction <0.0001 Significant
Sex_Restriction <0.0001 Significant
Stakes_Indicator <0.0001 Significant
Surface <0.0001 Significant
Track_Condition <0.0001 Significant
Weather <0.0001 Significant
Grade <0.0001 Significant
Track_Sealed_Indicator <0.0001 Significant
Maximum_Claim_Price <0.0001 Significant
2. DATA EXPLORATION: Bivariate Analysis
(b) Categorical Variables: Chi-Square Test of Association (contd….)
Variable P-Value Statistical association with
Handle
Minimum_Claim_Price <0.0001 Significant
Purse <0.0001 Significant
Day of the Week <0.0001 Significant
Attendance <0.0001 Significant
The variables mentioned above have thus been found to have a statistically significant association with
the Handle, the dependent variable.
(Please refer to the hyperlinked file for detailed workings.)
2. DATA EXPLORATION: Bivariate Analysis
(c) Numeric Variables : Correlation Analysis
Please click on the file below for the Correlation Matrix. The cells highlighted in RED indicate the presence of multi-collinearity due to a high
value of positive or negative correlation between any two variables. Multi-collinearity is checked later while building the regression model.
CORRELATION MATRIX
Handle_
Combi
race_da
te
race_nu
mber
number
_of_tick
ets_bet
total_po
ol
payoff_
amount
distance
_id
purse_u
sa
wps_po
ol
fraction
_1
fraction
_2
fraction
_3
fraction
_4
fraction
_5
winning
_time
minimu
m_claim
_price
maximu
m_claim
_price
number
_of_run
ners
Distance
ID_Con
v_to_Fu
r
Handle_
Combi
1 -0.0734 0.16599 0.04529 0.70363 0.00818 0.14679 0.34625 0.882 0.06214 -0.0395 -0.002 0.16188 0.0563 0.106 -0.1632 -0.1435 0.38228 0.14679
1 <.0001 <.0001 <.0001 <.0001 0.1835 <.0001 <.0001 <.0001 <.0001 <.0001 0.7437 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
race_dat
e
-0.0734 1 0.00202 0.0218 0.02629 0.00809 -0.0062 0.01589 -0.1112 -0.0333 0.0028 0.01504 0.02396 -0.0006 0.01026 0.02654 0.0097 -0.0722 -0.0062
race_dat
e
<.0001 1 0.7425 0.0004 <.0001 0.1882 0.317 0.0097 <.0001 <.0001 0.6482 0.0144 <.0001 0.9266 0.095 <.0001 0.1145 <.0001 0.3166
26461 26463 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
race_nu
mber
0.16599 0.00202 1 -0.0798 0.00187 0.12701 0.04731 0.17814 0.22663 0.00081 0.01002 0.05535 0.00879 0.01247 0.04337 -0.0264 -0.0285 0.22343 0.04731
race_nu
mber
<.0001 0.7425 1 <.0001 0.7612 <.0001 <.0001 <.0001 <.0001 0.8957 0.103 <.0001 0.1529 0.0425 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
number_
of_ticket
s_bet
0.04529 0.0218 -0.0798 1 0.3378 -0.023 -0.0444 -0.0558 -0.1102 0.02118 0.10424 0.06271 -0.0573 -0.0282 0.02492 0.02918 0.02917 0.02648 -0.0444
number_
of_ticket
s_bet
<.0001 0.0004 <.0001 1 <.0001 0.0002 <.0001 <.0001 <.0001 0.0006 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
total_po
ol
0.70363 0.02629 0.00187 0.3378 1 -0.1316 0.05351 0.10608 0.32596 0.03397 -0.0017 0.01738 0.05977 0.02161 0.05725 -0.0633 -0.0594 0.15722 0.0535
total_po
ol
<.0001 <.0001 0.7612 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 0.781 0.0047 <.0001 0.0004 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
payoff_a
mount
0.00818 0.00809 0.12701 -0.023 -0.1316 1 0.03044 0.04208 0.15456 0.01499 0.04696 0.04868 0.02319 -0.0027 0.05796 -0.0028 -0.0053 0.33667 0.03044
payoff_a
mount
0.1835 0.1882 <.0001 0.0002 <.0001 1 <.0001 <.0001 <.0001 0.0147 <.0001 <.0001 0.0002 0.6656 <.0001 0.6548 0.3863 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
distance
_id
0.14679 -0.0062 0.04731 -0.0444 0.05351 0.03044 1 0.17335 0.165 0.71452 0.59167 0.80977 0.78441 0.10692 0.96169 -0.0654 -0.0569 0.06724 1
distance
_id
<.0001 0.317 <.0001 <.0001 <.0001 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
purse_us
a
0.34625 0.01589 0.17814 -0.0558 0.10608 0.04208 0.17335 1 0.39972 0.00694 -0.0746 -0.0566 0.17255 0.07282 0.09688 -0.2313 -0.2228 0.05945 0.17335
purse_us
a
<.0001 0.0097 <.0001 <.0001 <.0001 <.0001 <.0001 1 <.0001 0.2591 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
w ps_poo
l
0.882 -0.1112 0.22663 -0.1102 0.32596 0.15456 0.165 0.39972 1 0.07052 -0.0439 -0.0071 0.18321 0.06319 0.11513 -0.1792 -0.1547 0.42674 0.165
w ps_poo
l
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 1 <.0001 <.0001 0.2473 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
fraction_
1
0.06214 -0.0333 0.00081 0.02118 0.03397 0.01499 0.71452 0.00694 0.07052 1 0.87029 0.78847 0.59345 0.09429 0.75985 -0.0107 -0.0058 -0.0012 0.71452
fraction_
1
<.0001 <.0001 0.8957 0.0006 <.0001 0.0147 <.0001 0.2591 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 0.0813 0.3428 0.8504 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
fraction_
2
-0.0395 0.0028 0.01002 0.10424 -0.0017 0.04696 0.59167 -0.0746 -0.0439 0.87029 1 0.75279 0.54696 0.09175 0.67156 0.03351 0.03405 0.06642 0.59167
fraction_
2
<.0001 0.6482 0.103 <.0001 0.781 <.0001 <.0001 <.0001 <.0001 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
2. DATA EXPLORATION: Bivariate Analysis
(d) Synopsis of Key Findings
The following shows a synopsis of the track-wise bivariate analysis done for each of the categorical
variables & Handle. For each track id, categories of a variable giving the highest value of average
Handle have been spelt out:
Variable Track IDs
AP CRC FG
Race Number 11 12 11
Race Type STK (Stakes) STK (Stakes) STK (Stakes)
Age Restriction 3 4U 3
Distance ID 1000 1200 900
Track Condition YL (Yielding) YL (Yielding) YL (Yielding)
Weather H (Hazy) R (Rainy) C (Clear)
No. of Runners 14 13 13
Location Type T (Track) T (Track) T (Track)
Day of Week Saturday Saturday Saturday
Month August January April
Hour of the Day July December March
CONTENTS
1. Data Preparation
i. Evaluating composition of available data
ii. Combining given data sets
iii. Deriving variables for analysis
2. Data Exploration
i. Univariate Analysis
a. Categorical variables
b. Numeric variables
c. Synopsis of Findings
ii. Bivariate Analysis
a. Plots
b. Categorical variables
c. Numeric variables
d. Synopsis of Key Findings
iii. Multivariate Analysis
3. Testing Assumptions of OLS
4. Regression Model Building
i. Summary of the Iterations performed for building the Models
ii. A combined model for all three race tracks
iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit
iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit
v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
2. DATA EXPLORATION: Multivariate Analysis
For the dependent variable, ‘Handle_Combi’, and each of the numeric variables a multivariate analysis was conducted Track_Id
wise.
1. Race Number
0
200
400
600
800
1000
1200
$0
$100,000
$200,000
$300,000
$400,000
$500,000
$600,000
$700,000
$800,000
$900,000
$1,000,000
$1,100,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NumberofRaces(Frequency)
AverageHandle(DollarAmount)
Race #
AP: Average Handle by Race #
AP: Handle AP: No. of Races
0
200
400
600
800
1000
1200
1400
1600
1800
$0
$20,000
$40,000
$60,000
$80,000
$100,000
$120,000
$140,000
$160,000
$180,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NumberofRaces(Frequency)
AverageHandle(DollarAmount)
Race #
CRC: Average Handle by Race #
CRC: Handle CRC: No. of Races
0
200
400
600
800
$0
$20,000
$40,000
$60,000
$80,000
$100,000
$120,000
$140,000
$160,000
$180,000
$200,000
$220,000
$240,000
$260,000
$280,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NumberofRaces(Frequency)
AverageHandle(DollarAmount)
Race #
FG: Average Handle by Race #
FG Handle FG: No. of Races
 For all 3 race tracks, Handle for race # 11 is the highest.
 The # of races have fallen as the race # increases for all 3 race tracks.
 In fact, for race # 11, the number of races have been very few.
 Thus, it can be seen that lower # of races for higher race numbers have been
generating the maximum amount of average Handle.
 Identifying the reasons for highest average Handle for Race # 11 in spite of
lower number of races can thus be relevant.
2. DATA EXPLORATION: Multivariate Analysis
2. Number of Runners
0
200
400
600
800
1000
1200
1400
1600
1800
$0
$100,000
$200,000
$300,000
$400,000
$500,000
$600,000
$700,000
$800,000
1 2 3 4 5 6 7 8 9 10 11 12 13
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
No .of Runners
AP: Average Handle by No. of Runners
AP: Handle AP: No. of Races
0
500
1000
1500
2000
2500
3000
3500
4000
4500
$0.00
$100,000.00
$200,000.00
$300,000.00
$400,000.00
$500,000.00
$600,000.00
1 2 3 4 5 6 7 8 9 10 11 12 13
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
No. of Runners
CRC: Average Handle by No. of Runners
CRC: Handle CRC: No. of Races
0
200
400
600
800
1000
$0.00
$100,000.00
$200,000.00
$300,000.00
$400,000.00
$500,000.00
$600,000.00
1 2 3 4 5 6 7 8 9 10 11 12 13
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
No. of Runners
FG: Average Handle by No. of Runners
FG: Handle FG: No. of Races
 For all 3 race tracks, maximum number of races have taken place for
around 6-7 runners.
 Beyond 6-7 runners in a race at either of the 3 race tracks, the # of
races have shown a falling trend.
 However, Handle is seen to be increasing with higher number of
races at all 3 race tracks.
 Thus, although fewer # of races have taken place when number of
runners have been beyond 6-7, Handle has increased.
 Clearly, increasing the number of races where runners are beyond
just 6-7 in number can have a positive impact on Handle.
2. DATA EXPLORATION: Multivariate Analysis
3. Day of the Week
0
200
400
600
800
1000
1200
1400
1600
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
Mon Tues Wed Thurs Fri Sat Sunday
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Day of the Week
AP: Average Handle by Day of the Week
AP: Handle AP: No. of Races
0
500
1000
1500
2000
2500
3000
3500
4000
0
50000
100000
150000
200000
Mon Tues Wed Thurs Fri Sat Sunday
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Day of the Week
CRC: Average Handle by Day of the Week
CRC: Handle CRC: No. of Races
0
200
400
600
800
1000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
Mon Tues Wed Thurs Fri Sat Sunday
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Day of the Week
FG: Average Handle by Day of the Week
FG: Handle FG: No. of Races
 Clearly, Saturday generates the highest average Handle during the week for all
3 race tracks.
 Tuesday, perhaps, gets the lowest average Handle during the week.
 Number of races held on Tuesday & Saturday are lowest & highest respectively
across all 3 race tracks.
 However, for track CRC, it is seen that though the average Handle for
Wednesday is almost as high as that on Saturday, the # of races are very low in
#.
 There is thus a scope for increasing average Handle on Wednesday by
increasing the # of races held on that day.
 For Friday & Sunday, average Handle is much lower than that on Saturday,
however, # of races are almost as high as that on Saturday for all 3 race tracks
except CRC.
2. DATA EXPLORATION: Multivariate Analysis
4. Month of the Year
0
200
400
600
800
1000
1200
1400
1600
1800
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Month of the Year
AP: Average Handle by Month of the Year
AP: Handle AP: No. of Races
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Month of the Year
CRC: Average Handle by Month of the Year
CRC: Handle CRC: No. of Races
0
200
400
600
800
1000
1200
1400
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Month of the Year
FG: Average Handle by Month of the Year
FG: Handle FG: No. of Races
 Track AP & FG are seen to have races in only a few months & not the whole year around
and these months are not coinciding with each other’s.
 For track CRC, while January has the highest average Handle, the # of races are the
lowest. There is thus a scope for increasing Handle in January even more by increasing
the # of races for that month.
 Also, at track CRC, June & October have the lowest average Handle but higher # of races
as compared to months in which Handle is lower.
 Especially for December, the steep increase in the number of races doesn’t justify the not
as steep increase in average Handle over the previous month of November.
2. DATA EXPLORATION: Multivariate Analysis
5. Hour of the Day (HOD)
0
200
400
600
800
1000
1200
1400
1600
1800
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
$400,000.00
1 2 3 4 5 6 7 8 9
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
HOD
AP: Average Handle by Hour of the Day (HOD)
AP: Handle AP: No. of Races
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
1 2 3 4 5 6 7 8 9
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
HOD
CRC: Average Handle by Hour of the Day (HOD)
CRC: Handle CRC: No. of Races
0
200
400
600
800
1000
1200
0
50000
100000
150000
200000
250000
300000
1 2 3 4 5 6 7 8 9
No.ofRaces(Frequency)
AverageHandle(DOllarAmount)
HOD
FG: Average Handle by Hour of the Day (HOD)
FG: Handle FG: No. of Races
 There appears to be a data anomaly for track FG. While there are no races at the
9th HOD, it has some amount of average Handle.
 For track AP, the average amount of Handle has increased with increasing hours of
the day. However, the number of races have fallen sharply after the 5th HOD.
 Strikingly, for track AP, at 8th & 9th HOD, very few races have generated the highest
amount of average Handle.
 For track FG, though there has been a steep fall in the number of races after the
4th HOD, the average Handle has shown an increase. Also, how such few races at
the 6th HOD yield a high amount of Handle comparable to those with higher # of
races?
 For track CRC, though there has been a steep fall in the number of races after the
4th HOD, the average Handle has remained high with only a marginal drop.
2. DATA EXPLORATION: Multivariate Analysis
6. Purse Amount
0
500
1000
1500
2000
2500
3000
3500
4000
$0.00
$200,000.00
$400,000.00
$600,000.00
$800,000.00
$1,000,000.00
$1,200,000.00
$1,400,000.00
<=15000 15001-30000 30001-50000 50001-200000 200000+
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Purse (Dollar Amount)
AP: Average Handle by Purse Amount
AP: Purse AP: No. of Obs
0
1000
2000
3000
4000
5000
6000
7000
8000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
<=15000 15001-30000 30001-50000 50001-200000 200000+
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Purse (Dollar Amount)
CRC: Average Handle by Purse Amount
CRC: Purse CRC: No. of Obs
0
200
400
600
800
1000
1200
1400
1600
1800
$0.00
$100,000.00
$200,000.00
$300,000.00
$400,000.00
$500,000.00
$600,000.00
$700,000.00
$800,000.00
$900,000.00
<=15000 15001-30000 30001-50000 50001-200000 200000+
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Axis Title
FG: Average Handle by Purse Amount
FG: Purse FG: No. of Obs
 For all three race tracks it can be seen that higher purse brackets have
higher amount of average Handle.
 The number of races have fallen sharply for higher brackets of purse
amount though an increase in # of races when purse is in the bracket of
USD 15001-30000 can be seen for both track AP & track CRC.
 Overall, higher amount of average Handle is seen for higher brackets of
purse amount accompanied by a sharp fall in the number of races
2. DATA EXPLORATION: Multivariate Analysis
7. Minimum Claim Price
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
<=10000 10001-30000 30001-50000 50001-100000 100000+
No.ofObs(Frequency)
AverageHandle(DollarAmount)
Minimum Claim Price (Dollar Amount)
AP: Average Handle by Min Claim Price
AP: Min_ClP AP: No. of Obs
0
1000
2000
3000
4000
5000
6000
7000
$0.00
$20,000.00
$40,000.00
$60,000.00
$80,000.00
$100,000.00
$120,000.00
$140,000.00
$160,000.00
$180,000.00
<=10000 10001-30000 30001-50000 50001-100000 100000+
No.ofObs(Frequency)
AverageHandle(DollarAmount)
Minimum Claim Price (Dollar Amount)
CRC: Average Handle by Min Claim Price
CRC: Min_ClP CRC: No. of Obs
0
500
1000
1500
2000
2500
3000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
<=10000 10001-30000 30001-50000 50001-100000 100000+
No.ofObs(Frequency)
AverageHandle(DollarAmount)
Minimum Claim Price (Dollar Amount)
FG: Average Handle by Min Claim Price
FG: Min_ClP FG: No. of Obs
 For all three race tracks, the amount of average Handle has been fairly
constant across all brackets of the minimum claim price.
 Also, for all three race tracks, the number of races have fallen with increasing
brackets of the minimum claim price.
 Only a marginal increase in the number of races can be seen for the bracket of
100000+ minimum claim price.
2. DATA EXPLORATION: Multivariate Analysis
8. Maximum Claim Price
0
1000
2000
3000
4000
5000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
<=10000 10001-30000 30001-50000 50001-100000 100000+
No.ofObs(Frequency)
AverageHandle(DollarAmount)
Maximum Claim Price (Dollar Amount)
AP: Average Handle by Max Claim Price
AP: Max_ClP AP: No. of Obs
0
1000
2000
3000
4000
5000
6000
7000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
<=10000 10001-30000 30001-50000 50001-100000 100000+
No.ofObs(Frequency)
AverageHandle(DollarAmount)
Maximum Claim Price (Dollar Amount)
CRC: Average Handle by Max Claim Price
CRC: Max_ClP CRC: No. of Obs
0
500
1000
1500
2000
2500
3000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
<=10000 10001-30000 30001-50000 50001-100000 100000+
No.ofObs(Frequency)
AverageHandle(DollarAmount)
Maximum Claim Price (Dollar Amount)
FG: Average Handle by Max Claim Price
FG: Max_ClP FG: No. of Obs
 As for the minimum claim price, similar observations can be made for the
amount of maximum claim price.
2. DATA EXPLORATION: Multivariate Analysis
9. Attendance
0
1000
2000
3000
4000
5000
6000
7000
8000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
0-3000 3001-5000 5001-7000 7001-9000 9001-11000 11000+
No.ofRaces(Frequency))
AverageHandle(DollarAmount)
Attendance (in numbers)
AP: Average Handle by Attendance
AP: Attend AP: No. of Obs
0
1000
2000
3000
4000
5000
6000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
0-3000 3001-5000 5001-7000 7001-9000 9001-11000 11000+
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Attendance (in numbers)
CRC: Average Handle by Attendance
CRC: Attend CRC: No. of Obs
0
500
1000
1500
2000
2500
3000
3500
4000
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
$400,000.00
$450,000.00
$500,000.00
$550,000.00
0-3000 3001-5000 5001-7000 7001-9000 9001-11000 11000+
No.ofRaces(Frequency)
AverageHandle(DollarAmount)
Attendance (in numbers)
FG: Average Handle by Attendance
FG: Attend FG: No. of Obs
 For track AP, Handle is generated only when attendance is 0-3000. Is that the
maximum audience holding capacity of this track.
 For track CRC, Handle is seen to have increased for higher attendance brackets
though there has been a corresponding fall in the no. of races.
 For track FG, Handle is highest for attendance bracket 5001-7000 though the
no. of races for this bracket of attendance is very low. No races with
attendance greater than 9000 have taken place at track FG. Is the capacity of
track FG limited to 9000?
CONTENTS
1. Data Preparation
i. Evaluating composition of available data
ii. Combining given data sets
iii. Deriving variables for analysis
2. Data Exploration
i. Univariate Analysis
a. Categorical variables
b. Numeric variables
c. Synopsis of Findings
ii. Bivariate Analysis
a. Plots
b. Categorical variables
c. Numeric variables
d. Synopsis of Key Findings
iii. Multivariate Analysis
3. Testing Assumptions of OLS
4. Regression Model Building
i. Summary of the Iterations performed for building the Models
ii. A combined model for all three race tracks
iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit
iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit
v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
3. TESTING ASSUPMTIONS OF OLS
The following assumptions of OLS could not be tested as the SAS procedures listed below for each test are not
available in WPS. Hence, it couldn’t be conclusively evaluated if the estimates were BLUE.:
i. Linearity
While the assumption of Linearity can be tested graphically also from the partial residual plots, the option
of ‘Partial’ while fitting the model using Proc Reg was not available.
ii. Independence of Error terms
The Durbin Watson test for evaluating the Independence of Error terms was not available as an option in
the procedure of Proc Reg in WPS.
iii Normality of Error terms
The option of ‘Normal’, ‘Histogram’ & ‘Probplot’ is not available in Proc Reg to evaluate the normality of
the error terms.
iv Homoskedasticity
White’s Test could not be used in WPS.
CONTENTS
1. Data Preparation
i. Evaluating composition of available data
ii. Combining given data sets
iii. Deriving variables for analysis
2. Data Exploration
i. Univariate Analysis
a. Categorical variables
b. Numeric variables
c. Synopsis of Findings
ii. Bivariate Analysis
a. Plots
b. Categorical variables
c. Numeric variables
d. Synopsis of Key Findings
iii. Multivariate Analysis
3. Testing Assumptions of OLS
4. Regression Model Building
i. Summary of the Iterations performed for building the Models
ii. A combined model for all three race tracks
iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit
iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit
v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
3. REGRESSION MODEL BUILDING
(i) A Snapshot of the Iterations Performed
The following is a synopsis of the various iterations performed for building the model. Four models have been built: One, a
combined model on an overall basis for all three race tracks and other three being separate models for each of the three race
tracks viz. AP, CRC & FG:
Iteration # Description of variables included in the Iteration Adjusted R-Square
Overall AP CRC FG
1. All as-is variables, numeric in nature, were used. 0.56 0.68 0.43 0.51
2. Dropping variables that were found to be linear
combinations of other variables in iteration # 1
0.56 0.68 0.43 0.51
3. Dummy variables, created for categorical
variables, along with as-is numeric variables
were included.
0.68 0.75 0.56 0.70
4. Dropping variables that were found to be linear
combinations of other variables in iteration # 3.
0.68 0.75 0.56 0.70
5. Dropping variables that were found to be linear
combinations of other variables in iteration # 4
or insignificant @ 5%
0.68 0.74 0.56 0.69
6. Variables found insignificant @ 5% in the
preceding iteration were dropped.
0.68 0.74
7. Variables with a VIF score > 10 in iteration # 6
above were dropped.
0.67 0.71 0.53 0.65
3. REGRESSION MODEL BUILDING
(ii) A combined model for all 3 race tracks taken together.
(a) Regression Output
Please click on the above hyperlink for detailed results and the model equation in the tab named ‘7th Iteration’.
(b) Summary of the results
Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on
Handle. Type of
Characteristic
Drivers of
Handle
Impact on
Handle
Average change in
Handle for a 1 unit
change in the driver.
Ra ce Age _34 -ve 8390
Ra ce Age _35 -ve 8058
Ra ce Age _4U +ve 16835
Ra ce C o urs e _T +ve 39050
Ra ce DST -ve 51826
Ra ce Lo ca tio n_F -ve 46635
Ra ce Lo ca tio n_I -ve 23297
Ra ce Lo ca tio n_L -ve 46913
Ra ce Ra ce _ALW -ve 6936
Ra ce Ra ce _AOC -ve 9218
Ra ce Ra ce _DBY -ve 222613
Ra ce Ra ce _MC L +ve 4501
Ra ce Ra ce _MSW +ve 4606
Ra ce Ra ce _OC S -ve 33897
Ra ce Ra ce _SHP +ve 40170
Ra ce Ra ce _STK -ve 14399
Ra ce Sta te _IL +ve 39715
Ra ce Tra ck_FM -ve 20126
Ra ce Tra ck_GD -ve 19918
Ra ce Tra ck_MY -ve 20368
Ra ce Tra ck_SF +ve 11108
Ra ce Tra ck_SY -ve 12894
Ra ce W a ge r_3 -ve 83076
Ra ce W a ge r_4 -ve 64022
Ra ce W a ge r_5 -ve 111358
Ra ce W a ge r_6 -ve 114061
Ra ce W a ge r_9 -ve 66484
Ra ce W a ge r_D -ve 74073
Ra ce W a ge r_M -ve 94612
Ra ce W a ge r_Q -ve 93753
Ra ce W a ge r_S -ve 68124
Ra ce W a ge r_T -ve 22168
Ra ce W a ge r_Z -ve 93800
Ra ce W e a the r_L -ve 2824
Ra ce W e a the r_O -ve 8414
Ra ce W e a the r_R -ve 24565
Ra ce
numbe r_o f_run
ne rs
+ve 12287
Ra ce
numbe r_o f_tick
e ts _be t
+ve 211
Ra ce purs e _us a +ve 1
Ra ce ra ce _numbe r +ve 3302
Time HOL_BxD +ve 130563
Time HOL_La b +ve 21833
Time HOL_Me m -ve 9719
Time HOL_NY +ve 59249
Time HOL_SB -ve 20825
Time HOL_SP D -ve 22563
Time HOL_TGV -ve 37150
Time HOL_Ve t -ve 19124
Time Mo n +ve 1764
Time W e e kDa y +ve 5603
Time W e e ke nd_Indi +ve 15326
Time fra ctio n_1 +ve 17
SUMMARY: REGRESSION RESULTS OF THE OVERALL MODEL
3. REGRESSION MODEL BUILDING
(ii) A combined model for all 3 race tracks taken together.
(c) Residual Plot
3. REGRESSION MODEL BUILDING
(ii) A combined model for all 3 race tracks taken together.
(d) Model Fit
3. REGRESSION MODEL BUILDING
(iii) Model for Track ‘AP’
(a) Regression Output
Please click on the above hyperlink for detailed results and the model equation in the tab named ‘7th Iteration’.
(b) Summary of the results
Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on
Handle.
s
Type of
Characteristic
Drivers of
Handle
Impact on
Handle
Average change in
Handle for a 1 unit
change in the driver.
Race Breed_QH -ve 48267
Race Location_F -ve 34365
Race Location_L -ve 36642
Race Race_ALW -ve 9246
Race Race_AOC -ve 16294
Race Race_MSW -ve 6628
Race Race_STK -ve 17440
Race Track_SY -ve 30309
Race Track_YL +ve 29391
Race Wager_3 -ve 79629
Race Wager_4 -ve 71710
Race Wager_D -ve 93001
Race Wager_S -ve 72598
Race Wager_T -ve 20962
Race Wager_Z -ve 106457
Race Weather_R -ve 15752
Race
number_of_run
ners
+ve 16502
Race
number_of_tick
ets_bet
+ve 314
Race purse_usa +ve 2
Race race_number +ve 5319
Time HOD +ve 3880
Time HOL_ID +ve 29700
Time HOL_Lab +ve 87228
Time WeekDay +ve 7264
Time Weekend_Indi +ve 34873
SUMMARY: REGRESSION RESULTS OF THE MODEL FOR TRACK 'AP'
3. REGRESSION MODEL BUILDING
(iii) Model for Track AP
(c) Residual Plot
3. REGRESSION MODEL BUILDING
(iii) Model for Track AP
(d) Model Fit
3. REGRESSION MODEL BUILDING
(iv) Model for Track ‘CRC’
(a) Regression Output
Please click on the above hyperlink for detailed results and the model equation in the tab named ‘6th Iteration’.
(b) Summary of the results
Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on
Handle. Type of
Characteristic
Drivers of Handle Impact on Handle Average change in
Handle for a 1 unit
change in the driver.
Race Age_4U +ve 106507
Race Course_T +ve 35208
Race DistanceID_Conv_to_Fur -ve 37
Race Location_S +ve 8080
Race Race_MSW +ve 8504
Race Race_STK +ve 19343
Race Track_FM -ve 19582
Race Track_GD -ve 16025
Race Track_SY -ve 9652
Race Track_W F -ve 16437
Race Track_YL -ve 22817
Race W ager_3 -ve 67735
Race W ager_5 -ve 93481
Race W ager_6 -ve 113661
Race W ager_9 -ve 67019
Race W ager_D -ve 49220
Race W ager_M -ve 80114
Race W ager_S -ve 54116
Race W ager_T -ve 12607
Race W ager_Z -ve 66625
Race W eather_F +ve 22005
Race W eather_L -ve 6765
Race W eather_O -ve 4888
Race number_of_runners +ve 9799
Race purse_usa +ve 1
Race race_number +ve 1840
Time HOD -ve 2791.4285
Time HOL_CDM +ve 38739
Time HOL_ID -ve 25209
Time HOL_Mem -ve 12591
Time Mon +ve 3053.76611
Time WeekDay +ve 5912.74561
Time Weekend_Indi +ve 8358.8241
Time fraction_3 +ve 0.84318
Time fraction_4 +ve 0.26747
Time fraction_5 +ve 0.95467
SUMMARY: REGRESSION RESULTS OF THE MODEL FOR TRACK 'CRC'
3. REGRESSION MODEL BUILDING
(iv) Model for Track CRC
(c) Residual Plot
3. REGRESSION MODEL BUILDING
(iv) Model for Track CRC
(d) Model Fit
3. REGRESSION MODEL BUILDING
(v) Model for Track ‘FG’
(a) Regression Output
Please click on the above hyperlink for detailed results and the model equation in the tab named ‘6th Iteration’.
(b) Summary of the results
Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on
Handle.
Type of
Characteristic
Drivers of Handle Impact on
Handle
Average change in
Handle for a 1 unit
change in the driver.
Race Age_4U +ve 8298
Race Location_F -ve 19754
Race Race_ALW -ve 10989
Race Race_AOC -ve 7139
Race Race_STK -ve 13220
Race Track_YL +ve 49565
Race Wager_3 -ve 101465
Race Wager_4 -ve 87265
Race Wager_6 -ve 129220
Race Wager_D -ve 76597
Race Wager_Q -ve 97902
Race Wager_S -ve 81803
Race Wager_T -ve 16207
Race Wager_Z -ve 111527
Race Weather_F -ve 10450
Race number_of_runners +ve 11585
Race number_of_tickets_bet +ve 725
Race payoff_amount +ve 1
Race purse_usa +ve 1
Race race_number +ve 5897
Time HOD -ve 1792
SUMMARY: REGRESSION RESULTS OF THE MODEL FOR TRACK 'FG'
3. REGRESSION MODEL BUILDING
(v) Model for Track FG
(c) Residual Plot
3. REGRESSION MODEL BUILDING
(v) Model for Track FG
(d) Model Fit
Thank You

More Related Content

Similar to Case Study

A statistical approach to predict flight delay
A statistical approach to predict flight delayA statistical approach to predict flight delay
A statistical approach to predict flight delayiDTechTechnologies
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecastingDevon Barrow
 
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin... Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...hydrologyproject001
 
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin... Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...hydrologyproject001
 
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin... Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...hydrologywebsite1
 
software-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdfsoftware-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdfJhaKaustubh1
 
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET Journal
 
Top10 algorithms data mining
Top10 algorithms data miningTop10 algorithms data mining
Top10 algorithms data miningAsad Ahamad
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Ch5-DataFlowTesting.ppt
Ch5-DataFlowTesting.pptCh5-DataFlowTesting.ppt
Ch5-DataFlowTesting.pptroshymans1
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptxssuser31398b
 
Predicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPredicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPrerit Saxena
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...IRJET Journal
 
Julien vachaudez - projet Autodiag
Julien vachaudez - projet AutodiagJulien vachaudez - projet Autodiag
Julien vachaudez - projet AutodiagSynhera
 
Sampling-SDM2012_Jun
Sampling-SDM2012_JunSampling-SDM2012_Jun
Sampling-SDM2012_JunMDO_Lab
 

Similar to Case Study (20)

Af03301980202
Af03301980202Af03301980202
Af03301980202
 
Dm
DmDm
Dm
 
A statistical approach to predict flight delay
A statistical approach to predict flight delayA statistical approach to predict flight delay
A statistical approach to predict flight delay
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin... Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin... Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin... Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
Download-manuals-surface water-manual-sw-volume8operationmanualdataprocessin...
 
software-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdfsoftware-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdf
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
 
Top10 algorithms data mining
Top10 algorithms data miningTop10 algorithms data mining
Top10 algorithms data mining
 
Training Module
Training ModuleTraining Module
Training Module
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Ch5-DataFlowTesting.ppt
Ch5-DataFlowTesting.pptCh5-DataFlowTesting.ppt
Ch5-DataFlowTesting.ppt
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
 
Predicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPredicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regression
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
 
Julien vachaudez - projet Autodiag
Julien vachaudez - projet AutodiagJulien vachaudez - projet Autodiag
Julien vachaudez - projet Autodiag
 
Presentation
PresentationPresentation
Presentation
 
Sampling-SDM2012_Jun
Sampling-SDM2012_JunSampling-SDM2012_Jun
Sampling-SDM2012_Jun
 

Case Study

  • 1. CASE STUDY Understanding the Key Drivers to Maximise Revenue Generated From Handle – Methodology, Findings & Results
  • 2. CONTENTS 1. Data Preparation i. Evaluating composition of available data ii. Combining given data sets iii. Deriving variables for analysis 2. Data Exploration i. Univariate Analysis a. Categorical variables b. Numeric variables c. Synopsis of Findings ii. Bivariate Analysis a. Plots b. Categorical variables c. Numeric variables d. Synopsis of Key Findings iii. Multivariate Analysis 3. Testing Assumptions of OLS 4. Regression Model Building i. Summary of the Iterations performed for building the Models ii. A combined model for all three race tracks iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
  • 3. 1. DATA PREPARATION (i)EVALUATING THE COMPOSITION OF AVAILABLE DATA Firstly, composition of the available data was evaluated. Some of the key points looked at were: • Type of the given variables (Character/Numeric) was evaluated in each of the data sets before merging using the procedure of PROC CONTENTS. • Extent of the missing data was identified using PROC FREQ. Table Name Variable # of missing data points Total # of data points in Table % of total missing values Meaning of the variable Race conditions_of_race 103101 185360 56%Restrictions (or conditions) on the eligibility of horses to run. See decode table Race Race_Type 1 185360 0% Race sex_restriction 116451 185360 63%Restrictions on the gender of horses that can run. See decode table Race scheduled_surface 182532 185360 98%Planned surface for a race. In inclement weather, turf races are often moved to dirt Race track_condition 7279 185360 4%Describes the condition of the surface at race time. See decode table. Race weather 7151 185360 4%Weather at race time. See decode table Race grade 183201 185360 99%Ranking of stakes races. 1 being the highest. Race About_distance_indicator 24392 2071 92% Can be used with distance_id and distance_unit to indicate estimated length of races. Used for turf races. See decode table Race track_sealed_indicator 47203 185360 25% Y/N on whether dirt track is "sealed" Sealing is a process of smoothing and compacting the dirt surface to make it less penetrable to rain Race_Distance_Conv Race_Type 1 185360 0% sex_restriction 116451 185360 63%Restrictions on the gender of horses that can run. See decode table scheduled_surface 182532 185360 98%Planned surface for a race. In inclement weather, turf races are often moved to dirt track_condition 7279 185360 4%Describes the condition of the surface at race time. See decode table. weather 7151 185360 4%Weather at race time. See decode table grade 183201 185360 99%Ranking of stakes races. 1 being the highest. track_sealed_indicator 47203 185360 25% Y/N on whether dirt track is "sealed" Sealing is a process of smoothing and compacting the dirt surface to make it less penetrable to rain conditions_of_race 103101 185360 56%Restrictions (or conditions) on the eligibility of horses to run. See decode table Track Track_Id 2 811 0%Track abbreviation Track_Type 12 811 1%Should all be T for thoroughbred State 28 811 3%State of operation Track_Statistic Loaction_Type 7 188565 0%Should all be T for track. See decode Location 22483 3980 83% Should all be ON for on (vs. off) Track_Zone DST_YN 1 167 1%
  • 4. 1. DATA PREPARATION (ii) COMBINING GIVEN DATA SETS The following presents a synopsis of the manner in which the given data sets were merged/combined to arrive at a consolidated data set for use in final analysis of the variable ‘Handle’: ORIGINAL DATA FILES MERGED DATA FILES File # File Name # of Observations # of Variables 1 Exotic_Payoff 722873 15 2 Race 185360 60 3 Race_Distance_Conv 185360 61 4 Track 811 12 5 Track_Statistic 188565 11 6 Track_Zone 167 8 File # File Name # of Observations # of Variables Files Combined Primary Key 1 Race_Combined 185360 61Race + Race_Distance_Conv Track_id Race_Date Race_Number 2 Exotic_Race_Combi 724278 68Exotic_Payoff + Race_Combined Track_id Race_Date Race_Number 3 TrackStat_Zone1 188569 17Track_Zone1 + Track_Statistic Track_id Country 4 Track_Final 189214 22Track1 + TrackStat_Zone1 Track_id Country Track_Name State 5 CDI_0 875026 83Track_Final + Exotic_Race_Combi Track_id Race_Date Country • Data set ‘Track_Zone’ was modified by renaming the variable 'Area_ID' as 'Country'. The data set thus obtained was 'Track_Zone1‘. Similarly, data set 'Track' was also modified by renaming the variable 'Area_ID' as 'Country'. The data set thus obtained is 'Track1‘. • In some of the data files it is seen that for the variable ‘Track_id’ there are tracks that are not common in the two data sets being merged. Hence, in the data files obtained after merging it is seen that count of the observations has increased. • Variables with more than 50% missing values were dropped for the purpose of analysis as no meaningful analysis was possible in the absence of adequate data
  • 5. 1. DATA PREPARATION (iii)DERIVING VARIABLES FOR ANALYSIS From the final data set obtained after merging given data files, the following variables were further derived: # Original Variable Derived Variable Description of the Derived Variable 1. Race_Dt Date_Race The derived variable shows the date on which a race took place. 2. Race_Dt Yr The derived variable shows the year in which the race took place. 3. Race_Dt Weekday It shows which day of the week the race took place. 4. Weekday Day_of_Week This variable renames weekday as a character variable. 5. Weekday Weekend_Indi This variable is categorical and shows whether the day of race is a weekend or not. 6. WPS_pool , Total_pool Handle_Combi This variable shows the Handle generated from a race. 7. Race_Dt Mon This variable shows the month of the year in which the race took place. 8. Post_time Race_time The variable ‘Post_time’ is converted from character to time format to derive other time related variables for analysis. 9. Race_time HOD This variable shows the hour of the day in which the race took place. 10. Race_Dt HOL_XXX Holiday indicators are created for each of the holidays mentioned in the list of holidays for year 2005 & 2006.
  • 6. 1. DATA PREPARATION (iii)DERIVING VARIABLES FOR ANALYSIS Notes: • For deriving variables from ‘Race_Date’ (which was in a date-time format), it was first converted from a number format to a SAS date format. • The existing data set that emerged from merging the given data files was further sub-setted to include:  only the relevant track ids required for the analysis: CRC, AP & FG  only the relevant years: 2005 & 2006 • The variable ‘Handle_Combi’ was identified as the Dependent variable. • Continuous variables were binned after evaluating the distribution using the procedure of PROC UNIVARIATE. The variables that were binned included:  Purse_usa  Minimum_claim_price  Maximum_claim_price  Attendance  Handle_combi
  • 7. CONTENTS 1. Data Preparation i. Evaluating composition of available data ii. Combining given data sets iii. Deriving variables for analysis 2. Data Exploration i. Univariate Analysis a. Categorical variables b. Numeric variables c. Synopsis of Findings ii. Bivariate Analysis a. Plots b. Categorical variables c. Numeric variables d. Synopsis of Key Findings iii. Multivariate Analysis 3. Testing Assumptions of OLS 4. Regression Model Building i. Summary of the Iterations performed for building the Models ii. A combined model for all three race tracks iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
  • 8. 2. DATA EXPLORATION (i) Univariate Analysis The procedure of PROC CONTENTS was used to identify the character & numeric variables in the data set that emerged after merging the given data files. The analysis shown in the following slides is for Track Ids: AP, CRC & FG and for the years 2005 & 2006. i. Categorical variables: PROC FREQ was used to evaluate the distribution of categorical variables in the data set. ii. Numeric variables: PROC MEAN was used to understand the characteristics of each numeric variable using i. Measures of Central Tendency and ii. Dispersion iii. Numeric variables: PROC UNIVARIATE was used to evaluate the: i. Skewness & Kurtosis ii. Distribution of the variable iii. Inter Quartile Range (IQR) : This was used in binning the variable for subsequent multivariate analysis iv. Categorical & Numeric variables were analysed both on an overall level as well as track id wise in line with the business requirements.
  • 9. 2. DATA EXPLORATION: Univariate Analysis (a) Categorical Variables: Wager_Type Frequency distribution on an overall basis wager_type Frequency Percent 3 4729 17.87 4 1090 4.12 5 5 0.02 6 104 0.39 9 41 0.15 D 1454 5.49 E 6352 24.01 M 184 0.7 Q 612 2.31 S 5633 21.29 T 6190 23.39 Z 67 0.25 Frequency distribution Track_id wise wager_type Frequency Percent 3 1359 19.33 4 379 5.39 D 573 8.15 E 1745 24.82 S 1304 18.54 T 1647 23.42 Z 25 0.36 wager_type Frequency Percent 3 2679 17.48 4 540 3.52 5 5 0.03 6 2 0.01 9 41 0.27 D 709 4.63 E 3751 24.48 M 184 1.2 S 3693 24.1 T 3686 24.06 Z 32 0.21 wager_type Frequency Percent 3 691 16.82 4 171 4.16 6 102 2.48 D 172 4.19 E 856 20.84 Q 612 14.9 S 636 15.49 T 857 20.87 Z 10 0.24 Track_id: AP Track_id: CRC Track_id: FG
  • 10. 2. DATA EXPLORATION: Univariate Analysis (a) Categorical Variables: Race_Type Frequency distribution on an overall basis Race_type Frequency Percent ALW 2738 10.35 AOC 2855 10.79 CAN 325 1.23 CLM 8683 32.81 DBY 4 0.02 HCP 22 0.08 MCL 6094 23.03 MSW 3406 12.87 OCS 14 0.05 SHP 17 0.06 SIM 5 0.02 SST 14 0.05 STK 1768 6.68 STR 516 1.95 Frequency distribution Track_id wise Race_type Frequency Percent ALW 1248 17.75 AOC 621 8.83 CLM 2687 38.21 HCP 22 0.31 MCL 767 10.91 MSW 1079 15.34 SHP 17 0.24 STK 449 6.39 STR 142 2.02 Race_type Frequency Percent ALW 983 6.42 AOC 1857 12.12 CAN 282 1.84 CLM 4551 29.7 MCL 4577 29.87 MSW 1694 11.06 OCS 14 0.09 SIM 5 0.03 SST 14 0.09 STK 1066 6.96 STR 279 1.82Race_type Frequency Percent ALW 507 12.34 AOC 377 9.18 CAN 43 1.05 CLM 1445 35.18 DBY 4 0.1 MCL 750 18.26 MSW 633 15.41 STK 253 6.16 STR 95 2.31 Track_id: AP Track_id: CRC Track_id: FG
  • 11. 2. DATA EXPLORATION: Univariate Analysis (a) Categorical Variables: Age_restriction Frequency distribution on an overall basis Age_restriction Frequency Percent 2 5175 19.56 3 2606 9.85 4 12 0.05 2U 6 0.02 34 3905 14.76 35 402 1.52 3U 12510 47.28 45 227 0.86 4U 1612 6.09 5U 6 0.02 Frequency distribution Track_id wise Age_restriction Frequency Percent 2 493 7.01 3 681 9.68 34 1393 19.81 3U 4465 63.5 Age_restriction Frequency Percent 2 4409 28.78 3 1110 7.24 4 12 0.08 2U 6 0.04 34 2512 16.39 3U 7169 46.79 4U 98 0.64 5U 6 0.04 Age_restriction Frequency Percent 2 273 6.65 3 815 19.84 35 402 9.79 3U 876 21.33 45 227 5.53 4U 1514 36.86 Track_id: AP Track_id: CRC Track_id: FG
  • 12. 2. DATA EXPLORATION: Univariate Analysis (a) Categorical Variables: Track_condition Frequency distribution on an overall basis Track_condition Frequency Percent FM 3707 14.18 FT 17139 65.56 GD 1928 7.38 MY 120 0.46 SF 129 0.49 SY 2571 9.84 WF 122 0.47 YL 425 1.63 Frequency distribution Track_id wise Track_condition Frequency Percent FM 1216 17.29 FT 4264 60.64 GD 433 6.16 MY 120 1.71 SF 129 1.83 SY 466 6.63 WF 62 0.88 YL 342 4.86 Track_condition Frequency Percent FM 1606 10.68 FT 10042 66.78 GD 1470 9.78 SY 1812 12.05 WF 38 0.25 YL 70 0.47 Track_condition Frequency Percent FM 885 21.74 FT 2833 69.59 GD 25 0.61 SY 293 7.2 WF 22 0.54 YL 13 0.32 Track_id: AP Track_id: CRC Track_id: FG
  • 13. 2. DATA EXPLORATION: Univariate Analysis (a) Categorical Variables: Weather Frequency distribution on an overall basis Weather Frequency Percent C 14479 55.39 F 175 0.67 H 622 2.38 L 8316 31.81 O 2237 8.56 R 312 1.19 Frequency distribution Track_id wise Weather Frequency Percent C 4360 62 H 622 8.85 L 1703 24.22 O 117 1.66 R 230 3.27 Weather Frequency Percent C 8586 57.1 F 38 0.25 L 4453 29.61 O 1931 12.84 R 30 0.2 Weather Frequency Percent C 1533 37.66 F 137 3.37 L 2160 53.06 O 189 4.64 R 52 1.28 Track_id: AP Track_id: CRC Track_id: FG
  • 14. 2. DATA EXPLORATION: Univariate Analysis (a) Categorical Variables: Others Sex_restriction Sex_restriction Frequency Percent B 5779 51.13 F 5523 48.87 Stakes_indicator Surface Stakes_indicator Frequency Percent N 24661 93.2 Y 1800 6.8 Surface Frequency Percent D 20936 79.12 T 5525 20.88
  • 15. 2. DATA EXPLORATION: Univariate Analysis (b) Numeric Variables track_id N Obs Variable Label Minimum Mean Median Maximum Std.Dev AP 7034purse_usa purse_usa 9500 28548.9 25000 1000000 52990.9 minimum_claim_price minimum_claim_price 0 13264.9 10000 100000 17398.2 maximum_claim_price maximum_claim_price 0 14559.1 10000 100000 18215.5 number_of_runners number_of_runners 3 8 8 14 2 Handle_Combi 52542 238538 215058 3139455 146357 CRC 15322purse_usa purse_usa 7000 23742.1 18000 2000000 48642.1 minimum_claim_price minimum_claim_price 0 14076.3 12500 62500 12314.2 maximum_claim_price maximum_claim_price 0 14434.1 12500 62500 12255.9 number_of_runners number_of_runners 0 8 7 13 2 Handle_Combi 0 138670 124265 1186000 74123.6 FG 4107purse_usa purse_usa 8000 28401.7 20500 600000 39505 minimum_claim_price minimum_claim_price 0 12009.3 9000 80000 15375.1 maximum_claim_price maximum_claim_price 0 13804.9 10000 80000 15947.5 number_of_runners number_of_runners 0 8 8 13 2 Handle_Combi 0 199949 183672 1647365 99617.3 Note: Variable ‘Attendance’, originally in data set ‘Track_Statistic’ only, shows the combined attendance at the track for the whole day whereas in other files, data has been shown for multiple races at a track for any day. Hence, the merged data file will not show correct numbers for the variable ‘Attendance’. It has, thus, has not been included in the analysis.
  • 16. 2. DATA EXPLORATION: Univariate Analysis (c) Synopsis of Key Findings The table below shows the synopsis of the Univariate analysis performed in preceding slides for both categorical as well as numeric variables: Variable Overall Remarks Wager Type Exacta & Trifecta Exacta & Trifecta were the most common wage types across all three tracks Race Type Claiming Track CRC also had Maiden Claiming as the most common race types besides Claiming. Age Restriction 3 yo’s & up 4 yo’s & up was also most common on track FG besides 3 yo’s & up Track Condition Fast A Fast track condition was most common across all three race tracks. Weather Clear Track FG was most often found Cloudy. Purse_usa While Track AP & FG had an average purse of USD 30000 (appox), track CRC’s purse was USD 24000 appox. Track AP had the highest Median value for purse_usa. Minimum_claim_price It was roughly the same for all three race tracks, appox USD 14000 Maximum_claim_price It was roughly the same for all three race tracks. Also, not much difference b/w min & max claim price for all three race tracks. Number_of_runners Average number of runners was 8 for all three race tracks. Handle_Combi The average Handle & median Handle was highest for track AP.
  • 17. CONTENTS 1. Data Preparation i. Evaluating composition of available data ii. Combining given data sets iii. Deriving variables for analysis 2. Data Exploration i. Univariate Analysis a. Categorical variables b. Numeric variables c. Synopsis of Findings ii. Bivariate Analysis a. Plots b. Categorical variables c. Numeric variables d. Synopsis of Key Findings iii. Multivariate Analysis 3. Testing Assumptions of OLS 4. Regression Model Building i. Summary of the Iterations performed for building the Models ii. A combined model for all three race tracks iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
  • 18. 2. DATA EXPLORATION (ii) Bivariate Analysis (a) Plots: For both, categorical as well as numeric variables, plots were used to graphically assess the data, identify any group patterns and detect extreme values & outliers, if any. Each categorical & numeric variable was plotted on the X-axis against the dependent variable, ‘Handle_combi’, on the Y- axis. (b) Categorical Variables: Chi-Square Test was used to evaluate the strength of association between the dependent variable and each of the categorical variables, both existing as well as those created by binning continuous numeric variable. For this purpose, the dependent variable, ‘Handle_Combi’ was converted from a numeric variable to an ordinal variable. Refer to the tab ‘Proc Univariate_Binning’ in the worksheet of the link below for workings. Measures of Central Tendency & Dispersion were also used. Workings (c) Numeric Variables: Correlation Analysis was used for each of the numeric variables and the dependent variable, ‘Handle_Combi’.
  • 19. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Categorical variables Handle_Combi & Age_Restriction Handle_Combi & Grade  The values highlighted in the plot for Handle_Combi & Grade above are those for which there are no grades.  The count of such missing values is 26116.  Since the count of such values is around 99% of the total data, in the absence of a confirmation from business, this variable will be dropped for the purpose of analysis & model building.
  • 20. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Categorical variables Handle_Combi & Race_Type Handle_Combi & Track_condition
  • 21. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Categorical variables Handle_Combi & Wager_Type Handle_Combi & Weather The following observations in the plots above appear to be outliers:  Wager_Type= 4 Handle_Combi=3139455 Count=1  Wager_Type= E Handle_Combi=3187911 & 3206094 Count=1 each  Wager_Type= T Handle_Combi=2946736 & 3026707 Count=1 each  Wager_Type= 5 Handle_Combi= 1074715 Count= 1  Wager_Type= 3 Handle_Combi= 2062128 & 2079182 Count= 1 each  Wager_Type= S Handle_Combi= 2341293 & 2423058 Count= 1 each  Weather= Blank Handle_Combi= 0 Count=330
  • 22. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Attendance Handle_Combi & Distance_id  When Attendance = 0, how can there be any Handle?  Attendance=Blank Handle_Combi=0 Count= 299
  • 23. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Fraction_1 Handle_Combi & Fraction_2  Fraction is the split time and distance of a race. Not too sure if Handle>0 in case Fraction=0  Fraction_1=5534 Count= 4  Fraction_2= 15140 Count= 4
  • 24. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Fraction_3 Handle_Combi & Fraction_4  Fraction_3=21840 Count=4  Not too sure if Handle should be>0 in case Fraction=0
  • 25. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Fraction_5 Handle_Combi & HOD  Not too sure if Handle should be >0 in case Fraction=0
  • 26. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Maximum Claim Price Handle_Combi & Minimum Claim Price
  • 27. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Month Handle_Combi & No. of Runners
  • 28. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Number of Tickets bet Handle_Combi & Payoff Amount  Number_of_tickets_bet= 100 Handle_Combi=3139455 Count=1  Number_of_tickets_bet= 300 Count= 6  Payoff_amount=449240 Handle_Combi= 74352 Count=1
  • 29. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Purse_usa Handle_Combi & Race_Number  Purse_usa= 2000000 Count= 5  Race_number= 66 Track_id= CRC Count=5
  • 30. 2. DATA EXPLORATION: Bivariate Analysis (a) Plots: Numeric variables Handle_Combi & Weekday
  • 31. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical variables track_id N Obs N nmiss Minimum Mean Median Maximum Sum Std Dev AP 7034 7032 2 52542 238538 215058 3139455 1677401898 146357 CRC 15322 15322 0 0 138670 124265 1186000 2124701137 74124 FG 4107 4107 0 0 199949 183672 1647365 821191578 99617 Track_Id wise Handle for the years 2005 & 2006 Year N Obs N nmiss Minimum Mean Maximum Sum Std Dev 2005 14355 14353 2 0 178654 3139455 2564218644 116427 2006 12108 12108 0 0 170059 3060903 2059075969 104284 Year wise Handle
  • 32. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical variables Holiday Nobs N Mean Std Dev Min Max No Holiday 24797 24795 174490 112375 0 3139455 HOL_BxD 95 95 244158 118135 88287 585465 HOL_GF 58 58 209818 75809 85759 411644 HOL_NY 150 150 241862 119630 72810 645937 HOL_TGV 201 201 117320 54094 29170 283323 HOL_Vet 88 88 212197 111846 68649 585766 HOL_Lab 184 184 185184 85400 65943 445343 HOL_ID 183 183 155035 82665 46025 457547 HOL_Mem 352 352 170100 80894 49733 438551 HOL_CDM 113 113 179644 59513 87072 362010 HOL_East 94 94 177696 57118 67108 307895 HOL_SPD 103 103 154384 51668 55188 279983 HOL_SB 45 45 170669 61101 62080 343682 Handle on different Holidays for the year 2005 & 2006 • Boxing Day & New Year Day is with the highest average Handle • Though the dispersion is also on the higher side
  • 33. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical variables Handle & Race #: Overall basis race_nu mber nmiss Minimum Mean Maximum Sum Std Dev 1 0 0 119529 376662 206425851 53432 2 0 0 144899 525388 364420470 60688 3 0 0 152468 586142 372631958 69666 4 0 0 164420 747300 442126299 79359 5 0 0 177810 902720 498046927 87828 6 0 0 192440 1647365 531134486 105368 7 0 0 191724 941320 486979194 100842 8 0 0 197928 2090764 477798356 132956 9 0 0 204963 3139455 614683133 180861 10 0 0 185809 1045357 364929770 106234 11 0 0 173064 1353026 144508699 147978 12 0 0 179634 1186000 82990909 130050 13 0 0 120881 320650 34571841 50269 14 0 66689 102336 171037 2046720 28514 66 0 0 0 0 0 0 Handle & Race #: Track_Id wise race_number Minimum Mean Maximum Std Dev 1 61220 164992 376662 51027 2 73235 183461 525388 58135 3 81479 201061 586142 72153 4 61246 220511 747300 85305 5 60299 229374 902720 98073 6 63431 245164 818390 96160 7 62133 262249 941320 97489 8 52542 269484 2090764 187592 9 68110 282277 3139455 255602 10 126975 322285 1045357 142620 11 729217 1014273 1353026 233335 12 440455 527820 624104 66091 race_number Minimum Mean Maximum Std Dev 1 0 95986 256454 39501 2 0 120956 316368 49793 3 0 127744 410791 56311 4 0 137095 454350 62528 5 0 141896 432622 66342 6 0 146594 477671 65293 7 0 146183 689454 75138 8 0 150831 578579 72838 9 0 141709 645937 74310 10 0 150792 773904 91221 11 0 161757 1074715 121928 12 0 170349 1186000 118032 13 0 120881 320650 50269 14 66689 102336 171037 28514 66 0 0 0 0 race_number Minimum Mean Maximum Std Dev 1 0 136802 346263 49652 2 0 161746 484711 59473 3 0 159603 399003 63393 4 0 176754 613785 74055 5 0 189721 568623 71512 6 0 236041 1647365 155171 7 0 228021 861703 99563 8 0 230092 737818 103343 9 0 243992 1015652 133043 10 0 204050 500897 74997 11 177969 252145 390084 56180 Track: AP Track: FG Track: CRC
  • 34. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical variables Handle & Race Type: Overall basis race_type nmiss Minimum Mean Maximum Sum Std Dev ALW 0 31129 202051 818390 553215834 94282 AOC 0 40955 174011 568623 496802159 75421 CAN 0 0 0 0 0 0 CLM 0 28771 160986 675791 139783970 1 73082 DBY 0 57846 75204 89428 300817 15417 HCP 0 161830 269683 490685 5933015 87602 MCL 0 24623 141795 583365 864097538 62007 MSW 0 27740 197642 747300 673168518 89045 OCS 0 90100 177323 334321 2482525 70185 SHP 0 185893 286774 421977 4875159 73033 SIM 0 0 0 0 0 0 SST 0 66215 176655 252036 2473163 54601 STK 0 43966 302815 3139455 535376495 275741 STR 0 29170 168081 419676 86729689 76316 Handle & Race Type: Track_Id wise race_type Minimum Mean Maximum Std Dev ALW 1248 62133 247491 818390 AOC 621 52542 231366 500463 CLM 2687 60299 211068 675791 HCP 22 161830 269683 490685 MCL 767 81532 198251 470437 MSW 1079 61220 243456 747300 SHP 17 185893 286774 421977 STK 449 91487 442856 3139455 STR 142 108189 234625 406691 race_type Minimum Mean Maximum Std Dev ALW 31129 141594 689454 65323 AOC 40955 147556 454350 58556 CAN 0 0 0 0 CLM 28771 126675 479510 52577 MCL 24623 126323 583365 51485 MSW 27740 156988 578579 70481 OCS 90100 177323 334321 70185 SIM 0 0 0 0 SST 66215 176655 252036 54601 STK 43966 233655 1186000 144080 STR 29170 132104 419676 60860 race_type Minimum Mean Maximum Std Dev ALW 60184 207417 499212 73915 AOC 70325 209846 568623 74347 CAN 0 0 0 0 CLM 55188 175920 468001 65413 DBY 57846 75204 89428 15417 MCL 55353 178479 434572 61185 MSW 65918 228344 613785 78170 STK 78226 345681 1647365 233437 STR 75712 174274 334480 61308 Track: AP Track: FG Track: CRC
  • 35. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical variables Handle & Age_Restriction: Overall basis Age_restri ction nmiss Minimum Mean Maximum Sum Std Dev 2 0 0 153635 747300 795063415 80370 3 0 0 210265 1647365 547951285 142556 4 0 96480 166530 298930 1998364 59611 2U 0 0 63590 157102 381538 72528 34 0 0 160670 583365 627417387 80087 35 0 0 183691 434572 73843611 69143 3U 0 0 176734 31394552210939606 123652 45 0 65918 177127 355546 40207927 54918 4U 0 56314 201147 1007029 324248936 92251 5U 0 157566 207091 248986 1242544 35932 Handle & Age_Restriction: Track_Id wise Age_restric tion Minimum Mean Maximum Std Dev 2 61220 242223 747300 103225 3 80139 262704 1353026 158859 34 72518 220667 524670 81221 3U 52542 240021 3139455 162695 Age_restri ction Minimum Mean Maximum Std Dev 2 0 140838 584870 69891 3 0 161285 1186000 112616 4 96480 166530 298930 59611 2U 0 63590 157102 72528 34 0 127400 583365 56663 3U 0 136606 1074715 73016 4U 72810 221839 454350 75001 5U 157566 207091 248986 35932 Age_restrict ion Minimum Mean Maximum Std Dev 2 0 200347 459996 73247 3 57846 233157 1647365 143454 35 0 183691 434572 69143 3U 0 182550 509929 78766 45 65918 177127 355546 54918 4U 56314 199808 1007029 93120 Track: AP Track: FG Track: CRC
  • 36. 2. DATA EXPLORATION: Bivariate Analysis Handle & Distance_id: Overall basis Distnace _id nmiss Minimum Mean Maximum Sum Std Dev 200 0 0 94080 326326 4233597 61825 350 0 104981 142810 175986 428430 35730 400 0 57846 129871 188257 1688319 44850 440 0 79568 109127 135500 436507 28628 450 0 0 116453 258603 47629242 48152 500 0 0 155807 569295289488594 79451 550 0 0 147242 667579433038125 73547 600 0 0 179191 1186000 100454456 4 91576 650 0 42656 158924 504548272077630 68769 700 0 0 146596 991286322950464 83803 750 0 54576 205633 773904 69092549 103604 800 0 0 183797 909653920821709 93494 818 0 55188 180558 450004 40444935 68968 832 0 39771 126262 394530 67045136 54619 850 0 0 186831 1647365745453980 111383 900 0 0 218692 1007029294577826 106595 950 0 90100 583927 2090764 33867777 508071 1000 0 0 724202 3139455 46348930 920541 1100 0 0 81521 174799 978254 64746 1200 0 91487 299421 805235 27247318 140306 1600 0 168211 225182 290300 900727 55358 Handle & Distance_id : Track_Id wise Distnace_id Minimum Mean Maximum Std Dev 400 123621 154167 188257 27745 450 61220 134536 229418 39500 500 81984 237488 477310 81093 550 74563 210006 498141 91052 600 61246 220387 1045357 89701 650 63431 204931 504548 72213 700 52542 218187 991286 116989 750 118256 202789 301862 58614 800 70663 235191 909653 95855 850 97339 282728 699943 89145 900 62133 223963 568607 84253 950 185893 645239 2090764 512202 1000 230036 1180695 3139455 1014228 1200 91487 299504 584818 106742 Distnace_id Minimum Mean Maximum Std Dev 200 0 94080 326326 61825 450 0 114970 258603 48539 500 0 138812 569295 67703 550 0 127500 447481 58367 600 0 141577 1186000 88648 650 42656 137669 409607 55482 700 0 130858 583365 64729 750 54576 187841 773904 109816 800 0 129737 517699 60098 832 39771 126262 394530 54619 850 0 151523 689454 80153 900 0 190424 706858 123442 950 90100 137227 191116 35961 1000 0 137283 334321 93446 1100 0 81521 174799 64746 1200 117416 299268 805235 189300 1600 168211 225182 290300 55358 Distnace_id Minimum Mean Maximum Std Dev 350 104981 142810 175986 35730 400 57846 75204 89428 15417 440 79568 109127 135500 28628 550 0 194172 667579 81347 600 0 186492 613785 74779 750 102894 238813 653609 88864 800 60184 197700 568623 79025 818 55188 180558 450004 68968 850 0 228115 1647365 155189 900 86624 293787 1007029 162084 Track: AP Track: FG Track: CRC
  • 37. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical Variables Handle & Track_Condition: Overall basis Track_con dition nmiss Minimum Mean Maximum Sum Std Dev FM 0 47127 229414 3060903850436130 140592 FT 0 0 165842 1647365 284236631 0 88371 GD 0 27740 180584 773904348166275 99651 MY 0 72518 180118 416176 21614199 65542 SF 0 97339 313760 910328 40475089 129318 SY 0 0 141092 557473362746364 64893 WF 0 65195 180629 403022 22036696 71323 YL 0 78342 318714 3139455135453550 343003 Handle & Track_Condition : Track_Id wise Track_conditi on Minimum Mean Maximum Std Dev FM 104925 292445 3060903 190157 FT 52542 218140 1045357 92865 GD 83694 262882 750821 105619 MY 72518 180118 416176 65542 SF 97339 313760 910328 129318 SY 61246 181344 438860 68084 WF 98683 214068 403022 70024 YL 123543 344863 3139455 374599 Track_conditio n Minimum Mean Maximum Std Dev FM 47127 179514 805235 90965 FT 0 135460 1186000 66663 GD 27740 155587 773904 83682 SY 33291 126181 557473 57553 WF 65195 128146 263524 47777 YL 78342 198372 505685 101064 Track_conditio n Minimum Mean Maximum Std Dev FM 81009 233360 667579 90214 FT 55188 194819 1647365 101440 GD 135246 225033 413163 73366 SY 0 169282 384473 66152 WF 91081 177043 284815 52412 YL 179507 278785 384008 70898 Track: AP Track: FG Track: CRC
  • 38. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical Variables Handle & Weather: Overall basis Weather nmiss Minimum Mean Maximum Sum Std Dev C 0 0 182642 3060903 2644474127 111013 F 0 60135 177446 388463 31053098 69996 H 0 83538 245123 750821 152466346 97967 L 0 27740 170675 3139455 1419335293 114994 O 0 31129 143224 1186000 320391583 78758 R 0 0 178122 432213 55574166 69767 Handle & Weather : Track_Id wise Weather Minimum Mean Maximum Std Dev C 60299 243543 3060903 132301 H 83538 245123 750821 97967 L 52542 233100 3139455 196410 O 74563 202576 457547 77792 R 61246 184426 432213 65310 Weather Minimum Mean Maximum Std Dev C 0 146379 1074715 78120 F 60135 145458 279416 47698 L 27740 134054 457672 56383 O 31129 135004 1186000 76598 R 85277 157631 342570 64507 Weather Minimum Mean Maximum Std Dev C 56314 212536 1647365 116815 F 65918 186319 388463 72693 L 55188 196956 1015652 86203 O 60184 190461 446041 68389 R 0 162061 334254 86450 Track: AP Track: FG Track: CRC
  • 39. 2. DATA EXPLORATION: Bivariate Analysis Handle & Number_of_Runners: Overall basis No._of_ Runners nmiss Minimum Mean Maximum Sum Std Dev 0 0 0 0 0 0 0 3 0 68110 87790 136111 614531 24010 4 0 29170 114078 339675 14373783 52845 5 0 31129 131162 574244 136277078 71137 6 0 27740 141212 1186000 616248774 76070 7 0 39557 156329 902720 1019111132 72781 8 0 24623 172195 1353026 935708945 84230 9 0 38246 203835 2090764 648603161 121901 10 0 51652 216924 3139455 619101417 187917 11 0 72784 234635 1015652 327081336 111020 12 0 49268 246030 818390 291791853 107368 13 0 175033 516909 1074715 11371992 261140 14 0 587653 752653 910328 3010611 146399 Handle & Number_of_Runners : Track_Id wise No._of_Runne rs Minimum Mean Maximum Std Dev 3 68110 88723 136111 26163 4 60299 126896 256008 48252 5 52542 181985 574244 74382 6 69928 190938 1030559 79928 7 61220 209634 902720 80247 8 81479 235889 1353026 103748 9 96150 266363 2090764 135946 10 109264 328673 3139455 331440 11 121347 280257 750821 91916 12 106441 288447 818390 96510 13 353658 473836 577231 101247 14 587653 752653 910328 146399 No._of_Runne rs Minimum Mean Maximum Std Dev 0 0 0 0 0 3 82191 82191 82191 . 4 29170 85022 196532 37081 5 31129 97017 327829 44356 6 27740 115761 1186000 60986 7 39557 132015 769498 57135 8 24623 141845 619376 59079 9 38246 160553 605132 77775 10 51652 168770 689454 75795 11 72784 188094 773904 100674 12 49268 207189 805235 115008 13 175033 534763 1074715 333074 No._of_Runne rs Minimum Mean Maximum Std Dev 0 0 0 0 0 4 75939 142690 339675 62476 5 62080 153114 526757 73351 6 55188 169506 737818 75955 7 55353 171881 394925 65169 8 72351 190798 462356 71482 9 88655 229691 1647365 147919 10 57846 215446 468001 71121 11 101330 251000 1015652 119559 12 85195 234524 653609 86906 13 396762 504946 667579 120182 Track: AP Track: FG Track: CRC
  • 40. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical Variables Handle & Location_Type: Overall basis Location_T ype nmiss Minimum Mean Maximum Sum Std Dev F 1 86735 175072 376662 47269555 49556 I 0 0 116255 256454 40340341 39131 L 0 69928 180899 318480 6512372 64031 O 0 96663 209871 386702 7765224 84081 S 0 0 76598 175461 26579435 25944 T 1 0 178739 3139455 4477232533 111255 Handle & Location_Type : Track_Id wise Location_Typ e Minimum Mean Maximum Std Dev F 93507 179912 376662 51240 L 69928 180899 318480 64031 O 96663 209871 386702 84081 T 52542 240715 3139455 148876 Location_Type Minimum Mean Maximum Std Dev I 0 116255 256454 39131 S 0 76598 175461 25944 T 0 143180 1186000 73051 Location_Type Minimum Mean Maximum Std Dev F 86735 164357 346263 44025 T 55188 202861 1647365 98654 Track: AP Track: FG Track: CRC
  • 41. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical Variables Handle & Day_of_Week: Overall basis Day_of_ Week nmiss Minimum Mean Maximum Sum Std Dev Fri 0 0 171751 503267 897744907 80706 Mon 0 0 141411 585465 473726296 67001 Sat 0 0 223245 3139455 1383896707 173655 Sunday 0 0 163067 1186000 898986819 82768 Thurs 0 0 160395 574244 548228915 71613 Tues 2 27740 110526 332921 151088435 45915 Wed 0 54471 194393 456229 269622534 68865 Handle & Day_of_Week : Track_Id wise Day_of_ Week Minimum Mean Maximum Std Dev Fri 80995 240492 503267 78187 Mon 61246 225548 457547 81572 Sat 68110 317064 3139455 266268 Sunday 83580 227642 563999 80174 Thurs 52542 194249 574244 68250 Tues 63431 164264 332921 69310 Wed 70663 202620 456229 69792 Day_of_Week Minimum Mean Maximum Std Dev Fri 0 132774 454808 57846 Mon 0 127376 585465 59292 Sat 0 176748 1074715 99373 Sunday 0 125038 1186000 63192 Thurs 0 124788 340647 56543 Tues 27740 105446 297950 40389 Wed 54471 160513 327829 52889 Day_of_Week Minimum Mean Maximum Std Dev Fri 57472 191383 425895 67939 Mon 60184 166166 410833 55430 Sat 55353 261509 1647365 151040 Sunday 62080 197506 452179 69753 Thurs 0 166491 374051 71729 Tues 76610 161920 313327 54843 Track: AP Track: FG Track: CRC
  • 42. 2. DATA EXPLORATION: Bivariate Analysis Handle & Month: Overall basis Month nmiss Minimum Mean Maximum Sum Std Dev 1 0 57786 208338 667579 284797365 85543 2 0 60830 200426 1015652 201829172 99535 3 0 55188 202569 1647365 188996819 133950 4 0 58512 168971 518254 53732836 77771 5 0 0 157576 689454 485017662 70931 6 0 33291 156905 578960 517159956 79389 7 2 0 183157 1186000 644530914 107033 8 0 0 217787 3139455 672525570 203657 9 0 42249 168886 584818 478622274 88225 10 0 0 112135 550582 213841634 62991 11 0 24623 144993 585766 283315911 64563 12 0 0 189533 805235 598924500 88769 Handle & Month : Track_Id wise Month Minimum Mean Maximum Std Dev 5 52542 193084 623554 69567 6 60299 206980 578960 80135 7 69928 238548 910328 94481 8 70663 295475 3139455 250128 9 61246 249515 584818 92486 Month Minimum Mean Maximum Std Dev 1 72810 249833 645937 108879 4 58512 168971 518254 77771 5 0 134856 689454 61910 6 33291 112235 365592 44334 7 0 132613 1186000 91722 8 0 130480 470593 58221 9 42249 127321 344041 48056 10 0 112135 550582 62991 11 24623 141236 585766 63116 12 42540 186032 805235 89726 Month Minimum Mean Maximum Std Dev 1 57786 201351 667579 78893 2 60830 200426 1015652 99535 3 55188 202569 1647365 133950 11 55353 177397 462356 67908 12 0 199967 509929 85060 Track: AP Track: FG Track: CRC
  • 43. 2. DATA EXPLORATION: Bivariate Analysis Handle & HOD (Hour of the day): Overall basis HOD nmiss Minimum Mean Maximum Sum Std Dev 1 0 24623 153512 902720 718435054 69015 2 0 31129 170912 818390 820378441 80133 3 0 29170 189699 2090764 1025700855 114302 4 0 40367 195446 3139455 1060881991 147331 5 0 33291 202763 1353026 534077362 122728 6 0 46025 174845 1186000 201596057 113198 7 0 156994 249484 379489 31933999 55992 11 0 39400 150897 376662 2565252 107688 12 0 28771 120498 586142 225933517 56593 Handle & HOD (Hour of the day): Track_Id wise HOD Minimum Mean Maximum Std Dev 1 61220 187499 902720 78039 2 61246 209849 818390 89115 3 60299 235107 2090764 139178 4 52542 268527 3139455 224173 5 68110 278611 1353026 122267 6 80995 276369 1030559 104720 7 156994 249484 379489 55992 11 176749 275783 376662 72334 12 115850 377879 586142 125569 HOD Minimum Mean Maximum Std Dev 1 24623 132098 454350 57664 2 31129 144769 522378 64563 3 29170 152494 689454 73566 4 40367 150608 805235 80956 5 33291 149124 1074715 90316 6 46025 124325 1186000 78465 11 39400 82778 162251 38426 12 28771 109425 316368 46450 HOD Minimum Mean Maximum Std Dev 1 59350 169483 484711 62488 2 71195 205370 667579 79772 3 60184 235762 1647365 131045 4 83017 220443 1015652 98004 5 101330 247910 1007029 129986 6 202109 257019 328307 48277 12 55188 144130 346263 48998 Track: AP Track: FG Track: CRC
  • 44. 2. DATA EXPLORATION: Bivariate Analysis (b). Categorical Variables: Chi-Square Test of Association The following is a summary of the Chi-Square test performed to evaluate whether the association between each of the independent variables & Handle (the dependent variable) is statistically significant or not. The results will be thus used for building the OLS Regression model with those independent variables that will have an association significant @ 5% with the dependent variable, Handle. For the purpose of this test, continuous variables have been binned as categorical variables on basis of the variable distribution found using the procedure of PROC UNIVARIATE: Variable P-Value Statistical association with Handle Wager_Type <0.0001 Significant Race_Type <0.0001 Significant Age_Restriction <0.0001 Significant Sex_Restriction <0.0001 Significant Stakes_Indicator <0.0001 Significant Surface <0.0001 Significant Track_Condition <0.0001 Significant Weather <0.0001 Significant Grade <0.0001 Significant Track_Sealed_Indicator <0.0001 Significant Maximum_Claim_Price <0.0001 Significant
  • 45. 2. DATA EXPLORATION: Bivariate Analysis (b) Categorical Variables: Chi-Square Test of Association (contd….) Variable P-Value Statistical association with Handle Minimum_Claim_Price <0.0001 Significant Purse <0.0001 Significant Day of the Week <0.0001 Significant Attendance <0.0001 Significant The variables mentioned above have thus been found to have a statistically significant association with the Handle, the dependent variable. (Please refer to the hyperlinked file for detailed workings.)
  • 46. 2. DATA EXPLORATION: Bivariate Analysis (c) Numeric Variables : Correlation Analysis Please click on the file below for the Correlation Matrix. The cells highlighted in RED indicate the presence of multi-collinearity due to a high value of positive or negative correlation between any two variables. Multi-collinearity is checked later while building the regression model. CORRELATION MATRIX Handle_ Combi race_da te race_nu mber number _of_tick ets_bet total_po ol payoff_ amount distance _id purse_u sa wps_po ol fraction _1 fraction _2 fraction _3 fraction _4 fraction _5 winning _time minimu m_claim _price maximu m_claim _price number _of_run ners Distance ID_Con v_to_Fu r Handle_ Combi 1 -0.0734 0.16599 0.04529 0.70363 0.00818 0.14679 0.34625 0.882 0.06214 -0.0395 -0.002 0.16188 0.0563 0.106 -0.1632 -0.1435 0.38228 0.14679 1 <.0001 <.0001 <.0001 <.0001 0.1835 <.0001 <.0001 <.0001 <.0001 <.0001 0.7437 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 race_dat e -0.0734 1 0.00202 0.0218 0.02629 0.00809 -0.0062 0.01589 -0.1112 -0.0333 0.0028 0.01504 0.02396 -0.0006 0.01026 0.02654 0.0097 -0.0722 -0.0062 race_dat e <.0001 1 0.7425 0.0004 <.0001 0.1882 0.317 0.0097 <.0001 <.0001 0.6482 0.0144 <.0001 0.9266 0.095 <.0001 0.1145 <.0001 0.3166 26461 26463 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 race_nu mber 0.16599 0.00202 1 -0.0798 0.00187 0.12701 0.04731 0.17814 0.22663 0.00081 0.01002 0.05535 0.00879 0.01247 0.04337 -0.0264 -0.0285 0.22343 0.04731 race_nu mber <.0001 0.7425 1 <.0001 0.7612 <.0001 <.0001 <.0001 <.0001 0.8957 0.103 <.0001 0.1529 0.0425 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 number_ of_ticket s_bet 0.04529 0.0218 -0.0798 1 0.3378 -0.023 -0.0444 -0.0558 -0.1102 0.02118 0.10424 0.06271 -0.0573 -0.0282 0.02492 0.02918 0.02917 0.02648 -0.0444 number_ of_ticket s_bet <.0001 0.0004 <.0001 1 <.0001 0.0002 <.0001 <.0001 <.0001 0.0006 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 total_po ol 0.70363 0.02629 0.00187 0.3378 1 -0.1316 0.05351 0.10608 0.32596 0.03397 -0.0017 0.01738 0.05977 0.02161 0.05725 -0.0633 -0.0594 0.15722 0.0535 total_po ol <.0001 <.0001 0.7612 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 0.781 0.0047 <.0001 0.0004 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 payoff_a mount 0.00818 0.00809 0.12701 -0.023 -0.1316 1 0.03044 0.04208 0.15456 0.01499 0.04696 0.04868 0.02319 -0.0027 0.05796 -0.0028 -0.0053 0.33667 0.03044 payoff_a mount 0.1835 0.1882 <.0001 0.0002 <.0001 1 <.0001 <.0001 <.0001 0.0147 <.0001 <.0001 0.0002 0.6656 <.0001 0.6548 0.3863 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 distance _id 0.14679 -0.0062 0.04731 -0.0444 0.05351 0.03044 1 0.17335 0.165 0.71452 0.59167 0.80977 0.78441 0.10692 0.96169 -0.0654 -0.0569 0.06724 1 distance _id <.0001 0.317 <.0001 <.0001 <.0001 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 purse_us a 0.34625 0.01589 0.17814 -0.0558 0.10608 0.04208 0.17335 1 0.39972 0.00694 -0.0746 -0.0566 0.17255 0.07282 0.09688 -0.2313 -0.2228 0.05945 0.17335 purse_us a <.0001 0.0097 <.0001 <.0001 <.0001 <.0001 <.0001 1 <.0001 0.2591 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 w ps_poo l 0.882 -0.1112 0.22663 -0.1102 0.32596 0.15456 0.165 0.39972 1 0.07052 -0.0439 -0.0071 0.18321 0.06319 0.11513 -0.1792 -0.1547 0.42674 0.165 w ps_poo l <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 1 <.0001 <.0001 0.2473 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 fraction_ 1 0.06214 -0.0333 0.00081 0.02118 0.03397 0.01499 0.71452 0.00694 0.07052 1 0.87029 0.78847 0.59345 0.09429 0.75985 -0.0107 -0.0058 -0.0012 0.71452 fraction_ 1 <.0001 <.0001 0.8957 0.0006 <.0001 0.0147 <.0001 0.2591 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 0.0813 0.3428 0.8504 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 fraction_ 2 -0.0395 0.0028 0.01002 0.10424 -0.0017 0.04696 0.59167 -0.0746 -0.0439 0.87029 1 0.75279 0.54696 0.09175 0.67156 0.03351 0.03405 0.06642 0.59167 fraction_ 2 <.0001 0.6482 0.103 <.0001 0.781 <.0001 <.0001 <.0001 <.0001 <.0001 1 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461 26461
  • 47. 2. DATA EXPLORATION: Bivariate Analysis (d) Synopsis of Key Findings The following shows a synopsis of the track-wise bivariate analysis done for each of the categorical variables & Handle. For each track id, categories of a variable giving the highest value of average Handle have been spelt out: Variable Track IDs AP CRC FG Race Number 11 12 11 Race Type STK (Stakes) STK (Stakes) STK (Stakes) Age Restriction 3 4U 3 Distance ID 1000 1200 900 Track Condition YL (Yielding) YL (Yielding) YL (Yielding) Weather H (Hazy) R (Rainy) C (Clear) No. of Runners 14 13 13 Location Type T (Track) T (Track) T (Track) Day of Week Saturday Saturday Saturday Month August January April Hour of the Day July December March
  • 48. CONTENTS 1. Data Preparation i. Evaluating composition of available data ii. Combining given data sets iii. Deriving variables for analysis 2. Data Exploration i. Univariate Analysis a. Categorical variables b. Numeric variables c. Synopsis of Findings ii. Bivariate Analysis a. Plots b. Categorical variables c. Numeric variables d. Synopsis of Key Findings iii. Multivariate Analysis 3. Testing Assumptions of OLS 4. Regression Model Building i. Summary of the Iterations performed for building the Models ii. A combined model for all three race tracks iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
  • 49. 2. DATA EXPLORATION: Multivariate Analysis For the dependent variable, ‘Handle_Combi’, and each of the numeric variables a multivariate analysis was conducted Track_Id wise. 1. Race Number 0 200 400 600 800 1000 1200 $0 $100,000 $200,000 $300,000 $400,000 $500,000 $600,000 $700,000 $800,000 $900,000 $1,000,000 $1,100,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NumberofRaces(Frequency) AverageHandle(DollarAmount) Race # AP: Average Handle by Race # AP: Handle AP: No. of Races 0 200 400 600 800 1000 1200 1400 1600 1800 $0 $20,000 $40,000 $60,000 $80,000 $100,000 $120,000 $140,000 $160,000 $180,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NumberofRaces(Frequency) AverageHandle(DollarAmount) Race # CRC: Average Handle by Race # CRC: Handle CRC: No. of Races 0 200 400 600 800 $0 $20,000 $40,000 $60,000 $80,000 $100,000 $120,000 $140,000 $160,000 $180,000 $200,000 $220,000 $240,000 $260,000 $280,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NumberofRaces(Frequency) AverageHandle(DollarAmount) Race # FG: Average Handle by Race # FG Handle FG: No. of Races  For all 3 race tracks, Handle for race # 11 is the highest.  The # of races have fallen as the race # increases for all 3 race tracks.  In fact, for race # 11, the number of races have been very few.  Thus, it can be seen that lower # of races for higher race numbers have been generating the maximum amount of average Handle.  Identifying the reasons for highest average Handle for Race # 11 in spite of lower number of races can thus be relevant.
  • 50. 2. DATA EXPLORATION: Multivariate Analysis 2. Number of Runners 0 200 400 600 800 1000 1200 1400 1600 1800 $0 $100,000 $200,000 $300,000 $400,000 $500,000 $600,000 $700,000 $800,000 1 2 3 4 5 6 7 8 9 10 11 12 13 No.ofRaces(Frequency) AverageHandle(DollarAmount) No .of Runners AP: Average Handle by No. of Runners AP: Handle AP: No. of Races 0 500 1000 1500 2000 2500 3000 3500 4000 4500 $0.00 $100,000.00 $200,000.00 $300,000.00 $400,000.00 $500,000.00 $600,000.00 1 2 3 4 5 6 7 8 9 10 11 12 13 No.ofRaces(Frequency) AverageHandle(DollarAmount) No. of Runners CRC: Average Handle by No. of Runners CRC: Handle CRC: No. of Races 0 200 400 600 800 1000 $0.00 $100,000.00 $200,000.00 $300,000.00 $400,000.00 $500,000.00 $600,000.00 1 2 3 4 5 6 7 8 9 10 11 12 13 No.ofRaces(Frequency) AverageHandle(DollarAmount) No. of Runners FG: Average Handle by No. of Runners FG: Handle FG: No. of Races  For all 3 race tracks, maximum number of races have taken place for around 6-7 runners.  Beyond 6-7 runners in a race at either of the 3 race tracks, the # of races have shown a falling trend.  However, Handle is seen to be increasing with higher number of races at all 3 race tracks.  Thus, although fewer # of races have taken place when number of runners have been beyond 6-7, Handle has increased.  Clearly, increasing the number of races where runners are beyond just 6-7 in number can have a positive impact on Handle.
  • 51. 2. DATA EXPLORATION: Multivariate Analysis 3. Day of the Week 0 200 400 600 800 1000 1200 1400 1600 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 Mon Tues Wed Thurs Fri Sat Sunday No.ofRaces(Frequency) AverageHandle(DollarAmount) Day of the Week AP: Average Handle by Day of the Week AP: Handle AP: No. of Races 0 500 1000 1500 2000 2500 3000 3500 4000 0 50000 100000 150000 200000 Mon Tues Wed Thurs Fri Sat Sunday No.ofRaces(Frequency) AverageHandle(DollarAmount) Day of the Week CRC: Average Handle by Day of the Week CRC: Handle CRC: No. of Races 0 200 400 600 800 1000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 Mon Tues Wed Thurs Fri Sat Sunday No.ofRaces(Frequency) AverageHandle(DollarAmount) Day of the Week FG: Average Handle by Day of the Week FG: Handle FG: No. of Races  Clearly, Saturday generates the highest average Handle during the week for all 3 race tracks.  Tuesday, perhaps, gets the lowest average Handle during the week.  Number of races held on Tuesday & Saturday are lowest & highest respectively across all 3 race tracks.  However, for track CRC, it is seen that though the average Handle for Wednesday is almost as high as that on Saturday, the # of races are very low in #.  There is thus a scope for increasing average Handle on Wednesday by increasing the # of races held on that day.  For Friday & Sunday, average Handle is much lower than that on Saturday, however, # of races are almost as high as that on Saturday for all 3 race tracks except CRC.
  • 52. 2. DATA EXPLORATION: Multivariate Analysis 4. Month of the Year 0 200 400 600 800 1000 1200 1400 1600 1800 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec No.ofRaces(Frequency) AverageHandle(DollarAmount) Month of the Year AP: Average Handle by Month of the Year AP: Handle AP: No. of Races 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec No.ofRaces(Frequency) AverageHandle(DollarAmount) Month of the Year CRC: Average Handle by Month of the Year CRC: Handle CRC: No. of Races 0 200 400 600 800 1000 1200 1400 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec No.ofRaces(Frequency) AverageHandle(DollarAmount) Month of the Year FG: Average Handle by Month of the Year FG: Handle FG: No. of Races  Track AP & FG are seen to have races in only a few months & not the whole year around and these months are not coinciding with each other’s.  For track CRC, while January has the highest average Handle, the # of races are the lowest. There is thus a scope for increasing Handle in January even more by increasing the # of races for that month.  Also, at track CRC, June & October have the lowest average Handle but higher # of races as compared to months in which Handle is lower.  Especially for December, the steep increase in the number of races doesn’t justify the not as steep increase in average Handle over the previous month of November.
  • 53. 2. DATA EXPLORATION: Multivariate Analysis 5. Hour of the Day (HOD) 0 200 400 600 800 1000 1200 1400 1600 1800 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 $400,000.00 1 2 3 4 5 6 7 8 9 No.ofRaces(Frequency) AverageHandle(DollarAmount) HOD AP: Average Handle by Hour of the Day (HOD) AP: Handle AP: No. of Races 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 1 2 3 4 5 6 7 8 9 No.ofRaces(Frequency) AverageHandle(DollarAmount) HOD CRC: Average Handle by Hour of the Day (HOD) CRC: Handle CRC: No. of Races 0 200 400 600 800 1000 1200 0 50000 100000 150000 200000 250000 300000 1 2 3 4 5 6 7 8 9 No.ofRaces(Frequency) AverageHandle(DOllarAmount) HOD FG: Average Handle by Hour of the Day (HOD) FG: Handle FG: No. of Races  There appears to be a data anomaly for track FG. While there are no races at the 9th HOD, it has some amount of average Handle.  For track AP, the average amount of Handle has increased with increasing hours of the day. However, the number of races have fallen sharply after the 5th HOD.  Strikingly, for track AP, at 8th & 9th HOD, very few races have generated the highest amount of average Handle.  For track FG, though there has been a steep fall in the number of races after the 4th HOD, the average Handle has shown an increase. Also, how such few races at the 6th HOD yield a high amount of Handle comparable to those with higher # of races?  For track CRC, though there has been a steep fall in the number of races after the 4th HOD, the average Handle has remained high with only a marginal drop.
  • 54. 2. DATA EXPLORATION: Multivariate Analysis 6. Purse Amount 0 500 1000 1500 2000 2500 3000 3500 4000 $0.00 $200,000.00 $400,000.00 $600,000.00 $800,000.00 $1,000,000.00 $1,200,000.00 $1,400,000.00 <=15000 15001-30000 30001-50000 50001-200000 200000+ No.ofRaces(Frequency) AverageHandle(DollarAmount) Purse (Dollar Amount) AP: Average Handle by Purse Amount AP: Purse AP: No. of Obs 0 1000 2000 3000 4000 5000 6000 7000 8000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 <=15000 15001-30000 30001-50000 50001-200000 200000+ No.ofRaces(Frequency) AverageHandle(DollarAmount) Purse (Dollar Amount) CRC: Average Handle by Purse Amount CRC: Purse CRC: No. of Obs 0 200 400 600 800 1000 1200 1400 1600 1800 $0.00 $100,000.00 $200,000.00 $300,000.00 $400,000.00 $500,000.00 $600,000.00 $700,000.00 $800,000.00 $900,000.00 <=15000 15001-30000 30001-50000 50001-200000 200000+ No.ofRaces(Frequency) AverageHandle(DollarAmount) Axis Title FG: Average Handle by Purse Amount FG: Purse FG: No. of Obs  For all three race tracks it can be seen that higher purse brackets have higher amount of average Handle.  The number of races have fallen sharply for higher brackets of purse amount though an increase in # of races when purse is in the bracket of USD 15001-30000 can be seen for both track AP & track CRC.  Overall, higher amount of average Handle is seen for higher brackets of purse amount accompanied by a sharp fall in the number of races
  • 55. 2. DATA EXPLORATION: Multivariate Analysis 7. Minimum Claim Price 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 <=10000 10001-30000 30001-50000 50001-100000 100000+ No.ofObs(Frequency) AverageHandle(DollarAmount) Minimum Claim Price (Dollar Amount) AP: Average Handle by Min Claim Price AP: Min_ClP AP: No. of Obs 0 1000 2000 3000 4000 5000 6000 7000 $0.00 $20,000.00 $40,000.00 $60,000.00 $80,000.00 $100,000.00 $120,000.00 $140,000.00 $160,000.00 $180,000.00 <=10000 10001-30000 30001-50000 50001-100000 100000+ No.ofObs(Frequency) AverageHandle(DollarAmount) Minimum Claim Price (Dollar Amount) CRC: Average Handle by Min Claim Price CRC: Min_ClP CRC: No. of Obs 0 500 1000 1500 2000 2500 3000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 <=10000 10001-30000 30001-50000 50001-100000 100000+ No.ofObs(Frequency) AverageHandle(DollarAmount) Minimum Claim Price (Dollar Amount) FG: Average Handle by Min Claim Price FG: Min_ClP FG: No. of Obs  For all three race tracks, the amount of average Handle has been fairly constant across all brackets of the minimum claim price.  Also, for all three race tracks, the number of races have fallen with increasing brackets of the minimum claim price.  Only a marginal increase in the number of races can be seen for the bracket of 100000+ minimum claim price.
  • 56. 2. DATA EXPLORATION: Multivariate Analysis 8. Maximum Claim Price 0 1000 2000 3000 4000 5000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 <=10000 10001-30000 30001-50000 50001-100000 100000+ No.ofObs(Frequency) AverageHandle(DollarAmount) Maximum Claim Price (Dollar Amount) AP: Average Handle by Max Claim Price AP: Max_ClP AP: No. of Obs 0 1000 2000 3000 4000 5000 6000 7000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 <=10000 10001-30000 30001-50000 50001-100000 100000+ No.ofObs(Frequency) AverageHandle(DollarAmount) Maximum Claim Price (Dollar Amount) CRC: Average Handle by Max Claim Price CRC: Max_ClP CRC: No. of Obs 0 500 1000 1500 2000 2500 3000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 <=10000 10001-30000 30001-50000 50001-100000 100000+ No.ofObs(Frequency) AverageHandle(DollarAmount) Maximum Claim Price (Dollar Amount) FG: Average Handle by Max Claim Price FG: Max_ClP FG: No. of Obs  As for the minimum claim price, similar observations can be made for the amount of maximum claim price.
  • 57. 2. DATA EXPLORATION: Multivariate Analysis 9. Attendance 0 1000 2000 3000 4000 5000 6000 7000 8000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 0-3000 3001-5000 5001-7000 7001-9000 9001-11000 11000+ No.ofRaces(Frequency)) AverageHandle(DollarAmount) Attendance (in numbers) AP: Average Handle by Attendance AP: Attend AP: No. of Obs 0 1000 2000 3000 4000 5000 6000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 0-3000 3001-5000 5001-7000 7001-9000 9001-11000 11000+ No.ofRaces(Frequency) AverageHandle(DollarAmount) Attendance (in numbers) CRC: Average Handle by Attendance CRC: Attend CRC: No. of Obs 0 500 1000 1500 2000 2500 3000 3500 4000 $0.00 $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 $400,000.00 $450,000.00 $500,000.00 $550,000.00 0-3000 3001-5000 5001-7000 7001-9000 9001-11000 11000+ No.ofRaces(Frequency) AverageHandle(DollarAmount) Attendance (in numbers) FG: Average Handle by Attendance FG: Attend FG: No. of Obs  For track AP, Handle is generated only when attendance is 0-3000. Is that the maximum audience holding capacity of this track.  For track CRC, Handle is seen to have increased for higher attendance brackets though there has been a corresponding fall in the no. of races.  For track FG, Handle is highest for attendance bracket 5001-7000 though the no. of races for this bracket of attendance is very low. No races with attendance greater than 9000 have taken place at track FG. Is the capacity of track FG limited to 9000?
  • 58. CONTENTS 1. Data Preparation i. Evaluating composition of available data ii. Combining given data sets iii. Deriving variables for analysis 2. Data Exploration i. Univariate Analysis a. Categorical variables b. Numeric variables c. Synopsis of Findings ii. Bivariate Analysis a. Plots b. Categorical variables c. Numeric variables d. Synopsis of Key Findings iii. Multivariate Analysis 3. Testing Assumptions of OLS 4. Regression Model Building i. Summary of the Iterations performed for building the Models ii. A combined model for all three race tracks iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
  • 59. 3. TESTING ASSUPMTIONS OF OLS The following assumptions of OLS could not be tested as the SAS procedures listed below for each test are not available in WPS. Hence, it couldn’t be conclusively evaluated if the estimates were BLUE.: i. Linearity While the assumption of Linearity can be tested graphically also from the partial residual plots, the option of ‘Partial’ while fitting the model using Proc Reg was not available. ii. Independence of Error terms The Durbin Watson test for evaluating the Independence of Error terms was not available as an option in the procedure of Proc Reg in WPS. iii Normality of Error terms The option of ‘Normal’, ‘Histogram’ & ‘Probplot’ is not available in Proc Reg to evaluate the normality of the error terms. iv Homoskedasticity White’s Test could not be used in WPS.
  • 60. CONTENTS 1. Data Preparation i. Evaluating composition of available data ii. Combining given data sets iii. Deriving variables for analysis 2. Data Exploration i. Univariate Analysis a. Categorical variables b. Numeric variables c. Synopsis of Findings ii. Bivariate Analysis a. Plots b. Categorical variables c. Numeric variables d. Synopsis of Key Findings iii. Multivariate Analysis 3. Testing Assumptions of OLS 4. Regression Model Building i. Summary of the Iterations performed for building the Models ii. A combined model for all three race tracks iii. Model for Track ‘AP’ : Model with plots for Residuals & Model fit iv. Model for Track ‘CRC’ : Model with plots for Residuals & Model fit v. Model for Track ‘FG’ : Model with plots for Residuals & Model fit
  • 61. 3. REGRESSION MODEL BUILDING (i) A Snapshot of the Iterations Performed The following is a synopsis of the various iterations performed for building the model. Four models have been built: One, a combined model on an overall basis for all three race tracks and other three being separate models for each of the three race tracks viz. AP, CRC & FG: Iteration # Description of variables included in the Iteration Adjusted R-Square Overall AP CRC FG 1. All as-is variables, numeric in nature, were used. 0.56 0.68 0.43 0.51 2. Dropping variables that were found to be linear combinations of other variables in iteration # 1 0.56 0.68 0.43 0.51 3. Dummy variables, created for categorical variables, along with as-is numeric variables were included. 0.68 0.75 0.56 0.70 4. Dropping variables that were found to be linear combinations of other variables in iteration # 3. 0.68 0.75 0.56 0.70 5. Dropping variables that were found to be linear combinations of other variables in iteration # 4 or insignificant @ 5% 0.68 0.74 0.56 0.69 6. Variables found insignificant @ 5% in the preceding iteration were dropped. 0.68 0.74 7. Variables with a VIF score > 10 in iteration # 6 above were dropped. 0.67 0.71 0.53 0.65
  • 62. 3. REGRESSION MODEL BUILDING (ii) A combined model for all 3 race tracks taken together. (a) Regression Output Please click on the above hyperlink for detailed results and the model equation in the tab named ‘7th Iteration’. (b) Summary of the results Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on Handle. Type of Characteristic Drivers of Handle Impact on Handle Average change in Handle for a 1 unit change in the driver. Ra ce Age _34 -ve 8390 Ra ce Age _35 -ve 8058 Ra ce Age _4U +ve 16835 Ra ce C o urs e _T +ve 39050 Ra ce DST -ve 51826 Ra ce Lo ca tio n_F -ve 46635 Ra ce Lo ca tio n_I -ve 23297 Ra ce Lo ca tio n_L -ve 46913 Ra ce Ra ce _ALW -ve 6936 Ra ce Ra ce _AOC -ve 9218 Ra ce Ra ce _DBY -ve 222613 Ra ce Ra ce _MC L +ve 4501 Ra ce Ra ce _MSW +ve 4606 Ra ce Ra ce _OC S -ve 33897 Ra ce Ra ce _SHP +ve 40170 Ra ce Ra ce _STK -ve 14399 Ra ce Sta te _IL +ve 39715 Ra ce Tra ck_FM -ve 20126 Ra ce Tra ck_GD -ve 19918 Ra ce Tra ck_MY -ve 20368 Ra ce Tra ck_SF +ve 11108 Ra ce Tra ck_SY -ve 12894 Ra ce W a ge r_3 -ve 83076 Ra ce W a ge r_4 -ve 64022 Ra ce W a ge r_5 -ve 111358 Ra ce W a ge r_6 -ve 114061 Ra ce W a ge r_9 -ve 66484 Ra ce W a ge r_D -ve 74073 Ra ce W a ge r_M -ve 94612 Ra ce W a ge r_Q -ve 93753 Ra ce W a ge r_S -ve 68124 Ra ce W a ge r_T -ve 22168 Ra ce W a ge r_Z -ve 93800 Ra ce W e a the r_L -ve 2824 Ra ce W e a the r_O -ve 8414 Ra ce W e a the r_R -ve 24565 Ra ce numbe r_o f_run ne rs +ve 12287 Ra ce numbe r_o f_tick e ts _be t +ve 211 Ra ce purs e _us a +ve 1 Ra ce ra ce _numbe r +ve 3302 Time HOL_BxD +ve 130563 Time HOL_La b +ve 21833 Time HOL_Me m -ve 9719 Time HOL_NY +ve 59249 Time HOL_SB -ve 20825 Time HOL_SP D -ve 22563 Time HOL_TGV -ve 37150 Time HOL_Ve t -ve 19124 Time Mo n +ve 1764 Time W e e kDa y +ve 5603 Time W e e ke nd_Indi +ve 15326 Time fra ctio n_1 +ve 17 SUMMARY: REGRESSION RESULTS OF THE OVERALL MODEL
  • 63. 3. REGRESSION MODEL BUILDING (ii) A combined model for all 3 race tracks taken together. (c) Residual Plot
  • 64. 3. REGRESSION MODEL BUILDING (ii) A combined model for all 3 race tracks taken together. (d) Model Fit
  • 65. 3. REGRESSION MODEL BUILDING (iii) Model for Track ‘AP’ (a) Regression Output Please click on the above hyperlink for detailed results and the model equation in the tab named ‘7th Iteration’. (b) Summary of the results Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on Handle. s Type of Characteristic Drivers of Handle Impact on Handle Average change in Handle for a 1 unit change in the driver. Race Breed_QH -ve 48267 Race Location_F -ve 34365 Race Location_L -ve 36642 Race Race_ALW -ve 9246 Race Race_AOC -ve 16294 Race Race_MSW -ve 6628 Race Race_STK -ve 17440 Race Track_SY -ve 30309 Race Track_YL +ve 29391 Race Wager_3 -ve 79629 Race Wager_4 -ve 71710 Race Wager_D -ve 93001 Race Wager_S -ve 72598 Race Wager_T -ve 20962 Race Wager_Z -ve 106457 Race Weather_R -ve 15752 Race number_of_run ners +ve 16502 Race number_of_tick ets_bet +ve 314 Race purse_usa +ve 2 Race race_number +ve 5319 Time HOD +ve 3880 Time HOL_ID +ve 29700 Time HOL_Lab +ve 87228 Time WeekDay +ve 7264 Time Weekend_Indi +ve 34873 SUMMARY: REGRESSION RESULTS OF THE MODEL FOR TRACK 'AP'
  • 66. 3. REGRESSION MODEL BUILDING (iii) Model for Track AP (c) Residual Plot
  • 67. 3. REGRESSION MODEL BUILDING (iii) Model for Track AP (d) Model Fit
  • 68. 3. REGRESSION MODEL BUILDING (iv) Model for Track ‘CRC’ (a) Regression Output Please click on the above hyperlink for detailed results and the model equation in the tab named ‘6th Iteration’. (b) Summary of the results Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on Handle. Type of Characteristic Drivers of Handle Impact on Handle Average change in Handle for a 1 unit change in the driver. Race Age_4U +ve 106507 Race Course_T +ve 35208 Race DistanceID_Conv_to_Fur -ve 37 Race Location_S +ve 8080 Race Race_MSW +ve 8504 Race Race_STK +ve 19343 Race Track_FM -ve 19582 Race Track_GD -ve 16025 Race Track_SY -ve 9652 Race Track_W F -ve 16437 Race Track_YL -ve 22817 Race W ager_3 -ve 67735 Race W ager_5 -ve 93481 Race W ager_6 -ve 113661 Race W ager_9 -ve 67019 Race W ager_D -ve 49220 Race W ager_M -ve 80114 Race W ager_S -ve 54116 Race W ager_T -ve 12607 Race W ager_Z -ve 66625 Race W eather_F +ve 22005 Race W eather_L -ve 6765 Race W eather_O -ve 4888 Race number_of_runners +ve 9799 Race purse_usa +ve 1 Race race_number +ve 1840 Time HOD -ve 2791.4285 Time HOL_CDM +ve 38739 Time HOL_ID -ve 25209 Time HOL_Mem -ve 12591 Time Mon +ve 3053.76611 Time WeekDay +ve 5912.74561 Time Weekend_Indi +ve 8358.8241 Time fraction_3 +ve 0.84318 Time fraction_4 +ve 0.26747 Time fraction_5 +ve 0.95467 SUMMARY: REGRESSION RESULTS OF THE MODEL FOR TRACK 'CRC'
  • 69. 3. REGRESSION MODEL BUILDING (iv) Model for Track CRC (c) Residual Plot
  • 70. 3. REGRESSION MODEL BUILDING (iv) Model for Track CRC (d) Model Fit
  • 71. 3. REGRESSION MODEL BUILDING (v) Model for Track ‘FG’ (a) Regression Output Please click on the above hyperlink for detailed results and the model equation in the tab named ‘6th Iteration’. (b) Summary of the results Results of only those drivers statistically significant @ 5% have been shown and interpreted with respect to their impact on Handle. Type of Characteristic Drivers of Handle Impact on Handle Average change in Handle for a 1 unit change in the driver. Race Age_4U +ve 8298 Race Location_F -ve 19754 Race Race_ALW -ve 10989 Race Race_AOC -ve 7139 Race Race_STK -ve 13220 Race Track_YL +ve 49565 Race Wager_3 -ve 101465 Race Wager_4 -ve 87265 Race Wager_6 -ve 129220 Race Wager_D -ve 76597 Race Wager_Q -ve 97902 Race Wager_S -ve 81803 Race Wager_T -ve 16207 Race Wager_Z -ve 111527 Race Weather_F -ve 10450 Race number_of_runners +ve 11585 Race number_of_tickets_bet +ve 725 Race payoff_amount +ve 1 Race purse_usa +ve 1 Race race_number +ve 5897 Time HOD -ve 1792 SUMMARY: REGRESSION RESULTS OF THE MODEL FOR TRACK 'FG'
  • 72. 3. REGRESSION MODEL BUILDING (v) Model for Track FG (c) Residual Plot
  • 73. 3. REGRESSION MODEL BUILDING (v) Model for Track FG (d) Model Fit