Harness Racing and SAS

418 views
219 views

Published on

Harvard Stats 135 midterm project evaluating SAS techniques.

Published in: Sports, Entertainment & Humor
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
418
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Present horse part recoding
  • Show creating a mean data set
  • Show SAS code for recoding data
  • Harness Racing and SAS

    1. 1. HARNESS RACING AND SAS USING SAS TO MODEL HORSE RACES
    2. 2. DATA SET • “Past Performance” from TrackMaster for races September 26, 2013 at Yonkers Raceway • Published in advance of the race • Cost: $1.50 • Comes in XML format – parsed using python • Contains 10 most recent PPs for each horse racing that day • 12 races x 8 horses x 10 past performances = 960 records • Variables of use: Lengths back at each quarter, final time, lead final time, gait, age (meta), track condition, track name, track length • Created race-level, horse-race-level, and longitudinal data sets for different aspects of this analysis
    3. 3. GAIT AND CONDITION • Hypothesis: Gait and track condition influence race time • Gait • Binary: Pacers and Trotters • Each race is one or the other • Each horse is one or the other • Condition • Categorical: Fast, Good, or Sloppy • Each race categorized into one • Created and cleaned race-level data set • Means test showed means are different for both variables • T-test showed these differences are statistically significant
    4. 4. REMOVING OUTLIERS
    5. 5. REMOVING OUTLIERS
    6. 6. GAIT T-TEST
    7. 7. CONDITION T-TEST
    8. 8. CORRELATION: LENGTHS BACK AT CALLS • Some horses pull away early, others seem to wait for the last quarter to go to the front • TrackMaster reports lengths back from lead and calls at each quarter • Lengths are recorded as fractional numbers (to the quarter) and as parts of horse • Nose • Head • Neck • Additional complication: “costly breaks” of pace and disqualification • Still not happy – strange lengths back for winners at final
    9. 9. CORRELATION OF LENGTHS BACK BY QUARTER
    10. 10. CORRELATION OF LENGTHS BACK BY QUARTER
    11. 11. AGE AND SPEED • Goal: Quantify how much horses slow down with age • Merged metadata for each horse with past performance data • Single-variable regression analysis of mean data set • Found that age is not a great predictor of speed • Age: Discrete, yet not categorical
    12. 12. MULTIVARIATE REGRESSION • Longitudinal data set • Created dummy variables for past and present track conditions, gaits, and track sizes • Used SAS’s “Lag” and “Last” Features • Removed disqualified races • Modeled race time based on current race conditions and two races prior
    13. 13. MULTIVARIATE REGRESSION Control Variables Variables of Interest Label Parameter Estimate Standard Error t Value Pr > |t| Label Parameter Estimate Standard Error t Value Pr > |t| Intercept 104.67788 4.81142 21.76 <.0001 Fast lag 0.35883 0.38598 0.93 0.3528 Lag final time 0.01412 0.03120 0.45 0.6510 Sloppy lag 0.48532 0.43151 1.12 0.2610 Lag2 final time 0.11361 0.02975 3.82 0.0001 Fast lag2 0.09472 0.37245 0.25 0.7993 Pacer -3.68185 0.21247 -17.33 <.0001 Sloppy lag2 -0.39904 0.42068 -0.95 0.3431 Fast -0.77005 0.38954 -1.98 0.0484 5/8 Track lag 0.14639 0.23680 0.62 0.5366 Sloppy 0.86942 0.43605 1.99 0.0465 1 Track lag 0.40192 0.51792 0.78 0.4379 Age 0.05312 0.04023 1.32 0.1871 5/8 track lag2 0.58564 0.21764 2.69 0.0073 5/8 Track -2.74052 0.20313 -13.49 <.0001 1 track lag2 0.67260 0.49172 1.37 0.1717 1 Track -3.18411 0.47824 -6.66 <.0001 Final race times from previous races are not great determinants of final race time this race!
    14. 14. PREDICTION OF SEPTEMBER 26 RACES • Used the coefficients from my multivariate regression and most recent two races for each horse • Ranked horses by predicted race values • But my bets weren’t great! But better than choosing at random! • Reason: Low, low variance in race times among horses. Not enough predictive power in model, even with R^2 > 0.5 Predicting the Winner Right Wrong
    15. 15. FINAL THOUGHTS • SAS’s LAG and LAST features are great for dealing with longitudinal data • Most work was on the DATA steps, not the PROC steps • My model was based on only 960 occurrences, 96 horses • With more data, might model Pacers and Trotters separately, Conditions separately • Still want to investigate lengths back for winning horses • Learned much about SAS and about harness racing

    ×