6. Analysis
Benchmarking with a linear model 0.73 6.5 min
Reducing number of features
Ensemble partial least squares regression 0.73 6.4 min
5-fold cross-validation
Regression r2
RMSE
Validation:
72 runners 0.63 7.2 min
7. About me: Alexis Yelton, MIT postdoc
Chitinase in marine cyanobacteria
Chitinaseactivity
My first half
marathon:
1:56:30
Personal best:
1:47:56
8. 22 Features
Month distance Weight Range
Month Runs Gender
Month Elevation Rest Days / Week
Month Pace Fast Days / Week
Month Time Long Days /Week
6 Month Distance 5K Time
6 Month Runs Marathon Time
6 Month Elevation Minimum Pace
6 Month Pace Minimum Pace > 2 mi
6 Month Time Minimum Pace > 3 mi
Age Range SD Pace
9. Results
0
500
1000
1500
4000 5000 6000 7000 8000
train_sub$HALF.MARATHON
sqrt(errorstrain^2)
Half Marathon Time
Errors vary with half
marathon time.
A larger data set
would allow for
better predictions
for faster and slower
runners.
13. Analysis
Benchmarking with a linear model 0.73 6.5 min
Dealing with collinear features (and reducing number of features)
1. Ensemble partial least squares regression 0.72 6.6 min
2. Linear model 0.71
6.7 min
3. Lasso regression 0.69 6.8 min
4. Ridge regression 0.72 6.7 min
5. Random forest regression 0.67 7.1 min
3-fold cross-validation
Regression r2
RMSE
Validation:
69 runners 0.63 7.2 min
14. Analysis
Benchmarking with a linear model 0.73 6.5 min
Reducing number of features
1. Ensemble partial least squares regression 0.72 6.6 min
2. Linear model 0.71
6.7 min
3. Lasso regression 0.69 6.8 min
Other models with these features
1. Ridge regression 0.72 6.7 min
2. Random forest regression 0.67 7.1 min
3-fold cross-validation
Regression r2
RMSE
Validation:
69 runners 0.63 7.2 min
15. 0.0e+00 4.0e+07 8.0e+07 1.2e+08
rf$finalModel
IncNodePurity
Your average pace over the past month is the most
important feature by far.
Results
Variable importance
Increase in node purity
Pace past month
5K time
Pace past month
Rest days
SD pace
Weight
Long days
Age
Gender
Editor's Notes
Start with asking if anyone is a runner. Be excited about the problem.
Try elastic net regression (lasso and ridge combination)
Motivation: Add what specifically works (main drivers of improving performance)
Figure of errors vs actual times
Motivation: Add what specifically works (main drivers of improving performance)
Figure of errors vs actual times
Start with 20 features in LM
Feature selection (PLS regression, Lasso)
Compare the same features in different models
Improve usability
Focus on the model I used. Introduce motivations before talking about the models (simplicity/usability and results)
Start with 20 features in LM
Feature selection (PLS regression, Lasso)
Compare the same features in different models
Improve usability
Start with 20 features in LM
Feature selection (PLS regression, Lasso)
Compare the same features in different models
Improve usability
Focus on the model I used. Introduce motivations before talking about the models (simplicity/usability and results)