2. Runners want to run faster
What goal should you set for a half marathon time?
How can I improve on that time?
3. Data from Strava.com
Pace
Time series, demographic, and aggregated running data on
10,000 runners. 1,000 with half-marathon times and other features.
75
100
125
150
1 2 3 4
Rests per week
HalfMarathonTime
"
1
8.4
8.6
8.8
9.0
0 2 4
Log Month Distance
LogHalfMarathonTime
10
10
10
20
clean5$mnth_pace
8.4
8.6
8.8
9.0
0 2 4
Log Month Distance
LogHalfMarathonTime
10
10
10
20
clean5$mnth_pace
4. Data from Strava.com
75
100
125
150
20 30 40 50 60
Age (years)
HalfMarathonTime
"blue"
blue
10
10
Distance past month Weight range
Time past month Age range
Pace past month Number of rest days/wk
Distance past 6 months Number of long days/wk
Gender Sdev pace
75
100
125
150
120 140 160 180 200
weight (lbs)
HalfMarathonTime(min)
clean5$
F
M
5. Analysis
Benchmarking with a linear model 0.49 10 min
Nonlinear regression modeling
1. Lasso regression 0.48 10 min
2. Ridge regression 0.48 10 min
3. Random forest regression 0.66 8.3 min
3-fold cross-validation
Regression r2
RMSE
Validation:
179 runners 0.79 6.2 min
Seems to be related to a different distribution in the test
set. Possibly because of importance of outliers.
6. 0.0e+00 4.0e+07 8.0e+07 1.2e+08
rf$finalModel
IncNodePurity
Your average pace over the past month is the most
important feature by far.
Results
Variable importance
Increase in node purity
Pace past month
Distance past month
Distance past 6 months
Elevation past month
Rest days
SD pace
Weight
Long days
Age
Gender
7. About me: Alexis Yelton, MIT postdoc
Chitinase in marine cyanobacteria
Chitinaseactivity
My first half
marathon:
1:56:30
Personal best:
1:47:56
Editor's Notes
Start with asking if anyone is a runner. Be excited about the problem.
Try elastic net regression (lasso and ridge combination)