Week 5 presentation 2

•Download as PPTX, PDF•

0 likes•133 views

apyelton

asdfafg

Data & Analytics

Runners want to run faster
How fast can I run a half marathon?
How can I improve on that time?

Data from Strava.com
Pace
Time series, demographic, and aggregated running data on
10,000 runners. 1,000 with half-marathon times.
75
100
125
150
1 2 3 4
Rests per week
HalfMarathonTime
"
1
8.4
8.6
8.8
9.0
0 2 4
Log Month Distance
LogHalfMarathonTime
10
10
10
20
clean5$mnth_pace
8.4
8.6
8.8
9.0
0 2 4
Log Month Distance
LogHalfMarathonTime
10
10
10
20
clean5$mnth_pace

Data from Strava.com
60
80
100
120
140
160
20 30 40 50 60
Age (years)
HalfMarathonTime
"blue"
blue
10
10
60
80
100
120
140
160
120 140 160 180 200
Weight (lbs)
HalfMarathonTime

Analysis
Benchmarking with a linear model 0.73 6.5 min
Reducing number of features
Ensemble partial least squares regression 0.73 6.4 min
5-fold cross-validation
Regression r2
RMSE
Validation:
72 runners 0.63 7.2 min

About me: Alexis Yelton, MIT postdoc
Chitinase in marine cyanobacteria
Chitinaseactivity
My first half
marathon:
1:56:30
Personal best:
1:47:56

22 Features
Month distance Weight Range
Month Runs Gender
Month Elevation Rest Days / Week
Month Pace Fast Days / Week
Month Time Long Days /Week
6 Month Distance 5K Time
6 Month Runs Marathon Time
6 Month Elevation Minimum Pace
6 Month Pace Minimum Pace > 2 mi
6 Month Time Minimum Pace > 3 mi
Age Range SD Pace

Results
0
500
1000
1500
4000 5000 6000 7000 8000
train_sub$HALF.MARATHON
sqrt(errorstrain^2)
Half Marathon Time
Errors vary with half
marathon time.
A larger data set
would allow for
better predictions
for faster and slower
runners.

Analysis
Benchmarking with a linear model 0.73 6.5 min
Dealing with collinear features (and reducing number of features)
1. Ensemble partial least squares regression 0.72 6.6 min
2. Linear model 0.71
6.7 min
3. Lasso regression 0.69 6.8 min
4. Ridge regression 0.72 6.7 min
5. Random forest regression 0.67 7.1 min
3-fold cross-validation
Regression r2
RMSE
Validation:
69 runners 0.63 7.2 min

Analysis
Benchmarking with a linear model 0.73 6.5 min
Reducing number of features
1. Ensemble partial least squares regression 0.72 6.6 min
2. Linear model 0.71
6.7 min
3. Lasso regression 0.69 6.8 min
Other models with these features
1. Ridge regression 0.72 6.7 min
2. Random forest regression 0.67 7.1 min
3-fold cross-validation
Regression r2
RMSE
Validation:
69 runners 0.63 7.2 min

0.0e+00 4.0e+07 8.0e+07 1.2e+08
rf$finalModel
IncNodePurity
Your average pace over the past month is the most
important feature by far.
Results
Variable importance
Increase in node purity
Pace past month
5K time
Pace past month
Rest days
SD pace
Weight
Long days
Age
Gender

4. Data from Strava.com Pace Time series, demographic, and aggregated running data on 10,000 runners. 1,000 with half-marathon times. 75 100 125 150 1 2 3 4 Rests per week HalfMarathonTime " 1 8.4 8.6 8.8 9.0 0 2 4 Log Month Distance LogHalfMarathonTime 10 10 10 20 clean5$mnth_pace 8.4 8.6 8.8 9.0 0 2 4 Log Month Distance LogHalfMarathonTime 10 10 10 20 clean5$mnth_pace

5. Data from Strava.com 60 80 100 120 140 160 20 30 40 50 60 Age (years) HalfMarathonTime "blue" blue 10 10 60 80 100 120 140 160 120 140 160 180 200 Weight (lbs) HalfMarathonTime

6. Analysis Benchmarking with a linear model 0.73 6.5 min Reducing number of features Ensemble partial least squares regression 0.73 6.4 min 5-fold cross-validation Regression r2 RMSE Validation: 72 runners 0.63 7.2 min

7. About me: Alexis Yelton, MIT postdoc Chitinase in marine cyanobacteria Chitinaseactivity My first half marathon: 1:56:30 Personal best: 1:47:56

8. 22 Features Month distance Weight Range Month Runs Gender Month Elevation Rest Days / Week Month Pace Fast Days / Week Month Time Long Days /Week 6 Month Distance 5K Time 6 Month Runs Marathon Time 6 Month Elevation Minimum Pace 6 Month Pace Minimum Pace > 2 mi 6 Month Time Minimum Pace > 3 mi Age Range SD Pace

9. Results 0 500 1000 1500 4000 5000 6000 7000 8000 train_sub$HALF.MARATHON sqrt(errorstrain^2) Half Marathon Time Errors vary with half marathon time. A larger data set would allow for better predictions for faster and slower runners.

10. Results

11.

12.

13. Analysis Benchmarking with a linear model 0.73 6.5 min Dealing with collinear features (and reducing number of features) 1. Ensemble partial least squares regression 0.72 6.6 min 2. Linear model 0.71 6.7 min 3. Lasso regression 0.69 6.8 min 4. Ridge regression 0.72 6.7 min 5. Random forest regression 0.67 7.1 min 3-fold cross-validation Regression r2 RMSE Validation: 69 runners 0.63 7.2 min

14. Analysis Benchmarking with a linear model 0.73 6.5 min Reducing number of features 1. Ensemble partial least squares regression 0.72 6.6 min 2. Linear model 0.71 6.7 min 3. Lasso regression 0.69 6.8 min Other models with these features 1. Ridge regression 0.72 6.7 min 2. Random forest regression 0.67 7.1 min 3-fold cross-validation Regression r2 RMSE Validation: 69 runners 0.63 7.2 min

15. 0.0e+00 4.0e+07 8.0e+07 1.2e+08 rf$finalModel IncNodePurity Your average pace over the past month is the most important feature by far. Results Variable importance Increase in node purity Pace past month 5K time Pace past month Rest days SD pace Weight Long days Age Gender

Editor's Notes

Start with asking if anyone is a runner. Be excited about the problem. Try elastic net regression (lasso and ridge combination)
Motivation: Add what specifically works (main drivers of improving performance) Figure of errors vs actual times
Motivation: Add what specifically works (main drivers of improving performance) Figure of errors vs actual times
Start with 20 features in LM Feature selection (PLS regression, Lasso) Compare the same features in different models Improve usability Focus on the model I used. Introduce motivations before talking about the models (simplicity/usability and results)
Start with 20 features in LM Feature selection (PLS regression, Lasso) Compare the same features in different models Improve usability
Start with 20 features in LM Feature selection (PLS regression, Lasso) Compare the same features in different models Improve usability Focus on the model I used. Introduce motivations before talking about the models (simplicity/usability and results)

Week 5 presentation 2

Recommended

Recommended

More Related Content

Similar to Week 5 presentation 2

Similar to Week 5 presentation 2 (20)

Recently uploaded

Recently uploaded (20)

Week 5 presentation 2

Editor's Notes