Practical Machine Learning

Practical Machine Learning
• Your model makes unacceptably large errors on new data. What to do next?

• Collect more training samples

• Reduce number of features

• Increase number of features

• Regularization

• Regularization
• Bigger Model

• Regularization
• Bigger Model
• Hyper-parameter tuning

Bias vs. Variance
x
x
x
x x
x
x
x
x x
x
x
x
x x
High bias
(underfit)
“Just right” High variance
(overfit)
x
f(x)
x
f(x)
x
f(x)

Bias vs. Variance – Machine Learning
perspective
• Optimal error rate (e.g. Bayes rate, best human error)
• Training error
• Validation error
Training Test

perspective
• Training error
Training Validation Test

perspective
• Training error
Bias
Variance

perspective
• Training error
Bias
Variance
1%
5%
6%

perspective
• Training error
Bias
Variance
1%
5%
6%
1%
2%
6%

perspective
• Training error
Bias
Variance
1%
5%
6%
1%
2%
6%
1%
5%
10%

Data from different distributions/domains
Training Test
10-hour call-center speech50-hour conversational speech

Training Test
TestValTrain-Val

Training Test
TestValTrain-Val
• Training error
• Train-Val error
• Test error

Training Test
TestValTrain-Val
• Training error
• Train-Val error
• Test error
Bias
Variance
Train-Test mismatch
Overfitting of Val

Training Test
TestValTrain-Val
• Training error
• Train-Val error
• Test error
Bias
Variance
1%
5%
6%
Train-Test mismatch
Overfitting of Val
10%
20%

Workflow (courtesy of Andrew Ng)
Training error high?

Bigger model
Train longer
New model architecture
Yes

Bigger model
Train longer
Train-Val error high?
Yes
No

Bigger model
Train longer
More data
Regularization
Yes
Yes
No

Bigger model
Train longer
More data
Regularization
Val error high?
Yes
Yes
No
No

Bigger model
Train longer
More data
Regularization
Val error high?
More data similar to test
Data synthesis
Yes
Yes
Yes
No
No

Bigger model
Train longer
More data
Regularization
Val error high?
Data synthesis
Test error high?
Yes
Yes
Yes
No
No
No

Bigger model
Train longer
More data
Regularization
Val error high?
Data synthesis
Test error high? More validation data
Yes
Yes
Yes
Yes
No
No
No
No
Done

Learning curves
More training dataerror
Validation
Train

Learning curves
Validation
Train
More training data
error
Validation
Train
High bias
Getting more data likely
doesn’t help much

Learning curves
Validation
Train
More training data
error
Validation
Train
More training data
error
Validation
Train
High bias
Getting more data likely
doesn’t help much
High variance
Getting more data is likely
to help

Working with imbalanced datasets

• Change your performance metric (e.g. F1 score instead of Accuracy)
• Customize objective function

• Data:
• Oversampling/Undersampling
• Synthesize minority class (e.g. SMOTE)
• Buy more data

• Data:
• Oversampling/Undersampling
• Synthesize minority class (e.g. SMOTE)
• Buy more data
• Algorithms:
• Bagging
• New/Other models
• Different perspective, e.g. anomaly detection

Practical Machine Learning

More Related Content

Viewers also liked

Similar to Practical Machine Learning

Recently uploaded

Practical Machine Learning