Powerful Google developer tools for immediate impact! (2023-24 C)
How to Reduce Scikit-Learn Training Time
1. How to Reduce Scikit-Learn
Training Time
Michael Galarnyk
mgalarnyk@anyscale.com
@galarnykmichael
2. About Me
● Current: Developer Relations at Anyscale
● Previous: Data Scientist at Scripps Research
2
Current: Writing about Distributed
Computing
Previous: Understanding Activity
Data
Teaching Python/Data
Science/Analytics at Stanford
Continuing Studies, LinkedIn
Learning, etc.
3. Why Speed Up Scikit-Learn?
● Scikit-Learn is an easy to use Python Library for
Machine Learning
● Sometimes scikit-learn models can take a long time
to train
3
4. Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
4
5. Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
5
6. Upgrade your scikit-learn
● pip install –-upgrade scikit-learn
● Why? Sometimes scikit-learn releases have
efficiency improvements
● Can give you ideas for how to improve your workflow
6
7. Histogram Boosting Improvements
● HistGradientBoostingClassifier and
HistGradientBoostingRegressor
now have support for categorical
features
● This saves time compared to one-
hot encoding (by far the slowest)
7
Image source
8. Histogram Boosting Improvements
● One-hot-encoding creates one
additional feature per category
value for each categorical feature ->
more split points during fitting
(more tree depth)
8
Image source
9. Histogram Boosting Improvements
● Native handling of categorical
features should be slightly slower
than treating categories as ordered
quantities (‘Ordinal’), since native
categorical support handling
requires sorting categories.
9
Image source
10. Histogram Boosting Improvements
● However speed isn’t everything as
treating categorical features as
continuous (ordinal) is not always
the best approach.
10
Image source
11. Fast Approximation of Polynomial Feature
Expansion
● PolynomialFeatures transformer
creates interaction terms and higher
order polynomial of your features.
● Returns squared features and
interaction terms (more if you want
higher order polynomials)
● It can be slow
11
Image source
12. Fast Approximation of Polynomial Feature
Expansion
● PolynomialCountSketch kernel
approximation function provides a
faster way as features it creates
represent the polynomial feature
expansion approximation.
12
Image source
13. Fast Approximation of Polynomial Feature
Expansion
● The drawback with this is
interpretability and maybe some
accuracy
13
Image source
14. Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
14
15. Changing your Solver
● Many learning problems are formulated as
minimization of some loss function on a training set
of examples.
● Optimization functions (solvers) help with
minimization.
15
16. Changing your Solver Idea
● Better algorithms
allow you to make
better use of the same
hardware
● With a more efficient
algorithm, you can
produce an optimal
model faster.
16
Image from Gaël Varoquaux’s talk
17. Changing your Solver Idea
● Full gradient
algorithm (liblinear)
converges rapidly, but
each iteration (shown
as a white +) can be
costly because it
requires you to use all
of the data
17
Image from Gaël Varoquaux’s talk
18. Changing your Solver Idea
● In the sub-sampled
approach, each
iteration is cheap to
compute, but it can
converge much more
slowly.
18
Image from Gaël Varoquaux’s talk
19. Changing your Solver Idea
● Some algorithms like ‘saga’
achieve the best of both
worlds. Each iteration is
cheap to compute, and the
algorithm converges rapidly
because of a variance
reduction technique
19
Image from Gaël Varoquaux’s talk
20. Changing your Solver Practice
20
● Choosing the right
solver for a
problem can save
a lot of time.
21. Changing your Solver Practice
21
● This code was for
single class, not
multiclass.
22. Changing your Solver Practice
22
● Some algorithms’
solvers are only
for single class
problems like
‘liblinear’
24. Changing your Solver Practice
24
● The documentation gives
good suggestions on which
solver to use for different
use cases.
25. Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
25
26. Different hyperparameter optimization techniques
● To achieve high performance for most scikit-learn
algorithms, you need to tune a model’s
hyperparameters
● Hyperparameters are the parameters of a model
which are not updated during training
26
27. Different hyperparameter optimization techniques
● Scikit-Learn contains a couple techniques for for
hyperparameter tuning like grid search
(GridSearchCV) and random search
(RandomizedSearchCV)
27
28. Different hyperparameter optimization techniques
● Grid search: exhaustively
considers all parameter
combinations.
● Random search: samples a
given number of candidates
from a parameter space with a
specified distribution.
28
Grid search can sometimes miss the
optimal part of the important
parameter
29. Grid Search Benefits and Drawbacks
● Benefits: Explainable, easily
parallelizable
● Drawbacks:
Inefficient/expensive
29
for rate in [0.1, 0.01, 0.001]:
for hidden_layers in [2, 3, 4]:
for param in ["a", "b", "c"]:
train_model(rate,
hidden_layers,
param)
Grid Search (code isn’t scikit-learn):
cross-product of all possible
configurations (3 x 3 x 3 = 27
evaluations)
32. for i in num_samples:
train_model(alpha=sample(0.001, 0.1),
hidden_layers=sample(2, 4),
param=sample(["a", "b", "c"]))
Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat in
high dimensions
● Drawbacks: Less explainable,
Inefficient/expensive
32
Random search (code isn’t scikit-
learn)
33. Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat
on high dimensions
● Drawbacks:
Inefficient/expensive
33
RandomizedSearchCV
34. Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat
on high dimensions
● Drawbacks:
Inefficient/expensive
34
Less configurations tested with
relatively good performance
35. Different hyperparameter optimization techniques
● Scikit-learn recently added
new techniques such as
halving grid search
(HalvingGridSearchCV) and
halving random search
(HalvingRandomSearch)
35
Successive halving
36. Different hyperparameter optimization techniques
● Hyperparameter are
evaluated with a small
number of resources at the
first iteration and the more
promising candidates are
selected and given more
resources during each
successive iteration.
36
Successive halving
37. Different hyperparameter optimization techniques
● There is a library called
Tune-sklearn that provides
cutting edge
hyperparameter tuning
techniques (Bayesian
optimization, early
stopping, and distributed
execution)
37
Early stopping in action.
38. Different hyperparameter optimization techniques
● Hyperparameter set 2 is a
set of unpromising
hyperparameters that
would be detected by Tune-
sklearn’s early stopping
mechanisms and stopped
early to avoid wasting time
and resources.
38
Early stopping in action.
39. Features of Tune-sklearn
● Consistency with scikit-
learn API
● Accessibility to modern
hyperparameter tuning
techniques
● Scalability
39
40. Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
40
41. Parallelize or Distribute your training
● Another way to
increase your model
building speed is to
parallelize or distribute
your training
with joblib and Ray
41
Resources (dark blue) that scikit-learn can utilize for
single core (A), multicore (B), and multimode training
(C)
42. Parallelize or Distribute your training
● By default, scikit-learn
trains a model using a
single core.
42
44. Parallelize or Distribute your training
● For this talk, you can
think of the MacBook
as a single node with 4
cores.
44
45. Parallelize or Distribute your training
● Using multiple cores,
can speed up the
training of your
model.
45
46. Parallelize or Distribute your training
● This is especially true
if you model has a
high degree of
parallelism
46
A random forest® is an easy model to parallelize as
each decision tree is independent of the others.
47. Parallelize or Distribute your training
● Scikit-Learn can
parallelize training on
a single node with
joblib which by
default uses the ‘loky’
backend.
47
A random forest® is an easy model to parallelize as
each decision tree is independent of the others.
48. Parallelize or Distribute your training
● Joblib allows you to
choose between
backends like ‘loky’,
‘multiprocessing’,
‘dask’, and ‘ray’.
48
49. Parallelize Training Example
● ’loky’ backend is
optimized for a single
node and not for
running distributed
(multimode)
applications.
49
51. Distributed Application Challenges
● Scheduling tasks
across multiple
machines
● Transferring data
efficiently
● Recovering from
machine failures
51
Distributed computing doesn’t always go the way you
would hope (image source)
52. ‘ray’ backend
’ray’ can handle the
details for you, keep
things simple, and lead
to better performance
52
Ray makes parallel and distributed processing work
like you would hope (image source)
54. ‘ray’ backend
54
Performance was measured on one, five, and ten m5.8xlarge nodes with 32 cores each. The performance of Loky and
Multiprocessing does not depend on the number of machines (they run on a single machine)
55. ‘ray’ backend
55
Ray performs a lot better in the random forest benchmark. This training used 45,000 trees (estimators) which resulted in
45,000 tasks being submitted.
56. ‘ray’ backend
56
Ray’s high throughput decentralized scheduler along with shared memory allow Ray to scale the workload to multiple
nodes.
57. ‘ray’ backend
57
Note that the performance improvement increases as we add nodes, but it is bottlenecked mainly by the serial part of
the program (Amdahl’s law).
59. What is Ray
Ray is an open source
library for parallel and
distributed Python
59
Ray ecosystem consists of core Ray system, scalable
libraries for machine learning), and tools for launching
clusters on any cluster or cloud provider.
Note that there is also improved memory usage: https://scikit-learn.org/stable/whats_new/v0.24.html#sklearn-ensemble
There is also missing value support