How to Reduce Scikit-Learn Training Time

How to Reduce Scikit-Learn
Training Time
Michael Galarnyk
mgalarnyk@anyscale.com
@galarnykmichael

About Me
● Current: Developer Relations at Anyscale
● Previous: Data Scientist at Scripps Research
2
Current: Writing about Distributed
Computing
Previous: Understanding Activity
Data
Teaching Python/Data
Science/Analytics at Stanford
Continuing Studies, LinkedIn
Learning, etc.

Why Speed Up Scikit-Learn?
● Scikit-Learn is an easy to use Python Library for
Machine Learning
● Sometimes scikit-learn models can take a long time
to train
3

Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
4

5

Upgrade your scikit-learn
● pip install –-upgrade scikit-learn
● Why? Sometimes scikit-learn releases have
efficiency improvements
● Can give you ideas for how to improve your workflow
6

Histogram Boosting Improvements
● HistGradientBoostingClassifier and
HistGradientBoostingRegressor
now have support for categorical
features
● This saves time compared to one-
hot encoding (by far the slowest)
7
Image source

● One-hot-encoding creates one
additional feature per category
value for each categorical feature ->
more split points during fitting
(more tree depth)
8
Image source

● Native handling of categorical
features should be slightly slower
than treating categories as ordered
quantities (‘Ordinal’), since native
categorical support handling
requires sorting categories.
9
Image source

● However speed isn’t everything as
treating categorical features as
continuous (ordinal) is not always
the best approach.
10
Image source

Fast Approximation of Polynomial Feature
Expansion
● PolynomialFeatures transformer
creates interaction terms and higher
order polynomial of your features.
● Returns squared features and
interaction terms (more if you want
higher order polynomials)
● It can be slow
11
Image source

Expansion
● PolynomialCountSketch kernel
approximation function provides a
faster way as features it creates
represent the polynomial feature
expansion approximation.
12
Image source

Expansion
● The drawback with this is
interpretability and maybe some
accuracy
13
Image source

14

Changing your Solver
● Many learning problems are formulated as
minimization of some loss function on a training set
of examples.
● Optimization functions (solvers) help with
minimization.
15

Changing your Solver Idea
● Better algorithms
allow you to make
better use of the same
hardware
● With a more efficient
algorithm, you can
produce an optimal
model faster.
16
Image from Gaël Varoquaux’s talk

● Full gradient
algorithm (liblinear)
converges rapidly, but
each iteration (shown
as a white +) can be
costly because it
requires you to use all
of the data
17

● In the sub-sampled
approach, each
iteration is cheap to
compute, but it can
converge much more
slowly.
18

● Some algorithms like ‘saga’
achieve the best of both
worlds. Each iteration is
cheap to compute, and the
algorithm converges rapidly
because of a variance
reduction technique
19

Changing your Solver Practice
20
● Choosing the right
solver for a
problem can save
a lot of time.

21
● This code was for
single class, not
multiclass.

22
● Some algorithms’
solvers are only
for single class
problems like
‘liblinear’

23
● This code is for multiple
classes

24
● The documentation gives
good suggestions on which
solver to use for different
use cases.

25

Different hyperparameter optimization techniques
● To achieve high performance for most scikit-learn
algorithms, you need to tune a model’s
hyperparameters
● Hyperparameters are the parameters of a model
which are not updated during training
26

● Scikit-Learn contains a couple techniques for for
hyperparameter tuning like grid search
(GridSearchCV) and random search
(RandomizedSearchCV)
27

● Grid search: exhaustively
considers all parameter
combinations.
● Random search: samples a
given number of candidates
from a parameter space with a
specified distribution.
28
Grid search can sometimes miss the
optimal part of the important
parameter

Grid Search Benefits and Drawbacks
● Benefits: Explainable, easily
parallelizable
● Drawbacks:
Inefficient/expensive
29
for rate in [0.1, 0.01, 0.001]:
for hidden_layers in [2, 3, 4]:
for param in ["a", "b", "c"]:
train_model(rate,
hidden_layers,
param)
Grid Search (code isn’t scikit-learn):
cross-product of all possible
configurations (3 x 3 x 3 = 27
evaluations)

parallelizable
● Drawbacks:
30
GridSearchCV

parallelizable
● Drawbacks:
31
81 configurations tested

for i in num_samples:
train_model(alpha=sample(0.001, 0.1),
hidden_layers=sample(2, 4),
param=sample(["a", "b", "c"]))
Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat in
high dimensions
● Drawbacks: Less explainable,
32
Random search (code isn’t scikit-
learn)

parallelizable, hard to beat
on high dimensions
● Drawbacks:
33
RandomizedSearchCV

parallelizable, hard to beat
on high dimensions
● Drawbacks:
34
Less configurations tested with
relatively good performance

● Scikit-learn recently added
new techniques such as
halving grid search
(HalvingGridSearchCV) and
halving random search
(HalvingRandomSearch)
35
Successive halving

● Hyperparameter are
evaluated with a small
number of resources at the
first iteration and the more
promising candidates are
selected and given more
resources during each
successive iteration.
36
Successive halving

● There is a library called
Tune-sklearn that provides
cutting edge
hyperparameter tuning
techniques (Bayesian
optimization, early
stopping, and distributed
execution)
37
Early stopping in action.

● Hyperparameter set 2 is a
set of unpromising
hyperparameters that
would be detected by Tune-
sklearn’s early stopping
mechanisms and stopped
early to avoid wasting time
and resources.
38
Early stopping in action.

Features of Tune-sklearn
● Consistency with scikit-
learn API
● Accessibility to modern
hyperparameter tuning
techniques
● Scalability
39

40

Parallelize or Distribute your training
● Another way to
increase your model
building speed is to
parallelize or distribute
your training
with joblib and Ray
41
Resources (dark blue) that scikit-learn can utilize for
single core (A), multicore (B), and multimode training
(C)

● By default, scikit-learn
trains a model using a
single core.
42

● Virtually all computers
today have multiple
cores.
43

● For this talk, you can
think of the MacBook
as a single node with 4
cores.
44

● Using multiple cores,
can speed up the
training of your
model.
45

● This is especially true
if you model has a
high degree of
parallelism
46
A random forest® is an easy model to parallelize as
each decision tree is independent of the others.

● Scikit-Learn can
parallelize training on
a single node with
joblib which by
default uses the ‘loky’
backend.
47
A random forest® is an easy model to parallelize as
each decision tree is independent of the others.

● Joblib allows you to
choose between
backends like ‘loky’,
‘multiprocessing’,
‘dask’, and ‘ray’.
48

Parallelize Training Example
● ’loky’ backend is
optimized for a single
node and not for
running distributed
(multimode)
applications.
49

Parallelize Training Example
n_jobs = -1 creates 1 job
per core automatically
50

Distributed Application Challenges
● Scheduling tasks
across multiple
machines
● Transferring data
efficiently
● Recovering from
machine failures
51
Distributed computing doesn’t always go the way you
would hope (image source)

‘ray’ backend
’ray’ can handle the
details for you, keep
things simple, and lead
to better performance
52
Ray makes parallel and distributed processing work
like you would hope (image source)

‘ray’ backend
53
Normalized speedup in terms of execution time of Ray, multiprocessing, and
Dask relative to the default ‘loky’ backend

‘ray’ backend
54
Performance was measured on one, five, and ten m5.8xlarge nodes with 32 cores each. The performance of Loky and
Multiprocessing does not depend on the number of machines (they run on a single machine)

‘ray’ backend
55
Ray performs a lot better in the random forest benchmark. This training used 45,000 trees (estimators) which resulted in
45,000 tasks being submitted.

‘ray’ backend
56
Ray’s high throughput decentralized scheduler along with shared memory allow Ray to scale the workload to multiple
nodes.

‘ray’ backend
57
Note that the performance improvement increases as we add nodes, but it is bottlenecked mainly by the serial part of
the program (Amdahl’s law).

‘ray’ backend
58
You can learn about how to distribute scikit-learn here.

What is Ray
Ray is an open source
library for parallel and
distributed Python
59
Ray ecosystem consists of core Ray system, scalable
libraries for machine learning), and tools for launching
clusters on any cluster or cloud provider.

How to Reduce Scikit-Learn Training Time

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to Reduce Scikit-Learn Training Time

Similar to How to Reduce Scikit-Learn Training Time (20)

More from Michael Galarnyk

More from Michael Galarnyk (10)

Recently uploaded

Recently uploaded (20)

How to Reduce Scikit-Learn Training Time

Editor's Notes