SlideShare a Scribd company logo
1 of 61
How to Reduce Scikit-Learn
Training Time
Michael Galarnyk
mgalarnyk@anyscale.com
@galarnykmichael
About Me
● Current: Developer Relations at Anyscale
● Previous: Data Scientist at Scripps Research
2
Current: Writing about Distributed
Computing
Previous: Understanding Activity
Data
Teaching Python/Data
Science/Analytics at Stanford
Continuing Studies, LinkedIn
Learning, etc.
Why Speed Up Scikit-Learn?
● Scikit-Learn is an easy to use Python Library for
Machine Learning
● Sometimes scikit-learn models can take a long time
to train
3
Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
4
Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
5
Upgrade your scikit-learn
● pip install –-upgrade scikit-learn
● Why? Sometimes scikit-learn releases have
efficiency improvements
● Can give you ideas for how to improve your workflow
6
Histogram Boosting Improvements
● HistGradientBoostingClassifier and
HistGradientBoostingRegressor
now have support for categorical
features
● This saves time compared to one-
hot encoding (by far the slowest)
7
Image source
Histogram Boosting Improvements
● One-hot-encoding creates one
additional feature per category
value for each categorical feature ->
more split points during fitting
(more tree depth)
8
Image source
Histogram Boosting Improvements
● Native handling of categorical
features should be slightly slower
than treating categories as ordered
quantities (‘Ordinal’), since native
categorical support handling
requires sorting categories.
9
Image source
Histogram Boosting Improvements
● However speed isn’t everything as
treating categorical features as
continuous (ordinal) is not always
the best approach.
10
Image source
Fast Approximation of Polynomial Feature
Expansion
● PolynomialFeatures transformer
creates interaction terms and higher
order polynomial of your features.
● Returns squared features and
interaction terms (more if you want
higher order polynomials)
● It can be slow
11
Image source
Fast Approximation of Polynomial Feature
Expansion
● PolynomialCountSketch kernel
approximation function provides a
faster way as features it creates
represent the polynomial feature
expansion approximation.
12
Image source
Fast Approximation of Polynomial Feature
Expansion
● The drawback with this is
interpretability and maybe some
accuracy
13
Image source
Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
14
Changing your Solver
● Many learning problems are formulated as
minimization of some loss function on a training set
of examples.
● Optimization functions (solvers) help with
minimization.
15
Changing your Solver Idea
● Better algorithms
allow you to make
better use of the same
hardware
● With a more efficient
algorithm, you can
produce an optimal
model faster.
16
Image from Gaël Varoquaux’s talk
Changing your Solver Idea
● Full gradient
algorithm (liblinear)
converges rapidly, but
each iteration (shown
as a white +) can be
costly because it
requires you to use all
of the data
17
Image from Gaël Varoquaux’s talk
Changing your Solver Idea
● In the sub-sampled
approach, each
iteration is cheap to
compute, but it can
converge much more
slowly.
18
Image from Gaël Varoquaux’s talk
Changing your Solver Idea
● Some algorithms like ‘saga’
achieve the best of both
worlds. Each iteration is
cheap to compute, and the
algorithm converges rapidly
because of a variance
reduction technique
19
Image from Gaël Varoquaux’s talk
Changing your Solver Practice
20
● Choosing the right
solver for a
problem can save
a lot of time.
Changing your Solver Practice
21
● This code was for
single class, not
multiclass.
Changing your Solver Practice
22
● Some algorithms’
solvers are only
for single class
problems like
‘liblinear’
Changing your Solver Practice
23
● This code is for multiple
classes
Changing your Solver Practice
24
● The documentation gives
good suggestions on which
solver to use for different
use cases.
Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
25
Different hyperparameter optimization techniques
● To achieve high performance for most scikit-learn
algorithms, you need to tune a model’s
hyperparameters
● Hyperparameters are the parameters of a model
which are not updated during training
26
Different hyperparameter optimization techniques
● Scikit-Learn contains a couple techniques for for
hyperparameter tuning like grid search
(GridSearchCV) and random search
(RandomizedSearchCV)
27
Different hyperparameter optimization techniques
● Grid search: exhaustively
considers all parameter
combinations.
● Random search: samples a
given number of candidates
from a parameter space with a
specified distribution.
28
Grid search can sometimes miss the
optimal part of the important
parameter
Grid Search Benefits and Drawbacks
● Benefits: Explainable, easily
parallelizable
● Drawbacks:
Inefficient/expensive
29
for rate in [0.1, 0.01, 0.001]:
for hidden_layers in [2, 3, 4]:
for param in ["a", "b", "c"]:
train_model(rate,
hidden_layers,
param)
Grid Search (code isn’t scikit-learn):
cross-product of all possible
configurations (3 x 3 x 3 = 27
evaluations)
Grid Search Benefits and Drawbacks
● Benefits: Explainable, easily
parallelizable
● Drawbacks:
Inefficient/expensive
30
GridSearchCV
Grid Search Benefits and Drawbacks
● Benefits: Explainable, easily
parallelizable
● Drawbacks:
Inefficient/expensive
31
81 configurations tested
for i in num_samples:
train_model(alpha=sample(0.001, 0.1),
hidden_layers=sample(2, 4),
param=sample(["a", "b", "c"]))
Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat in
high dimensions
● Drawbacks: Less explainable,
Inefficient/expensive
32
Random search (code isn’t scikit-
learn)
Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat
on high dimensions
● Drawbacks:
Inefficient/expensive
33
RandomizedSearchCV
Random Search Benefits and Drawbacks
● Benefits: Easily
parallelizable, hard to beat
on high dimensions
● Drawbacks:
Inefficient/expensive
34
Less configurations tested with
relatively good performance
Different hyperparameter optimization techniques
● Scikit-learn recently added
new techniques such as
halving grid search
(HalvingGridSearchCV) and
halving random search
(HalvingRandomSearch)
35
Successive halving
Different hyperparameter optimization techniques
● Hyperparameter are
evaluated with a small
number of resources at the
first iteration and the more
promising candidates are
selected and given more
resources during each
successive iteration.
36
Successive halving
Different hyperparameter optimization techniques
● There is a library called
Tune-sklearn that provides
cutting edge
hyperparameter tuning
techniques (Bayesian
optimization, early
stopping, and distributed
execution)
37
Early stopping in action.
Different hyperparameter optimization techniques
● Hyperparameter set 2 is a
set of unpromising
hyperparameters that
would be detected by Tune-
sklearn’s early stopping
mechanisms and stopped
early to avoid wasting time
and resources.
38
Early stopping in action.
Features of Tune-sklearn
● Consistency with scikit-
learn API
● Accessibility to modern
hyperparameter tuning
techniques
● Scalability
39
Ways to Speed Up Scikit-Learn
● Upgrade your scikit-learn
● Recent improvements to HistGradientBoostingClassifier
● Fast Approximation of polynomial feature expansion
● Changing your solver
● Different hyperparameter optimization techniques
● Parallelize or Distribute your Training
40
Parallelize or Distribute your training
● Another way to
increase your model
building speed is to
parallelize or distribute
your training
with joblib and Ray
41
Resources (dark blue) that scikit-learn can utilize for
single core (A), multicore (B), and multimode training
(C)
Parallelize or Distribute your training
● By default, scikit-learn
trains a model using a
single core.
42
Parallelize or Distribute your training
● Virtually all computers
today have multiple
cores.
43
Parallelize or Distribute your training
● For this talk, you can
think of the MacBook
as a single node with 4
cores.
44
Parallelize or Distribute your training
● Using multiple cores,
can speed up the
training of your
model.
45
Parallelize or Distribute your training
● This is especially true
if you model has a
high degree of
parallelism
46
A random forest® is an easy model to parallelize as
each decision tree is independent of the others.
Parallelize or Distribute your training
● Scikit-Learn can
parallelize training on
a single node with
joblib which by
default uses the ‘loky’
backend.
47
A random forest® is an easy model to parallelize as
each decision tree is independent of the others.
Parallelize or Distribute your training
● Joblib allows you to
choose between
backends like ‘loky’,
‘multiprocessing’,
‘dask’, and ‘ray’.
48
Parallelize Training Example
● ’loky’ backend is
optimized for a single
node and not for
running distributed
(multimode)
applications.
49
Parallelize Training Example
n_jobs = -1 creates 1 job
per core automatically
50
Distributed Application Challenges
● Scheduling tasks
across multiple
machines
● Transferring data
efficiently
● Recovering from
machine failures
51
Distributed computing doesn’t always go the way you
would hope (image source)
‘ray’ backend
’ray’ can handle the
details for you, keep
things simple, and lead
to better performance
52
Ray makes parallel and distributed processing work
like you would hope (image source)
‘ray’ backend
53
Normalized speedup in terms of execution time of Ray, multiprocessing, and
Dask relative to the default ‘loky’ backend
‘ray’ backend
54
Performance was measured on one, five, and ten m5.8xlarge nodes with 32 cores each. The performance of Loky and
Multiprocessing does not depend on the number of machines (they run on a single machine)
‘ray’ backend
55
Ray performs a lot better in the random forest benchmark. This training used 45,000 trees (estimators) which resulted in
45,000 tasks being submitted.
‘ray’ backend
56
Ray’s high throughput decentralized scheduler along with shared memory allow Ray to scale the workload to multiple
nodes.
‘ray’ backend
57
Note that the performance improvement increases as we add nodes, but it is bottlenecked mainly by the serial part of
the program (Amdahl’s law).
‘ray’ backend
58
You can learn about how to distribute scikit-learn here.
What is Ray
Ray is an open source
library for parallel and
distributed Python
59
Ray ecosystem consists of core Ray system, scalable
libraries for machine learning), and tools for launching
clusters on any cluster or cloud provider.
Questions?
© 2019-2020, Anyscale.io
mgalarnyk@anyscale.com
@GalarnykMichael

More Related Content

What's hot

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Kai Wähner
 

What's hot (20)

High Availability PostgreSQL on OpenShift...and more!
High Availability PostgreSQL on OpenShift...and more!High Availability PostgreSQL on OpenShift...and more!
High Availability PostgreSQL on OpenShift...and more!
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Github - Git Training Slides: Foundations
Github - Git Training Slides: FoundationsGithub - Git Training Slides: Foundations
Github - Git Training Slides: Foundations
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
GraphQL Basics
GraphQL BasicsGraphQL Basics
GraphQL Basics
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
Git basic
Git basicGit basic
Git basic
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Klee and angr
Klee and angrKlee and angr
Klee and angr
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 

Similar to How to Reduce Scikit-Learn Training Time

ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 

Similar to How to Reduce Scikit-Learn Training Time (20)

Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques Webinar
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric StrategyTuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
Mixed Precision Training Review
Mixed Precision Training ReviewMixed Precision Training Review
Mixed Precision Training Review
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning Optimization
 
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimizationSigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimization
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 

More from Michael Galarnyk

Dual-enzyme natural motors incorporating decontamination and propulsion capab...
Dual-enzyme natural motors incorporating decontamination and propulsion capab...Dual-enzyme natural motors incorporating decontamination and propulsion capab...
Dual-enzyme natural motors incorporating decontamination and propulsion capab...
Michael Galarnyk
 
Micromotor-based on–off fluorescence detection of sarin and soman simulants
Micromotor-based on–off fluorescence detection of sarin and soman simulantsMicromotor-based on–off fluorescence detection of sarin and soman simulants
Micromotor-based on–off fluorescence detection of sarin and soman simulants
Michael Galarnyk
 
Multifunctional Silver-Exchanged Zeolite Micromotors for
Multifunctional Silver-Exchanged Zeolite Micromotors forMultifunctional Silver-Exchanged Zeolite Micromotors for
Multifunctional Silver-Exchanged Zeolite Micromotors for
Michael Galarnyk
 
Ultrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and Release
Ultrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and ReleaseUltrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and Release
Ultrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and Release
Michael Galarnyk
 
Multiplexed immunoassay based on micromotors and microscale tags
Multiplexed immunoassay based on micromotors and microscale tagsMultiplexed immunoassay based on micromotors and microscale tags
Multiplexed immunoassay based on micromotors and microscale tags
Michael Galarnyk
 
Bubble-Propelled Micromotors for Enhanced Transport of Passive Tracers
Bubble-Propelled Micromotors for Enhanced Transport of Passive TracersBubble-Propelled Micromotors for Enhanced Transport of Passive Tracers
Bubble-Propelled Micromotors for Enhanced Transport of Passive Tracers
Michael Galarnyk
 
Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...
Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...
Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...
Michael Galarnyk
 
Micromotors to capture and destroy anthrax simulant spores
Micromotors to capture and destroy anthrax simulant sporesMicromotors to capture and destroy anthrax simulant spores
Micromotors to capture and destroy anthrax simulant spores
Michael Galarnyk
 
Turning_Erythrocytes_into_Functional_Mic
Turning_Erythrocytes_into_Functional_MicTurning_Erythrocytes_into_Functional_Mic
Turning_Erythrocytes_into_Functional_Mic
Michael Galarnyk
 

More from Michael Galarnyk (10)

Building a Data Science Portfolio that Rocks
Building a Data Science Portfolio that RocksBuilding a Data Science Portfolio that Rocks
Building a Data Science Portfolio that Rocks
 
Dual-enzyme natural motors incorporating decontamination and propulsion capab...
Dual-enzyme natural motors incorporating decontamination and propulsion capab...Dual-enzyme natural motors incorporating decontamination and propulsion capab...
Dual-enzyme natural motors incorporating decontamination and propulsion capab...
 
Micromotor-based on–off fluorescence detection of sarin and soman simulants
Micromotor-based on–off fluorescence detection of sarin and soman simulantsMicromotor-based on–off fluorescence detection of sarin and soman simulants
Micromotor-based on–off fluorescence detection of sarin and soman simulants
 
Multifunctional Silver-Exchanged Zeolite Micromotors for
Multifunctional Silver-Exchanged Zeolite Micromotors forMultifunctional Silver-Exchanged Zeolite Micromotors for
Multifunctional Silver-Exchanged Zeolite Micromotors for
 
Ultrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and Release
Ultrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and ReleaseUltrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and Release
Ultrasound-Propelled Nanoporous Gold Wire for Efficient Drug Loading and Release
 
Multiplexed immunoassay based on micromotors and microscale tags
Multiplexed immunoassay based on micromotors and microscale tagsMultiplexed immunoassay based on micromotors and microscale tags
Multiplexed immunoassay based on micromotors and microscale tags
 
Bubble-Propelled Micromotors for Enhanced Transport of Passive Tracers
Bubble-Propelled Micromotors for Enhanced Transport of Passive TracersBubble-Propelled Micromotors for Enhanced Transport of Passive Tracers
Bubble-Propelled Micromotors for Enhanced Transport of Passive Tracers
 
Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...
Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...
Self-Propelled Activated Carbon Janus Micromotors for Efficient Water Purific...
 
Micromotors to capture and destroy anthrax simulant spores
Micromotors to capture and destroy anthrax simulant sporesMicromotors to capture and destroy anthrax simulant spores
Micromotors to capture and destroy anthrax simulant spores
 
Turning_Erythrocytes_into_Functional_Mic
Turning_Erythrocytes_into_Functional_MicTurning_Erythrocytes_into_Functional_Mic
Turning_Erythrocytes_into_Functional_Mic
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

How to Reduce Scikit-Learn Training Time

  • 1. How to Reduce Scikit-Learn Training Time Michael Galarnyk mgalarnyk@anyscale.com @galarnykmichael
  • 2. About Me ● Current: Developer Relations at Anyscale ● Previous: Data Scientist at Scripps Research 2 Current: Writing about Distributed Computing Previous: Understanding Activity Data Teaching Python/Data Science/Analytics at Stanford Continuing Studies, LinkedIn Learning, etc.
  • 3. Why Speed Up Scikit-Learn? ● Scikit-Learn is an easy to use Python Library for Machine Learning ● Sometimes scikit-learn models can take a long time to train 3
  • 4. Ways to Speed Up Scikit-Learn ● Upgrade your scikit-learn ● Recent improvements to HistGradientBoostingClassifier ● Fast Approximation of polynomial feature expansion ● Changing your solver ● Different hyperparameter optimization techniques ● Parallelize or Distribute your Training 4
  • 5. Ways to Speed Up Scikit-Learn ● Upgrade your scikit-learn ● Recent improvements to HistGradientBoostingClassifier ● Fast Approximation of polynomial feature expansion ● Changing your solver ● Different hyperparameter optimization techniques ● Parallelize or Distribute your Training 5
  • 6. Upgrade your scikit-learn ● pip install –-upgrade scikit-learn ● Why? Sometimes scikit-learn releases have efficiency improvements ● Can give you ideas for how to improve your workflow 6
  • 7. Histogram Boosting Improvements ● HistGradientBoostingClassifier and HistGradientBoostingRegressor now have support for categorical features ● This saves time compared to one- hot encoding (by far the slowest) 7 Image source
  • 8. Histogram Boosting Improvements ● One-hot-encoding creates one additional feature per category value for each categorical feature -> more split points during fitting (more tree depth) 8 Image source
  • 9. Histogram Boosting Improvements ● Native handling of categorical features should be slightly slower than treating categories as ordered quantities (‘Ordinal’), since native categorical support handling requires sorting categories. 9 Image source
  • 10. Histogram Boosting Improvements ● However speed isn’t everything as treating categorical features as continuous (ordinal) is not always the best approach. 10 Image source
  • 11. Fast Approximation of Polynomial Feature Expansion ● PolynomialFeatures transformer creates interaction terms and higher order polynomial of your features. ● Returns squared features and interaction terms (more if you want higher order polynomials) ● It can be slow 11 Image source
  • 12. Fast Approximation of Polynomial Feature Expansion ● PolynomialCountSketch kernel approximation function provides a faster way as features it creates represent the polynomial feature expansion approximation. 12 Image source
  • 13. Fast Approximation of Polynomial Feature Expansion ● The drawback with this is interpretability and maybe some accuracy 13 Image source
  • 14. Ways to Speed Up Scikit-Learn ● Upgrade your scikit-learn ● Recent improvements to HistGradientBoostingClassifier ● Fast Approximation of polynomial feature expansion ● Changing your solver ● Different hyperparameter optimization techniques ● Parallelize or Distribute your Training 14
  • 15. Changing your Solver ● Many learning problems are formulated as minimization of some loss function on a training set of examples. ● Optimization functions (solvers) help with minimization. 15
  • 16. Changing your Solver Idea ● Better algorithms allow you to make better use of the same hardware ● With a more efficient algorithm, you can produce an optimal model faster. 16 Image from Gaël Varoquaux’s talk
  • 17. Changing your Solver Idea ● Full gradient algorithm (liblinear) converges rapidly, but each iteration (shown as a white +) can be costly because it requires you to use all of the data 17 Image from Gaël Varoquaux’s talk
  • 18. Changing your Solver Idea ● In the sub-sampled approach, each iteration is cheap to compute, but it can converge much more slowly. 18 Image from Gaël Varoquaux’s talk
  • 19. Changing your Solver Idea ● Some algorithms like ‘saga’ achieve the best of both worlds. Each iteration is cheap to compute, and the algorithm converges rapidly because of a variance reduction technique 19 Image from Gaël Varoquaux’s talk
  • 20. Changing your Solver Practice 20 ● Choosing the right solver for a problem can save a lot of time.
  • 21. Changing your Solver Practice 21 ● This code was for single class, not multiclass.
  • 22. Changing your Solver Practice 22 ● Some algorithms’ solvers are only for single class problems like ‘liblinear’
  • 23. Changing your Solver Practice 23 ● This code is for multiple classes
  • 24. Changing your Solver Practice 24 ● The documentation gives good suggestions on which solver to use for different use cases.
  • 25. Ways to Speed Up Scikit-Learn ● Upgrade your scikit-learn ● Recent improvements to HistGradientBoostingClassifier ● Fast Approximation of polynomial feature expansion ● Changing your solver ● Different hyperparameter optimization techniques ● Parallelize or Distribute your Training 25
  • 26. Different hyperparameter optimization techniques ● To achieve high performance for most scikit-learn algorithms, you need to tune a model’s hyperparameters ● Hyperparameters are the parameters of a model which are not updated during training 26
  • 27. Different hyperparameter optimization techniques ● Scikit-Learn contains a couple techniques for for hyperparameter tuning like grid search (GridSearchCV) and random search (RandomizedSearchCV) 27
  • 28. Different hyperparameter optimization techniques ● Grid search: exhaustively considers all parameter combinations. ● Random search: samples a given number of candidates from a parameter space with a specified distribution. 28 Grid search can sometimes miss the optimal part of the important parameter
  • 29. Grid Search Benefits and Drawbacks ● Benefits: Explainable, easily parallelizable ● Drawbacks: Inefficient/expensive 29 for rate in [0.1, 0.01, 0.001]: for hidden_layers in [2, 3, 4]: for param in ["a", "b", "c"]: train_model(rate, hidden_layers, param) Grid Search (code isn’t scikit-learn): cross-product of all possible configurations (3 x 3 x 3 = 27 evaluations)
  • 30. Grid Search Benefits and Drawbacks ● Benefits: Explainable, easily parallelizable ● Drawbacks: Inefficient/expensive 30 GridSearchCV
  • 31. Grid Search Benefits and Drawbacks ● Benefits: Explainable, easily parallelizable ● Drawbacks: Inefficient/expensive 31 81 configurations tested
  • 32. for i in num_samples: train_model(alpha=sample(0.001, 0.1), hidden_layers=sample(2, 4), param=sample(["a", "b", "c"])) Random Search Benefits and Drawbacks ● Benefits: Easily parallelizable, hard to beat in high dimensions ● Drawbacks: Less explainable, Inefficient/expensive 32 Random search (code isn’t scikit- learn)
  • 33. Random Search Benefits and Drawbacks ● Benefits: Easily parallelizable, hard to beat on high dimensions ● Drawbacks: Inefficient/expensive 33 RandomizedSearchCV
  • 34. Random Search Benefits and Drawbacks ● Benefits: Easily parallelizable, hard to beat on high dimensions ● Drawbacks: Inefficient/expensive 34 Less configurations tested with relatively good performance
  • 35. Different hyperparameter optimization techniques ● Scikit-learn recently added new techniques such as halving grid search (HalvingGridSearchCV) and halving random search (HalvingRandomSearch) 35 Successive halving
  • 36. Different hyperparameter optimization techniques ● Hyperparameter are evaluated with a small number of resources at the first iteration and the more promising candidates are selected and given more resources during each successive iteration. 36 Successive halving
  • 37. Different hyperparameter optimization techniques ● There is a library called Tune-sklearn that provides cutting edge hyperparameter tuning techniques (Bayesian optimization, early stopping, and distributed execution) 37 Early stopping in action.
  • 38. Different hyperparameter optimization techniques ● Hyperparameter set 2 is a set of unpromising hyperparameters that would be detected by Tune- sklearn’s early stopping mechanisms and stopped early to avoid wasting time and resources. 38 Early stopping in action.
  • 39. Features of Tune-sklearn ● Consistency with scikit- learn API ● Accessibility to modern hyperparameter tuning techniques ● Scalability 39
  • 40. Ways to Speed Up Scikit-Learn ● Upgrade your scikit-learn ● Recent improvements to HistGradientBoostingClassifier ● Fast Approximation of polynomial feature expansion ● Changing your solver ● Different hyperparameter optimization techniques ● Parallelize or Distribute your Training 40
  • 41. Parallelize or Distribute your training ● Another way to increase your model building speed is to parallelize or distribute your training with joblib and Ray 41 Resources (dark blue) that scikit-learn can utilize for single core (A), multicore (B), and multimode training (C)
  • 42. Parallelize or Distribute your training ● By default, scikit-learn trains a model using a single core. 42
  • 43. Parallelize or Distribute your training ● Virtually all computers today have multiple cores. 43
  • 44. Parallelize or Distribute your training ● For this talk, you can think of the MacBook as a single node with 4 cores. 44
  • 45. Parallelize or Distribute your training ● Using multiple cores, can speed up the training of your model. 45
  • 46. Parallelize or Distribute your training ● This is especially true if you model has a high degree of parallelism 46 A random forest® is an easy model to parallelize as each decision tree is independent of the others.
  • 47. Parallelize or Distribute your training ● Scikit-Learn can parallelize training on a single node with joblib which by default uses the ‘loky’ backend. 47 A random forest® is an easy model to parallelize as each decision tree is independent of the others.
  • 48. Parallelize or Distribute your training ● Joblib allows you to choose between backends like ‘loky’, ‘multiprocessing’, ‘dask’, and ‘ray’. 48
  • 49. Parallelize Training Example ● ’loky’ backend is optimized for a single node and not for running distributed (multimode) applications. 49
  • 50. Parallelize Training Example n_jobs = -1 creates 1 job per core automatically 50
  • 51. Distributed Application Challenges ● Scheduling tasks across multiple machines ● Transferring data efficiently ● Recovering from machine failures 51 Distributed computing doesn’t always go the way you would hope (image source)
  • 52. ‘ray’ backend ’ray’ can handle the details for you, keep things simple, and lead to better performance 52 Ray makes parallel and distributed processing work like you would hope (image source)
  • 53. ‘ray’ backend 53 Normalized speedup in terms of execution time of Ray, multiprocessing, and Dask relative to the default ‘loky’ backend
  • 54. ‘ray’ backend 54 Performance was measured on one, five, and ten m5.8xlarge nodes with 32 cores each. The performance of Loky and Multiprocessing does not depend on the number of machines (they run on a single machine)
  • 55. ‘ray’ backend 55 Ray performs a lot better in the random forest benchmark. This training used 45,000 trees (estimators) which resulted in 45,000 tasks being submitted.
  • 56. ‘ray’ backend 56 Ray’s high throughput decentralized scheduler along with shared memory allow Ray to scale the workload to multiple nodes.
  • 57. ‘ray’ backend 57 Note that the performance improvement increases as we add nodes, but it is bottlenecked mainly by the serial part of the program (Amdahl’s law).
  • 58. ‘ray’ backend 58 You can learn about how to distribute scikit-learn here.
  • 59. What is Ray Ray is an open source library for parallel and distributed Python 59 Ray ecosystem consists of core Ray system, scalable libraries for machine learning), and tools for launching clusters on any cluster or cloud provider.
  • 60.

Editor's Notes

  1. Note that there is also improved memory usage: https://scikit-learn.org/stable/whats_new/v0.24.html#sklearn-ensemble There is also missing value support
  2. Code Example: https://scikit-learn.org/stable/auto_examples/kernel_approximation/plot_scalable_poly_kernels.html#sphx-glr-auto-examples-kernel-approximation-plot-scalable-poly-kernels-py
  3. Code Example: https://scikit-learn.org/stable/auto_examples/kernel_approximation/plot_scalable_poly_kernels.html#sphx-glr-auto-examples-kernel-approximation-plot-scalable-poly-kernels-py
  4. Code Example: https://stackoverflow.com/questions/62815111/improve-speed-of-scikit-learn-multinomial-logistic-regression/66070192#66070192
  5. Code Example: https://stackoverflow.com/questions/62815111/improve-speed-of-scikit-learn-multinomial-logistic-regression/66070192#66070192
  6. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  7. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  8. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  9. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  10. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  11. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  12. For randomized search, if we only sample 60 parameter combinatioss via randomized search, we aleady h
  13. If you are curious why dask isn’t fast here, read this here: https://github.com/dask/dask/issues/5993
  14. If you are curious why dask isn’t fast here, read this here: https://github.com/dask/dask/issues/5993
  15. If you are curious why dask isn’t fast here, read this here: https://github.com/dask/dask/issues/5993
  16. If you are curious why dask isn’t fast here, read this here: https://github.com/dask/dask/issues/5993
  17. If you are curious why dask isn’t fast here, read this here: https://github.com/dask/dask/issues/5993
  18. If you are curious why dask isn’t fast here, read this here: https://github.com/dask/dask/issues/5993