SlideShare a Scribd company logo
Accelerating Random Forests in Scikit-Learn 
Gilles Louppe 
Universite de Liege, Belgium 
August 29, 2014 
1 / 26
Motivation 
... and many more applications ! 
2 / 26
About 
Scikit-Learn 
 Machine learning library for Python 
 Classical and well-established 
algorithms 
 Emphasis on code quality and usability 
Myself 
 @glouppe 
 PhD student (Liege, Belgium) 
 Core developer on Scikit-Learn since 2011 
Chief tree hugger 
scikit 
3 / 26
Outline 
1 Basics 
2 Scikit-Learn implementation 
3 Python improvements 
4 / 26
Machine Learning 101 
 Data comes as... 
A set of samples L = f(xi ; yi )ji = 0; : : : ;N  1g, with 
Feature vector x 2 Rp (= input), and 
Response y 2 R (regression) or y 2 f0; 1g (classi
cation) (= 
output) 
 Goal is to... 
Find a function ^y = '(x) 
Such that error L(y; ^y) on new (unseen) x is minimal 
5 / 26
Decision Trees 
푡2 
푡1 
풙 
푋푡1 ≤ 푣푡1 
푡10 
푡3 
≤ 
푡4 푡5 푡6 푡7 
푡8 푡9 푡11 푡12 푡13 
푡14 푡15 푡16 푡17 
푋푡3 ≤ 푣푡3 
푋푡6 ≤ 푣푡6 
푋푡10 ≤ 푣푡10 
푝(푌 = 푐|푋 = 풙) 
Split node 
≤  Leaf node 
 
 
 
≤ 
≤ 
t 2 ' : nodes of the tree ' 
Xt : split variable at t 
vt 2 R : split threshold at t 
'(x) = arg maxc2Y p(Y = cjX = x) 
6 / 26
Random Forests 
풙 
휑1 휑푀 
푝휑1 (푌 = 푐|푋 = 풙) 
… 
푝휑푚 (푌 = 푐|푋 = 풙) 
Σ 
푝휓(푌 = 푐|푋 = 풙) 
Ensemble of M randomized decision trees 'm 
 (x) = arg maxc2Y 
1M 
PM 
m=1 p'm(Y = cjX = x) 
7 / 26
Learning from data 
function BuildDecisionTree(L) 
Create node t 
if the stopping criterion is met for t then 
byt = some constant value 
else 
Find the best partition L = LL [ LR 
tL = BuildDecisionTree(LL) 
tR = BuildDecisionTree(LR) 
end if 
return t 
end function 
8 / 26
Outline 
1 Basics 
2 Scikit-Learn implementation 
3 Python improvements 
9 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.10 : January 2012 
 First sketch at sklearn.tree and sklearn.ensemble 
 Random Forests and Extremely Randomized Trees modules 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.11 : May 2012 
 Gradient Boosted Regression Trees module 
 Out-of-bag estimates in Random Forests 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.12 : October 2012 
 Multi-output decision trees 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.13 : February 2013 
 Speed improvements 
Rewriting from Python to Cython 
 Support of sample weights 
 Totally randomized trees embedding 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.14 : August 2013 
 Complete rewrite of sklearn.tree 
Refactoring 
Cython enhancements 
 AdaBoost module 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.15 : August 2014 
 Further speed and memory improvements 
Better algorithms 
Cython enhancements 
 Better parallelism 
 Bagging module 
10 / 26
Implementation overview 
 Modular implementation, designed with a strict separation of 
concerns 
Builders : for building and connecting nodes into a tree 
Splitters : for
nding a split 
Criteria : for evaluating the goodness of a split 
Tree : dedicated data structure 
 Ecient algorithmic formulation [See Louppe, 2014] 
Tips. An ecient algorithm is better than a bad one, even if 
the implementation of the latter is strongly optimized. 
Dedicated sorting procedure 
Ecient evaluation of consecutive splits 
 Close to the metal, carefully coded, implementation 
2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests 
# But we kept it stupid simple for users! 
clf = RandomForestClassifier() 
clf.fit(X_train, y_train) 
y_pred = clf.predict(X_test) 
11 / 26
Development cycle 
User feedback 
Benchmarks 
Profiling 
Algorithmic 
and code 
improvements 
Implementation 
Peer review 
12 / 26
Continuous benchmarks 
 During code review, changes in the tree codebase are 
monitored with benchmarks. 
 Ensure performance and code quality. 
 Avoid code complexi
cation if it is not worth it. 
13 / 26
Outline 
1 Basics 
2 Scikit-Learn implementation 
3 Python improvements 
14 / 26
Disclaimer. Early optimization is the root of all evil. 
(This took us several years to get it right.) 
15 / 26
Pro
ling 
Use pro
ling tools for identifying bottlenecks. 
In [1]: clf = DecisionTreeClassifier() 
# Timer 
In [2]: %timeit clf.fit(X, y) 
1000 loops, best of 3: 394 mu s per loop 
# memory_profiler 
In [3]: %memit clf.fit(X, y) 
peak memory: 48.98 MiB, increment: 0.00 MiB 
# cProfile 
In [4]: %prun clf.fit(X, y) 
ncalls tottime percall cumtime percall filename:lineno(function) 
390/32 0.003 0.000 0.004 0.000 _tree.pyx:1257(introsort) 
4719 0.001 0.000 0.001 0.000 _tree.pyx:1229(swap) 
8 0.001 0.000 0.006 0.001 _tree.pyx:1041(node_split) 
405 0.000 0.000 0.000 0.000 _tree.pyx:123(impurity_improvement) 
1 0.000 0.000 0.007 0.007 tree.py:93(fit) 
2 0.000 0.000 0.000 0.000 {method 'argsort' of 'numpy.ndarray' 405 0.000 0.000 0.000 0.000 _tree.pyx:294(update) 
... 
16 / 26
Pro
ling (cont.) 
# line_profiler 
In [5]: %lprun -f DecisionTreeClassifier.fit clf.fit(X, y) 
Line % Time Line Contents 
================================= 
... 
256 4.5 self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_) 
257 
258 # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise 
259 0.4 if max_leaf_nodes  0: 
260 0.5 builder = DepthFirstTreeBuilder(splitter, min_samples_split, 
261 0.6 self.min_samples_leaf, 262 else: 
263 builder = BestFirstTreeBuilder(splitter, min_samples_split, 
264 self.min_samples_leaf, max_depth, 
265 max_leaf_nodes) 
266 
267 22.4 builder.build(self.tree_, X, y, sample_weight) 
... 
17 / 26
Call graph 
python -m cProfile -o profile.prof script.py 
gprof2dot -f pstats profile.prof -o graph.dot 
18 / 26
Python is slow :-( 
 Python overhead is too large for high-performance code. 
 Whenever feasible, use high-level operations (e.g., SciPy or 
NumPy operations on arrays) to limit Python calls and rely 
on highly-optimized code. 
def dot_python(a, b): # Pure Python (2.09 ms) 
s = 0 
for i in range(a.shape[0]): 
s += a[i] * b[i] 
return s 
np.dot(a, b) # NumPy (5.97 us) 
 Otherwise (and only then !), write compiled C extensions 
(e.g., using Cython) for critical parts. 
cpdef dot_mv(double[::1] a, double[::1] b): # Cython (7.06 us) 
cdef double s = 0 
cdef int i 
for i in range(a.shape[0]): 
s += a[i] * b[i] 
return s 
19 / 26
Stay close to the metal 
 Use the right data type for the right operation. 
 Avoid repeated access (if at all) to Python objects. 
Trees are represented by single arrays. 
Tips. In Cython, check for hidden Python overhead. Limit 
yellow lines as much as possible ! 
cython -a tree.pyx 
20 / 26

More Related Content

What's hot

Machine Learning Landscape
Machine Learning LandscapeMachine Learning Landscape
Machine Learning Landscape
Eng Teong Cheah
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 Sydney
Alexandros Karatzoglou
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
Eric Wilson
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
Shuai Zhang
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
Carlos Castillo (ChaTo)
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
Prithvi Paneru
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
Dr Ganesh Iyer
 
Clustering
ClusteringClustering
Clustering
LipikaSaha2
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
ASHOK KUMAR
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
Knn
KnnKnn
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
PrasiddhaSarma
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
Tharuka Vishwajith Sarathchandra
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 

What's hot (20)

Machine Learning Landscape
Machine Learning LandscapeMachine Learning Landscape
Machine Learning Landscape
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 Sydney
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Clustering
ClusteringClustering
Clustering
 
KNN
KNNKNN
KNN
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Knn
KnnKnn
Knn
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Viewers also liked

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
AWeber
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
Oswal Abhishek
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 

Viewers also liked (20)

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 

Similar to Accelerating Random Forests in Scikit-Learn

SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
South Tyrol Free Software Conference
 
python.ppt
python.pptpython.ppt
python.ppt
Arun471829
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Piotr Przymus
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
Kenta Oono
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
Kenta Oono
 
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
Masashi Shibata
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
Christian Peel
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
Ivan Goloskokovic
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
Oswald Campesato
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
BigDataExpo
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Yusuke Izawa
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
Abdul Muneer
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Matopt
MatoptMatopt
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
PyData
 
python.ppt
python.pptpython.ppt
python.ppt
ramamoorthi24
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Raffi Khatchadourian
 

Similar to Accelerating Random Forests in Scikit-Learn (20)

SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
 
python.ppt
python.pptpython.ppt
python.ppt
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Matopt
MatoptMatopt
Matopt
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
python.ppt
python.pptpython.ppt
python.ppt
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 

Accelerating Random Forests in Scikit-Learn

  • 1. Accelerating Random Forests in Scikit-Learn Gilles Louppe Universite de Liege, Belgium August 29, 2014 1 / 26
  • 2. Motivation ... and many more applications ! 2 / 26
  • 3. About Scikit-Learn Machine learning library for Python Classical and well-established algorithms Emphasis on code quality and usability Myself @glouppe PhD student (Liege, Belgium) Core developer on Scikit-Learn since 2011 Chief tree hugger scikit 3 / 26
  • 4. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 4 / 26
  • 5. Machine Learning 101 Data comes as... A set of samples L = f(xi ; yi )ji = 0; : : : ;N 1g, with Feature vector x 2 Rp (= input), and Response y 2 R (regression) or y 2 f0; 1g (classi
  • 6. cation) (= output) Goal is to... Find a function ^y = '(x) Such that error L(y; ^y) on new (unseen) x is minimal 5 / 26
  • 7. Decision Trees 푡2 푡1 풙 푋푡1 ≤ 푣푡1 푡10 푡3 ≤ 푡4 푡5 푡6 푡7 푡8 푡9 푡11 푡12 푡13 푡14 푡15 푡16 푡17 푋푡3 ≤ 푣푡3 푋푡6 ≤ 푣푡6 푋푡10 ≤ 푣푡10 푝(푌 = 푐|푋 = 풙) Split node ≤ Leaf node ≤ ≤ t 2 ' : nodes of the tree ' Xt : split variable at t vt 2 R : split threshold at t '(x) = arg maxc2Y p(Y = cjX = x) 6 / 26
  • 8. Random Forests 풙 휑1 휑푀 푝휑1 (푌 = 푐|푋 = 풙) … 푝휑푚 (푌 = 푐|푋 = 풙) Σ 푝휓(푌 = 푐|푋 = 풙) Ensemble of M randomized decision trees 'm (x) = arg maxc2Y 1M PM m=1 p'm(Y = cjX = x) 7 / 26
  • 9. Learning from data function BuildDecisionTree(L) Create node t if the stopping criterion is met for t then byt = some constant value else Find the best partition L = LL [ LR tL = BuildDecisionTree(LL) tR = BuildDecisionTree(LR) end if return t end function 8 / 26
  • 10. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 9 / 26
  • 11. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.10 : January 2012 First sketch at sklearn.tree and sklearn.ensemble Random Forests and Extremely Randomized Trees modules 10 / 26
  • 12. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.11 : May 2012 Gradient Boosted Regression Trees module Out-of-bag estimates in Random Forests 10 / 26
  • 13. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.12 : October 2012 Multi-output decision trees 10 / 26
  • 14. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.13 : February 2013 Speed improvements Rewriting from Python to Cython Support of sample weights Totally randomized trees embedding 10 / 26
  • 15. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.14 : August 2013 Complete rewrite of sklearn.tree Refactoring Cython enhancements AdaBoost module 10 / 26
  • 16. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.15 : August 2014 Further speed and memory improvements Better algorithms Cython enhancements Better parallelism Bagging module 10 / 26
  • 17. Implementation overview Modular implementation, designed with a strict separation of concerns Builders : for building and connecting nodes into a tree Splitters : for
  • 18. nding a split Criteria : for evaluating the goodness of a split Tree : dedicated data structure Ecient algorithmic formulation [See Louppe, 2014] Tips. An ecient algorithm is better than a bad one, even if the implementation of the latter is strongly optimized. Dedicated sorting procedure Ecient evaluation of consecutive splits Close to the metal, carefully coded, implementation 2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests # But we kept it stupid simple for users! clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) 11 / 26
  • 19. Development cycle User feedback Benchmarks Profiling Algorithmic and code improvements Implementation Peer review 12 / 26
  • 20. Continuous benchmarks During code review, changes in the tree codebase are monitored with benchmarks. Ensure performance and code quality. Avoid code complexi
  • 21. cation if it is not worth it. 13 / 26
  • 22. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 14 / 26
  • 23. Disclaimer. Early optimization is the root of all evil. (This took us several years to get it right.) 15 / 26
  • 24. Pro
  • 26. ling tools for identifying bottlenecks. In [1]: clf = DecisionTreeClassifier() # Timer In [2]: %timeit clf.fit(X, y) 1000 loops, best of 3: 394 mu s per loop # memory_profiler In [3]: %memit clf.fit(X, y) peak memory: 48.98 MiB, increment: 0.00 MiB # cProfile In [4]: %prun clf.fit(X, y) ncalls tottime percall cumtime percall filename:lineno(function) 390/32 0.003 0.000 0.004 0.000 _tree.pyx:1257(introsort) 4719 0.001 0.000 0.001 0.000 _tree.pyx:1229(swap) 8 0.001 0.000 0.006 0.001 _tree.pyx:1041(node_split) 405 0.000 0.000 0.000 0.000 _tree.pyx:123(impurity_improvement) 1 0.000 0.000 0.007 0.007 tree.py:93(fit) 2 0.000 0.000 0.000 0.000 {method 'argsort' of 'numpy.ndarray' 405 0.000 0.000 0.000 0.000 _tree.pyx:294(update) ... 16 / 26
  • 27. Pro
  • 28. ling (cont.) # line_profiler In [5]: %lprun -f DecisionTreeClassifier.fit clf.fit(X, y) Line % Time Line Contents ================================= ... 256 4.5 self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_) 257 258 # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise 259 0.4 if max_leaf_nodes 0: 260 0.5 builder = DepthFirstTreeBuilder(splitter, min_samples_split, 261 0.6 self.min_samples_leaf, 262 else: 263 builder = BestFirstTreeBuilder(splitter, min_samples_split, 264 self.min_samples_leaf, max_depth, 265 max_leaf_nodes) 266 267 22.4 builder.build(self.tree_, X, y, sample_weight) ... 17 / 26
  • 29. Call graph python -m cProfile -o profile.prof script.py gprof2dot -f pstats profile.prof -o graph.dot 18 / 26
  • 30. Python is slow :-( Python overhead is too large for high-performance code. Whenever feasible, use high-level operations (e.g., SciPy or NumPy operations on arrays) to limit Python calls and rely on highly-optimized code. def dot_python(a, b): # Pure Python (2.09 ms) s = 0 for i in range(a.shape[0]): s += a[i] * b[i] return s np.dot(a, b) # NumPy (5.97 us) Otherwise (and only then !), write compiled C extensions (e.g., using Cython) for critical parts. cpdef dot_mv(double[::1] a, double[::1] b): # Cython (7.06 us) cdef double s = 0 cdef int i for i in range(a.shape[0]): s += a[i] * b[i] return s 19 / 26
  • 31. Stay close to the metal Use the right data type for the right operation. Avoid repeated access (if at all) to Python objects. Trees are represented by single arrays. Tips. In Cython, check for hidden Python overhead. Limit yellow lines as much as possible ! cython -a tree.pyx 20 / 26
  • 32. Stay close to the metal (cont.) Take care of data locality and contiguity. Make data contiguous to leverage CPU prefetching and cache mechanisms. Access data in the same way it is stored in memory. Tips. If accessing values row-wise (resp. column-wise), make sure the array is C-ordered (resp. Fortran-ordered). cdef int[::1, :] X = np.asfortranarray(X, dtype=np.int) cdef int i, j = 42 cdef s = 0 for i in range(...): s += X[i, j] # Fast s += X[j, i] # Slow If not feasible, use pre-buering. 21 / 26
  • 33. Stay close to the metal (cont.) Arrays accessed with bare pointers remain the fastest solution we have found (sadly). NumPy arrays or MemoryViews are slightly slower Require some pointer kung-fu # 7.06 us # 6.35 us 22 / 26
  • 34. Ecient parallelism in Python is possible ! 23 / 26
  • 35. Joblib Scikit-Learn implementation of Random Forests relies on joblib for building trees in parallel. Multi-processing backend Multi-threading backend Require C extensions to be GIL-free Tips. Use nogil declarations whenever possible. Avoid memory dupplication trees = Parallel(n_jobs=self.n_jobs)( delayed(_parallel_build_trees)( tree, X, y, ...) for i, tree in enumerate(trees)) 24 / 26
  • 36. A winning strategy Scikit-Learn implementation proves to be one of the fastest among all libraries and programming languages. 14000 12000 10000 8000 6000 4000 2000 0 Fit time (s ) 203.01 211.53 4464.65 3342.83 randomForest 1518.14 1711.94 R, Fortran 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java Orange Python 25 / 26
  • 37. Summary The open source development cycle really empowered the Scikit-Learn implementation of Random Forests. Combine algorithmic improvements with code optimization. Make use of pro
  • 38. ling tools to identify bottlenecks. Optimize only critical code ! 26 / 26