Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
The slides from the Machine Learning Summers School 2015 in Sydney on Machine Learning for Recommender Systems. Collaborative filtering algorithms, Context-aware methods, Restricted Boltzmann Machines, Recurrent Neural Networks, Tensor Factorization, etc.
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
Decision Trees in Machine Learning - Decision tree method is a commonly used data mining method for establishing classification systems based on several covariates or for developing prediction algorithms for a target variable.
KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
Given at PyDataSV 2014
In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
We first present the Python programming language and the NumPy package for scientific computing. Then, we devise a digit recognition system highlighting the scikit-learn package.
The slides from the Machine Learning Summers School 2015 in Sydney on Machine Learning for Recommender Systems. Collaborative filtering algorithms, Context-aware methods, Restricted Boltzmann Machines, Recurrent Neural Networks, Tensor Factorization, etc.
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
Decision Trees in Machine Learning - Decision tree method is a commonly used data mining method for establishing classification systems based on several covariates or for developing prediction algorithms for a target variable.
KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
Given at PyDataSV 2014
In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
We first present the Python programming language and the NumPy package for scientific computing. Then, we devise a digit recognition system highlighting the scikit-learn package.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
Realtime predictive analytics using RabbitMQ & scikit-learnAWeber
In this talk, AWeber's Michael Becker describes how to deploy a predictive model in a production environment using RabbitMQ and scikit-learn. You'll see a realtime content classification system to demonstrate this design.
This is the slides for the data science workshop at CDIPS, UC Berkeley on 06-28-2017. It is about general machine learning with a focus on scikit-learn. You can find all the related material: https://github.com/qingkaikong/20170628_ML_sklearn
scikit-learn has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry.
scikit-learn provides easy-to-use interfaces to perform advanced analysis and build powerful predictive models.
The tutorial will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross validation, and model selection. We will see how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline.
We will also cover how to build machine learning models on text data, and how to handle very large datasets.
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Machine learning in production with scikit-learnJeff Klukas
Presented at PyOhio 2017: https://pyohio.org/schedule/presentation/284/
The Python data ecosystem provides amazing tools to quickly get up and running with machine learning models, but the path to stably serving them in production is not so clear. We'll discuss details of wrapping a minimal REST API around scikit-learn, training and persisting models in batch, and logging decisions, then compare to some other common approaches to productionizing models.
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop
Tutorial on Scikit Learn I gave at SF Data Mining meetup on May 1st 2017. Review of major parts of the Scikit-Learn API and quick coding exercise on Iris Dataset
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Imagine writing a pure Python library which can achieve the performance of Fortran or C/C++.
To this end we have developed Pyccel, which translates Python code to either Fortran or C, and makes the generated code callable from Python. The generated Fortran or C code is not only fast, but also human-readable; hence it can easily be profiled and optimized for the target machine.
Pyccel has a focus on high-performance computing applications, where the efficient usage of the available hardware resources is fundamental.
To this end it provides type annotations, function decorators, and OpenMP pragmas.
Pyccel is easy to use, is almost completely written in Python, and compares favourably against other Python accelerators.
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Piotr Przymus
Have you ever wondered what happens to all the precious RAM after running your 'simple' CPython code? Prepare yourself for a short introduction to CPython memory management! This presentation will try to answer some memory related questions you always wondered about. It will also discuss basic memory profiling tools and techniques.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
This fast-paced session starts with an introduction to neural networks and linear regression models, along with a quick view of TensorFlow, followed by some Scala APIs for TensorFlow. You'll also see a simple dockerized image of Scala and TensorFlow code and how to execute the code in that image from the command line. No prior knowledge of NNs, Keras, or TensorFlow is required (but you must be comfortable with Scala).
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
Pythran is a an ahead of time compiler that turns modules written in a large subset of Python into C++ meta-programs that can be compiled into efficient native modules. It targets mainly compute intensive part of the code, hence it comes as no surprise that it focuses on scientific applications that makes extensive use of Numpy. Under the hood, Pythran inter-procedurally analyses the program and performs high level optimizations and parallel code generation. Parallelism can be found implicitly in Python intrinsics or Numpy operations, or explicitly specified by the programmer using OpenMP directives directly in the Python source code. Either way, the input code remains fully compatible with the Python interpreter. While the idea is similar to Parakeet or Numba, the approach differs significantly: the code generation is not performed at runtime but offline. Pythran generates C++11 heavily templated code that makes use of the NT2 meta-programming library and relies on any standard-compliant compiler to generate the binary code. We propose to walk through some examples and benchmarks, exposing the current state of what Pythran provides as well as the limit of the approach.
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution. We present our ongoing work on automated refactoring that assists developers in specifying whether and how their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs while preserving semantics. The approach, based on a novel imperative tensor analysis, will automatically determine when it is safe and potentially advantageous to migrate imperative DL code to graph execution and modify decorator parameters or eagerly executing code already running as graphs. The approach is being implemented as a PyDev Eclipse IDE plug-in and uses the WALA Ariadne analysis framework. We discuss our ongoing work towards optimizing imperative DL code to its full potential.
Similar to Accelerating Random Forests in Scikit-Learn (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
3. About
Scikit-Learn
Machine learning library for Python
Classical and well-established
algorithms
Emphasis on code quality and usability
Myself
@glouppe
PhD student (Liege, Belgium)
Core developer on Scikit-Learn since 2011
Chief tree hugger
scikit
3 / 26
5. Machine Learning 101
Data comes as...
A set of samples L = f(xi ; yi )ji = 0; : : : ;N 1g, with
Feature vector x 2 Rp (= input), and
Response y 2 R (regression) or y 2 f0; 1g (classi
6. cation) (=
output)
Goal is to...
Find a function ^y = '(x)
Such that error L(y; ^y) on new (unseen) x is minimal
5 / 26
7. Decision Trees
푡2
푡1
풙
푋푡1 ≤ 푣푡1
푡10
푡3
≤
푡4 푡5 푡6 푡7
푡8 푡9 푡11 푡12 푡13
푡14 푡15 푡16 푡17
푋푡3 ≤ 푣푡3
푋푡6 ≤ 푣푡6
푋푡10 ≤ 푣푡10
푝(푌 = 푐|푋 = 풙)
Split node
≤ Leaf node
≤
≤
t 2 ' : nodes of the tree '
Xt : split variable at t
vt 2 R : split threshold at t
'(x) = arg maxc2Y p(Y = cjX = x)
6 / 26
9. Learning from data
function BuildDecisionTree(L)
Create node t
if the stopping criterion is met for t then
byt = some constant value
else
Find the best partition L = LL [ LR
tL = BuildDecisionTree(LL)
tR = BuildDecisionTree(LR)
end if
return t
end function
8 / 26
11. History
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33 0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
0.10 : January 2012
First sketch at sklearn.tree and sklearn.ensemble
Random Forests and Extremely Randomized Trees modules
10 / 26
12. History
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33 0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
0.11 : May 2012
Gradient Boosted Regression Trees module
Out-of-bag estimates in Random Forests
10 / 26
13. History
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33 0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
0.12 : October 2012
Multi-output decision trees
10 / 26
14. History
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33 0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
0.13 : February 2013
Speed improvements
Rewriting from Python to Cython
Support of sample weights
Totally randomized trees embedding
10 / 26
15. History
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33 0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
0.14 : August 2013
Complete rewrite of sklearn.tree
Refactoring
Cython enhancements
AdaBoost module
10 / 26
16. History
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33 0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
0.15 : August 2014
Further speed and memory improvements
Better algorithms
Cython enhancements
Better parallelism
Bagging module
10 / 26
17. Implementation overview
Modular implementation, designed with a strict separation of
concerns
Builders : for building and connecting nodes into a tree
Splitters : for
18. nding a split
Criteria : for evaluating the goodness of a split
Tree : dedicated data structure
Ecient algorithmic formulation [See Louppe, 2014]
Tips. An ecient algorithm is better than a bad one, even if
the implementation of the latter is strongly optimized.
Dedicated sorting procedure
Ecient evaluation of consecutive splits
Close to the metal, carefully coded, implementation
2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests
# But we kept it stupid simple for users!
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
11 / 26
19. Development cycle
User feedback
Benchmarks
Profiling
Algorithmic
and code
improvements
Implementation
Peer review
12 / 26
20. Continuous benchmarks
During code review, changes in the tree codebase are
monitored with benchmarks.
Ensure performance and code quality.
Avoid code complexi
30. Python is slow :-(
Python overhead is too large for high-performance code.
Whenever feasible, use high-level operations (e.g., SciPy or
NumPy operations on arrays) to limit Python calls and rely
on highly-optimized code.
def dot_python(a, b): # Pure Python (2.09 ms)
s = 0
for i in range(a.shape[0]):
s += a[i] * b[i]
return s
np.dot(a, b) # NumPy (5.97 us)
Otherwise (and only then !), write compiled C extensions
(e.g., using Cython) for critical parts.
cpdef dot_mv(double[::1] a, double[::1] b): # Cython (7.06 us)
cdef double s = 0
cdef int i
for i in range(a.shape[0]):
s += a[i] * b[i]
return s
19 / 26
31. Stay close to the metal
Use the right data type for the right operation.
Avoid repeated access (if at all) to Python objects.
Trees are represented by single arrays.
Tips. In Cython, check for hidden Python overhead. Limit
yellow lines as much as possible !
cython -a tree.pyx
20 / 26
32. Stay close to the metal (cont.)
Take care of data locality and contiguity.
Make data contiguous to leverage CPU prefetching and cache
mechanisms.
Access data in the same way it is stored in memory.
Tips. If accessing values row-wise (resp. column-wise), make
sure the array is C-ordered (resp. Fortran-ordered).
cdef int[::1, :] X = np.asfortranarray(X, dtype=np.int)
cdef int i, j = 42
cdef s = 0
for i in range(...):
s += X[i, j] # Fast
s += X[j, i] # Slow
If not feasible, use pre-buering.
21 / 26
33. Stay close to the metal (cont.)
Arrays accessed with bare pointers remain the fastest
solution we have found (sadly).
NumPy arrays or MemoryViews are slightly slower
Require some pointer kung-fu
# 7.06 us # 6.35 us
22 / 26
35. Joblib
Scikit-Learn implementation of Random Forests relies on joblib
for building trees in parallel.
Multi-processing backend
Multi-threading backend
Require C extensions to be GIL-free
Tips. Use nogil declarations whenever possible.
Avoid memory dupplication
trees = Parallel(n_jobs=self.n_jobs)(
delayed(_parallel_build_trees)(
tree, X, y, ...)
for i, tree in enumerate(trees))
24 / 26
36. A winning strategy
Scikit-Learn implementation proves to be one of the fastest
among all libraries and programming languages.
14000
12000
10000
8000
6000
4000
2000
0
Fit time (s )
203.01 211.53
4464.65
3342.83
randomForest
1518.14 1711.94
R, Fortran
1027.91
13427.06
10941.72
Scikit-Learn-RF
Scikit-Learn-ETs
OpenCV-RF
OpenCV-ETs
OK3-RF
OK3-ETs
Weka-RF
R-RF
Orange-RF
Scikit-Learn
Python, Cython
OpenCV
C++
OK3
C Weka
Java
Orange
Python
25 / 26
37. Summary
The open source development cycle really empowered the
Scikit-Learn implementation of Random Forests.
Combine algorithmic improvements with code optimization.
Make use of pro
38. ling tools to identify bottlenecks.
Optimize only critical code !
26 / 26