Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Random Forests in Scikit-Learn

16,311 views

Published on

Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:

- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.

Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.

Published in: Data & Analytics

Accelerating Random Forests in Scikit-Learn

  1. 1. Accelerating Random Forests in Scikit-Learn Gilles Louppe Universite de Liege, Belgium August 29, 2014 1 / 26
  2. 2. Motivation ... and many more applications ! 2 / 26
  3. 3. About Scikit-Learn Machine learning library for Python Classical and well-established algorithms Emphasis on code quality and usability Myself @glouppe PhD student (Liege, Belgium) Core developer on Scikit-Learn since 2011 Chief tree hugger scikit 3 / 26
  4. 4. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 4 / 26
  5. 5. Machine Learning 101 Data comes as... A set of samples L = f(xi ; yi )ji = 0; : : : ;N

×