Accelerating Random Forest Algorithm for Parallel Hardware

Accelerating the Random Forest algorithm for commodity parallel
hardware
Mark Seligman
Suiji
August 5, 2015
Mark Seligman (Suiji) Accelerating the Random Forest algorithm for commodity parallel hardware August 5, 2015 1 / 44

1 Outline
2 Introduction
3 Random Forests
4 Implementation
5 Examples and anecdotes: R
6 Ongoing work
7 Summary and future work

Outline
Introduction
Random Forests
Implementation
Examples and anecdotes
Ongoing work
Summary and future work
Q & A

1 Outline
2 Introduction
3 Random Forests
4 Implementation
6 Ongoing work

Arborist project
Began as proprietary implementation of Random Forest (TM) algorithm.
Aim was enhanced performance across a wide variety of hardware, data and workﬂows.
GPU acceleration a key concern.
Open-sourced and rewritten following dissolution of venture.
Arborist is the project name.
Pyborist is the Python implementation, under development.
Rborist is the R package.

Project design goals
Language-agnostic, compiled core.
Minimal reliance on call-backs and external libraries.
Minimize data movement.
Ready extensibility.
Common source base for all spins.

1 Outline
2 Introduction
3 Random Forests
4 Implementation
6 Ongoing work

Binary decision trees, briefly
Prediction method presenting a series of true/false questions about the data.
Answer to given question determines which (of two) questions to pose next.
Successive T/F branching relationship justifies “tree” nomenclature.
Different data take different paths through the tree.
Terminal (or “leaf”) node in path reports score for that path (and data).
Can build single tree and refine: “boosting”.
Can build “forest” of (typically) 100 − 1000 trees.
Overall average (regression) or plurality (classification) derived from each tree’s score.

Random Forests
Random Forest is trademarked, registered to Leo Breiman (dec.) and Adele Cutler.
Predicts or validates vector of data (“response”)
Numerical: “regression”.
Categorical: “classiﬁcation”.
Trains on design matrix of observations: “predictor” columns.
Columns individually either numerical or categorical (“factors”).
Trees trained on randomly-selected (“bagged”) set of matrix rows.
Predictors sampled randomly throughout training - separately chosen for each node.
Validation on held-out subset: diﬀerent for each tree.
Independent prediction on separately-provided test sets.

Training as tree building
Begins with a root node, together with the bagged set.
Bagging: can view as indicator set of row indices, with multiplicities.
Subnode construction (“splitting”) is driven by information content.
Nodes with sufficient information branch into two new subnodes.
Branch is annotated with splitting criterion, determining its sense.
If no splitting, the node is terminal: leaf.
Tree construction can proceed depth-first, breadth-first, ...
Construction terminates when frontier nodes exhaust information content.
User may also constrain termination: node count, tree depth, node width, ...

Building trees: splitting as conditioning
Splitting has the consequence of partitioning the training data into progressively smaller
subsets.
Operationally, the splitting criterion conditions the data into complementary subspaces.
The left successor inherits the subspace satisfying the criterion.
The right successor inherits its complement.
From this perspective, the root node trains on all bagged observations.
Successor nodes, similarly, train on data conditioned by the parent.
As we’ll see, the conditioned subspaces can be characterized as row sections of the design.
In other words, the splitting criteria deﬁne successive bipartitions on row indices.
From this perspective, then, the algorithm would seem to terminate naturally.

Splitting: predictor perspective
Splitting criteria are formulated as order or subset relations with respect to some
predictor:
E.g., numerical predictor: p <= 3.2 ? branch left : branch right.
Factor predictor: q ∈ {3, 8, 17} ? branch left : branch right.
At a given node, candidate criteria obtained over randomly-sampled set of predictors.
Each predictor evaluates a series of (L/R) trial subsets:
For numerical predictors, trials are distinct cuts in the linear order.
For factors, trials are partitions over the runs of identical predictor values.
Criterion derived from trial maximizing “impurity” (separation) on response.
The predictor/criterion pair best “separating” the response is chosen for splitting.

Predictor ordering ⇐⇒ row index permutation
Trial score is a function only of response - evaluated according to predictor order.
Irrespective of predictor, a given node is scored over a unique set of response indices.
The role of the predictor is to dictate the order to walk the indices.
That is, predictor values play no role in scoring the trials.
Hence only predictor ranks (and runs), aﬀect scoring.
Each trial, in particular the “winner”, determines a bipartition of predictor ranks.
The predictor ranks, in turn, deﬁne a bipartition of row indices:
One set of indices characterizes the left branch.
Its complement characterizes the right branch.
Throughout training, then, the frontier nodes train over a (highly-disconnected) partition
of the original bagged data, as row sections.

Trial generation: 4 cases, divergent work loads
Predictor
Response Numerical Factor
Regression Index walk Index walk → run sets
(weighted variance) Run set sort
Run set walk: O(# runs)
Classification Index walk Index walk → run sets
(Gini gain) Run set walk: O(2# runs)
Index walks are linear in node width, but differ in state maintained.
Power set walks resort to sampling above ∼ 10 runs.
Binary classification walks runs linearly.

Aside: performance is data-dependent
As with linear algebra, the appropriate treatment depends on the contents of the (design)
matrix: SVD, for example.
E.g., regression has regular access patterns and tends to run very quickly.
Constraints in response or predictor values - may benefit from numerical simplification.
Will ties play a significant role? - sparse data can train very quickly.
Custom implementations rely heavily on the answer to such questions.
It therefore makes sense to strive for extensibility and ease of customization.

Data locality
Computers store data in hierarchy:
Registers.
Caches (L1 - L3).
RAM.
Disk.
CPU operates on registers.
Loading registers consumes many clock cycles, depending upon position in hierarchy.
Performance therefore best when data is spatially (hence temporally) local.
Similarly, loops over vectors most eﬃcient when data in consecutive iterations separated
by predictable and short(ish) strides.
“Regular” access patterns allow compiler, and hence hardware, to do a good job.
Regularity is crucial for GPUs, which excel at performing identical operations on
contiguous data.

Algorithm: observations
Splitting is “embarrassingly parallel”: trials can be evaluated on all nodes in the frontier,
and all candidate predictors, at the same time.
However, ranks corresponding to a node’s row indices vary with predictor.
Naive solution is to sort observations at each splitting step.
Approach used by early implementations.
Does not scale well.
Predictor ordering used repeatedly: suggests pre-ordering.

Algorithm: cont.
With pre-ordering, index walk accumulates per-node state by row index lookup.
Original Arborist approach.
No data locality, as index lookup is irregular.
Large state budget: must be swapped as indexed node changes.
“Restaging”: maintain separately-sorted state vectors for each predictor, by node.
Begin with pre-sorted list, update via (stable) bipartition at each node.
Current Arborist approach. Data locality improves with tree depth.
Only modest amount of state to move: 16 bytes to include doubles.
Splitting becomes quite regular: next datum prefetchable by hardware.
Each node/predictor pair restageable on SIMD hardware
Partition using parallel scan.

1 Outline
2 Introduction
3 Random Forests
4 Implementation
6 Ongoing work

Organization
Compiled code with various language front-ends.
R was the driving language, but Python now under active development.
Front-end “bridges” wherever possible: Rcpp, Cython.
Minimal use of front-end call-backs: PRNG, sampling, sorting.
Common code base also supports GPU version, largely as a subtyped extension.

Look and feel
Guided by existing packages. Many options the same, or similar.
Supports only numeric and factor data: leaves type-wrangling to the user.
Predictor sampling: continuous predProb (Bernoulli) - vs. discrete max features (w/o
replacement).
Breadth-ﬁrst implementation; introduces speciﬁcation of terminating level.
Introduces information-based stopping criterion, minRatio.
Unlimited (essentially) factor cardinality: blessing as well as curse.
Many useful package options remain to be implemented.

Distinguishing features
Decoupling of splitting from row-lookup: restaging + highly regular node walk.
Both stages beneﬁt from resulting data locality and regularity.
Restaging maintained as stable partition (amenable to SIMD parallelization).
Training produces lightweight, serial “pre-tree”.
Rich intermediate state: e.g., frontier maps reveal quantiles.
Amenability to workﬂow internalization: “loopless” behavior.

Training wheels, early experience
Began with R front end.
Compare performance with randomForest package.
“Medium” data: large, but in-memory.
Speedups typically observed as row counts approach 500 − 1000.
Linear scaling with # predictors, # trees, as expected.
Log-linear with # rows, also as expected.
Regression much easier to accelerate than classiﬁcation.

1 Outline
2 Introduction
3 Random Forests
4 Implementation
6 Ongoing work

Feature trials: Bernoulli vs. w/o replacement
German credit-scoring data1: binary response, 1000 rows.
7 numerical predictors.
13 categorical predictors, cardinalities range from 2 to 10.
Graphs to follow compare misprediction, execution times at various predictor selections.
Accuracy, as function of trial metric:
Bernoulli (blue) and w/o replacement (green) appear to track well.
Performance of Rborist generally 2-3 ×, in this regime.
1 Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.

q q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
5 10 15
0.200.220.240.260.280.300.32
Misprediction: predProb (Rborist) vs. mtry (RF)
mtry (default = 4) or equivalent (default = 4.8)
Mispredictionrate
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q
q
q

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
5 10 15
2.02.53.03.54.04.5
Execution time ratios: randomForest / Rborist
(equivalent) mtry
Ratio

Instructive user account
Arborist performance problem noted in recent blog post.2
Airline ﬂight-delay data tested on various RF packages.
8 predictors; various row counts: 104, 105, 106, 107.
Slowdown appears due to error in large-cardinality sampling.
In fact, Github version had already repiared salient problem.
Nonetheless, suggests improvements either implemented or to-be.
2 “Benchmarking Random Forest Implementations”, Szilard Pafka, DataScience.LA, May 19,
2015.

Account, cont.
Splitting now parallelized across all pairs, rather than by-predictor.
Class weighting to treat unbalanced data.
Binary classiﬁcation with high-cardinality factors:
Replaces sampling with n log n method.
Points to need for more, and broader, testing.

GPU: pilot study with University of Washington team
GWAS data provided by Dept. Global Health 3.
100 samples, up to ∼ 106
predictors.
Binary response: HIV detected or not.
Purely categorical predictors with cardinality = 3: SNPs.
Bespoke CPU and GPU versions spun oﬀ for the data set.
Each tree trained (almost) entirely on GPU.
Results illustrate potential - for highly regular data sets.
Drop-oﬀ on right is an artefact from copying data.frame.
3 Courtesy of Lingappam lab.

CPU vs. GPU: execution time ratios of bespoke versions
q
q
q
q
q
q
q
q
0 50000 100000 150000 200000 250000
1520253035404550
CPU vs GPU: timing ratios (1000 trees)
Predictor count
Ratio

1 Outline
2 Introduction
3 Random Forests
4 Implementation
6 Ongoing work

GPU-centric packages
ad hoc version not scalable as implemented: rewritten.
Restaging now implemented as stable partition via parallel scan.
Nvidia engineers concur with this solution, anticipate good scaling.
In general, though, data need not be so regular as this.
Mixed predictor types and multiple cardinalities present load-balancing challenges.
Dynamic parallelism option available for irregular workloads.
Predictor selection thwarts data locality: adjacent columns not necessarily used.
Lowest-hanging fruit may be isolated special cases such as SNP data.

GPU vs. CPU
Highly regular regression/numeric case may perform well on GPU.
On-GPU transpose: restaging, splitting employ diﬀerent major orderings.
For now: split on CPU and restage (highly regular) on GPU.
Multiple trees can be restaged at once via software pipeline:
Masks transfer latency by overlapping training of multiple trees.
Keeps CPU busy by dispatching less-regular tasks to multiple cores.

CPU-level parallelism
Original work focused on predictor-level parallelism, emphasizing wider data sets.
Node-level parallelism has emerged as equal player (e.g., ﬂight-delay data).
But with high core count and closely-spaced data, false sharing looms as potential threat.
Infrastructure now in place to support hierarchical parellization:
Head node orders predictors and scatters copies.
Multiple nodes each train blocks of trees on multicore hardware.
GPU participation also possible.
Head node gathers pretrees, builds forest, validates.
Remaining implementation chieﬂy a matter of scheduling and tuning.

CPU: load balancing
Mixed factor, numerical predictors offer greatest challenge, especially for classification.
In some cases, may benefit from parallelizing trial generations themselves.
Irrespective of data types, inter-level pass following splitting is inherently sequential.
May make sense to pipeline: overlap splitting of one tree with interlevel of another.
N.B.: Much more performance-testing is needed to investigate these scenarios.

Additional projects
Sparse internal representation.
Inchoate; main challenge is deﬁning interface.
NA handling.
Some variants easier to implement than others.
Post-processing: facilitate use by other utilities.
Feature contributions.

Pyborist: goals
Encourage ﬂexing, testing by broader ML community.
Honor precedent of scikit-learn: features, style.
Provide R-like abstractions: data frames and factors.
Attempt to minimize impact of host language on user data.
Stress software organization and design.

Pyborist: key ingredients
Cython bridge: emphasis on compilation.
Pandas: “dataframe” and “category” essential.
NumPy: PRNG, sort and sampling call-backs.
Considered other options: SWIG, CFFI, ctypes ...

1 Outline
2 Introduction
3 Random Forests
4 Implementation
6 Ongoing work

Summary
Only a few rigid design principles:
Constrain data movement.
Language-agnostic, compiled core implementation.
Common source base.
Plenty of opportunities for improvement.
Load balancing appears to be lowest-hanging fruit: both CPU and GPU.

Longer term
Solicit help, comments from the community.
Expanded use of templating: large, small types.
Out-of-memory support.
Generalized dispatch, fat binaries.
Plugins for other splitting methods: non-Gini.
Internalize additional workﬂows.

Acknowledgments
Stephen Elston, Quanta Analytics.
Abraham Flaxman, Dept. Global Health, U.W.
Seattle PuPPy.

Thank you
mseligman@suiji.org
@suijidata
blog.suiji.org (“Mood Stochastic”)
github.com/suiji/Arborist

Accelerating Random Forest Algorithm for Parallel Hardware

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerating Random Forest Algorithm for Parallel Hardware

Similar to Accelerating Random Forest Algorithm for Parallel Hardware (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

Accelerating Random Forest Algorithm for Parallel Hardware