CudaTree
Training Random Forests on the GPU
Alex Rubinsteyn (NYU, Mount Sinai)
@ GTC on March 25th, 2014
What’s a Random Forest?
trees = []
for i in 1 .. T:
Xi, Yi = random_sample(X, Y)
t = RandomizedDecisionTree()
t.fit(Xi,Yi)...
Random Forest trees are trained like normal
decision trees (CART), but use random subset
of features at each split.
bestSc...
Random Forests:
Easy to Parallelize?
• Each tree gets trained independently!
• Simple parallelization strategy:“each
proce...
Random Forest GPU
Training Strategies
• Tree per Thread: naive/slow
• Depth-First: data parallel threshold
selection for o...
Depth-First Training
Thread block = subset of samples
Depth-First Training
Thread block = subset of samples
Depth-First Training
Thread block = subset of samples
Depth-First Training
Thread block = subset of samples
Depth-First Training
Thread block = subset of samples
Depth-First Training
Thread block = subset of samples
Depth-First Training:
Algorithm Sketch
(1)Parallel Prefix Scan: compute label count
histograms
(2)Map: Evaluate impurity sc...
Breadth-First Training
Thread block = tree node
Breadth-First Training
Thread block = tree node
Breadth-First Training
Thread block = tree node
Breadth-First Training
Thread block = tree node
Breadth-First Training:
Algorithm Sketch
Uses the same sequence of data parallel operations
(Scan label counts, Map impuri...
Hybrid Training
★ c = number of classes
★ n = total number of samples
★ f = number of features considered at split
3702 + ...
GPU Algorithms vs.
Number of Features
• Randomly generated
synthetic data
• Performance relative
to sklearn 0.14
Benchmark Data
Dataset Samples Features Classes
ImageNet 10k 4k 10
CIFAR100 50k 3k 100
covertype 581k 57 7
poker 1M 11 10
...
Benchmark Results
Dataset wiseRF
sklearn
0.15
CudaTree
(Titan)
CudaTree
+
wiseRF
ImageNet 23s 13s 27s 25s
CIFAR100 160s 18...
Thanks!
• Installing: pip install cudatree
• Source:https://github.com/EasonLiao/CudaTree
• Credit:Yisheng Liao did most o...
Upcoming SlideShare
Loading in …5
×

CudaTree (GTC 2014)

1,038 views

Published on

Random Forests have become an extremely popular machine learning algorithm for making predictions from large and complicated data sets. The currently highest performing implementations of Random Forests all run on the CPU. We implemented a Random Forest learner for the GPU (using PyCUDA and runtime code generation) which outperforms the currently preferred libraries (scikits-learn and wiseRF). The "obvious" parallelization strategy (using one thread-block per tree) results in poor performance. Instead, we developed a more nuanced collection of kernels to handle various tradeoffs between the number of samples and the number of features.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,038
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CudaTree (GTC 2014)

  1. 1. CudaTree Training Random Forests on the GPU Alex Rubinsteyn (NYU, Mount Sinai) @ GTC on March 25th, 2014
  2. 2. What’s a Random Forest? trees = [] for i in 1 .. T: Xi, Yi = random_sample(X, Y) t = RandomizedDecisionTree() t.fit(Xi,Yi) trees.append(t) A bagged ensemble of randomized decision trees. Learning by uncorrelated memorization. Few free parameters, popular for “data science”
  3. 3. Random Forest trees are trained like normal decision trees (CART), but use random subset of features at each split. bestScore = bestThresh = None for i in RandomSubset(nFeatures): Xi, Yi = sort(X[:, i], Y) for nLess, thresh in enumerate(Xi): score = Impurity(nLess, Yi) if score < bestScore: bestScore = score bestThresh = thresh Decision Tree Training 1;
  4. 4. Random Forests: Easy to Parallelize? • Each tree gets trained independently! • Simple parallelization strategy:“each processor trains a tree” • Works great for multi-core implementations (wiseRF, scikit-learn, &c) • Terrible for GPU
  5. 5. Random Forest GPU Training Strategies • Tree per Thread: naive/slow • Depth-First: data parallel threshold selection for one tree node • Breadth-First: learn whole level of a tree simultaneously (threshold per block) • Hybrid: Use Depth-First training until too few samples, switch to Breadth-First
  6. 6. Depth-First Training Thread block = subset of samples
  7. 7. Depth-First Training Thread block = subset of samples
  8. 8. Depth-First Training Thread block = subset of samples
  9. 9. Depth-First Training Thread block = subset of samples
  10. 10. Depth-First Training Thread block = subset of samples
  11. 11. Depth-First Training Thread block = subset of samples
  12. 12. Depth-First Training: Algorithm Sketch (1)Parallel Prefix Scan: compute label count histograms (2)Map: Evaluate impurity score for all feature thresholds (3)Reduce: Which feature/threshold pair has the lowest impurity score? (4)Map: Marking whether each sample goes left or right at the split (5)Shuffle: keep samples/labels on both sides of the split contiguous.
  13. 13. Breadth-First Training Thread block = tree node
  14. 14. Breadth-First Training Thread block = tree node
  15. 15. Breadth-First Training Thread block = tree node
  16. 16. Breadth-First Training Thread block = tree node
  17. 17. Breadth-First Training: Algorithm Sketch Uses the same sequence of data parallel operations (Scan label counts, Map impurity evaluation, &c) as Depth-First training but within each thread block Fast at the bottom of the tree (lots of small tree nodes), insufficient parallelism at the top.
  18. 18. Hybrid Training ★ c = number of classes ★ n = total number of samples ★ f = number of features considered at split 3702 + 1.58c + 0.0577n + 21.84f Add a tree node to the Breadth-First queue when it contains fewer samples than: (coefficients from machine-specific regression, needs to be generalized)
  19. 19. GPU Algorithms vs. Number of Features • Randomly generated synthetic data • Performance relative to sklearn 0.14
  20. 20. Benchmark Data Dataset Samples Features Classes ImageNet 10k 4k 10 CIFAR100 50k 3k 100 covertype 581k 57 7 poker 1M 11 10 PAMAP2 2.87M 52 10 intrustion 5M 41 24
  21. 21. Benchmark Results Dataset wiseRF sklearn 0.15 CudaTree (Titan) CudaTree + wiseRF ImageNet 23s 13s 27s 25s CIFAR100 160s 180s 197s 94s covertype 107s 73s 67s 52s poker 117s 98s 59s 58s PAMAP2 1,066s 241s 934s 757s intrustion 667s 1,682s 199s 153s 6- core Xeon E5-2630, 24GB, GTX Titan, n_trees = 100, features_per_split = sqrt(n)
  22. 22. Thanks! • Installing: pip install cudatree • Source:https://github.com/EasonLiao/CudaTree • Credit:Yisheng Liao did most of the hard work, Russell Power & Jinyang Li were the sanity check brigade.

×