Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BrightTalk
5/20/2014
Building Random
Forest at Scale
Michal Malohlava!
@mmalohlava!
@hexadata
Who am I?
Background
•PhD in CS from Charles University in Prague, 2012
•1 year PostDoc at Purdue University experimenting...
Overview
1. A little bit of theory
!
2.Random Forest observations
!
3.Scaling & distribution of Random Forest

4.Q&A

Tree Planting
What is a model for
this data?
•Training sample of points covering area [0,3] x [0,3]
•Two possible colors of points
X
0 1...
What is a model for
this data?
The model should be able to predict a color of a new point
X
0 1 2 3
0
1
2
3
Y
What is a co...
Decision tree
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
A. Possible impurity measures
• Gini, entropy, RSS
B. Respect type of feature -
nominal, ordinal, continuous
How to grow a...
When to stop growing
the tree?
1. Build full tree
or
2.Apply stopping criterion - limit on:
•Tree depth, or
•Minimum numbe...
How to assign leaf
value?
The leaf value is
•If leaf contains only one point
then its color represents leaf
value
!
•Else ...
Decision tree
Tree covered whole area by rectangles predicting a
point color
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0....
Decision tree scoring
The model can predict a point color based on its
coordinates.
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y <...
X
0 1 2 3
0
1
2
3
Y
Overfitting
Tree perfectly represents training data (0% training error),
but also learned about noise!...
X
0 1 2 3
0
1
2
3
Y
Overfitting
And hence poorly predicts a new point!
Expected color !
was RED!
since points follow !
“ch...
Handle overfitting
Pre-pruning via stopping criterion
!
Post-pruning: decreases complexity of model but helps with
model g...
Randomize #1
Bagging
X
0 1 2 3
0
1
2
3
Y
X
0 1 2 3
0
1
2
3
Y
Prepare bootstrap
sample for
each tree
by sampling
with repla...
x<0.8
y<0.8
x< 2 x< 1.7
y < 2 y < 2.3
y < 2
y<0.8
Randomize #1
Bagging
x<0.8
y<0.8
x< 2 x< 1.7
y < 2 y < 2.3
y < 2
y<0.8
Randomize #1!
Bagging
Each tree sees only sample of training data and
captures only a part of the information.
!
Build mul...
Randomize #2!
Feature selection
Randomized split selection
x<0.8
???y < 2
y<0.8
•Select randomly subset
of features of siz...
Out of bag points
and validation
Each tree is built over a
sample of training points.
!
Remaining points are called
“out-o...
Advantages of
Random Forest
Independent trees which can be built in parallel
!
The model does not overfit easily
!
Produce...
A Few
Observations
Covtype dataset
Sampling rate impact
Number of split features
Variable importance
Building Forests
with H2O
H2O platform
Challenges
Parallelize and distribute Random Forest algorithm
•Preserve computation with data
•Minimize data transfers
Pre...
Implementation #1
Build independent trees per machine local data
•RVotes approach
•Each node builds a subset of forest
Cha...
Implementation #1
Fast - trees are independent and can be built in parallel
Data have to fit into memory
!
Possible accura...
Implementation #2
Build a distributed tree over all data
Dataset!
points
Implementation #2
Each data point has assigned a tree node
• stored in a temporary

vector
Dataset!
points
Pass over data ...
Implementation #2
Each data point has in/out of bag flag
• stored in a temporary

vector
Dataset!
points
I O I I I O I II ...
Implementation #2
Tree is built per layer
• Each node prepares

histogram for

splitting Active !
tree layer
Dataset!
poin...
Implementation #2
Tree is built per layer
• Histograms are reduced

and a new layer

is prepared Active !
tree layer
Datas...
Implementation #2
Exact solution - no decrease of accuracy

Elegant solution merging tree building and OOB scoring
!
More ...
Tree representation
Internally stored in compressed format as a byte array
•But it can be pretty huge (>10MB)
!
Externally...
Lesson learned
Preserving deterministic computation is crucial!
!
Trees need to be sent around the cloud for validation wh...
Time for questions
Thank you!
Learn more about H2O at
0xdata.com
or
git clone https://github.com/0xdata/h2o
Thank you!
Follow us at @hexadata
References
•0xdata, H2O: https://github.com/0xdata/h2o/
•Breiman, L. (1999). Pasting small votes for classification in lar...
Upcoming SlideShare
Loading in …5
×

Building Random Forest at Scale

121,834 views

Published on

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Published in: Technology, Education

Building Random Forest at Scale

  1. 1. BrightTalk 5/20/2014 Building Random Forest at Scale Michal Malohlava! @mmalohlava! @hexadata
  2. 2. Who am I? Background •PhD in CS from Charles University in Prague, 2012 •1 year PostDoc at Purdue University experimenting with algos for large computation •1 year at 0xdata helping to develop H2O engine for big data computation ! Experience with domain-specific languages, distributed system, software engineering, and big data.
  3. 3. Overview 1. A little bit of theory ! 2.Random Forest observations ! 3.Scaling & distribution of Random Forest
 4.Q&A

  4. 4. Tree Planting
  5. 5. What is a model for this data? •Training sample of points covering area [0,3] x [0,3] •Two possible colors of points X 0 1 2 3 0 1 2 3 Y
  6. 6. What is a model for this data? The model should be able to predict a color of a new point X 0 1 2 3 0 1 2 3 Y What is a color ! of this point?
  7. 7. Decision tree x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  8. 8. A. Possible impurity measures • Gini, entropy, RSS B. Respect type of feature - nominal, ordinal, continuous How to grow a decision tree?! Split rows in a given node into two sets with respect to impurity measure •The smaller, the more skewed is distribution •Compare impurity of parent with impurity of children x<0.8 ???y < 2 y<0.8
  9. 9. When to stop growing the tree? 1. Build full tree or 2.Apply stopping criterion - limit on: •Tree depth, or •Minimum number of points in a leaf x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<0.8
  10. 10. How to assign leaf value? The leaf value is •If leaf contains only one point then its color represents leaf value ! •Else majority color is picked, or color distribution is stored x<0.8 y < 2 y<0.8 ???
  11. 11. Decision tree Tree covered whole area by rectangles predicting a point color x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  12. 12. Decision tree scoring The model can predict a point color based on its coordinates. x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  13. 13. X 0 1 2 3 0 1 2 3 Y Overfitting Tree perfectly represents training data (0% training error), but also learned about noise! Noise
  14. 14. X 0 1 2 3 0 1 2 3 Y Overfitting And hence poorly predicts a new point! Expected color ! was RED! since points follow ! “chess board”! pattern
  15. 15. Handle overfitting Pre-pruning via stopping criterion ! Post-pruning: decreases complexity of model but helps with model generalization ! Randomize tree building and combine trees together “Model should have low training error but also generalization error!” Random Forest idea Breiman, L. (2001). Random forests. Machine Learning, 5–32. http://link.springer.com/article/10.1023/A:1010933404324
  16. 16. Randomize #1 Bagging X 0 1 2 3 0 1 2 3 Y X 0 1 2 3 0 1 2 3 Y Prepare bootstrap sample for each tree by sampling with replacement X 0 1 2 3 0 1 2 3 Y
  17. 17. x<0.8 y<0.8 x< 2 x< 1.7 y < 2 y < 2.3 y < 2 y<0.8 Randomize #1 Bagging x<0.8 y<0.8 x< 2 x< 1.7 y < 2 y < 2.3 y < 2 y<0.8
  18. 18. Randomize #1! Bagging Each tree sees only sample of training data and captures only a part of the information. ! Build multiple weak trees which vote together to give resulting prediction •voting is based on majority vote, or weighted average
  19. 19. Randomize #2! Feature selection Randomized split selection x<0.8 ???y < 2 y<0.8 •Select randomly subset of features of size sqrt(#features)
 •Select the best split only using the subset
  20. 20. Out of bag points and validation Each tree is built over a sample of training points. ! Remaining points are called “out-of-bag” (OOB). X 0 1 2 3 0 1 2 3 Y These points are used for validation as a good approximation for generalization error. Almost identical as N-fold cross validation.
  21. 21. Advantages of Random Forest Independent trees which can be built in parallel ! The model does not overfit easily ! Produces reasonable accuracy ! Brings more features to analyze data - variable importance, proximities, missing values imputation ! Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/~breiman/ RandomForests/cc_home.htm
  22. 22. A Few Observations
  23. 23. Covtype dataset
  24. 24. Sampling rate impact
  25. 25. Number of split features
  26. 26. Variable importance
  27. 27. Building Forests with H2O
  28. 28. H2O platform
  29. 29. Challenges Parallelize and distribute Random Forest algorithm •Preserve computation with data •Minimize data transfers Preserve Random Forest properties •Split nodes in an efficient way •Sample and preserve track of OOB samples Handle large trees
  30. 30. Implementation #1 Build independent trees per machine local data •RVotes approach •Each node builds a subset of forest Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and accurate approach. The Journal of Machine Learning Research, 5, p421–451.
  31. 31. Implementation #1 Fast - trees are independent and can be built in parallel Data have to fit into memory ! Possible accuracy decrease if each node can see only subset of data
  32. 32. Implementation #2 Build a distributed tree over all data Dataset! points
  33. 33. Implementation #2 Each data point has assigned a tree node • stored in a temporary
 vector Dataset! points Pass over data points means visiting tree nodes
  34. 34. Implementation #2 Each data point has in/out of bag flag • stored in a temporary
 vector Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags Trick for on-the-fly scoring: position out-of-bag rows inside a tree is tracked as well
  35. 35. Implementation #2 Tree is built per layer • Each node prepares
 histogram for
 splitting Active ! tree layer Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags
  36. 36. Implementation #2 Tree is built per layer • Histograms are reduced
 and a new layer
 is prepared Active ! tree layer Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags
  37. 37. Implementation #2 Exact solution - no decrease of accuracy
 Elegant solution merging tree building and OOB scoring ! More data transfers to exchange histograms Can produce huge trees (since tree size depends on data)
  38. 38. Tree representation Internally stored in compressed format as a byte array •But it can be pretty huge (>10MB) ! Externally can be stored as code class Tree_1 { static final float predict(double[] data) { float pred = ((float) data[3 /* petal_wid */] < 0.8200002f ? 0.0f : ((float) data[2 /* petal_len */] < 4.835f ? ((float) data[3 /* petal_wid */] < 1.6600002f ? 1.0f : 0.0f) : ((float) data[2 /* petal_len */] < 5.14475f ? ((float) data[3 /* petal_wid */] < 1.7440002f ? ((float) data[1 /* sepal_wid */] < 2.3600001f ? 0.0f : 1.0f) : 0.0f) : 0.0f))); return pred; } }
  39. 39. Lesson learned Preserving deterministic computation is crucial! ! Trees need to be sent around the cloud for validation which can be expensive! ! Tracking out-of-bag points can be tricky! ! Clever data binning is a key trick to decrease memory consumption
  40. 40. Time for questions Thank you!
  41. 41. Learn more about H2O at 0xdata.com or git clone https://github.com/0xdata/h2o Thank you! Follow us at @hexadata
  42. 42. References •0xdata, H2O: https://github.com/0xdata/h2o/ •Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine Learning, Vol 36. Kluwer, p85–103. •Breiman, L. (2001). Random forests. Machine Learning, 5–32. Retrieved from http://link.springer.com/article/10.1023/A:1010933404324 •Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/ ~breiman/RandomForests/cc_home.htm •Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and accurate approach. The Journal of Machine Learning Research, 5, p421–451. •Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. cs.yale.edu. Retrieved from http://link.springer.com/content/ pdf/10.1007/978-0-387-84858-7.pdf

×