Introduction to TreeNet (2004)


Published on

TreeNet is Salford's most flexible and powerful data mining tool, capable of consistently generating extremely accurate models. TreeNet has been responsible for the majority of Salford’s modeling competition awards. TreeNet demonstrates remarkable performance for both regression and classification. The algorithm typically generates thousands of small decision trees built in a sequential error–correcting process to converge to an accurate model.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to TreeNet (2004)

  1. 1. An Introduction to TreeNetTM Salford Systems golomi@salford-systems.comMikhail Golovnya, Dan Steinberg, Scott Cardell
  2. 2.  New approaches to machine learning/function ◦ Approximation developed by Jerome H. Friedman at Stanford University  Co-author of CART® with Breiman, Olshen and Stone  Author of MARSTM, PRIM, Projection Pursuit Good for classification and regression problems Builds on the notions of committees of experts and boosting but is substantially different in implementation details
  3. 3.  Stagewise function approximation in which each stage models residuals from the last step model ◦ Conventional boosting models original target each stage Each stage uses a very small tree, as small as two nodes and typically in the range of 4-8 nodes ◦ Conventional bagging and boosting use full size trees and even massively large trees Each stage learns from a fraction of the available training data. Typically less than 50% to start and falling into 20% or less by the last stage Each stage learns only a little: Severely down weight contribution of each new tree (learning rate is typically 0.10 or less) Focus in classification is on points near decision boundary, ignore points far away from boundary even if the points are on the wrong
  4. 4.  Built on CART trees and thus ◦ Immune to outliers ◦ Handles missing values automatically ◦ Selects variables ◦ Results invariant wrt monotone transformations of variables Trains very rapidly: many small trees do not take much longer run times than one large tree Resistant to over training- generalizes very well Can be remarkably accurate with little effort BUT resulting model may be very complex
  5. 5.  An intuitive introduction TreeNet Mathematical Basics ◦ Specifications of the TreeNet model as a series expansion ◦ Non-parametric approach to steepest descent optimization TreeNet at work ◦ Small trees, learning rates, sub-sample fractions, regression types ◦ Reading the output: reports and diagnostics Comparing to AdaBoost and other methods
  6. 6.  Consider the basic problem of estimating continuous outcome y based on a vector of predictors X Running a step-wise multiple linear regression will produce an estimate f1 (X) and associated residuals A simple intuitive idea: run a second-stage regression model to produce an estimate of the residuals f2 (X) and the associated updated residuals r^2=(y-f1f2) Repeating this process multiple times results to the following series expansion: y=f1+f2+f3+…
  7. 7.  The above idea can be easily implemented Unfortunately, the direct implementation suffers from the overfitting issues The residuals from the previous model essentially communicate information about where this model fails the most- hence, the next stage model effectively tries to improve the previous model where it failed This is generally known as boosting We may want to replace individual regressions with something simpler- regression trees, for example It is not yet known whether this simple idea actually works nor it is clear how to generalize it for various types of loss functions or classifications
  8. 8.  For any given set of inputs X we want to predict some outcome y Thus we want to construct a “nice” function f(X) which in turn can be used to express an estimate of y We need to define how “nice” can be measured
  9. 9.  In regression, when y is continuous, the easiest is to assume that f(X) itself is the estimate of y We may then define the loss function as the loss incurred when y is estimated by f(X) For example, least square loss (LS) is defined as (LOA¯0’:f)^2 Formally, a “nicely” defined f(X) will have the smallest expected loss (over the entire population) within the boundaries of its construction (for example, in multiple linear regressions, f(X) belongs to the class of linear functions)
  10. 10.  In reality, we have a set of N observed pairs (x,y) from the population, not the entire population Hence, the expected loss WU/A can be replaced with an estimate Here Fi=f(x) The problem thus reduces to finding a function f(X) that minimizes R Unfortunately, classification will demand additional treatment
  11. 11.  Consider binary classification and assume that y is coded as +1 or -1 The most detailed solution would then give us the associated probabilities p(y) Since probabilities are naturally constrained to the [0,1] interval, we assume that the function f(X) is transformed p(y)=1/(1+exp(-2fy)) Note that p(+1)+p(-1)=1 The “trick” here is finding an unconstrained estimate f instead of constrained estimate p Also note that f is simply half log-odds ratio of y=+1
  12. 12.  (insert graph) This graph shows the one-to-one correspondence between f and p for y=+1 Note that the most significant probability change occurs when f is between -3 and +3
  13. 13.  Again, the main question is what “nice” f means given that we observed N pairs (x,y) from the population Approaching this problem from the maximum likelihood point of view, one may show that the negative log-likelihood in this case becomes (insert equation) The problem once again reduces to finding f that minimizes R above We could obtain the same result formally by introducing a special loss function for classification (insert equation) The above likelihood considerations show a “natural” way to arrive to such a peculiar loss function
  14. 14.  Other approaches to defining the loss functions for binary classification are possible For example, by throwing away the log term in the previous equation one would arrive to the following loss L=exp(2yf) It is possible to show that this loss function is effectively used in the “classical” AdaBoost algorithm AdaBoost could be considered as a predecessor of gradient boosting, we will defer the comparison until later
  15. 15.  To summarize we are looking for a function f(X) that minimizes the estimate of loss The typical loss functions are (insert equations)
  16. 16.  The function f(X) is introduced as a known function of a fixed set of unknown parameters The problem then reduces to finding a set of optimal parameter estimates using non-linear optimization techniques Multiple linear regression and logistic regression: 1(X) is a linear combination of fixed predictors; parameters being the intercept term and the slope coefficients Major problem: the function and predictors need to be specified beforehand- this usually results to a lengthy trial-and-error process
  17. 17.  Construct f(X) using stage-wise approach Start with a constant, then at each stage adjust the values off (X) in various regions of data It is important to keep the adjustment rate low- the resulting model will become smoother and usually less subject to overfitting Note that we are effectively treating the values f=f(X) at all individual observed data points as separate parameters
  18. 18.  More specifically, assume that we have gone through k-1 stages and obtained the current version fK-1 (X) We want to construct an updated version fk(x) resulting to a smaller value of R Treating individual (insert equation) as parameters, we proceed by computing the anti-gradient (insert equation) The individual components mark the “directions” in which individual fK-1 must be changed to obtain a smaller R To induce smoothness lets limit our “freedom” by allowing only M (a smaller number, say between 2 and 10) distinct constant adjustments at any given stage
  19. 19.  The optimal strategy is then to group individual components gk, into M mutually exclusive groups, such that the variance within each group is minimized But this is equivalent to growing a fixed-size (M terminal nodes) regression tree using gk, as the target Suppose we found M subsets (insert equation) of cases (insert equation) The constant adjustments a kj are computed to minimize (insert equation) Finally the updated f(X) is (insert equation)
  20. 20.  For the given loss function L[y,IV],M, and MaxTrees ◦ Make an initial guess f(X)=f ◦ For K=0 to MaxTrees-1 ◦ Compute the anti-gradient Gk by taking the derivative of the loss with respect to f(X) and substitute y and current fk (X) ◦ Fit an M-node regression tree to the components of the negative gradient 1this will partition observations into M mutually exclusive groups ◦ Find the within node updates a5 by performing M univariate optimizations of the node contributions to the estimated loss ◦ Do the update (insert equation) ◦ End for
  21. 21.  For L[y,IV]=(y-f)^2, M, and MaxTrees Initial guess f(X)=f= mean(y) For K=0 to MaxTrees-1 The anti-gradient component (insert equation) which is the traditional definition of the current residual Fit an M-node regression tree to the current residuals 1* this will partition observations into M mutually exclusive groups The within-node updates a k, simply become node averages of the current residuals Do the update: (insert equation) End for
  22. 22.  For L[Y,fiX]=1 y-fl,M, and MaxTrees Initial guess f(X)=f=median(y) For k=0 to MaxTrees-1 The anti-gradient component (insert equation) which is the sign of the current residuals Fit an M-node regression tree to the sign of the current residuals 1* this will partition observations into M mutually exclusive groups The within-node updates a ki now become node medians of the current residuals Do the update (insert equation) End for
  23. 23.  For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees Initial guess f(X)=f= half log- odds of y=+1 For k=0 to MaxTrees-1 Recall that (insert equation) we call these generalize residuals Fit an M-node regression tree to the generalized residuals 1* this will partition observations into M mutually exclusive groups The within-node updates ak, are somewhat complicated (insert equation) where all measures are taken with respect to the node and variance (insert equation) Do the update (insert equation) End for
  24. 24.  Consider the following simple data set with single predictor X and 1000 observations Here and in the following slides negative response observations are marked in blue whereas positive response observations are marked in red The general tendency is to have positive response in the middle of the range of X (insert table)
  25. 25.  The dataset was generated using the following model described by f(X) and the corresponding p(X) for y=+1 (insert graphs)
  26. 26.  (insert graph) TreeNet fits constant probability 0.55 The residuals are positive for y=+1 and negative for y=-1
  27. 27.  (insert graph) The dataset was partitioned into 3 regions: low X (negative adjustment), middle X (positive), and large X (negative) The residuals “reflect” the directions of the adjustments
  28. 28.  (insert graph) This graph shows predictors f(X) after 1000 iterations and a very small learning rate of 0.002 Note how the true shape was nearly perfectly recovered
  29. 29.  The purpose of running a regression tree is to group observations into homogenous subsets Once we have the right partition the adjustments for each terminal node are computed separately to optimize the given loss function- these are generally different from the predictions generated by the regression tree itself (they are the same only for the LS Loss) Thus, the procedure is no longer as simple as the initial intuitive recursive regression approach we started with Nonetheless, the tree is used to define the actual form of (X) over the range of X and not only for the individual data points observed This becomes important in the final model deployment and scoring
  30. 30.  Up to this point we guarded against overfitting only by allowing a small number of adjustments at each stage We may further enhance this subject by forcing the adjustments to be smaller This is done by introducing a new parameter called “shrinkage” (learning rate) that is set to a constant value between 0 and 1 Small learning rates result to smoother models: a rate of 0.1 means that TreeNet will take 10 times more iterations to extract the same signal- more variables will be tried, finer partitions will result, smaller boundary jumps will take place Ideally, one might ultimately want to keep the learning rate close to zero and the number of stages (trees) close to infinity However, rates below 0.001 usually become impractical
  31. 31.  (insert graph) This graph shows predictor f(X) after 100 iterations and a learning rate of 1 Note the roughness of the shape and the presence of abrupt strong jumps
  32. 32.  (insert graph) This graph shows predicted f(X) after 1000 iterations and a very small learning rate of 0.0002 Note how the true shape was nearly perfectly recovered It may be further approved
  33. 33.  At each stage, instead of working with the entire learn dataset, consider taking a random sample of a fixed size Typical sampling rates are set to 50% of the learn data (the default) and even smaller for very large datasets In the long run, the entire learn dataset is exploited but the running time is reduced by the factor of two with the 50% sampling rate Sampling forces TreeNet to “rethink” optimal partition points from run to run due to random fluctuations of the residuals This, combined with the shrinkage and a large number of iterations, results to the overall improvement of the captured signal shape
  34. 34.  (insert graph) This graph shows predicted f(X) after 1000 stages, learning rate of 0.002, and 50% sampling Note the minor fluctuations in the average loss The resulting model is nice and smooth but there is still room for improvement
  35. 35.  (insert graph) All previous allowed as few as 10 cases for individual region/node (the default) Here we have increased this limit up to 50 This immediately resulted to an even smoother shape In practice, various node size limits should be tried
  36. 36.  In classification problems, it is possible to further reduce the amount of data processed as each stage We ignore data points “too far” from the decision boundary to be usefully considered ◦ Well correctly classified points are ignored (just like conventional boosting) ◦ Badly misclassified data points are also ignored (very different from conventional boosting) ◦ The focus is on the cases most difficult to classify correctly: those near the decision boundary
  37. 37.  (insert graph) 2-dimensional predictor space Red dots represent cases with +1 target Green dots represent cases with -1 target Black curve represents the decision boundary
  38. 38.  The remaining slides present TreeNet runs on real data as well as give examples of GUI controls We start with the Boston Housing dataset to illustrate regression Then we proceed with the Cell Phone dataset to illustrate classification
  39. 39.  (insert graph)
  40. 40.  (insert graph)
  41. 41.  (insert graph) Essentially a regression tree with 2 terminal nodes
  42. 42.  (insert table) CART run with TARGET=MV PREDICTORS= LSTAT LIMIT DEPTH= 1 Save residuals as RESI
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.