Introduction to TreeNet (2004)

An Introduction to TreeNetTM

Salford Systems
http://www.salford-systems.com
golomi@salford-systems.com
Mikhail Golovnya, Dan Steinberg, Scott Cardell

 New approaches to machine learning/function

◦ Approximation developed by Jerome H. Friedman at
Stanford University

 Co-author of CART® with Breiman, Olshen and Stone
 Author of MARSTM, PRIM, Projection Pursuit

 Good for classification and regression problems

 Builds on the notions of committees of experts
and boosting but is substantially different in
implementation details

 Stagewise function approximation in which each stage models
residuals from the last step model
◦ Conventional boosting models original target each stage

 Each stage uses a very small tree, as small as two nodes and
typically in the range of 4-8 nodes
◦ Conventional bagging and boosting use full size trees and even
massively large trees

 Each stage learns from a fraction of the available training data.
Typically less than 50% to start and falling into 20% or less by the
last stage

 Each stage learns only a little: Severely down weight contribution of
each new tree (learning rate is typically 0.10 or less)

 Focus in classification is on points near decision boundary, ignore
points far away from boundary even if the points are on the wrong

 Built on CART trees and thus
◦ Immune to outliers

◦ Handles missing values automatically

◦ Selects variables

◦ Results invariant wrt monotone transformations of variables

 Trains very rapidly: many small trees do not take

 much longer run times than one large tree

 Resistant to over training- generalizes very well

 Can be remarkably accurate with little effort

 BUT resulting model may be very complex

 An intuitive introduction

 TreeNet Mathematical Basics

◦ Specifications of the TreeNet model as a series expansion

◦ Non-parametric approach to steepest descent optimization

 TreeNet at work

◦ Small trees, learning rates, sub-sample fractions, regression types

◦ Reading the output: reports and diagnostics

 Comparing to AdaBoost and other methods

 Consider the basic problem of estimating continuous
outcome y based on a vector of predictors X

 Running a step-wise multiple linear regression will produce
an estimate f1 (X) and associated residuals

 A simple intuitive idea: run a second-stage regression model
to produce an estimate of the residuals f2 (X) and the
associated updated residuals r^2=(y-f1f2)

 Repeating this process multiple times results to the following
series expansion: y=f1+f2+f3+…

 The above idea can be easily implemented

 Unfortunately, the direct implementation suffers from the
overfitting issues

 The residuals from the previous model essentially communicate
information about where this model fails the most- hence, the
next stage model effectively tries to improve the previous model
where it failed

 This is generally known as boosting

 We may want to replace individual regressions with something
simpler- regression trees, for example

 It is not yet known whether this simple idea actually works nor it
is clear how to generalize it for various types of loss functions or
classifications

 For any given set of inputs X we want to predict some
outcome y

 Thus we want to construct a “nice” function f(X) which in turn
can be used to express an estimate of y

 We need to define how “nice” can be measured

 In regression, when y is continuous, the easiest is to assume
that f(X) itself is the estimate of y

 We may then define the loss function as the loss incurred
when y is estimated by f(X)

 For example, least square loss (LS) is defined as (LOA¯0’:f)^2

 Formally, a “nicely” defined f(X) will have the smallest
expected loss (over the entire population) within the
boundaries of its construction (for example, in multiple linear
regressions, f(X) belongs to the class of linear functions)

 In reality, we have a set of N observed pairs (x,y) from the
population, not the entire population

 Hence, the expected loss WU/A can be replaced with an
estimate

 Here Fi=f(x)

 The problem thus reduces to finding a function f(X) that
minimizes R

 Unfortunately, classification will demand additional treatment

 Consider binary classification and assume that y is coded as +1
or -1

 The most detailed solution would then give us the associated
probabilities p(y)

 Since probabilities are naturally constrained to the [0,1] interval,
we assume that the function f(X) is transformed
p(y)=1/(1+exp(-2fy))

 Note that p(+1)+p(-1)=1

 The “trick” here is finding an unconstrained estimate f instead of
constrained estimate p

 Also note that f is simply half log-odds ratio of y=+1

 (insert graph)

 This graph shows the one-to-one correspondence between f
and p for y=+1

 Note that the most significant probability change occurs when
f is between -3 and +3

 Again, the main question is what “nice” f means given that we observed
N pairs (x,y) from the population

 Approaching this problem from the maximum likelihood point of view,
one may show that the negative log-likelihood in this case becomes

 (insert equation)

 The problem once again reduces to finding f that minimizes R above

 We could obtain the same result formally by introducing a special loss
function for classification (insert equation)

 The above likelihood considerations show a “natural” way to arrive to
such a peculiar loss function

 Other approaches to defining the loss functions for binary
classification are possible

 For example, by throwing away the log term in the previous
equation one would arrive to the following loss L=exp(2yf)

 It is possible to show that this loss function is effectively used
in the “classical” AdaBoost algorithm

 AdaBoost could be considered as a predecessor of gradient
boosting, we will defer the comparison until later

 To summarize we are looking for a function
f(X) that minimizes the estimate of loss

 The typical loss functions are

 (insert equations)

 The function f(X) is introduced as a known function of a
fixed set of unknown parameters

 The problem then reduces to finding a set of optimal
parameter estimates using non-linear optimization
techniques

 Multiple linear regression and logistic regression: 1(X) is a
linear combination of fixed predictors; parameters being
the intercept term and the slope coefficients

 Major problem: the function and predictors need to be
specified beforehand- this usually results to a lengthy
trial-and-error process

 Construct f(X) using stage-wise approach

 Start with a constant, then at each stage adjust the values off
(X) in various regions of data

 It is important to keep the adjustment rate low- the resulting
model will become smoother and usually less subject to
overfitting

 Note that we are effectively treating the values f=f(X) at all
individual observed data points as separate parameters

 More specifically, assume that we have gone through k-1
stages and obtained the current version fK-1 (X)

 We want to construct an updated version fk(x) resulting to
a smaller value of R

 Treating individual (insert equation) as parameters, we
proceed by computing the anti-gradient (insert equation)

 The individual components mark the “directions” in which
individual fK-1 must be changed to obtain a smaller R

 To induce smoothness lets limit our “freedom” by allowing
only M (a smaller number, say between 2 and 10) distinct
constant adjustments at any given stage

 The optimal strategy is then to group individual
components gk, into M mutually exclusive groups, such
that the variance within each group is minimized

 But this is equivalent to growing a fixed-size (M terminal
nodes) regression tree using gk, as the target

 Suppose we found M subsets (insert equation) of cases
(insert equation)

 The constant adjustments a kj are computed to minimize
(insert equation)

 Finally the updated f(X) is (insert equation)

 For the given loss function L[y,IV],M, and MaxTrees
◦ Make an initial guess f(X)=f

◦ For K=0 to MaxTrees-1

◦ Compute the anti-gradient Gk by taking the derivative of the loss with
respect to f(X) and substitute y and current fk (X)

◦ Fit an M-node regression tree to the components of the negative gradient
1this will partition observations into M mutually exclusive groups

◦ Find the within node updates a5 by performing M univariate optimizations
of the node contributions to the estimated loss

◦ Do the update (insert equation)

◦ End for

 For L[y,IV]=(y-f)^2, M, and MaxTrees

 Initial guess f(X)=f= mean(y)

 For K=0 to MaxTrees-1

 The anti-gradient component (insert equation) which is the traditional
definition of the current residual

 Fit an M-node regression tree to the current residuals 1* this will partition
observations into M mutually exclusive groups

 The within-node updates a k, simply become node averages of the current
residuals

 Do the update: (insert equation)
 End for

 For L[Y,fiX]=1 y-fl,M, and MaxTrees

 Initial guess f(X)=f=median(y)

 For k=0 to MaxTrees-1

 The anti-gradient component (insert equation) which is the sign of the
current residuals

 Fit an M-node regression tree to the sign of the current residuals 1* this
will partition observations into M mutually exclusive groups

 The within-node updates a ki now become node medians of the current
residuals

 Do the update (insert equation)
 End for

 For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees

 Initial guess f(X)=f= half log- odds of y=+1

 For k=0 to MaxTrees-1

 Recall that (insert equation) we call these generalize residuals

 Fit an M-node regression tree to the generalized residuals 1* this will
partition observations into M mutually exclusive groups

 The within-node updates ak, are somewhat complicated (insert
equation) where all measures are taken with respect to the node and
variance (insert equation)

 Do the update (insert equation)
 End for

 Consider the following simple data set with
single predictor X and 1000 observations
 Here and in the following slides negative
response observations are marked in blue
whereas positive response observations are
marked in red
 The general tendency is to have positive
response in the middle of the range of X
 (insert table)

 The dataset was generated using the
following model described by f(X) and the
corresponding p(X) for y=+1
 (insert graphs)

 (insert graph)

 TreeNet fits constant probability 0.55

 The residuals are positive for y=+1 and
negative for y=-1

 (insert graph)

 The dataset was partitioned into 3 regions:
low X (negative adjustment), middle X
(positive), and large X (negative)

 The residuals “reflect” the directions of the
adjustments

 (insert graph)

 This graph shows predictors f(X) after 1000
iterations and a very small learning rate of
0.002

 Note how the true shape was nearly perfectly
recovered

 The purpose of running a regression tree is to group observations into
homogenous subsets

 Once we have the right partition the adjustments for each terminal node
are computed separately to optimize the given loss function- these are
generally different from the predictions generated by the regression tree
itself (they are the same only for the LS Loss)

 Thus, the procedure is no longer as simple as the initial intuitive
recursive regression approach we started with

 Nonetheless, the tree is used to define the actual form of (X) over the
range of X and not only for the individual data points observed

 This becomes important in the final model deployment and scoring

 Up to this point we guarded against overfitting only by allowing a small
number of adjustments at each stage

 We may further enhance this subject by forcing the adjustments to be
smaller

 This is done by introducing a new parameter called “shrinkage” (learning
rate) that is set to a constant value between 0 and 1

 Small learning rates result to smoother models: a rate of 0.1 means that
TreeNet will take 10 times more iterations to extract the same signal-
more variables will be tried, finer partitions will result, smaller boundary
jumps will take place

 Ideally, one might ultimately want to keep the learning rate close to zero
and the number of stages (trees) close to infinity

 However, rates below 0.001 usually become impractical

 (insert graph)

 This graph shows predictor f(X) after 100
iterations and a learning rate of 1

 Note the roughness of the shape and the
presence of abrupt strong jumps

 (insert graph)

 This graph shows predicted f(X) after 1000
iterations and a very small learning rate of
0.0002

 Note how the true shape was nearly perfectly
recovered

 It may be further approved

 At each stage, instead of working with the entire learn dataset,
consider taking a random sample of a fixed size

 Typical sampling rates are set to 50% of the learn data (the
default) and even smaller for very large datasets

 In the long run, the entire learn dataset is exploited but the
running time is reduced by the factor of two with the 50%
sampling rate

 Sampling forces TreeNet to “rethink” optimal partition points
from run to run due to random fluctuations of the residuals

 This, combined with the shrinkage and a large number of
iterations, results to the overall improvement of the captured
signal shape

 (insert graph)

 This graph shows predicted f(X) after 1000
stages, learning rate of 0.002, and 50%
sampling

 Note the minor fluctuations in the average
loss

 The resulting model is nice and smooth but
there is still room for improvement

 (insert graph)

 All previous allowed as few as 10 cases for
individual region/node (the default)

 Here we have increased this limit up to 50

 This immediately resulted to an even smoother
shape

 In practice, various node size limits should be
tried

 In classification problems, it is possible to further
reduce the amount of data processed as each stage

 We ignore data points “too far” from the decision
boundary to be usefully considered

◦ Well correctly classified points are ignored (just like
conventional boosting)

◦ Badly misclassified data points are also ignored (very
different from conventional boosting)

◦ The focus is on the cases most difficult to classify correctly:
those near the decision boundary

 (insert graph)

 2-dimensional predictor space

 Red dots represent cases with +1 target

 Green dots represent cases with -1 target

 Black curve represents the decision boundary

 The remaining slides present TreeNet runs on real data as
well as give examples of GUI controls

 We start with the Boston Housing dataset to illustrate
regression

 Then we proceed with the Cell Phone dataset to illustrate
classification

 (insert graph)
 Essentially a regression tree with 2 terminal
nodes

 (insert table)
 CART run with TARGET=MV

 PREDICTORS= LSTAT

 LIMIT DEPTH= 1

 Save residuals as RESI

Introduction to TreeNet (2004)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to TreeNet (2004)

Similar to Introduction to TreeNet (2004) (20)

More from Salford Systems

More from Salford Systems (20)

Recently uploaded

Recently uploaded (20)

Introduction to TreeNet (2004)