View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Idea that combining good methods could yield promising results was suggested by researchers more than a decade ago
In tree-structured analysis, suggestion stems from:
Wray Buntine (1991)
Kwok and Carter (1990)
Heath, Kasif and Salzberg (1993)
Notion is that if the trees can somehow get at different aspects of the data, the combination will be “better”
Better in this context means more accurate in classification and prediction for future cases
The original implementation of CART already included bagging ( B ootstrap A ggregation) and ARCing ( A daptive R esampling and C ombining) approaches to build tree ensembles
The original bagging and boosting approaches relied on sampling with replacement techniques to obtain a new modeling dataset
Subsequent approaches focused on refining the sampling machinery or changing the modeling emphasis from the original dependent variable to current model generalized residuals
Most important variants (and dates of published articles) are:
Bagging (Breiman, 1996, “ B ootstrap Ag gregation”)
Boosting (Freund and Schapire, 1995)
M ultiple A dditive R egression T rees (Friedman, 1999, aka MART™ or TreeNet™)
RandomForests™ (Breiman, 2001)
Work continues with major refinements underway (Friedman in collaboration with Salford Systems)
In this experiment, we ran RF with 100 trees on sample data (772x111) using different values for the number of variables R (N Vars) searched at each split
Combining trees always improves performance, with the optimal number of sampled predictors already establishing around 11
Both running speed and resources consumed for the most part depends on the row dimension of the data
Trees are grown using in as simple as feasible way to keep run times low (no surrogates, no priors, etc.)
Classification models produce pseudo-probability scores (percent of votes)
Performance-wise is capable of matching the performance of modern boosting techniques, including MART (described later)
Naturally allows parallel processing
The final model code is usually bulky, voluminous, and impossible to interpret directly
Current stable implementations include multinomial classification and least squares regression with an on-going research in the more advanced fields of predictive modeling (survival, choice, etc.)
11.
Proximity Matrix – Raw Material for Further Advances
RF introduces a novel way to define proximity between two observations:
For a dataset of size N define an N x N matrix of proximities
Initialize all proximities to zeroes
For any given tree, apply the tree to the dataset
If case i and case j both end up in the same node, increase proximity Prox i j between i and j by one
Accumulate over all trees in RF and normalize by twice the number of trees in RF
The resulting matrix provides intrinsic measure of proximity
Observations that are “alike” will have proximities close to one
The closer the proximity to 0, the more dissimilar cases i and j are
The measure is invariant to monotone transformations
The measure is clearly defined for any type of independent variables, including categorical
In predictive modeling the problem is usually attacked by introducing a well chosen loss function L ( Y , X , f ( X ))
In stochastic gradient boosting we need a loss function for which gradients can easily be computed and used to construct good base learners
The loss function used on the test data does not need the same properties
Practical ways of constructing loss functions
Direct interpretation of f ( X i ) as an estimate Y i or a population statistic of the distribution of Y conditional on X
Least Squares Loss (LS), f i is an estimate of E( Y| X i )
Least Absolute Deviation Loss (LAD), f i is an estimate of median( Y | X i )
Huber-M Loss, f i is an estimate of Y i
Choosing a conditional distribution for Y | X, defining f ( X ) as a parameter of that distribution and using the negative log-likelihood as the loss function
Logistic Loss (conditional Bernoulli, f ( X ) is the half log-odds of Y =1)
Poisson Loss (conditional Poisson, f ( X ) is the log( ))
Exponential Loss (conditional Exponential, f ( X ) is the log( ))
More general likelihood functions, for example, multinomial discrete choice, the Cox model
Huber-M regression loss is a reasonable compromise between the classical LS loss and robust LAD loss
Logistic log-likelihood based loss strikes the middle ground between the extremely sensitive exponential loss on one side and conventional LS and LAD losses on the other side
In reality, we have a set of N observed pairs ( X i , y i ) from the population, not the entire population
Hence, we use sample-based estimates of L ( Y , X , f ( X ))
To avoid biased estimates, one usually partitions the data into independent learn and test samples using the latter to compute an unbiased estimate of the population loss
In Stochastic gradient boosting the problem is attacked by acting like we are trying to minimize the loss function on the learn sample. But doing so in a slow constrained way
This results in a series of models that move closer and closer to the f(X) function that minimizes the loss on the learn sample. Eventually new models become overfit to the learn sample
From this sequence the function f ( X ) with the lowest loss on the test sample is chosen
By choosing from a fixed set of models overfitting to the test data is avoided
Sometimes the loss functions used on the test data and learn data differ
The function f ( X ) is introduced as a known function of a fixed set of unknown parameters
The problem then reduces to finding a set of optimal parameter estimates using classical optimization techniques
In linear regression and logistic regression: f ( X ) is a linear combination of fixed predictors; the parameters are the intercept and the slope coefficients
Major problem: the function and predictors need to be specified beforehand – this can result in a lengthy specification search process by trial and error
If this trial-and error-process uses the same data as the final model, that model will be overfit. This is the classical overfitting problem
If new data are used to estimate the final model and the model performs poorly, the specification search process must be repeated
This approach shows most benefits on small datasets where only simple specifications can be justified, or on datasets where there is strong a priori knowledge of the correct specification
Construct f ( X ) using data driven incremental approach
Start with a constant, then at each stage adjust the values of f ( X ) by small increments in various regions of data
It is important to keep the adjustment rate low – the resulting model will become smoother and be less subject to overfitting
Treating f i = f ( X i ) at all individual observed data points as separate parameters, the negative of the gradient points in the direction of change in f ( X ) that results in the steepest reduction of the loss
G = { g i = - d R / d f i ; i =1,…, N } .
The components of the negative gradient will be called generalized residuals
We want to limit the number of currently allowed separate adjustments to a small number M – a natural way to proceed then is to find an orthogonal partition of the X -space into M mutually exclusive regions such that the variance of the residuals within each region is minimized
This job is accomplished by building a fixed size M-node regression tree using the generalized residuals as the current target variable
Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth
Breiman, L. (1996). Bagging predictors. Machine Learning , 24, 123-140.
Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference , Morgan Kaufmann, pp. 148-156.
Views
Actions
Embeds 0
Report content