Upcoming SlideShare
×

# RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING

593 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING

1. 1. RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING Jerome H. Friedman Deptartment of Statisitcs and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 (jhf@stanford.edu) Prediction involves estimating the unknown value of an attribute of a system under study given the values of other measured attributes. In prediction (machine) learning the prediction rule is derived from data consisting of previously solved cases. Most methods for predictive learning were originated many years ago at the dawn of the computer age. Recently two new techniques have emerged that have revitalized the ﬁeld. These are support vector machines and boosted decision trees. This paper provides an introduction to these two new methods tracing their respective ancestral roots to standard kernel methods and ordinary decision trees. I. INTRODUCTION Given joint values for the input variables x, the opti- mal prediction for the output variable is y = F ∗ (x). ˆ The predictive or machine learning problem When the response takes on numeric values y ∈ is easy to state if diﬃcult to solve in gen- R1 , the learning problem is called “regression” and eral. Given a set of measured values of at- commonly used loss functions include absolute error tributes/characteristics/properties on a object (obser- L(y, F ) = |y − F |, and even more commonly squared– vation) x = (x1 , x2 , ···, xn ) (often called “variables”) error L(y, F ) = (y − F )2 because algorithms for min- the goal is to predict (estimate) the unknown value imization of the corresponding risk tend to be much of another attribute y. The quantity y is called the simpler. In the “classiﬁcation” problem the response “output” or “response” variable, and x = {x1 , ···, xn } takes on a discrete set of K unorderable categorical are referred to as the “input” or “predictor” variables. values (names or class labels) y, F ∈ {c1 , · · ·, cK } and The prediction takes the form of function the loss criterion Ly,F becomes a discrete K × K ma- trix. y = F (x1 , x2 , · · ·, xn ) = F (x) ˆ There are a variety of ways one can go about trying that maps a point x in the space of all joint values of to ﬁnd a good predicting function F (x). One might the predictor variables, to a point y in the space of ˆ seek the opinions of domain experts, formally codiﬁed response values. The goal is to produce a “good” pre- in the “expert systems” approach of artiﬁcial intel- dictive F (x). This requires a deﬁnition for the quality, ligence. In predictive or machine learning one uses or lack of quality, of any particular F (x). The most data. A “training” data base commonly used measure of lack of quality is predic- tion “risk”. One deﬁnes a “loss” criterion that reﬂects D = {yi , xi1 , xi2 , · · ·, xin }1 = {yi , xi }N N 1 (3) the cost of mistakes: L(y, y ) is the loss or cost of pre- ˆ dicting a value y for the response when its true value ˆ of N previously solved cases is presumed to exist for is y. The prediction risk is deﬁned as the average loss which the values of all variables (response and predic- over all predictions tors) have been jointly measured. A “learning” pro- cedure is applied to these data in order to extract R(F ) = Eyx L(y, F (x)) (1) (estimate) a good predicting function F (x). There where the average (expected value) is over the joint are a great many commonly used learning procedures. (population) distribution of all of the variables (y, x) These include linear/logistic regression, neural net- which is represented by a probability density function works, kernel methods, decision trees, multivariate p(y, x). Thus, the goal is to ﬁnd a mapping function splines (MARS), etc. For descriptions of a large num- F (x) with low predictive risk. ber of such learning procedures see Hastie, Tibshirani Given a function f of elements w in some set, the and Friedman 2001. choice of w that gives the smallest value of f (w) is Most machine learning procedures have been called arg minw f (w). This deﬁnition applies to all around for a long time and most research in the ﬁeld types of sets including numbers, vectors, colors, or has concentrated on producing reﬁnements to these functions. In terms of this notation the optimal pre- long standing methods. However, in the past several dictor with lowest predictive risk (called the “target years there has been a revolution in the ﬁeld inspired function”) is given by by the introduction of two new approaches: the ex- tension of kernel methods to support vector machines F ∗ = arg min R(F ). (2) (Vapnik 1995), and the extension of decision trees by F
2. 2. 2 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 boosting (Freund and Schapire 1996, Friedman 2001). (4) (5) approaches the optimal predicting target func- It is the purpose of this paper to provide an introduc- tion (2), FN (x) → F ∗ (x), provided the value chosen tion to these new developments. First the classic ker- for the scale parameter σ as a function of N ap- nel and decision tree methods are introduced. Then proaches zero, σ(N ) → 0, at a slower rate than 1/N . the extension of kernels to support vector machines is This result holds for almost any distance function described, followed by a description of applying boost- d(x, x ); only very mild restrictions (such as convex- ing to extend decision tree methods. Finally, similari- ity) are required. Another advantage of kernel meth- ties and diﬀerences between these two approaches will ods is that no training is required to build a model; the be discussed. training data set is the model. Also, the procedure is Although arguably the most inﬂuential recent de- conceptually quite simple and easily explained. velopments, support vector machines and boosting are Kernel methods suﬀer from some disadvantages not the only important advances in machine learning that have kept them from becoming highly used in in the past several years. Owing to space limitations practice, especially in data mining applications. Since these are the ones discussed here. There have been there is no model, they provide no easily understood other important developments that have considerably model summary. Thus, they cannot be easily inter- advanced the ﬁeld as well. These include (but are not preted. There is no way to discern how the function limited to) the bagging and random forest techniques FN (x) (4) depends on the respective predictor vari- of Breiman 1996 and 2001 that are somewhat related ables x. Kernel methods produce a “black–box” pre- to boosting, and the reproducing kernel Hilbert space diction machine. In order to make each prediction, the methods of Wahba 1990 that share similarities with kernel method needs to examine the entire data base. support vector machines. It is hoped that this article This requires enough random access memory to store will inspire the reader to investigate these as well as the entire data set, and the computation required to other machine learning procedures. make each prediction is proportional to the training sample size N . For large data sets this is much slower than that for competing methods. II. KERNEL METHODS Perhaps the most serious limitation of kernel meth- ods is statistical. For any ﬁnite N , performance Kernel methods for predictive learning were intro- (prediction accuracy) depends critically on the cho- duced by Nadaraya (1964) and Watson (1964). Given sen distance function d(x, x ), especially for regression the training data (3), the response estimate y for a set ˆ y ∈ R1 . When there are more than a few predictor of joint values x is taken to be a weighted average of variables, even the largest data sets produce a very the training responses {yi }N : 1 sparse sampling in the corresponding n–dimensional N N predictor variable space. This is a consequence of the so called “curse–of–dimensionality” (Bellman 1962). y = FN (x) = ˆ yi K(x, xi ) K(x, xi ). (4) In order for kernel methods to perform well, the dis- i=1 i=1 tance function must be carefully matched to the (un- The weight K(x, xi ) assigned to each response value known) target function (2), and the procedure is not yi depends on its location xi in the predictor variable very robust to mismatches. space and the location x where the prediction is to be As an example, consider the often used Euclidean made. The function K(x, x ) deﬁning the respective distance function weights is called the “kernel function”, and it deﬁnes 1/2 the kernel method. Often the form of the kernel func-  n tion is taken to be d(x, x ) =  (xj − xj )2  . (6) j=1 K(x, x ) = g(d(x, x )/σ) (5) where d(x, x ) is a deﬁned “distance” between x and If the target function F ∗ (x) dominately depends on x , σ is a scale (“smoothing”) parameter, and g(z) only a small subset of the predictor variables, then is a (usually monotone) decreasing function with in- performance will be poor because the kernel function creasing z; often g(z) = exp(−z 2 /2). Using this kernel (5) (6) depends on all of the predictors with equal (5), the estimate y (4) is a weighted average of {yi }N , ˆ 1 strength. If one happened to know which variables with more weight given to observations i for which were the important ones, an appropriate kernel could d(x, xi ) is small. The value of σ deﬁnes “small”. The be constructed. However, this knowledge is often not distance function d(x, x ) must be speciﬁed for each available. Such “kernel customizing” is a requirement particular application. with kernel methods, but it is diﬃcult to do without Kernel methods have several advantages that make considerable a priori knowledge concerning the prob- them potentially attractive. They represent a univer- lem at hand. sal approximator; as the training sample size N be- The performance of kernel methods tends to be comes arbitrarily large, N → ∞, the kernel estimate fairly insensitive to the detailed choice of the function WEAT003
3. 3. PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 3 g(z) (5), but somewhat more sensitive to the value region all have the same response value y. At this chosen for the smoothing parameter σ. A good value point a recursive recombination strategy (“tree prun- depends on the (usually unknown) smoothness prop- ing”) is employed in which sibling regions are in turn erties of the target function F ∗ (x), as well as the sam- merged in a bottom–up manner until the number of ple size N and the signal/noise ratio. regions J ∗ that minimizes an estimate of future pre- diction risk is reached (see Breiman et al 1983, Ch. 3). III. DECISION TREES Decision trees were developed largely in response to A. Decision tree properties the limitations of kernel methods. Detailed descrip- tions are contained in monographs by Brieman, Fried- Decision trees are the most popular predictive learn- man, Olshen and Stone 1983, and by Quinlan 1992. ing method used in data mining. There are a number The minimal description provided here is intended as of reasons for this. As with kernel methods, deci- an introduction suﬃcient for understanding what fol- sion trees represent a universal method. As the train- lows. ing data set becomes arbitrarily large, N → ∞, tree A decision tree partitions the space of all joint pre- based predictions (7) (8) approach those of the target dictor variable values x into J–disjoint regions {Rj }J . 1 function (2), TJ (x) → F ∗ (x), provided the number A response value yj is assigned to each corresponding ˆ of regions grows arbitrarily large, J(N ) → ∞, but at region Rj . For a given set of joint predictor values x, rate slower than N . the tree prediction y = TJ (x) assigns as the response ˆ In contrast to kernel methods, decision trees do pro- estimate, the value assigned to the region containing duce a model summary. It takes the form of a binary x tree graph. The root node of the tree represents the entire predictor variable space, and the (ﬁrst) split x ∈ Rj ⇒ TJ (x) =ˆj . y (7) into its daughter regions. Edges connect the root to Given a set of regions, the optimal response values two descendent nodes below it, representing these two associated with each one are easily obtained, namely daughter regions and their respective splits, and so on. the value that minimizes prediction risk in that region Each internal node of the tree represents an interme- diate region and its optimal split, deﬁned by a one of yj = arg min Ey [L(y, y ) | x ∈ Rj ]. ˆ (8) the predictor variables xj and a split point s. The ter- y minal nodes represent the ﬁnal region set {Rj }J used 1 The diﬃcult problem is to ﬁnd a good set of regions for prediction (7). It is this binary tree graphic that is {Rj }J . There are a huge number of ways to parti- 1 most responsible for the popularity of decision trees. tion the predictor variable space, the vast majority No matter how high the dimensionality of the predic- of which would provide poor predictive performance. tor variable space, or how many variables are actually In the context of decision trees, choice of a particu- used for prediction (splits), the entire model can be lar partition directly corresponds to choice of a dis- represented by this two–dimensional graphic, which tance function d(x, x ) and scale parameter σ in ker- can be plotted and then examined for interpretation. nel methods. Unlike with kernel methods where this For examples of interpreting binary tree representa- choice is the responsibility of the user, decision trees tions see Breiman et al 1983 and Hastie, Tibshirani attempt to use the data to estimate a good partition. and Friedman 2001. Unfortunately, ﬁnding the optimal partition re- Tree based models have other advantages as well quires computation that grows exponentially with the that account for their popularity. Training (tree build- number of regions J, so that this is only possible for ing) is relatively fast, scaling as nN log N with the very small values of J. All tree based methods use number of variables n and training observations N . a greedy top–down recursive partitioning strategy to Subsequent prediction is extremely fast, scaling as induce a good set of regions given the training data log J with the number of regions J. The predictor set (3). One starts with a single region covering the variables need not all be numeric valued. Trees can entire space of all joint predictor variable values. This seamlessly accommodate binary and categorical vari- is partitioned into two regions by choosing an optimal ables. They also have a very elegant way of dealing splitting predictor variable xj and a corresponding op- with missing variable values in both the training data timal split point s. Points x for which xj ≤ s are and future observations to be predicted (see Breiman deﬁned to be in the left daughter region, and those et al 1983, Ch. 5.3). for which xj > s comprise the right daughter region. One property that sets tree based models apart Each of these two daughter regions is then itself op- from all other techniques is their invariance to mono- timally partitioned into two daughters of its own in tone transformations of the predictor variables. Re- the same manner, and so on. This recursive parti- placing any subset of the predictor variables {xj } by tioning continues until the observations within each (possibly diﬀerent) arbitrary strictly monotone func- WEAT003