1.
ARTICLE IN PRESS Pattern Recognition ] (]]]]) ]]]–]]] Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/prLearn++.MF: A random subspace approach for the missing feature problemRobi Polikar a,n, Joseph DePasquale a, Hussein Syed Mohammed a,Gavin Brown b, Ludmilla I. Kuncheva ca Electrical and Computer Eng., Rowan University, 201 Mullica Hill Road, Glassboro, NJ 08028, USAb University of Manchester, Manchester, England, UKc University of Bangor, Bangor, Wales, UKa r t i c l e in fo abstractArticle history: We introduce Learn+ .MF, an ensemble-of-classiﬁers based algorithm that employs random subspace +Received 9 November 2009 selection to address the missing feature problem in supervised classiﬁcation. Unlike most establishedReceived in revised form approaches, Learn+ .MF does not replace missing values with estimated ones, and hence does not need +16 April 2010 speciﬁc assumptions on the underlying data distribution. Instead, it trains an ensemble of classiﬁers,Accepted 21 May 2010 each on a random subset of the available features. Instances with missing values are classiﬁed by the majority voting of those classiﬁers whose training data did not include the missing features. We showKeywords: that Learn+ .MF can accommodate substantial amount of missing data, and with only gradual decline in +Missing data performance as the amount of missing data increases. We also analyze the effect of the cardinality ofMissing features the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss theEnsemble of classiﬁers conditions under which the proposed approach is most effective.Random subspace method & 2010 Elsevier Ltd. All rights reserved.1. Introduction Fig. 1 illustrates such a scenario for a handwritten character recognition application: characters are digitized on an 8 Â 8 grid,1.1. The missing feature problem creating 64 features, f1–f64, a random subset (of about 20–30%) of which – indicated in orange (light shading) – are missing in each The integrity and completeness of data are essential for any case. Having such a large proportion of randomly varying featuresclassiﬁcation algorithm. After all, a trained classiﬁer – unless may be viewed as an extreme and unlikely scenario, warrantingspeciﬁcally designed to address this issue – cannot process reacquisition of the entire dataset. However, data reacquisition isinstances with missing features, as the missing number(s) in the often expensive, impractical, or sometimes even impossible,input vectors would make the matrix operations involved in data justifying the need for an alternative practical solution.processing impossible. To obtain a valid classiﬁcation, the data to The classiﬁcation algorithm described in this paper is designedbe classiﬁed should be complete with no missing features to provide such a practical solution, accommodating missing(henceforth, we use missing data and missing features inter- features subject to the condition of distributed redundancychangeably). Missing data in real world applications is not (discussed in Section 3), which is satisﬁed surprisingly often inuncommon: bad sensors, failed pixels, unanswered questions in practice.surveys, malfunctioning equipment, medical tests that cannot beadministered under certain conditions, etc. are all familiar 1.2. Current techniques for accommodating missing datascenarios in practice that can result in missing features. Featurevalues that are beyond the expected dynamic range of the data The simplest approach for dealing with missing data is todue to extreme noise, signal saturation, data corruption, etc. can ignore those instances with missing attributes. Commonlyalso be treated as missing data. Furthermore, if the entire data are referred to as ﬁltering or list wise deletion approaches, suchnot acquired under identical conditions (time/location, using the techniques are clearly suboptimal when a large portion of thesame equipment, etc.), different data instances may be missing data have missing attributes [1], and of course are infeasible, ifdifferent features. each instance is missing at least one or more features. A more pragmatic approach commonly used to accommodate missing data is imputation [2–5]: substitute the missing value with a meaningful estimate. Traditional examples of this approach n Corresponding author: Tel: + 1 856 256 5372; fax: + 1 856 256 5241. include replacing the missing value with one of the existing data E-mail address: polikar@rowan.edu (R. Polikar). points (most similar in some measure) as in hot – deck imputation0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2010.05.028 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
2.
ARTICLE IN PRESS2 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]Fig. 1. Handwritten character recognition as an illustration of the missing feature problem addressed in this study: a large proportion of feature values are missing(orange/shaded), and the missing features vary randomly from one instance to another (the actual characters are the numbers 9, 7 and 6). For interpretation of thereferences to colour in this ﬁgure legend, the reader is referred to the web version of this article.[2,6,7], the mean of that attribute across the data, or the mean of using the existing features) by calculating the fuzzy membershipits k-nearest neighbors [4]. In order for the estimate to be a of the data point to its nearest neighbors, clusters or hyperboxes.faithful representative of the missing value, however, the training The parameters of the clusters and hyperboxes are determineddata must be sufﬁciently dense, a requirement rarely satisﬁed for from the existing data. Algorithms based on the general fuzzydatasets with even modest number of features. Furthermore, min–max neural networks [20] or ARTMAP and fuzzy c-meansimputation methods are known to produce biased estimates as clustering [21] are examples of this approach.the proportion of missing data increases. A related, but perhaps a More recently, ensemble based approaches have also beenmore rigorous approach is to use polynomial regression to proposed. For example, Melville et al. [22] showed that theestimate substitutes for missing values [8]. However, regression algorithm DECORATE, which generates artiﬁcial data (with notechniques – besides being difﬁcult to implement in high missing values) from existing data (with missing values) is quitedimensions – assume that the data can reasonably ﬁt to a robust to missing data. On the other hand, Juszczak and Duin [23]particular polynomial, which may not hold for many applications. proposed combining an ensemble of one-class classiﬁers, each Theoretically rigorous methods with provable performance trained on a single feature. This approach is capable of handlingguarantees have also been developed. Many of these methods any combination of missing features, with the fewest number ofrely on model based estimation, such as Bayesian estimation classiﬁers possible. The approach can be very effective as long as[9–11], which calculates posterior probabilities by integrating single features can reasonably estimate the underlying decisionover the missing portions of the feature space. Such an approach boundaries, which is not always plausible.also requires a sufﬁciently dense data distribution; but more In this contribution, we propose an alternative strategy, calledimportantly, a prior distribution for all unknown parameters must Learn++.MF. It is inspired in part by our previously introducedbe speciﬁed, which requires prior knowledge. Such knowledge is ensemble-of-classiﬁers based incremental learning algorithm,typically vague or non-existent, and inaccurate choices often lead Learn++, and in part by the random subspace method (RSM). Into inferior performance. essence, Learn++.MF generates a sufﬁcient number of classiﬁers, An alternative strategy in model based methods is the each trained on a random subset of features. An instance with oneExpectation Maximization (EM) algorithm [12–14,10], justly or more missing attributes is then classiﬁed by majority voting ofadmired for its theoretical elegance. EM consists of two steps, those classiﬁers that did not use the missing features during theirthe expectation (E) and the maximization (M) step, which training. Hence, this approach differs in one fundamental aspectiteratively maximize the expectation of the log-likelihood of the from the other techniques: Learn++.MF does not try to estimate thecomplete data, conditioned on observed data. Conceptually, these values of the missing data; instead it tries to extract the moststeps are easy to construct, and the range of problems that can be discriminatory classiﬁcation information provided by the existinghandled by EM is quite broad. However, there are two potential data, by taking advantage of a presumed redundancy in thedrawbacks: (i) convergence can be slow if large portions of data feature set. Hence, Learn+ .MF avoids many of the pitfalls of +are missing; and (ii) the M step may be quite difﬁcult if a closed estimation and imputation based techniques.form of the distribution is unknown, or if different instances are As in most missing feature approaches, we assume that themissing different features. In such cases, the theoretical simplicity probability of a feature being missing is independent of the valueof EM does not translate into practice [2]. EM also requires prior of that or any other feature in the dataset. This model is referredknowledge that is often unavailable, or estimation of an unknown to as missing completely at random (MCAR) [2]. Hence, given theunderlying distribution, which may be computationally prohibi- dataset X¼(xij) where xij represents the jth feature of instance xi,tive for large dimensional datasets. Incorrect distribution estima- and a missing data indicator matrix M¼(Mij), where Mij is 1 if xij istion often leads to inconsistent results, whereas lack of missing and 0 otherwise, we assume that pðM9XÞ ¼ pðMÞ. This issufﬁciently dense data typically causes loss of accuracy. Several the most restrictive mechanism that generates missing data, thevariations have been proposed, such as using Gaussian mixture one that provides us with no additional information. However,models [15,16]; or Expected Conditional Maximization, to this is also the only case where list-wise deletion or mostmitigate some of these difﬁculties [2]. imputation approaches lead to no bias. Neural network based approaches have also been proposed. In the rest of this paper, we ﬁrst provide a review of ensembleGupta and Lam [17] looked at weight decay on datasets with systems, followed by the description of the Learn++.MF algorithm,missing values, whereas Yoon and Lee [18] proposed the Training- and provide a theoretical analysis to guide the choice of its freeEStimation-Training (TEST) algorithm, which predicts the actual parameters: number of features used to train each classiﬁer, andtarget value from a reasonably well estimated imputed one. Other the ensemble size. We then present results, analyze algorithmapproaches use neuro-fuzzy algorithms [19], where unknown performance with respect to its free parameters, and comparevalues of the data are either estimated (or a classiﬁcation is made performance metrics with theoretically expected values. We also Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
3.
ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 3compare the performance of Learn+ .MF to that of a single classiﬁer + feature(s) needs to be classiﬁed, only those classiﬁers trained withusing mean imputation, to naıve Bayes that can naturally handle ¨ the features that are presently available in x are used for themissing features, and an intermediate approach that combines classiﬁcation. As such, Learn+ .MF follows an alternative paradigm +RSM with mean imputation, as well as to that of one-class for the solution of the missing feature problem: the algorithmensemble approach of [23]. We conclude with discussions and tries to extract most of the information from the availableremarks. features, instead of trying to estimate, or impute, the values of the missing ones. Hence, Learn+ .MF does not require a very dense + training data (though, such data are certainly useful); it does not2. Background: ensemble and random subspace methods need to know or estimate the underlying distributions; and hence does not suffer from the adverse consequences of a potential An ensemble-based system is obtained by combining diverse incorrect estimate.classiﬁers. The diversity in the classiﬁers, typically achieved by Learn++.MF makes two assumptions. First, the feature set mustusing different training parameters for each classiﬁer, allows be redundant, that is, the classiﬁcation problem is solvable withindividual classiﬁers to generate different decision boundaries unknown subset(s) of the available features (of course, if the[24–27]. If proper diversity is achieved, a different error is made identities of the redundant features were known, they would haveby each classiﬁer, strategic combination of which can then reduce already been excluded from the feature set). Second, thethe total error. Starting in early 1990s, with such seminal works as redundancy is distributed randomly over the feature set (hence,[28–30,13,31–33], ensemble systems have since become an time series data would not be well suited for this approach). Theseimportant research area in machine learning, whose recent assumptions are primarily due to the random nature of featurereviews can be found in [34–36]. selection, and are shared with all RSM based approaches. We The diversity requirement can be achieved in several ways combine these two assumptions under the name distributed[24–27]: training individual classiﬁers using different (i) training redundancy. Such datasets are not uncommon in practice (thedata (sub)sets; (ii) parameters of a given classiﬁer architecture; scenario in Fig. 1, sensor applications, etc), and it is these(iii) classiﬁer models; or (iv) subsets of features. The last one, also applications for which Learn+ .MF is designed, and expected to +known as the random subspace method (RSM), was originally be most effective.proposed by Ho [37] for constructing an ensemble of decisiontrees (decision forests). In RSM, classiﬁers are trained usingdifferent random subsets of the features, allowing classiﬁers to err 3.2. Algorithm Learn+ .MF +in different sub-domains of the feature space. Skurichina pointsout that RSM works particularly well when the database provides A trivial approach is to create a classiﬁer for each of the 2f À 1redundant information that is ‘‘dispersed’’ across all features [38]. possible non-empty subsets of the f features. Then, for any The RSM has several desirable attributes: (i) working in a instance with missing features, use those classiﬁers whosereduced dimensional space also reduces the adverse conse- training feature set did not include the missing ones. Such anquences of the curse of dimensionality; (ii) RSM based ensemble approach, perfectly suitable for datasets with few features, has inclassiﬁers may provide improved classiﬁcation performance, fact been previously proposed [44]. For large feature sets, thisbecause the synergistic classiﬁcation capacity of a diverse set of exhaustive approach is infeasible. Besides, the probability of aclassiﬁers compensates for any discriminatory information lost by particular feature combination being missing also diminisheschoosing a smaller feature subset; (iii) implementing RSM is quite exponentially as the number of features increase. Therefore,straightforward; and (iv) it provides a stochastic and faster alter- trying to account for every possible combination is inefﬁcient.native to the optimal-feature-subset search algorithms. RSM At the other end of the spectrum, is Juszczak and Duin’s approachapproaches have been well-researched for improving diversity, [23], using fxc one-class classiﬁers, one for each feature and eachwith well-established beneﬁts for classiﬁcation applications [39], of the c classes. This approach requires fewest number ofregression problems [40] and optimal feature selection applica- classiﬁers that can handle any feature combination, but comestions [38,41]. However, the feasibility of an RSM based approach at a cost of potential performance drop due to disregarding anyhas not been explored for the missing feature problem, and hence existing relationship between the features, as well as the infor-constitutes the focus of this paper. mation provided by other existing features. Finally, a word on the algorithm name: Learn++ was developed Learn++.MF offers a strategic compromise: it trains an ensembleby reconﬁguring AdaBoost [33] to incrementally learn from new of classiﬁers with a subset of the features, randomly drawn from adata that may later become available. Learn+ generates an + feature distribution, which is iteratively updated to favor selec-ensemble of classiﬁers for each new database, and combines tion of those features that were previously undersampled. Thisthem through weighted majority voting. Learn+ draws its training + allows Learn++.MF to achieve classiﬁcation performances withdata from a distribution, iteratively updated to force the algo- little or no loss (compared to a fully intact data) even when largerithm to learn novel information not yet learned by the ensemble portions of data are missing. Without any loss of generality, we[42,43]. Learn+ .MF, combines the distribution update concepts of + assume that the training dataset is complete. Otherwise, a ++Learn with the random feature selection of RSM, to provide a sentinel can be used to ﬂag the missing features, and thenovel solution to the Missing Feature problem. classiﬁers are then trained on random selection of existing features. For brevity, we focus on the more critical case of ﬁeld data containing missing features.3. Approach The pseudocode of Learn++.MF is provided in Fig. 2. The inputs È É to the algorithm are the training data S ¼ ðxi ,yi Þ,i ¼ 1,. . .,N ;3.1. Assumptions and targeted applications of the proposed feature subset cardinality; i.e., number of features, nof, used toapproach train each classiﬁer; a supervised classiﬁcation algorithm (BaseClassiﬁer), the number of classiﬁers to be created T; and a The essence of the proposed approach is to generate a sentinel value sen to designate a missing feature. At each iterationsufﬁciently large number of classiﬁers, each trained with a t, the algorithm creates one additional classiﬁer Ct. The featuresrandom subset of features. When an instance x with missing used for training Ct are drawn from an iteratively updated Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
4.
ARTICLE IN PRESS4 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Fig. 2. Pseudocode of algorithm Learn++.MF.distribution Dt that promotes further diversity with respect to the During classiﬁcation of a new instance z, missing (or known tofeature combinations. D1 is initialized to be uniform, hence each be corrupt) values are ﬁrst replaced by a sentinel sen (chosen as afeature is equally likely to be used in training classiﬁer C1 (step 1). value not expected to occur in the data). Then, the features with the value sen are identiﬁed and placed in M(z), the set of missingD1 ðjÞ ¼ 1=f , j ¼ 1, Á Á Á ,f ð1Þ features.where f is the total number of features. For each iteration MðzÞ ¼ argðzðjÞ ¼ ¼ senÞ, 8j, j ¼ 1,. . .,f , ð5Þt ¼1,y,T, the distribution Dt A Rf that was updated in theprevious iteration is ﬁrst normalized to ensure that Dt stays as a Ã Finally, all classiﬁers Ct whose feature list Fselection(t) does notproper distribution include the features listed in M(z) are combined through majority , voting to obtain the ensemble classiﬁcation of z. XfDt ¼ Dt Dt ðjÞ ð2Þ X EðzÞ ¼ arg max 1ðMðzÞ Fselection ðtÞÞ ¼ |U ð6Þ j¼1 yAY Ã t:Ct ðzÞ ¼ y A random sample of nof features is drawn, without replace-ment, according to Dt, and the indices of these features are placedin a list, called F selection ðtÞ A Rnof (step 2). This list keeps track of 3.3. Choosing algorithm parameterswhich features have been used for each classiﬁer, so thatappropriate classiﬁers can be called during testing. Classiﬁer Ct, Considering that feature subsets for each classiﬁer are chosenis trained (step 3) using the features in Fselection(t), and evaluated randomly, and that we wish to use far fewer than 2f À 1 classiﬁerson S (step 4) to obtain necessary to accommodate every feature subset combination, it is possible that a particular feature combination required to 1XNPerft ¼ 1Ct ðxi Þ ¼ yi U ð3Þ accommodate a speciﬁc set of missing features might not have Ni¼1 been sampled by the algorithm. Such an instance cannot bewhere 1 Á U evaluates to 1 if the predicate is true. We require Ct to processed by Learn++.MF. It is then fair to ask how often Learn+ .MF +perform better than some minimum threshold, such as 1/2 as in will not be able to classify a given instance. Formally, given thatAdaBoost, or 1/c (better than the random guess for a reasonably Learn++.MF needs at least one useable classiﬁer that can handle thewell-balanced c-class data), on its training data to ensure that it current missing feature combination, what is the probability thathas a meaningful classiﬁcation capacity. The distribution Dt is there will be at least one classiﬁer – in an ensemble of T classiﬁersthen updated (step 5) according to – that will be able to classify an instance with m of its f features missing, if each classiﬁer is trained on nofof features? A formalDt þ 1 ðFselection ðtÞÞ ¼ bÃDt ðFselection ðtÞÞ, 0 o b r1 ð4Þ analysis to answer this question also allows us to determine thesuch that the weights of those features in current Fselection(t) are relationship among these parameters, and hence can guide us inreduced by a factor of b. Then, features not in Fselection(t) appropriately selecting the algorithm’s free parameters: nof and T.effectively receive higher weights when Dt is normalized in the A classiﬁer Ct is only useable if none of the nof features selectednext iteration. This gives previously undersampled features a for its training data matches one of those missing in x. Then, whatbetter chance to be selected into the next feature subset. Choosing is the probability of ﬁnding exactly such a classiﬁer? Without lossb ¼1 results in a feature distribution that remains uniform of any generality, assume that m missing features are ﬁxed, andthroughout the experiment. We recommend b ¼nof/f, which that we are choosing nof features to train Ct. Then, the probabilityreduces the likelihood of previously selected features being of choosing the ﬁrst feature – among f features – such that it doesselected in the next iteration. In our preliminary efforts, we not coincide with any of the m missing ones is (f Àm)/f. Theobtained better performances by using b ¼nof/f over other choices probability of choosing the second such feature is (f Àm À 1)/such as b ¼1 or b ¼1/f, though not always with statistically (f À 1), and similarly the probability of choosing the (nof)th suchsigniﬁcant margins [45]. feature is ðf ÀmÀðnof À1ÞÞ=ðf Àðnof À1ÞÞ. Then the probability of Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
5.
ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 5ﬁnding a useable classiﬁer Ct for x is ﬁnding a single useable classiﬁer is a success in a Bernoulli f Àm f ÀmÀ1 f ÀmÀ2 f ÀmÀðnof À1Þ experiment with probability of success p, then the probability ofp¼ U U UÁÁÁU ﬁnding at least t useable classiﬁers is f f À1 f À2 f Àðnof À1Þ nof À1 Y Pð 4t useable classifiers when m features are missingÞ ¼ pm t 4 m m m m m ¼ 1À U 1À U 1À U Á Á Á U 1À ¼ 1À T f f À1 f À2 f Àðnof À1Þ f Ài X T ðpÞt ð1ÀpÞTÀt i¼0 ¼ ð13Þ ð7Þ t¼t t Note that this is exactly the scenario described by the Note that Eq. (13) still needs to be weighted with thehypergeometric (HG) distribution: we have f features, m of which (binomial) probabilities of ‘‘having exactly m missing features’’,are missing; we then randomly choose nof of these f features. The and then summed over all values of m to yield the desiredprobability that exactly k of these selected nof features will be one probabilityof the m missing features is fX Ànof ! !, ! f m f Àm f P¼ rm ð1ÀrÞf Àm Uðpm t Þ 4 ð14Þp $ HGðk; f ,m,nof Þ ¼ ð8Þ m¼0 m k nof Àk nof By using the appropriate expressions from Eqs. (7) and (13) in We need k¼0, i.e., none of the features selected is one of the Eq. (14), we obtainmissing ones. Then, the desired probability is 2 0 Ànof fX X T nof À1 T Y !t ! f m m f Àm P¼ 4 m r ð1ÀrÞ Uf Àm @ 1À m¼0 m t¼t t i¼0 f Ài 0 nof ðf ÀmÞ! ðf Ànof Þ!p $ HGðk ¼ 0; f ,m,nof Þ ¼ ! ¼ !TÀt 13 U nof À1 Y f ðf ÀmÀnof Þ! f! m Â 1À 1À A5 ð15Þ nof i¼0 f Ài ð9Þ Eqs. (7) and (9) are equivalent. However, Eq. (7) is preferable, Eq. (15), when evaluated for t¼ 1, is the probability of ﬁndingsince it does not use factorials, and hence will not result in at least one useable classiﬁer trained on nof features – from a poolnumerical instabilities for large values of f. If we have T such of T classiﬁers – when each feature has a probability r of beingclassiﬁers, each trained on a randomly selected nof features, what missing. For t ¼1, Eqs. (10) and (13) are identical. However,is the probability that at least one will be useable, i.e., at least one Eq. (13) allows computation of the probability of ﬁnding anyof them was trained with a subset of nof features that do not number of useable classiﬁers (say, tdesired, not just one) simply byinclude one of the m missing features. The probability that a starting the summation from t ¼tdesired. These equations can helprandom single classiﬁer is not useable is (1 Àp). Since the feature us choose the algorithm’s free parameters, nof and T, to obtain aselections are independent (speciﬁcally for b ¼ 1), the probability reasonable conﬁdence that we can process most instances withof not ﬁnding any single useable classiﬁer among all T classiﬁers is missing features.(1 Àp)T. Hence, the probability of ﬁnding at least one useable As an example, consider a 30-feature dataset. First, let us ﬁxclassiﬁer is probability that a feature is missing at r ¼ 0.05. Then, on average, !T 1.5 features are missing from each instance. Fig. 3(a) shows the nof À1 Y m probability of ﬁnding at least one classiﬁer for various values ofP ¼ 1Àð1ÀpÞT ¼ 1À 1À 1À ð10Þ i¼0 f Ài nof as a function of T. Fig. 3(b) provides similar plots for a higher rate of r ¼0.20 (20% missing features). The solid curve in each If we know how many features are missing, Eq. (10) is exact. case is the curve obtained by Eq. (15), whereas the variousHowever, we usually do not know exactly how many features indicators represent the actual probabilities obtained as a result ofwill be missing. At best, we may know the probability r that 1000 simulations. We observe that (i) the experimental data ﬁtsa given feature may be missing. We then have a binomial the theoretical curve very well; (ii) the probability of ﬁnding adistribution: naming the event ‘‘ a given feature is missing’’ as useable classiﬁer increases with the ensemble size; and (iii) forsuccess, then the probability pm of having m of f features missing a ﬁxed r and a ﬁxed ensemble size, the probability of ﬁnding ain any given instance is the probability of having m successes in f useable classiﬁer increases as nof decreases. This latter obser-trials of a Bernoulli experiment, characterized by the binomial vation makes sense, as fewer the number of features used indistribution training, the larger the number of missing features that can be f accommodated. In fact, if a classiﬁer is trained with nof featurespm ¼ rm ð1ÀrÞf Àm ð11Þ m out of a total of f features, then that classiﬁer can accommodate The actual probability of ﬁnding a useable classiﬁer is then a up to f Ànof missing features.weighted sum of the probabilities (1 À p)T, where the weights are In Fig. 3(c), we now ﬁx nof¼ 9 (30% of the total feature size),the probability of having exactly m features missing, with m and calculate the probability of ﬁnding at least one useablevarying from 0 (no missing features) to maximum allowable classiﬁer as a function of ensemble size, for various values of r.number of fÀ nof missing features. As expected, for a ﬁxed value of nof and ensemble size, this 2 3 probability decreases as r increases. Ànof fX nof À1 Y !T In order to see how these results scale with larger feature sets, f mP ¼ 1À 4 f Àm rm ð1ÀrÞ U 1À 1À 5 ð12Þ m f Ài Fig. 4 shows the theoretical plots for the 216—feature multiple m¼0 i¼0 features (MFEAT) dataset [46]. All trends mentioned above can be Eq. (12) computes the probability of ﬁnding at least one observed in this case as well, however, the required ensemble sizeuseable classiﬁer by subtracting the probability of ﬁnding no is now in thousands. An ensemble with a few thousand classiﬁersuseable classiﬁer (in an ensemble of T classiﬁers) from one. is well within today’s computational resources, and in fact notA more informative way, however, is to compute the probability uncommon in many multiple classiﬁer system applications. It isof ﬁnding t useable classiﬁers (out of T). If the probability of clear, however, that as the feature size grows beyond the Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
6.
ARTICLE IN PRESS6 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Probability of finding at least one useable classifier Probability of finding at least one useable classifier f=30, ρ =0.05 f =30, ρ =0.2 1 1 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.5 0.6 nof = 3 (10%) 0.4 nof = 5 (15%) nof = 5 (15%) 0.5 nof = 9 (30%) 0.3 nof = 7 (23%) nof = 12 (40%) nof = 9 (30%) nof = 15 (50%) 0.2 nof = 12 (40%) 0.4 nof = 20 (67%) nof = 15 (50%) 0.1 0 0 20 40 60 80 100 0 20 40 60 80 100 120 140 160 180 200 Number of classifies in the ensemble Number of classifiers in the ensemble Probability of finding at least one useable classifier f =30, nof =9 1 0.9 0.8 0.7 0.6 0.5 0.4 ρ = 0.05 ρ = 0.10 0.3 ρ = 0.15 ρ = 0.20 0.2 ρ = 0.25 0.1 ρ = 0.30 ρ = 0.40 0 0 50 100 150 200 Number of classifiers in the ensemble Fig. 3. Probability of ﬁnding at least one useable classiﬁer for f¼ 30 (a) r ¼0.05, variable nof, (b) r ¼0.2 variable nof and (c) nof ¼ 9, variable r.thousands range, the computational burden of this approach avoid repetition of similar outcomes, we present representativegreatly increases. Therefore, Learn++.MF is best suited for appli- full results on four datasets here, and provide results for manycations with moderately high dimensionality (fewer than 1000). additional real world and UCI benchmark datasets online [48], aThis is not an overly restrictive requirement, since many appli- summary of which is provided at the end of this section. The fourcations have typically much smaller dimensionalities. datasets featured here encompass a broad range of feature and database sizes. These datasets are the Wine (13 features), Wisconsin Breast Cancer (WBC—30 features), Optical Character4. Simulation results Recognition (OCR—62 features) and Multiple Features (MFEAT—216 features) datasets obtained from the UCI repository4.1. Experimental setup and overview of results [46]. In all cases, multilayer perceptron type neural networks were used as the base classiﬁer, though any supervised classiﬁer can Learn+ .MF was evaluated on several databases with various + also be used. The training parameters of the base classiﬁers werenumber of features. The algorithm resulted in similar trends in all not ﬁne tuned, and selected as ﬁxed reasonable values (errorcases, which we discuss in detail below. Due to the amount of goal¼0.001–0.01, number of hidden layer nodes¼20–30 based ondetail involved with each dataset, as well as for brevity and to the dataset). Our goal is not to show the absolute best classiﬁcation Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
7.
ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 7 Probability of finding at least one useable classifier Probability of finding at least one useable classifier f =216, ρ =0.15 f=216, nof=40 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 nof = 15 0.4 nof = 20 0.3 nof = 25 0.3 ρ = 0.05 nof = 30 0.2 0.2 ρ = 0.10 nof = 35 ρ = 0.15 0.1 nof = 40 0.1 ρ = 0.18 nof = 50 ρ = 0.20 0 0 0 1000 2000 3000 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Number of classifiers in the ensemble Number of classifiers in the ensemble Fig. 4. Probability of ﬁnding at least one classiﬁer, f ¼ 216 (a) r ¼0.15, variable nof and (b) nof¼ 40, variable r.Table 1Datasets used in this study. Dataset k Dataset size Number of classes Number of features nof1 nof2 nof3 nof4 nof5 nof6 T WINE 178 3 13 3 4 5 6 7 variable 200 WBC 400 2 30 10 12 14 16 – variable 1000 OCR 5620 10 62 16 20 24 – – variable 1000 MFEAT 2000 10 216 10 20 30 40 60 variable 2000 VOC-In 384 6 6 2 3 – – – variable 100 VOC-IIn 84 12 12 3 4 5 6 – variable 200 IONn 270 2 34 8 10 12 14 – variable 1000 WATERn 374 4 38 12 14 16 18 – variable 1000 ECOLIn 347 5 5 2 3 – – – variable 1000 DERMAn 366 6 34 8 10 12 14 – variable 1000 PENn 10,992 10 16 6 7 8 9 – variable 250 n The full results for these datasets are provided online due to space considerations and may be found at [48].performances (which are quite good), but rather the effect of the observations, discussed below in detail, were as follows: (i) themissing features and strategic choice of the algorithm parameters. algorithm can handle data with missing values, 20% or more, with We deﬁne the global number of features in a dataset as the little or no performance drop, compared to classifying data withproduct of the number of features f, and the number of instances all values intact; and (ii) the choice of nof presents a critical trade-N; and a single test trial as evaluating the algorithm with 0–30% off: higher performance can be obtained using a larger nof, when(in steps of 2.5%), of the global number of features missing from fewer features are missing, yet smaller nof yields higher and morethe test data (typically half of the total data size). We evaluate the stable performance, and leaves fewer instances unprocessed,algorithm with different number of features, nof, used in training when larger portions of the data are missing.individual classiﬁers, including a variable nof, where the numberof features used to train each classiﬁer is determined by randomlydrawing an integer in the 15–50% range of f. All results are 4.2. Wine databaseaverages of 100 trials on test data, reported with 95% conﬁdenceintervals, where training and test sets were randomly reshufﬂed Wine database comes from the chemical analysis of 13for each trial. Missing features were simulated by randomly constituents in wines obtained from three cultivars. Previousreplacing actual values with sentinels. experiments using optimized classiﬁers trained with all 13 Table 1 summarizes datasets and parameters used in the features, and evaluated on a test dataset with no missing features,experiments: the cardinality of the dataset, the number of classes, had zero classiﬁcation error, setting the benchmark targettotal number of features f, the speciﬁc values of nof used in performance for this database. Six values of nof were considered:simulations, and the total number of classiﬁers T generated. 3, 4, 5, 6, 7 and variable, where each classiﬁer was trained on – not The behavior of the algorithm with respect to its parameters with a ﬁxed number of features, but a random selection of one ofwas very consistent across the databases. The two primary these values. Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
8.
ARTICLE IN PRESS8 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]Table 2Performance on the WINE database. nof ¼ 3 (out of 13) nof ¼4 (out of 13) nof ¼5 (out of 13) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 100.007 0.00 100 0.00 100.00 70.00 100 0.00 100.00 70.00 100 2.50 100.007 0.00 100 2.50 99.67 7 0.71 100 2.50 100.00 70.00 100 5.00 100.007 0.00 100 5.00 100.00 70.00 100 5.00 99.67 7 0.71 100 7.50 99.67 7 0.71 100 7.50 99.67 7 0.71 100 7.50 99.33 7 1.43 100 10.00 99.33 7 0.95 100 10.00 99.67 7 0.71 100 10.00 99.007 1.09 100 15.00 99.007 1.09 100 15.00 98.007 1.58 100 15.00 98.33 7 1.60 100 20.00 99.007 1.09 100 20.00 98.007 1.91 100 20.00 96.67 7 1.85 100 25.00 99.007 1.09 100 25.00 97.66 7 1.53 100 25.00 96.29 7 1.31 99 30.00 96.33 7 1.98 100 30.00 94.29 7 1.54 99 30.00 95.22 7 2.26 98 nof ¼ 6 (out of 13) nof ¼7 (out of 13) nof ¼variable (3–7 out of 13) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 100.007 0.00 100 0.00 100.00 70.00 100 0.00 100.00 70.00 100 2.50 99.67 7 0.71 100 2.50 99.33 7 0.95 100 2.50 99.33 7 0.95 100 5.00 99.33 7 0.95 100 5.00 99.33 7 0.95 100 5.00 100.00 70.00 100 7.50 98.67 7 1.58 100 7.50 98.67 7 1.17 100 7.50 99.67 7 0.71 100 10.00 98.33 7 2.20 100 10.00 98.32 7 1.60 100 10.00 99.67 7 0.74 100 15.00 96.98 72.00 99 15.00 96.94 7 2.30 99 15.00 97.67 7 1.53 100 20.00 96.98 7 2.27 99 20.00 91.27 7 3.55 96 20.00 98.00 7 1.17 100 25.00 93.72 7 2.66 95 25.00 93.03 7 4.35 86 25.00 97.33 7 1.78 100 30.00 95.29 7 2.63 93 30.00 91.34 7 4.20 78 30.00 94.67 7 2.18 100 Table 2 summarizes the test performances using all values of (of the same architecture as ensemble members) that employsnof, and the percentage of instances that could be processed – the commonly used mean imputation (M.I.) to replace the missingcorrectly or otherwise – with the existing ensemble (described data, as well as that of the E.M. algorithm [47]. We observe thatbelow). The ﬁrst row with 0.0% missing features is the algorithm’s Learn++.MF ensemble (signiﬁcantly) outperforms both E.M. andperformance when classiﬁers were trained on nof features, but mean imputation for all values of PMF, an observation that wasevaluated on a fully intact test data. The algorithm is able to obtain consistent for all databases we have evaluated (we compare100% correct classiﬁcation with any of the nof values used, indi- Learn++.MF to naı Bayes and RSM with mean imputation later in ¨vecating that this dataset does in fact include redundant features. the paper). Fig. 5(b) shows the performance of a single usable Now, since the feature subsets are selected at random, it is classiﬁer, averaged over all T classiﬁers, as a function of PMF.possible that the particular set of features available for a given The average performance of any ﬁxed single classiﬁer is, asinstance do not match the feature combinations selected by any of expected, independent of PMF; however, using higher nof valuesthe classiﬁers. Such an instance cannot be processed, as there are generally associated with better individual classiﬁer perfor-would be no classiﬁer trained with the unique combination of the mance. This makes sense, since there is more informationavailable features. The column with the heading PIP (percentage available to each classiﬁer with a larger nof. However, thisof instances processed) indicates the percentage of those comes at the cost of smaller PIP. In Fig. 5(b) and (c) we compareinstances that can be processed by the generated ensemble. As the performance and the PIP with a single vs. ensemble classiﬁer.expected, PIP decreases as the ‘‘% missing features (PMF)’’ For example, consider 20% PMF, for which the corresponding PIPincreases. We note that with 30% of the features missing from values for single classiﬁers are shown in Fig. 5(b). A single usablethe test data, 100% of this test dataset can be processed (at a classiﬁer trained with nof¼ 3 features is able to classify, onperformance of 96.3%) when nof¼3, whereas only 78% can be average, 51% of instances with a performance of 79%, whereas theprocessed (at a performance of 91.3%) when nof ¼7 features are ensemble has a 100% PIP with a performance of 99%. Fig. 5(c)used for training the classiﬁers. A few additional observations: shows PIP as a function of PMF, both for single classiﬁers (lowerﬁrst, we note that there is no (statistically signiﬁcant) perfor- curves) and the ensembles (upper curves). As expected, PIPmance drop with up to 25% of features missing when nof¼3, or up decreases with increasing PMF for both the ensemble and a singleto 10% when nof¼6, and only a few percentage points (3.5% for usable classiﬁer. However, the decline in PIP is much steeper fornof¼3, 4.7% for nof¼6) when as many as 30% of the features are single classiﬁers. In fact, there is virtually no drop (PIP ¼100%) upmissing—hence the algorithm can handle large amounts of until 20% PMF when the ensemble is used (regardless of nof).missing data with relatively little drop in classiﬁcation perfor- Hence the impact of using an ensemble is two-folds: the ensemblemance. Second, virtually all test data can be processed—even can process a larger percentage of instances with missingwhen as much as 30% of the features are missing, if we use features, and it also provides a higher classiﬁcation performancerelatively few features for training the classiﬁers. on those instances, compared to a single classiﬁer. For a more in depth analysis of the algorithm’s behavior, Also note in Fig. 5(c) that the decline in PIP is much steeperconsider the plots in Fig. 5. Fig. 5(a) illustrates the overall with nof ¼7 than it is for nof¼3, similar to decline in classiﬁcationperformance of the Learn+ .MF ensemble. All performances in the + performance shown in Fig. 5(a). Hence, our main observation0–30% interval of PMF with increments of 2.5% are provided regarding the trade-off one must accept by choosing a speciﬁc(while the tables skip some intermediate values for brevity). As nof value is as follows: classiﬁers trained with a larger nofexpected, the performance declines as PMF increases – albeit only may achieve a higher generalization performance individually,gradually. The decline is much milder for nof¼3 than it is for or initially when fewer features are missing, but are only ablenof¼7, the reasons of which are discussed below. Also included in to classify fewer instances as the PMF increases. This observa-Fig. 5(a) for comparison, is the performance of a single classiﬁer tion can be explained as follows: using large nof for training Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
9.
ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 9 Learn++.MF Ensemble Performance vs. MI Single Usable Classifier Performance Learn ++.MF 88 100 Ensembles PIP @20% PMF 26% 86 95 EM 84 20% 33% % % 82 37% Mean 90 Imputation 80 41% 78 51% 85 76 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) Percent Instances Processed Percent Classifiers Usable 100 100 Learn ++ .MF 80 80 Ensemble 60 60 % % 40 40 Single Classifiers 20 20 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) 3 features 4 features 5 features 6 features 7 features Variable M.I. E.M. ++Fig. 5. Detailed analysis of the algorithm on the WINE database with respect to PMF and nof: (a) Learn .MF ensemble performances compared to that of mean imputation,(b) single classiﬁer performance, (c) single classiﬁer and ensemble PIP and (d) percentage of useable classiﬁers in the ensemble.(e.g., 10 out of 13) means that the same large number of features the results of averaging 1000 such Monte Carlo simulations for(10) are required for testing. Therefore, fewer number of missing f¼13, nof ¼3–7, PMF [0–100]%, which agrees with the actualfeatures (only 3, in this case) can be accommodated. In fact, the results of Fig. 5(d) (up to 30% PMF is shown in Fig. 7(d)).probability of no classiﬁer being available for a given combinationof missing features increases with nof . Of course, if all features areused for training, then no classiﬁer will be able to process any 4.3. Wisconsin breast cancer (WBC) databaseinstance with even a single missing feature, hence the originalmotivation of this study. The WBC database consists of 569 instances of cell-biopsy Finally, Fig. 5(d) analyzes the average percentage of usable classiﬁed as malignant or benign based on 30 features obtainedclassiﬁers (PUC) per given instance, with respect to PMF. The from the individual cells. The expected best performance fromdecline in PUC also makes sense: there are fewer classiﬁers this dataset (when no features are missing) is 97% [46]. Five nofavailable to classify instances that have a higher PMF. Further- values 10, 12, 14, 16 and variable, were tested.more, this decline is also steeper for higher nof. Table 3 summarizes the test performances using all values of In summary, then, a larger nof usually provides a better nof, and the PIP values, similar to that of Wine database in Table 2.individual and/or initial performance than a smaller nof, but the The results with this database of larger number of features haveensemble is more resistant to unusable feature combinations much of the same trends and characteristics mentioned for thewhen trained with a smaller nof. How do we, then, choose the nof Wine database. Speciﬁcally, the proximity of the ensemblevalue, since we cannot know PMF in future datasets? A simple performance when used with nof¼ 10, 12, 14 and 16 featuressolution is to use a variable number of features—typically in the with no missing features to the 97% ﬁgure reported in therange of 15–50% of the total number of features. We observe from literature indicates that this database also has a redundant featureTable 2 and Fig. 5, that such a variable selection provides set. Also, as in the Wine dataset, the algorithm is able to handle upsurprisingly good performance: a good compromise with little to 20% missing features with little or no (statistically) signiﬁcantor no loss of unprocessed instances. performance drop, and only a gradual decline thereafter. As As for the other parameter, T—the number of classiﬁers, the expected, the PIP drops with increasing PMF, and the drop istrade off is simply a matter of computational resources. A larger steeper for larger nof values compared to smaller ones. UsingPIP and PUC will be obtained, if larger number of classiﬁers is variable number of features provides a good trade off. Thesetrained. Fig. 5(d) shows PUC for various values of PMF, indicating characteristics can also be seen in Fig. 7, where we plot thethe actual numbers obtained for this database after the classiﬁers ensemble performance, and PIP for single and ensemble classiﬁerswere trained. Family of curves generated using Eq. (15), such as as in Fig. 5 for Wine database. Fig. 7(a) also shows that thethose in Fig. 3, as well as Monte Carlo simulations can provide us ensemble outperforms a single classiﬁer (of identical architecture)with these numbers, before training the classiﬁers. Fig. 6 shows that uses mean imputation to replace the missing features. The Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
10.
ARTICLE IN PRESS10 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 100 Expected percentage of useable classifier 90 nof = 3 nof = 4 80 nof = 5 nof = 6 70 nof = 7 in the ensemble 60 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of any given feature being missing Fig. 6. Estimating the required ensemble size ahead of time. Learn++.MF Ensemble Performances vs. M.I. Percent of Processed Instances 100 Learn++.MF 100 Ensembles 80 Learn++.MF 90 Ensemble 60 % % Mean Imputation 40 80 20 Single Classifiers 70 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) 10 features 12 features 14 features 16 features Variable M.I.Fig. 7. Analysis of the algorithm on the WBC database with respect to PMF and nof: (a) Learn++.MF ensemble performances compared to that of mean imputation and (b)single classiﬁer and ensemble PIP.Table 3Performance on the WBC database. nof ¼ 10 (out of 30) nof¼ 12 (out of 30) nof¼ 14 (out of 30) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 95.007 0.00 100 0.00 96.00 70.00 100 0.00 95.00 70.00 100 2.50 95.207 0.24 100 2.50 95.80 70.18 100 2.50 94.75 70.24 100 5.00 95.057 0.30 100 5.00 95.55 70.41 100 5.00 94.80 70.33 100 7.50 95.15 7 0.42 100 7.50 95.55 70.11 100 7.50 94.55 70.56 100 10.00 95.55 7 0.30 100 10.00 95.34 70.40 100 10.00 94.39 70.31 100 15.00 95.107 0.69 100 15.00 94.88 70.50 100 15.00 93.84 70.95 98 20.00 94.63 7 0.77 100 20.00 93.08 71.10 97 20.00 92.35 71.08 91 25.00 94.53 7 0.81 98 25.00 92.78 70.74 91 25.00 91.42 71.20 76 30.00 92.78 7 1.39 92 30.00 89.87 72.02 75 30.00 89.01 71.73 55 nof ¼ 16 (out of 30) nof¼ var. (10–16 out of 30) PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 95.007 0.00 100 0.00 95.50 70.00 100 2.50 94.807 0.29 100 2.50 95.45 70.19 100 5.00 94.75 7 0.37 100 5.00 95.30 70.43 100 7.50 95.047 0.52 100 7.50 95.30 70.46 100 10.00 94.24 7 0.48 99 10.00 95.15 70.39 100 15.00 92.507 0.98 94 15.00 94.79 70.85 100 20.00 90.207 1.10 79 20.00 94.90 70.79 97 25.00 86.98 7 2.45 56 25.00 91.61 71.23 91 30.00 86.34 7 3.17 34 30.00 91.13 71.32 78 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
Be the first to comment