Published on

How to handle missing data

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. ARTICLE IN PRESS Pattern Recognition ] (]]]]) ]]]–]]] Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/prLearn++.MF: A random subspace approach for the missing feature problemRobi Polikar a,n, Joseph DePasquale a, Hussein Syed Mohammed a,Gavin Brown b, Ludmilla I. Kuncheva ca Electrical and Computer Eng., Rowan University, 201 Mullica Hill Road, Glassboro, NJ 08028, USAb University of Manchester, Manchester, England, UKc University of Bangor, Bangor, Wales, UKa r t i c l e in fo abstractArticle history: We introduce Learn+ .MF, an ensemble-of-classifiers based algorithm that employs random subspace +Received 9 November 2009 selection to address the missing feature problem in supervised classification. Unlike most establishedReceived in revised form approaches, Learn+ .MF does not replace missing values with estimated ones, and hence does not need +16 April 2010 specific assumptions on the underlying data distribution. Instead, it trains an ensemble of classifiers,Accepted 21 May 2010 each on a random subset of the available features. Instances with missing values are classified by the majority voting of those classifiers whose training data did not include the missing features. We showKeywords: that Learn+ .MF can accommodate substantial amount of missing data, and with only gradual decline in +Missing data performance as the amount of missing data increases. We also analyze the effect of the cardinality ofMissing features the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss theEnsemble of classifiers conditions under which the proposed approach is most effective.Random subspace method & 2010 Elsevier Ltd. All rights reserved.1. Introduction Fig. 1 illustrates such a scenario for a handwritten character recognition application: characters are digitized on an 8 Â 8 grid,1.1. The missing feature problem creating 64 features, f1–f64, a random subset (of about 20–30%) of which – indicated in orange (light shading) – are missing in each The integrity and completeness of data are essential for any case. Having such a large proportion of randomly varying featuresclassification algorithm. After all, a trained classifier – unless may be viewed as an extreme and unlikely scenario, warrantingspecifically designed to address this issue – cannot process reacquisition of the entire dataset. However, data reacquisition isinstances with missing features, as the missing number(s) in the often expensive, impractical, or sometimes even impossible,input vectors would make the matrix operations involved in data justifying the need for an alternative practical solution.processing impossible. To obtain a valid classification, the data to The classification algorithm described in this paper is designedbe classified should be complete with no missing features to provide such a practical solution, accommodating missing(henceforth, we use missing data and missing features inter- features subject to the condition of distributed redundancychangeably). Missing data in real world applications is not (discussed in Section 3), which is satisfied surprisingly often inuncommon: bad sensors, failed pixels, unanswered questions in practice.surveys, malfunctioning equipment, medical tests that cannot beadministered under certain conditions, etc. are all familiar 1.2. Current techniques for accommodating missing datascenarios in practice that can result in missing features. Featurevalues that are beyond the expected dynamic range of the data The simplest approach for dealing with missing data is todue to extreme noise, signal saturation, data corruption, etc. can ignore those instances with missing attributes. Commonlyalso be treated as missing data. Furthermore, if the entire data are referred to as filtering or list wise deletion approaches, suchnot acquired under identical conditions (time/location, using the techniques are clearly suboptimal when a large portion of thesame equipment, etc.), different data instances may be missing data have missing attributes [1], and of course are infeasible, ifdifferent features. each instance is missing at least one or more features. A more pragmatic approach commonly used to accommodate missing data is imputation [2–5]: substitute the missing value with a meaningful estimate. Traditional examples of this approach n Corresponding author: Tel: + 1 856 256 5372; fax: + 1 856 256 5241. include replacing the missing value with one of the existing data E-mail address: polikar@rowan.edu (R. Polikar). points (most similar in some measure) as in hot – deck imputation0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2010.05.028 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  2. 2. ARTICLE IN PRESS2 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]Fig. 1. Handwritten character recognition as an illustration of the missing feature problem addressed in this study: a large proportion of feature values are missing(orange/shaded), and the missing features vary randomly from one instance to another (the actual characters are the numbers 9, 7 and 6). For interpretation of thereferences to colour in this figure legend, the reader is referred to the web version of this article.[2,6,7], the mean of that attribute across the data, or the mean of using the existing features) by calculating the fuzzy membershipits k-nearest neighbors [4]. In order for the estimate to be a of the data point to its nearest neighbors, clusters or hyperboxes.faithful representative of the missing value, however, the training The parameters of the clusters and hyperboxes are determineddata must be sufficiently dense, a requirement rarely satisfied for from the existing data. Algorithms based on the general fuzzydatasets with even modest number of features. Furthermore, min–max neural networks [20] or ARTMAP and fuzzy c-meansimputation methods are known to produce biased estimates as clustering [21] are examples of this approach.the proportion of missing data increases. A related, but perhaps a More recently, ensemble based approaches have also beenmore rigorous approach is to use polynomial regression to proposed. For example, Melville et al. [22] showed that theestimate substitutes for missing values [8]. However, regression algorithm DECORATE, which generates artificial data (with notechniques – besides being difficult to implement in high missing values) from existing data (with missing values) is quitedimensions – assume that the data can reasonably fit to a robust to missing data. On the other hand, Juszczak and Duin [23]particular polynomial, which may not hold for many applications. proposed combining an ensemble of one-class classifiers, each Theoretically rigorous methods with provable performance trained on a single feature. This approach is capable of handlingguarantees have also been developed. Many of these methods any combination of missing features, with the fewest number ofrely on model based estimation, such as Bayesian estimation classifiers possible. The approach can be very effective as long as[9–11], which calculates posterior probabilities by integrating single features can reasonably estimate the underlying decisionover the missing portions of the feature space. Such an approach boundaries, which is not always plausible.also requires a sufficiently dense data distribution; but more In this contribution, we propose an alternative strategy, calledimportantly, a prior distribution for all unknown parameters must Learn++.MF. It is inspired in part by our previously introducedbe specified, which requires prior knowledge. Such knowledge is ensemble-of-classifiers based incremental learning algorithm,typically vague or non-existent, and inaccurate choices often lead Learn++, and in part by the random subspace method (RSM). Into inferior performance. essence, Learn++.MF generates a sufficient number of classifiers, An alternative strategy in model based methods is the each trained on a random subset of features. An instance with oneExpectation Maximization (EM) algorithm [12–14,10], justly or more missing attributes is then classified by majority voting ofadmired for its theoretical elegance. EM consists of two steps, those classifiers that did not use the missing features during theirthe expectation (E) and the maximization (M) step, which training. Hence, this approach differs in one fundamental aspectiteratively maximize the expectation of the log-likelihood of the from the other techniques: Learn++.MF does not try to estimate thecomplete data, conditioned on observed data. Conceptually, these values of the missing data; instead it tries to extract the moststeps are easy to construct, and the range of problems that can be discriminatory classification information provided by the existinghandled by EM is quite broad. However, there are two potential data, by taking advantage of a presumed redundancy in thedrawbacks: (i) convergence can be slow if large portions of data feature set. Hence, Learn+ .MF avoids many of the pitfalls of +are missing; and (ii) the M step may be quite difficult if a closed estimation and imputation based techniques.form of the distribution is unknown, or if different instances are As in most missing feature approaches, we assume that themissing different features. In such cases, the theoretical simplicity probability of a feature being missing is independent of the valueof EM does not translate into practice [2]. EM also requires prior of that or any other feature in the dataset. This model is referredknowledge that is often unavailable, or estimation of an unknown to as missing completely at random (MCAR) [2]. Hence, given theunderlying distribution, which may be computationally prohibi- dataset X¼(xij) where xij represents the jth feature of instance xi,tive for large dimensional datasets. Incorrect distribution estima- and a missing data indicator matrix M¼(Mij), where Mij is 1 if xij istion often leads to inconsistent results, whereas lack of missing and 0 otherwise, we assume that pðM9XÞ ¼ pðMÞ. This issufficiently dense data typically causes loss of accuracy. Several the most restrictive mechanism that generates missing data, thevariations have been proposed, such as using Gaussian mixture one that provides us with no additional information. However,models [15,16]; or Expected Conditional Maximization, to this is also the only case where list-wise deletion or mostmitigate some of these difficulties [2]. imputation approaches lead to no bias. Neural network based approaches have also been proposed. In the rest of this paper, we first provide a review of ensembleGupta and Lam [17] looked at weight decay on datasets with systems, followed by the description of the Learn++.MF algorithm,missing values, whereas Yoon and Lee [18] proposed the Training- and provide a theoretical analysis to guide the choice of its freeEStimation-Training (TEST) algorithm, which predicts the actual parameters: number of features used to train each classifier, andtarget value from a reasonably well estimated imputed one. Other the ensemble size. We then present results, analyze algorithmapproaches use neuro-fuzzy algorithms [19], where unknown performance with respect to its free parameters, and comparevalues of the data are either estimated (or a classification is made performance metrics with theoretically expected values. We also Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  3. 3. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 3compare the performance of Learn+ .MF to that of a single classifier + feature(s) needs to be classified, only those classifiers trained withusing mean imputation, to naıve Bayes that can naturally handle ¨ the features that are presently available in x are used for themissing features, and an intermediate approach that combines classification. As such, Learn+ .MF follows an alternative paradigm +RSM with mean imputation, as well as to that of one-class for the solution of the missing feature problem: the algorithmensemble approach of [23]. We conclude with discussions and tries to extract most of the information from the availableremarks. features, instead of trying to estimate, or impute, the values of the missing ones. Hence, Learn+ .MF does not require a very dense + training data (though, such data are certainly useful); it does not2. Background: ensemble and random subspace methods need to know or estimate the underlying distributions; and hence does not suffer from the adverse consequences of a potential An ensemble-based system is obtained by combining diverse incorrect estimate.classifiers. The diversity in the classifiers, typically achieved by Learn++.MF makes two assumptions. First, the feature set mustusing different training parameters for each classifier, allows be redundant, that is, the classification problem is solvable withindividual classifiers to generate different decision boundaries unknown subset(s) of the available features (of course, if the[24–27]. If proper diversity is achieved, a different error is made identities of the redundant features were known, they would haveby each classifier, strategic combination of which can then reduce already been excluded from the feature set). Second, thethe total error. Starting in early 1990s, with such seminal works as redundancy is distributed randomly over the feature set (hence,[28–30,13,31–33], ensemble systems have since become an time series data would not be well suited for this approach). Theseimportant research area in machine learning, whose recent assumptions are primarily due to the random nature of featurereviews can be found in [34–36]. selection, and are shared with all RSM based approaches. We The diversity requirement can be achieved in several ways combine these two assumptions under the name distributed[24–27]: training individual classifiers using different (i) training redundancy. Such datasets are not uncommon in practice (thedata (sub)sets; (ii) parameters of a given classifier architecture; scenario in Fig. 1, sensor applications, etc), and it is these(iii) classifier models; or (iv) subsets of features. The last one, also applications for which Learn+ .MF is designed, and expected to +known as the random subspace method (RSM), was originally be most effective.proposed by Ho [37] for constructing an ensemble of decisiontrees (decision forests). In RSM, classifiers are trained usingdifferent random subsets of the features, allowing classifiers to err 3.2. Algorithm Learn+ .MF +in different sub-domains of the feature space. Skurichina pointsout that RSM works particularly well when the database provides A trivial approach is to create a classifier for each of the 2f À 1redundant information that is ‘‘dispersed’’ across all features [38]. possible non-empty subsets of the f features. Then, for any The RSM has several desirable attributes: (i) working in a instance with missing features, use those classifiers whosereduced dimensional space also reduces the adverse conse- training feature set did not include the missing ones. Such anquences of the curse of dimensionality; (ii) RSM based ensemble approach, perfectly suitable for datasets with few features, has inclassifiers may provide improved classification performance, fact been previously proposed [44]. For large feature sets, thisbecause the synergistic classification capacity of a diverse set of exhaustive approach is infeasible. Besides, the probability of aclassifiers compensates for any discriminatory information lost by particular feature combination being missing also diminisheschoosing a smaller feature subset; (iii) implementing RSM is quite exponentially as the number of features increase. Therefore,straightforward; and (iv) it provides a stochastic and faster alter- trying to account for every possible combination is inefficient.native to the optimal-feature-subset search algorithms. RSM At the other end of the spectrum, is Juszczak and Duin’s approachapproaches have been well-researched for improving diversity, [23], using fxc one-class classifiers, one for each feature and eachwith well-established benefits for classification applications [39], of the c classes. This approach requires fewest number ofregression problems [40] and optimal feature selection applica- classifiers that can handle any feature combination, but comestions [38,41]. However, the feasibility of an RSM based approach at a cost of potential performance drop due to disregarding anyhas not been explored for the missing feature problem, and hence existing relationship between the features, as well as the infor-constitutes the focus of this paper. mation provided by other existing features. Finally, a word on the algorithm name: Learn++ was developed Learn++.MF offers a strategic compromise: it trains an ensembleby reconfiguring AdaBoost [33] to incrementally learn from new of classifiers with a subset of the features, randomly drawn from adata that may later become available. Learn+ generates an + feature distribution, which is iteratively updated to favor selec-ensemble of classifiers for each new database, and combines tion of those features that were previously undersampled. Thisthem through weighted majority voting. Learn+ draws its training + allows Learn++.MF to achieve classification performances withdata from a distribution, iteratively updated to force the algo- little or no loss (compared to a fully intact data) even when largerithm to learn novel information not yet learned by the ensemble portions of data are missing. Without any loss of generality, we[42,43]. Learn+ .MF, combines the distribution update concepts of + assume that the training dataset is complete. Otherwise, a ++Learn with the random feature selection of RSM, to provide a sentinel can be used to flag the missing features, and thenovel solution to the Missing Feature problem. classifiers are then trained on random selection of existing features. For brevity, we focus on the more critical case of field data containing missing features.3. Approach The pseudocode of Learn++.MF is provided in Fig. 2. The inputs È É to the algorithm are the training data S ¼ ðxi ,yi Þ,i ¼ 1,. . .,N ;3.1. Assumptions and targeted applications of the proposed feature subset cardinality; i.e., number of features, nof, used toapproach train each classifier; a supervised classification algorithm (BaseClassifier), the number of classifiers to be created T; and a The essence of the proposed approach is to generate a sentinel value sen to designate a missing feature. At each iterationsufficiently large number of classifiers, each trained with a t, the algorithm creates one additional classifier Ct. The featuresrandom subset of features. When an instance x with missing used for training Ct are drawn from an iteratively updated Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  4. 4. ARTICLE IN PRESS4 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Fig. 2. Pseudocode of algorithm Learn++.MF.distribution Dt that promotes further diversity with respect to the During classification of a new instance z, missing (or known tofeature combinations. D1 is initialized to be uniform, hence each be corrupt) values are first replaced by a sentinel sen (chosen as afeature is equally likely to be used in training classifier C1 (step 1). value not expected to occur in the data). Then, the features with the value sen are identified and placed in M(z), the set of missingD1 ðjÞ ¼ 1=f , j ¼ 1, Á Á Á ,f ð1Þ features.where f is the total number of features. For each iteration MðzÞ ¼ argðzðjÞ ¼ ¼ senÞ, 8j, j ¼ 1,. . .,f , ð5Þt ¼1,y,T, the distribution Dt A Rf that was updated in theprevious iteration is first normalized to ensure that Dt stays as a à Finally, all classifiers Ct whose feature list Fselection(t) does notproper distribution include the features listed in M(z) are combined through majority , voting to obtain the ensemble classification of z. XfDt ¼ Dt Dt ðjÞ ð2Þ X EðzÞ ¼ arg max 1ðMðzÞ Fselection ðtÞÞ ¼ |U ð6Þ j¼1 yAY à t:Ct ðzÞ ¼ y A random sample of nof features is drawn, without replace-ment, according to Dt, and the indices of these features are placedin a list, called F selection ðtÞ A Rnof (step 2). This list keeps track of 3.3. Choosing algorithm parameterswhich features have been used for each classifier, so thatappropriate classifiers can be called during testing. Classifier Ct, Considering that feature subsets for each classifier are chosenis trained (step 3) using the features in Fselection(t), and evaluated randomly, and that we wish to use far fewer than 2f À 1 classifierson S (step 4) to obtain necessary to accommodate every feature subset combination, it is possible that a particular feature combination required to 1XNPerft ¼ 1Ct ðxi Þ ¼ yi U ð3Þ accommodate a specific set of missing features might not have Ni¼1 been sampled by the algorithm. Such an instance cannot bewhere 1 Á U evaluates to 1 if the predicate is true. We require Ct to processed by Learn++.MF. It is then fair to ask how often Learn+ .MF +perform better than some minimum threshold, such as 1/2 as in will not be able to classify a given instance. Formally, given thatAdaBoost, or 1/c (better than the random guess for a reasonably Learn++.MF needs at least one useable classifier that can handle thewell-balanced c-class data), on its training data to ensure that it current missing feature combination, what is the probability thathas a meaningful classification capacity. The distribution Dt is there will be at least one classifier – in an ensemble of T classifiersthen updated (step 5) according to – that will be able to classify an instance with m of its f features missing, if each classifier is trained on nofof features? A formalDt þ 1 ðFselection ðtÞÞ ¼ bÃDt ðFselection ðtÞÞ, 0 o b r1 ð4Þ analysis to answer this question also allows us to determine thesuch that the weights of those features in current Fselection(t) are relationship among these parameters, and hence can guide us inreduced by a factor of b. Then, features not in Fselection(t) appropriately selecting the algorithm’s free parameters: nof and T.effectively receive higher weights when Dt is normalized in the A classifier Ct is only useable if none of the nof features selectednext iteration. This gives previously undersampled features a for its training data matches one of those missing in x. Then, whatbetter chance to be selected into the next feature subset. Choosing is the probability of finding exactly such a classifier? Without lossb ¼1 results in a feature distribution that remains uniform of any generality, assume that m missing features are fixed, andthroughout the experiment. We recommend b ¼nof/f, which that we are choosing nof features to train Ct. Then, the probabilityreduces the likelihood of previously selected features being of choosing the first feature – among f features – such that it doesselected in the next iteration. In our preliminary efforts, we not coincide with any of the m missing ones is (f Àm)/f. Theobtained better performances by using b ¼nof/f over other choices probability of choosing the second such feature is (f Àm À 1)/such as b ¼1 or b ¼1/f, though not always with statistically (f À 1), and similarly the probability of choosing the (nof)th suchsignificant margins [45]. feature is ðf ÀmÀðnof À1ÞÞ=ðf Àðnof À1ÞÞ. Then the probability of Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  5. 5. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 5finding a useable classifier Ct for x is finding a single useable classifier is a success in a Bernoulli f Àm f ÀmÀ1 f ÀmÀ2 f ÀmÀðnof À1Þ experiment with probability of success p, then the probability ofp¼ U U UÁÁÁU finding at least t useable classifiers is f f À1 f À2 f Àðnof À1Þ nof À1 Y Pð 4t useable classifiers when m features are missingÞ ¼ pm t 4 m m m m m ¼ 1À U 1À U 1À U Á Á Á U 1À ¼ 1À T f f À1 f À2 f Àðnof À1Þ f Ài X T ðpÞt ð1ÀpÞTÀt i¼0 ¼ ð13Þ ð7Þ t¼t t Note that this is exactly the scenario described by the Note that Eq. (13) still needs to be weighted with thehypergeometric (HG) distribution: we have f features, m of which (binomial) probabilities of ‘‘having exactly m missing features’’,are missing; we then randomly choose nof of these f features. The and then summed over all values of m to yield the desiredprobability that exactly k of these selected nof features will be one probabilityof the m missing features is fX Ànof ! !, ! f m f Àm f P¼ rm ð1ÀrÞf Àm Uðpm t Þ 4 ð14Þp $ HGðk; f ,m,nof Þ ¼ ð8Þ m¼0 m k nof Àk nof By using the appropriate expressions from Eqs. (7) and (13) in We need k¼0, i.e., none of the features selected is one of the Eq. (14), we obtainmissing ones. Then, the desired probability is 2 0 Ànof fX X T nof À1 T Y !t ! f m m f Àm P¼ 4 m r ð1ÀrÞ Uf Àm @ 1À m¼0 m t¼t t i¼0 f Ài 0 nof ðf ÀmÞ! ðf Ànof Þ!p $ HGðk ¼ 0; f ,m,nof Þ ¼ ! ¼ !TÀt 13 U nof À1 Y f ðf ÀmÀnof Þ! f! m  1À 1À A5 ð15Þ nof i¼0 f Ài ð9Þ Eqs. (7) and (9) are equivalent. However, Eq. (7) is preferable, Eq. (15), when evaluated for t¼ 1, is the probability of findingsince it does not use factorials, and hence will not result in at least one useable classifier trained on nof features – from a poolnumerical instabilities for large values of f. If we have T such of T classifiers – when each feature has a probability r of beingclassifiers, each trained on a randomly selected nof features, what missing. For t ¼1, Eqs. (10) and (13) are identical. However,is the probability that at least one will be useable, i.e., at least one Eq. (13) allows computation of the probability of finding anyof them was trained with a subset of nof features that do not number of useable classifiers (say, tdesired, not just one) simply byinclude one of the m missing features. The probability that a starting the summation from t ¼tdesired. These equations can helprandom single classifier is not useable is (1 Àp). Since the feature us choose the algorithm’s free parameters, nof and T, to obtain aselections are independent (specifically for b ¼ 1), the probability reasonable confidence that we can process most instances withof not finding any single useable classifier among all T classifiers is missing features.(1 Àp)T. Hence, the probability of finding at least one useable As an example, consider a 30-feature dataset. First, let us fixclassifier is probability that a feature is missing at r ¼ 0.05. Then, on average, !T 1.5 features are missing from each instance. Fig. 3(a) shows the nof À1 Y m probability of finding at least one classifier for various values ofP ¼ 1Àð1ÀpÞT ¼ 1À 1À 1À ð10Þ i¼0 f Ài nof as a function of T. Fig. 3(b) provides similar plots for a higher rate of r ¼0.20 (20% missing features). The solid curve in each If we know how many features are missing, Eq. (10) is exact. case is the curve obtained by Eq. (15), whereas the variousHowever, we usually do not know exactly how many features indicators represent the actual probabilities obtained as a result ofwill be missing. At best, we may know the probability r that 1000 simulations. We observe that (i) the experimental data fitsa given feature may be missing. We then have a binomial the theoretical curve very well; (ii) the probability of finding adistribution: naming the event ‘‘ a given feature is missing’’ as useable classifier increases with the ensemble size; and (iii) forsuccess, then the probability pm of having m of f features missing a fixed r and a fixed ensemble size, the probability of finding ain any given instance is the probability of having m successes in f useable classifier increases as nof decreases. This latter obser-trials of a Bernoulli experiment, characterized by the binomial vation makes sense, as fewer the number of features used indistribution training, the larger the number of missing features that can be f accommodated. In fact, if a classifier is trained with nof featurespm ¼ rm ð1ÀrÞf Àm ð11Þ m out of a total of f features, then that classifier can accommodate The actual probability of finding a useable classifier is then a up to f Ànof missing features.weighted sum of the probabilities (1 À p)T, where the weights are In Fig. 3(c), we now fix nof¼ 9 (30% of the total feature size),the probability of having exactly m features missing, with m and calculate the probability of finding at least one useablevarying from 0 (no missing features) to maximum allowable classifier as a function of ensemble size, for various values of r.number of fÀ nof missing features. As expected, for a fixed value of nof and ensemble size, this 2 3 probability decreases as r increases. Ànof fX nof À1 Y !T In order to see how these results scale with larger feature sets, f mP ¼ 1À 4 f Àm rm ð1ÀrÞ U 1À 1À 5 ð12Þ m f Ài Fig. 4 shows the theoretical plots for the 216—feature multiple m¼0 i¼0 features (MFEAT) dataset [46]. All trends mentioned above can be Eq. (12) computes the probability of finding at least one observed in this case as well, however, the required ensemble sizeuseable classifier by subtracting the probability of finding no is now in thousands. An ensemble with a few thousand classifiersuseable classifier (in an ensemble of T classifiers) from one. is well within today’s computational resources, and in fact notA more informative way, however, is to compute the probability uncommon in many multiple classifier system applications. It isof finding t useable classifiers (out of T). If the probability of clear, however, that as the feature size grows beyond the Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  6. 6. ARTICLE IN PRESS6 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Probability of finding at least one useable classifier Probability of finding at least one useable classifier f=30, ρ =0.05 f =30, ρ =0.2 1 1 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.5 0.6 nof = 3 (10%) 0.4 nof = 5 (15%) nof = 5 (15%) 0.5 nof = 9 (30%) 0.3 nof = 7 (23%) nof = 12 (40%) nof = 9 (30%) nof = 15 (50%) 0.2 nof = 12 (40%) 0.4 nof = 20 (67%) nof = 15 (50%) 0.1 0 0 20 40 60 80 100 0 20 40 60 80 100 120 140 160 180 200 Number of classifies in the ensemble Number of classifiers in the ensemble Probability of finding at least one useable classifier f =30, nof =9 1 0.9 0.8 0.7 0.6 0.5 0.4 ρ = 0.05 ρ = 0.10 0.3 ρ = 0.15 ρ = 0.20 0.2 ρ = 0.25 0.1 ρ = 0.30 ρ = 0.40 0 0 50 100 150 200 Number of classifiers in the ensemble Fig. 3. Probability of finding at least one useable classifier for f¼ 30 (a) r ¼0.05, variable nof, (b) r ¼0.2 variable nof and (c) nof ¼ 9, variable r.thousands range, the computational burden of this approach avoid repetition of similar outcomes, we present representativegreatly increases. Therefore, Learn++.MF is best suited for appli- full results on four datasets here, and provide results for manycations with moderately high dimensionality (fewer than 1000). additional real world and UCI benchmark datasets online [48], aThis is not an overly restrictive requirement, since many appli- summary of which is provided at the end of this section. The fourcations have typically much smaller dimensionalities. datasets featured here encompass a broad range of feature and database sizes. These datasets are the Wine (13 features), Wisconsin Breast Cancer (WBC—30 features), Optical Character4. Simulation results Recognition (OCR—62 features) and Multiple Features (MFEAT—216 features) datasets obtained from the UCI repository4.1. Experimental setup and overview of results [46]. In all cases, multilayer perceptron type neural networks were used as the base classifier, though any supervised classifier can Learn+ .MF was evaluated on several databases with various + also be used. The training parameters of the base classifiers werenumber of features. The algorithm resulted in similar trends in all not fine tuned, and selected as fixed reasonable values (errorcases, which we discuss in detail below. Due to the amount of goal¼0.001–0.01, number of hidden layer nodes¼20–30 based ondetail involved with each dataset, as well as for brevity and to the dataset). Our goal is not to show the absolute best classification Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  7. 7. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 7 Probability of finding at least one useable classifier Probability of finding at least one useable classifier f =216, ρ =0.15 f=216, nof=40 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 nof = 15 0.4 nof = 20 0.3 nof = 25 0.3 ρ = 0.05 nof = 30 0.2 0.2 ρ = 0.10 nof = 35 ρ = 0.15 0.1 nof = 40 0.1 ρ = 0.18 nof = 50 ρ = 0.20 0 0 0 1000 2000 3000 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Number of classifiers in the ensemble Number of classifiers in the ensemble Fig. 4. Probability of finding at least one classifier, f ¼ 216 (a) r ¼0.15, variable nof and (b) nof¼ 40, variable r.Table 1Datasets used in this study. Dataset k Dataset size Number of classes Number of features nof1 nof2 nof3 nof4 nof5 nof6 T WINE 178 3 13 3 4 5 6 7 variable 200 WBC 400 2 30 10 12 14 16 – variable 1000 OCR 5620 10 62 16 20 24 – – variable 1000 MFEAT 2000 10 216 10 20 30 40 60 variable 2000 VOC-In 384 6 6 2 3 – – – variable 100 VOC-IIn 84 12 12 3 4 5 6 – variable 200 IONn 270 2 34 8 10 12 14 – variable 1000 WATERn 374 4 38 12 14 16 18 – variable 1000 ECOLIn 347 5 5 2 3 – – – variable 1000 DERMAn 366 6 34 8 10 12 14 – variable 1000 PENn 10,992 10 16 6 7 8 9 – variable 250 n The full results for these datasets are provided online due to space considerations and may be found at [48].performances (which are quite good), but rather the effect of the observations, discussed below in detail, were as follows: (i) themissing features and strategic choice of the algorithm parameters. algorithm can handle data with missing values, 20% or more, with We define the global number of features in a dataset as the little or no performance drop, compared to classifying data withproduct of the number of features f, and the number of instances all values intact; and (ii) the choice of nof presents a critical trade-N; and a single test trial as evaluating the algorithm with 0–30% off: higher performance can be obtained using a larger nof, when(in steps of 2.5%), of the global number of features missing from fewer features are missing, yet smaller nof yields higher and morethe test data (typically half of the total data size). We evaluate the stable performance, and leaves fewer instances unprocessed,algorithm with different number of features, nof, used in training when larger portions of the data are missing.individual classifiers, including a variable nof, where the numberof features used to train each classifier is determined by randomlydrawing an integer in the 15–50% range of f. All results are 4.2. Wine databaseaverages of 100 trials on test data, reported with 95% confidenceintervals, where training and test sets were randomly reshuffled Wine database comes from the chemical analysis of 13for each trial. Missing features were simulated by randomly constituents in wines obtained from three cultivars. Previousreplacing actual values with sentinels. experiments using optimized classifiers trained with all 13 Table 1 summarizes datasets and parameters used in the features, and evaluated on a test dataset with no missing features,experiments: the cardinality of the dataset, the number of classes, had zero classification error, setting the benchmark targettotal number of features f, the specific values of nof used in performance for this database. Six values of nof were considered:simulations, and the total number of classifiers T generated. 3, 4, 5, 6, 7 and variable, where each classifier was trained on – not The behavior of the algorithm with respect to its parameters with a fixed number of features, but a random selection of one ofwas very consistent across the databases. The two primary these values. Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  8. 8. ARTICLE IN PRESS8 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]Table 2Performance on the WINE database. nof ¼ 3 (out of 13) nof ¼4 (out of 13) nof ¼5 (out of 13) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 100.007 0.00 100 0.00 100.00 70.00 100 0.00 100.00 70.00 100 2.50 100.007 0.00 100 2.50 99.67 7 0.71 100 2.50 100.00 70.00 100 5.00 100.007 0.00 100 5.00 100.00 70.00 100 5.00 99.67 7 0.71 100 7.50 99.67 7 0.71 100 7.50 99.67 7 0.71 100 7.50 99.33 7 1.43 100 10.00 99.33 7 0.95 100 10.00 99.67 7 0.71 100 10.00 99.007 1.09 100 15.00 99.007 1.09 100 15.00 98.007 1.58 100 15.00 98.33 7 1.60 100 20.00 99.007 1.09 100 20.00 98.007 1.91 100 20.00 96.67 7 1.85 100 25.00 99.007 1.09 100 25.00 97.66 7 1.53 100 25.00 96.29 7 1.31 99 30.00 96.33 7 1.98 100 30.00 94.29 7 1.54 99 30.00 95.22 7 2.26 98 nof ¼ 6 (out of 13) nof ¼7 (out of 13) nof ¼variable (3–7 out of 13) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 100.007 0.00 100 0.00 100.00 70.00 100 0.00 100.00 70.00 100 2.50 99.67 7 0.71 100 2.50 99.33 7 0.95 100 2.50 99.33 7 0.95 100 5.00 99.33 7 0.95 100 5.00 99.33 7 0.95 100 5.00 100.00 70.00 100 7.50 98.67 7 1.58 100 7.50 98.67 7 1.17 100 7.50 99.67 7 0.71 100 10.00 98.33 7 2.20 100 10.00 98.32 7 1.60 100 10.00 99.67 7 0.74 100 15.00 96.98 72.00 99 15.00 96.94 7 2.30 99 15.00 97.67 7 1.53 100 20.00 96.98 7 2.27 99 20.00 91.27 7 3.55 96 20.00 98.00 7 1.17 100 25.00 93.72 7 2.66 95 25.00 93.03 7 4.35 86 25.00 97.33 7 1.78 100 30.00 95.29 7 2.63 93 30.00 91.34 7 4.20 78 30.00 94.67 7 2.18 100 Table 2 summarizes the test performances using all values of (of the same architecture as ensemble members) that employsnof, and the percentage of instances that could be processed – the commonly used mean imputation (M.I.) to replace the missingcorrectly or otherwise – with the existing ensemble (described data, as well as that of the E.M. algorithm [47]. We observe thatbelow). The first row with 0.0% missing features is the algorithm’s Learn++.MF ensemble (significantly) outperforms both E.M. andperformance when classifiers were trained on nof features, but mean imputation for all values of PMF, an observation that wasevaluated on a fully intact test data. The algorithm is able to obtain consistent for all databases we have evaluated (we compare100% correct classification with any of the nof values used, indi- Learn++.MF to naı Bayes and RSM with mean imputation later in ¨vecating that this dataset does in fact include redundant features. the paper). Fig. 5(b) shows the performance of a single usable Now, since the feature subsets are selected at random, it is classifier, averaged over all T classifiers, as a function of PMF.possible that the particular set of features available for a given The average performance of any fixed single classifier is, asinstance do not match the feature combinations selected by any of expected, independent of PMF; however, using higher nof valuesthe classifiers. Such an instance cannot be processed, as there are generally associated with better individual classifier perfor-would be no classifier trained with the unique combination of the mance. This makes sense, since there is more informationavailable features. The column with the heading PIP (percentage available to each classifier with a larger nof. However, thisof instances processed) indicates the percentage of those comes at the cost of smaller PIP. In Fig. 5(b) and (c) we compareinstances that can be processed by the generated ensemble. As the performance and the PIP with a single vs. ensemble classifier.expected, PIP decreases as the ‘‘% missing features (PMF)’’ For example, consider 20% PMF, for which the corresponding PIPincreases. We note that with 30% of the features missing from values for single classifiers are shown in Fig. 5(b). A single usablethe test data, 100% of this test dataset can be processed (at a classifier trained with nof¼ 3 features is able to classify, onperformance of 96.3%) when nof¼3, whereas only 78% can be average, 51% of instances with a performance of 79%, whereas theprocessed (at a performance of 91.3%) when nof ¼7 features are ensemble has a 100% PIP with a performance of 99%. Fig. 5(c)used for training the classifiers. A few additional observations: shows PIP as a function of PMF, both for single classifiers (lowerfirst, we note that there is no (statistically significant) perfor- curves) and the ensembles (upper curves). As expected, PIPmance drop with up to 25% of features missing when nof¼3, or up decreases with increasing PMF for both the ensemble and a singleto 10% when nof¼6, and only a few percentage points (3.5% for usable classifier. However, the decline in PIP is much steeper fornof¼3, 4.7% for nof¼6) when as many as 30% of the features are single classifiers. In fact, there is virtually no drop (PIP ¼100%) upmissing—hence the algorithm can handle large amounts of until 20% PMF when the ensemble is used (regardless of nof).missing data with relatively little drop in classification perfor- Hence the impact of using an ensemble is two-folds: the ensemblemance. Second, virtually all test data can be processed—even can process a larger percentage of instances with missingwhen as much as 30% of the features are missing, if we use features, and it also provides a higher classification performancerelatively few features for training the classifiers. on those instances, compared to a single classifier. For a more in depth analysis of the algorithm’s behavior, Also note in Fig. 5(c) that the decline in PIP is much steeperconsider the plots in Fig. 5. Fig. 5(a) illustrates the overall with nof ¼7 than it is for nof¼3, similar to decline in classificationperformance of the Learn+ .MF ensemble. All performances in the + performance shown in Fig. 5(a). Hence, our main observation0–30% interval of PMF with increments of 2.5% are provided regarding the trade-off one must accept by choosing a specific(while the tables skip some intermediate values for brevity). As nof value is as follows: classifiers trained with a larger nofexpected, the performance declines as PMF increases – albeit only may achieve a higher generalization performance individually,gradually. The decline is much milder for nof¼3 than it is for or initially when fewer features are missing, but are only ablenof¼7, the reasons of which are discussed below. Also included in to classify fewer instances as the PMF increases. This observa-Fig. 5(a) for comparison, is the performance of a single classifier tion can be explained as follows: using large nof for training Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  9. 9. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 9 Learn++.MF Ensemble Performance vs. MI Single Usable Classifier Performance Learn ++.MF 88 100 Ensembles PIP @20% PMF 26% 86 95 EM 84 20% 33% % % 82 37% Mean 90 Imputation 80 41% 78 51% 85 76 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) Percent Instances Processed Percent Classifiers Usable 100 100 Learn ++ .MF 80 80 Ensemble 60 60 % % 40 40 Single Classifiers 20 20 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) 3 features 4 features 5 features 6 features 7 features Variable M.I. E.M. ++Fig. 5. Detailed analysis of the algorithm on the WINE database with respect to PMF and nof: (a) Learn .MF ensemble performances compared to that of mean imputation,(b) single classifier performance, (c) single classifier and ensemble PIP and (d) percentage of useable classifiers in the ensemble.(e.g., 10 out of 13) means that the same large number of features the results of averaging 1000 such Monte Carlo simulations for(10) are required for testing. Therefore, fewer number of missing f¼13, nof ¼3–7, PMF [0–100]%, which agrees with the actualfeatures (only 3, in this case) can be accommodated. In fact, the results of Fig. 5(d) (up to 30% PMF is shown in Fig. 7(d)).probability of no classifier being available for a given combinationof missing features increases with nof . Of course, if all features areused for training, then no classifier will be able to process any 4.3. Wisconsin breast cancer (WBC) databaseinstance with even a single missing feature, hence the originalmotivation of this study. The WBC database consists of 569 instances of cell-biopsy Finally, Fig. 5(d) analyzes the average percentage of usable classified as malignant or benign based on 30 features obtainedclassifiers (PUC) per given instance, with respect to PMF. The from the individual cells. The expected best performance fromdecline in PUC also makes sense: there are fewer classifiers this dataset (when no features are missing) is 97% [46]. Five nofavailable to classify instances that have a higher PMF. Further- values 10, 12, 14, 16 and variable, were tested.more, this decline is also steeper for higher nof. Table 3 summarizes the test performances using all values of In summary, then, a larger nof usually provides a better nof, and the PIP values, similar to that of Wine database in Table 2.individual and/or initial performance than a smaller nof, but the The results with this database of larger number of features haveensemble is more resistant to unusable feature combinations much of the same trends and characteristics mentioned for thewhen trained with a smaller nof. How do we, then, choose the nof Wine database. Specifically, the proximity of the ensemblevalue, since we cannot know PMF in future datasets? A simple performance when used with nof¼ 10, 12, 14 and 16 featuressolution is to use a variable number of features—typically in the with no missing features to the 97% figure reported in therange of 15–50% of the total number of features. We observe from literature indicates that this database also has a redundant featureTable 2 and Fig. 5, that such a variable selection provides set. Also, as in the Wine dataset, the algorithm is able to handle upsurprisingly good performance: a good compromise with little to 20% missing features with little or no (statistically) significantor no loss of unprocessed instances. performance drop, and only a gradual decline thereafter. As As for the other parameter, T—the number of classifiers, the expected, the PIP drops with increasing PMF, and the drop istrade off is simply a matter of computational resources. A larger steeper for larger nof values compared to smaller ones. UsingPIP and PUC will be obtained, if larger number of classifiers is variable number of features provides a good trade off. Thesetrained. Fig. 5(d) shows PUC for various values of PMF, indicating characteristics can also be seen in Fig. 7, where we plot thethe actual numbers obtained for this database after the classifiers ensemble performance, and PIP for single and ensemble classifierswere trained. Family of curves generated using Eq. (15), such as as in Fig. 5 for Wine database. Fig. 7(a) also shows that thethose in Fig. 3, as well as Monte Carlo simulations can provide us ensemble outperforms a single classifier (of identical architecture)with these numbers, before training the classifiers. Fig. 6 shows that uses mean imputation to replace the missing features. The Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  10. 10. ARTICLE IN PRESS10 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 100 Expected percentage of useable classifier 90 nof = 3 nof = 4 80 nof = 5 nof = 6 70 nof = 7 in the ensemble 60 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of any given feature being missing Fig. 6. Estimating the required ensemble size ahead of time. Learn++.MF Ensemble Performances vs. M.I. Percent of Processed Instances 100 Learn++.MF 100 Ensembles 80 Learn++.MF 90 Ensemble 60 % % Mean Imputation 40 80 20 Single Classifiers 70 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) 10 features 12 features 14 features 16 features Variable M.I.Fig. 7. Analysis of the algorithm on the WBC database with respect to PMF and nof: (a) Learn++.MF ensemble performances compared to that of mean imputation and (b)single classifier and ensemble PIP.Table 3Performance on the WBC database. nof ¼ 10 (out of 30) nof¼ 12 (out of 30) nof¼ 14 (out of 30) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 95.007 0.00 100 0.00 96.00 70.00 100 0.00 95.00 70.00 100 2.50 95.207 0.24 100 2.50 95.80 70.18 100 2.50 94.75 70.24 100 5.00 95.057 0.30 100 5.00 95.55 70.41 100 5.00 94.80 70.33 100 7.50 95.15 7 0.42 100 7.50 95.55 70.11 100 7.50 94.55 70.56 100 10.00 95.55 7 0.30 100 10.00 95.34 70.40 100 10.00 94.39 70.31 100 15.00 95.107 0.69 100 15.00 94.88 70.50 100 15.00 93.84 70.95 98 20.00 94.63 7 0.77 100 20.00 93.08 71.10 97 20.00 92.35 71.08 91 25.00 94.53 7 0.81 98 25.00 92.78 70.74 91 25.00 91.42 71.20 76 30.00 92.78 7 1.39 92 30.00 89.87 72.02 75 30.00 89.01 71.73 55 nof ¼ 16 (out of 30) nof¼ var. (10–16 out of 30) PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 95.007 0.00 100 0.00 95.50 70.00 100 2.50 94.807 0.29 100 2.50 95.45 70.19 100 5.00 94.75 7 0.37 100 5.00 95.30 70.43 100 7.50 95.047 0.52 100 7.50 95.30 70.46 100 10.00 94.24 7 0.48 99 10.00 95.15 70.39 100 15.00 92.507 0.98 94 15.00 94.79 70.85 100 20.00 90.207 1.10 79 20.00 94.90 70.79 97 25.00 86.98 7 2.45 56 25.00 91.61 71.23 91 30.00 86.34 7 3.17 34 30.00 91.13 71.32 78 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028