A Kernel-Based Semi-Naive Bayesian Classifier Using P-Trees1

                              Anne Denton, William Perrizo
 ...
attributes and improve on the naive Bayesian classifier. These classifiers require an
optimization procedure that is compu...
data structures, namely P-trees to represent the data [12-16]. This allows us to efficiently
evaluate the number of data p...
N
               1
 f i ( x) =
               N
                      ∑ K (x, x )
                      t =1
             ...
density estimate even in the case of a Gaussian kernel function. The Gaussian kernel
function evaluates the density of tra...
they are highly correlated. Attributes are considered highly correlated if the product
assumption in (6) can be shown to b...
attribute values. If attributes are correlated over all but not for the value of a particular test
sample no join is perfo...
scaling better than O(N) for such operations.
        The efficiency of P-tree operations depends strongly on the compress...
shows the relationship between traversal of points in Peano order and sorting according to
highest order bits.
        A m...
evaluation of the number of data points in a neighborhood that can be represented by a
single bit pattern. The HOBbit dist...
all attributes. For continuous attributes we evaluate (7) and for categorical attributes (8)
for all values of the class l...
if yield is higher than 128 for a given pixel. Table 2 summarizes the properties of the data
sets.


              Number ...
adult                                     18.3 0.4   18.0    0.3   17.3 0.3   17.1 0.3     17.0 0.3      16.6   0.3
sick-
...
combining attributes.


                                   9
                                   8
          Decrease in Er...
statistical significance. By combining many attributes the problems of the curse of
dimensionality can recur. Similarly, i...
120



          Time per Test Sample in   80
                Milliseconds                                                ...
Cambridge, Massachusetts, 2001.
[2] http://www.ics.uci.edu/mlearn/MLSummary.html
[3] P. Domingos, M. Pazzani, "Beyond Inde...
1-2, 273-324, 1997.




                      18
Upcoming SlideShare
Loading in …5
×

ICDM03_SNB.doc

606 views

Published on

  • Be the first to comment

  • Be the first to like this

ICDM03_SNB.doc

  1. 1. A Kernel-Based Semi-Naive Bayesian Classifier Using P-Trees1 Anne Denton, William Perrizo Department of Computer Science, North Dakota State University Fargo, ND 58105-5164, USA {anne.denton, william.perrizo}@ndsu.nodak.edu Abstract A novel semi-naive Bayesian classifier is introduced that is particularly suitable to data with many attributes. The naive Bayesian classifier is taken as a starting point and correlations are reduced through joining of highly correlated attributes. Our technique differs from related work in its use of kernel-functions that systematically include continuous attributes rather than relying on discretization as a preprocessing step. This retains distance information within the attribute domains and ensures that attributes are joined based on their correlation for the particular values of the test sample. We implement a lazy semi-naive Bayesian classifier using P-trees and demonstrate that it generally outperforms the naive Bayesian classifier. 1. INTRODUCTION One of the main challenges in data mining is handling data with many attributes. The volume of the space that is spanned by all attributes grows exponentially with the number of attributes, and the density of training points decreases accordingly. This phenomenon is also termed the curse of dimensionality [1]. Many current problems, such as DNA sequence analysis, and text analysis suffer from the problem (see "spam" data set from [2] discussed below). Classification techniques that estimate a class label based on the density of training points such as nearest neighbor classifiers have to base their decision on far distant neighbors. Decision tree algorithms can only use few attributes before the small number of data points in each branch makes results insignificant. This means that the potential predictive power of all other attributes is lost. A classifier that suffers relatively little from high dimensionality is the naive Bayesian classifier. Despites its name and the reputation of being trivial, it can lead to a surprisingly high accuracy, even in cases in which the assumption of independence of attributes, on which it is based, is no longer valid [3]. Other classifiers, namely generalized additive models [4], have been developed that make similar use of the predictive power of large numbers of 1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.
  2. 2. attributes and improve on the naive Bayesian classifier. These classifiers require an optimization procedure that is computationally unacceptable for many data mining applications. Classifiers that require an eager training or optimization step are particularly ill suited to settings in which the training data changes continuously. Such a situation is common for sliding window approaches in data streams, in which old data is discarded at the rate at which new data arrives [5]. We introduce a lazy classifier that does not require a training phase. Data that is no longer considered accurate can be eliminated or replaced without the need to recreate a classifier or redo an optimization procedure. Our classifier improves on the accuracy of the Naive Bayesian classifier by treating strongly correlated attributes as one. Approaches that aim at improving on the validity of the naive assumption through joining of attributes are commonly referred to as semi-naive Bayesian classifiers [6-8]. Kononenko originally proposed this idea [6] and Pazzani [7], more recently, evaluated Cartesian product attributes in a wrapper approach. In previous work continuous attributes were intervalized as a preprocessing step, significantly limiting the usefulness of the classifier for continuous data. Other classifiers that improve on the naive Bayesian classifier include Bayesian network and augmented Bayesian classifiers [9-11]. These classifiers determine a tree or network like structure between attributes to account for correlations. Such classifiers commonly assume that correlations are determined for attributes as a whole, but generalizations that consider the specific instance are also discussed [9]. They do, however, all discretize continuous attributes, and thereby lose distance information within attributes. Our approach is founded on a more general definition of the naive Bayesian classifier that involves kernel density estimators to compute probabilities [4]. We introduce a kernel-based correlation function and join attributes when the value of the correlation function at the location of the test sample exceeds a predefined threshold. We show that in this framework it is sufficient to define the kernel function of joined attributes without stating the shape of the joined attributes themselves. No information is lost in the joining process. The benefits of a kernel-based definition of density estimators are thereby fully extended to the elimination of correlations. In contrast to most other techniques attributes are only joined if their values are correlated at the location of the test sample. An example of the impact of a local correlation definition could be the classification of e- mail messages based on author age and message length. These two attributes are probably highly correlated if the author is a young child, i.e. they should be joined if age and message length of the test sample are very small. For other age groups there is probably little basis for considering those attributes combined. Evaluation of kernel functions requires the fast computation of counts, i.e. of the number of records that satisfy a given condition. We use compressed, bit-column-oriented 2
  3. 3. data structures, namely P-trees to represent the data [12-16]. This allows us to efficiently evaluate the number of data points that satisfy particular conditions on the attributes. Speed is furthermore increased through a distance function that leads to intervals for which counts can particularly easily be calculated, namely the HOBbit distance [12]. Intervals are weighted according a Gaussian function. Section 2 presents the concept of kernel density estimation in the context of Bayes' theorem and the naive Bayesian classifier. Section 2.1 defines correlations, section 2.2 summarizes P-tree properties, section 2.3 shows how the HOBbit distance can be used in kernel functions, and section 2.4 combines the components into the full algorithm. Section 3 presents our results, with 3.1 introducing the data sets and 3.2-3.4 discussing accuracy. Section 3.5 demonstrates the performance of our algorithm and section 4 concludes the paper. 2. NAIVE AND SEMI-NAIVE BAYESIAN CLASSIFIER USING KERNEL DENSITY ESTIMATION Bayes' theorem is used in different contexts. In what is called Bayesian estimation in statistics, Bayes' theorem allows consistent use of prior knowledge of the data, by transforming a prior into a posterior distribution through use of empirical knowledge of the data [1,4]. Using the simplest assumption of a constant prior distribution Bayes' theorem leads to a straightforward relationship between conditional probabilities. Given a class label C with m classes c1, c2, ..., cm and an attribute vector x of all other attributes, the conditional probability of class label ci can be expressed as follows P(x | C = ci ) P (C = ci ) P(C = ci | x) = (1) P ( x) P(C = ci) is the probability of class label ci and can be estimated from the data directly. The probability of a particular unknown sample P(x) does not have to be calculated because it does not depend on the class label and the class with highest probability can be determined without its knowledge. Probabilities for discrete attributes can be calculated based on frequencies of occurrences. For continuous attributes different alternatives exist. We use a representation that is used by the statistics community and that is based on one- dimensional kernel density estimates as discussed in [4]. The conditional probability P(x | C = ci) can be written as a kernel density estimate for class ci P ( x | C = c i ) = f i ( x) (2) with 3
  4. 4. N 1 f i ( x) = N ∑ K (x, x ) t =1 i t (3) where xt are training points and Ki(x, xt) is a kernel function. Estimation of this quantity from the data would be equivalent density-based estimation [1], which is also called Parzen window classification [4], or for HOBbit distance-based classification Podium classification [13]. The naive Bayesian model assumes that for a given class the probabilities for individual attributes are independent M P (x | C = ci ) = ∏ P ( x k | C = ci ) (4) k =1 where xk is the kth attribute of a total of M attributes. The conditional probability P(xk|C = ci) can, for categorical attributes, simply be derived from the sample proportions. For numerical attributes two alternatives exist to the kernel-based representation used in this paper. Intervalization leads to the same formalism as categorical attributes and does not need to be treated separately. The traditional Naive Bayesian classifier estimates probabilities by an approximation of the data through a function, such as a Gaussian distribution 1  ( xk − µ i ) 2  P ( x k | C = ci ) = exp −    (5) 2σ i 2 2π σ i   where µi is the mean of the values of attribute xk averaged over training points with class label ci and σi is the standard deviation. We use a one-dimensional kernel density estimate that comes naturally from (2) M 1 M  N  f i (x) = ∏ f i k =1 (k ) (x (k ) )= N ∏ ∑K i( k ) ( x ( k ) , xt( k ) )  k =1  t =1  (6) where the one-dimensional Gaussian kernel function is given by 1  ( x ( k ) − xt( k ) ) 2  K (k ) i Gauss (x , x ) = (k ) (k ) t exp −   [C = ci ]  (7) 2π σ k  2σ k2  with σk being the standard deviation of attribute k. Note the difference to the kernel 4
  5. 5. density estimate even in the case of a Gaussian kernel function. The Gaussian kernel function evaluates the density of training points within a range of the sample point x. This does not imply that the distribution of weighted training points must be Gaussian. We show that the density-based approximation is commonly more accurate. A toy configuration can be seen in figure 1 in which the Gaussian kernel density estimate has two peaks whereas the Gaussian distribution function has - as always - one peak only. The data points are represented by vertical lines. 0.1 0.08 distribution function 0.06 kernel density 0.04 estimate data points 0.02 0 Figure 1: Comparison between Gaussian distribution function and kernel density estimate. Categorical attributes can be discussed within the same framework. The kernel function for categorical attributes is 1 (k ) K i(Cat ( x ( k ) , xt( k ) ) = k) [ x = xt( k ) ] (8) Ni where Ni is the number of training points with class label ci. [π] indicates that the term is 1 if the predicate π is true and 0 if it is false. We use this notation throughout the paper. The kernel density for categorical attributes is identical to the conditional probability [ ] ( ) N N 1 (k ) f i Cat ( x ( k ) ) = ∑ K i Cat ( x ( k ) , xt( k ) ) = ∑ x = xt( k ) [ C = ci ] = P x ( k ) C = ci (9) t =1 t =1 Ni 2.1. Correlation function of attributes We will now go beyond the naive Bayesian approximation by joining attributes if 5
  6. 6. they are highly correlated. Attributes are considered highly correlated if the product assumption in (6) can be shown to be a poor approximation. The validity of the product assumption can be verified for any two attributes a and b individually by calculating the following correlation function N ∑ ∏Kt =1 k = a ,b (k ) ( x ( k ) , xt( k ) ) Corr (a, b) = N −1 (10) ∏∑ K k = a ,b t =1 (k ) (k ) (x , x (k ) t ) where the kernel function is a Gaussian function (7) for continuous data and (8) for categorical data. If the product assumption (6) is satisfied Corr(a,b) is 0 by definition. This means that kernel density estimates can be calculated for the individual attributes a and b independently and combined using the naive Bayesian model. If, however, the product assumption is shown to be poor, i.e., the combined kernel function differs from the product of individual ones, then the two attributes will be considered together. The product assumption is shown to be poor if the correlation function between two attributes at the location of the unknown sample (10) exceeds a threshold, typically 0.05-1. The respective attributes are then considered together. The kernel function for joined attributes is (x ) = ∑ ∏K N ( a ,b ) (a) (b) (a) (b ) (k ) K ,x ,x t ,x t ( x ( k ) , xt( k ) ) (11) t =1 k =a ,b With this definition it is possible to work with joined attributes in the same way as with the original attributes. Since all probability calculations are based on kernel function there is no need to explicitly state the format of the joined variables themselves. It also becomes clear how multiple attributes can be joined. The kernel function of joined attributes can be used in the correlation calculation (10) without further modification. In fact, it is possible to look at the joining of attributes as successively interchanging the summation and product in (6). The correlation calculation (10) compares the kernel density estimate at the location of the unknown sample after a potential join with that before a join. If the difference is great, a join is performed. It is important to observe that the criterion for correlation is a function of the 6
  7. 7. attribute values. If attributes are correlated over all but not for the value of a particular test sample no join is performed. One may, e.g., expect that in a classification problem of student satisfaction with a course, class size and the ability to hear the instructor are correlated attributes. If the test sample refers to a class size of 200 students this may, however, not be the case if for such large classrooms microphones are used. This applies to both categorical and continuous attributes. An example could be the attributes hair color and age. It is probably fair to assume that hair color gray should have a high correlation with age whereas little correlation is to be expected for other hair colors. Attributes should therefore only be joined if the test sample has hair color gray. Note that the traditional naive Bayesian classifier in which a distribution function is used for continuous attributes (5) would lead to a correlation function that vanishes identically for two continuous attributes. The sum over all data points that is explicit in (6) is part of the definition of the mean and standard deviation for the traditional naive Bayesian classifier. Without this sum the numerator and denominator in (10) would be equal and the correlation function would vanish by definition. We will now proceed to discuss the P-tree data structure and the HOBbit distance that make the calculation efficient. 2.2. P-Trees The P-tree data structure was originally developed for spatial data [10] but has been successfully applied in many contexts [14,15]. P-trees store bit-columns of the data in sequence to allow compression as well as the fast evaluation of counts of records that satisfy a particular condition. Most storage structures for tables store data by rows. Each row must be scanned whenever an algorithm requires information on the number of records that satisfy a particular condition. Such database scans that scale as O(N) are the basis for many data mining algorithms. We use a column-wise storage in which columns are further broken up into bits. A tree-based structure replaces subtrees that consist entirely of 0 values by a higher level "pure 0" node, and subtrees that consist entirely of 1 values by higher level "pure 1" nodes. The number of records that satisfy a particular condition is now evaluated by a bit-wise AND on the compressed bit-sequences. Figure 2 illustrates the storage of a table with 2 integer and one Boolean attribute. The number of records with A1 = 12 (i.e. the bit sequence 1100) is evaluated as a bit-wise AND of the two P-trees corresponding to the higher order bits of A1 and the complements of the two P- trees corresponding to the lower order bits. This AND operation can be done very efficiently for the first half of the data set, since the single high-level 0-bit already indicates that the condition is not satisfied for any of the records. This is the basis for a 7
  8. 8. scaling better than O(N) for such operations. The efficiency of P-tree operations depends strongly on the compression of the bit sequences, and thereby on the ordering of rows. For data that shows inherent continuity, such as spatial or multimedia data, such an ordering can be easily constructed. For spatial data, for example, neighboring pixels will often represent similar color values. Traversing an image in a sequence that keeps close points close will preserve the continuity when moving from the two-dimensional image to the one-dimensional storage representation. An example of a suitable ordering is an ordering according to a space-filling curve such as Peano-, or recursive raster ordering [10]. For time sequences such as data streams the time dimension will commonly show the corresponding continuity. Figure 2: Storage of tables as hierarchically compressed bit columns If data shows no natural continuity it may be beneficial to sort it. Sorting according to full attributes is one option but it can easily be seen that it is likely not to be the best. Sorting the table in figure 2 according to attribute A 1 would correspond to a sorting according to the first four bits of the bit-wise representation on the right-hand side. That means that the lowest order bit of A1 affects the sorting more strongly than the highest order bit of A2. It is, however, likely that the highest order bit of A2 will be more descriptive of the data, and will correlate more strongly with further attributes, such as A 3. We therefore sort according to all highest order bits first. Figure 2 indicates at the bottom the sequence in which bits are used for sorting. It can be seen that the Peano order traversal of an image corresponds to sorting records according to their spatial coordinates in just this way. We will therefore refer to it as generalized Peano order sorting. Figure 3 8
  9. 9. shows the relationship between traversal of points in Peano order and sorting according to highest order bits. A main difference between generalized Peano order sorting and the Peano order traversal of images lies in the fact that spatial data does not require storing spatial coordinates since they are redundant. When feature attributes are used for sorting not all value combinations exist, and the attributes that are used for sorting therefore have to be represented as P-trees. Only integer data types are traversed in Peano order, since the concept of proximity doesn't apply to categorical data. The relevance of categorical data for the ordering is chosen according to the number of attributes needed to represent it. Figure 3 shows how two numerical attributes are traversed with crosses representing existing data points. The attribute values are then listed in table format. Figure 3: Peano order sorting and P-tree construction The higher order bit of attribute x is chosen as an example to demonstrate the construction of a P-tree of fan-out 2, i.e., each node has 0 or 2 children. In the P-tree graph 1 stands for nodes that represent only 1 values, 0 stands for nodes that represent only 0 values, and m ("mixed") stands for nodes that represent a combination of 0 and 1 values. Only "mixed" nodes have children. For other nodes the data sequence can be reconstructed based on the purity information and node level alone. The P-tree example is kept simple for demonstration purposes. The implementation has a fan-out of 16, and uses an array rather than pointers to represent the tree structure. 2.3. HOBbit Distance The nature of a P-tree-based data representation with its bit-column structure has a strong impact on the kinds of algorithms that will be efficient. P-trees allow easy 9
  10. 10. evaluation of the number of data points in a neighborhood that can be represented by a single bit pattern. The HOBbit distance has the desired property. It is defined as 0 for x s( k ) = xt( k )  d HOBbit ( x s( k ) , xt( k ) ) =  ∞    xt( k )    j + 1  xs (k ) (10) max  j ≠  j  for x s( k ) ≠ xt( k )  j =0   2   2   where x s(k ) and xt(k ) are the values of attribute k for points xs and xt, and   denotes the floor function. The HOBbit distance is thereby the number of bits by which two values have to be right-shifted to make them equal. The numbers 32 (10000) and 37 (10101), e.g., have HOBbit distance 3 because only the first two digits are equal, and the numbers consequently have to be right-shifted by 3 bits. The first two bits (10) define the neighborhood of 32 (or 37) with a HOBbit distance of no more than 3. We would like to approximate functions that are defined for Euclidean distances by the HOBbit distance. The exponential HOBbit distance corresponds to the average Euclidean distance of all values within a neighborhood of a particular HOBbit distance 0  for x s( k ) = xt( k ) d EH ( x s , xt ) =  (k ) (k ) (k ) (k ) (11) 2 d HOBbit ( xs , xt ) −1 for  x (k ) s ≠x (k ) t With this definition we can write the one-dimensional Gaussian kernel function in HOBbit approximation as 1  d EH ( x s( k ) , xt( k ) ) 2  K (k ) (k ) (x , x (k ) )= exp −  [ C = ci ] (12) i Gauss s t 2π σ k  2σ k 2    Note that this can be seen as a type of intervalization due to the fact that a large number of points will have the same HOBbit distance from the sample. This explains why the HOBbit-/ P-tree-based algorithm can be efficient despite the high computational effort that multi-dimensional kernel density functions and correlation functions would otherwise involve. 2.4. Algorithm Our classification algorithm uses the background that has been developed as follows. Based on the attribute values of each test sample we evaluate kernel functions for 10
  11. 11. all attributes. For continuous attributes we evaluate (7) and for categorical attributes (8) for all values of the class labels. Note that these kernel functions only have to be evaluated once for each attribute value and can be reused as long as the training data is unchanged. Kernel functions for pairs of attributes are then evaluated to determine the correlation function (10). For this purpose we use kernel functions that are independent of the class label rather than doing the analysis for different classes separately. This makes the solution numerically more stable by avoiding situations in which two attributes are joined for one class label and not for another. Attributes are joined if the correlation function exceeds a given threshold that is commonly chosen in the range 0.05 to 1. Joining of attributes consists in computing the joined kernel functions (11) and using the result to replace the individual kernel functions of the respective two attributes in (6). Evaluation of kernel functions for continuous attributes involves evaluating the number of data points within each of the HOBbit ranges of the attribute(s). This is done through AND operations on the P-trees that correspond to the respective highest order bits. The number of points in each range is then weighted according to the value of the Gaussian function (12) using the exponential HOBbit distance that corresponds to the given HOBbit range. 3. IMPLEMENTATION AND RESULTS We implemented all algorithms in Java and evaluated them on 4 data sets. Data sets were selected to have at least 3000 data points and to contain continuous attributes. Two thirds of the data were taken as training set and one third as test set. Due to the consistently large size of data sets cross-validation was considered unnecessary. All experiments were done using the same parameter values for all data sets. 3.1. Data Sets Three of the data sets were obtained from the UCI machine learning library [1] where full documentation on the data sets is available. These data sets include the following • spam data set: word and letter frequencies are used to classify e-mail as spam • adult data set: census data is used to predict whether income is greater $50000 • sick-euthyroid data set: medical data is used to predict sickness from thyroid disease An additional data set was taken from the spatial data mining domain (crop data set). The RGB colors in the photograph of a cornfield are used to predict the yield of the field [17]. Class label is the first bit of the 8-bit yield information, i.e. the class label is 1 11
  12. 12. if yield is higher than 128 for a given pixel. Table 2 summarizes the properties of the data sets. Number of attributes Number Number of Un- Probability without class label of instances known of minority classes values class label total cate- conti- trainin test gorica nuous g l spam 57 0 57 2 3067 1534 no 37.68 crop 3 0 3 2 784080 87120 no 24.69 adult 14 8 6 2 30162 15060 no 24.57 sick- 25 19 6 2 2108 1055 yes 9.29 euthyroid Table 1: Summary of data set properties No preprocessing of the data was done, but some attributes, were identified as being logarithmic in nature, and the logarithm was encoded in P-trees. The following attributes were chosen as logarithmic: "capital-gain" and "capital-loss" of the adult data set, and all attributes of the "spam" data set. 3.2. Results We will now compare results of our semi-naive algorithm with an implementation of the traditional naive Bayesian algorithm that uses a Gaussian distribution function, cf. equation (5), and a P-Tree Naive Bayesian algorithm that uses HOBbit-based kernel density estimation, cf. equation (12), but does not check for correlations. Table 2 gives a summary of the results. Discrete Discrete Kernel Kernel Traditi P-Tree Semi- Semi- Semi- Semi- o-nal (+/- (+/- (+/- (+/- Naïve (+/-) Naïve Naïve Naïve Naïve (+/-) Naïve ) ) ) ) Bayes Bayes Bayes Bayes Bayes Bayes (1) (2) (1) (2) spam 11.9 0.9 10.0 0.8 9.4 0.8 9.5 0.8 8.4 0.7 8.0 0.7 crop 21.6 0.2 22.0 0.2 28.8 0.2 21.0 0.2 28.5 0.2 21.0 0.2 12
  13. 13. adult 18.3 0.4 18.0 0.3 17.3 0.3 17.1 0.3 17.0 0.3 16.6 0.3 sick- 15.2 1.2 5.9 0.7 11.4 1.0 8.8 1.0 4.2 0.6 4.6 0.7 euthyroid Table 2: Results 3.3. P-Tree Naive Bayesian Classifier Before using the semi-naive Bayesian classifier we will evaluate the performance of a simple naive Bayesian classifier that uses kernel density estimates based on a the HOBbit distance. Whether this classifier improves performance depends on two factors. Classification accuracy should benefit from the fact that a kernel density estimate in general gives more detailed information on an attribute than the choice of a simple distribution function. Our implementation does, however, use the HOBbit distance when determining the kernel density estimate, which may make classification worse over all. Figure 3 shows that for three of the data sets the P-tree naive Bayesian algorithm constitutes an improvement over traditional naive Bayesian. 9 Decrease in Error Rate 6 3 0 spam crop adult sick-euthyroid -3 Figure 3: Decrease in error rate for the P-tree naive Bayesian classifier compared with the traditional naive Bayesian classifier in units of the standard error. 3.4. Semi-Naive Bayesian Classifier The semi-naive Bayesian classifier was evaluated using two parameter combinations. Figure 4 shows the decrease in error rate compared with the P-tree naive Bayesian classifier, which is the relevant comparison when evaluating the benefit of 13
  14. 14. combining attributes. 9 8 Decrease in Error Rate 7 6 5 Parameters (1) 4 Parameters (2) 3 2 1 0 spam crop adult sick- euthyroid Figure 4: Decrease in error rate for the kernel-based semi-naive Bayesian classifier compared with the P-tree naive classifier in units of the standard error. It can be clearly seen that for the chosen parameters accuracy is generally increased over the P-tree naive Bayesian algorithm both for data sets with continuous attributes, and ones with categorical attributes. For the majority of data sets both parameter combinations lead to a significant improvement despite the fact that the cutoff value for correlations varies from 0.05 to 0.3, and runs differ in their treatment of anti-correlations. It can therefore be concluded that the benefits of correlation elimination are fairly robust with respect to parameter choice. Run (1) used a cut-off of threshold t = 0.3 while run (2) used t = 0.05. Run (1) eliminates not only correlations, i.e., attributes for which Corr(a,b) > t but also anti-correlations, i.e., attributes for which Corr(a,b) < -t. We also tried using multiple iterations, in which not only two but more attributes were joined. We did not get any consistent improvement, which suggests that joining of attributes should not be overdone. This is in accordance with the observation that correlated attributes can benefit the naive Bayesian classifier [18]. The mechanism by which correlation elimination can lead to a decrease in accuracy in the semi-naive Bayesian algorithm is as follows. Correlation elimination was introduced as a simple interchanging of summation and multiplication in equation (6). For categorical attributes the kernel function is a step function. The product of step functions will quickly result in a small volume in attribute space, leading to correspondingly few training points and poor 14
  15. 15. statistical significance. By combining many attributes the problems of the curse of dimensionality can recur. Similarly, it turned out not to be beneficial to eliminate anti- correlations as extensively as correlations because they will even more quickly lead to a decrease in the number of points that are considered. We then compared our approach with the alternative strategy of discretizing continuous attributes as a processing step. Figure 5 shows the decrease in error rate of the kernel-based implementation compared with the approach of discretizing attributes as a preprocessing step. The improvement of accuracy for the kernel-based representation is clearly evident. This is may explain why early discretization-based results on semi-naive classifiers, that were reported by Kononenko [4], showed little or no improvement on a data set that is similar to our sick-euthyroid data set, whereas the kernel-based implementation leads to a clear improvement. 45 40 Decrease in Error Rate 35 30 25 Parameters (1) 20 Parameters (2) 15 10 5 0 spam crop adult sick- euthyroid Figure 5: Decrease in error rate for 2 parameter combinations for the kernel-based semi- naive Bayesian classifier compared with a semi-naive Bayesian classifier using discretization in units of the standard error. 3.5. Performance 15
  16. 16. 120 Time per Test Sample in 80 Milliseconds Measured Execution Time Linear Interpolation 40 0 0 10000 20000 30000 Num ber of Training Points Figure 5: Scaling of execution time as a function of training set size It is important for data mining algorithms to be efficient for large data sets. Figure 5 shows that the P-tree-based semi-naive Bayesian algorithm shows a better scaling than O(N) as a function of the training points. This scaling is closely related to the P-tree storage concept that benefits increasingly from compression for increasing data set size. 4. CONCLUSIONS We have presented a semi-naive Bayesian algorithm that treats continuous data through kernel density estimates rather than discretization. We were able to show that it increases accuracy for data sets from a wide range of domains both from the UCI machine learning repository as well as from an independent source. By avoiding discretization our algorithm ensures that distance information within numerical attributes will be represented accurately and improvements in accuracy could clearly be demonstrated. Categorical and continuous data are thereby treated on an equally strong footing, which is unusual since classification algorithms tend to favor one or the other type of data. Our algorithm is particularly valuable for the classification of data sets with many attributes. It does not require training of a classifier and is thereby suitable to such settings as data streams. The implementation using P-trees has an efficient sub-linear scaling with respect to training set size. We have thereby introduced a tool equally interesting from a theoretical and a practical perspective References [1] D. Hand, H. Mannila, P. Smyth, "Principles of Data Mining", The MIT Press, 16
  17. 17. Cambridge, Massachusetts, 2001. [2] http://www.ics.uci.edu/mlearn/MLSummary.html [3] P. Domingos, M. Pazzani, "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier", 13th International Conference on Machine Learning, 105-112, 1996. [4] T. Hastie, R. Tibshirani, and J. Friedman, "The elements of statistical learning: data mining, inference, and prediction", Springer-Verlag, New York, 2001. [5] M. Datar, A. Gionis, P. Indyk, and R. Motwani, "Maintaining Stream Statistics over sliding windows," in ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002. [6] I. Kononenko, "Semi-Naive Bayesian Classifier", In Proceedings of the sixth European Working Session on Learning, 206-219, 1991. [7] M. Pazzani, "Constructive Induction of Cartesian Product Attributes", Information, Statistics and Induction in Science, Melbourne, Australia, 1996. [8] Z. Zheng, G. Webb, K.-M. Ting, "Lazy Bayesian Rules: A lazy semi-naive Bayesian learning technique competitive to boosting decision trees", Proc. 16th Int. Conf. on Machine Learning, 493-502, 1999. [9] N. Friedman, D. Geiger, and M. Goldszmidt, "Bayesian network classifiers", Machine Learning, 29, 131-163, 1997. [10] E. J. Keogh and M. J. Pazzani, "Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches", Uncertainty 99: The Seventh International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 1999. [11] J. Cheng and R. Greiner, "Comparing Bayesian network classifiers", in Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI'99), 101-107, Morgan Kaufmann Publishers, August 1999. [12] M. Khan, Q. Ding, W. Perrizo, "K-Nearest Neighbor classification of spatial data streams using P-trees", PAKDD-2002, Taipei, Taiwan, May 2002. [13] W. Perrizo, Q. Ding, A. Denton, K. Scott, Q. Ding, M. Khan, "PINE - podium incremental neighbor evaluator for spatial data using P-trees", Symposium on Applied Computing (SAC'03), Melbourne, Florida, USA, 2003. [14] Q. Ding, W. Perrizo, Q. Ding, “On Mining Satellite and other Remotely Sensed Images”, DMKD-2001, pp. 33-40, Santa Barbara, CA, 2001. [15] W. Perrizo, W. Jockheck, A. Perera, D. Ren, W. Wu, Y. Zhang, "Multimedia data mining using P-trees", Multimedia Data Mining Workshop, KDD, Sept. 2002. [16] A. Perera, A. Denton, P. Kotala, W. Jockheck, W. Valdivia Granda, W. Perrizo, "P- tree Classification of Yeast Gene Deletion Data", SIGKDD Explorations, Dec. 2002. [17] http://midas-10.cs.ndsu.nodak.edu/data/images/data_set_94/ [18] R. Kohavi, G. John, "Wrappers for feature subset selection", Artificial Intelligence, 17
  18. 18. 1-2, 273-324, 1997. 18

×