IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                        VOL. 22,   NO. 1,    JANUARY 2010             ...
140                                               IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                    ...
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                VOL. 22,   NO. 1,   JANUARY 2010                      ...
142                                              IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                   VO...
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                 VOL. 22,   NO. 1,   JANUARY 2010                     ...
144                                              IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                     ...
Upcoming SlideShare
Loading in …5

Bayesian classifiers programmed in sql


Published on

Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
For further details contact us:
044-42046028 or 8428302179.

Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bayesian classifiers programmed in sql

  1. 1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 139 Bayesian Classifiers Programmed in SQL set and classification model. We also develop query optimizations that may be applicable to other distance-based algorithms. A horizontal layout of the cluster centroids table and a denormalized Carlos Ordonez and Sasi K. Pitchaimalai table for sufficient statistics are essential to accelerate queries. Past research has shown the main alternatives to integrate data miningAbstract—The Bayesian classifier is a fundamental classification technique. In algorithms without modifying DBMS source code are SQL queriesthis work, we focus on programming Bayesian classifiers in SQL. We introduce and User-Defined Functions (UDFs), with UDFs being generallytwo classifiers: Naive Bayes and a classifier based on class decomposition using more efficient [5]. We show SQL queries are more efficient thanK-means clustering. We consider two complementary tasks: model computation UDFs for Bayesian classification. Finally, despite the fact that C++and scoring a data set. We study several layouts for tables and several indexing is a significantly faster alternative, we provide evidence thatalternatives. We analyze how to transform equations into efficient SQL queriesand introduce several query optimizations. We conduct experiments with real and exporting data sets is a performance bottleneck.synthetic data sets to evaluate classification accuracy, query optimizations, and The article is organized as follows: Section 2 introducesscalability. Our Bayesian classifier is more accurate than Naive Bayes and definitions. Section 3 introduces two alternative Bayesian classifiersdecision trees. Distance computation is significantly accelerated with horizontal ¨ in SQL: Naıve Bayes and a Bayesian classifier based on K-means.layout for tables, denormalization, and pivoting. We also compare Naive Bayes Section 4 presents an experimental evaluation on accuracy,implementations in SQL and C++: SQL is about four times slower. Our Bayesian optimizations, and scalability. Related work is discussed inclassifier in SQL achieves high classification accuracy, can efficiently analyze Section 5. Section 6 concludes the paper.large data sets, and has linear scalability.Index Terms—Classification, K-means, query optimization. 2 DEFINITIONS Ç We focus on computing classification models on a data set X ¼ fx1 ; . . . ; xn g with d attributes X1 ; . . . ; Xd , one discrete attribute G1 INTRODUCTION (class or target), and n records (points). We assume G has m ¼ 2CLASSIFICATION is a fundamental problem in machine learning values. Data set X represents a d  n matrix, where xi represents aand statistics [2]. Bayesian classifiers stand out for their robust- column vector. We study two complementary models: 1) each classness, interpretability, and accuracy [2]. They are deeply related to is approximated by a normal distribution or histogram and 2) fittingmaximum likelihood estimation and discriminant analysis [2], a mixture model with k clusters on each class with K-means. We usehighlighting their theoretical importance. In this work, we focus subscripts i; j; h; g as follows: i ¼ 1 . . . n; j ¼ 1 . . . k; h ¼ 1 . . . d; ¨on Bayesian classifiers considering two variants: Naıve Bayes [2] g ¼ 1 . . . m. The T superscript indicates matrix transposition.and a Bayesian classifier based on class decomposition using Throughout the paper, we will use a small running example,clustering [7]. where d ¼ 4; k ¼ 3 (for K-means) and m ¼ 2 (binary). In this work, we integrate Bayesian classification algorithmsinto a DBMS. Such integration allows users to directly analyze data 3 BAYESIAN CLASSIFIERS PROGRAMMED IN SQLsets inside the DBMS and to exploit its extensive capabilities(storage management, querying, concurrency control, fault toler- ¨ We introduce two classifiers: Naıve Bayes (NB) and a Bayesianance, and security). We use SQL queries as the programming Classifier based on K-means (BKM) that performs class decom-mechanism, since it is the standard language in a DBMS. More position. We consider two phases for both classifiers: 1) computingimportantly, using SQL eliminates the need to understand and the model and 2) scoring a data set based on the model. Ourmodify the internal source, which is a difficult task. Unfortunately, optimizations do not alter model accuracy.SQL has two important drawbacks: it has limitations to manipulate 3.1 ¨ Naıve Bayesvectors and matrices and has more overhead than a systems We consider two versions of NB: one for numeric attributes andlanguage like C. Keeping those issues in mind, we study how to another for discrete attributes. Numeric NB will be improved withevaluate mathematical equations with several tables layouts and class decomposition.optimized SQL queries. NB assumes attributes are independent, and thus, the joint class Our contributions are the following: We present two efficient conditional probability can be estimated as the product of ¨SQL implementations of Naıve Bayes for numeric and discrete probabilities of each attribute [2]. We now discuss NB based on aattributes. We introduce a classification algorithm that builds one multivariate Gaussian. NB has no input parameters. Each class isclustering model per class, which is a generalization of K-means[1], [4]. Our main contribution is a Bayesian classifier programmed modeled as a single normal distribution with mean vector Cg and a ¨in SQL, extending Naıve Bayes, which uses K-means to decompose diagonal variance matrix Rg . Scoring assumes a model is availableeach class into clusters. We generalize queries for clustering and there exists a data set with the same attributes in order toadding a new problem dimension. That is, our novel queries predict class G. Variance computation based on sufficient statisticscombine three dimensions: attribute, cluster, and class subscripts. [5] in one pass can be numerically unstable when the variance isWe identify euclidean distance as the most time-consuming much smaller compared to large attribute values or when the datacomputation. Thus, we introduce several schemes to efficiently set has a mix of very large and very small numbers. For ill-compute distance considering different storage layouts for the data conditioned data sets, the computed variance can be significantly different from the actual variance or even become negative. Therefore, the model is computed in two passes: a first pass to. The authors are with the Department of Computer Science, University of get the mean per class and a second one to compute the variance Houston, Houston, TX 77204. E-mail: {ordonez, yeskaym} P per class. The mean per class is given by Cg ¼ xi 2Yg xi =Ng , where PManuscript received 24 Apr. 2008; revised 12 Jan. 2009; accepted 4 May Yg X are the records in class g. Equation Rg ¼ 1=Ng ni 2Yg ðxi À x2009; published online 14 May 2009. T Cg Þðxi À Cg Þ gives a diagonal variance matrix Rg , which isRecommended for acceptance by S. Wang.For information on obtaining reprints of this article, please send e-mail to: numerically stable, but requires two passes over the data, and reference IEEECS Log Number TKDE-2008-04-0219. The SQL implementation for numeric NB follows the meanDigital Object Identifier no. 10.1109/TKDE.2009.127. and variance equations introduced above. We compute three 1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  2. 2. 140 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010aggregations grouping by g with two queries. The first query TABLE 1computes the mean Cg of class g with a sum ðXh Þ=count ðÃÞ BKM Tablesaggregation and class priors g with a count() aggregation. Thesecond query computes Rg with sum(ðXh À h Þ2 Þ. Note the jointprobability computation is not done in this phase. Scoring usesthe Gaussian parameters as input to classify an input point to themost probable class, with one query in one pass over X. Eachclass probability is evaluated as a Gaussian. To avoid numericalissues when a variance is zero, the probability is set to 1 and thejoint probability is computed with a sum of probabilitylogarithms instead of a product of probabilities. A CASEstatement pivots probabilities and avoids a max() aggregation.A final query determines the predicted class, being the one withmaximum probability, obtained with a CASE statement. We now discuss NB for discrete attributes. For numeric NB, we vectors (instead of variables). Class priors are analogous to NB:used Gaussians because they work well for large data sets and g ¼ Ng =n.because they are easy to manipulate mathematically. That is, NB We now introduce sufficient statistics in the generalizeddoes not assume any specific probability density function (pdf). notation. Let Xg:j represent partition j of points in class g (asAssume X1 ; . . . ; Xd can be discrete or numeric. If an attribute Xh is determined by K-means). Then, L, the linear sum of points is: Pdiscrete (categorical) NB simply computes its histogram: prob- Lg:j ¼ xi 2Xg:j xi , and the sum of “squared” points for cluster g : j Pabilities are derived with counts per value divided by the becomes: Qg:j ¼ xi 2Xg:j xi xT . Based on L; Q, the Gaussian para- icorresponding number of points in each class. Otherwise, if the meters per class g are:attribute Xh is numeric then binning is required. Binning requirestwo passes over the data set, pretty much like numeric NB. In the 1 Cg:j ¼ Lg:j ; ð1Þfirst pass, bin boundaries are determined. On the second pass, one- Ng:jdimensional frequency histograms are computed on each attribute.The bin boundaries (interval ranges) may impact the accuracy of 1 1 Rg:j ¼ Qg:j À Lg:j ðLg:j ÞT : ð2ÞNB due to skewed distributions or extreme values. Thus, we Ng:j ðNg:j Þ2consider two techniques to bin attributes: 1) creating k uniformintervals between min and max and 2) taking intervals around the We generalize K-means to compute m models, fitting a mixturemean based on multiples of the standard deviation, thus getting model to each class. K-means is initialized, and then, it iteratesmore evenly populated bins. We do not study other binning until it converges on all classes. The algorithm is as given below.schemes such as quantiles (i.e., equidepth binning). Initialization: The implementation in SQL of discrete NB is straightforward. 1. Get global N; L; Q and ; .For discrete attributes, no preprocessing is required. For numeric 2. Get k random points per class to initialize C.attributes, the minimum, maximum, and mean can be determinedin one pass in a single query. The variance for all numeric While not all m models converge:attributes is computed on a second pass to avoid numerical issues. 1. E step: get k distances j per g; find nearest cluster j per g;Then, each attribute is discretized finding the interval for each update N; L; Q per class.value. Once we have a binned version of X, then we compute 2. M step: update W ; C; R from N; L; Q per class; computehistograms on each attribute with SQL aggregations. Probabilities model quality per g; monitor convergence.are obtained dividing by the number of records in each class.Scoring requires determining the interval for each attribute value For scoring, in a similar manner to NB, a point is assigned to theand retrieving its probability. Each class probability is also class with highest probability. In this case, this means getting thecomputed by adding logarithms. NB has an advantage over other most probable cluster based on mk Gaussians. The probability getsclassifiers: it can handle a data set with mixed attribute types (i.e., multiplied by the class prior by default (under the commondiscrete and numerical). Bayesian framework), but it can be ignored for imbalanced classification problems.3.2 Bayesian Classifier Based on K-MeansWe now present BKM, a Bayesian classifier based on class 3.2.2 Bayesian Classifier Programmed in SQLdecomposition obtained with the K-means algorithm. BKM is a Based on the mathematical framework introduced in the previousgeneralization of NB, where NB has one cluster per class and the section, we now study how to create an efficient implementation ofBayesian classifier has k 1 clusters per class. For ease of BKM in SQL.exposition and elegance, we use the same k for each class, but Table 1 provides a summary of all working tables used by BKM.we experimentally show it is a sound decision. We consider two In this case, the tables for NB are extended with the j subscript,major tasks: model computation and scoring a data set based on a and since there is a clustering algorithm behind, we introducecomputed model. The most time-consuming phase is building the tables for distance computation. Like NB, since computationsmodel. Scoring is equivalent to an E step from K-means executed require accessing all attributes/distances/probabilities for eachon each class. point at one time, tables have a physical organization that stores all values for point i on the same address and there is a primary index3.2.1 Mixture Model and Decomposition Algorithm on i for efficient hash join computation. In this manner, we forceWe fit a mixture of k clusters to each class with K-means [4]. The the query optimizer to perform n I/Os instead of the actualoutput are priors (same as NB), mk weights (W ), mk centroids number of rows. This storage and indexing optimization is(C), and mk variance matrices (R). We generalize NB notation with essential for all large tables having n or or more rows.k clusters per class g with notation g : j, meaning the given vector This is a summary of optimizations. There is a single set of tablesor matrix refers to jth cluster from class g. Sums are defined over for the classifier: all m models are updated on the same table scan; Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  3. 3. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 141 TABLE 2 but mdk columns (or m tables, with dk columns each). Table 2 Query Optimization: Distance Computation shows the expected I/O cost. The following SQL statement computes k distances for each point, corresponding to the gth model. This statement is also used for scoring, and therefore, it is convenient to include g. INSERT INTO XD SELECT i,XH.g ,(X1-C1_X1)**2 + .. +(X4-C1_X4)**2, ..,(X1-C3_X1)**2 + .. +(X4-C3_X4)**2 FROM XH,CH WHERE XH.g=CH.g; We also exploited UDFs which represent an extensibilitythis eliminates the need to create multiple temporary tables. We mechanism [8] provided by the DBMS. UDFs are programmed inintroduce a fast distance computation mechanism based on a the fast C language and they can take advantage of arrays and flowflattened version of the centroids; temporary tables have less rows. control statements (loops, if-then). Based on this layout, weWe use a CASE statement to get closest cluster avoiding the use of developed a UDF which receives Cg as parameter and canstandard SQL aggregations; a join between two large tables is internally manipulate it as a matrix. The UDF is called mn timesavoided. We delete points from classes whose model has converged; and it returns the closest cluster per class. That is, the UDFthe algorithm works with smaller tables as models converge. computes distances and determines the closest cluster. In the experimental section, we will study which is the most efficient3.2.3 Model Computation alternative between SQL and the UDF. A big drawback about UDFsTable CH is initialized with k random points per class; this is that they are dependent on the database system architecture.initialization requires building a stratified sample per class. Therefore, they are not portable and optimizations developed for The global variance is used to normalize X to follow a one database system may not work well in another one.Nð0; 1Þ pdf. The global mean and global variance are derived as At this point, we have computed k distances per class and wein NB. In this manner, K-means computes euclidean distance with need to determine the closest cluster. There are two basicall attributes being on the same relative scale. The normalized data alternatives: pivoting distances and using SQL standard aggrega-set is stored in table XH, which also has n rows. BKM uses XH tions, using case statements to determine the minimum distance.thereafter at each iteration. For the first alternative, XD must be pivoted into a bigger table. We first discuss several approaches to compute distance for Then, the minimum distance is determined using the min()each class. Based on these approaches we derive a highly efficient aggregation. The closest cluster is the subscript of the minimumscheme to compute distance. The following statements assume the distance, which is determined joining XH. In the secondsame k per class. alternative, we just need to compare every distance against the The simplest, but inefficient, way to compute distances is to rest using a CASE statement. Since the second alternative does notcreate pivoted versions of X and C having one value per row. use joins and is based on a single table scan, it is much faster thanSuch a table structure enables the usage of SQL standard using a pivoted version of XD. The SQL to determine closestaggregations (e.g., sum(), count()). Evaluating a distance query cluster per class for our running example is given below. Thisinvolves creating a temporary table with mdkn rows, which will be statement can also be used for scoring.much larger than X. Since the vertical variant is expected to be theslowest, it is not analyzed in the experimental section. Instead, we INSERT INTO XN SELECT i,guse a hybrid distance computation in which we have k distances ,CASE WHEN d1=d2 AND d1=d3 THEN 1per row. We start by considering a pivoted version of X called WHEN d2=d3 THEN 2XV . The corresponding SQL query computes k distances per class ELSE 3 END AS jusing an intermediate table having dn rows. We call this variant FROM XD;hybrid vertical. Once we have determined the closest cluster for each point on We consider a horizontal scheme to compute distances. All each class, we need to update sufficient statistics, which are justk distances per class are computed as SELECT terms. This sums. This computation requires joining XH and XN andproduces a temporary table with mkn rows. Then, the temporary partitioning points by class and cluster number. Since table NLQtable is pivoted and aggregated to produce a table having mn rows is denormalized, the jth vector and the jth matrix are computedbut k columns. Such a layout enables efficient computation of the together. The join computation is demanding since we are joiningnearest cluster in a single table scan. The query optimizer decides two tables with OðnÞ rows. For BKM, it is unlikely that variancewhich join algorithm to use and does not index any intermediate can have numerical issues, but is feasible. In such a case, sufficienttable. We call this approach nested horizontal. We introduce a statistics are substituted by a two-pass computation like NB.modification to the nested horizontal approach in which we createa primary index on the temporary distance table in order to enable INSERT INTO NLQ SELECT XH.g,jfast join computation. We call this scheme horizontal temporary. ,sum(1.0) as Ng We introduce a yet more efficient scheme to get distance ,sum(X1) , .. ,sum(X4) /* L */using a flattened version of C, which we call CH, that has ,sum(X1**2) , .. ,sum(X4**2) /* Q*/dk columns corresponding to all k centroids for one class. TheSQL code pairs each attribute from X with the corresponding FROM XH,XN WHERE XH.i=XN.i GROUP BY XH.g,j;attribute from the cluster. Computation is efficient because CH We now discuss the M step to update W ; C; R. Computinghas only m rows and the join computes a table with n rows (for WCR from NLQ is straightforward since both tables have thebuilding the model) or mn rows for scoring. We call this same structure and they are small. There is a Cartesian productapproach the horizontal approach. There exists yet another between NLQ and the model quality table, but this latter table has“completely horizontal” approach in which all mk distances are only one row. Finally, to speed up computations, we delete pointsstored on the same row for each point and C has only one row from XH for classes whose model has converged: a reduction in Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  4. 4. 142 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 TABLE 3 TABLE 4 Accuracy Varying k Comparing Accuracy: NB, BKM, and DTsize is propagated to the nearest cluster along with sufficientstatistics queries.INSERT INTO WCR SELECT NLQ.g,NLQ.j ,Ng/MODEL.n /* pi */ ,L1/Ng, ..,L4/Ng /* C */ ,Q1/Ng-(L1/Ng)**2,..,Q4/Ng-(L4/Ng)**2 /* R */ BKM ran until K-means converged on all classes. Decision trees used the CN5.0 algorithm splitting nodes until they reached aFROM NLQ,MODEL WHERE NLQ.g=MODEL.g; minimum percentage of purity or became too small. Pruning was applied to reduce overfit. The number of clusters for BKM by3.2.4 Scoring default was k ¼ 4.We consider two alternatives: based on probability (default) or The number of clusters (k) is the most important parameter todistance. Scoring is similar to the E step, but there are differences. tune BKM accuracy. Table 3 shows accuracy behavior as kFirst, the new data set must be normalized with the original increases. A higher k produces more complex models. Intuitively,variance used to build the model. Such variance should be similar a higher k should achieve higher accuracy because each class canto the variance of the new data set. Second, we need to compute be better approximated with localized clusters. On the other hand,mk probabilities (distances) instead of only k distances because a higher k than necessary can lead to empty clusters. As can bewe are trying to find the most probable (closest) cluster among all seen, a lower k produces better models for spam and wbcancer,(this is a subtle, but important difference in SQL code generation); whereas a higher k produces higher accuracy for pima and bscale.thus, the join condition between XH and CH gets eliminated. Therefore, there is no ideal setting for k, but, in general, k 20Third, once we have mk probabilities (distances), we need to pick produces good results. Based on these results, we set k ¼ 4 bythe maximum (minimum) one, and then, the point is assigned to default since it provides reasonable accuracy for all data sets.the corresponding cluster. To have a more elegant implementa- Table 4 compares the accuracy of the three models, includingtion, we predict the class in two stages: we first determine the the overall accuracy as well as a breakdown per class. Accuracymost probable cluster per class, and then, compare the prob- per predicted class is important to understand issues withabilities of such clusters. Column XH.g is not used for scoring imbalanced classes and detecting subsets of highly similar points.purposes, but it is used to measure accuracy, when known. In practice, one class is generally more important than the other one (asymmetric), depending on whether the classification task is4 EXPERIMENTAL EVALUATION to minimize false positives or false negatives. This is a summary of findings. First of all, considering global accuracy, BKM is betterWe analyze three major aspects: 1) classification accuracy, 2) query than NB on two data sets and becomes tied with the other two dataoptimization, and 3) time complexity and speed. We compare the sets. On the other hand, compared to decision trees, it is better inaccuracy of NB, BKM, and decision trees (DTs). one case and worse otherwise. NB follows the same pattern, but4.1 Setup showing a wider gap. However, it is important to understand whyWe used the Teradata DBMS running on a server with a 3.2 GHz BKM and NB perform worse than DT in some cases, looking at theCPU, 2 GB of RAM, and a 750 GB disk. accuracy breakdown. In this case, BKM generally does better than Parameters were set as follows: We set ¼ 0:001 for K-means. NB, except in two cases. Therefore, class decomposition doesThe number of clusters per class was k ¼ 4 (setting experimentally increase prediction accuracy. On the other hand, compared tojustified). All query optimizations were turned on by default (they decision trees, in all cases where BKM is less accurate than DT ondo not affect model accuracy). Experiments with DTs were global accuracy, BKM has higher accuracy on one class. In fact, onperformed using a data mining tool. closer look, it can be seen that DT completely misses prediction for We used real data sets to test classification accuracy (from one class for the bscale data set. That is, global accuracy for DT isthe UCI repository) and synthetic data sets to analyze speed misleading. We conclude that BKM performs better than NB and is(varying d; n). Real data sets include pima (d ¼ 6; n ¼ 768), competitive with DT.spam (d ¼ 7; n ¼ 4;601), bscale (d ¼ 4; n ¼ 625), and wbcancer(d ¼ 7; n ¼ 569). Categorical attributes (!3 values) were trans-formed into binary attributes. TABLE 5 Query Optimization: Distance Computation n ¼ 100k (Seconds)4.2 Model AccuracyIn this section, we measure the accuracy of predictions when usingBayesian classification models. We used 5-fold cross validation.For each run, the data set was partitioned into a training set and atest set. The training set was used to compute the model, whereasthe test set was used to independently measure accuracy. Thetraining set size was 80 percent and the test set was 20 percent. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  5. 5. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 143 TABLE 6 Query Optimization: SQL versus UDF (Scoring, n ¼ 1M) (Seconds)4.3 Query OptimizationTable 5 provides detailed experiments on distance computation,the most demanding computation for BKM. Our best distance Fig. 1. BKM Classifier: Time complexity; (default d ¼ 4, k ¼ 4, n ¼ 100k).strategy is two orders of magnitude faster than the worst strategyand it is one order of magnitude faster than its closest rival. The ing the DBMS source code are SQL queries and UDFs. A discreteexplanation is that I/O is minimized to n operations and ¨ Naıve Bayes classifier programmed in SQL is introduced in [6]. Wecomputations happen in main memory for each row through summarize differences with ours. The data set is assumed to haveSQL arithmetic expressions. Note a standard aggregation on the discrete attributes: binning numeric attributes is not considered. Itpivoted version of X (XV) is faster than the horizontal nested uses an inefficient large pivoted intermediate table, whereas ourquery variant. discrete NB model can directly work on a horizontal layout. Our Table 6 compares SQL with UDFs to score the data set proposal extends clustering algorithms in SQL [4] to perform(computing distance and finding nearest cluster per class). We ¨ classification, generalizing Naıve Bayes [2]. K-means clusteringexploit scalar UDFs [5]. Since finding the nearest cluster is was programmed with SQL queries introducing three variants [4]:straightforward in the UDF, this comparison considers both standard, optimized, and incremental. We generalized the opti-computations as one. This experiment favors the UDF since SQL mized variant. Note that classification represents a significantlyrequires accessing large tables in separate queries. As we can see, harder problem than clustering. User-Defined Functions areSQL (with arithmetic expressions) turned out be faster than the identified as an important extensibility mechanism to integrateUDF. This was interesting because both approaches used the same data mining algorithms [5], [8]. ATLaS [8] extends SQL syntax withtable as input and performed the same I/O reading a large table. object-oriented constructs to define aggregate and table functionsWe expected SQL to be slower because it required a join and XH (with initialize, iterate, and terminate clauses), providing a user-and XD were accessed. The explanation was that the UDF has friendly interface to the SQL standard. We point out severaloverhead to pass each point and model as parameters in each call. differences with our work. First, we propose to generate SQL code4.4 Speed and Time Complexity from a host language, thus achieving Turing-completeness in SQL (similar to embedded SQL). We showed the Bayesian classifier canWe compare SQL and C++ running on the same computer. We also be solved more efficiently with SQL queries than with UDFs. Evencompare the time to export with ODBC. C++ worked on flat files further, SQL code provides better portability and ATLaS requiresexported from the DBMS. We used binary files in order to get modifying the DBMS source code. Class decomposition withmaximum performance in C++. Also, we shut down the DBMS clustering is shown to improve NB accuracy [7]. The classifier canwhen C++ was running. In short, we conducted a fair comparison. adapt to skewed distributions and overlapping subsets of pointsTable 7 compares SQL, C++, and ODBC varying n. Clearly, ODBC by building better local models. In [7], EM was the algorithm to fitis a bottleneck. Overall, both languages scale linearly. We can see a mixture per class. Instead, we decided to use K-means because itSQL performs better as n grows because DBMS overhead becomes is faster, simpler, better understood in database research and hasless important. However, C++ is about four times faster. less numeric issues. Fig. 1 shows BKM time complexity varying n; d with large datasets. Time is measured for one iteration. BKM is linear in n and d,highlighting its scalability. 6 CONCLUSIONS We presented two Bayesian classifiers programmed in SQL: the5 RELATED WORK ¨ Naıve Bayes classifier (with discrete and numeric versions) and a ¨ generalization of Naıve Bayes (BKM), based on decomposingThe most widely used approach to integrate data mining classes with K-means clustering. We studied two complementaryalgorithms into a DBMS is to modify the internal source code. aspects: increasing accuracy and generating efficient SQL code. WeScalable K-means (SKM) [1] and O-cluster [3] are two examples of introduced query optimizations to generate fast SQL code. The bestclustering algorithms internally integrated with a DBMS. A physical storage layout and primary index for large tables is baseddiscrete Naive Bayes classifier has been internally integrated with on the point subscript. Sufficient statistics are stored on denorma-the SQL Server DBMS. On the other hand, the two main lized tables. The euclidean distance computation uses a flattenedmechanisms to integrate data mining algorithms without modify- (horizontal) version of the cluster centroids matrix, which enables arithmetic expressions. The nearest cluster per class, required by K- means, is efficiently determined avoiding joins and aggregations. TABLE 7 Comparing SQL, C++, and ODBC with NB (d ¼ 4) (Seconds) Experiments with real data sets compared NB, BKM, and decision trees. The numeric and discrete versions of NB had similar accuracy. BKM was more accurate than NB and was similar to decision trees in global accuracy. However, BKM was more accurate when computing a breakdown of accuracy per class. A low number of clusters produced good results in most cases. We compared equivalent implementations of NB in SQL and C++ with large data sets: SQL was four times slower. SQL queries were faster than UDFs Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  6. 6. 144 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010to score, highlighting the importance of our optimizations. NB and REFERENCESBKM exhibited linear scalability in data set size and dimensionality. [1] P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to There are many opportunities for future work. We want to Large Databases,” Proc. ACM Knowledge Discovery and Data Mining (KDD)derive incremental versions or sample-based methods to accelerate Conf., pp. 9-15, 1998. [2] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statisticalthe Bayesian classifier. We want to improve our Bayesian classifier Learning, first ed. Springer, produce more accurate models with skewed distributions, data [3] B.L. Milenova and M.M. Campos, “O-Cluster: Scalable Clustering of Largesets with missing information, and subsets of points having High Dimensional Data Sets,” Proc. IEEE Int’l Conf. Data Mining (ICDM), pp. 290-297, 2002.significant overlap with each other, which are known issues for [4] C. Ordonez, “Integrating K-Means Clustering with a Relational DBMSclustering algorithms. We are interested in combining dimension- Using SQL,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201,ality reduction techniques like PCA or factor analysis with Feb. 2006.Bayesian classifiers. UDFs need further study to accelerate [5] C. Ordonez, “Building Statistical Models and Scoring with UDFs,” Proc. ACM SIGMOD, pp. 1005-1016, 2007.computations and evaluate complex mathematical equations. [6] S. Thomas and M.M. Campos, SQL-Based Naive Bayes Model Building and Scoring, US Patent 7,051,037, US Patent and Trade Office, 2006. [7] R. Vilalta and I. Rish, “A Decomposition of Classes via Clustering toACKNOWLEDGMENTS Explain and Improve Naive Bayes,” Proc. European Conf. Machine Learning (ECML), pp. 444-455, 2003.This research work was partially supported by US National [8] H. Wang, C. Zaniolo, and C.R. Luo, “ATLaS: A Small but Complete SQLScience Foundation grants CCF 0937562 and IIS 0914861. Extension for Data Mining and Data Streams,” Proc. Very Large Data Bases (VLDB) Conf., pp. 1113-1116, 2003. . For more information on this or any other computing topic, please visit our Digital Library at Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.