draft2.doc

465 views
416 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
465
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

draft2.doc

  1. 1. An Evaluation of Commercial Data Mining Oracle Data Mining Emily Davis Computer Science Department Rhodes University Supervisor: John Ebden November 2004 Submitted in partial fulfilment of the requirements for BSc. Honours in Computer Science
  2. 2. Acknowledgements I am very grateful for all the advice and assistance given to me by my supervisor, John Ebden. I am exceedingly thankful for all the time and effort he put into helping me produce this work. I am also grateful for the funding provided by the Andrew Mellon Foundation in the form of an Honours Degree Scholarship. I must acknowledge the financial and technical support of this project of Telkom SA, Business Connexion, Comverse SA, and Verso Technologies through the Telkom Centre of Excellence at Rhodes University. I must also thank the technical division in the Computer Science Department at Rhodes University and especially Jody Balarin and Chris Morley for their help. 2 An Evaluation of Commercial Data Mining: Oracle Data Mining
  3. 3. Table of Contents Abstract: 3 An Evaluation of Commercial Data Mining: Oracle Data Mining
  4. 4. This project describes an investigation of a commercial data mining suite, that available with Oracle9i database software. This investigation was conducted in order to determine the type of results achieved when data mining models were created using Oracle’s data mining components and applied to data. Issues investigated in this process included whether the algorithms used in the evaluation found a pattern in a data set, which of the algorithms built the most effective data mining model, the manner in which the data mining models were tested and the effect the distribution of the data set had on the testing process. Two algorithms in the Classification category, Naïve Bayes and Adaptive Bayes Network, were used to build the data mining models. The models were then tested to determine their accuracy and applied to new data to establish their effectiveness. The results of the testing process and the results of applying the models to new data were analysed and compared as part of this investigation. A number of conclusions were drawn from this investigation, namely that Oracle Data Mining provides all the functionality necessary to easily build an effective data mining model and that the Adaptive Bayes Network algorithm produced the most effective data mining model. As far as actual results were concerned the accuracy the models displayed during testing was not a good indication of the accuracy they would display when applied to new data and the distribution of the target attribute in the data sets had an impact on the data mining models and the testing thereof. Section 1 Introduction 4 An Evaluation of Commercial Data Mining: Oracle Data Mining
  5. 5. Chapter 1 Introduction The purpose of this evaluation is to determine how the Oracle Data Mining suite provides data mining functionality. This involves investigating a number of issues: 1. How easy the tools available with the data mining software are to use and in what ways they provide aspects of data mining like data preparation, building of data mining models and testing of these models. 2. Whether the algorithms selected for this evaluation found a useful pattern in a data set and what happened when the models produced by the algorithms were applied to a new data set. 3. Which of the algorithms investigated built the most effective data mining model and under what circumstances this occurred. 4. How the models were tested and whether test results gave an indication of how the models would perform when applied to new data. 5. Lastly, the manner in which the distribution of the data used to build the data mining models affected the models and how the distribution of the data used to test the models affected the test results. 1.1 Background to Data Mining Data mining is a relatively new offshoot of database technology which has arisen primarily as a result of the ability of computers to: • Store vast quantities of data in data warehouses. (Data warehouses differ from operational databases in that the data in a warehouse is historical; the data does not only consist of active records in a database.) • Implement various algorithms for the mining of data. • Use these algorithms to analyse these vast quantities of data in a reasonable amount of time. The ability to store vast amounts of data is of little use if the data cannot somehow be organised in a meaningful way. Data mining achieves this by discovering the patterns in data that represent knowledge and providing some sort of description or abstraction of what is contained in a data set. These patterns allow organisations to learn from 5 An Evaluation of Commercial Data Mining: Oracle Data Mining
  6. 6. past behaviour stored in historical data and exploit those patterns that work best for them. There are various ways to classify data mining into categories as suggested by a number of authors. Berry and Linoff [2000] attempt to classify into categories the various techniques of data mining and specify two main categories – directed data mining and undirected data mining. Geatz and Roiger [2003] divide data mining into two categories, supervised and unsupervised learning. Al-Attar [2004] makes a distinction between data mining and data modelling. Berry and Linoff [2000] suggest considering the goals of the data mining project when classifying data mining and, accordingly, what techniques can be used to fulfil these goals. Prescriptive techniques are useful for making predictions and descriptive techniques help with understanding of a problem space. According to Berry and Linoff [2000], directed data mining involves using the data to build a model that describes one particular variable of interest in terms of the rest of the data. This category includes techniques such as classification, estimation and prediction. Undirected data mining builds a model with no single target variable but rather to establish the relationships among all the variables. Included in this category are affinity groupings or association discovery, clustering (classification with no predefined data) and description or visualization. [Berry and Linoff, 2000] Geatz and Roiger [2003] define input variables as independent variables and output variables as dependent variables. It can then be deduced that dependent variables do not exist in unsupervised learning as no output variable is produced but rather a descriptive relationship is produced. In supervised learning a predictive, dependent variable is produced as output. According to Al-Attar [2004], data mining results in patterns that are understandable such as decision trees, rules and associations. Data modelling produces a model that fits the data that can be understandable (trees, rules) or presented as a black box as in neural networks. 6 An Evaluation of Commercial Data Mining: Oracle Data Mining
  7. 7. In keeping with these definitions it is possible to say that directed data mining, supervised learning and Al-Attar’s [2004] definition of data mining describe similar predictive techniques and fall into the category of supervised learning. Undirected data mining, unsupervised learning and Al-Attar’s [2004] data modelling are in the same class as descriptive techniques and fall into the category of unsupervised learning. 1.2 Supervised Learning and Classification Techniques Algorithms are used to implement the techniques in these various data mining categories. Supervised learning covers techniques that include prediction, classification, estimation, decision trees and association rules. As this evaluation investigates classification techniques, these will be discussed in further detail. Geatz and Roiger [2003] describe classification as a technique where the dependent or output variable is categorical. The emphasis of the model is to assign new instances of data to categorical classes. The authors describe estimation as a similar technique that is used to determine the value of an unknown output attribute that is numerical. Geatz and Roiger [2003] state that prediction only differs from the two techniques mentioned above in that it is used to determine future outcomes of data. Classification techniques such as these are generally used when there is a set of input and output data as dependent and independent variables exist in the data. 1.3 Oracle Data Mining (ODM) Oracle embeds data mining in the Oracle 9i Enterprise Edition version 9.2.0.5.0 database which allows for integration with other database applications. All data mining functions are provided through the Java API giving complete control to the data miner over the data mining functions. [Oracle9i Data Mining Concepts Release 2 (9.2) 2002] The Oracle Data Mining suite is made up of two components, the data mining Java API and the Data Mining Server (DMS). [Oracle9i Data Mining Concepts Release 2 (9.2), 2002] The DMS is a server side component that provides a repository of 7 An Evaluation of Commercial Data Mining: Oracle Data Mining
  8. 8. metadata of the input and result objects of data mining. The DMS also provides a connection to the database and access to the data that is mined. It is possible to use JDeveloper 10g to provide the access to the Java API and the DMS. The data mining can then be performed using Data Mining for Java (DM4J) 9.0.4 or by writing Java code. DM4J provides a number of wizards that automatically produce the Java code. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] 1.3.1 Oracle Data Mining Algorithms ODM supports a number of algorithms and choice of algorithm for ODM depends on the data available for mining as well as the format of results required. This project has made use of the Adaptive Bayes Network and Naïve Bayes algorithms which are Classification algorithms that assign new instances of data to categorical classes and can be used to make predictions when applied to new data. 1.3.2 Functionality of Oracle Data Mining Algorithms and ODM Mining tasks are available to perform data mining operations using these algorithms which include building and testing of models, computing model lift and applying models to new data (scoring). DM4J wizards control the preparation and mining of data as well as evaluation and scoring of models. DM4J has the ability to automatically generate Java and SQL code to transfer the data mining into integrated data mining or business intelligence applications. [Oracle Data Mining for Java (DM4J), 2004] 1.4 Chapter Summary This chapter introduces the evaluation and describes what is hoped to be achieved by investigating the Oracle Data Mining suite. A short background to data mining is presented and supervised learning and Classification techniques introduced. A short introduction to ODM is also presented. The next chapter will describe the approach taken by this evaluation and will present reasons for some of the design decisions. 8 An Evaluation of Commercial Data Mining: Oracle Data Mining
  9. 9. Section 2 Evaluation of Oracle Data Mining Chapter 2 Methodology of the Evaluation 9 An Evaluation of Commercial Data Mining: Oracle Data Mining
  10. 10. This chapter aims to provide an explanation of the approach that has been taken during this evaluation. It will explain why ODM was selected as the data mining tool to be evaluated as well as why the Naïve Bayes and Adaptive Bayes Network algorithms were used to build the data mining models. The parameters required by these algorithms are explained and the data used during this evaluation is described. 2.1 Approach One purpose of this evaluation is to determine what functionality is provided with ODM as well as to ascertain what kinds of models can be produced by ODM. In order to make these discoveries, it is necessary to use a number of algorithms in the data mining suite to build data mining models, to test the accuracy of these models and to validate the results these models produce when applied to new data. To be able to perform comparisons of the results the models produce, it has been necessary to select two forms of data mining algorithm that fall into the same categories, in this case, supervised learning and classification. For this reason, Naïve Bayes for Classification and Adaptive Bayes Network for Classification have been selected as both algorithms fall into the supervised learning category and can be used to make predictions. These predictions could then be compared to determine which models, built using the different algorithms, are more effective. Both algorithms allow for building the model, testing the model, computing model lift (providing a measure of how quickly the model finds actual positive target values) and application of the model to new data. An Oracle 9i Enterprise Edition version 9.2.0.5.0 database was configured and the tools and software for data mining installed and configured for use with the database. For the purposes of this investigation, JDeveloper 10g provides the access to the Java API and the DMS. The data mining itself is performed using DM4J 9.0.4 which is an extension of JDeveloper that provides the user with a number of wizards that automatically create the Java programs that perform the data mining when these programs are run. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] 10 An Evaluation of Commercial Data Mining: Oracle Data Mining
  11. 11. The data used during the evaluation was obtained at http://www.ru.ac.za/weather/ which provides an archive of weather data in the Grahamstown area for a number of years. It was deemed that it would be more interesting to use this data to determine whether a pattern was present in the data when conducting the evaluation as the results would be of more interest than sample data with little relevance to Rhodes University. The two Classification algorithms were then used to build, test and apply a number of data mining models to the data and it was then possible to compare the predictions made by each model. During the model building stage it was possible to build the models using prepared and unprepared data as well as to build models using the different techniques to determine the effect this had on the results. During testing of the models it was possible to compare the models’ accuracy and to measure how quickly the model finds actual positive target values (model lift). Once the models had been built and tested it was possible to apply the models to new data and then compare the predictions made by the models to those of the other models as well as to the actual values in the historical data. It was also of interest to compare the results of testing the models to those of applying the models to new data. 2.2 Choice of Data Mining Tool It was chosen to evaluate the data mining functionality provided with the Oracle9i Enterprise Edition database. An aspect of ODM that supported its use was that all data mining processing occurs within the database. This removes the need to extract data from the database in order to perform the mining as well as reducing the need for hardware and software to store and manage this data. According to Berger [2004] this results in a more secure and stable data management and mining environment and enhances productivity as the data does not have to be extracted from the database before it is mined. ODM uses Java code to build, test and apply the models. It was decided to use DM4J 9.0.4 (an extension of JDeveloper 10g) to conduct the data mining as DM4J provides wizards that allow the user to adjust the settings for the data mining and automatically generates the Java code that is run when the mining is performed. This functionality 11 An Evaluation of Commercial Data Mining: Oracle Data Mining
  12. 12. allows novice users to use the default settings for the various algorithms and more advanced users can experiment with the different settings without having to rewrite vast amounts of code. DM4J also provides access to the Oracle 9i database and the data used for the data mining which allows the user to carry out data preparation within the database using similar wizards. These factors would allow the ease of use of the tools to be evaluated and to determine how the various stages of the data mining process are supported by ODM. In the study of related literature it is apparent that a number of authors feel data mining should be conducted in a procedural manner. Al-Attar [2004] feels that a step by step data mining methodology needs to be developed to allow non-experts to conduct data mining and that this methodology should be repeatable for most data mining projects. This and similar statements show the need for a well defined data mining process to be used by data miners. Geatz and Roiger [2003] introduce the KDD (Knowledge Discovery and Data Mining) data mining process where emphasis is placed on data preparation for model building which involves: • Identification of the goal to be achieved using data mining. • Selecting the data to be mined. • Data preprocessing in order to deal with noisy data. • Data transformation which involves the addition or removal of attributes and instances, normalizing of data and type conversions. • The actual data mining, at this stage the model is built from training and test data sets. • The resulting model is interpreted to determine if the results it presents are useful or interesting. • The model or acquired knowledge is applied to the problem. When this suggested process is compared to the process used by ODM as depicted in Figure 1, it is apparent that ODM makes use of similar stages in their data mining and 12 An Evaluation of Commercial Data Mining: Oracle Data Mining
  13. 13. places the necessary emphasis on preparation of data and evaluation of results. This suggests that ODM provides access to the necessary stages involved in conducting a more successful data mining project. Figure 1.The Oracle Data Mining Process [Berger, 2004] 2.3 The Data The data used in this evaluation consists of a number of tables that are stored in the Oracle database and available in Appendix B on the CD-ROM that accompanies this project. The data was created from a weather data archive available at http://www.ru.ac.za/weather/ compiled by Jacot-Guillarmod, F. According to the explanation on the web page, the data available at the site represents data gathered at 5 minute intervals throughout a day. Data recorded includes: • Temperature (degrees F) • Humidity (percent) • Barometer (inches of mercury) • Wind Direction (degrees, 360 = North, 90 = East) • Wind Speed (MPH) 13 An Evaluation of Commercial Data Mining: Oracle Data Mining
  14. 14. • High Wind Speed (MPH) • Solar Radiation (Watts/m^2) • Rainfall (inches) • Wind Chill (computed from high wind speed and temperature) Preparing the data to create the database tables involved removing the reading of rainfall in inches from the records and replacing it with a ‘yes’ or ‘no’ value, depending on whether rain had been measured or not. This implies that the 5 minute interval measurements are used to determine whether rain had been recorded on the day the measurements were taken. Although information is lost regarding the amount of rain that had fallen on a specific day, for the purposes of this evaluation it is of interest whether rain fell at all on a specific day as the predictions made by the algorithms are categorical. This categorical variable which was named RAIN would then be predicted by the models when applied to new data of the same format. The resulting structure of the tables of data is depicted in Table 1. Name Data Type Size Nulls? THETIME NUMBER NO TEMP NUMBER YES HUM NUMBER YES BARO NUMBER YES WDIR NUMBER 3 YES WSPD NUMBER YES WSHI NUMBER YES SRAD NUMBER YES CHILL NUMBER YES RAIN VARCHAR 3 YES Table 1 Mining Data Table Structure The data set WEATHER_BUILD is used for the building of the data mining models for both algorithms. This data set consists of 2601 records and is created from a number of daily weather archives recorded in September 2004. 14 An Evaluation of Commercial Data Mining: Oracle Data Mining
  15. 15. The test data set used to evaluate the effectiveness of the models is created from WEATHER_BUILD and the process of creating this data set will be explained in more detail later in the project. WEATHER_ APPLY consists of 290 records and is the data set which the built and tested model is applied to in order to make predictions. All the actual values of the RAIN attribute had been removed and stored for later comparison. This means the models will predict whether the value of RAIN will be ‘yes’ or ‘no’ and it will then be possible to compare these predictions with the actual values in the original data used to create WEATHER_APPLY. The results of the application of the models to the data are stored by DM4J for inspection and use. It is also possible to export the results to spreadsheet format which has been done in this case to allow for comparison between models and with the actual data values. 2.4 Classification Algorithms The two algorithms selected for the evaluation were Naïve Bayes and Adaptive Bayes Network. Both are classification algorithms that allow the data miner to build a model using historical data and then apply this model to new data in order to make predictions regarding a dependent, categorical variable in the data. Berger [2004], states that both algorithms should be used in a data mining project to see which algorithm is able to build the better model. This provides a further justification for the comparison of these two algorithms within the data mining suite. 2.4.1 Naïve Bayes The Naïve Bayes algorithm builds a model that predicts the probability of a variable falling into a categorical class. This is achieved by discovering patterns present in the 15 An Evaluation of Commercial Data Mining: Oracle Data Mining
  16. 16. data and counting the number of times certain conditions or relationships in the data occur. [Berger, 2004] The data mining model represents these relationships and can be applied to new data to make predictions. The algorithm makes use of Bayes’ Theorem, which is statistical in nature. [Berger, 2004] The algorithm is said to provide quicker model building and faster application to new data than the Adaptive Bayes Network algorithm. Naïve Bayes can also be used to make predictions of categorical classes that consist of binary-type outcomes or multiple categories of outcomes. [Berger, 2004] 2.4.2 Adaptive Bayes Network The Adaptive Bayes Network model provides similar functionality to that of Naïve Bayes but can also be used to generate rules or decision tree-like outcomes when built and again to make predictions when applied to new data. The rules that are generated are easy to interpret in the form of “if…..then” statements. Berger [2004] states that this algorithm can be used to build better models than Naïve Bayes but it does involve a larger number of parameters to be set and it tends to take a longer time to build such a model. 2.5 Algorithm Settings 2.5.1 Naïve Bayes Settings Naïve Bayes works by looking at the build data and calculating conditional probabilities for the target value. This is done by observing the frequency of certain attribute values and combinations thereof. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] The two parameters that must be supplied to the Naïve Bayes build wizard, as shown in Figure 2, indicate how outliers in the data should be treated; occurrences below the threshold values are ignored when creating the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] 16 An Evaluation of Commercial Data Mining: Oracle Data Mining
  17. 17. The singleton threshold value provides a threshold for the count of items that occur frequently in the data. Given k as the number of times the item occurs in the data, P as the number of records and t as the singleton threshold expressed as a percentage of P; then the item is considered to occur frequently if k>=t*P. [Oracle Help for Java,1997-2004] The pairwise threshold provides a threshold for the count of pairs of items that occur frequently in the data. Given k as the number of times two items appear together in the records and P and t as above; a pair is frequent if k>t*P. [Oracle Help for Java, 1997-2004] Figure 2 Naïve Bayes algorithm settings 2.5.2 Adaptive Bayes Network Settings 17 An Evaluation of Commercial Data Mining: Oracle Data Mining
  18. 18. Adaptive Bayes Network works by ranking the attributes in a data set and then building a Naïve Bayes model in order of the ranked attributes. The algorithm then builds a set of features or ‘trees’ using these attributes which are in turn tested against the model in order to determine whether they improve the accuracy of the model or not. If no improvement is found the feature is discarded. When the number of discarded features reaches a certain level the building stops and the model is those features that remain. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] The choice of settings when building an Adaptive Bayes Network model allows the user to choose from three types of models: SingleFeatureBuild, MultiFeatureBuild and NaiveBayesBuild. The SingleFeatureBuild model produces rules in an “if….then” format and produces only one feature. The parameters required by this type of model are shown in Figure 3 and include the maximum depth of the feature (number of attributes in the feature) and number of predictors to use during the building of the model. It is then possible for the algorithm to determine which attributes to include in the feature and how many to include up to the specified maximum. [Oracle Help for Java, 1997-2004] A greater feature depth as well as a greater number of predictors included will result in a slower model building process. The MultiFeatureBuild model does not generate any rules. This model builds a form of Naive Bayes model and creates one or more features made up of a number of attributes. 18 An Evaluation of Commercial Data Mining: Oracle Data Mining
  19. 19. The parameters required by this kind of model are the maximum number of features to build and, as with the SingleFeatureBuild model type, the maximum number of predictors or attributes to use while the model is built. Also to be specified is the maximum number of failures to allow when a feature is tested against model accuracy, before it is discarded and the number of attributes allowed in a feature. [Oracle Help for Java, 1997-2004]. Again a greater feature depth, greater number of predictors and a greater number of failures allowed will result in a slower model build process. Figure 3 Adaptive Bayes Network algorithm settings 19 An Evaluation of Commercial Data Mining: Oracle Data Mining
  20. 20. The NaiveBayesBuild model type does not generate rules either and, like the MultiFeatureBuild, also builds a form of Naïve Bayes model. The maximum number of predictors to consider during the build process must be specified by the user. [Oracle Help for Java, 1997-2004] Again, the greater the number of predictors the algorithm must consider, the slower the model building will be. The type of model created in all the Adaptive Bayes Network models in this evaluation was the SingleFeatureBuild. This model type was chosen as in the explanations of the model types it appears to be the model type that results in a model less similar to a Naïve Bayes model. Also it is the only model type that produces rules and the rules produced by the model would be of interest in determining what aspects of the data influenced the predictions made by the model. 2.6 Chapter Summary This chapter has described what has hoped to have been achieved by building data mining models using the Naïve Bayes and Adaptive Bayes Network algorithms in the Oracle Data Mining suite. The reasons for selecting Oracle Data Mining for this research have been highlighted. The models built using the algorithms have been outlined and the parameters required by each algorithm have been described. The source of the data used for this evaluation has been explained as well as how the data sets for the data mining were created. The next chapter will describe the process of preparing the data, building the models, testing the models and training and tuning the models. 20 An Evaluation of Commercial Data Mining: Oracle Data Mining
  21. 21. Chapter 3 Classification Models This chapter describes the process of building the Classification models. The process of preparing the data to create the build and test data sets is discussed and the Priors technique is introduced. The actual model building is explained in this chapter. The model testing process is described including aspects like model accuracy, confusion matrices and model lift. The process of training and tuning the models to increase their effectiveness is explained. This chapter provides insight into how ODM provides data mining functionality. 3.1 Preparing the Data 3.1.1 Build and Test Data Sets 21 An Evaluation of Commercial Data Mining: Oracle Data Mining
  22. 22. Pyle [2000] emphasises the importance of proper data preparation for data mining and says the benefits of data mining using properly prepared data include the creation of more effective models faster. He states that at least two outputs are required from data preparation: the training data set which is used for building the model and the testing data set which helps detect overtraining (noise trained into the model). These data sets are used by the data mining suite later in the data mining process. In the case of this evaluation it was necessary to use the data in WEATHER_BUILD to create the training and testing data sets. DM4J provides a tool which allows the user to create randomized build and test tables from the existing data. The wizard is known as the Transformation Split wizard and is specifically developed for use with Classification models. The wizard allows the user to select which data is to be used to create the new tables as well as to specify what percentage of records in the original data should be allocated to each of the build and test tables. WEATHER_BUILD was used as the original data and 75% of the records were allocated to the build table and 25% were placed in the test table. That is, 1951 records were randomly selected from WEATHER_BUILD and placed in the build table and the remaining 650 records were placed in the test table. These ratios were chosen because the varying nature of the weather data meant it would be more beneficial to have a larger number of cases in the build data set, thus allowing the data mining model to be aware of a larger number of cases that influenced the target attribute RAIN. The wizard produced the Transformation Split component which was run and the resulting tables were named THE_BUILD and THE_TEST and were stored in the database along with the original data. 3.1.2 Priors In a number of scenarios where the variable that is being predicted is binary in nature, one outcome of this variable may occur more frequently in the data that the other. When the model is built from such data the model may not observe enough of the one case to build an accurate model and may predict the other case nearly every time but 22 An Evaluation of Commercial Data Mining: Oracle Data Mining
  23. 23. still show a high accuracy during testing. In order to prevent this from occurring, it is necessary to create a build table that has approximately equal numbers of each outcome and also to supply the algorithm with the original distribution of the data or the prior distribution. This technique, known as Priors, should result in a more effective model. However, the model must be tested against data of the original distribution. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] In order to determine the effect of using such a technique as a form of data preparation it was decided to build models using both algorithms that would use data prepared in this way. When the data in THE_BUILD data set was examined it was apparent that the ‘no’ outcome occurred more frequently than the ‘yes’ for the target attribute RAIN as shown in Figure 4. An outcome of ‘no’ occurred 1242 times and a ‘yes’ occurred 727 times. Histogram for:RAIN 1400 1200 1000 Bin Count 800 600 400 200 0 yes no Bin Range Figure 4. Data Distribution for RAIN Attribute from THE_BUILD data set It was possible to create a build data set with a more even distribution of the target attribute. This was accomplished using the ODM browser and a Transformation wizard which created a stratified sample of the data with a balanced distribution of the target attribute. Stratified random sampling divides the data set into subpopulations and samples are then taken from these in proportion to subpopulation size. [Fernandez, 2003] As there were 727 cases of ‘yes’ for the RAIN attribute, creating a balanced data set would require a data set of approximately twice that size (1454). This data set was created by the wizard and named THE_BUILD1. When the 23 An Evaluation of Commercial Data Mining: Oracle Data Mining
  24. 24. distribution of the RAIN attribute was inspected again a more balanced distribution was shown as depicted in Figure 5. Histogram for:RAIN 1400 1200 1000 Bin Count 800 600 400 200 0 yes no Bin Range Figure 5. Data Distribution for RAIN Attribute from THE_BUILD1 data set The data sets THE_BUILD and THE_BUILD1 were used to build models using each algorithm and tested on the same test data, THE_TEST, in order to allow for an evaluation of the effect the distribution of the data has on the resulting models. 3.2 Building the Models In total, 8 classification models were built using DM4J, four using the Naïve Bayes for Classification algorithm and four using the Adaptive Bayes Network for Classification algorithm. Of the four for each algorithm, two models were built using the data set THE_BUILD where the Priors technique was not made use of and weighting was used in one of the two, two models were built using THE_BUILD1, using the Priors technique and again, weighting was used in one of the two. Weighting and its effects on the models will be discussed later in this chapter. All the models were built using the attribute RAIN as the target value. This means the models were built in order to predict the outcome, ‘yes’ or ‘no’, of RAIN when applied to new data. 3.2.1 Building the Naïve Bayes Models 24 An Evaluation of Commercial Data Mining: Oracle Data Mining
  25. 25. 3.2.1.1 nbBuild The first model built was named nbBuild and used the data set THE_BUILD which had the uneven distribution of the target attribute RAIN (as discussed in section 3.1.2). The Naïve Bayes algorithm was used and the default algorithm settings were used. This means the singleton threshold was 0.1 and the pairwise threshold was 0.1 for the model. 3.2.1.2 nbBuild2 The second model was named nbBuild2 and made use of the data set THE_BUILD1 which was adjusted using stratified sampling and the Priors technique to have an even distribution of the target value RAIN. When making use of the Priors technique it was necessary to specify in the model build wizard what the original distribution of the data had been in order for the algorithm to be aware of this when making its classifications. The values supplied at this stage of the model build process are shown in Figure 6. Again, the default algorithm settings of 0.1 for the pairwise and singleton thresholds were used. 3.2.2 Building the Adaptive Bayes Network Models 3.2.2.1 abnBuild The third model was named abnBuild and made use of the data set THE_BUILD. The Adaptive Bayes Network algorithm was used and a model type of SingleFeatureBuild was selected. This model type produces rules along with its predictions. The settings for the model type were left at the defaults. These settings included a maximum number of predictors of 25, a maximum network feature depth of 10 and no time limit for the running of the algorithm. 3.2.2.2 abnBuild2 The fourth model was named abnBuild2 and made use of THE_BUILD1 which was the adjusted data set. Again it was necessary to specify the original distribution of the 25 An Evaluation of Commercial Data Mining: Oracle Data Mining
  26. 26. original data set as shown in Figure 6. A SingleFeatureBuild model type was selected and the default settings as described above were used. Figure 6. Extract from Classification Model Build Wizard, Priors Settings. 3.3 Testing the Models Roiger and Geatz [2003] state that evaluation of supervised learning models involves determining the level of predictive accuracy and that supervised learning models can be evaluated using test data sets. Such models can be evaluated by comparing the test set error rates of supervised learning models created from the same training data to 26 An Evaluation of Commercial Data Mining: Oracle Data Mining
  27. 27. determine the accuracy of the models and which model is most effective. It is of interest how ODM supports testing, whether the accuracy a model displays during testing indicates how it will perform on new data and how the data used during testing affects the results of testing the models. The test model results produced by DM4J are depicted in confusion matrices. Confusion matrices can be used to determine the accuracy of Classification models and to show the number of false negative or false positive predictions made by the model on the test data. Confusion matrices are best used for evaluating the accuracy of models using categorical data which is being used in this case. [Roiger and Geatz, 2003] Roiger and Geatz [2003] provide an example of a confusion matrix as shown in Table 2, Model A is used to classify categorical data into two classes, Accept and Reject. The rows in the table represent the actual values in the data and the columns represent the predicted values. The model correctly classified 600 Accept instances from the data and correctly classified 300 Reject instances. However, there were actually 625 Accept instances in the data and 375 Reject instances. The model also classified 675 instances as Accept instances and 325 instances as Reject instances. The accuracy of the model is then determined by dividing 900 by 1000 and results in a 90% accuracy or an error rate of 10%. Example Model Predicted Accept Predicted Reject Actual Accept 600 25 Actual Reject 75 300 Table 2. Example Confusion Matrix 3.3.1 Model Accuracy The four models discussed in the previous section were each tested on the same test data set, THE_TEST, consisting of 633 records. The test accuracy for each model is shown in Table 3. It is interesting to note the greater accuracy of the models built using the Adaptive Bayes Network algorithm and that using the prior distribution technique appears to have had a negative impact on the test accuracy of the models. 27 An Evaluation of Commercial Data Mining: Oracle Data Mining
  28. 28. Model nbBuild nbBuild2 abnBuild abnBuild2 Test Accuracy 72.35387% 71.09005% 85.15008% 84.9921% Table 3. Model Test Accuracy Rates 3.3.2 Model Confusion Matrices Testing the models produced a confusion matrix for each model which showed the tendencies of the individual model’s predictions when examined. The following Tables 4-7 depict each models confusion matrix which is then discussed. Again, the rows represent actual values and the columns represent the predicted values. nbBuild no yes no 384 34 yes 141 74 Table 4. Confusion Matrix for Model nbBuild Testing When nbBuild was tested the model correctly predicted the value of the RAIN attribute in 384 + 74 = 458 cases out of 633 cases. As can be seen in the lower left corner of the matrix, the model also incorrectly predicts a larger number (141) of ‘no’ values that are actually ‘yes’ values. This error will be adjusted for when the model is tuned. nbBuild2 no yes no 320 98 yes 85 130 Table 5. Confusion Matrix for Model nbBuild2 Testing The nbBuild2 model correctly predicted the value of the RAIN attribute in 320 + 130 = 450 cases out of 633 cases. When tested this model shows less of a tendency for an error in a certain direction, i.e. ‘yes’ or ‘no’, as the false prediction numbers of 98 and 85 are close. This can be attributed to the fact that the model was built using the Priors technique, to compensate for the lower level of ‘yes’ values for RAIN in the original data. abnBuild no yes no 353 65 28 An Evaluation of Commercial Data Mining: Oracle Data Mining
  29. 29. yes 29 186 Table 6. Confusion Matrix for Model abnBuild Testing The abnBuild model correctly predicted the value of the RAIN attribute in 353 + 186 = 539 cases out of 633 cases. This model shows a higher accuracy during testing than the previous models built using Naïve Bayes. Testing also shows that this model makes a larger number of incorrect ‘yes’ predictions. This effect could also be minimised during tuning. abnBuild2 no yes no 346 72 yes 23 192 Table 7. Confusion Matrix for Model abnBuild2 Testing The abnBuild2 model correctly predicted the value of the RAIN attribute in 346+192 = 538 cases out of 633 cases. Similarly, this model shows a higher accuracy during testing than those models built using Naïve Bayes. This model also tends to make a larger number of incorrect ‘yes’ predictions. This too could be dealt with during model tuning. Once the accuracy of the models is tested it is possible to perform another kind of model testing using cumulative gains charts or lift charts. 3.4 Calculating Model Lift A lift or cumulative gains chart shows how well the model improves predictions of positive target attribute outcomes over a sample of the data containing actual results. The usefulness of such a technique would be apparent in a business problem where predicted positive values in a model may indicate possible business opportunities. Lift allows that miner to estimate how well the model will perform when applied to new data. [Oracle Help for Java, 1997-2004] 29 An Evaluation of Commercial Data Mining: Oracle Data Mining
  30. 30. Figure 7. nbBuild Lift Chart Figure 7 shows the cumulative lift chart for nbBuild when applied to the test data set, THE_TEST. The value in the first column, approximately 2.4, indicates that the model should find approximately 2.4 times as many actual positive values for the RAIN attribute than a random selection of 10% of the data would show. Figure 8. nbBuild2 Lift Chart 30 An Evaluation of Commercial Data Mining: Oracle Data Mining
  31. 31. Figure 8 depicts the cumulative lift chart for nbBuild2 when applied to the test data set. In the first and second columns the graph indicates that the model should find approximately 2.4 times as many positive values than random selection. Figure 9. abnBuild Lift Chart Figure 9 shows the cumulative lift chart for abnBuild when applied to the test data set. The value of approximately 2.6 indicates that the model finds approximately 2.6 times as many positive values as random selection would. 31 An Evaluation of Commercial Data Mining: Oracle Data Mining
  32. 32. Figure 10. abnBuild2 Lift Chart Figure 10 shows the cumulative lift chart for abnBuild2 when applied to the test data set. Similarly, the value of approximately 2.6 indicates that the model finds approximately 2.6 times as many positive values as random selection would. It is evident from the above charts that, although the accuracy of the models is not high in all cases, when applied to new data they should provide a far greater level of accuracy than attempting to make predictions using no model at all. 3.5 Training and Tuning the Models Using ODM it is possible to assign weights to the target value when using Naïve Bayes or Adaptive Bayes so that the model predicts more of one kind of outcome if it appears that there are a large number of false predictions of a certain kind when testing the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] This bias can be built into the model to increase predictions of the desired target value. In this investigation weighting was used to introduce this bias because when testing the nbBuild model, it was apparent from the confusion matrix that a significant error was encountered as the model predicted a large number of false negatives, ‘no’ values for the target attribute RAIN that were in fact ‘yes’ values. These predictions were 32 An Evaluation of Commercial Data Mining: Oracle Data Mining
  33. 33. false in 141 of the cases. This level of false predictions was high, thus it was viable to use weighting in order to decrease the number of false negative predictions. A weighting value is often chosen by trial and error and is then associated with a certain type of prediction, false negative or positive, and the model will then treat a false prediction of that kind as ‘the weighting value’ times as costly as an error of the other kind. This forces the model to make more predictions in the other direction. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] As it was apparent during testing that nbBuild predicted a large number of false negatives and as this was the most substantial error out of all the models, it was decided to build another four models, two more for each algorithm, which incorporated a weighting of 3 against false negatives. The weighting value of 3 was chosen after some experimentation and used on all the new models as shown in the extract of the model build wizard for abnBuild4 in Figure 11. The Priors technique was used in one case for each algorithm. The models were then tested on the same test data set, THE_TEST, which the previous models were tested on. Table 8 represents the previous models’ test accuracy rates and Table 9 represents the new weighted models’ test accuracy rates. Model nbBuild (no nbBuild2 abnBuild (no abnBuild2 Priors) Priors) Test Accuracy 72.35387% 71.09005% 85.15008% 84.9921% Table 8 Unweighted Models’ Test Accuracy Rates Model nbBuild3 (no nbBuild4 abnBuild3 (no abnBuild4 Priors) Priors) Test Accuracy 72.511846% 68.24645% 77.40916% 77.40916% Table 9 Weighted Models’ Test Accuracy Rates 33 An Evaluation of Commercial Data Mining: Oracle Data Mining
  34. 34. Figure 11. Extract Showing Weighting of Model Build Wizard for abnBuild4 In only one case, nbBuild3, did weighting improve model test accuracy when compared to the model with the same settings, nbBuild, before weighting was added. It is of interest to compare the confusion matrices for these two models. nbBuild no yes no 384 34 yes 141 74 Table 10 nbBuild Confusion Matrix 34 An Evaluation of Commercial Data Mining: Oracle Data Mining
  35. 35. nbBuild3 no yes no 381 37 yes 137 78 Table 11 nbBuild3 Confusion Matrix Table 10 shows the confusion matrix for nbBuild and Table 11 shows the confusion matrix for the weighted model nbBuild3. nbBuild3 was weighted 3 against false negatives. The effects of this weighting are shown in the decrease of false ‘no’ predictions, from 141 to 137, the increase in correct ‘yes’ predictions, from 74 to 78, and the increase in false ‘yes’ predictions, from 34 to 37. The affect of the weighting seems minimal but can be increased by increasing the value of the weighting. However, since the weighting appears to have had a negative impact on the test accuracy of other models it was decided to leave the value at 3. 3.6 Applying the Models to New Data At this stage it is necessary to provide a summary of the models built and tested thus far. This summary is provided in Table 12. Classification Algorithm Naïve Bayes Adaptive Bayes Network No weighting, no use of nbBuild abnBuild Priors No Weighting, use of nbBuild2 abnBuild2 Priors Weighting, no use of nbBuild3 abnBuild3 Priors Weighting, use of Priors nbBuild4 abnBuild4 Table 12. Summary of Classification Models The models were applied to the new data in the WEATHER_APPLY set. The results were depicted according to the unique THE_TIME attribute for each record and showed a prediction, ‘yes’ or ‘no’, of whether it was likely to rain. The results were exported to spreadsheets to allow for inspection and the comparisons are discussed in the following chapters. 35 An Evaluation of Commercial Data Mining: Oracle Data Mining
  36. 36. 3.7 Chapter Summary The eight Classification models that have been built have been discussed and it is apparent that the algorithms have found a pattern in the data. The support ODM provides for the process of preparing the data to build the models has been described and the Priors technique has been explained. The model testing process has been described and has given an indication of the accuracy of the models. It will be interesting to compare this accuracy with the accuracy the models exhibit when applied to new data. Model lift has been calculated for the models. Four of the models have been tuned by introducing weighting into the models. The models have been applied to new data and the results of this are described in the next chapter. 36 An Evaluation of Commercial Data Mining: Oracle Data Mining
  37. 37. Chapter 4 Model Results This chapter describes the results obtained when the models were applied to new data. Extracts of the results are provided to show how these can be interpreted. The rules associated with the predictions made by the Adaptive Bayes Network models are explained. As a form of external validation the predictions made by the models are compared to the actual values in the original data. The results of this validation are compared for the eight models in order to determine which model is most effective when applied to new data and with what settings this model was built. 4.1 Results of Application to New Data The eight classification models were applied to the new data in the WEATHER_APPLY data set. This data set consisted of 290 records all of which had had the value for the RAIN attribute removed. These values had been stored for later comparison. The results were depicted by THE_TIME attribute and showed a prediction ‘yes’ or ‘no’, of whether it was likely to rain for all 290 records. The probability of this prediction was also depicted as shown in a sample from the results for nbBuild in Table 13. The results in this extract can be interpreted as at THE_TIME attribute with value 1, it is predicted that no rain will have been measured and this prediction is given with a probability of 0.9999. At 37 An Evaluation of Commercial Data Mining: Oracle Data Mining
  38. 38. THE_TIME attribute with value 138 it is predicted that rain will have been measured with a probability of 0.6711. PREDICTION PROBABILITY THE_TIME no 0.9999 1 yes 0.6711 138 Table 13. Extract of results from model nbBuild Those models that were weighted provided predictions and cost figures. This cost figure is provided instead of probability as the model makes prediction based on the cost of an incorrect prediction to the model’s accuracy. This cost figure is determined by the weighting of a certain type of false prediction when the model is tuned and the algorithm then attempts to minimise costs when making predictions. An extract from these types of results is shown in Table 14. This extract can be interpreted as at THE_TIME attribute of value 1, it is predicted that no rain will have been measured and the cost of such a prediction is 0. At THE_TIME attribute of value 138 it is predicted that rain will have been measured, if this prediction is incorrect the cost is higher at 0.3288 which is due to the fact that a target value of ‘yes’ was weighted to avoid false negatives. Low cost can be interpreted as high probability as can be seen from comparing the two extracts, but it is not possible to directly calculate probability from cost. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] PREDICTION COST THE_TIME no 0 1 yes 0.3288 138 Table 14. Extract of results from model nbBuild3 4.1.1 Rules Associated with Adaptive Bayes Network Predictions 38 An Evaluation of Commercial Data Mining: Oracle Data Mining
  39. 39. Those models that were built using the Adaptive Bayes Network algorithm provide the same format of results as shown in Tables 13 and 14 but also provide the rule with which the associated prediction was made. During the model build stage these rules are generated and then predictions are made using these rules when the model is applied to new data. However, not all rules are made use of when the model is applied to new data. The format of these results is shown in Table 15. PREDICTION PROBABILITY RULE_ID THETIME no 0.5418 52 1 yes 0.6677 53 138 Table 15 Extract of Results from model abnBuild showing rules After inspecting the spreadsheets containing the results of those models built using the Adaptive Bayes Network algorithm, it was apparent that when the models were applied to the new data only 8 of the 61 rules generated during the model building process were used to make the predictions. These 8 rules will be expanded upon in Table 16. Rule ID If (Condition) Then Confidence Support (classification) 2 CHILL in (37 - no 0.63258135 0.104113765 46.6) 38 CHILL in (37 - yes 0.6427132 0.019299136 46.6) and WDIR in (22 - 89.6) 43 CHILL in (37 - yes 0.94884205 0.019807009 46.6) and WDIR in (89.6 - 157.2) 44 CHILL in (46.6 - yes 0.9037015 0.014728288 56.2) and WDIR in (89.6 - 157.2) 48 CHILL in (46.6 - yes 0.8486806 0.031488065 56.2) and WDIR in (157.2 - 224.8) 52 CHILL in (37 - no 0.54187334 0.019807009 46.6) and WDIR in (224.8 - 292.4) 39 An Evaluation of Commercial Data Mining: Oracle Data Mining
  40. 40. 53 CHILL in (46.6 - yes 0.6677172 0.1777552 56.2) and WDIR in (224.8 - 292.4) 57 CHILL in (37 - no 0.93961054 0.0726257 46.6) and WDIR in (292.4 - 360) Table 16 Rules used by Adaptive Bayes Network Models to Make Predictions These rules can be interpreted as follows for rule 52: IF CHILL in (37 - 46.6) and WDIR in (224.8 - 292.4) THEN RAIN equal (no) Confidence=0.54187334 Support=0.019807009 The support value given with the rules gives an indication of the percentage of cases in the build data set with the same predicted target attribute and that meet the conditions of the rule. The confidence value indicates the improvement in the accuracy of the model that has been made by adding the rule. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] 4.2 Comparison of Model Results Once the models had been applied to new data and the results of this step had been exported to spreadsheets, it was possible to replace the original values of the RAIN attribute in the WEATHER_APPLY data set to allow the effectiveness of the models to be evaluated. After the RAIN attribute was replaced in the original data, each prediction made by each model was compared to the actual value of the RAIN attribute for that record. The number of correct predictions was counted and the percentage of correct predictions was calculated. It is also of interest to consider the accuracy of the model during testing when evaluating the effectiveness of the model when applied to new data. This makes it possible to determine whether testing results 40 An Evaluation of Commercial Data Mining: Oracle Data Mining
  41. 41. give a good indication of model performance when applied to new data. These results are depicted in Table17. Percentage of Model Number of Correct Correct Accuracy Model Model Predictions (out of 290) Predictions During Settings Testing No weighting, nbBuild no use of Priors 40 13.79% 72.35386% No weighting, nbBuild2 use of Priors 107 36.90% 71.09005% Weighting, nbBuild3 no use of Priors 40 13.79% 72.511846% nbBuild4 Weighting, 185 63.79% use of 41 An Evaluation of Commercial Data Mining: Oracle Data Mining
  42. 42. Priors 68.24645% No weighting, abnBuild no use of Priors 123 42.41% 85.15008% No weighting, abnBuild2 use of Priors 123 42.41% 84.9921% Weighting, abnBuild3 no use of Priors 212 73.10% 77.40916% Weighting, abnBuild4 use of Priors 212 73.10% 77.40916% Table 17 Summary of Accuracy of Predictions When Compared to Actual Data It is also of interest to directly compare the results when applied to new data of those models built using different algorithms but the same settings in terms of weighting and use of Priors. This is depicted in Table 18. Models Settings Naïve Bayes Percentage Adaptive Bayes of Correct Predictions Network Percentage of Correct Predictions nbBuild vs No weighting, no 13.79% 42.41% abnBuild use of Priors nbBuild2 vs No weighting, use 36.90% 42.41% abnBuild2 of Priors nbBuild3 vs Weighting, no use 13.79% 73.10% abnBuild3 of Priors nbBuild3 vs Weighting, use of 63.79% 73.10% abnBuild3 Priors Table 18 Comparison of Models Built using Same Settings 4.3 Chapter Summary 42 An Evaluation of Commercial Data Mining: Oracle Data Mining
  43. 43. This chapter has provided a description of the results that were obtained when the models were applied to new data including the rules the Adaptive Bayes Network algorithm used to make its predictions. The predictions made by the models were compared to the original values for RAIN in the data. The accuracy of the predictions was compared between as well as with the accuracy the models showed during testing. In the following chapter the results of applying the models to new data and the results of the comparisons between the models will be interpreted as part of the evaluation of ODM. Chapter 5 Interpretation of Results This chapter provides an interpretation of the results obtained from the data mining models built using similar techniques but different algorithms. Each comparison between the models is interpreted and reasons presented for the results obtained. The effectiveness of all the models is compared and the significance of these observations discussed. 5.1 Comparison of Model Results 43 An Evaluation of Commercial Data Mining: Oracle Data Mining
  44. 44. As presented in Table 18 in the previous chapter, the percentage of correct predictions for each model built using the Naïve Bayes algorithm was compared to that of the model built using the Adaptive Bayes Network algorithm using similar techniques, in terms of Priors and weighting. Table 19 includes the accuracy during testing for the models along with the other results. Comparison Models Settings Naïve Adaptive Naïve Adaptive Bayes Bayes Bayes Bayes Percentage Network Accuracy Network of Correct Percentage During Accuracy Predictions of Correct Testing During Predictions Testing 1 nbBuild vs No 13.79% 42.41% 72.35386% 85.15008 abnBuild weighting, no % use of Priors 2 nbBuild2 No 36.90% 42.41% 71.09005% 84.9921% vs weighting, abnBuild2 use of Priors 3 nbBuild3 Weighting, 13.79% 73.10% 72.511846 77.40916 vs no use of % % abnBuild3 Priors 4 nbBuild4 Weighting, 63.79% 73.10% 68.24645% 77.40916 vs use of Priors % abnBuild4 Table 19 Comparison of Models built using same techniques and showing accuracy during testing When each comparison is inspected, it is apparent that in all cases when the models have been applied to new data, those models built using the Adaptive Bayes Network algorithm outperform those built using Naïve Bayes. In all except comparison 2, the percentage of correct predictions for the Adaptive Bayes Network models are markedly higher than those for the Naïve Bayes models. During testing, those models built using the Adaptive Bayes Network algorithm showed a higher level of accuracy than 44 An Evaluation of Commercial Data Mining: Oracle Data Mining
  45. 45. those models built using Naïve Bayes. However, this difference in test accuracy between the models in each comparison is not nearly as large as that demonstrated when the models are applied to new data. In the following subsections each comparison is interpreted and discussed. 5.1.1 Comparison 1 The models in this comparison were built without making use of the Priors technique and were not tuned using weighting to introduce bias. The model built using the Adaptive Bayes Network algorithm, abnBuild, correctly predicted 42.41% of the RAIN attribute outcomes when applied to new data whereas the model built using Naïve Bayes, nbBuild, only predicted 13.79% of the outcomes correctly. During testing, nbBuild showed an accuracy of 72.35386% and abnBuild showed an accuracy of 85.15008%. The fact that nbBuild showed a relatively high test accuracy and a low accuracy when applied to new data can be attributed to the fact that the data set used for building the model, THE_BUILD, had an unbalanced distribution of outcomes for the RAIN attribute. During the model building stage the model did not observe enough of one outcome of the target attribute to build an accurate model but still showed a high level of accuracy during testing as the data distribution of the test data set, THE_TEST, was similar to that of the build data. Thus, when applied to the new data, WEATHER_APPLY, the model was shown to be ineffective. 45 An Evaluation of Commercial Data Mining: Oracle Data Mining
  46. 46. abnBuild showed a higher overall accuracy than nbBuild which was expected as the Adaptive Bayes Network algorithm is said to build more effective models. [Berger, 2004] 5.1.2 Comparison 2 The models in this comparison were built using the Priors technique in order to minimise the effect of an unbalanced distribution of outcomes for the RAIN attribute in the build data set. When applied to the new data, nbBuild2 correctly predicted 36.90% of the outcomes of RAIN and abnBuild2 correctly predicted 42.41%. During testing, nbBuild2 showed an accuracy of 71.09005% and abnBuild2 showed an accuracy of 84.9921%. When the Priors technique was implemented and the models applied to new data, nbBuild2 showed an increase in accuracy of 23.11% compared to nbBuild. However, the test accuracy of this model decreased by little over 1%. The increase in accuracy when the model was applied to new data can be attributed to the fact that the build data set used to build this model had a balanced distribution of outcomes for the RAIN attribute, thus allowing the model to observe a sufficient number of cases of each outcome to ensure a more effective model. The slight decrease in test accuracy is due to the fact that the test data set had a similar distribution to the original build data set with the uneven distribution of the target attribute. abnBuild2 showed the same accuracy when applied to new data as the Adaptive Bayes Network model that did not make use of the priors technique. However, abnBuild2 showed a decrease in test accuracy from that of abnBuild. The fact that 46 An Evaluation of Commercial Data Mining: Oracle Data Mining
  47. 47. both Adaptive Bayes Network models showed the same accuracy when applied to new data is indicative of the effectiveness of the algorithm for building a model from data with varying distributions of the target attribute. The decrease in test accuracy of abnBuild2 can be attributed to the fact that the build data has a more even distribution of the target attribute and the test data set has a less even distribution of the target attribute. 5.1.3 Comparison 3 The models in this comparison were built from the original build data with the uneven distribution of the target attribute, RAIN. These models were tuned as it was evident during testing of the model nbBuild, that the model predicted a large number of false ‘no’ values for the target attribute. Thus, it was viable to introduce bias into the model by using weighting. The models were weighted 3 against false negatives, implying that the cost of predicting a false negative was 3 times that of predicting a false positive. With this weighting in place, nbBuild3 showed no improvement in accuracy when applied to new data compared to the first Naïve Bayes model. The accuracy remained at 13.79%. The accuracy of nbBuild3 during testing only improved by 0.157986% compared to the first model. It is unexpected that after introducing weighting into the model, nbBuild3’s accuracy should not improve when the model is applied to new data. This could indicate that the majority of reasons for this model’s ineffectiveness are due to the use of the build data that does not incorporate the Priors technique. The improvement in accuracy during testing of nbBuild3 could be attributed to introducing weighting, but indicates that this has had little effect on the accuracy of the model when applied to new data. 47 An Evaluation of Commercial Data Mining: Oracle Data Mining
  48. 48. A dramatic improvement in the Adaptive Bayes Network model was observed after introducing weighting. The accuracy of the model when applied to new data increased to 73.10% even though during testing the accuracy of abnBuild3 dropped to 77.40916%. It can be deduced that introducing the weighting has had a significant impact on the abnBuild3’s accuracy when applied to new data as the model has been sufficiently tuned to avoid errors of a certain kind. The accuracy of the first Adaptive Bayes Network model during testing was 85.15008%. Introducing weighting reduced this accuracy to 77.40916% even though the weighted model markedly outperforms the non-weighted model when applied to new data. It must also be emphasised that in this case the accuracy during testing and during application to new data are relatively close which has not occurred with previous models. For this reason, it can be said that introducing weighting into this model has improved the effectiveness of the model in terms of accuracy of the testing of the model itself and the accuracy of the model when applied to new data. 5.1.4 Comparison 4 The models in this comparison, nbBuild4 and abnBuild4, were built using the Priors technique and tuned by introducing a weighting of 3 against false negatives into the model. nbBuild4 showed a significant improvement in accuracy when applied to new data, correctly predicting 63.79% of the outcomes. The accuracy during testing and application of abnBuild4 was unchanged from that of the previous model that only made use of weighting. It is apparent that nbBuild4 showed an improvement due to the combination of the Priors technique and the introduction of weighting. These results show that building the model using data with a balanced distribution of the target attribute and then introducing weighting enhances the effect of the weighting in terms of building a 48 An Evaluation of Commercial Data Mining: Oracle Data Mining
  49. 49. more effective model and then tuning the model to increase the overall accuracy of the model. Also to be noted is that the accuracy of nbBuild4 during testing, 68.24645%, is more similar to the accuracy of the model when applied to new data than in the case of the other Naïve Bayes models. This could indicate the increased effectiveness during application to new data and testing of this model built using Priors and using weighting. The accuracy of abnBuild4 remained the same during application to new data and testing as it did for the previous model that did not make use of the Priors technique. For this reason, it is apparent that the use of the Priors technique has had no impact on the effectiveness of this model and tuning the model using weighting has significantly improved the effectiveness of this model. 5.2 Effectiveness of Models Figure 12 graphically portrays the accuracy of the models when applied to the new data set, WEATHER_APPLY, as well as the settings for each pair of models. In the case of the Naïve Bayes models, the most effective model correctly predicted 63.79% of the RAIN outcomes and was built using the Priors technique and introducing a weighting of 3 against false negatives. The most effective Adaptive Bayes model correctly predicted 73.10% of the RAIN outcomes and was built by introducing a weighting of 3 against false negatives into the model. The Priors technique had no influence on the accuracy of the Adaptive Bayes models when they were applied to new data. Figure 13 graphically portrays the accuracy of the models when tested on the test data set, THE_TEST, as well as the settings for each pair of models. The most accurate model during testing built with the Naïve Bayes algorithm was the model that used weighting but no use of the Priors technique. This model showed an 49 An Evaluation of Commercial Data Mining: Oracle Data Mining
  50. 50. accuracy of 72.51%. However, this model was also one of the two models that performed most poorly when applied to new data correctly predicting only 13.79% of the RAIN outcomes. The high accuracy in this case is attributed to the build and test data sets used during building and testing of this model having a similar unbalanced distribution of outcomes for the target attribute. This caused the model to perform well during testing but to be ineffective when applied to the new data set with a different distribution of outcomes for the target attribute. The model built using the Adaptive Bayes Network algorithm that demonstrated the greatest accuracy during testing was the model that did not make use of the Priors technique and had no bias introduced in the form of weighting. This model demonstrated an accuracy of 85.15% during testing. This model was also one of the two models of the Adaptive Bayes Network kind that performed most poorly when applied to new data predicting only 42.41% of the outcomes of RAIN correctly. Since the use of the Priors technique had no effect on the effectiveness of the models when applied to new data, it is possible to deduce that the introduction of weighting improved this models accuracy when applied to new data even though this was not reflected during testing. These results raise some questions about the effectiveness of the model testing process. Model Results 80.00% 70.00% 60.00% Accuracy 50.00% Naïve Bayes 40.00% Adaptive Bayes Network 30.00% 20.00% 10.00% 0.00% no no weighting, weighting, weighting, weighting, no priors priors no priors priors Model Settings 50 An Evaluation of Commercial Data Mining: Oracle Data Mining
  51. 51. Figure 12 Model results and settings of application to new data Testing Results 90.00% 80.00% 70.00% Naïve Bayes Accuracy 60.00% 50.00% 40.00% Adaptive Bayes 30.00% Network 20.00% 10.00% 0.00% weighting, weighting, weighting, weighting, no priors no priors priors priors no no Model Settings Figure 13 Model results and settings during testing 5.3 Significance of Results It is significant that the effectiveness of the models when applied to new data of those models built using the Adaptive Bayes Network algorithm was not affected by the use of the Priors techniques which attempts to ensure a balanced distribution of the target attribute in the build data set. This could indicate the effectiveness of the algorithm in incorporating rare occurrences in the data into the model. This is opposed to the case of the models built using the Naïve Bayes algorithm. The effectiveness of the models in these cases was markedly improved by use of the Priors technique, indicating the requirement of data preparation for the algorithm. Also to be emphasised is the effect of introducing weighting into the models of both kinds. In the case of those models built using the Adaptive Bayes Network algorithm, 51 An Evaluation of Commercial Data Mining: Oracle Data Mining
  52. 52. the introduction of weighting provided a dramatic increase in the accuracy of the results when the model was applied to new data. Those models built using the Naïve Bayes algorithm only benefited from the introduction of weighting when the Priors technique had also been used in the model. This could indicate that the effect of introducing weighting in order to tune the model is most beneficial when the model is already at a relatively high level of effectiveness. It was interesting to note the discrepancies between the models’ accuracy during testing and accuracy when applied to new data. In all cases, the test accuracy was higher than the accuracy calculated when the model was applied to new data. In some case this difference was significant. This could indicate the impact the nature of the test data has on the results of model test accuracy. The data used for testing the models was created from the data set used to build the models using the Transformation Split wizard as discussed in Chapter 3. For this reason, the distribution of the target attribute in both data sets was similar, which positively influenced the accuracy of the models built from and tested on similar data. External validation of the models’ performance on the new data emphasised this influence. These findings indicate the need for test data sets that show fewer similarities to the build data sets and question the use of the Transformation Split wizard to create build and test data sets from data that shows a specific distribution of the target attribute. 5.4 Chapter Summary After interpreting the results obtained from the different models it is apparent that the most effective model was built using the Adaptive Bayes Network algorithm with a weighting of 3 against false negatives. It was apparent that the results obtained from those models built using the Adaptive Bayes Network algorithm were not affected by the use of the Priors technique whereas the results of the models built using the Naïve Bayes algorithm were. Weighting had an effect on the results obtained from both kinds of models but was only noticeable in the case of the Naïve Bayes models when the Priors technique was used. Also to be noted is that the accuracy of the models during testing does not always indicate the effectiveness of the models when applied to new data. 52 An Evaluation of Commercial Data Mining: Oracle Data Mining
  53. 53. The following chapter will discuss the conclusions that can be drawn from the results obtained in this chapter. Section 3 Conclusion Chapter 6 Conclusions Drawn from Results This chapter will draw conclusions from the results presented in the previous chapters. The first set of conclusions will be made from the actual results obtained when the models were applied to new data. The next set will consider the effect the data used during the data mining had on the results obtained and lastly, conclusions regarding Oracle Data Mining will be drawn. 6.1 Conclusions Regarding Model Results The most effective model built using the Naïve Bayes algorithm correctly predicted the outcome of the RAIN attribute 63.79% of 290 records. The model built using this algorithm and no other techniques correctly predicted only 13.79% of the outcomes. Introduction of bias into the model using weighting had no effect on this accuracy. Use of the Priors technique increased this accuracy to 36.90%. A combination of weighting and the use of Priors increased the accuracy of the model when applied to new data to 63.79%. Tuning was accomplished by introducing bias into the model using weighting. It was viable to introduce bias into the model because during testing, the confusion matrix of the model showed the model tended to make errors by predicting outcomes of a certain kind. Bias makes these particular errors more costly to the effectiveness of the model and thus the algorithm attempts to minimise them when building the model. These observations indicate that the Naïve Bayes algorithm requires the use of the Priors technique when the build data has an uneven distribution of the target attribute. This ensures the algorithm observes enough of each target attribute outcome to build a 53 An Evaluation of Commercial Data Mining: Oracle Data Mining
  54. 54. model that will be effective when applied to data with a different distribution of the target attribute. Adjusting the settings of the algorithm parameters would also be beneficial when using this algorithm to build a model of data with an uneven target attribute distribution. These parameters, the pairwise and singleton thresholds, affect how the algorithm treats outliers in the data. By reducing the values of these parameters a more accurate model can be built but this would only be beneficial if the model already observes enough of a certain type of target attribute outcome. Introducing bias into the Naïve Bayes model was most beneficial when the model had been built using the Priors technique. This could indicate that the effect of weighting is enhanced when the model is already relatively effective. The most effective of all the models was built using the Adaptive Bayes Network algorithm. This model correctly predicted 73.10% of 290 RAIN attribute outcomes when applied to the new weather data set. This level of accuracy was increased from 42.41% by tuning the model. The use of the Priors technique had no effect on the models built using the Adaptive Bayes Network algorithm. This indicates that the effectiveness of the resulting models was not affected by the distribution of the target attribute in the data set used to build the model. Thus, it can be concluded that the algorithm effectively considers occurrences of instances in the data even when these occurrences are rare. 6.2 Conclusions Regarding Data It is apparent from the results of applying the models to the WEATHER_APPLY data set that the algorithms found a pattern in the data that allowed them to be able to correctly predict the outcome of the RAIN attribute in a significant number of cases. According to the rules generated by the Adaptive Bayes Network algorithms these predictions were mostly influenced by the measurements for wind chill factor and 54 An Evaluation of Commercial Data Mining: Oracle Data Mining
  55. 55. wind direction in the records. Although unexpected, these measurements appear to allow the models to make accurate predictions in most cases. The Transformation Split wizard allows a data set to be split up into build and test data sets by randomly selecting a predetermined number of records and placing them in the build data set and placing the remainder of records in the test data set. However, use of this technique to create the data sets results in both data sets showing a similar distribution of the target attributes. If the distribution of the target attribute is uneven, both data sets will show this to an extent. This is depicted in Figures 14, 15 and 16. Figure 14 shows the distribution of the RAIN attribute in the data set used with the Transformation Split wizard to create the test and build data sets from. Figure 15 shows the distribution of this attribute in the build data set and Figure 16 in the test data set created from this wizard. Original Data Distribution 1800 1600 1400 1200 Bin Count 1000 800 600 400 200 0 yes no Bin Range Figure 14 Distribution of RAIN attribute in the original data set It appears that similar distributions of the target attribute in both the build and test data sets influence the accuracy of the model during testing. The result of testing a model using a data set that resembles the build data set is an inflated accuracy. This was evident from the significantly lower levels of accuracy the models showed when applied to the new data of a different distribution. The distribution of the apply data set is shown in Figure 17. 55 An Evaluation of Commercial Data Mining: Oracle Data Mining
  56. 56. These findings indicate the need to test a model on a variety of data sets of different distributions in order to properly validate model accuracy and effectiveness when applied to data sets with different distributions of the target attribute. Further, it appears it would be more beneficial to use the largest data set possible to build and test models on. This would result in a more effective model as a wider range of occurrences in the data would be incorporated into the model. Build Data Distribution 1800 1600 1400 1200 Bin Count 1000 800 600 400 200 0 yes no Bin Range Figure 15 Distribution of RAIN attribute in the build data set Test Data Distribution 1800 1600 1400 1200 Bin Count 1000 800 600 400 200 0 yes no Bin Range 56 An Evaluation of Commercial Data Mining: Oracle Data Mining
  57. 57. Figure 16 Distribution of RAIN attribute in the test data set Apply Data Distribution 300 250 200 Bin Count 150 100 50 0 yes no Bin Range Figure17 Distribution of RAIN attribute in the apply data set 6.3 Conclusions Regarding Oracle Data Mining Oracle Data Mining and DM4J in particular provide the user with easy to use and understand wizards that cover all aspects of the data mining process. Wizards are available to create the build and test data sets from an original data set, to prepare the data for use with the Priors technique, to build models, to test these models and to apply these models to new data. Although data preparation is an important aspect of the data mining process [Berger, 2004], it is not explicitly emphasised in the wizards that allow for the model building. Although accessible through the Data Mining Browser, techniques for data preparation and the benefits of using them are not emphasised. The Data Mining Browser in DM4J allows the user to easily access the results of model testing and application of models to new data. These results can also be exported to spreadsheets allowing increased accessibility and ensuring they can be easily worked with. 57 An Evaluation of Commercial Data Mining: Oracle Data Mining
  58. 58. DM4J provides easy and reliable access to the database and the tables stored in the database. This makes it possible to search for a specific data set during the data mining process. The Data Mining Browser also allows the user to view summaries of the data including distributions of attributes in the data set, which is of use during the data preparation phase. It is apparent from the results of the model testing during this evaluation that testing a model on a single data set does not provide an indication of the effectiveness of the model when applied to new data. Test accuracy of models can be misleading. For this reason models should be externally validated using a technique similar to the one used in this investigation (applying to data where outcome of the target attribute is known) or tested on a number of data sets with varying distributions to better determine model accuracy. The need to validate a model on a variety of data sets is not emphasised in the documentation or by the wizards. It must be emphasised that the ease and speed of building and testing a model using the wizards allows for a number of models to be built and tests to be conducted. This approach is recommended in order to ensure the most effective model possible is produced. 6.4 Chapter Summary This chapter has drawn a number of conclusions from the results obtained during the data mining. Conclusions have been made regarding model results, the effect of data used during the data mining and Oracle Data Mining itself. The following chapter will conclude this evaluation. 58 An Evaluation of Commercial Data Mining: Oracle Data Mining
  59. 59. Chapter 7 Conclusion This chapter presents the conclusions drawn from the evaluation and suggests possible extensions to the research area. 7.1 Conclusion Oracle Data Mining provides data mining functionality through a series of wizards. These wizards allow the user to perform data preparation, to build models, to test these models and to apply the models to new data. The data preparation in this evaluation was performed using the Transformation Split wizard and the Stratified Sampling wizard. A number of wizards were used to build, test and apply the models to new data. The wizards were easy to use and understand and allowed a number of models to be built in a short amount of time. Access to the database was provided through the wizards. However, it was found that the wizards for building the data mining models placed little emphasis on data preparation. The two Classification algorithms used in this evaluation found a distinct pattern in the weather data sets. This allowed the models to be used to make predictions of the outcome of the RAIN attribute when the models were applied to the new data set. It is possible to conclude that given a new set of weather data, the data mining models would be able to make fairly accurate predictions of the outcome of the RAIN attribute. Of the algorithms investigated, the Adaptive Bayes Network algorithm produced the most effective model when applied to new data, correctly predicting 73.10% of the RAIN attributes outcomes. This model was tuned using a weighting of 3 against false negatives to introduce bias into the model. The most effective model built using the 59 An Evaluation of Commercial Data Mining: Oracle Data Mining

×