IT FOR BUSINESS INTELLIGENCEData Analysis techniques usingWEKA: Classification andRegression Nikhil Yagnic (07AG3801)
IntroductionWeka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning softwarewritten in Java, developed at the University of Waikato, New Zealand. Weka is free softwareavailable under the GNU General Public License.The Weka workbench contains a collection of visualization tools and algorithms for data analysisand predictive modelling, together with graphical user interfaces for easy access to this functionality.The original non-Java version of Weka was a TCL/TK front-end to (mostly third-party) modellingalgorithms implemented in other programming languages, plus data pre-processing utilities in C, anda Makefile-based system for running machine learning experiments. This original version wasprimarily designed as a tool for analyzing data from agricultural domains, but the more recentfully Java-based version (Weka 3), for which development started in 1997, is now used in manydifferent application areas, in particular for educational purposes and research. Advantages of Wekainclude: free availability under the GNU General Public License portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform a comprehensive collection of data pre-processing and modelling techniques ease of use due to its graphical user interfacesWeka supports several standard data mining tasks, more specifically, data pre-processing, clustering,classification, regression, visualization, and feature selection. All of Wekas techniques arepredicated on the assumption that the data is available as a single flat file or relation, where eachdata point is described by a fixed number of attributes (normally, numeric or nominal attributes, butsome other attribute types are also supported). Weka provides access to SQL databases using JavaDatabase Connectivity and can process the result returned by a database query. It is not capable ofmulti-relational data mining, but there is separate software for converting a collection of linkeddatabase tables into a single table that is suitable for processing using Weka. Another importantarea that is currently not covered by the algorithms included in the Weka distribution is sequencemodelling.Classification via decision trees using WEKAProblem:A bank is introducing a new financial product. So the bank wants to classify the new customerswhether they will be ready to buy the new product or not. Bank has the existing information fromthe old clients who are interested in buying the new product.Classification is a statistical technique that helps to classify any new client into one of the existinggroups. It will create a model on the test data available. And then classifies the new data based onthe model that is developed using the test data.Steps to do classification in WEKAStep 1: Create a data file in the format of arff or csv. Weka understands these two formats. We aretaking the file in csv format Bank.csv
Step 2: Open the Weka application. This will show the following screenStep 3: Loading data into WEKA.To do that click on the open file button and browse for the bank.csv file. Then it shows all theattributes as shown in the below figure.
Step 4: View the dataIn the selected attribute panel you can see the values corresponding to the attributes and also itstype, name e.t.cYou can also visualize the frequency distribution of all the attributes at a time by clicking on the“Visualize All” button. It shows the following screen.
This visualizes all shows the range of data for each attribute and also the mean, median andfrequency of each attribute. For example the value of age in our case is ranging from 18 to 67 withan average of 42.5Step 5: Classify the Test dataTo do this select the classify button which shows the following screen.Then select the J48 algorithm which is under the node of tree when you click on the choose button.This will show the following screen.
Step 6: Run the classification AlgorithmSelect the dependent variable that should be classified and click on the start.This shows the output in the classifier output panel in ASCII version of the tree.This is difficult to understand. To view the output in the form of tree, right click on the trees.j48 andselect “visualize tree” option. This shows the following screen by again right clicking on the outputand selecting full screen option.
Step 7: Analyze the model created by existing dataFrom the Classifier output we can find that the Classification accuracy of the model is 89%.This means that the model is able to predict the values 89% correctly. So if we use the same modelto find out the buying decision of new customer the probability will be 0.89Step 8: Test the New customer dataCreate your new customer data in arff or csv format with the same attributes as test data.Now input the data by checking the radio button “Supplied test set” and click on “ set” to browse forthe new data set.
Then click on the start button which generates a new tree.Save the classification result as arff. This file contains a copy of the new instances along with anadditional column for the predicted value. The result will look like following.
Regression Using WEKAProblem:The idea is to find out how the CPU performance is correlated with the attributes like machine cycletime, minimum main memory, cache memory e.t.cA regression is a statistic tool that helps in finding out how the dependent variable (CPUperformance) is related to the independent attributes.Steps to do Regression in WEKAStep 1: Create data file and open the WEKA as in the same way as we did for Classification.Step 2: Load the regression data file CPU.arff into weka.Click on open file and browse for the file, that shows the following screenStep 3: Run the regressionClick on the Classify tab and choose “Linear Regression” from the node under function. This showsthe following screen.
Click on start that will show output in the classifier output screen which gives a regression equation.
Interpretation of the output:The CPU performance is more dependent on CHMAX and then CACHEThe correlation coefficient of 0.912 is very high, its output suggests that the dependentvariable is strongly associated with the independent variables.We can also determine the new CPU performance by using the regression equation if wehave the values of the attributes.