IT for Business IntelligenceData Mining Techniques Classification and regression Using WEKA A.Kranthikumar (10BM60001)
Classification via decision trees using WEKAProblem:A bank is introducing a new financial product. So the bank wants to classify the newcustomers whether they will be ready to buy the new product or not. Bank has theexisting information from the old clients who are interested in buying the newproduct.Classification is a statistical technique that helps to classify any new client into one ofthe existing groups. It will create a model on the test data available. And thenclassifies the new data based on the model that is developed using the test data.Steps to do classification in WEKAStep 1: Create a data file in the format of arff or csv. Weka understands these twoformats. We are taking the file in csv format Bank.csvStep 2: Open the Weka application. This will show the following screenNow click on the Explorer tab. This directs to the following window.
Step 3: Loading data into WEKA.To do that click on the open file button and browse for the bank.csv file. Then itshows all the attributes as shown in the below figure.
Step 4: View the data In the selected attribute panel you can see the values corresponding to the attributes and also its type, name e.t.c You can also visualize the frequency distribution of all the attributes at a time by clicking on the “Visualize All” button. It shows the following screen.This visualizes all shows the range of data for each attribute and also the mean,median and frequency of each attribute. For example the value of age in our case isranging from 18 to 67 with an average of 42.5Step 5: Classify the Test data To do this select the classify button which shows the following screen.
Then select the J48 algorithm which is under the node of tree whenyou click on the choose button. This will show the following screen.
Step 6: Run the classification Algorithm Select the dependent variable that should be classified and click on the start. This shows the output in the classifier output panel in ASCII version of the tree. This is difficult to understand. To view the output in the form of tree, right click on the trees.j48 and select “visualize tree” option. This shows the following screen by again right clicking on the output and selecting full screen option.Step 7: Analyze the model created by existing data From the Classifier output we can find that the Classification accuracy of the model is 89%. This means that the model is able to predict the values 89% correctly. So if we use the same model to find out the buying decision of new customer the probability will be 0.89Step 8: Test the New customer data Create your new customer data in arff or csv format with the same attributes as test data. Now input the data by checking the radio button “Supplied test set” and click on “ set” to browse for the new data set.
Then click on the start button which generates a new tree.Save the classification result as arff. This file contains a copy of the newinstances along with an additional column for the predicted value. The resultwill look like following.
Regression Using WEKAProblem: The idea is to find out how the CPU performance is correlated with theattributes like machine cycle time, minimum main memory, cache memory e.t.cA regression is a statistic tool that helps in finding out how the dependent variable(CPU performance) is related to the independent attributes.Steps to do Regression in WEKAStep 1: Create data file and open the WEKA as in the same way as we did forClassification.Step 2: Load the regression data file CPU.arff into weka. Click on open file and browse for the file, that shows the following screenStep 3: Run the regression Click on the Classify tab and choose “Linear Regression” from the node under function. This shows the following screen.
Click on start that will show output in the classifier output screen which gives aregression equation.
Interpretation of the output: From the output you can see that the CPU performance is more dependent on CHMAX and then CACHE High correlation coefficient of 0.912 from output suggests that the dependent variable is strongly associated with the independent variables. We can also determine the new CPU performance by using the regression equation if we have the values of the attributes.