Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
IT for Business Intelligence Term paper on Weka Submitted by: Saurabh Singh 10BM60082
Introduction The Weka contains a collection of visualization tools and algorithms for data analysis and predictivemodeling, together with graphical user interfaces for easy access to this functionality. The original non-Java version of Weka was a TCL/TK front-end to (mostly third-party) modeling algorithms implementedin other programming languages, plus data preprocessing utilities in C, and a Make file-based system forrunning machine learning experiments. This original version was primarily designed as a tool foranalyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3), forwhich development started in 1997, is now used in many different application areas, in particular foreducational purposes and research. Advantages of Weka include: free availability under the GNU General Public License portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform a comprehensive collection of data preprocessing and modeling techniques ease of use due to its graphical user interfacesWeka primarily consists of following four screens:
K-means clustering in WEKASuppose a company wants to cluster the market based on the attribute collected by its research team.This can be done very effectively and efficiently by using K- mean clustering in Weka.The attributes used are as follows: ID AGE SEX RELIGION INCOME MARRIED CHILDREN CAR SAVING A/C CURRENT A/C LOAN PENSION PLANWeka accepts few file input format such as .csv, .arff etc. We would be using .csv file as the input file inour example. Given data file consists of 1600 instances and 12 attributes as described above.Steps in K-mean analysis:Step 1:Weak Startup screen
Step 2:Choose explorer option from the menu. This option is more than enough for us to perform all therequired operation on the data.Step 3:Load the .csv file of bank accounts data.
Step 4:Since we intend to create cluster within the data so click on cluster tab and choose Simple K-meansamong the choices that appear. Following screen would appear.Step 5:Click on the box next to choose box and following menu would appear
Step 6:Assign value 4 to ‘numClusters’ box.Step 7:Click on start to begin the clustering process. Following screen would appear for the same.Step 8:The result can be viewed in a separate window. Following screen would appear.
We can interpret by the above given results that Cluster 0: Centers around male population. Mainly lives in town area. Is mostly non married. Doesn’t own a car or previous loan. Owns a Savings a/c and current a/c. Still is not having a pension plan.Hence we can conclude that cluster 1 is the likely cluster to buy a pension plan. Similar interpretationcan be applied to other clusters as well according to requirements.Step 9:We can use visualize all to see the distribution of all the variables in the population.
Linear Regression using WEKARegressionRegression model can easily answer questions such as how much should be charged for a given model ofcar with certain set of features. It uses the past data of car sales, price of the cars, features provided andother attributes to determine the price of future models.Regression in WEKASuppose a company wants to regress the Price of a car with various features associated with it. It canrun the regression in WEKA by appropriately determining the independent variables and then establish aregression equation establishing the relationship between independent variables and dependentvariable. Following example illustrates this procedure -Step 1:Weak Startup screen
Step 2:Choose explorer option from the menu. This option is more than enough for us to perform all therequired operation on the data.Step 3:Load the .csv file of car specification data.
Step4:Click Classify tab, then click Choose button and then select Linear Regression from Functions. Followingscreen would appear after this.Step5:After clicking on Start button, following output would be generated.
Interpretation of the output – From the above output, we can observe that the selling price is positivelycorrelated to the engine displacement and none of the other factors.Step 6:Right click on result list for options and select visualize Classifier errors for the following screen.Step 7:If we click at any point on the given plot summary of data point is given by Weka. E.g.