Clustering and Regression using WEKA

14,516 views
14,123 views

Published on

Published in: Business, Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
14,516
On SlideShare
0
From Embeds
0
Number of Embeds
236
Actions
Shares
0
Downloads
543
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Clustering and Regression using WEKA

  1. 1. VGSOMWEKA – Data Mining Techniques Clustering and Regression BY M.P.Vijaya Prabhu 10BM60097
  2. 2. Contents1. INTRODUCTION ............................................................................................................................... 32. CLUSTERING .................................................................................................................................... 4 2.1 Data Visualization..................................................................................................................... 83. Regression Analysis........................................................................................................................ 10 3.1 Pricing the house ................................................................................................................... 104. References..................................................................................................................................... 13
  3. 3. WEKA – DATA MINING TECHNIQUES 1. INTRODUCTION “Data Mining Software in Java”. Weka is the acronym of Waikato Environment for KnowledgeAnalysis is a collection of state-of-the-art machine learning algorithms and data preprocessing toolswritten in Java, developed at the University of Waikato, New Zealand. It is free software that runs onalmost any platform and is available under the GNU General Public License. Weka is the next generation Data Mining Tool to complex analysis more interactively and canvisualize it more effectively.WEKA GUI appears like thisAdvantages of using WEKA 1) Built in Advanced algorithm 2) Effective Visualization of results 3) Easy to use GUI
  4. 4. Let us demonstrate the use of WEKA using 2 examples each on CLUSTERING (Kmeans) and Regression. 2. CLUSTERINGData is a sample bank data taken from an online source.It contains the following attributes 1) age numeric 2) {FEMALE,MALE} 3) region {INNER_CITY,TOWN,RURAL,SUBURBAN} 4) income numeric 5) married {NO,YES} 6) children {0,1,2,3} 7) car {NO,YES} 8) save_act {NO,YES} 9) current_act {NO,YES} 10) mortgage {NO,YES} 11) pep {YES,NO} Based on these data we need to CLUSTER the user groups into 6 and have to find out the characteristics of each group.The sample data contains 600 instances. The objective is to cluster based on K-Means algorithm.Once the preprocessing of the data is done, we can start with clustering the data.First, the data is loaded into WEKA and preprocessing can be done as shown below.
  5. 5. WEKA SimpleKMeans algorithm automatically handles a mixture of categorical andnumerical attributes. While doing distance computations like in our case, the built in algorithmwill automatically normalizes numerical attributes. Euclidean distance is general measure ofdistance between Euclidean and clusters. After selecting k-Means we can select advance settings in the k-means algorithm. Wehave given the CLUSTERs as 6 from 2 ,to get 6 different clusters from the given data.
  6. 6. After the required details are given “Use Training Set” is checked. Then we can click “Start”The result is available as given below.================================================================================================OUTPUT :=== Run information ===Scheme: weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10Relation: bank-dataInstances: 600Attributes: 12 id
  7. 7. age sex region income married children car save_act current_act mortgage pepTest mode: evaluate on training data=== Clustering model (full training set) ===kMeans======Number of iterations: 18Within cluster sum of squared errors: 1955.4146634784236Missing values globally replaced with mean/modeCluster centroids: Cluster#Attribute Full Data 0 1 2 3 4 5 (600) (74) (164) (71) (58) (99) (134)==========================================================================================id ID12101 ID12107 ID12103 ID12101 ID12104 ID12102 ID12108age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALEregion INNER_CITY RURAL INNER_CITY INNER_CITY TOWN INNER_CITY TOWNincome 27524.0312 28838.7605 28586.4063 20463.1273 20600.8528 25720.037 33568.3929married YES NO YES YES YES YES NOchildren 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403car NO NO NO NO NO YES YESsave_act YES YES YES NO NO NO YEScurrent_act YES YES YES YES YES YES YESmortgage NO NO NO NO NO YES NOpep NO NO NO YES NO YES YESTime taken to build model (full training data) : 0.16 seconds=== Model and evaluation on training set ===Clustered Instances0 74 ( 12%)1 164 ( 27%)2 71 ( 12%)3 58 ( 10%)4 99 ( 17%)5 134 ( 22%)================================================================================================
  8. 8. The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. 0 74 ( 12%) 1 164 ( 27%) 2 71 ( 12%) 3 58 ( 10%) 4 99 ( 17%) 5 134 ( 22%) The put put of this clustering can be found in the form of cluster centroid Cluster 0 1 2 3 4 5 6 age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433 sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE INNER_CIT INNER_CIT INNER_CIT INNER_CIT region Y RURAL Y Y TOWN Y TOWN 27524.031 28838.760 28586.406 20463.127 20600.852 33568.392 income 2 5 3 3 8 25720.037 9 married YES NO YES YES YES YES NO children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403 car NO NO NO NO NO YES YES save_act YES YES YES NO NO NO YEScurrent_act YES YES YES YES YES YES YESmortgage NO NO NO NO NO YES NO pep NO NO NO YES NO YES YES For example, the centroid for cluster 0 shows that this is a segment of cases representing middle aged (approx. 42) females living in inner city with an average income of approx. $27,500, who are married with one child, etc. Furthermore, this group has on average said YES to the NO product. 2.1 Data Visualization The result can be viewed more intuitively by the advanced VISUALIZATION built in WEKA. The visualization of the distribution of male and female in each cluster can be found by using the following methods. Step 1 : Right click on the output and select “Visualise Cluster alignment”
  9. 9. Step 2 : Select the different cluster as the X axis.Step 3 : SelectInstance_Nbr as Y AxisStep 4 : Select “ Sex “ as colour.It means it will differentiate sex based on colour.This will result in a visualization of the distribution of males and females in each cluster.
  10. 10. 3. Regression Analysis Regression can be done effectively with more options via WEKA software.Lets explain it using a simple “LinearRegression”3.1 Pricing the house Data is taken from an online source .The selling price of the house needs to be determined based on the data given. The data contains the following attributes. 1) houseSize NUMERIC 2) lotSize NUMERIC 3) bedrooms NUMERIC 4) granite NUMERIC 5) bathroom NUMERIC 6) sellingPrice NUMERIC So, based on the size of the house, Lot size ,number of bedrooms it has ,whether it is furnished with Granite, number of bathroom ,we need to predict the DEPENDANT VARIABLE ,i.e. the SELLING PRICE. First, the data is loaded into WEKA and necessary preprocess is done. Since, our data is already processed .We proceed to selecting the type of REGRESSION
  11. 11. In the picture given above select the “Linear Regression” tab. Then Select “Use Training Set” inthe Test Options.There are three other choices available while doing simple Linear Regression they are  Supplied test set: Supply test data to do model
  12. 12.  Cross-validation : which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model  Percentage split: where WEKA takes a percentile subset to build a final model.Here the column “Selling Price” is chosen. This means with the available data we are going topredict the DEPENDANT VARIABLE (Selling Price).Then click on the “Start” button to build a model using WEKA.OUTPUT:=================================================================================================== Run information ===Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8Relation: houseInstances: 700Attributes: 6 houseSize lotSize bedrooms granite bathroom sellingPriceTest mode: evaluate on training data=== Classifier model (full training set) ===Linear Regression ModelsellingPrice = 22.6582 * houseSize + 9.1242 * lotSize + 42145.0767 * bedrooms + 42562.0901 * bathroom + -20981.3142Time taken to build model: 0.04 seconds=== Evaluation on training set ====== Summary ===Correlation coefficient 0.9945Mean absolute error 4790.821Root mean squared error 4245.4125Relative absolute error 11.9082 %Root relative squared error 11.21 %Total Number of Instances 700================================================================================================The output predicts that the Selling price will be
  13. 13. sellingPrice= (22.6582*houseSize) + (9.1242 * lotSize) + (42145.0767 * bedrooms) + (42562.0901 * bathroom) -20981.3142. If we want to determine the “selling price” of the house based on given data just “Plug in” the values and find it easily. The output predicts that the “Granite” doesn’t matter much regarding the SELLING PRICE of the house.4. References http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm www.cs.waikato.ac.nz/ml/weka/ http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/ http://maya.cs.depaul.edu/classes/ect584/weka/k-means.html http://www.cs.utexas.edu/users/ml/tutorials/Weka-tut/

×