Weka for clustering and regression itb vgsom


Published on

Explaining Weka with simple practical examples.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Weka for clustering and regression itb vgsom

  2. 2. WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machinelearning software written in Java, developed at the University of Waikato, New Zealand. WEKAis free software available under the GNU General Public License. WEKA is a unique softwarecompared to MS –EXCEL because it can be used to run multivariate regression without anyhassles. It also gives output showing dependent variable equation and other statistical data.Weka is a collection of machine learning algorithms for data mining tasks. The algorithms caneither be applied directly to a dataset or called from your own Java code. Weka contains tools fordata pre-processing, classification, regression, clustering, association rules, and visualization. Itis also well-suited for developing new machine learning schemes.The initial versions of WEKA used only Attribute Relationship File Format (ARFF) files, savedas *.arff. But newer versions provide an option for multiple versions like: XRFF, Binary serialfiles, LIBSVM, SVM Light, CSV, C4.5 among others.USING WEKA:The WEKA GUI Chooser has the four following options: 1. Weka Explorer 2. Weka Experimenter 3. Weka Knowledge Flow 4. Simple CLIWeka Explorer has the following options in each tabs: 1. Preprocess 2. Classify 3. Cluster 4. Associate 5. Select Attributes 6. Visualize
  3. 3. Apart from doing these statistical operations, each of the data can be visualized graphically andfiltered according to requirement.Weka Experimenter:There are several algorithms for each process. Thus the criticality of the software lies inidentifying the optimal algorithm. For Regression and classification, Experimenter gives acomparisn of the best algorithm by statistical analysis. Unfortunately, such an option is not therefor Clustering algorithms.Import of data:Data is imported in form of CSV file which is converted into arff format automatically whileimporting. The data is imported through Preprocess tab of WEKA as shown in picture above. CLUSTERINGDefinition: Cluster analysis is a class of statistical techniques that can be applied to data thatexhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them intoclusters. A cluster is a group of relatively homogeneous cases or observations. Objects in acluster are similar to each other. They are also dissimilar to objects outside the cluster,particularly objects in other clusters.”
  4. 4. DATA SET USED FOR CLUSTERINGThe example used is a survey report on instant noodles. It had:Instances: 76Attribute: 33The questions or attributes were as follows:AgeProfessionDiabetesstopObesitystopOtherstopCadburynchoclHomemadesweetsSweetfrmshopCakepastrySugarcubeCelebrationGiftsBeginningauspiciousYummyfoodHealthconcernLunchdinnerafterTastytraditnAbroadFrequencyeatingInflnearbyInflfrndrelativeInflblogonlineAdvertQualityPackagingAmbiencePriceImptraditonsweet Newexperimentswt Newvariety Homedeliveryimp Impchitchatplace Packagdsweetslngtime
  5. 5. PROCEDURE AND RESULT:Data-set is taken from my AMRP project survey, regarding the interest and motivation ofconsumers towards traditional sweets.Simple K-Mean Algorithm was used to cluster the data set.The output is as follows: Attribute Full Data 0 1 (76) (44) (32) ======================================================= Age 1.6711 1.6364 1.7188 Profession 1.7632 1.6818 1.875 Diabetesstop 2.3553 2.3636 2.3438 Obesitystop 1.9605 1.9545 1.9688 otherstop 1.9474 1.8636 2.0625 Cadburynchocl 4.2895 4.25 4.3438 homemadesweets 4.3421 4.3636 4.3125 sweetfrmshop 4.0395 4.1136 3.9375 cakepastry 3.9342 4.0455 3.7813 sugarcube 2.4605 2.5 2.4063 celebration 4.1447 4.3409 3.875 gifts 3.7632 3.7955 3.7188 beginningauspicious 3.7763 3.8636 3.6563 yummyfood 3.8158 3.9318 3.6563 healthconcern 2.9868 3 2.9688 lunchdinnerafter 3.9737 4.0909 3.8125 tastytraditn 3.7632 4.0227 3.4063 abroad 1.8684 1.8864 1.8438 frequencyeating 2.5658 2.4318 2.75 inflnearby 3.0 4.0 3.0 inflfrndrelative 4.0 4.0 3.0 inflblogonline 3.0 3.0 2.0 advert 3.0 3.0 2.0 quality 5.0 5.0 5.0 packaging 3.0 3.0 4.0 ambience 3.0 3.0 4.0 price 3.0 4.0 3.0 imptraditonsweet 5.0 5.0 3.0 newexperimentswt 3.0 3.0 3.0 newvariety 3.0 3.0 4.0 homedeliveryimp 2.8158 2.8409 2.7813 impchitchatplace 3.3421 3.3182 3.375
  6. 6. packagdsweetslngtime 3.1579 3 3.375Note: The significant values in the above table, on which the cluster characteristics are formed,are marked with red.Clustered Instances0 44 ( 58%)1 32 ( 42%)INTERPRETATION:ASPECTS CLUSTER ‘0’ CLUSTER ‘1’Traditionality Loves traditional sweets. Loves experiments and newer Considers sweet as a variety of sweets traditional symbol. Wants sweet after lunch or dinner.Frequency of consumption High MediumPrice More price sensitive Lesser price sensitive.Influnce by friends and High Medium. Generally tries newrelatives or advertisements to shop by own instinct.try a new shopAmbience of shop and Matters less Matters significantly.packagingFood Court for chatting (Like preferred preferedHaldiram)Packaged/ tinned sweets Medium Good DemandINFERENCE AND SUGGESTION DERIVED FROM THE CLUSTERING:There are two distinct clusters of consumers in the sweet industry.Cluster ‘0’ (58%) considers sweet as the “symbol of tradition”, which is typically savoredafter lunch and dinner. They enjoy the most traditional sweets, and don’t prefer to try newvariants. They prefer sticking to old shops unless inspired by external agents (friends/ relatives/blog/ advertisements etc) to try otherwise. Quality is an important factor. But ambience andpackaging doesn’t play a major role. So, shops like Nokur or Girish Dey will be their typicalfavorite ones.Cluster ‘1’(42%) are the true connoisseurs of sweets. They appreciate both traditional as wellas experimental sweets (the new variants). They often prefer trying out new shops andbrands. Packaged sweets are also preferred which can be savored later. Apart from quality,ambience and packaging plays a vital role, where as price is of medium importance. This
  7. 7. cluster seems to be more impulsive consumers, and would probably not mind paying a premiumfor some new and creative sweets. So, brands like K.C. Das will be their preferred choice. REGRESSIONThe next procedure is regression analysis.We obtain data from stores on monthly sales of a celebration chocolate pack depending on theamount spent on its promotion in terms of posters used around the block or any other effort .Here after we select all attributes and go to classify tab and run regression function.OUTPUTThe output obtained is given below= Run information ===Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8Relation: Problem_2-weka.filters.unsupervised.attribute.Remove-R1Instances: 46
  8. 8. Attributes: 3 Sales Price PromotionTest mode: split 80.0% train, remainder test=== Classifier model (full training set) ===Linear Regression ModelSales = -53.2173 * Price + 3.6131 * Promotion + 5837.5208Time taken to build model: 0 seconds=== Evaluation on test split ====== Summary ===Correlation coefficient 0.8066Mean absolute error 543.6332Root mean squared error 711.4575Relative absolute error 48.288 %Root relative squared error 59.6886 %Total Number of Instances 5Ignored Class Unknown Instances 4INTERPRETETIONThe given data shows correlation coefficient of 0.8066 which means 65% accuracy of the model.As expected we find that sales will decrease due to increase in price and increase with increase inpromotion budget.This explains how WEKA can be used for multivariate regression .REFERENCEhttp://en.wikipedia.org/wiki/Weka_(machine_learning)http://www.cs.waikato.ac.nz/ml/weka/http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)