2012   Regression Analysis   and Cluster Analysis   Using WEKA            Kanishka Chakraborty (10BM60036)                ...
Table of ContentsIntroduction ...............................................................................................
IntroductionThe amount of data generated is huge and growing at exponential rate each moment. But datais not much of use i...
Scope of this term-paperThis paper deals with the analysis of telecom customers about their Value Added Services usagepatt...
To be usable in WEKA the data was first converted in .arff format. This is done by introducing afew things:       Attribu...
AnalysisRegression AnalysisThe regression analysis is used to understand the relation that a particular variable (Dependen...
OUTPUTThe regression analysis conducted on the data gives us the following equation:3G Planner = 0.4599 * (Internet Mobile...
Cluster AnalysisBefore creating a marketing strategy for any product it is very important to identify particularsegments p...
OUTPUTThe outputs obtained are as follows:Cluster centroidsThe centroids obtained by clustering helps in understanding the...
ANALYSIS OF THE OUTPUTIn K Means Clustering the number of clusters to be formed is entered by the user. Here thenumber of ...
Upcoming SlideShare
Loading in …5
×

Term Paper on WEKA

1,335 views

Published on

A term paper demonstrating the use of WEKA to conduct cluster analysis and regression analysis.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,335
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
52
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Term Paper on WEKA

  1. 1. 2012 Regression Analysis and Cluster Analysis Using WEKA Kanishka Chakraborty (10BM60036) VGSoM, IIT Kharagpur 2010-2012
  2. 2. Table of ContentsIntroduction ........................................................................................................... 3Scope of this term paper ........................................................................................ 4 Data Used ................................................................................................................................ 4 Analysis Done ........................................................................................................................... 5Analysis------------------------------------------------------------------------------------------------6 Regression Analysis ................................................................................................................. 6 Cluster Analysis ........................................................................................................................ 8References------------------------------------------------------------------------------------------- 10 2
  3. 3. IntroductionThe amount of data generated is huge and growing at exponential rate each moment. But datais not much of use in itself. It must be into information that can be interpreted and used. Thereare multiple methods to convert data into information. Data mining is one of the methodswhich help in deducing meaningful patterns and facts from the data. It has an application inevery walk of life. Any organization must rely on data mining in order to get proper insights onwhich there decisions will be based. Many data mining tools are present in the market. WEKA(Waikato Environment for Knowledge Analysis) is one such data mining tool. It is the onlytoolkit that has gained such widespread popularity.It is a java-based free tool available under GNU General Public License. It consists of manyfeatures and hence has made it quite a popular data mining tool. It consists of manyvisualization tools, algorithms and preprocessing & modeling techniques to conduct datamining. It provides the user with both a GUI (Graphical User Interface) and CLI (Command LineInterface).The applications available:  Explorer: An environment to analyze data in WEKA  Experimenter: Environment for conducting statistical tests  KnowledgeFlow: Same as explorer with additional feature of drag-and-drop  Simple CLI: Provides command line interface for WEKAThe tool requires the data to be in .arff format. Arff stands for Attribute Relation File Format. Itis an ASCII file with all the attributes, their relation and values for each instance. It consists ofthree parts: Relation, Attribute and Data. 3
  4. 4. Scope of this term-paperThis paper deals with the analysis of telecom customers about their Value Added Services usagepattern and experience. This analysis is being carried out in order to identify customers who arelikely to go for a service like 3G. The paper will also try to identify which factors are important inorder to assess which customer will adopt 3G. This information plays a major role in creation ofthe marketing strategy of 3G.DATA USEDThe data that has been used for this paper was collected with the help of a survey conducted inGuwahati, Assam. This is being done in order to identify important factors differentiatingbetween potential 3G customers and non-3G potential customers. The sample size used for thisanalysis is 206 and consists of the following demographic segments:  Students  Young Professionals (<35 years of age),Working Professionals (>35 years of age)  Housewives  Defense personnel  Low Income Group (Rickshaw drivers, Auto rickshaw drivers, Shopkeepers etc.) Variable Description Categories Monthly How much the customer spends on expenditure on <100, 100-300, 300-500, >500 VAS in a month VAS Whether the customer uses Mobile Internet Yes, No internet on their mobile What has been the mobile internet Internet speed Satisfied. Neither Satisfied nor Dissatisfied, usage satisfaction level of the experience Dissatisfied, Not used customers How aware is the customer Using 3G, Fully Aware, Partially Aware, Not 3G Awareness regarding the 3G services Aware <3000, 3000-5000, 5000-7000, 7000-10000, What is the price of the handset the Handset Price 10000-15000, 15000-20000, 20000-30000, customer is using >30000 Whether the customer is planning 3G usage plan Yes, No to use 3G in the near future Low income group, Housewives, Defense, The age-occupation combination of Demography Young Professionals, Working Professionals, the customer Students 4
  5. 5. To be usable in WEKA the data was first converted in .arff format. This is done by introducing afew things:  Attribute: Each variable is defined as an attribute. The data type (numeric, string etc.) is also defined for each attribute  Data: The instances are input under the data header. It consists of the value for each attribute for the instances.ANALYSIS DONEThe following analysis will be conducted using the tool:  Regression  ClusteringRegression will be carried out in order to understand the relation between the various variablesused in the data in order to predict how any variable will vary with respect to some othervariable(s). Clustering is a technique that helps to form different groups and assign eachinstance to one group or another. Each group consists of instances which are similar to eachother. It has widespread usage in segmenting customers according to their characteristics andpreferences. 5
  6. 6. AnalysisRegression AnalysisThe regression analysis is used to understand the relation that a particular variable (Dependentvariable) share with others (Independent variable). For this paper the factors studied are asfollows:  Dependent Variable: Plan to use 3G  Independent Variable: o Internet mobile user o 3G awareness o Price of the handset usedSTEPS TO FOLLOW I. Select Classify tab II. Click on the Choose button III. Go to functions IV. Select LinearRegression from the list V. Enter the % of data wanted for the test (rest will be used for validation) from Test options VI. Click on Start to perform the analysis 6
  7. 7. OUTPUTThe regression analysis conducted on the data gives us the following equation:3G Planner = 0.4599 * (Internet Mobile User) + 0.0891 * (3G awareness) - 0.1325 * (Handsetprice) + 0.9421ANALYSIS OF THE OUTPUTThe output received leads to the following interpretations:  Whether a person is planning to buy 3G depends upto a great extent to whether that person is using internet on their mobile or not. A person who is using internet on their mobile is more likely to try 3G.  Dependence of 3G trial plan also relates to the price of the handset the respondent is currently using. Higher the price higher is the likelihood that the person will try 3G.  The plan for 3G usage also depends on the 3G awareness level. The dependence is weak. According to the output the higher the awareness about 3G more likely it is that the person will try 3G. 7
  8. 8. Cluster AnalysisBefore creating a marketing strategy for any product it is very important to identify particularsegments present in the market. These segments can then be studied in order to select the onewhich is best suited for targeting. For identifying the segments present in the market clusteringcan be used. For this paper, K Means Clustering has been used.STEPS TO FOLLOW I. Select Cluster tab II. Click on the Choose button III. Select SimpleKMeans from the list IV. Click on the text box besides the Choose button. Enter the number of clusters you want to have in numclusters V. Click on Start to perform the analysis 8
  9. 9. OUTPUTThe outputs obtained are as follows:Cluster centroidsThe centroids obtained by clustering helps in understanding the characteristics of eachsegment. It provides us with information regarding each cluster according to the variousvariables. Attribute Cluster  0 1 2 3 Membership (65) (61) (41) (39) Monthly Expense on VAS 1.1692 1 1.1951 1.3333 Mobile Internet user .6923 2 1.7317 .9487Satisfaction level of mobile internet usage 1.9385 0 0 1.2462 3G Awareness 2.4769 2.6885 2.6098 2.1538 Demography 4.18154 2.3607 4.6829 5.2821 Price of Handset used 2.3385 2.1311 3.3415 3.8769 3G usage plan 2 2 1.9268 .8974Clustered InstancesCluster instances basically give information regarding the number of instances that belong toeach cluster. This aids in predicting what percentage of the total population is likely to belongto each cluster  Cluster 0: 65 (32%)  Cluster 1: 61 (30%)  Cluster 2: 41 (20%)  Cluster 3: 39 (19%) 9
  10. 10. ANALYSIS OF THE OUTPUTIn K Means Clustering the number of clusters to be formed is entered by the user. Here thenumber of clusters to be formed by the clustering tool has been assigned as 4. WEKA providedus with the description of each cluster in terms of the centroids of each variable with respect tothe cluster. The cluster descriptions are as follows: Attribute Cluster  0 1 2 3 Membership (65) (61) (41) (39) Monthly <100 <100 <100 0-300Expense on VASMobile Internet Yes No No Yes user Satisfaction Not Satisfied Haven’t Haven’t used Satisfiedlevel of mobile used internet usage 3G Awareness Low Low Low Fully Aware awareness awareness awareness Demography Working House Working Young Professionals wives Professional Professionals & StudentsPrice of Handset 3000-5000 3000-5000 5000-7000 7000-10000 used 3G usage plan No No No YesThus the segment to be targeted initially is the cluster 3. It consists of Young workingprofessionals (< 35 years of age) and students. This segment is the most likely to go for 3Gservices. The awareness level of this segment is fairly high. The handset used by the membersin this segment is in the price band of 7000-10000. The members of this segment are satisfiedwith the speed of internet they receive on their handsets. The cluster membership of thissegment is 19%. Thus it can be deduced according to the analysis that around 19% of the totalpopulation consists of customers who are likely to go for a service like 3G. References  http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html  http://en.wikipedia.org/wiki/Weka_%28machine_learning%29  http://sourceforge.net/projects/weka/files/documentation/3.6.x/WekaManual-3-6- 2.pdf/download 10

×