1.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) IT for Business Intelligence (BM61080) Data Mining Techniques using WEKASagar (10BM60075)This term paper contains a brief introduction to WEKA – a powerful data mining tool along with aguide to two data mining techniques - Clustering (k-means) and Linear Regression, using WEKA tool. Page 1 Vinod Gupta School of Management
2.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)Data Mining Techniques using WEKA:WEKA: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning softwarewritten in Java, developed at the University of Waikato, New Zealand. The Weka workbench contains a collectionof visualization tools and algorithms for data analysis and predictive modeling, together with graphical userinterfaces for easy access to this functionality. Weka supports several standard data mining tasks, morespecifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Wekasmain user interface is the explorer, but essentially the same functionality can be accessed through the component-based knowledge Flow interface and from the command line. There is also the experimenter, which allows thesystematic comparison of the predictive performance of Wekas machine learning algorithms on a collection ofdatasets.Interfaces – Command Line Interface (CLI) Graphical User Interface (GUI) The WEKA GUI Chooser – Fig. 1The buttons can be used to start the following applications – Explorer – this is the environment for exploring data with WEKA and gives access to all the facilities using menu selection and form filling. Experimenter – Gives the answer for the question: Which methods and parameter values work best for the given problem? Knowledge Flow – Supports incremental learning and allows designing configurations for streamed data processing. Incremental algorithms can be used to process very large datasets. Simple CLI – A simple Command Line Interface for executing WEKA commands directly.The Explorer interface features several panels providing access to the main components of theworkbench: The preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a filtering algorithm. It is possible to transform the data and delete instances and attributes according to specific criteria. The classify panel enable to apply classification and regression algorithms, to estimate the accuracy of the resulting predictive model, and to visualize erroneous predictions, ROC curves, etc. Vinod Gupta School of Management 2
3.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) The associate panel provides access to association rule learners that attempt to identify all important interrelationships between attributes in the data. The cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm. The select attributes panel provides algorithms for identifying the most predictive attributes in a dataset. The visualize panel shows a scatter plot matrix, where individual scatter plots can be selected and enlarged, and analyzed further using various selection operators.This paper will demonstrate the following two data mining techniques in WEKA: Clustering (Simple K Means) Linear regressionClustering in WEKAClustering: Clustering can be loosely defined as: The process of organizing objects into groups whose membersare similar in some way. Clustering is the task of assigning a set of objects into groups (called clusters) so that theobjects in the same cluster are more similar to each other than to those in other clusters. The clusters found bydifferent algorithms vary significantly in their properties, and understanding these "cluster models" is key tounderstanding the differences between the various algorithms. Typical cluster models include: Connectivity models: Models based on distance connectivity. Centroid models: The k-means algorithm represents each cluster by a single mean vector. Distribution models: Clusters are modeled using statistic distributions, such as multivariate normal distributions. Density models: Clusters are seen as connected dense regions in the data space. Subspace models: Clusters are modeled with both cluster members and relevant attributes. Group models: Do not provide a refined model, just the grouping information. Graph-based models: A clique (a subset of nodes in a graph such that every two nodes in the subset are connected by an edge) can be considered as a prototypical form of cluster.Clustering algorithms may be classified as listed below: Exclusive Clustering Overlapping Clustering Hierarchical Clustering Probabilistic ClusteringFour of the most used clustering algorithms are: K-means Fuzzy C-means Hierarchical clustering Mixture of GaussiansK-means is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm, Hierarchicalclustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering algorithm.K-Means Clustering: K-means (MacQueen, 1967) is one of the simplest algorithms. The procedure follows a simpleand easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.The main idea is to define k centroids, one for each cluster. The algorithm is composed of the following steps: Vinod Gupta School of Management 3
4.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.The k-means algorithm does not necessarily find the most optimal configuration. The algorithm is also significantlysensitive to the initial randomly selected cluster centres. There is no general theoretical solution to find theoptimal number of clusters for any given data set. A simple approach is to compare the results of multiple runswith different k classes and choose the best one.Why to do Clustering (Business Applications)? Market Segmentation Identifying market needs To better understand the relationships between different groups of consumers/potential customers Product positioning New product opportunities Selecting test markets Clustering can be used to group all the shopping items available on the web into a set of unique productsK-Means Clustering in WEKA:The data set well use for our clustering example will focus fictional BMW dealership. The dealership has kept trackof how people walk through the dealership and the showroom, what cars they look at, and how often theyultimately make purchases. They are hoping to mine this data by finding patterns in the data and by using clustersto determine if certain behaviors in their customers emerge. There are 100 rows of data in this sample, and eachcolumn describes the steps that the customers reached in their BMW experience, with a column having a 1 (theymade it to this step or looked at this car), or 0 (they didnt reach this step).The ARFF data well be using with WEKA is: Vinod Gupta School of Management 4
5.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)Steps to be followed for doing K-Means Clustering in WEKA:Step 1: Select Explorer in the Weka GUI Chooser window (Fig.1)Step 2: The following window appears: Fig. 2Step 3: Select “Open File” and load the ARFF data file bmw-browsers. After loading the file, the interface will belike this – Fig. 3Step 4: With this data set, we are looking to create clusters, so click on the Cluster tab. Click Choose andselect SimpleKMeans, set the numClusters value to 5 and click ok: Vinod Gupta School of Management 5
6.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Fig. 4Step 5: For viewing the distribution of all variables in the population, we can click on “Visualize All”: Fig. 5 Vinod Gupta School of Management 6
7.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)Step 6: Now, we are ready to run the clustering algorithm. 100 rows of data with five data clusters would likelytake a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second.Select the Use training set in the Cluster mode panel and then click Start button to begin clustering process. Fig. 6Step 7: For displaying the result in separate window, in the Result list panel, right click the result and select Viewin a separate window. Following result will be displayed: Fig. 7 Vinod Gupta School of Management 7
8.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)Results Interpretation:The output tells how each cluster comes together: with a "1" meaning everyone in that cluster shares the samevalue of one, and a "0" meaning everyone in that cluster has a value of zero for that attribute. Each cluster showsa type of behavior in customers, from which we can draw conclusions: Cluster 0: o The "Dreamers" o Wander around the dealership o Dont purchase anything Cluster 1: o The "M5 Lovers” o Not a high purchase rate — only 52 percent o A potential problem and could be a focus o More salespeople could be send to the M5 section Cluster 2: o The "Throw-Aways" o No good conclusions from their behavior Cluster 3: o The "BMW Babies" o Always purchase a car and finance it o They walk around the lot looking at cars and then go to the computer search available at the dealership o Making search computers more prominent around the lots section o Tend to buy M5s or Z4s Cluster 4: o The "Starting out with BMW" o These look at the 3-series and never at the much more expensive M5 o Do not walk around the lot and ignore the computer search terminals o While 50 percent get to the financing stage, only 32 percent ultimately finish the transaction o These know exactly what kind of car they want (the 3-series entry-level model) o Sales to this group can be increased by relaxing financing standards or by reducing the 3-series pricesThe data in these clusters can also be inspected visually. To do this: Right click the result in the Result list panel Select Visualize cluster assignments By setting X-axis variable as M5 Y-axis variable as Purchase we get the following output: Fig. 8 Vinod Gupta School of Management 8
9.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)This figure shows in a chart how the clusters are grouped in terms of who looked at the M5 and who purchasedone. Also, turn up the "Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plotpoints to allow us to see them more easily.The visual results match the conclusions we drew above. We can see in the X=1, Y=1 point (those who looked atM5s and made a purchase) that the only clusters represented here are 1 and 3. We also see that the only clustersat point X=0, Y=0 are 4 and 0. Clusters 1 and 3 were buying the M5s, while cluster 0 wasnt buying anything, andcluster 4 was only looking at the 3-series. By changing X and Y axes, we can identify other trends and patterns.Other clustering methods can also be used to group the data into clusters. WEKA is very useful in the clusteringprocess when the size of data is huge. It can generate clusters pretty quickly even with huge data. As business hashuge applications of clustering, WEKA is very useful in the clustering of data in real business scenarios.Linear Regression using WEKARegression: Regression is the easiest technique to use, but is also probably the least powerful. This model canbe as easy as one input variable and one output variable (Scatter diagram in Excel, or an XY Diagram inOpenOffice.org). It can get more complex than that, including dozens of input variables. Regression models all fitthe same general pattern: there are a number of independent variables, which, when taken together, produce aresult — a dependent variable. The regression model is then used to predict the result of an unknown dependentvariable, given the values of the independent variables. Correlation analysis can be applied to determine thedegree to which variables are related. Broadly, regression can be classified into two types: Simple linear regression (one dependent variable and one independent variable) Multiple regression (one dependent variable and many independent variables)The process of Multiple regression in WEKA is described with an example in this term paper.Business applications of Regression Pricing decisions Trend Line Analysis Risk Analysis for Investments Sales or Market Forecasts To predict the demographics and types of future work forces for large companies. Total quality controlRegression in WEKA:The price of the house (the dependent variable) is the result of many independent variables — the square footageof the house, the size of the lot, whether granite is in the kitchen, bathrooms are upgraded, etc. Lets continue thisexample of a house price-based regression model, and create some real data to examine. These are actualnumbers from houses for sale, and we will try to find the value for a house.House values for regression model:House size (square feet) Lot size Bedrooms Granite Upgraded bathroom? Selling price3529 9191 6 0 0 $205,0003247 10061 5 1 1 $224,9004032 10150 5 0 1 $197,9002397 14156 4 1 0 $189,9002200 9600 4 0 1` $195,0003536 19994 6 1 1 $325,000 Vinod Gupta School of Management 9
10.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)2983 9365 5 0 1 $230,0003198 9669 5 1 1 ????Steps to be followed:Step 1: Select Explorer from the WEKA GUI user window and load the file houses. Following screen will appear: Fig. 9Step 2: Click Classifier tab in the explorer window and click the Choose button in the Classifier panel. Then selectLinearRegression from functions: Fig. 10 Vinod Gupta School of Management 10
11.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)It automatically identifies the dependent variable as Selling Price. In case it doesn’t happen we can select thedependent variable.Step3: Press the Start button and the following output is generated: Fig. 11As described earlier in clustering example, output can also be viewed in a separate window.We can also visualize the classifier error i.e. those instances which are wrongly predicted by regression equationby right clinking on the result set in the Result list panel and selecting Visualize classifier errors. Fig. 12 Vinod Gupta School of Management 11
12.
WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)Interpreting the regression model:Let us interpret the patterns and conclusions that our model tells us, besides just a strict house value: Granite doesnt matter: WEKA will only use columns that statistically contribute to the accuracy of the model). This regression model is telling us that granite in your kitchen doesnt affect the houses value. Bathrooms do matter: We can use the coefficient from the regression model to determine the value of an upgraded bathroom on the house value. Bigger houses reduce the value: WEKA is telling us that the bigger our house is, the lower the selling price. This can be seen by the negative coefficient in front of the houseSize variable. The house size, unfortunately, isnt an independent variable because its related to the bedrooms variable, which makes sense, since bigger houses tend to have more bedrooms.Other applications of WEKA in data mining:WEKA can be used for various other data mining techniques: Classification (using decision trees) Collaborative filtering (Nearest Neighbor) AssociationReferences: a) Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan Kaufmann publisher) b) www.wikipedia.org c) http://www2.cs.uregina.ca/~dbd/cs831/notes/clustering/clustering.html d) http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html Vinod Gupta School of Management 12