SlideShare a Scribd company logo
WEKA Manual (IT for Business Intelligence)                                     Sagar (10BM60075)




                 IT for Business Intelligence (BM61080)


     Data Mining Techniques using WEKA




Sagar (10BM60075)




This term paper contains a brief introduction to WEKA – a powerful data mining tool along with a
guide to two data mining techniques - Clustering (k-means) and Linear Regression, using WEKA tool.




                                                                                               Page 1
                               Vinod Gupta School of Management
WEKA Manual (IT for Business Intelligence)                                                 Sagar (10BM60075)


Data Mining Techniques using WEKA:

WEKA: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software
written in Java, developed at the University of Waikato, New Zealand. The Weka workbench contains a collection
of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user
interfaces for easy access to this functionality. Weka supports several standard data mining tasks, more
specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's
main user interface is the explorer, but essentially the same functionality can be accessed through the component-
based knowledge Flow interface and from the command line. There is also the experimenter, which allows the
systematic comparison of the predictive performance of Weka's machine learning algorithms on a collection of
datasets.

Interfaces –
      Command Line Interface (CLI)
      Graphical User Interface (GUI)

         The WEKA GUI Chooser –




                                                       Fig. 1

The buttons can be used to start the following applications –
     Explorer – this is the environment for exploring data with WEKA and gives access to all the facilities using
        menu selection and form filling.
     Experimenter – Gives the answer for the question: Which methods and parameter values work best for
        the given problem?
     Knowledge Flow – Supports incremental learning and allows designing configurations for streamed data
        processing. Incremental algorithms can be used to process very large datasets.
     Simple CLI – A simple Command Line Interface for executing WEKA commands directly.


The Explorer interface features several panels providing access to the main components of the
workbench:

        The preprocess panel has facilities for importing data from a database, a CSV file, etc., and for
         preprocessing this data using a filtering algorithm. It is possible to transform the data and delete
         instances and attributes according to specific criteria.
        The classify panel enable to apply classification and regression algorithms, to estimate the accuracy of the
         resulting predictive model, and to visualize erroneous predictions, ROC curves, etc.



                                         Vinod Gupta School of Management                                           2
WEKA Manual (IT for Business Intelligence)                                                  Sagar (10BM60075)


        The associate panel provides access to association rule learners that attempt to identify all important
         interrelationships between attributes in the data.
        The cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm.
         There is also an implementation of the expectation maximization algorithm.
        The select attributes panel provides algorithms for identifying the most predictive attributes in a dataset.
        The visualize panel shows a scatter plot matrix, where individual scatter plots can be selected and
         enlarged, and analyzed further using various selection operators.

This paper will demonstrate the following two data mining techniques in WEKA:
     Clustering (Simple K Means)
     Linear regression



Clustering in WEKA


Clustering: Clustering can be loosely defined as: The process of organizing objects into groups whose members
are similar in some way. Clustering is the task of assigning a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar to each other than to those in other clusters. The clusters found by
different algorithms vary significantly in their properties, and understanding these "cluster models" is key to
understanding the differences between the various algorithms. Typical cluster models include:

        Connectivity models: Models based on distance connectivity.
        Centroid models: The k-means algorithm represents each cluster by a single mean vector.
        Distribution models: Clusters are modeled using statistic distributions, such as multivariate normal
         distributions.
         Density models: Clusters are seen as connected dense regions in the data space.
        Subspace models: Clusters are modeled with both cluster members and relevant attributes.
        Group models: Do not provide a refined model, just the grouping information.
        Graph-based models: A clique (a subset of nodes in a graph such that every two nodes in the subset are
         connected by an edge) can be considered as a prototypical form of cluster.

Clustering algorithms may be classified as listed below:

        Exclusive Clustering
        Overlapping Clustering
        Hierarchical Clustering
        Probabilistic Clustering

Four of the most used clustering algorithms are:

        K-means
        Fuzzy C-means
        Hierarchical clustering
        Mixture of Gaussians

K-means is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm, Hierarchical
clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering algorithm.

K-Means Clustering: K-means (MacQueen, 1967) is one of the simplest algorithms. The procedure follows a simple
and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.
The main idea is to define k centroids, one for each cluster. The algorithm is composed of the following steps:


                                         Vinod Gupta School of Management                                            3
WEKA Manual (IT for Business Intelligence)                                                Sagar (10BM60075)


    1.   Place K points into the space represented by the objects that are being clustered. These points represent
         initial group centroids.
    2.   Assign each object to the group that has the closest centroid.
    3.   When all objects have been assigned, recalculate the positions of the K centroids.
    4.   Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into
         groups from which the metric to be minimized can be calculated.

The k-means algorithm does not necessarily find the most optimal configuration. The algorithm is also significantly
sensitive to the initial randomly selected cluster centres. There is no general theoretical solution to find the
optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs
with different k classes and choose the best one.

Why to do Clustering (Business Applications)?

        Market Segmentation
        Identifying market needs
        To better understand the relationships between different groups of consumers/potential customers
        Product positioning
        New product opportunities
        Selecting test markets
        Clustering can be used to group all the shopping items available on the web into a set of unique products

K-Means Clustering in WEKA:

The data set we'll use for our clustering example will focus fictional BMW dealership. The dealership has kept track
of how people walk through the dealership and the showroom, what cars they look at, and how often they
ultimately make purchases. They are hoping to mine this data by finding patterns in the data and by using clusters
to determine if certain behaviors in their customers emerge. There are 100 rows of data in this sample, and each
column describes the steps that the customers reached in their BMW experience, with a column having a 1 (they
made it to this step or looked at this car), or 0 (they didn't reach this step).
The ARFF data we'll be using with WEKA is:




                                        Vinod Gupta School of Management                                          4
WEKA Manual (IT for Business Intelligence)                                                  Sagar (10BM60075)


Steps to be followed for doing K-Means Clustering in WEKA:

Step 1: Select Explorer in the Weka GUI Chooser window (Fig.1)
Step 2: The following window appears:




                                                       Fig. 2

Step 3: Select “Open File” and load the ARFF data file bmw-browsers. After loading the file, the interface will be
like this –




                                                       Fig. 3
Step 4: With this data set, we are looking to create clusters, so click on the Cluster tab. Click Choose and
select SimpleKMeans, set the numClusters value to 5 and click ok:



                                         Vinod Gupta School of Management                                            5
WEKA Manual (IT for Business Intelligence)                                                    Sagar (10BM60075)




                                                         Fig. 4


Step 5: For viewing the distribution of all variables in the population, we can click on “Visualize All”:




                                                         Fig. 5

                                          Vinod Gupta School of Management                                        6
WEKA Manual (IT for Business Intelligence)                                                   Sagar (10BM60075)


Step 6: Now, we are ready to run the clustering algorithm. 100 rows of data with five data clusters would likely
take a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second.
Select the Use training set in the Cluster mode panel and then click Start button to begin clustering process.




                                                         Fig. 6

Step 7: For displaying the result in separate window, in the Result list panel, right click the result and select View
in a separate window. Following result will be displayed:




                                                         Fig. 7


                                          Vinod Gupta School of Management                                               7
WEKA Manual (IT for Business Intelligence)                                               Sagar (10BM60075)


Results Interpretation:
The output tells how each cluster comes together: with a "1" meaning everyone in that cluster shares the same
value of one, and a "0" meaning everyone in that cluster has a value of zero for that attribute. Each cluster shows
a type of behavior in customers, from which we can draw conclusions:
      Cluster 0:
             o The "Dreamers"
             o Wander around the dealership
             o Don't purchase anything

       Cluster 1:
            o The "M5 Lovers”
            o Not a high purchase rate — only 52 percent
            o A potential problem and could be a focus
            o More salespeople could be send to the M5 section

       Cluster 2:
            o The "Throw-Aways"
            o No good conclusions from their behavior

       Cluster 3:
            o The "BMW Babies"
            o Always purchase a car and finance it
            o They walk around the lot looking at cars and then go to the computer search available at the
                 dealership
            o Making search computers more prominent around the lots section
            o Tend to buy M5s or Z4s

       Cluster 4:
            o The "Starting out with BMW"
            o These look at the 3-series and never at the much more expensive M5
            o Do not walk around the lot and ignore the computer search terminals
            o While 50 percent get to the financing stage, only 32 percent ultimately finish the transaction
            o These know exactly what kind of car they want (the 3-series entry-level model)
            o Sales to this group can be increased by relaxing financing standards or by reducing the 3-series
                 prices

The data in these clusters can also be inspected visually. To do this:
     Right click the result in the Result list panel
     Select Visualize cluster assignments
     By setting X-axis variable as M5 Y-axis variable as Purchase we get the following output:




                                                           Fig. 8



                                        Vinod Gupta School of Management                                         8
WEKA Manual (IT for Business Intelligence)                                                 Sagar (10BM60075)


This figure shows in a chart how the clusters are grouped in terms of who looked at the M5 and who purchased
one. Also, turn up the "Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plot
points to allow us to see them more easily.

The visual results match the conclusions we drew above. We can see in the X=1, Y=1 point (those who looked at
M5s and made a purchase) that the only clusters represented here are 1 and 3. We also see that the only clusters
at point X=0, Y=0 are 4 and 0. Clusters 1 and 3 were buying the M5s, while cluster 0 wasn't buying anything, and
cluster 4 was only looking at the 3-series. By changing X and Y axes, we can identify other trends and patterns.

Other clustering methods can also be used to group the data into clusters. WEKA is very useful in the clustering
process when the size of data is huge. It can generate clusters pretty quickly even with huge data. As business has
huge applications of clustering, WEKA is very useful in the clustering of data in real business scenarios.


Linear Regression using WEKA
Regression: Regression is the easiest technique to use, but is also probably the least powerful. This model can
be as easy as one input variable and one output variable (Scatter diagram in Excel, or an XY Diagram in
OpenOffice.org). It can get more complex than that, including dozens of input variables. Regression models all fit
the same general pattern: there are a number of independent variables, which, when taken together, produce a
result — a dependent variable. The regression model is then used to predict the result of an unknown dependent
variable, given the values of the independent variables. Correlation analysis can be applied to determine the
degree to which variables are related. Broadly, regression can be classified into two types:
      Simple linear regression (one dependent variable and one independent variable)

        Multiple regression (one dependent variable and many independent variables)

The process of Multiple regression in WEKA is described with an example in this term paper.

Business applications of Regression
     Pricing decisions

        Trend Line Analysis

        Risk Analysis for Investments

        Sales or Market Forecasts

        To predict the demographics and types of future work forces for large companies.

        Total quality control

Regression in WEKA:

The price of the house (the dependent variable) is the result of many independent variables — the square footage
of the house, the size of the lot, whether granite is in the kitchen, bathrooms are upgraded, etc. Let's continue this
example of a house price-based regression model, and create some real data to examine. These are actual
numbers from houses for sale, and we will try to find the value for a house.
House values for regression model:
House size (square feet) Lot size Bedrooms Granite Upgraded bathroom? Selling price
3529                     9191     6           0        0                      $205,000
3247                     10061 5              1        1                      $224,900
4032                     10150 5              0        1                      $197,900
2397                     14156 4              1        0                      $189,900
2200                     9600     4           0        1`                     $195,000
3536                     19994 6              1        1                      $325,000


                                         Vinod Gupta School of Management                                           9
WEKA Manual (IT for Business Intelligence)                                                 Sagar (10BM60075)


2983                     9365     5           0        1                      $230,000

3198                     9669     5           1        1                      ????


Steps to be followed:
Step 1: Select Explorer from the WEKA GUI user window and load the file houses. Following screen will appear:




                                                        Fig. 9

Step 2: Click Classifier tab in the explorer window and click the Choose button in the Classifier panel. Then select
LinearRegression from functions:




                                                       Fig. 10


                                        Vinod Gupta School of Management                                           10
WEKA Manual (IT for Business Intelligence)                                                 Sagar (10BM60075)


It automatically identifies the dependent variable as Selling Price. In case it doesn’t happen we can select the
dependent variable.

Step3: Press the Start button and the following output is generated:




                                                       Fig. 11

As described earlier in clustering example, output can also be viewed in a separate window.
We can also visualize the classifier error i.e. those instances which are wrongly predicted by regression equation
by right clinking on the result set in the Result list panel and selecting Visualize classifier errors.




                                                       Fig. 12




                                        Vinod Gupta School of Management                                             11
WEKA Manual (IT for Business Intelligence)                                                Sagar (10BM60075)


Interpreting the regression model:
Let us interpret the patterns and conclusions that our model tells us, besides just a strict house value:
      Granite doesn't matter: WEKA will only use columns that statistically contribute to the accuracy of the
          model). This regression model is telling us that granite in your kitchen doesn't affect the house's value.
      Bathrooms do matter: We can use the coefficient from the regression model to determine the value of
          an upgraded bathroom on the house value.
      Bigger houses reduce the value: WEKA is telling us that the bigger our house is, the lower the selling
          price. This can be seen by the negative coefficient in front of the houseSize variable. The house size,
          unfortunately, isn't an independent variable because it's related to the bedrooms variable, which makes
          sense, since bigger houses tend to have more bedrooms.


Other applications of WEKA in data mining:
WEKA can be used for various other data mining techniques:
    Classification (using decision trees)

        Collaborative filtering (Nearest Neighbor)

        Association


References:
    a)   Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan Kaufmann publisher)

    b) www.wikipedia.org

    c)   http://www2.cs.uregina.ca/~dbd/cs831/notes/clustering/clustering.html

    d) http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html




                                       Vinod Gupta School of Management                                          12

More Related Content

What's hot

Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKAbutest
 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekaPrashant Menon
 
Wek1
Wek1Wek1
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
MS SQL SERVER:Microsoft neural network and logistic regression
MS SQL SERVER:Microsoft neural network and logistic regressionMS SQL SERVER:Microsoft neural network and logistic regression
MS SQL SERVER:Microsoft neural network and logistic regression
DataminingTools Inc
 
MS SQL SERVER: Decision trees algorithm
MS SQL SERVER: Decision trees algorithmMS SQL SERVER: Decision trees algorithm
MS SQL SERVER: Decision trees algorithm
DataminingTools Inc
 
Explanations in Data Systems
Explanations in Data SystemsExplanations in Data Systems
Explanations in Data Systems
Fotis Savva
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
data mining with weka application
data mining with weka applicationdata mining with weka application
data mining with weka application
Rezapourabbas
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
DataminingTools Inc
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
Analytics machine learning in weka
Analytics machine learning in wekaAnalytics machine learning in weka
Analytics machine learning in weka
Sudhakar Chavan
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Edureka!
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
ijsrd.com
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
IRJET Journal
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
Chamin Nalinda Loku Gam Hewage
 

What's hot (20)

Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Wek1
Wek1Wek1
Wek1
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
 
MS SQL SERVER:Microsoft neural network and logistic regression
MS SQL SERVER:Microsoft neural network and logistic regressionMS SQL SERVER:Microsoft neural network and logistic regression
MS SQL SERVER:Microsoft neural network and logistic regression
 
MS SQL SERVER: Decision trees algorithm
MS SQL SERVER: Decision trees algorithmMS SQL SERVER: Decision trees algorithm
MS SQL SERVER: Decision trees algorithm
 
Explanations in Data Systems
Explanations in Data SystemsExplanations in Data Systems
Explanations in Data Systems
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
data mining with weka application
data mining with weka applicationdata mining with weka application
data mining with weka application
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Analytics machine learning in weka
Analytics machine learning in wekaAnalytics machine learning in weka
Analytics machine learning in weka
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 

Viewers also liked

Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
Hein Min Htike
 
P2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksP2P: Simulations and Real world Networks
P2P: Simulations and Real world Networks
Matilda Rhode
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
Daniyar Mukhanov
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
Daniyar Mukhanov
 
Weka.arff
Weka.arffWeka.arff
Weka.arff
Daniyar Mukhanov
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
Clement Robert Habimana
 
Twitter r t under crisis
Twitter r t under crisisTwitter r t under crisis
Twitter r t under crisis
Clement Robert Habimana
 
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
Daniyar Mukhanov
 
Amazon mp
Amazon mpAmazon mp
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Frank Kelly
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ishmecse13
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
Weka presentation cmt111
Weka presentation cmt111Weka presentation cmt111
Weka presentation cmt111
Clement Robert Habimana
 
Selling on amazon
Selling on amazonSelling on amazon
Selling on amazon
Hein Min Htike
 
Social influence and political mobilization
Social influence and political mobilizationSocial influence and political mobilization
Social influence and political mobilization
Daniyar Mukhanov
 
Predictive Analytics: It's The Intervention That Matters
Predictive Analytics: It's The Intervention That MattersPredictive Analytics: It's The Intervention That Matters
Predictive Analytics: It's The Intervention That Matters
Health Catalyst
 

Viewers also liked (20)

Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
 
P2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksP2P: Simulations and Real world Networks
P2P: Simulations and Real world Networks
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
 
Weka
WekaWeka
Weka
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
 
Weka.arff
Weka.arffWeka.arff
Weka.arff
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
 
Twitter r t under crisis
Twitter r t under crisisTwitter r t under crisis
Twitter r t under crisis
 
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
 
Amazon mp
Amazon mpAmazon mp
Amazon mp
 
Weka
WekaWeka
Weka
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Weka
WekaWeka
Weka
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Weka presentation cmt111
Weka presentation cmt111Weka presentation cmt111
Weka presentation cmt111
 
Selling on amazon
Selling on amazonSelling on amazon
Selling on amazon
 
Social influence and political mobilization
Social influence and political mobilizationSocial influence and political mobilization
Social influence and political mobilization
 
Predictive Analytics: It's The Intervention That Matters
Predictive Analytics: It's The Intervention That MattersPredictive Analytics: It's The Intervention That Matters
Predictive Analytics: It's The Intervention That Matters
 

Similar to Weka_Manual_Sagar

Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
IJERA Editor
 
Clustering
ClusteringClustering
Clustering
Meme Hei
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka
Ishan Awadhesh
 
Open06
Open06Open06
Open06butest
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
IRJET Journal
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introductionbutest
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introductionbutest
 
Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
Krishna Mohan Mishra
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka Tool
IRJET Journal
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine Learning
IRJET Journal
 
Data Analysis – Technical learnings
Data Analysis – Technical learningsData Analysis – Technical learnings
Data Analysis – Technical learnings
InvenkLearn
 
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
cscpconf
 
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSPREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSSamsung Electronics
 
Predicting performance of classification algorithms
Predicting performance of classification algorithmsPredicting performance of classification algorithms
Predicting performance of classification algorithms
IAEME Publication
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
IJCI JOURNAL
 
Excel Datamining Addin Intermediate
Excel Datamining Addin IntermediateExcel Datamining Addin Intermediate
Excel Datamining Addin Intermediate
excel content
 
Excel Datamining Addin Intermediate
Excel Datamining Addin IntermediateExcel Datamining Addin Intermediate
Excel Datamining Addin Intermediate
DataminingTools Inc
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET Journal
 
A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...
IRJET Journal
 

Similar to Weka_Manual_Sagar (20)

Itb weka nikhil
Itb weka nikhilItb weka nikhil
Itb weka nikhil
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
Clustering
ClusteringClustering
Clustering
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka
 
Open06
Open06Open06
Open06
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introduction
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introduction
 
Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka Tool
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine Learning
 
Data Analysis – Technical learnings
Data Analysis – Technical learningsData Analysis – Technical learnings
Data Analysis – Technical learnings
 
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
 
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSPREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
 
Predicting performance of classification algorithms
Predicting performance of classification algorithmsPredicting performance of classification algorithms
Predicting performance of classification algorithms
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
 
Excel Datamining Addin Intermediate
Excel Datamining Addin IntermediateExcel Datamining Addin Intermediate
Excel Datamining Addin Intermediate
 
Excel Datamining Addin Intermediate
Excel Datamining Addin IntermediateExcel Datamining Addin Intermediate
Excel Datamining Addin Intermediate
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...
 

Recently uploaded

Brand Analysis for an artist named Struan
Brand Analysis for an artist named StruanBrand Analysis for an artist named Struan
Brand Analysis for an artist named Struan
sarahvanessa51503
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
Lviv Startup Club
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
Operational Excellence Consulting
 
Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...
dylandmeas
 
Exploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social DreamingExploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social Dreaming
Nicola Wreford-Howard
 
3.0 Project 2_ Developing My Brand Identity Kit.pptx
3.0 Project 2_ Developing My Brand Identity Kit.pptx3.0 Project 2_ Developing My Brand Identity Kit.pptx
3.0 Project 2_ Developing My Brand Identity Kit.pptx
tanyjahb
 
Digital Transformation and IT Strategy Toolkit and Templates
Digital Transformation and IT Strategy Toolkit and TemplatesDigital Transformation and IT Strategy Toolkit and Templates
Digital Transformation and IT Strategy Toolkit and Templates
Aurelien Domont, MBA
 
Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...
Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...
Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...
bosssp10
 
Building Your Employer Brand with Social Media
Building Your Employer Brand with Social MediaBuilding Your Employer Brand with Social Media
Building Your Employer Brand with Social Media
LuanWise
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
Ben Wann
 
Introduction to Amazon company 111111111111
Introduction to Amazon company 111111111111Introduction to Amazon company 111111111111
Introduction to Amazon company 111111111111
zoyaansari11365
 
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n PrintAffordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Navpack & Print
 
Training my puppy and implementation in this story
Training my puppy and implementation in this storyTraining my puppy and implementation in this story
Training my puppy and implementation in this story
WilliamRodrigues148
 
An introduction to the cryptocurrency investment platform Binance Savings.
An introduction to the cryptocurrency investment platform Binance Savings.An introduction to the cryptocurrency investment platform Binance Savings.
An introduction to the cryptocurrency investment platform Binance Savings.
Any kyc Account
 
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdfikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
agatadrynko
 
Set off and carry forward of losses and assessment of individuals.pptx
Set off and carry forward of losses and assessment of individuals.pptxSet off and carry forward of losses and assessment of individuals.pptx
Set off and carry forward of losses and assessment of individuals.pptx
HARSHITHV26
 
LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024
Lital Barkan
 
In the Adani-Hindenburg case, what is SEBI investigating.pptx
In the Adani-Hindenburg case, what is SEBI investigating.pptxIn the Adani-Hindenburg case, what is SEBI investigating.pptx
In the Adani-Hindenburg case, what is SEBI investigating.pptx
Adani case
 
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdfSearch Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Arihant Webtech Pvt. Ltd
 
20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf
tjcomstrang
 

Recently uploaded (20)

Brand Analysis for an artist named Struan
Brand Analysis for an artist named StruanBrand Analysis for an artist named Struan
Brand Analysis for an artist named Struan
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
 
Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...Discover the innovative and creative projects that highlight my journey throu...
Discover the innovative and creative projects that highlight my journey throu...
 
Exploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social DreamingExploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social Dreaming
 
3.0 Project 2_ Developing My Brand Identity Kit.pptx
3.0 Project 2_ Developing My Brand Identity Kit.pptx3.0 Project 2_ Developing My Brand Identity Kit.pptx
3.0 Project 2_ Developing My Brand Identity Kit.pptx
 
Digital Transformation and IT Strategy Toolkit and Templates
Digital Transformation and IT Strategy Toolkit and TemplatesDigital Transformation and IT Strategy Toolkit and Templates
Digital Transformation and IT Strategy Toolkit and Templates
 
Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...
Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...
Call 8867766396 Satta Matka Dpboss Matka Guessing Satta batta Matka 420 Satta...
 
Building Your Employer Brand with Social Media
Building Your Employer Brand with Social MediaBuilding Your Employer Brand with Social Media
Building Your Employer Brand with Social Media
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
 
Introduction to Amazon company 111111111111
Introduction to Amazon company 111111111111Introduction to Amazon company 111111111111
Introduction to Amazon company 111111111111
 
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n PrintAffordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n Print
 
Training my puppy and implementation in this story
Training my puppy and implementation in this storyTraining my puppy and implementation in this story
Training my puppy and implementation in this story
 
An introduction to the cryptocurrency investment platform Binance Savings.
An introduction to the cryptocurrency investment platform Binance Savings.An introduction to the cryptocurrency investment platform Binance Savings.
An introduction to the cryptocurrency investment platform Binance Savings.
 
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdfikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
 
Set off and carry forward of losses and assessment of individuals.pptx
Set off and carry forward of losses and assessment of individuals.pptxSet off and carry forward of losses and assessment of individuals.pptx
Set off and carry forward of losses and assessment of individuals.pptx
 
LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024
 
In the Adani-Hindenburg case, what is SEBI investigating.pptx
In the Adani-Hindenburg case, what is SEBI investigating.pptxIn the Adani-Hindenburg case, what is SEBI investigating.pptx
In the Adani-Hindenburg case, what is SEBI investigating.pptx
 
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdfSearch Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
 
20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf
 

Weka_Manual_Sagar

  • 1. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) IT for Business Intelligence (BM61080) Data Mining Techniques using WEKA Sagar (10BM60075) This term paper contains a brief introduction to WEKA – a powerful data mining tool along with a guide to two data mining techniques - Clustering (k-means) and Linear Regression, using WEKA tool. Page 1 Vinod Gupta School of Management
  • 2. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Data Mining Techniques using WEKA: WEKA: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. The Weka workbench contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's main user interface is the explorer, but essentially the same functionality can be accessed through the component- based knowledge Flow interface and from the command line. There is also the experimenter, which allows the systematic comparison of the predictive performance of Weka's machine learning algorithms on a collection of datasets. Interfaces –  Command Line Interface (CLI)  Graphical User Interface (GUI) The WEKA GUI Chooser – Fig. 1 The buttons can be used to start the following applications –  Explorer – this is the environment for exploring data with WEKA and gives access to all the facilities using menu selection and form filling.  Experimenter – Gives the answer for the question: Which methods and parameter values work best for the given problem?  Knowledge Flow – Supports incremental learning and allows designing configurations for streamed data processing. Incremental algorithms can be used to process very large datasets.  Simple CLI – A simple Command Line Interface for executing WEKA commands directly. The Explorer interface features several panels providing access to the main components of the workbench:  The preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a filtering algorithm. It is possible to transform the data and delete instances and attributes according to specific criteria.  The classify panel enable to apply classification and regression algorithms, to estimate the accuracy of the resulting predictive model, and to visualize erroneous predictions, ROC curves, etc. Vinod Gupta School of Management 2
  • 3. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)  The associate panel provides access to association rule learners that attempt to identify all important interrelationships between attributes in the data.  The cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm.  The select attributes panel provides algorithms for identifying the most predictive attributes in a dataset.  The visualize panel shows a scatter plot matrix, where individual scatter plots can be selected and enlarged, and analyzed further using various selection operators. This paper will demonstrate the following two data mining techniques in WEKA:  Clustering (Simple K Means)  Linear regression Clustering in WEKA Clustering: Clustering can be loosely defined as: The process of organizing objects into groups whose members are similar in some way. Clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar to each other than to those in other clusters. The clusters found by different algorithms vary significantly in their properties, and understanding these "cluster models" is key to understanding the differences between the various algorithms. Typical cluster models include:  Connectivity models: Models based on distance connectivity.  Centroid models: The k-means algorithm represents each cluster by a single mean vector.  Distribution models: Clusters are modeled using statistic distributions, such as multivariate normal distributions.  Density models: Clusters are seen as connected dense regions in the data space.  Subspace models: Clusters are modeled with both cluster members and relevant attributes.  Group models: Do not provide a refined model, just the grouping information.  Graph-based models: A clique (a subset of nodes in a graph such that every two nodes in the subset are connected by an edge) can be considered as a prototypical form of cluster. Clustering algorithms may be classified as listed below:  Exclusive Clustering  Overlapping Clustering  Hierarchical Clustering  Probabilistic Clustering Four of the most used clustering algorithms are:  K-means  Fuzzy C-means  Hierarchical clustering  Mixture of Gaussians K-means is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm, Hierarchical clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering algorithm. K-Means Clustering: K-means (MacQueen, 1967) is one of the simplest algorithms. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. The algorithm is composed of the following steps: Vinod Gupta School of Management 3
  • 4. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. The k-means algorithm does not necessarily find the most optimal configuration. The algorithm is also significantly sensitive to the initial randomly selected cluster centres. There is no general theoretical solution to find the optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs with different k classes and choose the best one. Why to do Clustering (Business Applications)?  Market Segmentation  Identifying market needs  To better understand the relationships between different groups of consumers/potential customers  Product positioning  New product opportunities  Selecting test markets  Clustering can be used to group all the shopping items available on the web into a set of unique products K-Means Clustering in WEKA: The data set we'll use for our clustering example will focus fictional BMW dealership. The dealership has kept track of how people walk through the dealership and the showroom, what cars they look at, and how often they ultimately make purchases. They are hoping to mine this data by finding patterns in the data and by using clusters to determine if certain behaviors in their customers emerge. There are 100 rows of data in this sample, and each column describes the steps that the customers reached in their BMW experience, with a column having a 1 (they made it to this step or looked at this car), or 0 (they didn't reach this step). The ARFF data we'll be using with WEKA is: Vinod Gupta School of Management 4
  • 5. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Steps to be followed for doing K-Means Clustering in WEKA: Step 1: Select Explorer in the Weka GUI Chooser window (Fig.1) Step 2: The following window appears: Fig. 2 Step 3: Select “Open File” and load the ARFF data file bmw-browsers. After loading the file, the interface will be like this – Fig. 3 Step 4: With this data set, we are looking to create clusters, so click on the Cluster tab. Click Choose and select SimpleKMeans, set the numClusters value to 5 and click ok: Vinod Gupta School of Management 5
  • 6. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Fig. 4 Step 5: For viewing the distribution of all variables in the population, we can click on “Visualize All”: Fig. 5 Vinod Gupta School of Management 6
  • 7. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Step 6: Now, we are ready to run the clustering algorithm. 100 rows of data with five data clusters would likely take a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second. Select the Use training set in the Cluster mode panel and then click Start button to begin clustering process. Fig. 6 Step 7: For displaying the result in separate window, in the Result list panel, right click the result and select View in a separate window. Following result will be displayed: Fig. 7 Vinod Gupta School of Management 7
  • 8. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Results Interpretation: The output tells how each cluster comes together: with a "1" meaning everyone in that cluster shares the same value of one, and a "0" meaning everyone in that cluster has a value of zero for that attribute. Each cluster shows a type of behavior in customers, from which we can draw conclusions:  Cluster 0: o The "Dreamers" o Wander around the dealership o Don't purchase anything  Cluster 1: o The "M5 Lovers” o Not a high purchase rate — only 52 percent o A potential problem and could be a focus o More salespeople could be send to the M5 section  Cluster 2: o The "Throw-Aways" o No good conclusions from their behavior  Cluster 3: o The "BMW Babies" o Always purchase a car and finance it o They walk around the lot looking at cars and then go to the computer search available at the dealership o Making search computers more prominent around the lots section o Tend to buy M5s or Z4s  Cluster 4: o The "Starting out with BMW" o These look at the 3-series and never at the much more expensive M5 o Do not walk around the lot and ignore the computer search terminals o While 50 percent get to the financing stage, only 32 percent ultimately finish the transaction o These know exactly what kind of car they want (the 3-series entry-level model) o Sales to this group can be increased by relaxing financing standards or by reducing the 3-series prices The data in these clusters can also be inspected visually. To do this:  Right click the result in the Result list panel  Select Visualize cluster assignments  By setting X-axis variable as M5 Y-axis variable as Purchase we get the following output: Fig. 8 Vinod Gupta School of Management 8
  • 9. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) This figure shows in a chart how the clusters are grouped in terms of who looked at the M5 and who purchased one. Also, turn up the "Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plot points to allow us to see them more easily. The visual results match the conclusions we drew above. We can see in the X=1, Y=1 point (those who looked at M5s and made a purchase) that the only clusters represented here are 1 and 3. We also see that the only clusters at point X=0, Y=0 are 4 and 0. Clusters 1 and 3 were buying the M5s, while cluster 0 wasn't buying anything, and cluster 4 was only looking at the 3-series. By changing X and Y axes, we can identify other trends and patterns. Other clustering methods can also be used to group the data into clusters. WEKA is very useful in the clustering process when the size of data is huge. It can generate clusters pretty quickly even with huge data. As business has huge applications of clustering, WEKA is very useful in the clustering of data in real business scenarios. Linear Regression using WEKA Regression: Regression is the easiest technique to use, but is also probably the least powerful. This model can be as easy as one input variable and one output variable (Scatter diagram in Excel, or an XY Diagram in OpenOffice.org). It can get more complex than that, including dozens of input variables. Regression models all fit the same general pattern: there are a number of independent variables, which, when taken together, produce a result — a dependent variable. The regression model is then used to predict the result of an unknown dependent variable, given the values of the independent variables. Correlation analysis can be applied to determine the degree to which variables are related. Broadly, regression can be classified into two types:  Simple linear regression (one dependent variable and one independent variable)  Multiple regression (one dependent variable and many independent variables) The process of Multiple regression in WEKA is described with an example in this term paper. Business applications of Regression  Pricing decisions  Trend Line Analysis  Risk Analysis for Investments  Sales or Market Forecasts  To predict the demographics and types of future work forces for large companies.  Total quality control Regression in WEKA: The price of the house (the dependent variable) is the result of many independent variables — the square footage of the house, the size of the lot, whether granite is in the kitchen, bathrooms are upgraded, etc. Let's continue this example of a house price-based regression model, and create some real data to examine. These are actual numbers from houses for sale, and we will try to find the value for a house. House values for regression model: House size (square feet) Lot size Bedrooms Granite Upgraded bathroom? Selling price 3529 9191 6 0 0 $205,000 3247 10061 5 1 1 $224,900 4032 10150 5 0 1 $197,900 2397 14156 4 1 0 $189,900 2200 9600 4 0 1` $195,000 3536 19994 6 1 1 $325,000 Vinod Gupta School of Management 9
  • 10. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) 2983 9365 5 0 1 $230,000 3198 9669 5 1 1 ???? Steps to be followed: Step 1: Select Explorer from the WEKA GUI user window and load the file houses. Following screen will appear: Fig. 9 Step 2: Click Classifier tab in the explorer window and click the Choose button in the Classifier panel. Then select LinearRegression from functions: Fig. 10 Vinod Gupta School of Management 10
  • 11. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) It automatically identifies the dependent variable as Selling Price. In case it doesn’t happen we can select the dependent variable. Step3: Press the Start button and the following output is generated: Fig. 11 As described earlier in clustering example, output can also be viewed in a separate window. We can also visualize the classifier error i.e. those instances which are wrongly predicted by regression equation by right clinking on the result set in the Result list panel and selecting Visualize classifier errors. Fig. 12 Vinod Gupta School of Management 11
  • 12. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075) Interpreting the regression model: Let us interpret the patterns and conclusions that our model tells us, besides just a strict house value:  Granite doesn't matter: WEKA will only use columns that statistically contribute to the accuracy of the model). This regression model is telling us that granite in your kitchen doesn't affect the house's value.  Bathrooms do matter: We can use the coefficient from the regression model to determine the value of an upgraded bathroom on the house value.  Bigger houses reduce the value: WEKA is telling us that the bigger our house is, the lower the selling price. This can be seen by the negative coefficient in front of the houseSize variable. The house size, unfortunately, isn't an independent variable because it's related to the bedrooms variable, which makes sense, since bigger houses tend to have more bedrooms. Other applications of WEKA in data mining: WEKA can be used for various other data mining techniques:  Classification (using decision trees)  Collaborative filtering (Nearest Neighbor)  Association References: a) Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan Kaufmann publisher) b) www.wikipedia.org c) http://www2.cs.uregina.ca/~dbd/cs831/notes/clustering/clustering.html d) http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html Vinod Gupta School of Management 12