SlideShare a Scribd company logo
1 of 17
Data Mining Techniques using WEKA




VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
                     In partial fulfillment
            Of the requirements for the degree of
       MASTER OF BUSINESS ADMINISTRATION




                                 SUBMITTED BY:


                                 Prashant Menon 10BM60061

                                 VGSOM, IIT KHARAGPUR
Introduction to WEKA


WEKA is a collection of open source of many data mining and machine learning
algorithms, including
– pre-processing on data
– Classification:
– clustering
– association rule extraction

• Created by researchers at the University of Waikato in New Zealand

• Java based (also open source).


Main features of WEK

• 49 data preprocessing tools

• 76 classification/regression algorithms

• 8 clustering algorithms

• 15 attribute/subset evaluators + 10 search algorithms for feature selection.

• 3 algorithms for finding association rules

• 3 graphical user interfaces

– “The Explorer” (exploratory data analysis)

– “The Experimenter” (experimental environment)

– “The KnowledgeFlow” (new process model inspired interface)


Weka: Download and Installation

• Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/

   – Choose a self-extracting executable (including Java VM)
– (If one is interested in modifying/extending weka there is a developer version that
includes the source code)
• After download is completed, run the self extracting file to install Weka, and use the
default set-ups.


Weka Application Interfaces

• Explorer
– preprocessing, attribute selection, learning, visualiation
• Experimenter
– testing and evaluating machine learning algorithms
• Knowledge Flow
– visual design of KDD process
– Explorer
• Simple Command-line
 A simple interface for typing commands

Weka Functions and Tools

• Preprocessing Filters
• Attribute selection
• Classification/Regression
• Clustering
• Association discovery
• Visualization


Load data file and Preprocessing

• Load data file in formats: ARFF, CSV, C4.5, binary
• Import from URL or SQL database (using JDBC)
• Preprocessing filters
– Adding/removing attributes
– Attribute value substitution
– Discretization
– Time series filters (delta, shift)
– Sampling, randomization
– Missing value management
– Normalization and other numeric transformations


Feature Selection

• Very flexible: arbitrary combination of search and evaluation methods
• Search methods
– best-first
– genetic
– ranking ...
• Evaluation measures
– ReliefF
– information gain
– gain ratio

Classification

• Predicted target must be categorical
• Implemented methods
– decision trees(J48, etc.) and rules
– Naïve Bayes
– neural networks
– instance-based classifiers …
• Evaluation methods
– test data set
    – crossvalidation

Clustering
• Implemented methods
– k-Means
– EM
– Cobweb
– X-means
– FarthestFirst…
• Clusters can be visualized and compared to “true” clusters (if given)

Regression

• Predicted target is continuous
• Methods
– Linear regression
– Simple Linear Regression
– Neural networks
– Regression trees …

Weka: Pros and cons

Pros
– Open source,
• Free
• Extensible
• Can be integrated into other java packages
– GUIs (Graphic User Interfaces)
• Relatively easier to use
– Features
• Run individual experiment, or
• Build KDD phases

Cons

– Lack of proper and adequate documentations
– Systems are updated constantly (Kitchen Sink Syndrome)



3. WEKA data formats

• Data can be imported from a file in various formats:
– ARFF (Attribute Relation File Format) has two sections:
• the Header information defines attribute name, type and relations.
• the Data section lists the data records.
– CSV: Comma Separated Values (text file)
– C4.5: A format used by a decision induction algorithm
C4.5, requires two separated files
• Name file: defines the names of the attributes
• Date file: lists the records (samples)
– binary
• Data can also be read from a URL or from an SQL database (using JDBC)
This term paper will demonstrate the following two data mining techniques using
WEKA:

    Clustering (Simple K Means)
    Linear regression



Clustering
Clustering allows a user to make groups of data to determine patterns from the data.
Clustering has its advantages when the data set is defined and a general pattern needs to
be determined from the data. One can create a specific number of groups, depending on
business needs. One defining benefit of clustering over classification is that every
attribute in the data set will be used to analyze the data. A major disadvantage of using
clustering is that the user is required to know ahead of time how many groups he wants to
create. For a user without any real knowledge of his data, this might be difficult. It might
take several steps of trial and error to determine the ideal number of groups to create.
However, for the average user, clustering can be the most useful data mining method one
can use. It can quickly take the entire set of data and turn it into groups, from which one
can quickly make some conclusions.


Data set for WEKA

This data set consists of three types of entities: (a) the specification of an auto in terms of
various characteristics, (b)its assigned insurance risk rating, (c) its normalized losses in
use as compared to other cars. The second rating corresponds to the degree to which the
auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol
associated with its price. Then, if it is more risky (or less), this symbol is adjusted by
moving it up (or down) the scale. A value of +3 indicates that the auto is risky, -3 that it
is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value
is normalized for all autos within a particular size classification (two-door small, station
wagons, sports/speciality, etc...), and represents the average loss per car per year.


A part of the saved arff file.
 @relation autos
 @attribute normalized-losses real
 @attribute make { alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury,
 mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo}
 @attribute fuel-type { diesel, gas}
 @attribute aspiration { std, turbo}
 @attribute num-of-doors { four, two}
 @attribute body-style { hardtop, wagon, sedan, hatchback, convertible}
 @attribute drive-wheels { 4wd, fwd, rwd}
@attribute engine-location { front, rear}
@attribute wheel-base real
@attribute length real
@attribute width real
@attribute height real
@attribute curb-weight real
@attribute engine-type { dohc, dohcv, l, ohc, ohcf, ohcv, rotor}
@attribute num-of-cylinders { eight, five, four, six, three, twelve, two}
@attribute engine-size real
@attribute fuel-system { 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi}
@attribute bore real
@attribute stroke real
@attribute compression-ratio real
@attribute horsepower real
@attribute peak-rpm real
@attribute city-mpg real
@attribute highway-mpg real
@attribute price real
@attribute symboling { -3, -2, -1, 0, 1, 2, 3}

@data
?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000
,21,27,13495,3
?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000
,21,27,16500,3
?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000,
19,26,16500,1
164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10,102,5500,24,30,
13950,2
164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8,115,5500,18,22,17450
,2
?,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
,2
158,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,
17710,1
?,audi,gas,std,four,wagon,fwd,front,105.8,192.7,71.4,55.7,2954,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,
18920,1
158,audi,gas,turbo,four,sedan,fwd,front,105.8,192.7,71.4,55.9,3086,ohc,five,131,mpfi,3.13,3.4,8.3,140,5500,17,
20,23875,1
?,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52,3053,ohc,five,131,mpfi,3.13,3.4,7,160,5500,16,22,
?,0
192,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29,
16430,2
192,bmw,gas,std,four,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29
,16925,0
188,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2710,ohc,six,164,mpfi,3.31,3.19,9,121,4250,21,28,
20970,0
188,bmw,gas,std,four,sedan,rwd,front,101.2,176.8,64.8,54.3,2765,ohc,six,164,mpfi,3.31,3.19,9,121,4250,21,28,
21105,0
?,bmw,gas,std,four,sedan,rwd,front,103.5,189,66.9,55.7,3055,ohc,six,164,mpfi,3.31,3.19,9,121,4250,20,25,24565,
1
?,bmw,gas,std,four,sedan,rwd,front,103.5,189,66.9,55.7,3230,ohc,six,209,mpfi,3.62,3.39,8,182,5400,16,22,30760,0
?,bmw,gas,std,two,sedan,rwd,front,103.5,193.8,67.9,53.7,3380,ohc,six,209,mpfi,3.62,3.39,8,182,5400,16,22,41315
,0
?,bmw,gas,std,four,sedan,rwd,front,110,197,70.9,56.3,3505,ohc,six,209,mpfi,3.62,3.39,8,182,5400,15,20,36880,0
 121,chevrolet,gas,std,two,hatchback,fwd,front,88.4,141.1,60.3,53.2,1488,l,three,61,2bbl,2.91,3.03,9.5,48,5100,47,
 53,5151,2
 98,chevrolet,gas,std,two,hatchback,fwd,front,94.5,155.9,63.6,52,1874,ohc,four,90,2bbl,3.03,3.11,9.6,70,5400,38,43,
 6295,1
 81,chevrolet,gas,std,four,sedan,fwd,front,94.5,158.8,63.6,52,1909,ohc,four,90,2bbl,3.03,3.11,9.6,70,5400,38,43,
 6575,0
 118,dodge,gas,std,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,1876,ohc,four,90,2bbl,2.97,3.23,9.41,68,5500
 ,37,41,5572,1
 118,dodge,gas,std,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,1876,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500
 ,31,38,6377,1
 118,dodge,gas,turbo,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,2128,ohc,four,98,mpfi,3.03,3.39,7.6,102,5500
 ,24,30,7957,1
 148,dodge,gas,std,four,hatchback,fwd,front,93.7,157.3,63.8,50.6,1967,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500,
 31,38,6229,1
 148,dodge,gas,std,four,sedan,fwd,front,93.7,157.3,63.8,50.6,1989,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500,31,
 38,6692,1


Clustering in WEKA

Load the data file AUTOS.arff into WEKA using the same steps we used to load data
into the Preprocess tab. Take a few minutes to look around the data in this tab. Look at
the columns, the attribute data, the distribution of the columns, etc. The screen should
look like the figure shown below after loading the data.




With this data set, we are looking to create clusters, so instead of clicking on
the Classify tab, click on the Cluster tab. Click Choose and select SimpleKMeans from
the choices that appear (this will be our preferred method of clustering for this article).
WEKA Explorer window should look like the following figure at this point.
Finally, we want to adjust the attributes of our cluster algorithm by
clicking SimpleKMeans (not the best UI design here, but go with it). The only attribute
of the algorithm we are interested in adjusting here is the numClusters field, which tells
us how many clusters we want to create. (Remember, one need to know this before start.)
Let's change the default value of 2 to 4 for now, but keep these steps in mind later if one
wants to adjust the number of clusters created. WEKA Explorer should look like the
following at this point. Click OK to accept these values.
At this point, we are ready to run the clustering algorithm. Remember that this much
rows of data with four data clusters would likely take a few hours of computation with a
spreadsheet, but WEKA can spit out the answer in less than a second. The output should
look like the figure shown below.
Time taken to build model (full training data) : 0.02 seconds

=== Model and evaluation on training set ===

Clustered Instances

0    60 ( 29%)
1    33 ( 16%)
2    55 ( 27%)
3    57 ( 28%)

Based on the values of cluster centroids as shown in the above figure, we can state the
characteristics of each of the clusters. For explaniantion we are taking Cluster 1 and
Cluster 2

Cluster 2

This group will always look for the premium segment car ‘Peugot’ . Has the larget wheel
base, length, height, curb weight, engine size. As the engine size is inversely proportional
to the mileage , it has the lowest city and high way mileage. It has the highest number of
cylinders.
Compression ratio , horse power, peak rpm all have the highest value which make it a
highest priced Car.

Cluster 1

This group will always look for the ‘VALUE FOR MONEY’ car. It belongs to the mass
segment. As the engine power is inversely proportional to the mileage , we can see it has
the highest highway and city mileage and low compression ration, horse power and RPM.
For this segment price is one of the important criteria before buying the car.

The Cluster analysis will help the car company which segment it should target before
start of the new product development/bringing the car into the market.


Visualization of Clustering Results
A more intuitive way to go through the results is to visualize them in the graphical form.
To do so:
     Right click the result in the Result list panel
     Select Visualize cluster assignments
     By setting X-axis variable as Cluster, Y-axis variable as Instance_number and
      Color as aspiration, we get the following output:




Here we can see all the clusters (segments) have mixed response to the aspiration.

Similarly we can change the variables in X-axis, Y-axis and color to visualize other aspects of
result. Note that WEKA has generated an extra variable named “Cluster” (not present in
original data) which signifies the cluster membership of various instances. We can save the
output as an arff file by clicking on the save button.
The output file contains an additional attribute cluster for each instance.
Thus besides the value of twenty six attributes for any instance, the output also specifies
the cluster membership for that instance.
Creating the regression model with WEKA

To create the model, click on the Classify tab. The first step is to select the model we
want to build, so WEKA knows how to work with the data, and how to create the
appropriate model:
    1. Click the Choose button, then expand the functions branch.
    2. Select the LinearRegression leaf.
This tells WEKA that we want to build a regression model. As one can see from the other
choices, though, there are lots of possible models to build.This should give a good
indication of how we are only touching the surface of this subject. There is another
choice called SimpleLinearRegression in the same branch. Do not choose this because
simple regression only looks at one variable, and we have six.

The used attributes are as follows:

 The used attributes are :
 MYCT: machine cycle time in nanoseconds (integer)
 MMIN: minimum main memory in kilobytes (integer)
 MMAX: maximum main memory in kilobytes (integer)
 CACH: cache memory in kilobytes (integer)
 CHMIN: minimum channels in units (integer)
 CHMAX: maximum channels in units (integer)
 PRP: published relative performance (integer) (target variable)


A part of the data file is as follows:

 @relation machine_cpu
 @attribute MYCT numeric
 @attribute MMIN numeric
 @attribute MMAX numeric
 @attribute CACH numeric
 @attribute CHMIN numeric
 @attribute CHMAX numeric
 @attribute class numeric
 @data
 125,256,6000,256,16,128,198
 29,8000,32000,32,8,32,269
 29,8000,32000,32,8,32,220
 29,8000,32000,32,8,32,172
 29,8000,16000,32,8,16,132
 26,8000,32000,64,8,32,318
 23,16000,32000,64,16,32,367
 23,16000,32000,64,16,32,489
 23,16000,64000,64,16,32,636
When we've selected the right model, WEKA Explorer should look like the following
figure.




Now that the desired model has been chosen, we have to tell WEKA where the data is
that it should use to build the model. Though it may be obvious to us that we want to use
the data we supplied in the ARFF file, there are actually different options, some more
advanced than what we'll be using. The other three choices are Supplied test set, where
one can supply a different set of data to build the model; Cross-validation, which lets
WEKA build a model based on subsets of the supplied data and then average them out to
create a final model; and Percentage split, where WEKA takes a percentile subset of the
supplied data to build a final model. These other choices are useful with different models,
which we'll see in future articles. With regression, we can simply choose Use training
set. This tells WEKA that to build our desired model, we can simply use the data set we
supplied in our ARFF file.
Finally, the last step to creating our model is to choose the dependent variable (the
column we are looking to predict). We know this should be the Class, since that's what
we're trying to determine. Right below the test options, there's a combo box that lets us
choose the dependent variable. The column Class should be selected by default. If it's
not, please select it.

Now we are ready to create our model. Click Start. The following figure shows what the
output should look like.

=== Run information ===




Scheme:     weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation:    machine_cpu

Instances:   209

Attributes: 7

          MYCT

          MMIN

          MMAX

          CACH

          CHMIN

          CHMAX

          class

Test mode:      evaluate on training data




=== Classifier model (full training set) ===




Linear Regression Model




class =




   0.0491 * MYCT +

   0.0152 * MMIN +

   0.0056 * MMAX +

   0.6298 * CACH +

   1.4599 * CHMAX +

  -56.075




Time taken to build model: 0 seconds
=== Evaluation on training set ===

=== Summary ===




Correlation coefficient         0.93

Mean absolute error             37.9748

Root mean squared error              58.9899

Relative absolute error         39.592 %

Root relative squared error          36.7663 %

Total Number of Instances            209




Here class represents PRP(Published Relative Performance)



Interpreting the regression model
It puts the regression model right there in the output, as shown in Listing below:


Class(PRP) = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 *
CACH + 1.4599 * CHMAX -56.075



Listing 2 shows the results, plugging in the values for my house.


Listing 2. PRP value using regression model

Class(PRP)= 0.0491 * 29 + 0.0152 * 8000 + .0056 * 32000 + .6298 * 32 + 1.4599 * 32 –
56.075

     PRP= 1.4239 + 121.6 + 179.2 + 20.1536 + 46.7168-56.075



     PRP = 313.0193
However, looking back to the top of the article, data mining isn't just about outputting
a single number: It's about identifying patterns and rules. It's not strictly used to
produce an absolute number but rather to create a model that lets us detect patterns,
predict output, and come up with conclusions backed by the data



Minimum channel units doesn't matter — WEKA will only use columns that
statistically contribute to the accuracy of the model (measured in R-squared). It will
throw out and ignore columns that don't help in creating a good model. So this
regression model is telling us that minimum channels doesn't affect PRP.

We can also visualize the classifier error i.e. those instances which are wrongly
predicted by regression equation by right clinking on the result set in the Result list
panel and selecting Visualize classifier errors.




The X-axis has Price (actual) and the Y-axis has Predicted Price.

More Related Content

What's hot

0/1 knapsack
0/1 knapsack0/1 knapsack
0/1 knapsackAmin Omi
 
Sql queries presentation
Sql queries presentationSql queries presentation
Sql queries presentationNITISH KUMAR
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithmsDr Geetha Mohan
 
RECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSINGRECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSINGJothi Lakshmi
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Turing Machine
Turing MachineTuring Machine
Turing MachineRajendran
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
 
Avl trees
Avl treesAvl trees
Avl treesppreeta
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 

What's hot (20)

Spark architecture
Spark architectureSpark architecture
Spark architecture
 
0/1 knapsack
0/1 knapsack0/1 knapsack
0/1 knapsack
 
Sorting
SortingSorting
Sorting
 
Sql queries presentation
Sql queries presentationSql queries presentation
Sql queries presentation
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithms
 
RECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSINGRECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSING
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Turing Machine
Turing MachineTuring Machine
Turing Machine
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0
 
Graph coloring using backtracking
Graph coloring using backtrackingGraph coloring using backtracking
Graph coloring using backtracking
 
Avl trees
Avl treesAvl trees
Avl trees
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Run time storage
Run time storageRun time storage
Run time storage
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Spark
SparkSpark
Spark
 
LR(0) PARSER
LR(0) PARSERLR(0) PARSER
LR(0) PARSER
 
joins in database
 joins in database joins in database
joins in database
 

Similar to Data mining techniques using weka

Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningKeshab Kumar Gaurav
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningEdunomica
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at OracleSandesh Rao
 
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]IRJET Journal
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 antimo musone
 
Knowledge Discovery Using Data Mining
Knowledge Discovery Using Data MiningKnowledge Discovery Using Data Mining
Knowledge Discovery Using Data Miningparthvora18
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...SQUADEX
 
IRJET- Automated CV Classification using Clustering Technique
IRJET- Automated CV Classification using Clustering TechniqueIRJET- Automated CV Classification using Clustering Technique
IRJET- Automated CV Classification using Clustering TechniqueIRJET Journal
 
How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016oysteing
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code OptimizationIRJET Journal
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introductionbutest
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introductionbutest
 

Similar to Data mining techniques using weka (20)

Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data mining
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Weka
WekaWeka
Weka
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
 
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
 
Selenium Automation Framework
Selenium Automation  FrameworkSelenium Automation  Framework
Selenium Automation Framework
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015
 
Knowledge Discovery Using Data Mining
Knowledge Discovery Using Data MiningKnowledge Discovery Using Data Mining
Knowledge Discovery Using Data Mining
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
IRJET- Automated CV Classification using Clustering Technique
IRJET- Automated CV Classification using Clustering TechniqueIRJET- Automated CV Classification using Clustering Technique
IRJET- Automated CV Classification using Clustering Technique
 
How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introduction
 
Weka toolkit introduction
Weka toolkit introductionWeka toolkit introduction
Weka toolkit introduction
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Data mining techniques using weka

  • 1. Data Mining Techniques using WEKA VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR In partial fulfillment Of the requirements for the degree of MASTER OF BUSINESS ADMINISTRATION SUBMITTED BY: Prashant Menon 10BM60061 VGSOM, IIT KHARAGPUR
  • 2. Introduction to WEKA WEKA is a collection of open source of many data mining and machine learning algorithms, including – pre-processing on data – Classification: – clustering – association rule extraction • Created by researchers at the University of Waikato in New Zealand • Java based (also open source). Main features of WEK • 49 data preprocessing tools • 76 classification/regression algorithms • 8 clustering algorithms • 15 attribute/subset evaluators + 10 search algorithms for feature selection. • 3 algorithms for finding association rules • 3 graphical user interfaces – “The Explorer” (exploratory data analysis) – “The Experimenter” (experimental environment) – “The KnowledgeFlow” (new process model inspired interface) Weka: Download and Installation • Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/ – Choose a self-extracting executable (including Java VM)
  • 3. – (If one is interested in modifying/extending weka there is a developer version that includes the source code) • After download is completed, run the self extracting file to install Weka, and use the default set-ups. Weka Application Interfaces • Explorer – preprocessing, attribute selection, learning, visualiation • Experimenter – testing and evaluating machine learning algorithms • Knowledge Flow – visual design of KDD process – Explorer • Simple Command-line A simple interface for typing commands Weka Functions and Tools • Preprocessing Filters • Attribute selection • Classification/Regression • Clustering • Association discovery • Visualization Load data file and Preprocessing • Load data file in formats: ARFF, CSV, C4.5, binary • Import from URL or SQL database (using JDBC) • Preprocessing filters – Adding/removing attributes – Attribute value substitution – Discretization – Time series filters (delta, shift) – Sampling, randomization – Missing value management – Normalization and other numeric transformations Feature Selection • Very flexible: arbitrary combination of search and evaluation methods • Search methods
  • 4. – best-first – genetic – ranking ... • Evaluation measures – ReliefF – information gain – gain ratio Classification • Predicted target must be categorical • Implemented methods – decision trees(J48, etc.) and rules – Naïve Bayes – neural networks – instance-based classifiers … • Evaluation methods – test data set – crossvalidation Clustering • Implemented methods – k-Means – EM – Cobweb – X-means – FarthestFirst… • Clusters can be visualized and compared to “true” clusters (if given) Regression • Predicted target is continuous • Methods – Linear regression – Simple Linear Regression – Neural networks – Regression trees … Weka: Pros and cons Pros – Open source, • Free • Extensible • Can be integrated into other java packages – GUIs (Graphic User Interfaces)
  • 5. • Relatively easier to use – Features • Run individual experiment, or • Build KDD phases Cons – Lack of proper and adequate documentations – Systems are updated constantly (Kitchen Sink Syndrome) 3. WEKA data formats • Data can be imported from a file in various formats: – ARFF (Attribute Relation File Format) has two sections: • the Header information defines attribute name, type and relations. • the Data section lists the data records. – CSV: Comma Separated Values (text file) – C4.5: A format used by a decision induction algorithm C4.5, requires two separated files • Name file: defines the names of the attributes • Date file: lists the records (samples) – binary • Data can also be read from a URL or from an SQL database (using JDBC)
  • 6. This term paper will demonstrate the following two data mining techniques using WEKA:  Clustering (Simple K Means)  Linear regression Clustering Clustering allows a user to make groups of data to determine patterns from the data. Clustering has its advantages when the data set is defined and a general pattern needs to be determined from the data. One can create a specific number of groups, depending on business needs. One defining benefit of clustering over classification is that every attribute in the data set will be used to analyze the data. A major disadvantage of using clustering is that the user is required to know ahead of time how many groups he wants to create. For a user without any real knowledge of his data, this might be difficult. It might take several steps of trial and error to determine the ideal number of groups to create. However, for the average user, clustering can be the most useful data mining method one can use. It can quickly take the entire set of data and turn it into groups, from which one can quickly make some conclusions. Data set for WEKA This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b)its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year. A part of the saved arff file. @relation autos @attribute normalized-losses real @attribute make { alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo} @attribute fuel-type { diesel, gas} @attribute aspiration { std, turbo} @attribute num-of-doors { four, two} @attribute body-style { hardtop, wagon, sedan, hatchback, convertible} @attribute drive-wheels { 4wd, fwd, rwd}
  • 7. @attribute engine-location { front, rear} @attribute wheel-base real @attribute length real @attribute width real @attribute height real @attribute curb-weight real @attribute engine-type { dohc, dohcv, l, ohc, ohcf, ohcv, rotor} @attribute num-of-cylinders { eight, five, four, six, three, twelve, two} @attribute engine-size real @attribute fuel-system { 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi} @attribute bore real @attribute stroke real @attribute compression-ratio real @attribute horsepower real @attribute peak-rpm real @attribute city-mpg real @attribute highway-mpg real @attribute price real @attribute symboling { -3, -2, -1, 0, 1, 2, 3} @data ?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000 ,21,27,13495,3 ?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000 ,21,27,16500,3 ?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000, 19,26,16500,1 164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10,102,5500,24,30, 13950,2 164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8,115,5500,18,22,17450 ,2 ?,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250 ,2 158,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25, 17710,1 ?,audi,gas,std,four,wagon,fwd,front,105.8,192.7,71.4,55.7,2954,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25, 18920,1 158,audi,gas,turbo,four,sedan,fwd,front,105.8,192.7,71.4,55.9,3086,ohc,five,131,mpfi,3.13,3.4,8.3,140,5500,17, 20,23875,1 ?,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52,3053,ohc,five,131,mpfi,3.13,3.4,7,160,5500,16,22, ?,0 192,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29, 16430,2 192,bmw,gas,std,four,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29 ,16925,0 188,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2710,ohc,six,164,mpfi,3.31,3.19,9,121,4250,21,28, 20970,0 188,bmw,gas,std,four,sedan,rwd,front,101.2,176.8,64.8,54.3,2765,ohc,six,164,mpfi,3.31,3.19,9,121,4250,21,28, 21105,0 ?,bmw,gas,std,four,sedan,rwd,front,103.5,189,66.9,55.7,3055,ohc,six,164,mpfi,3.31,3.19,9,121,4250,20,25,24565, 1 ?,bmw,gas,std,four,sedan,rwd,front,103.5,189,66.9,55.7,3230,ohc,six,209,mpfi,3.62,3.39,8,182,5400,16,22,30760,0 ?,bmw,gas,std,two,sedan,rwd,front,103.5,193.8,67.9,53.7,3380,ohc,six,209,mpfi,3.62,3.39,8,182,5400,16,22,41315 ,0
  • 8. ?,bmw,gas,std,four,sedan,rwd,front,110,197,70.9,56.3,3505,ohc,six,209,mpfi,3.62,3.39,8,182,5400,15,20,36880,0 121,chevrolet,gas,std,two,hatchback,fwd,front,88.4,141.1,60.3,53.2,1488,l,three,61,2bbl,2.91,3.03,9.5,48,5100,47, 53,5151,2 98,chevrolet,gas,std,two,hatchback,fwd,front,94.5,155.9,63.6,52,1874,ohc,four,90,2bbl,3.03,3.11,9.6,70,5400,38,43, 6295,1 81,chevrolet,gas,std,four,sedan,fwd,front,94.5,158.8,63.6,52,1909,ohc,four,90,2bbl,3.03,3.11,9.6,70,5400,38,43, 6575,0 118,dodge,gas,std,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,1876,ohc,four,90,2bbl,2.97,3.23,9.41,68,5500 ,37,41,5572,1 118,dodge,gas,std,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,1876,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500 ,31,38,6377,1 118,dodge,gas,turbo,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,2128,ohc,four,98,mpfi,3.03,3.39,7.6,102,5500 ,24,30,7957,1 148,dodge,gas,std,four,hatchback,fwd,front,93.7,157.3,63.8,50.6,1967,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500, 31,38,6229,1 148,dodge,gas,std,four,sedan,fwd,front,93.7,157.3,63.8,50.6,1989,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500,31, 38,6692,1 Clustering in WEKA Load the data file AUTOS.arff into WEKA using the same steps we used to load data into the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns, the attribute data, the distribution of the columns, etc. The screen should look like the figure shown below after loading the data. With this data set, we are looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear (this will be our preferred method of clustering for this article). WEKA Explorer window should look like the following figure at this point.
  • 9. Finally, we want to adjust the attributes of our cluster algorithm by clicking SimpleKMeans (not the best UI design here, but go with it). The only attribute of the algorithm we are interested in adjusting here is the numClusters field, which tells us how many clusters we want to create. (Remember, one need to know this before start.) Let's change the default value of 2 to 4 for now, but keep these steps in mind later if one wants to adjust the number of clusters created. WEKA Explorer should look like the following at this point. Click OK to accept these values.
  • 10. At this point, we are ready to run the clustering algorithm. Remember that this much rows of data with four data clusters would likely take a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second. The output should look like the figure shown below.
  • 11. Time taken to build model (full training data) : 0.02 seconds === Model and evaluation on training set === Clustered Instances 0 60 ( 29%) 1 33 ( 16%) 2 55 ( 27%) 3 57 ( 28%) Based on the values of cluster centroids as shown in the above figure, we can state the characteristics of each of the clusters. For explaniantion we are taking Cluster 1 and Cluster 2 Cluster 2 This group will always look for the premium segment car ‘Peugot’ . Has the larget wheel base, length, height, curb weight, engine size. As the engine size is inversely proportional to the mileage , it has the lowest city and high way mileage. It has the highest number of cylinders. Compression ratio , horse power, peak rpm all have the highest value which make it a highest priced Car. Cluster 1 This group will always look for the ‘VALUE FOR MONEY’ car. It belongs to the mass segment. As the engine power is inversely proportional to the mileage , we can see it has
  • 12. the highest highway and city mileage and low compression ration, horse power and RPM. For this segment price is one of the important criteria before buying the car. The Cluster analysis will help the car company which segment it should target before start of the new product development/bringing the car into the market. Visualization of Clustering Results A more intuitive way to go through the results is to visualize them in the graphical form. To do so:  Right click the result in the Result list panel  Select Visualize cluster assignments  By setting X-axis variable as Cluster, Y-axis variable as Instance_number and Color as aspiration, we get the following output: Here we can see all the clusters (segments) have mixed response to the aspiration. Similarly we can change the variables in X-axis, Y-axis and color to visualize other aspects of result. Note that WEKA has generated an extra variable named “Cluster” (not present in original data) which signifies the cluster membership of various instances. We can save the output as an arff file by clicking on the save button. The output file contains an additional attribute cluster for each instance. Thus besides the value of twenty six attributes for any instance, the output also specifies the cluster membership for that instance.
  • 13. Creating the regression model with WEKA To create the model, click on the Classify tab. The first step is to select the model we want to build, so WEKA knows how to work with the data, and how to create the appropriate model: 1. Click the Choose button, then expand the functions branch. 2. Select the LinearRegression leaf. This tells WEKA that we want to build a regression model. As one can see from the other choices, though, there are lots of possible models to build.This should give a good indication of how we are only touching the surface of this subject. There is another choice called SimpleLinearRegression in the same branch. Do not choose this because simple regression only looks at one variable, and we have six. The used attributes are as follows: The used attributes are : MYCT: machine cycle time in nanoseconds (integer) MMIN: minimum main memory in kilobytes (integer) MMAX: maximum main memory in kilobytes (integer) CACH: cache memory in kilobytes (integer) CHMIN: minimum channels in units (integer) CHMAX: maximum channels in units (integer) PRP: published relative performance (integer) (target variable) A part of the data file is as follows: @relation machine_cpu @attribute MYCT numeric @attribute MMIN numeric @attribute MMAX numeric @attribute CACH numeric @attribute CHMIN numeric @attribute CHMAX numeric @attribute class numeric @data 125,256,6000,256,16,128,198 29,8000,32000,32,8,32,269 29,8000,32000,32,8,32,220 29,8000,32000,32,8,32,172 29,8000,16000,32,8,16,132 26,8000,32000,64,8,32,318 23,16000,32000,64,16,32,367 23,16000,32000,64,16,32,489 23,16000,64000,64,16,32,636
  • 14. When we've selected the right model, WEKA Explorer should look like the following figure. Now that the desired model has been chosen, we have to tell WEKA where the data is that it should use to build the model. Though it may be obvious to us that we want to use the data we supplied in the ARFF file, there are actually different options, some more advanced than what we'll be using. The other three choices are Supplied test set, where one can supply a different set of data to build the model; Cross-validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model; and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. These other choices are useful with different models, which we'll see in future articles. With regression, we can simply choose Use training set. This tells WEKA that to build our desired model, we can simply use the data set we supplied in our ARFF file. Finally, the last step to creating our model is to choose the dependent variable (the column we are looking to predict). We know this should be the Class, since that's what we're trying to determine. Right below the test options, there's a combo box that lets us choose the dependent variable. The column Class should be selected by default. If it's not, please select it. Now we are ready to create our model. Click Start. The following figure shows what the output should look like. === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
  • 15. Relation: machine_cpu Instances: 209 Attributes: 7 MYCT MMIN MMAX CACH CHMIN CHMAX class Test mode: evaluate on training data === Classifier model (full training set) === Linear Regression Model class = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX + -56.075 Time taken to build model: 0 seconds
  • 16. === Evaluation on training set === === Summary === Correlation coefficient 0.93 Mean absolute error 37.9748 Root mean squared error 58.9899 Relative absolute error 39.592 % Root relative squared error 36.7663 % Total Number of Instances 209 Here class represents PRP(Published Relative Performance) Interpreting the regression model It puts the regression model right there in the output, as shown in Listing below: Class(PRP) = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX -56.075 Listing 2 shows the results, plugging in the values for my house. Listing 2. PRP value using regression model Class(PRP)= 0.0491 * 29 + 0.0152 * 8000 + .0056 * 32000 + .6298 * 32 + 1.4599 * 32 – 56.075  PRP= 1.4239 + 121.6 + 179.2 + 20.1536 + 46.7168-56.075  PRP = 313.0193
  • 17. However, looking back to the top of the article, data mining isn't just about outputting a single number: It's about identifying patterns and rules. It's not strictly used to produce an absolute number but rather to create a model that lets us detect patterns, predict output, and come up with conclusions backed by the data Minimum channel units doesn't matter — WEKA will only use columns that statistically contribute to the accuracy of the model (measured in R-squared). It will throw out and ignore columns that don't help in creating a good model. So this regression model is telling us that minimum channels doesn't affect PRP. We can also visualize the classifier error i.e. those instances which are wrongly predicted by regression equation by right clinking on the result set in the Result list panel and selecting Visualize classifier errors. The X-axis has Price (actual) and the Y-axis has Predicted Price.