Weka.arff

Data Analysis with Weka
zoo.arff
Done by
Clement Robert H.
Daniyar M.
Web and Social Computing

Dataset
Zoo.arff:
A simple database containing 17 Boolean-valued attributes. The "type"
attribute appears to be the class attribute. Here is a breakdown of which
animals are in which type.

Objectives
● Select Dataset
● Learn and Explain the two classifiers
● Apply them to the selected Dataset
● Compare the results

Outline
● DataSet Visualization
● Preprocessing
● Classification
● JRip Implementation and Results
● J48 Implementation and Results
● Knowledge Flow Implementation
● Experimenter (Comparison of the Two Classifiers)
● Conclusions
● Questions and Answers

Visualization
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.

Basic statistics

Hair

Feathers

Eggs

Milk

Catsize

Pre-Processing
Why?
Incomplete Data
Noisy Data
Inconsistent Data
How?
Discretize
Remove Duplicates
RemoveUseless
Etc

Classification
“Data mining technique used to predict group
membership for data instances”
Types of classifiers:
Decision Trees
J48
Rule Based Classifiers
JRip
Bayes
Naive Bayes

Classifiers Options Explained-Applied on DataSet
Training Data:
We use the whole dataset as training set. It gives the best results for the data
set itself but does not guarantee the best test for unseen data.
Cross-Validation:
Divide the Data Set into K subsamples and use k-1 subsamples as training data
and one subsample as test data.
Percentage Split:
We divide the dataset into two parts: the first X% of the data set is used as
training and the rest is used as the test set.

JRIP Classifier Algorithm
Based on RIPPER Algorithm
RIPPER: Repeated Incremental
Pruning to Produce Error
Reduction
Incremental Pruning
Error Reduction
Produces Rules
Works fine for:
Class: Missing classes values, binary
and Nominal Classes
Advantages
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
“If Age is greater than 35 and Status is Married Then
He/she Does not Cheat”

Classifier Evaluation-Default Options JRip
Pre-Processed?-> NO Results

PreProcessed?-> No Results->NO BIG CHANGE
Classifier Evaluation->Cross Validation Increased
JRip
● Same Options as Previous
● Except
● Cross-Validation= 20 Folds
“If the Cross-Validation Increases,
the correctly Classified Instances
Decreasing”
Why?
Next Cross-Validation=40 etc

PreProcessed?-> No Results->NO BIG CHANGE
Classifier Evaluation->Cross Validation
JRip
● No Relation among those values
● Why 10? Extensive experiments have shown
that this is the best choice to get an accurate
estimate for small Data Set ( we have 101
instances and 17 attributes)
● The more we increase the number of FOLDS in
cross-Validation, the more we decrease the
size of training subsets! (eg: for k=80, we
wanted to have 80 training subsets, in 101
instances!)
Cross
Validation
Correctly
Classified Inst.
10 87.14
20 85.14
30 86.13
40 84.15
50 87.12
60 82.17
70 85.14
80 87.12
90 84.15

PreProcessed?-> No Results
Classifier Evaluation->Training Set JRip

PreProcessing-> Not Yet
Classifier Evaluation->Training Set JRip
Cross
Validation (K)
Correctly
Classified
Inst.
10 87.14
20 85.14
30 86.13
40 84.15
50 87.12
60 82.17
70 85.14
80 87.12
90 84.15
VS
Training Set 92.07
Training Set
● works with the totality of the data set as
training and test Data
● It works better for the given data but does
not gives us confidence to use it the
unseen cases (prediction, etc)

Pre-Processing- Why?
JRip
● Rule generated based on useless
attribute
○ Filter with RemoveUseless
● Possible Duplicates->can lead to biased
Rules-> We never know
○ Filter with RemoveDuplicates
● Other Filters
○ Discretize, etc

● RemoveDuplicates:
Removes all duplicate
instances from the first
batch of data it receive
● RemoveUseless: This filter
removes attributes that do
not vary at all or that vary
too much
Pre-Processing-> RemoveDuplicates (instances), RemoveUseless (attributes) JRip

PreProcessed?-> Yes
Classifier Evaluation->Default Options JRip

Classifier Evaluation-> Percentage Split-> Default (66%) JRip
PreProcessed?-> Yes Results
● With this split, the test samples
do not have Amphibians!
● Has good statistics because
test is done on small test set.

Classifier Evaluation-> Percentage Split-> 80% JRip
PreProcessed?-> Yes Results
Percentage Split
● Takes samples randomly
● Chance for some
representatives to be left out
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

What Next? JRip
● Done with Test Options Applied to Data
● JRIP has its own parameters to change in order to make it work better
○ Folds, MinWeights, Seeds of randomization, Error Rate and
Pruning,
● Pruning
○ Pre-Pruning: Cut the classifier to continue growing when
condition is met
○ Post-Pruning: Let the Classifier grow, and the reduce it to make
it small so as to cover more unseen data.
● Without Pruning, the classifier is more detailed which makes it to be
limited on unseen cases.
● JRip uses Post Pruning Method (based on REP)
● Next, JRip with Pruning, Vs JRip without Pruning

Classifier Evaluation->Pruning-TRUE Vs FALSE JRip
PreProcessed?-> no Results with Experimenter*

J48 Classifier Algorithm
Based on C4.5 Algorithm
Modified ID3
continuous attributes
missing attributes
attributes with differing costs
post-pruning trees
Produces Decision Tree
Works fine for:
The advantages of the C4.5 are:
• Builds models that can be easily interpreted
• Easy to implement
• Can use both categorical and continuous
values
• Deals with noise

Cross validation =20
J48
No major changes
ROC and PRC area dropped
insignificantly
Confusion matrix and
Decision tree remains same

Changing cross validation value J48
Cross
Validation
Correctly
Classified Inst.
10 92.079
20 92.079
30 92.079
40 92.079
50 92.079
60 92.079
70 92.079
80 92.079
90 92.079
Changes nothing including ROC, PRC, confusion matrix and
decision tree
This is because number of instances are relatively small for
J48 (101)

Classifier by training set J48

Getting same results by Knowledge Flow

Getting same results by Knowledge Flow (con’t)

J48? or JRip on Zoo.arff-> Comparison with Experimenter Feature
Experimenter is Suitable for:
● Large Scale Experiments
● Automation
● Statistics can be stored in .arff
format
● …..
● Classifiers Comparison

J48? or JRip? on Zoo.arff?-> Results
With:
● Significance of 10% on
Percentage Correct Instances
● Cross-Validation=10
● Default Classifiers Options
We see that:
● J48 is the Winner on Zoo.arff

Lesson Learnt & Conclusions
● Output Readability
○ JRip outputs are easy to read and understand
○ J48 Trees can be more complex to read
● Performance
○ J48 beats JRip, as it generate general tree without Pre-Processing
○ J48 generates Tree whose precision is high
○ No big difference for small data set like zoo.arff
● Weka test options
○ Can mislead someone who only looks for good results.
○ No good/bad classifier. It depends on many characteristics (dataset size, options used, classifier
used etc).
● Weka
○ Experimenter gives a good way to compare classifiers
○ Knowledge Flow helps to get the intermediary steps in the generation of classifier (helps to
understand how some options work like cross validation)

Weka.arff

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Weka.arff

Similar to Weka.arff (20)

More from Daniyar Mukhanov

More from Daniyar Mukhanov (8)

Recently uploaded

Recently uploaded (20)

Weka.arff