SlideShare a Scribd company logo
Data Analysis with Weka
zoo.arff
Done by
Clement Robert H.
Daniyar M.
Web and Social Computing
Dataset
Zoo.arff:
A simple database containing 17 Boolean-valued attributes. The "type"
attribute appears to be the class attribute. Here is a breakdown of which
animals are in which type.
Objectives
● Select Dataset
● Learn and Explain the two classifiers
● Apply them to the selected Dataset
● Compare the results
Outline
● DataSet Visualization
● Preprocessing
● Classification
● JRip Implementation and Results
● J48 Implementation and Results
● Knowledge Flow Implementation
● Experimenter (Comparison of the Two Classifiers)
● Conclusions
● Questions and Answers
Visualization
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Basic statistics
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Hair
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Feathers
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Eggs
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Milk
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Catsize
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
mammal, bird, reptile, fish,
amphibian, insect, invertebrate.
Pre-Processing
Why?
Incomplete Data
Noisy Data
Inconsistent Data
How?
Discretize
Remove Duplicates
RemoveUseless
Etc
Classification
“Data mining technique used to predict group
membership for data instances”
Types of classifiers:
Decision Trees
J48
Rule Based Classifiers
JRip
Bayes
Naive Bayes
Classifiers Options Explained-Applied on DataSet
Training Data:
We use the whole dataset as training set. It gives the best results for the data
set itself but does not guarantee the best test for unseen data.
Cross-Validation:
Divide the Data Set into K subsamples and use k-1 subsamples as training data
and one subsample as test data.
Percentage Split:
We divide the dataset into two parts: the first X% of the data set is used as
training and the rest is used as the test set.
JRIP Classifier Algorithm
Based on RIPPER Algorithm
RIPPER: Repeated Incremental
Pruning to Produce Error
Reduction
Incremental Pruning
Error Reduction
Produces Rules
Works fine for:
Class: Missing classes values, binary
and Nominal Classes
Advantages
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
“If Age is greater than 35 and Status is Married Then
He/she Does not Cheat”
Classifier Evaluation-Default Options JRip
Pre-Processed?-> NO Results
PreProcessed?-> No Results->NO BIG CHANGE
Classifier Evaluation->Cross Validation Increased
JRip
● Same Options as Previous
● Except
● Cross-Validation= 20 Folds
“If the Cross-Validation Increases,
the correctly Classified Instances
Decreasing”
Why?
Next Cross-Validation=40 etc
PreProcessed?-> No Results->NO BIG CHANGE
Classifier Evaluation->Cross Validation
JRip
● No Relation among those values
● Why 10? Extensive experiments have shown
that this is the best choice to get an accurate
estimate for small Data Set ( we have 101
instances and 17 attributes)
● The more we increase the number of FOLDS in
cross-Validation, the more we decrease the
size of training subsets! (eg: for k=80, we
wanted to have 80 training subsets, in 101
instances!)
Cross
Validation
Correctly
Classified Inst.
10 87.14
20 85.14
30 86.13
40 84.15
50 87.12
60 82.17
70 85.14
80 87.12
90 84.15
PreProcessed?-> No Results
Classifier Evaluation->Training Set JRip
PreProcessing-> Not Yet
Classifier Evaluation->Training Set JRip
Cross
Validation (K)
Correctly
Classified
Inst.
10 87.14
20 85.14
30 86.13
40 84.15
50 87.12
60 82.17
70 85.14
80 87.12
90 84.15
VS
Training Set 92.07
Training Set
● works with the totality of the data set as
training and test Data
● It works better for the given data but does
not gives us confidence to use it the
unseen cases (prediction, etc)
Pre-Processing- Why?
JRip
● Rule generated based on useless
attribute
○ Filter with RemoveUseless
● Possible Duplicates->can lead to biased
Rules-> We never know
○ Filter with RemoveDuplicates
● Other Filters
○ Discretize, etc
● RemoveDuplicates:
Removes all duplicate
instances from the first
batch of data it receive
● RemoveUseless: This filter
removes attributes that do
not vary at all or that vary
too much
Pre-Processing-> RemoveDuplicates (instances), RemoveUseless (attributes) JRip
PreProcessed?-> Yes
Classifier Evaluation->Default Options JRip
Classifier Evaluation-> Percentage Split-> Default (66%) JRip
PreProcessed?-> Yes Results
● With this split, the test samples
do not have Amphibians!
● Has good statistics because
test is done on small test set.
Classifier Evaluation-> Percentage Split-> 80% JRip
PreProcessed?-> Yes Results
Percentage Split
● Takes samples randomly
● Chance for some
representatives to be left out
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
What Next? JRip
● Done with Test Options Applied to Data
● JRIP has its own parameters to change in order to make it work better
○ Folds, MinWeights, Seeds of randomization, Error Rate and
Pruning,
● Pruning
○ Pre-Pruning: Cut the classifier to continue growing when
condition is met
○ Post-Pruning: Let the Classifier grow, and the reduce it to make
it small so as to cover more unseen data.
● Without Pruning, the classifier is more detailed which makes it to be
limited on unseen cases.
● JRip uses Post Pruning Method (based on REP)
● Next, JRip with Pruning, Vs JRip without Pruning
Classifier Evaluation->Pruning-TRUE Vs FALSE JRip
PreProcessed?-> no Results with Experimenter*
J48 Classifier Algorithm
Based on C4.5 Algorithm
Modified ID3
continuous attributes
missing attributes
attributes with differing costs
post-pruning trees
Produces Decision Tree
Works fine for:
The advantages of the C4.5 are:
• Builds models that can be easily interpreted
• Easy to implement
• Can use both categorical and continuous
values
• Deals with noise
Default options J48
Default Options J48
Cross validation =20
J48
No major changes
ROC and PRC area dropped
insignificantly
Confusion matrix and
Decision tree remains same
Changing cross validation value J48
Cross
Validation
Correctly
Classified Inst.
10 92.079
20 92.079
30 92.079
40 92.079
50 92.079
60 92.079
70 92.079
80 92.079
90 92.079
Changes nothing including ROC, PRC, confusion matrix and
decision tree
This is because number of instances are relatively small for
J48 (101)
Classifier by training set J48
Weka Knowledge Flow
Step by Step
Comparing two different trees
Comparing two sets of rules
Getting same results by Knowledge Flow
Getting same results by Knowledge Flow (con’t)
J48? or JRip on Zoo.arff-> Comparison with Experimenter Feature
Experimenter is Suitable for:
● Large Scale Experiments
● Automation
● Statistics can be stored in .arff
format
● …..
● Classifiers Comparison
J48? or JRip? on Zoo.arff?-> Results
With:
● Significance of 10% on
Percentage Correct Instances
● Cross-Validation=10
● Default Classifiers Options
We see that:
● J48 is the Winner on Zoo.arff
Lesson Learnt & Conclusions
● Output Readability
○ JRip outputs are easy to read and understand
○ J48 Trees can be more complex to read
● Performance
○ J48 beats JRip, as it generate general tree without Pre-Processing
○ J48 generates Tree whose precision is high
○ No big difference for small data set like zoo.arff
● Weka test options
○ Can mislead someone who only looks for good results.
○ No good/bad classifier. It depends on many characteristics (dataset size, options used, classifier
used etc).
● Weka
○ Experimenter gives a good way to compare classifiers
○ Knowledge Flow helps to get the intermediary steps in the generation of classifier (helps to
understand how some options work like cross validation)
Q & A
Thank you

More Related Content

What's hot

Py data19 final
Py data19   finalPy data19   final
Py data19 final
Maria Navarro Jiménez
 
Decision tree
Decision treeDecision tree
Decision tree
Soujanya V
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
PyData
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Collaborative writing technologies: Overleaf for institutions
Collaborative writing technologies: Overleaf for institutionsCollaborative writing technologies: Overleaf for institutions
Collaborative writing technologies: Overleaf for institutions
Digital Science
 
Decision tree
Decision treeDecision tree
Decision tree
Varun Jain
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
An Overview of HTML, CSS & Java Script
An Overview of HTML, CSS & Java ScriptAn Overview of HTML, CSS & Java Script
An Overview of HTML, CSS & Java Script
Fahim Abdullah
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
yht4ever
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
butest
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
Zhen Li
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Edureka!
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
butest
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Simplilearn
 
HITS + Pagerank
HITS + PagerankHITS + Pagerank
HITS + Pagerank
ajkt
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
Krish_ver2
 

What's hot (20)

Py data19 final
Py data19   finalPy data19   final
Py data19 final
 
Decision tree
Decision treeDecision tree
Decision tree
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Collaborative writing technologies: Overleaf for institutions
Collaborative writing technologies: Overleaf for institutionsCollaborative writing technologies: Overleaf for institutions
Collaborative writing technologies: Overleaf for institutions
 
Decision tree
Decision treeDecision tree
Decision tree
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
An Overview of HTML, CSS & Java Script
An Overview of HTML, CSS & Java ScriptAn Overview of HTML, CSS & Java Script
An Overview of HTML, CSS & Java Script
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
HITS + Pagerank
HITS + PagerankHITS + Pagerank
HITS + Pagerank
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 

Similar to Weka.arff

Weka presentation cmt111
Weka presentation cmt111Weka presentation cmt111
Weka presentation cmt111
Clement Robert Habimana
 
Data Mining Zoo classification
Data Mining Zoo classificationData Mining Zoo classification
Data Mining Zoo classification
Mahmudul Hasan
 
Presentation
PresentationPresentation
Presentation
butest
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
GrubhubTech
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
Rashmi Bhat
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
Taylor Martell
 
evolutionary algo's.ppt
evolutionary algo's.pptevolutionary algo's.ppt
evolutionary algo's.ppt
SherazAhmed103
 
Automated parameter optimization should be included in future 
defect predict...
Automated parameter optimization should be included in future 
defect predict...Automated parameter optimization should be included in future 
defect predict...
Automated parameter optimization should be included in future 
defect predict...
Chakkrit (Kla) Tantithamthavorn
 
Experiments on Design Pattern Discovery
Experiments on Design Pattern DiscoveryExperiments on Design Pattern Discovery
Experiments on Design Pattern Discovery
Tim Menzies
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
Hein Min Htike
 
weka data mining
weka data mining weka data mining
weka data mining
kalthoom almaqbali
 
Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...
butest
 
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
Daniel Valcarce
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
Chamin Nalinda Loku Gam Hewage
 
Using Interactive Genetic Algorithm for Requirements Prioritization
Using Interactive Genetic Algorithm for Requirements PrioritizationUsing Interactive Genetic Algorithm for Requirements Prioritization
Using Interactive Genetic Algorithm for Requirements Prioritization
Francis Palma
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
tttiba
 
Analyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKAAnalyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKA
Yogesh Shinde
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 

Similar to Weka.arff (20)

Weka presentation cmt111
Weka presentation cmt111Weka presentation cmt111
Weka presentation cmt111
 
Data Mining Zoo classification
Data Mining Zoo classificationData Mining Zoo classification
Data Mining Zoo classification
 
Presentation
PresentationPresentation
Presentation
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
evolutionary algo's.ppt
evolutionary algo's.pptevolutionary algo's.ppt
evolutionary algo's.ppt
 
Automated parameter optimization should be included in future 
defect predict...
Automated parameter optimization should be included in future 
defect predict...Automated parameter optimization should be included in future 
defect predict...
Automated parameter optimization should be included in future 
defect predict...
 
Experiments on Design Pattern Discovery
Experiments on Design Pattern DiscoveryExperiments on Design Pattern Discovery
Experiments on Design Pattern Discovery
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
weka data mining
weka data mining weka data mining
weka data mining
 
Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...
 
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Using Interactive Genetic Algorithm for Requirements Prioritization
Using Interactive Genetic Algorithm for Requirements PrioritizationUsing Interactive Genetic Algorithm for Requirements Prioritization
Using Interactive Genetic Algorithm for Requirements Prioritization
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
Analyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKAAnalyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKA
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 

More from Daniyar Mukhanov

Презентация андроид
Презентация андроидПрезентация андроид
Презентация андроид
Daniyar Mukhanov
 
Digital literacy
Digital literacyDigital literacy
Digital literacy
Daniyar Mukhanov
 
Intro android
Intro androidIntro android
Intro android
Daniyar Mukhanov
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
Daniyar Mukhanov
 
Social influence and political mobilization
Social influence and political mobilizationSocial influence and political mobilization
Social influence and political mobilization
Daniyar Mukhanov
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
Daniyar Mukhanov
 
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
Daniyar Mukhanov
 
Graph of UK train stations
Graph of UK train stationsGraph of UK train stations
Graph of UK train stations
Daniyar Mukhanov
 

More from Daniyar Mukhanov (8)

Презентация андроид
Презентация андроидПрезентация андроид
Презентация андроид
 
Digital literacy
Digital literacyDigital literacy
Digital literacy
 
Intro android
Intro androidIntro android
Intro android
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
 
Social influence and political mobilization
Social influence and political mobilizationSocial influence and political mobilization
Social influence and political mobilization
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
 
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
 
Graph of UK train stations
Graph of UK train stationsGraph of UK train stations
Graph of UK train stations
 

Recently uploaded

RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
indexPub
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
ImMuslim
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
 

Recently uploaded (20)

RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
 

Weka.arff

  • 1. Data Analysis with Weka zoo.arff Done by Clement Robert H. Daniyar M. Web and Social Computing
  • 2. Dataset Zoo.arff: A simple database containing 17 Boolean-valued attributes. The "type" attribute appears to be the class attribute. Here is a breakdown of which animals are in which type.
  • 3. Objectives ● Select Dataset ● Learn and Explain the two classifiers ● Apply them to the selected Dataset ● Compare the results
  • 4. Outline ● DataSet Visualization ● Preprocessing ● Classification ● JRip Implementation and Results ● J48 Implementation and Results ● Knowledge Flow Implementation ● Experimenter (Comparison of the Two Classifiers) ● Conclusions ● Questions and Answers
  • 5. Visualization mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 6. Basic statistics mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 7. Hair mammal, bird, reptile, fish, amphibian, insect, invertebrate. mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 8. Feathers mammal, bird, reptile, fish, amphibian, insect, invertebrate. mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 9. Eggs mammal, bird, reptile, fish, amphibian, insect, invertebrate. mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 10. Milk mammal, bird, reptile, fish, amphibian, insect, invertebrate. mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 11. Catsize mammal, bird, reptile, fish, amphibian, insect, invertebrate. mammal, bird, reptile, fish, amphibian, insect, invertebrate.
  • 12. Pre-Processing Why? Incomplete Data Noisy Data Inconsistent Data How? Discretize Remove Duplicates RemoveUseless Etc
  • 13. Classification “Data mining technique used to predict group membership for data instances” Types of classifiers: Decision Trees J48 Rule Based Classifiers JRip Bayes Naive Bayes
  • 14. Classifiers Options Explained-Applied on DataSet Training Data: We use the whole dataset as training set. It gives the best results for the data set itself but does not guarantee the best test for unseen data. Cross-Validation: Divide the Data Set into K subsamples and use k-1 subsamples as training data and one subsample as test data. Percentage Split: We divide the dataset into two parts: the first X% of the data set is used as training and the rest is used as the test set.
  • 15. JRIP Classifier Algorithm Based on RIPPER Algorithm RIPPER: Repeated Incremental Pruning to Produce Error Reduction Incremental Pruning Error Reduction Produces Rules Works fine for: Class: Missing classes values, binary and Nominal Classes Advantages As highly expressive as decision trees Easy to interpret Easy to generate Can classify new instances rapidly “If Age is greater than 35 and Status is Married Then He/she Does not Cheat”
  • 16. Classifier Evaluation-Default Options JRip Pre-Processed?-> NO Results
  • 17. PreProcessed?-> No Results->NO BIG CHANGE Classifier Evaluation->Cross Validation Increased JRip ● Same Options as Previous ● Except ● Cross-Validation= 20 Folds “If the Cross-Validation Increases, the correctly Classified Instances Decreasing” Why? Next Cross-Validation=40 etc
  • 18. PreProcessed?-> No Results->NO BIG CHANGE Classifier Evaluation->Cross Validation JRip ● No Relation among those values ● Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate for small Data Set ( we have 101 instances and 17 attributes) ● The more we increase the number of FOLDS in cross-Validation, the more we decrease the size of training subsets! (eg: for k=80, we wanted to have 80 training subsets, in 101 instances!) Cross Validation Correctly Classified Inst. 10 87.14 20 85.14 30 86.13 40 84.15 50 87.12 60 82.17 70 85.14 80 87.12 90 84.15
  • 19. PreProcessed?-> No Results Classifier Evaluation->Training Set JRip
  • 20. PreProcessing-> Not Yet Classifier Evaluation->Training Set JRip Cross Validation (K) Correctly Classified Inst. 10 87.14 20 85.14 30 86.13 40 84.15 50 87.12 60 82.17 70 85.14 80 87.12 90 84.15 VS Training Set 92.07 Training Set ● works with the totality of the data set as training and test Data ● It works better for the given data but does not gives us confidence to use it the unseen cases (prediction, etc)
  • 21. Pre-Processing- Why? JRip ● Rule generated based on useless attribute ○ Filter with RemoveUseless ● Possible Duplicates->can lead to biased Rules-> We never know ○ Filter with RemoveDuplicates ● Other Filters ○ Discretize, etc
  • 22. ● RemoveDuplicates: Removes all duplicate instances from the first batch of data it receive ● RemoveUseless: This filter removes attributes that do not vary at all or that vary too much Pre-Processing-> RemoveDuplicates (instances), RemoveUseless (attributes) JRip
  • 24. Classifier Evaluation-> Percentage Split-> Default (66%) JRip PreProcessed?-> Yes Results ● With this split, the test samples do not have Amphibians! ● Has good statistics because test is done on small test set.
  • 25. Classifier Evaluation-> Percentage Split-> 80% JRip PreProcessed?-> Yes Results Percentage Split ● Takes samples randomly ● Chance for some representatives to be left out http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
  • 26. What Next? JRip ● Done with Test Options Applied to Data ● JRIP has its own parameters to change in order to make it work better ○ Folds, MinWeights, Seeds of randomization, Error Rate and Pruning, ● Pruning ○ Pre-Pruning: Cut the classifier to continue growing when condition is met ○ Post-Pruning: Let the Classifier grow, and the reduce it to make it small so as to cover more unseen data. ● Without Pruning, the classifier is more detailed which makes it to be limited on unseen cases. ● JRip uses Post Pruning Method (based on REP) ● Next, JRip with Pruning, Vs JRip without Pruning
  • 27. Classifier Evaluation->Pruning-TRUE Vs FALSE JRip PreProcessed?-> no Results with Experimenter*
  • 28. J48 Classifier Algorithm Based on C4.5 Algorithm Modified ID3 continuous attributes missing attributes attributes with differing costs post-pruning trees Produces Decision Tree Works fine for: The advantages of the C4.5 are: • Builds models that can be easily interpreted • Easy to implement • Can use both categorical and continuous values • Deals with noise
  • 31. Cross validation =20 J48 No major changes ROC and PRC area dropped insignificantly Confusion matrix and Decision tree remains same
  • 32. Changing cross validation value J48 Cross Validation Correctly Classified Inst. 10 92.079 20 92.079 30 92.079 40 92.079 50 92.079 60 92.079 70 92.079 80 92.079 90 92.079 Changes nothing including ROC, PRC, confusion matrix and decision tree This is because number of instances are relatively small for J48 (101)
  • 37. Comparing two sets of rules
  • 38. Getting same results by Knowledge Flow
  • 39. Getting same results by Knowledge Flow (con’t)
  • 40. J48? or JRip on Zoo.arff-> Comparison with Experimenter Feature Experimenter is Suitable for: ● Large Scale Experiments ● Automation ● Statistics can be stored in .arff format ● ….. ● Classifiers Comparison
  • 41. J48? or JRip? on Zoo.arff?-> Results With: ● Significance of 10% on Percentage Correct Instances ● Cross-Validation=10 ● Default Classifiers Options We see that: ● J48 is the Winner on Zoo.arff
  • 42. Lesson Learnt & Conclusions ● Output Readability ○ JRip outputs are easy to read and understand ○ J48 Trees can be more complex to read ● Performance ○ J48 beats JRip, as it generate general tree without Pre-Processing ○ J48 generates Tree whose precision is high ○ No big difference for small data set like zoo.arff ● Weka test options ○ Can mislead someone who only looks for good results. ○ No good/bad classifier. It depends on many characteristics (dataset size, options used, classifier used etc). ● Weka ○ Experimenter gives a good way to compare classifiers ○ Knowledge Flow helps to get the intermediary steps in the generation of classifier (helps to understand how some options work like cross validation)
  • 43. Q & A Thank you