SlideShare a Scribd company logo
1 of 33
Download to read offline
Data Mining with WEKA
Census Income Dataset
(UCI Machine Learning Repository)
Hein and Maneshka
Data Mining
● non-trivial extraction of previously unknown and potentially useful information
from data by means of computers.
● part of machine learning field.
● two types of machine learning:
○ supervised learning: to find real values as output
■ regression: to find real value(s) as output
■ classification: to map instance of data to one of predefined classes
○ unsupervised learning: to discover internal representation of data
■ clustering: to group instances of data together based on some characteristics
■ association rule mining: to find relationship between instances of data
Aim
● Perform data mining using WEKA
○ understanding the dataset
○ preprocessing
○ task: classification
Dataset - Census Income Dataset
● from UCI machine learning repository
● 32, 561 instances
● attributes: 14
○ continuous: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week:
○ nominal: workclass, education, marital-status, occupation, relationship, race, sex, native-country
● salary - classes: 2 (<= 50K and > 50K)
● missing values:
○ workclass & occupation: 1836 (6%)
○ native-country: 583 (2%)
● imbalance distribution of values
○ age, capital-gain, capital-loss, native-country
Dataset - Census Income Dataset
● imbalance distributions of attributes ● No strong seperation of classes
Blue: <=50K Red: >50K
Preprocessing
● preprocess (filter) the data for effective datamining
○ consider how to deal with missing values, and outliers
○ consider which attributes are relevant
● removed fnlwgt attribute (final weight)
○ with fnlwgt, J48, full dataset - accuracy (86.232%)
○ without fnlwgt - accuracy (86.2596%)
● removed education-num attribute
○ mirror attribute to education
● handling missing values
○ ReplaceMissingValues filter (unsupervised - attribute)
● removed duplicates
○ RemoveDuplicate filter (unsupervised - instance)
Preprocessing
● grouped education attribute values
○ 16 values → 9 values
HS-graduate
Some-college
Bechalor
Prof-School
Masters
Doctorate
Assoc-acdm
Assoc-voc
HS-not-finished
HS-graduate
Some-college
Bechalor
Prof-School
Masters
Doctorate
Assoc-acdm
Assoc-voc
Pre-school
1st-4th
5th-6th
7th-8th
9th
10th
11th
12th
HS-not-finished
Preprocessing - Balancing Class Distribution
● without balancing class distribution, the classifiers perform badly for classes with lower distributions
Preprocessing - Balancing Class Distribution
Step 1: Apply the Resample filter
Filters→supervised→instance→Resample
Step 2: Set the biasToUniformClass parameter of
the Resample Filter to 1.0 and click
‘Apply Filter’
Preprocessing - Outliers
● Outliers in data can skew and mislead the processing of algorithms.
● Outliers can be removed in the following manner
Preprocessing - Removing Outliers
Step 1 : Select InterquartertileRange filter
Filters→unsupervised→attribute→InteruartileRange--> Apply
Result: Creates two attributes- outliers and
extreme values with attribute no’s
14 and 15 respectively
Preprocessing - Removing Outliers
Step 2 : a) Select another filter RemoveWithValues
Filters→unsupervised→instance→RemoveWithValues
b) Click on filter to get its parameters.
Set attrıbuteIndex to 14 and nominalIndices to 2,
since its only values set to yes that need to be
removed.
Preprocessing - Removing Outliers
Result: Removes all outliers from dataset
Step 3:Remove the outlier and extreme attributes from the dataset
Preprocessing - Impact of Removing Outliers
● With outliers in dataset - 85.3302% correctly classified instances
● Without Outliers in dataset - 84.3549% correctly classified instances
Since the percentage for correctly classified instances were greater for the
dataset with outliers, this was selected!
The reduced accuracy is due to the nature of our dataset (very skewed
distributions in attributes ( capital-gain)).
Preprocessing
● Our preprocessing recap
○ removed fnlwgt, edu-num attributes
○ removed duplicate instances
○ fill in missing values
○ grouped some attribute values for education
○ rebalanced class distribution
● size of dataset: 14356 instances
Performance of Classifiers
● simplest measure: rate of correct predictions
● confusion matrix:
● Precision: how many positive predictions are correct (TP/(TP + FP))
● Recall: how many positive predictions are caught (TP/(TP + FN))
● F Measure: consider both precision and recall
(2 * precision * recall / precision + recall)
Performance of Classifiers
● kappa statistic: chance corrected accuracy measure (must be bigger than 0)
● ROC Area: the bigger the area is, the better result (must be bigger than 0.5)
● Error rates: useful for regression
○ predicting real values
○ predictions are not just right or wrong
○ these reflects the magnitude of errors
Developing Classifiers
● ran algorithms with default parameters
● test parameter: cross-validation 10 fold
● preprocessed dataset
Algorithm Accuracy
J48 83.6305 %
JRip 82.0075 %
NaiveBayes 76.5464 %
IBk 84.9401 %
Logistics 82.3837 %
● chose J48 and IBk classifiers to
develop further.
● IBk is best performing.
● J48 is very fast, second best, very
popular.
J48 Algorithm
● Open source Java implementation of the C4.5 algorithm in the Weka data
mining tools
● It creates a decision tree based on labelled input data
● The trees generated can be used for classification and for this reason is called a
statistical classifier
Pros and Cons of J48
Pros
● Easier to interpret results
● Helps to visualise through a decision tree
Cons
● Run complexity of algorithm depends on the depth of the tree(i.e the no of
attributes in the data set)
● Space complexity is large as values need to be stored in arrays repeatedly.
J48 - Using Default Parameters
Number of Leaves : 811
Size of the tree : 1046
J48 -Setting bınarySplıts parameter to True
J48 -Setting unpruned parameter to True
Number of Leaves : 3479
Size of the tree : 4214
J48 -Setting unpruned and bınarySplıts
parameters to True
J48 - Observations
● we initially thought Education would be most important factor in classifying
income.
● J48 tree (without binarization) has CapitalGain as root tree, instead of
Education.
● It means CapitalGain contributes larger towards income than we initially
thought.
IBk Classifier
● instance-based classifier
● k-nearest neighbors algorithm
● takes nearest k neighbors to make decisions
● use distance measures to get nearest neighbors
○ chi-square distance, euclidean distance (used by IBk)
● can use distance weighting
○ to give more influence to nearer neighbors
○ 1/distance and 1-distance
● can use for classification and regression
○ classification output - class value assigned as one most common among the neighbors
○ regression - value is the average of neighbors
Pros and Cons of IBk
Pros
● easy to understand / implement
● perform well with enough representation
● choice between attributes and distance measures
Cons
● large search space
○ have to search whole dataset to get nearest neighbors
● curse of dimensionality
● must choose meaningful distance measure
Improving IBk
ran KNN algorithm with different combinations of parameters
Parameters Correct Prediction ROC Area
K-mean (k = 1, no weight) default 84.9401 % 0.860
K-mean (k = 5, no weight) 80.691 % 0.882
K-mean (k=5, inverse-distance-weight) 85.978 0.929
K-mean (k=10, no weight) 81.0323 % 0.887
K-mean (k=10, inverse-distance-weight) 86.5422 % 0.939
K-mean (k=10, similarity-weighted) 81.6244 % 0.892
K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948
K-mean (k=100, inverse-distance-weight) 86.6397 % 0.947
IBk - Observations
● larger k gives better classification
○ up until certain number of k (50)
○ using inverse weight improve accuracy greatly
● limitations
○ we used euclidean distance (not the best for nominial values in dataset)
Vote Classifier
● we combined our classifier -> Meta
○ used average of probabilities
Classifier Accuracy ROC Area
J48 85.3998 % 0.879
K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948
Logistics 82.3837 % 0.905
Vote 87.3084 % 0.947
What We Have Done
● Developing classifier for Census Income Dataset
○ a lot of preprocessing
○ learned in details about J48 and KNN classifiers
● Developed classifier with 87.3084 % accuracy and 0.947 ROC area.
○ using VOTE
Thank You.

More Related Content

What's hot (20)

Trees (data structure)
Trees (data structure)Trees (data structure)
Trees (data structure)
 
Linked list
Linked listLinked list
Linked list
 
header, circular and two way linked lists
header, circular and two way linked listsheader, circular and two way linked lists
header, circular and two way linked lists
 
Compiler Design IPU notes Handwritten
Compiler Design IPU notes HandwrittenCompiler Design IPU notes Handwritten
Compiler Design IPU notes Handwritten
 
Circular link list.ppt
Circular link list.pptCircular link list.ppt
Circular link list.ppt
 
Binary Search Tree in Data Structure
Binary Search Tree in Data StructureBinary Search Tree in Data Structure
Binary Search Tree in Data Structure
 
Data Structures (CS8391)
Data Structures (CS8391)Data Structures (CS8391)
Data Structures (CS8391)
 
Binary search tree
Binary search treeBinary search tree
Binary search tree
 
Data Structure and its Fundamentals
Data Structure and its FundamentalsData Structure and its Fundamentals
Data Structure and its Fundamentals
 
Python Functions
Python   FunctionsPython   Functions
Python Functions
 
C functions
C functionsC functions
C functions
 
Array lecture
Array lectureArray lecture
Array lecture
 
User defined functions in C
User defined functions in CUser defined functions in C
User defined functions in C
 
String.ppt
String.pptString.ppt
String.ppt
 
Ch3
Ch3Ch3
Ch3
 
Function C programming
Function C programmingFunction C programming
Function C programming
 
Dictionaries and Sets in Python
Dictionaries and Sets in PythonDictionaries and Sets in Python
Dictionaries and Sets in Python
 
Recursion
RecursionRecursion
Recursion
 
stack presentation
stack presentationstack presentation
stack presentation
 
Function in C Language
Function in C Language Function in C Language
Function in C Language
 

Viewers also liked

Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1BarryK88
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepersHein Min Htike
 
Classification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spaceClassification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spacees712
 
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...IJCSIS Research Publications
 
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...iosrjce
 
Assessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsAssessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsIJCSIS Research Publications
 
HCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppHCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppDarran Mottershead
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
 
P2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksP2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksMatilda Rhode
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5BarryK88
 
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...Triangle American Marketing Association
 
Loan Processing System
Loan Processing SystemLoan Processing System
Loan Processing Systemtenlaclgt
 

Viewers also liked (20)

Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Tutorial weka
Tutorial wekaTutorial weka
Tutorial weka
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
 
Classification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spaceClassification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my space
 
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
 
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
 
Weka
WekaWeka
Weka
 
Assessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsAssessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing Organizations
 
PROJECT_REPORT_FINAL
PROJECT_REPORT_FINALPROJECT_REPORT_FINAL
PROJECT_REPORT_FINAL
 
HCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppHCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink App
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
 
P2P: Simulations and Real world Networks
P2P: Simulations and Real world NetworksP2P: Simulations and Real world Networks
P2P: Simulations and Real world Networks
 
Group7_Datamining_Project_Report_Final
Group7_Datamining_Project_Report_FinalGroup7_Datamining_Project_Report_Final
Group7_Datamining_Project_Report_Final
 
Project_702
Project_702Project_702
Project_702
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
 
Tree pruning
Tree pruningTree pruning
Tree pruning
 
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
 
Loan Processing System
Loan Processing SystemLoan Processing System
Loan Processing System
 

Similar to Data mining with weka

30thSep2014
30thSep201430thSep2014
30thSep2014Mia liu
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepSanjanaSaxena17
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningRomiRoy4
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
malware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectmalware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectNaveenAd4
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selectionMarco Meoni
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1khairulhuda242
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha
 

Similar to Data mining with weka (20)

30thSep2014
30thSep201430thSep2014
30thSep2014
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
malware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year projectmalware detection ppt for vtu project and other final year project
malware detection ppt for vtu project and other final year project
 
07 learning
07 learning07 learning
07 learning
 
IDS for IoT.pptx
IDS for IoT.pptxIDS for IoT.pptx
IDS for IoT.pptx
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Data mining with weka

  • 1. Data Mining with WEKA Census Income Dataset (UCI Machine Learning Repository) Hein and Maneshka
  • 2. Data Mining ● non-trivial extraction of previously unknown and potentially useful information from data by means of computers. ● part of machine learning field. ● two types of machine learning: ○ supervised learning: to find real values as output ■ regression: to find real value(s) as output ■ classification: to map instance of data to one of predefined classes ○ unsupervised learning: to discover internal representation of data ■ clustering: to group instances of data together based on some characteristics ■ association rule mining: to find relationship between instances of data
  • 3. Aim ● Perform data mining using WEKA ○ understanding the dataset ○ preprocessing ○ task: classification
  • 4. Dataset - Census Income Dataset ● from UCI machine learning repository ● 32, 561 instances ● attributes: 14 ○ continuous: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week: ○ nominal: workclass, education, marital-status, occupation, relationship, race, sex, native-country ● salary - classes: 2 (<= 50K and > 50K) ● missing values: ○ workclass & occupation: 1836 (6%) ○ native-country: 583 (2%) ● imbalance distribution of values ○ age, capital-gain, capital-loss, native-country
  • 5. Dataset - Census Income Dataset ● imbalance distributions of attributes ● No strong seperation of classes Blue: <=50K Red: >50K
  • 6. Preprocessing ● preprocess (filter) the data for effective datamining ○ consider how to deal with missing values, and outliers ○ consider which attributes are relevant ● removed fnlwgt attribute (final weight) ○ with fnlwgt, J48, full dataset - accuracy (86.232%) ○ without fnlwgt - accuracy (86.2596%) ● removed education-num attribute ○ mirror attribute to education ● handling missing values ○ ReplaceMissingValues filter (unsupervised - attribute) ● removed duplicates ○ RemoveDuplicate filter (unsupervised - instance)
  • 7. Preprocessing ● grouped education attribute values ○ 16 values → 9 values HS-graduate Some-college Bechalor Prof-School Masters Doctorate Assoc-acdm Assoc-voc HS-not-finished HS-graduate Some-college Bechalor Prof-School Masters Doctorate Assoc-acdm Assoc-voc Pre-school 1st-4th 5th-6th 7th-8th 9th 10th 11th 12th HS-not-finished
  • 8. Preprocessing - Balancing Class Distribution ● without balancing class distribution, the classifiers perform badly for classes with lower distributions
  • 9. Preprocessing - Balancing Class Distribution Step 1: Apply the Resample filter Filters→supervised→instance→Resample Step 2: Set the biasToUniformClass parameter of the Resample Filter to 1.0 and click ‘Apply Filter’
  • 10. Preprocessing - Outliers ● Outliers in data can skew and mislead the processing of algorithms. ● Outliers can be removed in the following manner
  • 11. Preprocessing - Removing Outliers Step 1 : Select InterquartertileRange filter Filters→unsupervised→attribute→InteruartileRange--> Apply Result: Creates two attributes- outliers and extreme values with attribute no’s 14 and 15 respectively
  • 12. Preprocessing - Removing Outliers Step 2 : a) Select another filter RemoveWithValues Filters→unsupervised→instance→RemoveWithValues b) Click on filter to get its parameters. Set attrıbuteIndex to 14 and nominalIndices to 2, since its only values set to yes that need to be removed.
  • 13. Preprocessing - Removing Outliers Result: Removes all outliers from dataset Step 3:Remove the outlier and extreme attributes from the dataset
  • 14. Preprocessing - Impact of Removing Outliers ● With outliers in dataset - 85.3302% correctly classified instances ● Without Outliers in dataset - 84.3549% correctly classified instances Since the percentage for correctly classified instances were greater for the dataset with outliers, this was selected! The reduced accuracy is due to the nature of our dataset (very skewed distributions in attributes ( capital-gain)).
  • 15. Preprocessing ● Our preprocessing recap ○ removed fnlwgt, edu-num attributes ○ removed duplicate instances ○ fill in missing values ○ grouped some attribute values for education ○ rebalanced class distribution ● size of dataset: 14356 instances
  • 16. Performance of Classifiers ● simplest measure: rate of correct predictions ● confusion matrix: ● Precision: how many positive predictions are correct (TP/(TP + FP)) ● Recall: how many positive predictions are caught (TP/(TP + FN)) ● F Measure: consider both precision and recall (2 * precision * recall / precision + recall)
  • 17. Performance of Classifiers ● kappa statistic: chance corrected accuracy measure (must be bigger than 0) ● ROC Area: the bigger the area is, the better result (must be bigger than 0.5) ● Error rates: useful for regression ○ predicting real values ○ predictions are not just right or wrong ○ these reflects the magnitude of errors
  • 18. Developing Classifiers ● ran algorithms with default parameters ● test parameter: cross-validation 10 fold ● preprocessed dataset Algorithm Accuracy J48 83.6305 % JRip 82.0075 % NaiveBayes 76.5464 % IBk 84.9401 % Logistics 82.3837 % ● chose J48 and IBk classifiers to develop further. ● IBk is best performing. ● J48 is very fast, second best, very popular.
  • 19. J48 Algorithm ● Open source Java implementation of the C4.5 algorithm in the Weka data mining tools ● It creates a decision tree based on labelled input data ● The trees generated can be used for classification and for this reason is called a statistical classifier
  • 20. Pros and Cons of J48 Pros ● Easier to interpret results ● Helps to visualise through a decision tree Cons ● Run complexity of algorithm depends on the depth of the tree(i.e the no of attributes in the data set) ● Space complexity is large as values need to be stored in arrays repeatedly.
  • 21. J48 - Using Default Parameters Number of Leaves : 811 Size of the tree : 1046
  • 22. J48 -Setting bınarySplıts parameter to True
  • 23. J48 -Setting unpruned parameter to True Number of Leaves : 3479 Size of the tree : 4214
  • 24.
  • 25. J48 -Setting unpruned and bınarySplıts parameters to True
  • 26. J48 - Observations ● we initially thought Education would be most important factor in classifying income. ● J48 tree (without binarization) has CapitalGain as root tree, instead of Education. ● It means CapitalGain contributes larger towards income than we initially thought.
  • 27. IBk Classifier ● instance-based classifier ● k-nearest neighbors algorithm ● takes nearest k neighbors to make decisions ● use distance measures to get nearest neighbors ○ chi-square distance, euclidean distance (used by IBk) ● can use distance weighting ○ to give more influence to nearer neighbors ○ 1/distance and 1-distance ● can use for classification and regression ○ classification output - class value assigned as one most common among the neighbors ○ regression - value is the average of neighbors
  • 28. Pros and Cons of IBk Pros ● easy to understand / implement ● perform well with enough representation ● choice between attributes and distance measures Cons ● large search space ○ have to search whole dataset to get nearest neighbors ● curse of dimensionality ● must choose meaningful distance measure
  • 29. Improving IBk ran KNN algorithm with different combinations of parameters Parameters Correct Prediction ROC Area K-mean (k = 1, no weight) default 84.9401 % 0.860 K-mean (k = 5, no weight) 80.691 % 0.882 K-mean (k=5, inverse-distance-weight) 85.978 0.929 K-mean (k=10, no weight) 81.0323 % 0.887 K-mean (k=10, inverse-distance-weight) 86.5422 % 0.939 K-mean (k=10, similarity-weighted) 81.6244 % 0.892 K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948 K-mean (k=100, inverse-distance-weight) 86.6397 % 0.947
  • 30. IBk - Observations ● larger k gives better classification ○ up until certain number of k (50) ○ using inverse weight improve accuracy greatly ● limitations ○ we used euclidean distance (not the best for nominial values in dataset)
  • 31. Vote Classifier ● we combined our classifier -> Meta ○ used average of probabilities Classifier Accuracy ROC Area J48 85.3998 % 0.879 K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948 Logistics 82.3837 % 0.905 Vote 87.3084 % 0.947
  • 32. What We Have Done ● Developing classifier for Census Income Dataset ○ a lot of preprocessing ○ learned in details about J48 and KNN classifiers ● Developed classifier with 87.3084 % accuracy and 0.947 ROC area. ○ using VOTE