1. Data Mining with WEKA
Census Income Dataset
(UCI Machine Learning Repository)
Hein and Maneshka
2. Data Mining
● non-trivial extraction of previously unknown and potentially useful information
from data by means of computers.
● part of machine learning field.
● two types of machine learning:
○ supervised learning: to find real values as output
■ regression: to find real value(s) as output
■ classification: to map instance of data to one of predefined classes
○ unsupervised learning: to discover internal representation of data
■ clustering: to group instances of data together based on some characteristics
■ association rule mining: to find relationship between instances of data
3. Aim
● Perform data mining using WEKA
○ understanding the dataset
○ preprocessing
○ task: classification
8. Preprocessing - Balancing Class Distribution
● without balancing class distribution, the classifiers perform badly for classes with lower distributions
9. Preprocessing - Balancing Class Distribution
Step 1: Apply the Resample filter
Filters→supervised→instance→Resample
Step 2: Set the biasToUniformClass parameter of
the Resample Filter to 1.0 and click
‘Apply Filter’
10. Preprocessing - Outliers
● Outliers in data can skew and mislead the processing of algorithms.
● Outliers can be removed in the following manner
11. Preprocessing - Removing Outliers
Step 1 : Select InterquartertileRange filter
Filters→unsupervised→attribute→InteruartileRange--> Apply
Result: Creates two attributes- outliers and
extreme values with attribute no’s
14 and 15 respectively
12. Preprocessing - Removing Outliers
Step 2 : a) Select another filter RemoveWithValues
Filters→unsupervised→instance→RemoveWithValues
b) Click on filter to get its parameters.
Set attrıbuteIndex to 14 and nominalIndices to 2,
since its only values set to yes that need to be
removed.
13. Preprocessing - Removing Outliers
Result: Removes all outliers from dataset
Step 3:Remove the outlier and extreme attributes from the dataset
14. Preprocessing - Impact of Removing Outliers
● With outliers in dataset - 85.3302% correctly classified instances
● Without Outliers in dataset - 84.3549% correctly classified instances
Since the percentage for correctly classified instances were greater for the
dataset with outliers, this was selected!
The reduced accuracy is due to the nature of our dataset (very skewed
distributions in attributes ( capital-gain)).
15. Preprocessing
● Our preprocessing recap
○ removed fnlwgt, edu-num attributes
○ removed duplicate instances
○ fill in missing values
○ grouped some attribute values for education
○ rebalanced class distribution
● size of dataset: 14356 instances
16. Performance of Classifiers
● simplest measure: rate of correct predictions
● confusion matrix:
● Precision: how many positive predictions are correct (TP/(TP + FP))
● Recall: how many positive predictions are caught (TP/(TP + FN))
● F Measure: consider both precision and recall
(2 * precision * recall / precision + recall)
17. Performance of Classifiers
● kappa statistic: chance corrected accuracy measure (must be bigger than 0)
● ROC Area: the bigger the area is, the better result (must be bigger than 0.5)
● Error rates: useful for regression
○ predicting real values
○ predictions are not just right or wrong
○ these reflects the magnitude of errors
18. Developing Classifiers
● ran algorithms with default parameters
● test parameter: cross-validation 10 fold
● preprocessed dataset
Algorithm Accuracy
J48 83.6305 %
JRip 82.0075 %
NaiveBayes 76.5464 %
IBk 84.9401 %
Logistics 82.3837 %
● chose J48 and IBk classifiers to
develop further.
● IBk is best performing.
● J48 is very fast, second best, very
popular.
19. J48 Algorithm
● Open source Java implementation of the C4.5 algorithm in the Weka data
mining tools
● It creates a decision tree based on labelled input data
● The trees generated can be used for classification and for this reason is called a
statistical classifier
20. Pros and Cons of J48
Pros
● Easier to interpret results
● Helps to visualise through a decision tree
Cons
● Run complexity of algorithm depends on the depth of the tree(i.e the no of
attributes in the data set)
● Space complexity is large as values need to be stored in arrays repeatedly.
21. J48 - Using Default Parameters
Number of Leaves : 811
Size of the tree : 1046
26. J48 - Observations
● we initially thought Education would be most important factor in classifying
income.
● J48 tree (without binarization) has CapitalGain as root tree, instead of
Education.
● It means CapitalGain contributes larger towards income than we initially
thought.
27. IBk Classifier
● instance-based classifier
● k-nearest neighbors algorithm
● takes nearest k neighbors to make decisions
● use distance measures to get nearest neighbors
○ chi-square distance, euclidean distance (used by IBk)
● can use distance weighting
○ to give more influence to nearer neighbors
○ 1/distance and 1-distance
● can use for classification and regression
○ classification output - class value assigned as one most common among the neighbors
○ regression - value is the average of neighbors
28. Pros and Cons of IBk
Pros
● easy to understand / implement
● perform well with enough representation
● choice between attributes and distance measures
Cons
● large search space
○ have to search whole dataset to get nearest neighbors
● curse of dimensionality
● must choose meaningful distance measure
29. Improving IBk
ran KNN algorithm with different combinations of parameters
Parameters Correct Prediction ROC Area
K-mean (k = 1, no weight) default 84.9401 % 0.860
K-mean (k = 5, no weight) 80.691 % 0.882
K-mean (k=5, inverse-distance-weight) 85.978 0.929
K-mean (k=10, no weight) 81.0323 % 0.887
K-mean (k=10, inverse-distance-weight) 86.5422 % 0.939
K-mean (k=10, similarity-weighted) 81.6244 % 0.892
K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948
K-mean (k=100, inverse-distance-weight) 86.6397 % 0.947
30. IBk - Observations
● larger k gives better classification
○ up until certain number of k (50)
○ using inverse weight improve accuracy greatly
● limitations
○ we used euclidean distance (not the best for nominial values in dataset)
31. Vote Classifier
● we combined our classifier -> Meta
○ used average of probabilities
Classifier Accuracy ROC Area
J48 85.3998 % 0.879
K-mean (k=50, inverse-distance-weight) 86.8905 % 0.948
Logistics 82.3837 % 0.905
Vote 87.3084 % 0.947
32. What We Have Done
● Developing classifier for Census Income Dataset
○ a lot of preprocessing
○ learned in details about J48 and KNN classifiers
● Developed classifier with 87.3084 % accuracy and 0.947 ROC area.
○ using VOTE