Data Science Project by Areeb Ansari.ppt

Predictive Modeling of Income
Levels based on Demographic
and Employment Features
Presented By:
Areeb Ansari
DATA SCIENCE PROJECT, DECEMBER ‘23
LEARNBAY

Agenda
 Objective
 Data
 Methods
 Artificial Neural Network
 Normal Bayes Classifier
 Decision Trees
 Boosted Trees
 Random Forest
 Results
 Comparisons
 Observations
CSC 7333 - Dr. Jianhua Chen 2

Objective
 Analysis of Census Data to determine
certain trends
 Prediction task is to determine
whether a person makes over 50K a
year.
 Analyze the accuracy and run time of
different machine learning algorithms

Data
• 48842 instances (train = 32561, test = 16281)
• 45222 if instances with unknown values are
removed (train = 30162, test = 15060)
• Duplicate or conflicting instances : 6
• 2 classes : >50K, <=50K
• Probability for the label '>50K' : 23.93% / 24.78%
(without unknowns)
• 14 attributes : both continuous and discreet-
valued.

Data Dictionary
• Age
• Work-class
• Final_census
• Education
• Education_num
• Marital Status
• Occupation
• Relationship
• Race
• Gender
• Capital-gain
• Capital-loss
• Hours/week
• Country

Artificial Neural Network
• Sigmoid function is used as the squashing
function.
• No. of Layers = 3
• 256 nodes in first layer. Second and third
layers have 10 nodes each.
• Terminate if no. of epochs exceed 1000 or
rate of change of network weights falls below
10-6.
• Learning rate = 0.1

Normal Bayes Classifier
• The classifier assumes that:
• Features are fairly independent in nature
• the attributes are normally distributed.
• It is not necessary for the attributes to be
independent; but does yield better results if they
are.
• Data distribution function is assumed to be a
Gaussian mixture – one component per class.
• Training data  Min vectors and co-variance
matrices for every class  Predict

Decision Trees
 Regression tree partition continuous values
 Maximum depth of tree = 25
 Minimum sample count = 5
 Maximum no. of categories = 15
 No. of cross validation folds = 15
 CART(Classification and Regression Tree) is used as
the tree algorithm Rules for splitting data at a node
based on the value of variable Stopping rules for
deciding on terminal nodes  Prediction of target
variable for terminal nodes

Boosted Trees
• Real AdaBoost algorithm has been used.
• Misclassified events  Reweight them  Build &
optimize new tree with reweighted events 
Score each tree  Use tree-scores as weights
and average over all trees
• Weak classifier  classifiers with error rate
slightly better than random guessing.
 No. of weak classifiers used = 10
• Trim rate  Threshold to eliminate samples with
boosting weight < 1 – trim rate.
 Trim rate used = 0.95

Random Forest
• Another Ensemble Learning Method
• Collection of tree predictors : forest
• At first, it grows many decision trees.
• To classify a new object from an input vector,:
1. It is classified by each of the trees in the forest
2. Mode of the classes is chosen.
• All the trees are trained with the same
parameters but on different training sets

Random Forest (contd.)
• No. of variables randomly selected at node and
used to find best split(s) = 4
• Maximum no. of trees in the forest = 100
• Forest accuracy = 0.01
• Terminate if no. of iterations exceed 50 or error
percentage exceeds 0.1

Results
Unknown data included
Method
Correct
Classification
Wrong
Classification
Class 0
false
positives
Class 1
false
positives Time Accuracy
Neural Network 13734 2547 1339 1208 719 0.84356
Normal Bayes 13335 2946 1968 978 3 0.819053
Decision Tree 13088 3193 1022 2171 5 0.803882
Boosted Tree 13487 2794 1628 1166 285 0.828389
Random Forest 13694 2587 864 1723 51 0.841103
Unknown data excluded
Method
Correct
Classification
Wrong
Classification
Class 0
false
positives
Class 1
false
positives Time Accuracy
Neural Network 12711 2349 1804 545 545 0.844024
Normal Bayes 12226 2834 1945 889 3 0.811819
Decision Tree 12017 3043 983 2060 4 0.797942
Boosted Tree 12260 2800 1510 1290 221 0.814077
Random Forest 12621 2439 850 1589 48 0.838048

Comparisons (unknown data
included)
0.78
0.79
0.8
0.81
0.82
0.83
0.84
0.85
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Accuracy
0
100
200
300
400
500
600
700
800
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Time
0
500
1000
1500
2000
2500
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Class 0 false positives
0
500
1000
1500
2000
2500
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Class 1 false positives

Observations
 Removing non relevant attributes improves
accuracy (Curse of Dimensionality)
 Some attributes seemed to have little relevance to
salary. For example: Race, Gender.
 Removing the attributes improves accuracy from by
0.21% in decision trees.
 For Random Forest, accuracy improves by 0.33%
 For Boosted Trees, accuracy falls slightly by 0.12%
 For ANN, accuracy improves by 1.12%
 Bayes Classifier – Removing co-related
attributes improves accuracy.
 Education_num highly related to Education. Removing
education_num improves accuracy by 0.83%

Thank you!!!

Data Science Project by Areeb Ansari.ppt

Recommended

Recommended

More Related Content

Similar to Data Science Project by Areeb Ansari.ppt

Similar to Data Science Project by Areeb Ansari.ppt (20)

Recently uploaded

Recently uploaded (20)

Data Science Project by Areeb Ansari.ppt

Editor's Notes