SlideShare a Scribd company logo
1 of 16
Predictive Modeling of Income
Levels based on Demographic
and Employment Features
Presented By:
Areeb Ansari
DATA SCIENCE PROJECT, DECEMBER ‘23
LEARNBAY
Agenda
 Objective
 Data
 Methods
 Artificial Neural Network
 Normal Bayes Classifier
 Decision Trees
 Boosted Trees
 Random Forest
 Results
 Comparisons
 Observations
CSC 7333 - Dr. Jianhua Chen 2
Objective
 Analysis of Census Data to determine
certain trends
 Prediction task is to determine
whether a person makes over 50K a
year.
 Analyze the accuracy and run time of
different machine learning algorithms
CSC 7333 - Dr. Jianhua Chen 3
Data
• 48842 instances (train = 32561, test = 16281)
• 45222 if instances with unknown values are
removed (train = 30162, test = 15060)
• Duplicate or conflicting instances : 6
• 2 classes : >50K, <=50K
• Probability for the label '>50K' : 23.93% / 24.78%
(without unknowns)
• 14 attributes : both continuous and discreet-
valued.
Data Dictionary
• Age
• Work-class
• Final_census
• Education
• Education_num
• Marital Status
• Occupation
• Relationship
• Race
• Gender
• Capital-gain
• Capital-loss
• Hours/week
• Country
Data SnapShot
Artificial Neural Network
• Sigmoid function is used as the squashing
function.
• No. of Layers = 3
• 256 nodes in first layer. Second and third
layers have 10 nodes each.
• Terminate if no. of epochs exceed 1000 or
rate of change of network weights falls below
10-6.
• Learning rate = 0.1
Normal Bayes Classifier
• The classifier assumes that:
• Features are fairly independent in nature
• the attributes are normally distributed.
• It is not necessary for the attributes to be
independent; but does yield better results if they
are.
• Data distribution function is assumed to be a
Gaussian mixture – one component per class.
• Training data  Min vectors and co-variance
matrices for every class  Predict
Decision Trees
 Regression tree partition continuous values
 Maximum depth of tree = 25
 Minimum sample count = 5
 Maximum no. of categories = 15
 No. of cross validation folds = 15
 CART(Classification and Regression Tree) is used as
the tree algorithm Rules for splitting data at a node
based on the value of variable Stopping rules for
deciding on terminal nodes  Prediction of target
variable for terminal nodes
CSC 7333 - Dr. Jianhua Chen 9
Boosted Trees
• Real AdaBoost algorithm has been used.
• Misclassified events  Reweight them  Build &
optimize new tree with reweighted events 
Score each tree  Use tree-scores as weights
and average over all trees
• Weak classifier  classifiers with error rate
slightly better than random guessing.
 No. of weak classifiers used = 10
• Trim rate  Threshold to eliminate samples with
boosting weight < 1 – trim rate.
 Trim rate used = 0.95
Random Forest
• Another Ensemble Learning Method
• Collection of tree predictors : forest
• At first, it grows many decision trees.
• To classify a new object from an input vector,:
1. It is classified by each of the trees in the forest
2. Mode of the classes is chosen.
• All the trees are trained with the same
parameters but on different training sets
Random Forest (contd.)
• No. of variables randomly selected at node and
used to find best split(s) = 4
• Maximum no. of trees in the forest = 100
• Forest accuracy = 0.01
• Terminate if no. of iterations exceed 50 or error
percentage exceeds 0.1
Results
Unknown data included
Method
Correct
Classification
Wrong
Classification
Class 0
false
positives
Class 1
false
positives Time Accuracy
Neural Network 13734 2547 1339 1208 719 0.84356
Normal Bayes 13335 2946 1968 978 3 0.819053
Decision Tree 13088 3193 1022 2171 5 0.803882
Boosted Tree 13487 2794 1628 1166 285 0.828389
Random Forest 13694 2587 864 1723 51 0.841103
Unknown data excluded
Method
Correct
Classification
Wrong
Classification
Class 0
false
positives
Class 1
false
positives Time Accuracy
Neural Network 12711 2349 1804 545 545 0.844024
Normal Bayes 12226 2834 1945 889 3 0.811819
Decision Tree 12017 3043 983 2060 4 0.797942
Boosted Tree 12260 2800 1510 1290 221 0.814077
Random Forest 12621 2439 850 1589 48 0.838048
CSC 7333 - Dr. Jianhua Chen 13
Comparisons (unknown data
included)
0.78
0.79
0.8
0.81
0.82
0.83
0.84
0.85
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Accuracy
0
100
200
300
400
500
600
700
800
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Time
0
500
1000
1500
2000
2500
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Class 0 false positives
0
500
1000
1500
2000
2500
Neural
Network
Normal
Bayes
Decision
Tree
Boosted
Tree
Random
Forest
Class 1 false positives
Observations
 Removing non relevant attributes improves
accuracy (Curse of Dimensionality)
 Some attributes seemed to have little relevance to
salary. For example: Race, Gender.
 Removing the attributes improves accuracy from by
0.21% in decision trees.
 For Random Forest, accuracy improves by 0.33%
 For Boosted Trees, accuracy falls slightly by 0.12%
 For ANN, accuracy improves by 1.12%
 Bayes Classifier – Removing co-related
attributes improves accuracy.
 Education_num highly related to Education. Removing
education_num improves accuracy by 0.83%
CSC 7333 - Dr. Jianhua Chen 15
Thank you!!!
CSC 7333 - Dr. Jianhua Chen 16

More Related Content

Similar to Data Science Project by Areeb Ansari.ppt

Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
Decision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledDecision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledLuca Zavarella
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data MiningRashmi Bhat
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxHimanshuSharma997566
 
Aed1222 lesson 5
Aed1222 lesson 5Aed1222 lesson 5
Aed1222 lesson 5nurun2010
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with TensorflowShubham Sharma
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptxPallabiSahoo5
 
Assessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning ModelsAssessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning ModelsPrince Kumar
 

Similar to Data Science Project by Areeb Ansari.ppt (20)

Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Decision trees
Decision treesDecision trees
Decision trees
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
 
Decision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledDecision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic Unveiled
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
 
Aed1222 lesson 5
Aed1222 lesson 5Aed1222 lesson 5
Aed1222 lesson 5
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with Tensorflow
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Assessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning ModelsAssessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning Models
 

Recently uploaded

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Data Science Project by Areeb Ansari.ppt

  • 1. Predictive Modeling of Income Levels based on Demographic and Employment Features Presented By: Areeb Ansari DATA SCIENCE PROJECT, DECEMBER ‘23 LEARNBAY
  • 2. Agenda  Objective  Data  Methods  Artificial Neural Network  Normal Bayes Classifier  Decision Trees  Boosted Trees  Random Forest  Results  Comparisons  Observations CSC 7333 - Dr. Jianhua Chen 2
  • 3. Objective  Analysis of Census Data to determine certain trends  Prediction task is to determine whether a person makes over 50K a year.  Analyze the accuracy and run time of different machine learning algorithms CSC 7333 - Dr. Jianhua Chen 3
  • 4. Data • 48842 instances (train = 32561, test = 16281) • 45222 if instances with unknown values are removed (train = 30162, test = 15060) • Duplicate or conflicting instances : 6 • 2 classes : >50K, <=50K • Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) • 14 attributes : both continuous and discreet- valued.
  • 5. Data Dictionary • Age • Work-class • Final_census • Education • Education_num • Marital Status • Occupation • Relationship • Race • Gender • Capital-gain • Capital-loss • Hours/week • Country
  • 7. Artificial Neural Network • Sigmoid function is used as the squashing function. • No. of Layers = 3 • 256 nodes in first layer. Second and third layers have 10 nodes each. • Terminate if no. of epochs exceed 1000 or rate of change of network weights falls below 10-6. • Learning rate = 0.1
  • 8. Normal Bayes Classifier • The classifier assumes that: • Features are fairly independent in nature • the attributes are normally distributed. • It is not necessary for the attributes to be independent; but does yield better results if they are. • Data distribution function is assumed to be a Gaussian mixture – one component per class. • Training data  Min vectors and co-variance matrices for every class  Predict
  • 9. Decision Trees  Regression tree partition continuous values  Maximum depth of tree = 25  Minimum sample count = 5  Maximum no. of categories = 15  No. of cross validation folds = 15  CART(Classification and Regression Tree) is used as the tree algorithm Rules for splitting data at a node based on the value of variable Stopping rules for deciding on terminal nodes  Prediction of target variable for terminal nodes CSC 7333 - Dr. Jianhua Chen 9
  • 10. Boosted Trees • Real AdaBoost algorithm has been used. • Misclassified events  Reweight them  Build & optimize new tree with reweighted events  Score each tree  Use tree-scores as weights and average over all trees • Weak classifier  classifiers with error rate slightly better than random guessing.  No. of weak classifiers used = 10 • Trim rate  Threshold to eliminate samples with boosting weight < 1 – trim rate.  Trim rate used = 0.95
  • 11. Random Forest • Another Ensemble Learning Method • Collection of tree predictors : forest • At first, it grows many decision trees. • To classify a new object from an input vector,: 1. It is classified by each of the trees in the forest 2. Mode of the classes is chosen. • All the trees are trained with the same parameters but on different training sets
  • 12. Random Forest (contd.) • No. of variables randomly selected at node and used to find best split(s) = 4 • Maximum no. of trees in the forest = 100 • Forest accuracy = 0.01 • Terminate if no. of iterations exceed 50 or error percentage exceeds 0.1
  • 13. Results Unknown data included Method Correct Classification Wrong Classification Class 0 false positives Class 1 false positives Time Accuracy Neural Network 13734 2547 1339 1208 719 0.84356 Normal Bayes 13335 2946 1968 978 3 0.819053 Decision Tree 13088 3193 1022 2171 5 0.803882 Boosted Tree 13487 2794 1628 1166 285 0.828389 Random Forest 13694 2587 864 1723 51 0.841103 Unknown data excluded Method Correct Classification Wrong Classification Class 0 false positives Class 1 false positives Time Accuracy Neural Network 12711 2349 1804 545 545 0.844024 Normal Bayes 12226 2834 1945 889 3 0.811819 Decision Tree 12017 3043 983 2060 4 0.797942 Boosted Tree 12260 2800 1510 1290 221 0.814077 Random Forest 12621 2439 850 1589 48 0.838048 CSC 7333 - Dr. Jianhua Chen 13
  • 15. Observations  Removing non relevant attributes improves accuracy (Curse of Dimensionality)  Some attributes seemed to have little relevance to salary. For example: Race, Gender.  Removing the attributes improves accuracy from by 0.21% in decision trees.  For Random Forest, accuracy improves by 0.33%  For Boosted Trees, accuracy falls slightly by 0.12%  For ANN, accuracy improves by 1.12%  Bayes Classifier – Removing co-related attributes improves accuracy.  Education_num highly related to Education. Removing education_num improves accuracy by 0.83% CSC 7333 - Dr. Jianhua Chen 15
  • 16. Thank you!!! CSC 7333 - Dr. Jianhua Chen 16

Editor's Notes

  1. Regression accuracy is 1
  2. Constructing a model in this framework requires making several choices. The shape of the decision to use in each node. The type of predictor to use in each leaf. The splitting objective to optimize in each node. The method for injecting randomness into the trees. In case of a regression, the classifier response is the average of the responses over all the trees in the forest.