3. Executive Summary
• After data preparation and partition, three models are built in SAS
studio, EM, and DataRobot
• The same test dataset is scored by these models
• The model built in EM has the best performance
4. Introduction
• Can we predict Income level based on age, gender, education, etc.?
• What is my income level after I graduate?
5. Purpose
• Figure out the best predictive model for Income dataset
• Predict my Income level
• Practice skills for preparing data, building model, and model assessment
6. Data Selection
• Income dataset is originally extracted from 1994 Census bureau database
• Downloaded from Kaggle.com
• Reasons for choosing it:
• Target variable, Income, is categorical variable
• Medium size: 10+ columns and 30K+ rows
• Used in Macro and DataRobot projects
7. Exploration
• Using SAS studio to explore data
• 32,561 observations
• 15 variables: 6 Num, 9 Char
• Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt
• Char: Income Relationship Education Occupation Sex Marital
Workclass Race Nativecountry
• Target: Income (“>50K” , “<=50k”)
25. Preparation & Transformations
• Solutions:
• Imputing missing value using subject matter knowledge:
impute missing value for Workclass and Occupation with “Unemployeed”
• Imputing missing value using mode value:
impute missing value for Nativecountry with “United-States”
26. Preparation & Transformations
• Solutions:
• Coverting Capitalgain and Capitalloss from Num to Char
• Binning multiple-level variables: Education Marital Workclass
28. Preparation & Transformations
• Reasons for dropping variable Fnlwgt:
• It is the weight on the Current Population Survey files, not original data from Census
• It shows near zero importance in last week DataRobot project
29. Preparation & Transformations
• Reasons for not handling with variable Occupation:
• 15 levels
• Do not have a sound criterion
• Reasons for not handling with variable Race and Relationship:
• 5-6 Levels
• Each level is meaningful
50. Options and Recommendations
• Factors which may cause these differences:
• Dropping variable Fnlwgt
• Reducing levels
• Variable transformation: Capitalgain Capitalloss
• Increase speed, but decrease model performance
51. Options
• Using DataRobot to build models without handling “data issues”
• Keep trying in SAS studio
52. Summary
• We can predict Income level based on these characteristics
• For Income dataset, DataRobot is most robust to build models
• Be aware of unexpected outcomes for data preparing
• Back and forth, until getting an ideal result