Predictive model

Income Analysis
Ping Yin
11/10/2016

Contents
• Executive Summary ------------------------------------------------------------------------------------- 3
• Introduction ---------------------------------------------------------------------------------------------- 4
• Purpose ---------------------------------------------------------------------------------------------------- 5
• Methodology
Data Selection ----------------------------------------------------------------------------------- 6
Exploration ----------------------------------------------------------------------------------- 7-24
Preparation & Transformation ---------------------------------------------------------- 25-34
Model Development & Assessment --------------------------------------------------- 35-44
Model Comparison ------------------------------------------------------------------------ 45-47
• Options and Recommendations ---------------------------------------------------------------- 48-52
• Summary ------------------------------------------------------------------------------------------------- 53
• Appendix ------------------------------------------------------------------------------------------------- 54

Executive Summary
• After data preparation and partition, three models are built in SAS
studio, EM, and DataRobot
• The same test dataset is scored by these models
• The model built in EM has the best performance

Introduction
• Can we predict Income level based on age, gender, education, etc.?
• What is my income level after I graduate?

Purpose
• Figure out the best predictive model for Income dataset
• Predict my Income level
• Practice skills for preparing data, building model, and model assessment

Data Selection
• Income dataset is originally extracted from 1994 Census bureau database
• Downloaded from Kaggle.com
• Reasons for choosing it:
• Target variable, Income, is categorical variable
• Medium size: 10+ columns and 30K+ rows
• Used in Macro and DataRobot projects

Exploration
• Using SAS studio to explore data
• 32,561 observations
• 15 variables: 6 Num, 9 Char
• Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt
• Char: Income Relationship Education Occupation Sex Marital
Workclass Race Nativecountry
• Target: Income (“>50K” , “<=50k”)

Exploration
Data issues :
• Missing value: Workclass Occupation Nativecountry
• Multiple levels: Education Marital Workclass Nativecountry
• Numeric variables: Capitalgain Capitalloss
• Screen variable: Fnlwgt

Preparation & Transformations
• Solutions:
• Imputing missing value using subject matter knowledge:
impute missing value for Workclass and Occupation with “Unemployeed”
• Imputing missing value using mode value:
impute missing value for Nativecountry with “United-States”

• Solutions:
• Coverting Capitalgain and Capitalloss from Num to Char
• Binning multiple-level variables: Education Marital Workclass

• Solutions:
• Binning Nativecountry and creating a new variable: region

• Reasons for dropping variable Fnlwgt:
• It is the weight on the Current Population Survey files, not original data from Census
• It shows near zero importance in last week DataRobot project

• Reasons for not handling with variable Occupation:
• 15 levels
• Do not have a sound criterion
• Reasons for not handling with variable Race and Relationship:
• 5-6 Levels
• Each level is meaningful

After preparation:

• Data partition using Strata method

Now it is ready to go!
Training dataset
Test dataset
SAS Studio
Enterprise Miner
DataRobot

Model Development & Assessment: SAS Studio

Model Development & Assessment: EM

Model Development & Assessment: DataRobot

Model Comparison
• The best model in this project:
EM Studio DataRobot

Model Comparison: Predict my Income level
Ping Dataset
EM
Studio
DataRobot

Options and Recommendations
Using 60% data to
build a model
Using 70% data to
build a model

Macro
Project
DataRobot
Project
The overall
best model

• Factors which may cause these differences:
• Dropping variable Fnlwgt
• Reducing levels
• Variable transformation: Capitalgain Capitalloss
• Increase speed, but decrease model performance

Options
• Using DataRobot to build models without handling “data issues”
• Keep trying in SAS studio

Summary
• We can predict Income level based on these characteristics
• For Income dataset, DataRobot is most robust to build models
• Be aware of unexpected outcomes for data preparing
• Back and forth, until getting an ideal result

Appendix
Link to Data:
https://www.kaggle.com/uciml/adult-census-Income

Predictive model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Predictive model

Similar to Predictive model (20)

Recently uploaded

Recently uploaded (20)

Predictive model