Project presentation slides

College Scorecard
Predicting Earnings To Debt Ratio
Emdadul Haque and Derek Atwood

Data Description
College Scorecard data: https://www.kaggle.com/kaggle/college-scorecard
● Data collected from 1996 - 2013
● 2009 dataset chosen for completeness and recency
● 7149 observations / 1484 features
● Each observation corresponds to a unique College
● Features related to demographics, cost of attendance, proportion of students
receiving financial aid, earnings multiple years after matriculation, etc

Data Description
● Lots of missing data!
● Some information not reported by specific Colleges
● Some information suppressed for privacy

Data Processing
● Variables with >15% of observations missing were removed
● Response variable created as a ratio of median earnings six years after
matriculation vs. median debt
● For each variable, missing values were replaced with the median of non-missing
values
● Highly correlated and low variance variables were removed

Data Processing
● Outliers diagnosed and removed (~0.5% of response variable)

Analysis
● Originally we intended to use data from 2009 to predict earnings to debt ratio for
2011
● Predictors with low amounts of missing values in 2009 had large amounts of
missing values in 2011, and vice versa
● Final data consisted of 5130 observations and 223 predictors
● 2009 data split into training (70%) and testing (30%) sets

Methodology
Linear Model:
● Poor performance (negative predicted ratios)
Lasso:
● Exploratory lasso model selected ~120-130 variables for various iterations
● Models resulted in MSE of ~0.45 (R2 ~0.65)
Principal Component Analysis
● No single predictor explained a significant percentage of variance

Random Forest Explained
● Ensemble learning method that aggregates regression trees
● A subset of the total predictors is used to build each tree
● + Handles large numbers of variable without deletion
● + Runs efficiently on large data sets
● + Inherent treating of interactions between variables
● - Loss of interpretability

Random Forest
Final Model:
One-half of the total predictors used per tree
Forest of 200 trees
MSE of ~0.3 (R2 ~ 0.75)

Conclusion
● Missing data provided greatest challenge to building an accurate model
● Data was decidedly unclean - redundant variables, missing factor levels, etc
● Significant amount of data processing required (~¾ of time spent)
● Imputing missing data with median values increased model performance
● The large amount of missing data likely sets an upper bound on the performance
of this model, but more data processing, feature engineering, and additional
tuning of parameters could result in more robust performance.

Project presentation slides

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Project presentation slides

Similar to Project presentation slides (20)

Recently uploaded

Recently uploaded (20)

Project presentation slides