SlideShare a Scribd company logo
College Scorecard
Predicting Earnings To Debt Ratio
Emdadul Haque and Derek Atwood
Data Description
College Scorecard data: https://www.kaggle.com/kaggle/college-scorecard
● Data collected from 1996 - 2013
● 2009 dataset chosen for completeness and recency
● 7149 observations / 1484 features
● Each observation corresponds to a unique College
● Features related to demographics, cost of attendance, proportion of students
receiving financial aid, earnings multiple years after matriculation, etc
Data Description
● Lots of missing data!
● Some information not reported by specific Colleges
● Some information suppressed for privacy
Data Processing
● Variables with >15% of observations missing were removed
● Response variable created as a ratio of median earnings six years after
matriculation vs. median debt
● For each variable, missing values were replaced with the median of non-missing
values
● Highly correlated and low variance variables were removed
Data Processing
● Outliers diagnosed and removed (~0.5% of response variable)
Analysis
● Originally we intended to use data from 2009 to predict earnings to debt ratio for
2011
● Predictors with low amounts of missing values in 2009 had large amounts of
missing values in 2011, and vice versa
● Final data consisted of 5130 observations and 223 predictors
● 2009 data split into training (70%) and testing (30%) sets
Methodology
Linear Model:
● Poor performance (negative predicted ratios)
Lasso:
● Exploratory lasso model selected ~120-130 variables for various iterations
● Models resulted in MSE of ~0.45 (R2 ~0.65)
Principal Component Analysis
● No single predictor explained a significant percentage of variance
Random Forest Explained
● Ensemble learning method that aggregates regression trees
● A subset of the total predictors is used to build each tree
● + Handles large numbers of variable without deletion
● + Runs efficiently on large data sets
● + Inherent treating of interactions between variables
● - Loss of interpretability
Random Forest
Random Forest
Final Model:
One-half of the total predictors used per tree
Forest of 200 trees
MSE of ~0.3 (R2 ~ 0.75)
Conclusion
● Missing data provided greatest challenge to building an accurate model
● Data was decidedly unclean - redundant variables, missing factor levels, etc
● Significant amount of data processing required (~¾ of time spent)
● Imputing missing data with median values increased model performance
● The large amount of missing data likely sets an upper bound on the performance
of this model, but more data processing, feature engineering, and additional
tuning of parameters could result in more robust performance.
Questions?

More Related Content

Viewers also liked

Treatment powerpoint [autosaved]
Treatment powerpoint [autosaved]Treatment powerpoint [autosaved]
Treatment powerpoint [autosaved]
Dom1997
 
Top 12 Employer-Provided Benefits of 2016
Top 12 Employer-Provided Benefits of 2016Top 12 Employer-Provided Benefits of 2016
Top 12 Employer-Provided Benefits of 2016
Amanda Coe
 
UMA BOA PROSA COM A REDE ENTRELAÇOS
UMA BOA PROSA COM A REDE ENTRELAÇOS UMA BOA PROSA COM A REDE ENTRELAÇOS
UMA BOA PROSA COM A REDE ENTRELAÇOS
NOUS - Desenvolvimento Profissional Ltda.
 
EDL
EDLEDL
EDL
Dom1997
 
Consent form
Consent formConsent form
Consent form
Dom1997
 
La rassegna stampa dei Bagni Misteriosi
La rassegna stampa dei Bagni MisteriosiLa rassegna stampa dei Bagni Misteriosi
La rassegna stampa dei Bagni Misteriosi
Francesco Malcangio
 
TheDiasppearingSpoonLessonPlan
TheDiasppearingSpoonLessonPlanTheDiasppearingSpoonLessonPlan
TheDiasppearingSpoonLessonPlanTaisha Bowman
 
Cuidados del cuerpo
Cuidados del cuerpoCuidados del cuerpo
Cuidados del cuerpo
Yina Granados
 
How to build your startup in 13 steps?
How to build your startup in 13 steps?How to build your startup in 13 steps?
How to build your startup in 13 steps?
Jose Gonsalo
 
Telemedicine Facts Infographic
Telemedicine Facts InfographicTelemedicine Facts Infographic
Telemedicine Facts Infographic
Amanda Coe
 
Sistema operativo windows 7
Sistema operativo windows 7Sistema operativo windows 7
Sistema operativo windows 7
renk ren
 

Viewers also liked (13)

Treatment powerpoint [autosaved]
Treatment powerpoint [autosaved]Treatment powerpoint [autosaved]
Treatment powerpoint [autosaved]
 
Article1
Article1Article1
Article1
 
Top 12 Employer-Provided Benefits of 2016
Top 12 Employer-Provided Benefits of 2016Top 12 Employer-Provided Benefits of 2016
Top 12 Employer-Provided Benefits of 2016
 
UMA BOA PROSA COM A REDE ENTRELAÇOS
UMA BOA PROSA COM A REDE ENTRELAÇOS UMA BOA PROSA COM A REDE ENTRELAÇOS
UMA BOA PROSA COM A REDE ENTRELAÇOS
 
EDL
EDLEDL
EDL
 
Consent form
Consent formConsent form
Consent form
 
La rassegna stampa dei Bagni Misteriosi
La rassegna stampa dei Bagni MisteriosiLa rassegna stampa dei Bagni Misteriosi
La rassegna stampa dei Bagni Misteriosi
 
TheDiasppearingSpoonLessonPlan
TheDiasppearingSpoonLessonPlanTheDiasppearingSpoonLessonPlan
TheDiasppearingSpoonLessonPlan
 
Cuidados del cuerpo
Cuidados del cuerpoCuidados del cuerpo
Cuidados del cuerpo
 
How to build your startup in 13 steps?
How to build your startup in 13 steps?How to build your startup in 13 steps?
How to build your startup in 13 steps?
 
Telemedicine Facts Infographic
Telemedicine Facts InfographicTelemedicine Facts Infographic
Telemedicine Facts Infographic
 
Sistema operativo windows 7
Sistema operativo windows 7Sistema operativo windows 7
Sistema operativo windows 7
 
Austin.
Austin.Austin.
Austin.
 

Similar to Project presentation slides

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Web search-metrics-tutorial-www2010-section-2of7-relevance
Web search-metrics-tutorial-www2010-section-2of7-relevanceWeb search-metrics-tutorial-www2010-section-2of7-relevance
Web search-metrics-tutorial-www2010-section-2of7-relevance
Ali Dasdan
 
Statistical Databases
Statistical DatabasesStatistical Databases
Statistical Databases
ssuseraef7e0
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
Salford Systems
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
ESEM 2014
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
PATHALAMRAJESH
 
Employee Turnover Solution Using Analytical Techniques
Employee Turnover Solution Using Analytical TechniquesEmployee Turnover Solution Using Analytical Techniques
Employee Turnover Solution Using Analytical Techniques
Rajat Seth
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
Knoldus Inc.
 
Chronic Absenteeism Rate Prediction: A Data Science Case Study
Chronic Absenteeism Rate Prediction: A Data Science Case StudyChronic Absenteeism Rate Prediction: A Data Science Case Study
Chronic Absenteeism Rate Prediction: A Data Science Case Study
Iver Band
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
Alex Gilgur
 
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
SARADINDU SENGUPTA
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Graduate admission Prediction: Comparing Regression and Classification models
Graduate admission Prediction: Comparing Regression and Classification modelsGraduate admission Prediction: Comparing Regression and Classification models
Graduate admission Prediction: Comparing Regression and Classification models
FaizaNoor21
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
Egor Kraev
 
Paper planes short ver linkedin
Paper planes  short ver   linkedinPaper planes  short ver   linkedin
Paper planes short ver linkedin
Himanshu Agarwal
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
Καρολίνα Κάτι
 

Similar to Project presentation slides (20)

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Web search-metrics-tutorial-www2010-section-2of7-relevance
Web search-metrics-tutorial-www2010-section-2of7-relevanceWeb search-metrics-tutorial-www2010-section-2of7-relevance
Web search-metrics-tutorial-www2010-section-2of7-relevance
 
Statistical Databases
Statistical DatabasesStatistical Databases
Statistical Databases
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Employee Turnover Solution Using Analytical Techniques
Employee Turnover Solution Using Analytical TechniquesEmployee Turnover Solution Using Analytical Techniques
Employee Turnover Solution Using Analytical Techniques
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Chronic Absenteeism Rate Prediction: A Data Science Case Study
Chronic Absenteeism Rate Prediction: A Data Science Case StudyChronic Absenteeism Rate Prediction: A Data Science Case Study
Chronic Absenteeism Rate Prediction: A Data Science Case Study
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
 
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Graduate admission Prediction: Comparing Regression and Classification models
Graduate admission Prediction: Comparing Regression and Classification modelsGraduate admission Prediction: Comparing Regression and Classification models
Graduate admission Prediction: Comparing Regression and Classification models
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
 
Paper planes short ver linkedin
Paper planes  short ver   linkedinPaper planes  short ver   linkedin
Paper planes short ver linkedin
 
Galambos_SlidesNEAIR2015
Galambos_SlidesNEAIR2015Galambos_SlidesNEAIR2015
Galambos_SlidesNEAIR2015
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 

Recently uploaded

Chapters 3 Contracts.pptx Chapters 3 Contracts.pptx
Chapters 3  Contracts.pptx Chapters 3  Contracts.pptxChapters 3  Contracts.pptx Chapters 3  Contracts.pptx
Chapters 3 Contracts.pptx Chapters 3 Contracts.pptx
Sheldon Byron
 
Exploring Career Paths in Cybersecurity for Technical Communicators
Exploring Career Paths in Cybersecurity for Technical CommunicatorsExploring Career Paths in Cybersecurity for Technical Communicators
Exploring Career Paths in Cybersecurity for Technical Communicators
Ben Woelk, CISSP, CPTC
 
欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】
欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】
欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】
foismail170
 
135. Reviewer Certificate in Journal of Engineering
135. Reviewer Certificate in Journal of Engineering135. Reviewer Certificate in Journal of Engineering
135. Reviewer Certificate in Journal of Engineering
Manu Mitra
 
The Impact of Artificial Intelligence on Modern Society.pdf
The Impact of Artificial Intelligence on Modern Society.pdfThe Impact of Artificial Intelligence on Modern Society.pdf
The Impact of Artificial Intelligence on Modern Society.pdf
ssuser3e63fc
 
Digital Marketing Training In Bangalore
Digital  Marketing Training In BangaloreDigital  Marketing Training In Bangalore
Digital Marketing Training In Bangalore
nidm599
 
How to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and BusinessHow to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and Business
ideatoipo
 
皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】
皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】
皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】
larisashrestha558
 
Personal Brand Exploration Comedy Jxnelle.
Personal Brand Exploration Comedy Jxnelle.Personal Brand Exploration Comedy Jxnelle.
Personal Brand Exploration Comedy Jxnelle.
alexthomas971
 
Personal Brand exploration KE.pdf for assignment
Personal Brand exploration KE.pdf for assignmentPersonal Brand exploration KE.pdf for assignment
Personal Brand exploration KE.pdf for assignment
ragingokie
 
134. Reviewer Certificate in Computer Science
134. Reviewer Certificate in Computer Science134. Reviewer Certificate in Computer Science
134. Reviewer Certificate in Computer Science
Manu Mitra
 
han han widi kembar tapi beda han han dan widi kembar tapi sama
han han widi kembar tapi beda han han dan widi kembar tapi samahan han widi kembar tapi beda han han dan widi kembar tapi sama
han han widi kembar tapi beda han han dan widi kembar tapi sama
IrlanMalik
 
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
atwvhyhm
 
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
yuhofha
 
Midterm Contract Law and Adminstration.pptx
Midterm Contract Law and Adminstration.pptxMidterm Contract Law and Adminstration.pptx
Midterm Contract Law and Adminstration.pptx
Sheldon Byron
 
DOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdf
DOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdfDOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdf
DOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdf
Pushpendra Kumar
 
Full Sail_Morales_Michael_SMM_2024-05.pptx
Full Sail_Morales_Michael_SMM_2024-05.pptxFull Sail_Morales_Michael_SMM_2024-05.pptx
Full Sail_Morales_Michael_SMM_2024-05.pptx
mmorales2173
 
New Explore Careers and College Majors 2024.pdf
New Explore Careers and College Majors 2024.pdfNew Explore Careers and College Majors 2024.pdf
New Explore Careers and College Majors 2024.pdf
Dr. Mary Askew
 
Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...
Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...
Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...
Dirk Spencer Corporate Recruiter LION
 
欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】
欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】
欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】
foismail170
 

Recently uploaded (20)

Chapters 3 Contracts.pptx Chapters 3 Contracts.pptx
Chapters 3  Contracts.pptx Chapters 3  Contracts.pptxChapters 3  Contracts.pptx Chapters 3  Contracts.pptx
Chapters 3 Contracts.pptx Chapters 3 Contracts.pptx
 
Exploring Career Paths in Cybersecurity for Technical Communicators
Exploring Career Paths in Cybersecurity for Technical CommunicatorsExploring Career Paths in Cybersecurity for Technical Communicators
Exploring Career Paths in Cybersecurity for Technical Communicators
 
欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】
欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】
欧洲杯买球平台-欧洲杯买球平台推荐-欧洲杯买球平台| 立即访问【ac123.net】
 
135. Reviewer Certificate in Journal of Engineering
135. Reviewer Certificate in Journal of Engineering135. Reviewer Certificate in Journal of Engineering
135. Reviewer Certificate in Journal of Engineering
 
The Impact of Artificial Intelligence on Modern Society.pdf
The Impact of Artificial Intelligence on Modern Society.pdfThe Impact of Artificial Intelligence on Modern Society.pdf
The Impact of Artificial Intelligence on Modern Society.pdf
 
Digital Marketing Training In Bangalore
Digital  Marketing Training In BangaloreDigital  Marketing Training In Bangalore
Digital Marketing Training In Bangalore
 
How to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and BusinessHow to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and Business
 
皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】
皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】
皇冠体育- 皇冠体育官方网站- CROWN SPORTS| 立即访问【ac123.net】
 
Personal Brand Exploration Comedy Jxnelle.
Personal Brand Exploration Comedy Jxnelle.Personal Brand Exploration Comedy Jxnelle.
Personal Brand Exploration Comedy Jxnelle.
 
Personal Brand exploration KE.pdf for assignment
Personal Brand exploration KE.pdf for assignmentPersonal Brand exploration KE.pdf for assignment
Personal Brand exploration KE.pdf for assignment
 
134. Reviewer Certificate in Computer Science
134. Reviewer Certificate in Computer Science134. Reviewer Certificate in Computer Science
134. Reviewer Certificate in Computer Science
 
han han widi kembar tapi beda han han dan widi kembar tapi sama
han han widi kembar tapi beda han han dan widi kembar tapi samahan han widi kembar tapi beda han han dan widi kembar tapi sama
han han widi kembar tapi beda han han dan widi kembar tapi sama
 
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
 
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
 
Midterm Contract Law and Adminstration.pptx
Midterm Contract Law and Adminstration.pptxMidterm Contract Law and Adminstration.pptx
Midterm Contract Law and Adminstration.pptx
 
DOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdf
DOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdfDOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdf
DOC-20240602-WA0001..pdf DOC-20240602-WA0001..pdf
 
Full Sail_Morales_Michael_SMM_2024-05.pptx
Full Sail_Morales_Michael_SMM_2024-05.pptxFull Sail_Morales_Michael_SMM_2024-05.pptx
Full Sail_Morales_Michael_SMM_2024-05.pptx
 
New Explore Careers and College Majors 2024.pdf
New Explore Careers and College Majors 2024.pdfNew Explore Careers and College Majors 2024.pdf
New Explore Careers and College Majors 2024.pdf
 
Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...
Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...
Transferable Skills - Your Roadmap - Part 1 and 2 - Dirk Spencer Senior Recru...
 
欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】
欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】
欧洲杯投注网站-欧洲杯投注网站推荐-欧洲杯投注网站| 立即访问【ac123.net】
 

Project presentation slides

  • 1. College Scorecard Predicting Earnings To Debt Ratio Emdadul Haque and Derek Atwood
  • 2. Data Description College Scorecard data: https://www.kaggle.com/kaggle/college-scorecard ● Data collected from 1996 - 2013 ● 2009 dataset chosen for completeness and recency ● 7149 observations / 1484 features ● Each observation corresponds to a unique College ● Features related to demographics, cost of attendance, proportion of students receiving financial aid, earnings multiple years after matriculation, etc
  • 3. Data Description ● Lots of missing data! ● Some information not reported by specific Colleges ● Some information suppressed for privacy
  • 4. Data Processing ● Variables with >15% of observations missing were removed ● Response variable created as a ratio of median earnings six years after matriculation vs. median debt ● For each variable, missing values were replaced with the median of non-missing values ● Highly correlated and low variance variables were removed
  • 5. Data Processing ● Outliers diagnosed and removed (~0.5% of response variable)
  • 6. Analysis ● Originally we intended to use data from 2009 to predict earnings to debt ratio for 2011 ● Predictors with low amounts of missing values in 2009 had large amounts of missing values in 2011, and vice versa ● Final data consisted of 5130 observations and 223 predictors ● 2009 data split into training (70%) and testing (30%) sets
  • 7. Methodology Linear Model: ● Poor performance (negative predicted ratios) Lasso: ● Exploratory lasso model selected ~120-130 variables for various iterations ● Models resulted in MSE of ~0.45 (R2 ~0.65) Principal Component Analysis ● No single predictor explained a significant percentage of variance
  • 8. Random Forest Explained ● Ensemble learning method that aggregates regression trees ● A subset of the total predictors is used to build each tree ● + Handles large numbers of variable without deletion ● + Runs efficiently on large data sets ● + Inherent treating of interactions between variables ● - Loss of interpretability
  • 10. Random Forest Final Model: One-half of the total predictors used per tree Forest of 200 trees MSE of ~0.3 (R2 ~ 0.75)
  • 11.
  • 12. Conclusion ● Missing data provided greatest challenge to building an accurate model ● Data was decidedly unclean - redundant variables, missing factor levels, etc ● Significant amount of data processing required (~¾ of time spent) ● Imputing missing data with median values increased model performance ● The large amount of missing data likely sets an upper bound on the performance of this model, but more data processing, feature engineering, and additional tuning of parameters could result in more robust performance.