Determining Student Return on Investment in College Edu
1. ↗
1
Data Driven Decision Making
George Jreije, Alex Turgeon, Kyle Elkins,
Kai Tay, Daniel Hyland
Is College Worth
the Cost?
Determining Student Return on
Investment in College
Education
2. Introduction
↗ Problem
↗ High Tuition Costs & Uncertain Market Conditions
”What would be the earning potential of a student
attending a certain college?”
↗ Solution
↗ Build a prediction model using DOE college data
↗ Identify what factors affect income following
college graduation
↗ Evaluate the mean student earnings for
specific colleges
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
3. ↗ Raw data consisted of 1700 variables
⋄ Time Series Data: 19 years
↗ Consists of 10 main cohort variables
⋄ Cost
⋄ Aid
⋄ Repayment
⋄ Completion
⋄ Earnings
⋄ Identifier
⋄ School Type
⋄ Academics
⋄ Admissions
⋄ Student
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
Data Breakdown
4. ↗ Assumption
↗ Existing literature and knowledge base
↗ Relevance in major/program, initial loans,
college attended
↗ The Fault
↗ Not familiar with dataset
↗ Missing observations
↗ Theoretical approach
Initial Assumptions
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
5. ↗ Combined 6 years of data into one dataframe
↗ Dropped missing observations
↗ ~48,000 -> ~11,000 (rows)
↗ Dropped variables with missing observations
↗ ~1,700 -> ~300 (columns)
↗ Dropped Categorical variables
↗ ~300 -> 89 (columns)
Process | Data Cleaning
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
6. Process | Test & Training
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
↗ Split Test & Training set
↗ Training: 2007-2012
⋄ ~8,500 observations
↗ Test: 2012-2013
⋄ ~2,700 observations
2007 Training
2012 Test 2013
7. Process | Stepwise
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
↗ Regression using stepwise selection
(leapseq)
↗ Identify most significant variable
↗ Confirmed our five variables were significant
8. Comparison of Models
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
↗ Theoretical Approach: 5 predictors
↗ Student loans
↗ Region
↗ College brand recognition
↗ Major/Program
↗ Type of degree
↗ Data-Driven Approach: 5 predictors
↗ Instructional Expenditure per student
↗ % of aided students with low family income
↗ Average family income of dependent students
↗ % of first-generation students
↗ % of students earning over $25k
9. ↗ Ran a multivariate regression on our model
↗ AIC of 143,166
↗ MSE of 12,672,173
↗ Test data
↗ Aic of 43,831
↗ MSE of 12,672,173
↗ MSE decreased to 59%
Final Model |
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
10. Application
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
↗ Business- to- Enterprise Application
↗ Need to understand factors that contribute
to student success at institutes
↗ Identify areas to improve strengths
↗ Remain competitive in a crowded market
↗ Higher Education
↗ Marketing & Communications
↗ Admissions
↗ Alumni Fundraising and Giving
11. Considerations | Data
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
↗ Data
↗ Information Bias
⋄ Schools have a natural incentive to report
success stories
⋄ Social work doesn't pay well, therefore
earnings numbers may be skewed or
suppressed when reported
↗ Inconsistent Source of Data
⋄ School distributed student outcome surveys
have a response rate of only 15-20% on
average
12. ↗ External Factors
↗ Model does not account for external
environment and broader economic conditions
⋄ Unemployment Rate
⋄ Emerging Jobs and Shifting Industry
Considerations | External
Introduction | Data | Assumptions | Process | Final Model | Application | Considerations
13. Conclusion
↗ Insights
↗ Being a first-generation student does lead to lower
earnings after college.
↗ Higher instructional expenditure on students and
family income of dependent students have no effect
on the mean earnings six years out.
↗ Not Applicable
↗ Does not account for external factors and time-
sensitive data
↗ Improvements
↗ Refine & standardize
data gathering method
George: We thought initial loan would definitely have an impact, and the brand ranking would, which is what the literature said.
Kyle
Talk about how we had used veera to clean our data in the theoretical approach, its a data-blending software online. after cleaning with veera and plugging in the r codes, we realized it wasn’t wirking within r.
We had decided halfway through that this would not work, so we talked with professor and he recommended that we start from scratch with the data-driven approach with numerical multi-variate linear regression
And here’s what we did afterwards(cue to slide)
limmited scaled/ time for categorical variables
We had 6 data files representing each year(2007-2013), and we combined it into one dataframe
Kyle
We had 6 data files representing each year(2007-2013), and we combined it into one dataframe
We split our test and training sets by which year the data came from.
2007-2012 was our training data
Kyle
We used regression with stepwise selection, the hybrid
Which helped identify which variables that was the most significant in predicting mean earnings 6 years out of college
Narrowed us to 5 variables
Kai
so the student loans dataset was not on the student level we only had school level data
region was actually usable
college brand recognition- USA news- created a binary for college recognition for top 150 colleges
Major and program is the same as the student loans- don’t know what the student got just what was offered by the school
Type of degree- what type of school (years of institution) DATA doesn’t say what we wanted to measure.
how much they spent on teaching (pay for prof divided by students)
Dan
Alex
Alex
Raw data consisted of 1700 variablesspanning 19 years