2. Objective
Inform how the U.S. Department of Labor (DOL) workforce
system can prioritize limited follow-up resources to those
who are mostly likely to have the most trouble finding or
keeping a job
Gain insights from machine learning on the characteristics
of workforce system clients that are most likely to be
unemployed 1 year after exiting workforce system
services
3. Method
107 predictors
Unemployed
Employed
Tree Algorithm
Fit 2 different “tree-based” machine learning algorithms
Decision Trees
Random Forests
Decision trees are relatively simple and quick but can
suffer from over-fitting (i.e., they predict your training
dataset too well and not generalize to future data)
Random Forests are computationally intensive and more
complicated, but generalize better to future data
4. Data
1. DOL Performance
Records
Administrative data with
characteristics of
individuals served,
workforce services
provided, and employment
outcomes for 4 quarters
after exit
PY 2018 Q2 WIOA
Performance Records Public
Use Data File
2. O*Net Skills Dataset
Mapping of occupation
codes to 35 skills (e.g.,
reading comprehension,
active listening)
Every occupation is rated on
each skill on a 0-5 scale
E.g., Chief Executives
have a score of 4.88 on
Active Listening and a
score of 0 on Equipment
Maintenance.
1 Active Learning
2 Active Listening
3 Complex Problem Solving
4 Coordination
5 Critical Thinking
6 Equipment Maintenance
7 Equipment Selection
8 Installation
9 Instructing
10 Judgment and Decision Making
11 Learning Strategies
12 Management of Financial Resources
13 Management of Material Resources
14 Management of Personnel Resources
15 Mathematics
16 Monitoring
17 Negotiation
18 Operation Monitoring
19 Operation and Control
20 Operations Analysis
21 Persuasion
22 Programming
23 Quality Control Analysis
24 Reading Comprehension
25 Repairing
26 Science
27 Service Orientation
28 Social Perceptiveness
29 Speaking
30 Systems Analysis
31 Systems Evaluation
32 Technology Design
33 Time Management
34 Troubleshooting
35 Writing
5. Data Scope and Limitations
Data Scope
~1 million workforce system clients
Client’s most recent spell of service at the workforce center
Adults age 25 - 65 served by the workforce system from July 1, 2016
to December 31, 2018
50 states (excludes territories)
Employment outcomes available
Limitation = Lost about half the raw data file when I merged in O*Net
skills ratings (due to missingness on the most recent occupation
variable)
Results may not generalize to the broader workforce system population
6. Participant Trends
OUTCOME:
33% of participants are
unemployed 1 yr later
SELECT FEATURES
53% have only a high
school diploma/GED or
less
43% are White
28% are Black
Age range is diverse
20%
27%
24%
29%
25-30 31-40 41-50 51-65
Age
7. Decision Tree: 15 Most Important Features
to predict unemployment
Age 51-65 (vs Age 25-30)
Education Level Less than HS (vs
having a BA or higher)
Duration (in days) of workforce system
service receipt
Living in LA, FL, MA, CT, NH, IN (vs
California)
Veteran status
Being long-term unemployed prior to
receiving workforce system services
Being Black or providing “no response”
to the race/ethnicity data element (vs
being White)
The following job skills: programming,
monitoring
8. Random Forest: 15 Most Important Features
to predict unemployment
Age 51-65 (vs Age 25-30)
Education Level Less than HS (vs
having a BA or higher)
Duration (in days) of workforce system
service receipt
Living in LA or FL (vs CA)
Providing “no response” to the
race/ethnicity data element (vs being
White)
The following job skills: programming,
monitoring, reading comprehension,
math, writing, operation monitoring
Being long-term unemployed prior to
receiving workforce system services
Veteran status
9. Model Best Parameters
Accuracy on
Validation
Dataset
(Share of
Correctly
Classified Cases)
Accuracy on
Test Dataset
(Share of
Correctly
Classified
Cases)
Decision Tree -max # of leaf nodes = 100 67% 67%
Random Forest -max depth of tree = 10
-max features = 14
-N estimators = 100
67% 67%
Prediction Results
10. Decision Tree and Random Forest models have the same
level of accuracy on the training and test data.
However, they vary some in the most important
predictive features
67% accuracy is not that much better than a coin flip
Future research: Improve predictive power
Modify features (add interaction and higher order terms)
Restrict the data scope to a more homogenous subset of the
workforce system clients, such as low-income adults age 30-
40 in California.
Diagnose who is missing data on O*Net skill ratings
Try additional machine learning algorithms
Takeaways
12. Extra: Fitting a Machine Learning Model
1. Engineer the features: clean data and recode values as needed.
Covert categorical features into binary dummies.
2. Split data into a “training” vs “test” datasets. Hold the “test”
data in reserve until Step #5.
3. Try out a range of model parameters on the training dataset,
leveraging 5-fold cross-validation to create a more robust fit.
4. Pick the best model parameters based on prediction accuracy
(How well does the model trained on the “training dataset” predict
the outcomes in the “validation” dataset?)
5. Assess how well my model generalizes to unseen data. Evaluate
how accurately the model predicts the outcomes in the “test”
dataset.
13. Demographic & Socioeconomic Features
# Predictor Coding Description
1 Age Continuous Age in years at program entry
2 Sex Categorical Male, Female, neither. Omitted male for interpretability
3 Race/Ethnicity Categorical Hispanic, Asian (Not Hispanic), Black (Not Hispanic), Native Hawaiian/Pacific
Islander/American Indian/Alaska Native (not Hispanic), White (Not Hispanic),
Multiple Race (not Hispanic). Omitted “White” for interpretability.
4 Education Level Categorical Less than HS, HS diploma or GED, some post-secondary, postsecondary
technical or vocational certificate, Associate’s degree, Bachelor’s Degree or
higher. Omitted “Bachelor’s Degree or higher” for interpretability.
5 Veteran Status Binary Flag for veteran
6 Low-income Status Binary Flag for low-income at entry
7 English as a Second
Language
Binary Flag for English as a Second Language at entry
8 Single Parent Binary Flag for single parent at entry
9 Criminal History Categorical Yes, no, or refused to answer. Omitted “no” for interpretability.
10 Long-term unemployed Binary Flag for being unemployed for 27 or more consecutive weeks
11 Public Assistance Status Binary Flag for receipt of TANF, SNAP, SSI, or other reported assistance
12 State Categorical State that submitted the participant data. Omitted CA for interpretability.
14. Workforce System Experience Features
# Name Coding Description
13 2017 Exit Year Binary 2017 rather than 2016 Exit Year
14 Service Duration Continuous Cumulative days of service
15 Number of Spells of
Service Receipt
Continuous Count of the number of cycles of “entry”
and “exit” into workforce system services
Recent Occupation Skills Features
# Name Coding Description
16 Skill Rating (for each
of the 35 skills) for
the client’s most
recent occupation
Continuous Rating between 0 - 5