Data Con LA 2022-PaCMAP ensembles for occupational specializations in the California Cloud Workforce

PaCMAP Ensembles
Neal Fultz
neal@njnm.co
bit.ly/3vwx0W3

Executive Summary
● LACC system needed a way to identify
curricula gaps in Cloud Computing program
● By applying BERT, PaCMAP and elbow
grease, can ﬁnd the gaps.

Background: The Client
● CA Cloud Workforce Consortia (LACC)
● > 2,000 annual openings in LA County
● “industry standard skills to understand and
develop applications for the cloud”

Background: The Problem
● “Cloud classes” lag behind “Cloud jobs”
● More generally: how do we ﬁgure out if /
where there is a mismatch between what
industry needs and what is taught in
classrooms?

Data: Curriculum
● Course Outline of Record
○ Tentative schedule,
assignments, textbook
○ Course Objectives
○ Student Learning
Outcomes
● Programs are
sets of courses

Data: Jobs
● O*Net
○ by Dept of Labor
○ CC 4.0
○ Tasks
○ Work Activities
○ Wages & Growth

Idea: “Skill Space”
Soft Technical

“Interpretation”
● “Closeness”
● “Matches”
● “Misses”
Soft Technical

Implementation
1. Stock NN + ﬁne tune
2. Dimensionality Reduction
3. Hyperparameter tuning
4. Score & Aggregate

Hard Mode
Because the client has free devops in the form of
students, they wanted to make downstream
applications a class project.
=> Therefore, can’t use anything students can’t
access or ﬁgure out in ﬁnal deliverable.

Implementation pt 1: NN
● DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter. Sanh,
Debut, Julien, Wolf. Neurips 2019.
● “40% smaller, 60% faster, that retains 97% of
the language understanding capabilities.”
● Runs comfortably on typically student
workstation or in Collab. GPU optional.
“40% smaller” is still
768-dimensional

Implementation pt 2: DR
● Uniform Pairwise Controlled
Manifold Approximation Projection
○ Multi stage optimization with “Far neighbors”
○ Review paper is extremely good

Implementation Pt 2
● “Bleeding edge” - multiple breaking changes
during engagement, non-standard interfaces,
“interesting” defaults, etc
○ But devs @ Duke very responsive
● Based on spot checking, /very/ good at
consolidating redundant information in this
type of data set

Implementation Pt2
● Issue: Only “as good” as inputs.
● Solution: Leverage domain knowledge
○ Bloom’s Taxonomy
● Solution: go even wider and let PaCMAP strip
out redundancies

Implementation Pt 2
Level % Job Tasks
creating 16.3
evaluating 12.8
analyzing 7.9
applying 10.3
understanding 2.8
remembering 4.7

PaCMAP ensemble
DistilBERT
Bag of
Words
Bloom’s
Taxonomy
PaCMAP
Ensemble
Embedding

Implementation pt 2
● PaCMAP ensemble provides reasonable and
structured way to blend together three
diﬀerent NLP models
● Have to deal with extra complexity (as with all
ensembles)

Implementation pt 3: Tune
● Need to tune all component models +
PaCMAP
● Choose a good loss / metric:
○ “Stress”
○ “K-fold Stress”
○ “K-fold Spearman Stress”

● Choosing # of dimensions
○ In past, would use scree plots / intuition
○ Use Gavish & Donaho instead (270)
○ NB: that’s under linearity, PaCMAP can do as
good with fewer, treat as hyperparameter
Implementation pt 3: Tune

Implementation pt 4: Scoring
Now have this thing:
Note mismatch between what that is and what
the actual problem is. Need to distill to a metric
of “closeness”
Soft Technical

Soft Technical
a c
b
d
e
x
y
z x y z
a
b
c
d
e

● What to ﬁll in cells?
○ cos() - problems
○ Distances
probably ok
○ CDF
x y z
a
b
c
d
e

● How to aggregate?
○ mean: attenuation
○ mean(max())
○ Note: 2 scores
○ Forcibly
Symmetrize
x y z
a
b
c
d
e

“Design and conduct
hardware or software tests”

PaCMAP ensemble (scoring)
DistilBERT
Bag of
Words
Bloom’s
Taxonomy
PaCMAP
Ensemble
Embedding
Fine Tuning

Shoutouts?
Special thanks to:
● Salomon / ScopeWave
● Nancy / Santa Monica College
● Ankush, Jeremy, Rebecca / Handshake
● PaCMAP Team / Duke

Who Are You?
Neal Fultz, neal@njnm.co - data science and
machine learning consultant and recovering
software engineer. Primarily AdTech and
FinTech, but I do other things as well.

How did you find this Project?
After presenting at IDEAS 2017 (DTLA) on a
project I did with DataKind for University of
Wisconsin Parkside, another attendee
remembered me 4 years later and reached out.

What about The Program Level?
● BEWARE between-course sparsity.
● Concatenate sets (Course Outcome x Job
Task) similarity matrices and reaggregate.
● This allows diﬀerent programs to tune to
speciﬁc niches and specialties.

Future Work?
“Skill space” is generic, NNs are very ﬂexible:
● Determine if courses can transfer or substitute
● Resume generator, job recommender from
student transcripts
● Join to wages data, estimate ROI per course
● Identify “missing Bloom level”

Data Con LA 2022-PaCMAP ensembles for occupational specializations in the California Cloud Workforce

Recommended

Recommended

More Related Content

Similar to Data Con LA 2022-PaCMAP ensembles for occupational specializations in the California Cloud Workforce

Similar to Data Con LA 2022-PaCMAP ensembles for occupational specializations in the California Cloud Workforce (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Data Con LA 2022-PaCMAP ensembles for occupational specializations in the California Cloud Workforce