This document discusses how Degreed uses machine learning to categorize learning content to personalize skill development for users. It describes how Degreed generates training data by tagging content, trains a neural network model on word embeddings to predict categories for new content, and evaluates the model's precision on different categories to identify areas for improvement. The goal is to apply the model's category predictions to each user's learning history to suggest personalized content areas to focus on.
10. Machine Learning task:
Predict which category a piece of content belongs to using the title
and description of the content.
Choosing
categories
10
Generating
training data
Training model Evaluating model
12. Choosing categories
12
Criteria
• Based on our user’s interests & completions
• Separability
• Concept size
• Data density
• Context independent
• Linked to key words and phrases
Method
• Manually derived, data driven
• Iterate based on model results
• Eventually, clustering
15. Evaluating training data
15
Separability
Categories per item
Data density
Items per category
Coverage
How much content is in
the training data?
218559 / 436290 items
Quality
For a subsample, how good
are the assigned categories?
78% precision
16. The model
16
So | you | want |to | be | a | data | scientist | so you| …. data scientist
So you want to be a data scientist …
17. The model
17
So | So you | So you want | …. data scientist
Semantic map
Word embeddings
So you want to be a data scientist …
18. The model
18
So | So you | So you want | …. data scientist
73 categories
Hidden layer
Word embeddings
Category
embeddings
So you want to be a data scientist …
19. The model
19
So | So you | So you want | …. data scientist | scien | cient|ienti|…
73 categories
Hidden layer
Word embeddings
Category
embeddings
So you want to be a data scientist …
20. How does it do?
20
Overall model precision
Test set: 50K items
P@1: 48.9%
Most popular category: 4.8%
Breakdown by category
21. Where are we making
mistakes?
21
Which categories are content wrongly
categorized to?
Data Science
False Negative categories
Business management
False Negative categories
22. The model
22
So | So you | So you want | …. data scientist | scien | cient|ienti|…
Hidden layers
So you want to be a data scientist …
Word embeddings
tSNE
Sentence vectors
25. Results from applying to
my history:
25
What I’m Learning About Right Now
Data Science
Learning
Software Engineering
Programming Languages
Business Intelligence
Leadership
Cognition
Agile Management
Human Resources
Business Culture
26. Using this to personalize learning
26
Data Science
Programming Languages
Software Engineering
Building Relationships
Leadership