3. How did I scrape the jobs?
Jobs were scraped from the following:
• Dice.com
• Monster.com
• Startupers.com
• Ventureloop.com
All credit goes to:
Craig Perler
CTO & Founder
projectSHERPA
4. Job Selection Process
Iterative process based on keywords, text analysis using
NLTK, and manually reviewing the jobs
Went from 70k jobs to final set of 585 based on the following
criteria:
Keywords:
▪ (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR
▪ Job description contained: 'predictive modeling', 'data mining', 'text mining',
'machine learning', 'natural language processing') AND
▪ Job title did not contain: 'intern', 'internship'
Skills
▪ Job had at least one tagged skill and did not contain the following: 'Salesforce',
'VBA', 'Sharepoint', 'Drupal'
5. Cleaning Up the Jobs Part 1
Job data was good, but needed some further
cleaning
Converted the posted pay rate text into a number
Assigned job seniority ('junior', 'default', 'senior') based on job title
keywords
Used python-linkedin to pull in more
company data
Name, Description, Industry, Company Size, Company Type,
Specialities
6. Cleaning Up the Jobs Part 2
Added more data columns
Job Posted Year, Job Posted Month, Number of words in Job
Description, Number of characters in Job Description
Converted text fields to numeric values
Job Seniority, Employee Count, Company Type
Converted Company Industry and Specialities to
binary valued columns
Used only the following specialities: 'Big Data', 'Analytics', 'Machine
Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural
Language Processing', 'Predictive Analytics', 'Data Mining'
8. What do we do?
Test using two different datasets
Pay Rate - job salaries as provided by the poster
(67 records)
Estimated Salary - job salaries as provided by a
separate model (584 records)
9. Model Selection
Most of the competitors in the Kaggle competition
used Random Forest.
Zygmunt Zając made a suggestion:
10. Initial Results
Tested with linear models and Random Forest
Model Training Score Testing Score
Ordinary Linear
Regression
0.795 -32,120,852.970
Ridge
Regression
0.669 -0.079
Lasso Regression -0.009 -0.021
Random Forest 1.000 0.147
Model Training Score Testing Score
Ordinary Linear
Regression
0.430 -330,266.154
Ridge
Regression
0.361 0.163
Lasso Regression 0.031 -0.051
Random Forest 1.000 0.325
Pay Rate Estimated Salary
Can we do better???
12. Smoothing out the data
Our salary data is too granular - let's round to
units of 10k
Original Data Smoothed Data
13. Expanding the model set
Random Forest is good but slow - try
Decision Tree to see if we can get
comparable results
LDA - can use a classification model on the
smoothed data
KNN - why not?
15. Model Results
Linear Models (OLS, Ridge, Lasso)
Universally poor on the small data set
Of the Linear Models, Ridge was the best on the large data set
Decision Tree/Random Forest
Overfitted on the small data set (great training score, poor test score)
Had the best results on the large data set (Random Forest)
KNN
Comparable to the non-"Linear Models" on the small and large data sets
LDA
Needed rather large K to get good results
Had the best results on the small data set
16. Final Results
Model Training Score Testing Score
KNN 1.000 0.147
Model Training Score Testing Score
Decision Tree 1.000 0.342
Pay Rate Estimated Salary
Model Training Score Testing Score
LDA 0.585 0.286
Model Training Score Testing Score
Random Forest 1.000 0.581
Smoothed Pay Rate Smoothed Estimated Salary
18. How can we do better next
time?
More data!
Either more data points or expand the parameters
of the model
Keep playing with the shape of the data
Improve the ranges - 20k vs. 30k is more
significant than 150k vs. 160k
Improve the quality of the data
Verify the LinkedIn data, Job Seniority, etc.
19. Additional Thoughts
Why I think a project like this is a marketing gimmick:
Only recruiters post expected salary
Too much variance in job titles and not enough in
the job description
Only provides base salary and ignores bonus and
non-cash compensation
Cannot handle deprecated skills or brand new skills
20. References
projectSHERPA Homepage,
http://projectsherpa.com/
"Predict the salary of any UK job ad based on
its contents", http://www.kaggle.com/c/job-
salary-prediction
"Predicting advertised salaries",
http://fastml.com/predicting-advertised-
salaries/