SlideShare a Scribd company logo
1 of 20
Job Salary Prediction
with Python
John Maiden
NYC Data Science Academy
Motivation
How did I scrape the jobs?
Jobs were scraped from the following:
• Dice.com
• Monster.com
• Startupers.com
• Ventureloop.com
All credit goes to:
Craig Perler
CTO & Founder
projectSHERPA
Job Selection Process
 Iterative process based on keywords, text analysis using
NLTK, and manually reviewing the jobs
 Went from 70k jobs to final set of 585 based on the following
criteria:
 Keywords:
▪ (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR
▪ Job description contained: 'predictive modeling', 'data mining', 'text mining',
'machine learning', 'natural language processing') AND
▪ Job title did not contain: 'intern', 'internship'
 Skills
▪ Job had at least one tagged skill and did not contain the following: 'Salesforce',
'VBA', 'Sharepoint', 'Drupal'
Cleaning Up the Jobs Part 1
 Job data was good, but needed some further
cleaning
 Converted the posted pay rate text into a number
 Assigned job seniority ('junior', 'default', 'senior') based on job title
keywords
 Used python-linkedin to pull in more
company data
 Name, Description, Industry, Company Size, Company Type,
Specialities
Cleaning Up the Jobs Part 2
 Added more data columns
 Job Posted Year, Job Posted Month, Number of words in Job
Description, Number of characters in Job Description
 Converted text fields to numeric values
 Job Seniority, Employee Count, Company Type
 Converted Company Industry and Specialities to
binary valued columns
 Used only the following specialities: 'Big Data', 'Analytics', 'Machine
Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural
Language Processing', 'Predictive Analytics', 'Data Mining'
Modeling Time!
Problem
 Only 67 jobs (out of 585) have posted salaries!
Solution
What do we do?
Test using two different datasets
 Pay Rate - job salaries as provided by the poster
(67 records)
 Estimated Salary - job salaries as provided by a
separate model (584 records)
Model Selection
 Most of the competitors in the Kaggle competition
used Random Forest.
 Zygmunt Zając made a suggestion:
Initial Results
Tested with linear models and Random Forest
Model Training Score Testing Score
Ordinary Linear
Regression
0.795 -32,120,852.970
Ridge
Regression
0.669 -0.079
Lasso Regression -0.009 -0.021
Random Forest 1.000 0.147
Model Training Score Testing Score
Ordinary Linear
Regression
0.430 -330,266.154
Ridge
Regression
0.361 0.163
Lasso Regression 0.031 -0.051
Random Forest 1.000 0.325
Pay Rate Estimated Salary
Can we do better???
Changing the shape of the data
Original Data Log Data
Sqrt Data
Smoothing out the data
Our salary data is too granular - let's round to
units of 10k
Original Data Smoothed Data
Expanding the model set
 Random Forest is good but slow - try
Decision Tree to see if we can get
comparable results
 LDA - can use a classification model on the
smoothed data
 KNN - why not?
Reviewing the code
http://xkcd.com/221/
Model Results
 Linear Models (OLS, Ridge, Lasso)
 Universally poor on the small data set
 Of the Linear Models, Ridge was the best on the large data set
 Decision Tree/Random Forest
 Overfitted on the small data set (great training score, poor test score)
 Had the best results on the large data set (Random Forest)
 KNN
 Comparable to the non-"Linear Models" on the small and large data sets
 LDA
 Needed rather large K to get good results
 Had the best results on the small data set
Final Results
Model Training Score Testing Score
KNN 1.000 0.147
Model Training Score Testing Score
Decision Tree 1.000 0.342
Pay Rate Estimated Salary
Model Training Score Testing Score
LDA 0.585 0.286
Model Training Score Testing Score
Random Forest 1.000 0.581
Smoothed Pay Rate Smoothed Estimated Salary
Reviewing the data
http://blog.mindjet.com/2011/12/drowning-from-information-overload/
How can we do better next
time?
 More data!
 Either more data points or expand the parameters
of the model
 Keep playing with the shape of the data
 Improve the ranges - 20k vs. 30k is more
significant than 150k vs. 160k
 Improve the quality of the data
 Verify the LinkedIn data, Job Seniority, etc.
Additional Thoughts
Why I think a project like this is a marketing gimmick:
 Only recruiters post expected salary
 Too much variance in job titles and not enough in
the job description
 Only provides base salary and ignores bonus and
non-cash compensation
 Cannot handle deprecated skills or brand new skills
References
 projectSHERPA Homepage,
http://projectsherpa.com/
 "Predict the salary of any UK job ad based on
its contents", http://www.kaggle.com/c/job-
salary-prediction
 "Predicting advertised salaries",
http://fastml.com/predicting-advertised-
salaries/

More Related Content

Similar to Ds final project jwm

Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVPThinkful
 
Benchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academiaBenchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academiaNick Craswell
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit DataFiras Obeid
 
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarInstitute of Contemporary Sciences
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningAlibaba Cloud
 
TSI Final Presentation
TSI Final PresentationTSI Final Presentation
TSI Final PresentationMarco Better
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기Sungmin Kim
 
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)Anupran Trivedi
 
#SIOP15 Presentation On Performance Sorting Using Video Interviews
#SIOP15 Presentation On Performance Sorting Using Video Interviews#SIOP15 Presentation On Performance Sorting Using Video Interviews
#SIOP15 Presentation On Performance Sorting Using Video InterviewsBenjamin Taylor
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learningShishir Choudhary
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Lead Scoring Case Study
Lead Scoring Case StudyLead Scoring Case Study
Lead Scoring Case StudyLumbiniSardare
 

Similar to Ds final project jwm (20)

Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVP
 
Benchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academiaBenchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academia
 
Voice of the Market, Tom Anderson
Voice of the Market, Tom AndersonVoice of the Market, Tom Anderson
Voice of the Market, Tom Anderson
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit Data
 
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
 
DataScholar.io
DataScholar.ioDataScholar.io
DataScholar.io
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine Learning
 
TSI Final Presentation
TSI Final PresentationTSI Final Presentation
TSI Final Presentation
 
Data mining on yelp dataset
Data mining on yelp datasetData mining on yelp dataset
Data mining on yelp dataset
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기
 
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
 
#SIOP15 Presentation On Performance Sorting Using Video Interviews
#SIOP15 Presentation On Performance Sorting Using Video Interviews#SIOP15 Presentation On Performance Sorting Using Video Interviews
#SIOP15 Presentation On Performance Sorting Using Video Interviews
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Experimenting with Data!
Experimenting with Data!Experimenting with Data!
Experimenting with Data!
 
Lead Scoring Case Study
Lead Scoring Case StudyLead Scoring Case Study
Lead Scoring Case Study
 

Recently uploaded

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Ds final project jwm

  • 1. Job Salary Prediction with Python John Maiden NYC Data Science Academy
  • 3. How did I scrape the jobs? Jobs were scraped from the following: • Dice.com • Monster.com • Startupers.com • Ventureloop.com All credit goes to: Craig Perler CTO & Founder projectSHERPA
  • 4. Job Selection Process  Iterative process based on keywords, text analysis using NLTK, and manually reviewing the jobs  Went from 70k jobs to final set of 585 based on the following criteria:  Keywords: ▪ (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR ▪ Job description contained: 'predictive modeling', 'data mining', 'text mining', 'machine learning', 'natural language processing') AND ▪ Job title did not contain: 'intern', 'internship'  Skills ▪ Job had at least one tagged skill and did not contain the following: 'Salesforce', 'VBA', 'Sharepoint', 'Drupal'
  • 5. Cleaning Up the Jobs Part 1  Job data was good, but needed some further cleaning  Converted the posted pay rate text into a number  Assigned job seniority ('junior', 'default', 'senior') based on job title keywords  Used python-linkedin to pull in more company data  Name, Description, Industry, Company Size, Company Type, Specialities
  • 6. Cleaning Up the Jobs Part 2  Added more data columns  Job Posted Year, Job Posted Month, Number of words in Job Description, Number of characters in Job Description  Converted text fields to numeric values  Job Seniority, Employee Count, Company Type  Converted Company Industry and Specialities to binary valued columns  Used only the following specialities: 'Big Data', 'Analytics', 'Machine Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural Language Processing', 'Predictive Analytics', 'Data Mining'
  • 7. Modeling Time! Problem  Only 67 jobs (out of 585) have posted salaries! Solution
  • 8. What do we do? Test using two different datasets  Pay Rate - job salaries as provided by the poster (67 records)  Estimated Salary - job salaries as provided by a separate model (584 records)
  • 9. Model Selection  Most of the competitors in the Kaggle competition used Random Forest.  Zygmunt Zając made a suggestion:
  • 10. Initial Results Tested with linear models and Random Forest Model Training Score Testing Score Ordinary Linear Regression 0.795 -32,120,852.970 Ridge Regression 0.669 -0.079 Lasso Regression -0.009 -0.021 Random Forest 1.000 0.147 Model Training Score Testing Score Ordinary Linear Regression 0.430 -330,266.154 Ridge Regression 0.361 0.163 Lasso Regression 0.031 -0.051 Random Forest 1.000 0.325 Pay Rate Estimated Salary Can we do better???
  • 11. Changing the shape of the data Original Data Log Data Sqrt Data
  • 12. Smoothing out the data Our salary data is too granular - let's round to units of 10k Original Data Smoothed Data
  • 13. Expanding the model set  Random Forest is good but slow - try Decision Tree to see if we can get comparable results  LDA - can use a classification model on the smoothed data  KNN - why not?
  • 15. Model Results  Linear Models (OLS, Ridge, Lasso)  Universally poor on the small data set  Of the Linear Models, Ridge was the best on the large data set  Decision Tree/Random Forest  Overfitted on the small data set (great training score, poor test score)  Had the best results on the large data set (Random Forest)  KNN  Comparable to the non-"Linear Models" on the small and large data sets  LDA  Needed rather large K to get good results  Had the best results on the small data set
  • 16. Final Results Model Training Score Testing Score KNN 1.000 0.147 Model Training Score Testing Score Decision Tree 1.000 0.342 Pay Rate Estimated Salary Model Training Score Testing Score LDA 0.585 0.286 Model Training Score Testing Score Random Forest 1.000 0.581 Smoothed Pay Rate Smoothed Estimated Salary
  • 18. How can we do better next time?  More data!  Either more data points or expand the parameters of the model  Keep playing with the shape of the data  Improve the ranges - 20k vs. 30k is more significant than 150k vs. 160k  Improve the quality of the data  Verify the LinkedIn data, Job Seniority, etc.
  • 19. Additional Thoughts Why I think a project like this is a marketing gimmick:  Only recruiters post expected salary  Too much variance in job titles and not enough in the job description  Only provides base salary and ignores bonus and non-cash compensation  Cannot handle deprecated skills or brand new skills
  • 20. References  projectSHERPA Homepage, http://projectsherpa.com/  "Predict the salary of any UK job ad based on its contents", http://www.kaggle.com/c/job- salary-prediction  "Predicting advertised salaries", http://fastml.com/predicting-advertised- salaries/