SlideShare a Scribd company logo
Jobmash
Job searching without the pain
Marianne Hoogeveen
Searching for data science jobs is tiring and depressing
‘Data Scientist’ Ranges from
Excel pusher to software
engineer
! MANY jobs, but how to find
the right kind?
! What IS the right kind (for
me)?
Want “more like this” feature
MORE LIKE
THIS,
PLEASE!
Data: 11000 job postings from Indeed
Specific data challenges:
Buzzwords (“world-class”, “driven”, “exciting”, “mission”, “opportunity”)
Difficult to validate: what is ground truth?
Look for “Data Scientist”, “Data Engineer”, “Data Analyst” job descriptions
US Wide
Compute text similarity
Remove low-information words (stop words, dates, locations, numbers,
common verbs, …)
Count occurrence of words (and bigrams), weighted negatively if they are
common in the corpus of all documents (TF-IDF)
Compare similarity between these weighted TF-IDF vectors using cosine
similarity
What are the job titles of top-100 most similar job postings?
Search term: Data Scientist Search term: Data EngineerSearch term: Data Analyst
DA DA DS DEDSDS DE DEDAother other other
Using cosine similarity on TF-IDF vectors, after removing buzzwords; find 100 most similar and compare job titles
Which job titles can we predict from job description?
Removing buzzwords
False positive rate
0.0 1.00.4 0.6 0.80.2
1.0 1.0
False positive rate
0.0 1.00.4 0.6 0.80.2
Keeping buzzwords
Truepositiverate
Truepositiverate
(“world-class”, “exciting”, “mission”, “opportunity”)
Let’s have a look!
JOBMASH APP
Why is this useful?
Less irrelevant results
Don’t miss similar jobs that don’t have the right job title
About me
PhD Theoretical Physics, King’s College London
Data Science Internship at Cytora Ltd:
Recognising street addresses in newspaper articles
Extra slides
Validation
200 random docs
Compare with same
docs cut in half
Compute pairwise
cosine similarities
0
0.2
0.4
0.6
0.8
1
OriginaldocumentssortedbyID
Half documents sorted by original’s ID
Topic modelling
Salient terms:
Client
Machine
Model
System
Status
Financial
Research
Risk
Analysis
Statistical
Engineer
Employment
Support
Big
LDA topics
e.g. “analysis”, “report”, “statistical” ,
“excel”, “sas”, “insight”
e.g. “big”, “technology”, “system” ,
“engineer”, “design”, “build”,
“hadoop”, “platform”
Other LDA topics
e.g. “status”, “disability”, “gender” e.g. “benefit”, “dental”, “pay”, “medical”, “401k”

More Related Content

Similar to Marianne hoogeveen demo1

Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
Are You Getting the Best of your MySQL Indexes
Are You Getting the Best of your MySQL IndexesAre You Getting the Best of your MySQL Indexes
Are You Getting the Best of your MySQL IndexesMYXPLAIN
 
databases3b
databases3bdatabases3b
databases3bc.west
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
databases2
databases2databases2
databases2c.west
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Andre Freitas
 
A Compositional-distributional Semantic Model over Structured Data
A Compositional-distributional Semantic Model over Structured DataA Compositional-distributional Semantic Model over Structured Data
A Compositional-distributional Semantic Model over Structured DataAndre Freitas
 
Lecture 17
Lecture 17Lecture 17
Lecture 17
Shani729
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
Detecting Gender-bias from Energy Modeling Jobscape
Detecting Gender-bias from Energy Modeling JobscapeDetecting Gender-bias from Energy Modeling Jobscape
Detecting Gender-bias from Energy Modeling Jobscape
yungahhh
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
Sanghamitra Deb
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
Shani729
 
The Duet model
The Duet modelThe Duet model
The Duet model
Bhaskar Mitra
 
Combined Boolean Slideshare
Combined Boolean SlideshareCombined Boolean Slideshare
Combined Boolean SlideshareCommvault
 
Data science and artificial intelligence
Data science and artificial intelligenceData science and artificial intelligence
Data science and artificial intelligence
ssuser774037
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 

Similar to Marianne hoogeveen demo1 (18)

Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Are You Getting the Best of your MySQL Indexes
Are You Getting the Best of your MySQL IndexesAre You Getting the Best of your MySQL Indexes
Are You Getting the Best of your MySQL Indexes
 
databases3b
databases3bdatabases3b
databases3b
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
databases2
databases2databases2
databases2
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
A Compositional-distributional Semantic Model over Structured Data
A Compositional-distributional Semantic Model over Structured DataA Compositional-distributional Semantic Model over Structured Data
A Compositional-distributional Semantic Model over Structured Data
 
Lecture 17
Lecture 17Lecture 17
Lecture 17
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Detecting Gender-bias from Energy Modeling Jobscape
Detecting Gender-bias from Energy Modeling JobscapeDetecting Gender-bias from Energy Modeling Jobscape
Detecting Gender-bias from Energy Modeling Jobscape
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Combined Boolean Slideshare
Combined Boolean SlideshareCombined Boolean Slideshare
Combined Boolean Slideshare
 
Data science and artificial intelligence
Data science and artificial intelligenceData science and artificial intelligence
Data science and artificial intelligence
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 

Recently uploaded

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Alireza Kamrani
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
elinavihriala
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 

Recently uploaded (20)

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 

Marianne hoogeveen demo1

  • 1. Jobmash Job searching without the pain Marianne Hoogeveen
  • 2. Searching for data science jobs is tiring and depressing ‘Data Scientist’ Ranges from Excel pusher to software engineer ! MANY jobs, but how to find the right kind? ! What IS the right kind (for me)?
  • 3. Want “more like this” feature MORE LIKE THIS, PLEASE!
  • 4. Data: 11000 job postings from Indeed Specific data challenges: Buzzwords (“world-class”, “driven”, “exciting”, “mission”, “opportunity”) Difficult to validate: what is ground truth? Look for “Data Scientist”, “Data Engineer”, “Data Analyst” job descriptions US Wide
  • 5. Compute text similarity Remove low-information words (stop words, dates, locations, numbers, common verbs, …) Count occurrence of words (and bigrams), weighted negatively if they are common in the corpus of all documents (TF-IDF) Compare similarity between these weighted TF-IDF vectors using cosine similarity
  • 6. What are the job titles of top-100 most similar job postings? Search term: Data Scientist Search term: Data EngineerSearch term: Data Analyst DA DA DS DEDSDS DE DEDAother other other Using cosine similarity on TF-IDF vectors, after removing buzzwords; find 100 most similar and compare job titles
  • 7. Which job titles can we predict from job description? Removing buzzwords False positive rate 0.0 1.00.4 0.6 0.80.2 1.0 1.0 False positive rate 0.0 1.00.4 0.6 0.80.2 Keeping buzzwords Truepositiverate Truepositiverate (“world-class”, “exciting”, “mission”, “opportunity”)
  • 8. Let’s have a look! JOBMASH APP
  • 9. Why is this useful? Less irrelevant results Don’t miss similar jobs that don’t have the right job title
  • 10. About me PhD Theoretical Physics, King’s College London Data Science Internship at Cytora Ltd: Recognising street addresses in newspaper articles
  • 12. Validation 200 random docs Compare with same docs cut in half Compute pairwise cosine similarities 0 0.2 0.4 0.6 0.8 1 OriginaldocumentssortedbyID Half documents sorted by original’s ID
  • 14. LDA topics e.g. “analysis”, “report”, “statistical” , “excel”, “sas”, “insight” e.g. “big”, “technology”, “system” , “engineer”, “design”, “build”, “hadoop”, “platform”
  • 15. Other LDA topics e.g. “status”, “disability”, “gender” e.g. “benefit”, “dental”, “pay”, “medical”, “401k”