SlideShare a Scribd company logo
1 of 21
Download to read offline
Company Clustering Based on skills
they seek in the job Market
Navid Nobani
Master of Business Intelligence and Big Data Analytics
Objective
The main objective of my work is to find a way to cluster companies based on their demand for human
resources through job announcements.
While at the beginning it may seem easy,actually job announcements in most cases are hard to analysis since:
• There is no universal taxonomy for describing skills and occupations
• Companies tend to ask for skills that they may not actually need for the current vacancy
• Companies may try to hide critical info like name of company,exact location,salary ,…
• …
Given this issues, it’s extremely hard to find companies that act in the same way in terms of skills they seek and
the occupations they offer .
My focus is to first find an efficient way to clean company names which enables me to have a real view of
company activity and then to find a method (through grouping or filtering the skills) to cluster the similar
companies together.
Storage :
AWS Athena
(TabulaeX)
Companies and Skills
(WollyBI)
Cleaning Compnay
Names
Pivoting Companies
and ESCO Skills
Creating
Subsets
Extreme Values
Skill Grouping
Sector Filtering
Analysis
Visulalization
Clustering
Dimension
Reduction
An Overview of What I Did!
Getting Raw data from TabulaeX Database
I used european job announcement collected by Tabulaex in 2018 as my data
source.This data is stored on AWS S3 and are accessable through AWS Athena
which is an interactive query tool. Later I used Dbeaver database administration
tool to lunch the queries and save the data locallay as CSV files.
Amazon Athena
Getting Raw data from TabulaeX Database (cont.)
I mainly used two tables from Tabulaex Databases.The first table contains
announcement data gathered via differnet methods like scrappers and crawlers
and a table with extracted ESCO skills for each announcement using Onthology
matching and Machine Learning Techniques.
id Company …
1
2
3
.
.
.
.
n
AA
AA
BB
BC
.
.
.
ZZ
1
2
3
.
.
.
.
n
Skill1 skill2 skill3 skill4 … skill mid
Cleaning Company Names
Due to Business regulations and regardless of the technique used to capture the company name from job
announcements, in most cases the company name will be contaminated with suffixes and prefixes which in
a national and international level show the activity type of the company.
To solve this problem, I’ve wrote and algorithm (and implemented in Python) which removes this unwanted
parts from the raw company names.
To use this algorithm first I manually cleaned about 5,000 names and then used these clean names as an
input of a simple ontology matching script which helped me to increase the clean names to 80,000
companies. In the next step I used these clean names to identify the tokens which aren’t a part of the
company name. To do so I’ve utilized a frequency-location based metric which is capable of detecting
unwanted parts.
The algorithm uses the two pieces I’ve described above. One to match the already cleaned company names
and another to detect unwanted parts for the companies which are not presented in the training set.
Priliminary Results
Unlike using all ESCO skills, utilizing the new classification of skills generated
promising results from the early stages of the analysis.
PCA with with Mikowski Distance (m=3) Hierarchical Clustering (k=4) T-SNE plot
Cleaning Company Names (cont.)
5,000 Clean
Names
Frequency-Location
Metric
Simple Cleaning
Script
80,000 Clean
Names
Final Algorithm
Dirty Company
Names
Manual Cleaning
Creating a table with Companies and Skills
After cleaning company names I used Sparklyr package to create a table with
companies as rows and skills as columns.
Skill1 skill2 skill3 skill4 … skill mCompany
AA
AA
BB
BC
.
.
.
ZZ
1,498 Skills
4,909Companies
Clustering
In order to perform clustering on companies based on their skills I’ve decided
to use two different categories of clustering algorithms:
1. Prototype-Based
• Kmeans
• PAM
• CLARA
2. Density-Based
• DBSCAN
Clustering Benchmark. Source :biomedicalcomputationreview.org
Prototype-Based Clustering
This category of clustering methids, consider a cluster as a set of objects in
which each object is closer to the prototype that defines the cluster than to the
prototype of any other cluster.
In case of KMeans The centroid is average of all points while for PAM and
CLARA algorithms, the centroid is defined by a medoid, which is the most
representative data of any cluster.
By nature the prototype-based methods need to have K (number of clusters) a
priori.
To do so I used Elbow method and Silhouette methods for all three algorithms.
Prototype-Based Clustering (cont.)
KMeans PAM CLARA
Elbow Method
Silhouette
Method
As it can be seen from the figures above, all algorithms and methods point to 4 as number of clusters.
Prototype-Based Clustering (cont.)
Euclidean
Manhattan
KMeans CLARAPAM
Preliminary Results
Using extracted skills (about 1500 skills) and applying classic clustering
methods like KMeans, DBSCAN,…and performing dimensionality reduction
and visualization techniques like PCA and t-SNE hasn’t created acceptable
results.
PCA with log Transformation T-SNE plot PCA
New Features
Tabulaex internally developed a new classification system of ESCO skills. Based
on this system, skills will be classified as one of the following macro-categories:
• Knowledge 90
• Personal Qualities 25
• Skills 219
• Tools & Technologies 97
431Total numbr of skills covered
Data Transformation
While the preliminary results well promising, some of them like PCA show had
room for further improvements.To do so I’ve applied a simple transformation to
alter the raw data from absolute numbers to percentages.
Company
Knowledge
PersonalQualities
Tools&Technologies
Skills
Nastasi Srl. 25 73 14 5
Company
Knowledge
PersonalQualities
Tools&Technologies
Skills
Nastasi Srl. 0.21 0.62 0.11 0.04
Correlation plots and Relationships between skills
Using the for categories of skills, some intresting relationships came out.
It’s possible to observe three type of
relationships between skills:
• Substitution
(Ex.Knowledge Vs.Skills)
• Correlation
( Ex.Knowledge Vs.Tools & Tech.)
• Partial Corelation
(Ex.Skills & Tools & Tech.)
From Correlations to Comparisons
While seeing the relationship
between all skills are
interesting, filtering these plots
fort similar companies give us a
general view of companies
strategies toward skills of their
human resources.
In order to do so, we need to
identify the similar companies
based on their business sectors.
I used an excerpt of Crunch base database
for top 1000 companies for 28 countries of
European Union with 19144 unique
companies.
Based on these data companies have a
mixture of 703 unique fields/activities as
their sector. After extracting unique fields, I
created a wide table to map the skills to
each company.This table later is used to find
similar companies using Jaccard metric.
19144
703
Filed Mapping
19144
Jaccard Distance Matrix
Focusing on specific companies
Having Jaccard distance matrix of company sectors we can
filter the previous pair plots for a given company and its n
similar companies based on the business sector.
Based on what we saw about the
positive impact of using grouped skills
we can use their inter-dynamics to
compare the similar companies
together.
Having four categories of skills we can
come up with 6 distinct pairs. Choosing
the appropriate pair depends on the
type of the analysis and comparison we
want to perform on the similar
companies.
In the next slide you will see the
«knowledge» and « Tools & Technology»
plots for «Pirelli» and «SAP» companies.
Size of circles shows the the “Personal Qualities” skill group
Pirelli SAP
Conclusiuon
While using all ESCO skills in order to cluster the companies seems
intimidating at the beginning, utilizing various clustering methods and
algorithms has shown that without grouping skills in a meaningful way before
using them, the results won’t have any business added value.

More Related Content

Similar to Company Clustering Based on skills they seek in the job market

Mca1040 system analysis and design
Mca1040  system analysis and designMca1040  system analysis and design
Mca1040 system analysis and designsmumbahelp
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 antimo musone
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: butest
 
Raise your Brand's Analytics IQ - Annotated Version
Raise your Brand's Analytics IQ - Annotated VersionRaise your Brand's Analytics IQ - Annotated Version
Raise your Brand's Analytics IQ - Annotated VersionGib Bassett
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Neo4j
 
lecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxlecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxMarc Teunis
 
Using Safyr to navigate and analyse SAP data model demonstration screen shots
Using Safyr to navigate and analyse SAP data model demonstration screen shotsUsing Safyr to navigate and analyse SAP data model demonstration screen shots
Using Safyr to navigate and analyse SAP data model demonstration screen shotsRoland Bullivant
 
Mi0038 enterprise resource planning
Mi0038  enterprise resource planningMi0038  enterprise resource planning
Mi0038 enterprise resource planningsmumbahelp
 
Mi0038 enterprise resource planning
Mi0038  enterprise resource planningMi0038  enterprise resource planning
Mi0038 enterprise resource planningsmumbahelp
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
Machine Learning with Tensorflow
Machine Learning with TensorflowMachine Learning with Tensorflow
Machine Learning with TensorflowVatsal Mishra
 
Om0011 enterprise resource planning
Om0011  enterprise resource planningOm0011  enterprise resource planning
Om0011 enterprise resource planningsmumbahelp
 
Om0011 enterprise resource planning
Om0011  enterprise resource planningOm0011  enterprise resource planning
Om0011 enterprise resource planningsmumbahelp
 
Oracle Demantra Training
 Oracle Demantra Training Oracle Demantra Training
Oracle Demantra Trainingwilliamflender
 
XL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl MinerXL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl Minerxlminer content
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docxjaffarbikat
 

Similar to Company Clustering Based on skills they seek in the job market (20)

Mca1040 system analysis and design
Mca1040  system analysis and designMca1040  system analysis and design
Mca1040 system analysis and design
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web:
 
Sadcw 6e chapter4
Sadcw 6e chapter4Sadcw 6e chapter4
Sadcw 6e chapter4
 
Raise your Brand's Analytics IQ - Annotated Version
Raise your Brand's Analytics IQ - Annotated VersionRaise your Brand's Analytics IQ - Annotated Version
Raise your Brand's Analytics IQ - Annotated Version
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...
 
lecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxlecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptx
 
Using Safyr to navigate and analyse SAP data model demonstration screen shots
Using Safyr to navigate and analyse SAP data model demonstration screen shotsUsing Safyr to navigate and analyse SAP data model demonstration screen shots
Using Safyr to navigate and analyse SAP data model demonstration screen shots
 
Mi0038 enterprise resource planning
Mi0038  enterprise resource planningMi0038  enterprise resource planning
Mi0038 enterprise resource planning
 
Mi0038 enterprise resource planning
Mi0038  enterprise resource planningMi0038  enterprise resource planning
Mi0038 enterprise resource planning
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Machine Learning with Tensorflow
Machine Learning with TensorflowMachine Learning with Tensorflow
Machine Learning with Tensorflow
 
Om0011 enterprise resource planning
Om0011  enterprise resource planningOm0011  enterprise resource planning
Om0011 enterprise resource planning
 
Om0011 enterprise resource planning
Om0011  enterprise resource planningOm0011  enterprise resource planning
Om0011 enterprise resource planning
 
Oracle Demantra Training
 Oracle Demantra Training Oracle Demantra Training
Oracle Demantra Training
 
XL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl MinerXL-MINER:Introduction To Xl Miner
XL-MINER:Introduction To Xl Miner
 
Introduction To XL-Miner
Introduction To XL-MinerIntroduction To XL-Miner
Introduction To XL-Miner
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 

More from Carla Marini

Segmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retailSegmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retailCarla Marini
 
Alidata Experience - Alitalia Customer Satisfaction
Alidata Experience -  Alitalia Customer SatisfactionAlidata Experience -  Alitalia Customer Satisfaction
Alidata Experience - Alitalia Customer SatisfactionCarla Marini
 
PW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your projectPW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your projectCarla Marini
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsCarla Marini
 
Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8Carla Marini
 
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...Carla Marini
 
JOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData AnalyticsJOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData AnalyticsCarla Marini
 

More from Carla Marini (8)

Tox Framework
Tox FrameworkTox Framework
Tox Framework
 
Segmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retailSegmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retail
 
Alidata Experience - Alitalia Customer Satisfaction
Alidata Experience -  Alitalia Customer SatisfactionAlidata Experience -  Alitalia Customer Satisfaction
Alidata Experience - Alitalia Customer Satisfaction
 
PW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your projectPW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your project
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8
 
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
 
JOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData AnalyticsJOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData Analytics
 

Recently uploaded

原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样wsppdmt
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...rightmanforbloodline
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 

Recently uploaded (20)

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 

Company Clustering Based on skills they seek in the job market

  • 1. Company Clustering Based on skills they seek in the job Market Navid Nobani Master of Business Intelligence and Big Data Analytics
  • 2. Objective The main objective of my work is to find a way to cluster companies based on their demand for human resources through job announcements. While at the beginning it may seem easy,actually job announcements in most cases are hard to analysis since: • There is no universal taxonomy for describing skills and occupations • Companies tend to ask for skills that they may not actually need for the current vacancy • Companies may try to hide critical info like name of company,exact location,salary ,… • … Given this issues, it’s extremely hard to find companies that act in the same way in terms of skills they seek and the occupations they offer . My focus is to first find an efficient way to clean company names which enables me to have a real view of company activity and then to find a method (through grouping or filtering the skills) to cluster the similar companies together.
  • 3. Storage : AWS Athena (TabulaeX) Companies and Skills (WollyBI) Cleaning Compnay Names Pivoting Companies and ESCO Skills Creating Subsets Extreme Values Skill Grouping Sector Filtering Analysis Visulalization Clustering Dimension Reduction An Overview of What I Did!
  • 4. Getting Raw data from TabulaeX Database I used european job announcement collected by Tabulaex in 2018 as my data source.This data is stored on AWS S3 and are accessable through AWS Athena which is an interactive query tool. Later I used Dbeaver database administration tool to lunch the queries and save the data locallay as CSV files. Amazon Athena
  • 5. Getting Raw data from TabulaeX Database (cont.) I mainly used two tables from Tabulaex Databases.The first table contains announcement data gathered via differnet methods like scrappers and crawlers and a table with extracted ESCO skills for each announcement using Onthology matching and Machine Learning Techniques. id Company … 1 2 3 . . . . n AA AA BB BC . . . ZZ 1 2 3 . . . . n Skill1 skill2 skill3 skill4 … skill mid
  • 6. Cleaning Company Names Due to Business regulations and regardless of the technique used to capture the company name from job announcements, in most cases the company name will be contaminated with suffixes and prefixes which in a national and international level show the activity type of the company. To solve this problem, I’ve wrote and algorithm (and implemented in Python) which removes this unwanted parts from the raw company names. To use this algorithm first I manually cleaned about 5,000 names and then used these clean names as an input of a simple ontology matching script which helped me to increase the clean names to 80,000 companies. In the next step I used these clean names to identify the tokens which aren’t a part of the company name. To do so I’ve utilized a frequency-location based metric which is capable of detecting unwanted parts. The algorithm uses the two pieces I’ve described above. One to match the already cleaned company names and another to detect unwanted parts for the companies which are not presented in the training set.
  • 7. Priliminary Results Unlike using all ESCO skills, utilizing the new classification of skills generated promising results from the early stages of the analysis. PCA with with Mikowski Distance (m=3) Hierarchical Clustering (k=4) T-SNE plot
  • 8. Cleaning Company Names (cont.) 5,000 Clean Names Frequency-Location Metric Simple Cleaning Script 80,000 Clean Names Final Algorithm Dirty Company Names Manual Cleaning
  • 9. Creating a table with Companies and Skills After cleaning company names I used Sparklyr package to create a table with companies as rows and skills as columns. Skill1 skill2 skill3 skill4 … skill mCompany AA AA BB BC . . . ZZ 1,498 Skills 4,909Companies
  • 10. Clustering In order to perform clustering on companies based on their skills I’ve decided to use two different categories of clustering algorithms: 1. Prototype-Based • Kmeans • PAM • CLARA 2. Density-Based • DBSCAN Clustering Benchmark. Source :biomedicalcomputationreview.org
  • 11. Prototype-Based Clustering This category of clustering methids, consider a cluster as a set of objects in which each object is closer to the prototype that defines the cluster than to the prototype of any other cluster. In case of KMeans The centroid is average of all points while for PAM and CLARA algorithms, the centroid is defined by a medoid, which is the most representative data of any cluster. By nature the prototype-based methods need to have K (number of clusters) a priori. To do so I used Elbow method and Silhouette methods for all three algorithms.
  • 12. Prototype-Based Clustering (cont.) KMeans PAM CLARA Elbow Method Silhouette Method As it can be seen from the figures above, all algorithms and methods point to 4 as number of clusters.
  • 14. Preliminary Results Using extracted skills (about 1500 skills) and applying classic clustering methods like KMeans, DBSCAN,…and performing dimensionality reduction and visualization techniques like PCA and t-SNE hasn’t created acceptable results. PCA with log Transformation T-SNE plot PCA
  • 15. New Features Tabulaex internally developed a new classification system of ESCO skills. Based on this system, skills will be classified as one of the following macro-categories: • Knowledge 90 • Personal Qualities 25 • Skills 219 • Tools & Technologies 97 431Total numbr of skills covered
  • 16. Data Transformation While the preliminary results well promising, some of them like PCA show had room for further improvements.To do so I’ve applied a simple transformation to alter the raw data from absolute numbers to percentages. Company Knowledge PersonalQualities Tools&Technologies Skills Nastasi Srl. 25 73 14 5 Company Knowledge PersonalQualities Tools&Technologies Skills Nastasi Srl. 0.21 0.62 0.11 0.04
  • 17. Correlation plots and Relationships between skills Using the for categories of skills, some intresting relationships came out. It’s possible to observe three type of relationships between skills: • Substitution (Ex.Knowledge Vs.Skills) • Correlation ( Ex.Knowledge Vs.Tools & Tech.) • Partial Corelation (Ex.Skills & Tools & Tech.)
  • 18. From Correlations to Comparisons While seeing the relationship between all skills are interesting, filtering these plots fort similar companies give us a general view of companies strategies toward skills of their human resources. In order to do so, we need to identify the similar companies based on their business sectors. I used an excerpt of Crunch base database for top 1000 companies for 28 countries of European Union with 19144 unique companies. Based on these data companies have a mixture of 703 unique fields/activities as their sector. After extracting unique fields, I created a wide table to map the skills to each company.This table later is used to find similar companies using Jaccard metric. 19144 703 Filed Mapping 19144 Jaccard Distance Matrix
  • 19. Focusing on specific companies Having Jaccard distance matrix of company sectors we can filter the previous pair plots for a given company and its n similar companies based on the business sector. Based on what we saw about the positive impact of using grouped skills we can use their inter-dynamics to compare the similar companies together. Having four categories of skills we can come up with 6 distinct pairs. Choosing the appropriate pair depends on the type of the analysis and comparison we want to perform on the similar companies. In the next slide you will see the «knowledge» and « Tools & Technology» plots for «Pirelli» and «SAP» companies.
  • 20. Size of circles shows the the “Personal Qualities” skill group Pirelli SAP
  • 21. Conclusiuon While using all ESCO skills in order to cluster the companies seems intimidating at the beginning, utilizing various clustering methods and algorithms has shown that without grouping skills in a meaningful way before using them, the results won’t have any business added value.