Company Clustering Based on skills they seek in the job market
1. Company Clustering Based on skills
they seek in the job Market
Navid Nobani
Master of Business Intelligence and Big Data Analytics
2. Objective
The main objective of my work is to find a way to cluster companies based on their demand for human
resources through job announcements.
While at the beginning it may seem easy,actually job announcements in most cases are hard to analysis since:
• There is no universal taxonomy for describing skills and occupations
• Companies tend to ask for skills that they may not actually need for the current vacancy
• Companies may try to hide critical info like name of company,exact location,salary ,…
• …
Given this issues, it’s extremely hard to find companies that act in the same way in terms of skills they seek and
the occupations they offer .
My focus is to first find an efficient way to clean company names which enables me to have a real view of
company activity and then to find a method (through grouping or filtering the skills) to cluster the similar
companies together.
3. Storage :
AWS Athena
(TabulaeX)
Companies and Skills
(WollyBI)
Cleaning Compnay
Names
Pivoting Companies
and ESCO Skills
Creating
Subsets
Extreme Values
Skill Grouping
Sector Filtering
Analysis
Visulalization
Clustering
Dimension
Reduction
An Overview of What I Did!
4. Getting Raw data from TabulaeX Database
I used european job announcement collected by Tabulaex in 2018 as my data
source.This data is stored on AWS S3 and are accessable through AWS Athena
which is an interactive query tool. Later I used Dbeaver database administration
tool to lunch the queries and save the data locallay as CSV files.
Amazon Athena
5. Getting Raw data from TabulaeX Database (cont.)
I mainly used two tables from Tabulaex Databases.The first table contains
announcement data gathered via differnet methods like scrappers and crawlers
and a table with extracted ESCO skills for each announcement using Onthology
matching and Machine Learning Techniques.
id Company …
1
2
3
.
.
.
.
n
AA
AA
BB
BC
.
.
.
ZZ
1
2
3
.
.
.
.
n
Skill1 skill2 skill3 skill4 … skill mid
6. Cleaning Company Names
Due to Business regulations and regardless of the technique used to capture the company name from job
announcements, in most cases the company name will be contaminated with suffixes and prefixes which in
a national and international level show the activity type of the company.
To solve this problem, I’ve wrote and algorithm (and implemented in Python) which removes this unwanted
parts from the raw company names.
To use this algorithm first I manually cleaned about 5,000 names and then used these clean names as an
input of a simple ontology matching script which helped me to increase the clean names to 80,000
companies. In the next step I used these clean names to identify the tokens which aren’t a part of the
company name. To do so I’ve utilized a frequency-location based metric which is capable of detecting
unwanted parts.
The algorithm uses the two pieces I’ve described above. One to match the already cleaned company names
and another to detect unwanted parts for the companies which are not presented in the training set.
7. Priliminary Results
Unlike using all ESCO skills, utilizing the new classification of skills generated
promising results from the early stages of the analysis.
PCA with with Mikowski Distance (m=3) Hierarchical Clustering (k=4) T-SNE plot
8. Cleaning Company Names (cont.)
5,000 Clean
Names
Frequency-Location
Metric
Simple Cleaning
Script
80,000 Clean
Names
Final Algorithm
Dirty Company
Names
Manual Cleaning
9. Creating a table with Companies and Skills
After cleaning company names I used Sparklyr package to create a table with
companies as rows and skills as columns.
Skill1 skill2 skill3 skill4 … skill mCompany
AA
AA
BB
BC
.
.
.
ZZ
1,498 Skills
4,909Companies
10. Clustering
In order to perform clustering on companies based on their skills I’ve decided
to use two different categories of clustering algorithms:
1. Prototype-Based
• Kmeans
• PAM
• CLARA
2. Density-Based
• DBSCAN
Clustering Benchmark. Source :biomedicalcomputationreview.org
11. Prototype-Based Clustering
This category of clustering methids, consider a cluster as a set of objects in
which each object is closer to the prototype that defines the cluster than to the
prototype of any other cluster.
In case of KMeans The centroid is average of all points while for PAM and
CLARA algorithms, the centroid is defined by a medoid, which is the most
representative data of any cluster.
By nature the prototype-based methods need to have K (number of clusters) a
priori.
To do so I used Elbow method and Silhouette methods for all three algorithms.
12. Prototype-Based Clustering (cont.)
KMeans PAM CLARA
Elbow Method
Silhouette
Method
As it can be seen from the figures above, all algorithms and methods point to 4 as number of clusters.
14. Preliminary Results
Using extracted skills (about 1500 skills) and applying classic clustering
methods like KMeans, DBSCAN,…and performing dimensionality reduction
and visualization techniques like PCA and t-SNE hasn’t created acceptable
results.
PCA with log Transformation T-SNE plot PCA
15. New Features
Tabulaex internally developed a new classification system of ESCO skills. Based
on this system, skills will be classified as one of the following macro-categories:
• Knowledge 90
• Personal Qualities 25
• Skills 219
• Tools & Technologies 97
431Total numbr of skills covered
16. Data Transformation
While the preliminary results well promising, some of them like PCA show had
room for further improvements.To do so I’ve applied a simple transformation to
alter the raw data from absolute numbers to percentages.
Company
Knowledge
PersonalQualities
Tools&Technologies
Skills
Nastasi Srl. 25 73 14 5
Company
Knowledge
PersonalQualities
Tools&Technologies
Skills
Nastasi Srl. 0.21 0.62 0.11 0.04
17. Correlation plots and Relationships between skills
Using the for categories of skills, some intresting relationships came out.
It’s possible to observe three type of
relationships between skills:
• Substitution
(Ex.Knowledge Vs.Skills)
• Correlation
( Ex.Knowledge Vs.Tools & Tech.)
• Partial Corelation
(Ex.Skills & Tools & Tech.)
18. From Correlations to Comparisons
While seeing the relationship
between all skills are
interesting, filtering these plots
fort similar companies give us a
general view of companies
strategies toward skills of their
human resources.
In order to do so, we need to
identify the similar companies
based on their business sectors.
I used an excerpt of Crunch base database
for top 1000 companies for 28 countries of
European Union with 19144 unique
companies.
Based on these data companies have a
mixture of 703 unique fields/activities as
their sector. After extracting unique fields, I
created a wide table to map the skills to
each company.This table later is used to find
similar companies using Jaccard metric.
19144
703
Filed Mapping
19144
Jaccard Distance Matrix
19. Focusing on specific companies
Having Jaccard distance matrix of company sectors we can
filter the previous pair plots for a given company and its n
similar companies based on the business sector.
Based on what we saw about the
positive impact of using grouped skills
we can use their inter-dynamics to
compare the similar companies
together.
Having four categories of skills we can
come up with 6 distinct pairs. Choosing
the appropriate pair depends on the
type of the analysis and comparison we
want to perform on the similar
companies.
In the next slide you will see the
«knowledge» and « Tools & Technology»
plots for «Pirelli» and «SAP» companies.
20. Size of circles shows the the “Personal Qualities” skill group
Pirelli SAP
21. Conclusiuon
While using all ESCO skills in order to cluster the companies seems
intimidating at the beginning, utilizing various clustering methods and
algorithms has shown that without grouping skills in a meaningful way before
using them, the results won’t have any business added value.