Session 2.2 ontology-guided job market demand analysis: a cross-sectional study for the data science field
1. Ontology-guided Job Market Demand Analysis:
A Cross-Sectional Study for the Data Science field
Elisa Margareth Sibarani, Simon Scerri, Camilo Morales, Sören Auer, Diego Collarana
12/09/2017 - 13/09/2017
SEMANTiCS 2017, Amsterdam, Netherlands
2. 2
Motivation (1)
The prompt changes in the job market
Continuous increase in new skills and new job profiles
New challenges for
• Job Seekers less informed about the demanded skills
• Educators unable to offer courses that meet the expectations
Deep learning
Scikit-learn
TENSORFLOW
CAFFE
Theano
TORCH
MxNet
Puppet
Ansible Vagrant
……………
4. 4
Goals
• Provide the most needed technical skills, use case: data science
field
• A cross-sectional analysis, focuses on a snapshot of demand (at a
point in time)
• Target user group:
• job seekers and applicants
• educators and training providers
• Utilize co-word analysis, to identify and structure relationships
among concepts
• serve as a basis for our future work on time-series analysis
6. 6
Literature Review (relevant state of the art)
• Research on co-word analysis [1, 5, 7, 17]
• The potential of utilizing co-word analysis for investigating job
adverts [8, 12, 13, 9, 10]
• A review on the research methodology of 70 researches in LIS [6]
• OBIE applications [14, 18]
• IT skill analysis [11, 15, 16, 19]
7. 7
Novel contribution
• little research in the co-word analysis that explore OBIE for keyword
extraction
• provide SARO ontology, knowledge representation serves as a
reusable model
• implement OBIE methodology
• critical step in this study to cope with the “indexer effect” of co-word
analysis
• utilize co-word analysis for knowledge discovery
• reveal skill demands together with their internal and external correlations
with other skills
9. 9
Skills and Recruitment Ontology (SARO)
• the extension of two relevant models
• ESCO (labor market and its skills)
• Schema.org (job openings in organizations)
• goal: enable analysis and reuse of related tasks to interpret job
postings in the context of skills
• top-level view of SARO
• saro:JobPosting, extends the so:JobPosting concept in Schema.org and
defines essential attributes – including so:datePosted, so:jobLocation,
so:hiringOrganization, in addition to the saro:describes to state the
saro:jobRole
• saro:Skill, extends the esco:Concept, categorizes skills as job-specific or
transversal (cross-sector)
11. 11
Ontology-based Information Extraction (1)
• purpose: builds an index/keywords list as a basic pillar for co-word
analysis
• guided by an ontology (a certain model that specifies the objective
of the search)
• extracting pre-defined concepts and instances
• annotating text using concepts and instances
• why GATE?
• enable developers to implement flexible IE system
• supplies robust evaluation tools for NLP
13. 13
Correlation Matrix Construction
• build a symmetric co-word matrix
• querying triples to retrieve occurrence and co-occurrence frequency
• transform co-word matrix into correlation matrix
• calculate equivalence index to measure strength of association between two
keywords
• 𝐸𝑖𝑗 =
𝐶𝑖𝑗
2
(𝐶𝑖 . 𝐶𝑗)
• 0 ≤ 𝐸𝑖𝑗 ≤ 1
• Cij = number of job adverts where the skill pair appears
• Ci = number of times that the keyword i is used to index a document
• Cj = number of times that the keyword j is used to index a document
15. 15
Co-word Analysis for Job Adverts (2)
Pass-1
Link with
>> index
Add
other
links
Choose
correspond
ing nodes
Check if
links >
threshold
NO
YES
Pass-2
Choose
link from
Pass-1
cluster
Add links
to other
pass1-
node
Choose
pass-1
node with
>> index
Check if
links >
threshold
NO
YES
Co-occurrence threshold
Max Pass-1 Link
Co-occurrence threshold
Max Pass-1 and
Pass-2 Link
16. 16
1st Study and Evaluation: OBIE Method (1)
• The objectives is to evaluate:
• the adequacy of OBIE method
• its performance compared to manual human
extraction
• Data collection and pre-processing:
• 872 job adverts between August to November
2015, crawled from Adzuna.com
• 184 keywords (skills), ranging from 1 to 26 per
advert
• 20% of job adverts have at least 10 keywords,
95% had more than 1 keyword
17. 17
1st Study and Evaluation: OBIE Method (2)
• lowest F-score: SkillTool
annotation.
• 86 total, 37 consistent, 16 unique to IAA
set, and 33 uniquely by OBIE method
• Investigate the results, possibilities
to improve:
• ambiguity remains a challenge
• skills unknown to SARO instances are
not annotated
• missing synonyms result in incomplete
annotations
Hudson in “Hudson Global Resources Limited
offers the services of an employment agency
…”, excel in “If this role sounds like something
which you could excel in, please do not hesitate
…”.,
only SQL is marked up in Microsoft SQL Server,
MsOffice and MsAccess are marked up, but
Microsoft Office and Access are missed
18. 18
2nd Study: Co-word for Skill Analysis (1)
• The objectives:
• proof-of-concept, ahead of next stage research (time-series analysis to
detect trends)
• identify its potential to yield useful insights
• decided to drop off some generic skills
• ignore four frequently-occurring field words (Analyst, Analysis, Analytics,
Analytic)
• 180 skills under consideration
19. 19
2nd Study: Co-word for Skill Analysis (2)
• The final co-occurrences observed, range from:
• 1 (1087 skill pairs only observed once in the job post sample)
• 167 (1 skill pair was observed in 167 job posts)
• To generate the clusters, by heuristic approach we define 2 variants:
• 1st variant: co-occurrence threshold ≥ 10, Pass1Links = 3, and MaxLinks = 5
• 2nd variant: co-occurrence threshold ≥ 15, Pass1Links = 5, and MaxLinks = 8
• Why those numbers?
• to generate clusters that are not incomprehensibly large
• not bound with very weak links
22. 22
Conclusion
• OBIE method, guided by a domain vocabulary (SARO)
• Evaluation shows the resulting F-measure of automated extraction
fares very well
• the co-word analysis study serves as a proof-of-concept to identify
skill demand composition and trends
• all raw elicited information, higher-level abstractions and results,
available in a standard format (RDF)
23. 23
Future Work
• OBIE method improvement
• Ontology Population
• skills synonym
• new skill instance
• add more JAPE grammar rules
• entities written in whole upper case (e.g. CVS, GO), starting in lower case and followed
by digits(e.g. k8s), or consisting of a mix of upper case and lower case letters (e.g. LeSS,
SaaS)
• perform time-series analysis and identify trends and shifts
• provide a Web-based User Interface
Scope
Portrays the skills in demand at a single point in time
Co-word analysis: when two skills often appear together, a high chance that both skills are strongly related
Occurrence and co-occurrence frequency of skills pairs
How to extract keywords prior to the co-word analysis? Ontology-based Information Extraction (OBIE)
The OBIE pipeline exploits an ontology: Skills and Recruitment Ontology (SARO)
Research on co-word analysis [1, 5, 7, 17]
reduces and projects data into a specific visual representation
The potential of utilizing co-word analysis for investigating job adverts
library and information science (LIS) [8, 12, 13]
information systems (IS) [9, 10]
A review on the research methodology of 70 researches in LIS [6]
3 out of 70 used automatic text analysis
18 out of 70 used inferential statistics
OBIE applications [14, 18]
IT skill analysis [11, 15, 16, 19]
What is TOBIE?
Framework for semantically extract skills and information related to a job posting, and analyze the trends and changes
TOBIE relies on a vocabulary to guide the information extraction process (http://vocol.iais.fraunhofer.de/saro/)
TOBIE employs the co-word analysis to dynamically map the skills demand and its structure (job role inquiry)
OBIE pipeline accepts as input:
job adverts dataset (JSON or XML)
SARO ontology
OBIE pipeline consists of:
linguistic analysis components (pre-processing)
named-entity recognizer based on the ontology
transducer with JAPE grammar rules
using SILK framework:
convert OBIE result in XML to RDF
load RDF triples to KB
a technique which exploits the use of co-occurrence phenomena
skills that occur together frequently are assumed to be related
the strength of that relationship related to the co-occurrence frequency
networks of these co-occurring phenomena are constructed
utilizes the Paris/Keele method for cluster analysis
perfect for large and heterogeneous datasets
for large number of keywords, this method is simpler and easier to present
the assumption: there is a cluster-type structure, without an obligation to specify number of clusters and that all keywords should be included in the cluster construction
Networks of these co-occurring phenomena are constructed, and then maps of evolving skills sets are generated using the link-node values of the networks. With these maps of skill structure and evolution, the labour market policy analyst/educators can develop a deeper understanding of the interrelationships among the different skill sets and the impacts of external intervention, and can recommend new directions for more desirable curricula.
@Inbook{Kostoff1993,
author="Kostoff, Ronald N.",
editor="Bozeman, Barry
and Melkers, Julia",
title="Co-Word Analysis",
bookTitle="Evaluating R{\&}D Impacts: Methods and Practice",
year="1993",
publisher="Springer US",
address="Boston, MA",
pages="63--78",
abstract="In formulating and executing broad spectrum research policy, it is important to understand how research thrusts have interrelated and evolved over time, how they are projected to evolve, and how different types of interventions from sponsors and policymakers can affect the evolution and impact of research. While a panel of experts could provide an acceptable view of the trends and interrelationships within a narrowly-defined research area, identification of the connectivity of a broad range of areas is well beyond the expertise of any one panel of experts, and perhaps beyond a group of panels. An integration of topics and trends requires supplementation to the standard peer or analyst group evaluation. Much recent effort has been focused on development of more objective quantitative approaches for analyzing and integrating written and survey information to supplement analysts or groups of peers in understanding research trends.",
isbn="978-1-4757-5182-6",
doi="10.1007/978-1-4757-5182-6_4",
url="https://doi.org/10.1007/978-1-4757-5182-6_4"
}
the algorithm, based on threshold of co-occurrence frequency and number of links in one cluster (sub-network):
Pass-1:
starting linked-nodes: choose the link with the highest co-efficient
added other links and their corresponding nodes in decreasing order of the co-efficient based on Breadth First Search (BFS)
stop when no more links exceed the threshold or maximum pass-1 link is exceeded
Pass-2:
extend each Pass-1 sub-network by adding links between nodes in different cluster
stop when no link meets the co-occurrence threshold, or total link Pass-2 is met
Result: a keyword’s structure/map & strategic diagram
Gold standard and Inter-Annotator Agreement
two annotators annotate random sample of 50 job adverts
strict IAA F-score = 94%
then, manually annotated two additional and different samples of 50 job posts.
generate the gold-standard of manually annotated job posts
OBIE evaluation:
OBIE was executed for the 150 posts, containing 1,760 sentences and 53,577 tokens
Using the Corpus Quality Assurance tool in GATE to calculate precision, recall, and the F1-score