automatic extraction of job information from job vacancies
1. eduworks-network.eu
facebook.com/eduworksnetwork
@EduworksNetwork
This project has been funded with support from the European Commission.
This communication reflects the views only of the author, and the Commission cannot be held responsible for any use which may be
made of the information contained therein.
Automatic Extraction of Job
Information from Job
Vacancies
Vladimer Kobayashi
Advisers: Stefan Mol, Gábor Kismihók, and Deanne den
Hartog
3. What are in a
vacancy*?
• Worker-oriented domain
1. Worker characteristics
2. Worker requirements
3. Experience Requirements
Worker oriented
Job oriented domain
• Job-oriented domain
1. Occupational requirements
2. Workforce characteristics
3. Occupation-specific information
*Based on O*NET’s Content Model
4. Job Information Extraction from
Vacancies
• XML
• Databases
Data
Integration &
Management
• Part-of-
Speech
tagging
• Classification
Automatic
Extraction
• Summarisation
• Visualization
• Analytics
Presentation
5. Method – Automatic Extraction
Sentences Feature
matrix
Preprocessing
and
Segmentation
Feature
Extraction
Classification
model
Random
Forest,
SVM, and
Naive Bayes
Hard to
classify
sentences
Query by
committee
Newly expert
labelled
sentences
Retrain
Vacancies
Expert
Classified
sentences
Validation
6. Preprocessing
• Punctuation removal
• Lower case
• Sentence segmentation
• Stopword removal
• We do not remove these stopwords
“to", "have", "has", "had", "must","can", "could", "may","might",
"shall","should","will", and "would"
7. Feature Type Number of derived
features
Variable Type
Part of speech (POS) tag of the
first word
1
Categorical (actual POS)
Is the first word in this sentence
unique in work activity sentences
(based from the labelled data)
1
Numeric
Is the first word in this sentence
unique in worker attribute
sentences (based from the
labelled data)
1
Numeric
Is the last word in this sentence
unique in work activity sentences
(based from the labelled data)
1
Numeric
Is the last in this sentence unique
in worker attribute sentences
(based from the labelled data) 1
Numeric
8. Feature Type
Number of derived features
Variable Type
Proportion of adjectives 1 Numeric
Proportion of verbs 1 Numeric
Proportion of word “to” 1 Numeric
Proportion of modal verbs
1
Numeric
Proportion of numbers 1 Numeric
Proportion of adverbs 1 Numeric
9. Feature Type Number of derived features Variable Type
Proportion of nouns 1 Numeric
Proportion of nouns, verbs,
adjectives, adverbs, and other
part of speech tags followed by
another verb
5
Proportion of unique words
found only in work activity
sentences (based from the
labelled data)
1
Numeric
Proportion of unique words
found only in worker attributes
sentences (based from the
labelled data)
1
Numeric
Frequency of keywords for work
activity and worker attributes
sentences
149
Numeric
11. Key results
• We identified
• 270,000 work activity sentences
• 317,000 work attribute sentences
• Classifier is at least 90 percent accurate (10-fold cross
validation)
14. Word2vec – word similarity
Word Cosine similarity
interpersonal 0.90
verbal 0.90
skills 0.88
written 0.85
strong 0.84
excellent 0.83
good 0.83
communicator 0.81
ability 0.80
organisational 0.80
Words similar to communication
27. Challenges
• Labeling data is time consuming
• Choose which data to label
• Make use of unlabeled data
• Crowd source the labeling
28. • Job vacancies as source of job information
• Apply techniques from text mining and machine learning
to perform the job information extraction
• Contribution to Job Analysis, Job Test Validation, and
Career planning
• Benefits job-seekers and recruiters.
Summary
29. Key Publications
2017
Kobayashi, V. B., Berkers, H. A., Mol, S. T., Kismihok, G., & Den Hartog, D. N. (2017). Text
Mining in Organizational Research. Organizational Research Methods. Manuscript in
Preparation.
Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismihok, G., & Den Hartog, D. N. (in press). Text
Classification for Organizational Research: A Tutorial. Organizational Research Methods.
2016
Kobayashi, V., Mol, S. T., Kismihok, G., & Hesterberg M. (2017). Automatic Extraction of
Nursing Tasks from Online Job Vacancies In M. Fathi, M. Khobreh, & F. Ansari (Eds),
Professional Education and Training through Knowledge, Technology and Innovation (pp. 51-
56). Siegen, Germany: Universitatsverlag Siegen.
30. This work was supported by the
European Commission through the
Marie-Curie Initial Training
Network EDUWORKS (grant
number PITN-GA-2013-608311)