SKILL: A System for Skill
Identification and
Normalization
Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair
© 2014 CareerBuilder ► 2 ◄
© 2014 CareerBuilder ► 3 ◄
© 2014 CareerBuilder ► 4 ◄
Taxonomy
Surface
Forms
Normalized
Entity Name
Noises
Selected
Sections
Deduplication
© 2014 CareerBuilder
Blacklist
Wiki
Category
Tags
BLS
SOC
System
Capability, Knowledgeability, Technology, Terminology
► 5 ◄
Surface
Forms
categories
keywords like
school, company,
person and etc.
© 2014 CareerBuilder ► 6 ◄
Most Likely
Sense
Skills Sense
(BI -> Business
Intelligence)
Google Search
(SVM -> Support
Vector Machine)
© 2014 CareerBuilder ► 7 ◄
Tokenize Input
text and assemble
n-grams
Match n-grams
directly with
Taxonomy
Date of
Birth
Birth Childbirth Doomed
© 2014 CareerBuilder
• Neural Network Language Model
• Input is a corpus and output is a Huffman tree
• Given a word predicts the context (or oppositely)
• Mikolov, T. et al., ICLR 20131
• Don’t count, Predict! (Baroni and Kruszewski 20142)
► 8 ◄
1 Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
2 Baroni, M., Georgiana D., and Kruszewski, G. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL
© 2014 CareerBuilder
• Training data: surface forms ONLY
• Substitute ‘s+’ by ‘_‘
• Vector size: 200
• skip-gram model with hierarchical softmax (Mikolov et al., ASRU
2011*)
• Min-count: 1
► 9 ◄
* Mikolov, T., Deoras, A., Povey, D., Burget, L., and Černocký, J. 2011. Strategies for Training Large Scale Neural Network Lan-guage Models. ASRU.
Taxonomy
Surface Forms
Normalized
Entity Name
Word2vec
Vectors
© 2014 CareerBuilder
• Collect seed skills surface forms by direct matching
• For each seed surface form 𝑥𝑖, calculate # of other seed surface forms showing
in its vector
• Choose skills by a user defined cutoff on confidence scores. Default is set at
0.5.
• If # of words < 150, return all skills.
► 10 ◄
© 2014 CareerBuilder
• Taxonomy Precision: 90%.
• Taxonomy Recall: 70%. CB Taxonomy ∩ ESCO Taxonomy (50K vs 5K).
• ESCO is a systematic EU government initiative for complete workforce
analytics.
• Tagging: Precision 82%; Recall: 70%.
► 11 ◄
% of Approved Skills # of Responses Cumulative %
100% 902 (28%) 28%
90% - 99% 661 (21%) 49%
80% - 89% 618 (19%) 68%
70% - 79% 432 (13%) 81%
60% - 69% 251 (8%) 89%
50% or less 352 (11%) 100%
© 2014 CareerBuilder
Web service
► 12 ◄
© 2014 CareerBuilder
Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the
Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language
Learning, Prague, Czech Re- public, 708–716.
Demartini, G., Difallah, D. E., and Cudré-Mauroux. P. 2013. Large-scale linked data integration using probabilistic
reasoning and crowdsourcing. The VLDB Journal 22, 5 (October 2013),
665-687. DOI=10.1007/s00778-013-0324-z
Jonnalagadda, S. and Topham, P. 2011. NEMO: Extraction and normalization of organization names from
PubMed affiliation strings. Computing research repository, vol. abs/1107.5743.
Kivimaki, I., Panchenko, A., Dessy, A., Verdegem, D., Francq, P., Fairon, C., Bersini, H., Saerens, M. 2013. A
graph-based ap- proach to skill extraction from text. In Proceedings of Text- Graphs-8 Workshop. In Empirical
Methods for Natural Language Processing (EMNLP 2013). Seattle, USA.
Magdy, W., Darwish, K., Emam, O., and Hassan, H. 2007. Arabic cross document person name normalization. In
Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and
Resources, Prague, Czech Republic.
Singh, S., Subramanya, A., Pereira, F., and McCallum, 2011. A. Large-scale cross-document coreference using
distributed inference and hierarchical models. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, 793–803.
► 13 ◄
© 2014 CareerBuilder ► 14 ◄

The Skills System

  • 1.
    SKILL: A Systemfor Skill Identification and Normalization Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair
  • 2.
  • 3.
  • 4.
    © 2014 CareerBuilder► 4 ◄ Taxonomy Surface Forms Normalized Entity Name Noises Selected Sections Deduplication
  • 5.
    © 2014 CareerBuilder Blacklist Wiki Category Tags BLS SOC System Capability,Knowledgeability, Technology, Terminology ► 5 ◄ Surface Forms categories keywords like school, company, person and etc.
  • 6.
    © 2014 CareerBuilder► 6 ◄ Most Likely Sense Skills Sense (BI -> Business Intelligence) Google Search (SVM -> Support Vector Machine)
  • 7.
    © 2014 CareerBuilder► 7 ◄ Tokenize Input text and assemble n-grams Match n-grams directly with Taxonomy Date of Birth Birth Childbirth Doomed
  • 8.
    © 2014 CareerBuilder •Neural Network Language Model • Input is a corpus and output is a Huffman tree • Given a word predicts the context (or oppositely) • Mikolov, T. et al., ICLR 20131 • Don’t count, Predict! (Baroni and Kruszewski 20142) ► 8 ◄ 1 Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2 Baroni, M., Georgiana D., and Kruszewski, G. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL
  • 9.
    © 2014 CareerBuilder •Training data: surface forms ONLY • Substitute ‘s+’ by ‘_‘ • Vector size: 200 • skip-gram model with hierarchical softmax (Mikolov et al., ASRU 2011*) • Min-count: 1 ► 9 ◄ * Mikolov, T., Deoras, A., Povey, D., Burget, L., and Černocký, J. 2011. Strategies for Training Large Scale Neural Network Lan-guage Models. ASRU. Taxonomy Surface Forms Normalized Entity Name Word2vec Vectors
  • 10.
    © 2014 CareerBuilder •Collect seed skills surface forms by direct matching • For each seed surface form 𝑥𝑖, calculate # of other seed surface forms showing in its vector • Choose skills by a user defined cutoff on confidence scores. Default is set at 0.5. • If # of words < 150, return all skills. ► 10 ◄
  • 11.
    © 2014 CareerBuilder •Taxonomy Precision: 90%. • Taxonomy Recall: 70%. CB Taxonomy ∩ ESCO Taxonomy (50K vs 5K). • ESCO is a systematic EU government initiative for complete workforce analytics. • Tagging: Precision 82%; Recall: 70%. ► 11 ◄ % of Approved Skills # of Responses Cumulative % 100% 902 (28%) 28% 90% - 99% 661 (21%) 49% 80% - 89% 618 (19%) 68% 70% - 79% 432 (13%) 81% 60% - 69% 251 (8%) 89% 50% or less 352 (11%) 100%
  • 12.
    © 2014 CareerBuilder Webservice ► 12 ◄
  • 13.
    © 2014 CareerBuilder Cucerzan,S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Re- public, 708–716. Demartini, G., Difallah, D. E., and Cudré-Mauroux. P. 2013. Large-scale linked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal 22, 5 (October 2013), 665-687. DOI=10.1007/s00778-013-0324-z Jonnalagadda, S. and Topham, P. 2011. NEMO: Extraction and normalization of organization names from PubMed affiliation strings. Computing research repository, vol. abs/1107.5743. Kivimaki, I., Panchenko, A., Dessy, A., Verdegem, D., Francq, P., Fairon, C., Bersini, H., Saerens, M. 2013. A graph-based ap- proach to skill extraction from text. In Proceedings of Text- Graphs-8 Workshop. In Empirical Methods for Natural Language Processing (EMNLP 2013). Seattle, USA. Magdy, W., Darwish, K., Emam, O., and Hassan, H. 2007. Arabic cross document person name normalization. In Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic. Singh, S., Subramanya, A., Pereira, F., and McCallum, 2011. A. Large-scale cross-document coreference using distributed inference and hierarchical models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, 793–803. ► 13 ◄
  • 14.

Editor's Notes

  • #3 So why am I doing this? It all started with my personal experience because I used to teach in a third-tier university at downtown Atlanta. A high-schooler entering college from southern Atlanta is really a big thing for the family because many of them are among the first in their families who could ever earn a college degree after the civil war. And for the students, it’s more like a dream-come-true-happy-life-ever-after kind of thing. However, the reality is that they have to carry a heavy debt for college, but what is worse is that, they couldn’t even find a job after graduation. So what happens is that for educators in a third or fourth-tier university here, they don’t really know what to teach. And for students, they just have no idea about what they are learning. More unfortunately, employers simply can’t find the right candidate. After joining Careerbuilder, I realized this is part of a bigger socio-economical problem, which is called “the skills-gap”. I just wish to do something for it, because it always feels great to do something for both commercial value and social good. And it’s also part of the company’s mission. The system I’m presenting today is a first step forward that I am able to take because in order to fill the gap, I feel like we need to know what defines a skill.
  • #5 Stopwords City and country names Rhetoric terms commonly used in resumes
  • #8 Tokenize the input text and remove irrelevant symbols Assemble up to 5 sequential tokens and match with the Taxonomy. Anything in common will be retained. Happy boss and bonus!
  • #11 Collect surface forms by direct matching Given a candidate surface form, calculate the proportion of relevant surface forms out of all surface forms. Relevancy is defined as showing up in the w2v vector. Normalize the frequency to a probability. Truncate output by a defined cutoff. Default is 0.5. Allow exceptions.