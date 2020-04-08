Successfully reported this slideshow.
Level-based Resume Classiﬁcation on Nursing Positions Haoqi Gu Advisor: Jinho D. Choi April 3rd, 2020
Guidelines: ➢ Introduction ➢ Background ➢ Dataset ➢ Approach ➢ Experiments ➢ Conclusion
Introduction ● Document Classiﬁcation ○ Similar Characteristics ● Text, image, music, etc. ● Text Classiﬁcation ● Resume C...
Background ● Related work: Job Category Classiﬁcation ○ Domain adaptation on a number of job descriptions and use this mod...
Dataset ● Dataset Statistics ● Annotation ● Format Conversion and Section Extraction(DOCX) ● Format Conversion and Section...
Dataset Statistics ● Job Position: Clinical Research Coordinator ● 6512 Application Resumes ● Four Levels of CRC positions...
Dataset Annotation ● Done by Dr. Elaine Fisher, Rebecca Thomas, Charlie Williams and Sabrina Sabir from the School of Nurs...
Dataset Description
Difﬁculties ● The formats of resumes are different. ○ as the previous picture ● Length of the resumes differ ○ Common to s...
Format Conversion and Section Extraction(DOCX) ● Sample original format ● DOCX to HTML ﬁle (API: Pypandoc)
Format Conversion and Section Extraction(DOCX) ● Sample Converted Format ● Extract the HTML tags for section Titles (<p><s...
Format Conversion and Section Extraction(PDF) ● Sample original ﬁles ● PDF to plain text ﬁle (API: Tika Parser)
Format Conversion and Section Extraction(PDF) ● Sample Converted Format ● String matching to catch section titles ● Mark a...
Resume ID with Different Sections
Extraction Accuracy ● Manually Check 100 Resumes ● Extraction Accuracy: 97% ● 6 Section Extracted: Work Experience, Educat...
Sample Preprocessed Data ● A list of resumes ● Each Resume contains several section information ● Section information cont...
Approach ● String matching ● Feature vector ● Bags of Words ● Ensemble Models
Approach(String matching) ● Idea: comparison between key information in resumes and guidelines ● Sample Guidelines (Work e...
Sample Guideline
String Matching(Work Experience Vector) ● Guideline ○ Five classes of work experience mentioned: ■ 0: not related 1: clini...
String Matching(Work Experience Vector) ● 5-by-1 vector ○ The first entry represent the existence of not related work expe...
String Matching(Education) ● Guideline ○ Seven degrees of education mentioned: ■ 'No college','Technical diplo','Associate...
String Matching(Education) ● 7-by-1 vector ○ The first entry represent the existence of no college degree ○ ...Similar for...
String Matching(Judgement) ● Every resume has two vectors which represent the key information in work experience and educa...
Approach(Feature Vector) ● Previous Vectors ● Combine Work Experience and Education Vectors: 12-by-1 feature vector ● Mode...
Approach(Bags of Words) ● TF-IDF of a speciﬁc word: ○ ○ Raw count of the speciﬁc word in the resume ○ Nominator: Total num...
Approach(Bags of Words) ● 125-by-1 vector with each entry representing 125 frequent words ● Models going to train on bags ...
Approach(Ensemble Models) Sections for training: 1) Work Experience 2) Education 3) Work Experience + Education ● Ensemble...
Experiments ● Data Split ○ training (75%),development(10%), and test(15%) ● Results ● Error Analysis
Results(String matching) ● String Matching(Only on work experience): 42.92 ● String Matching(Only on education): 1.43 ● St...
Results(Feature Vectors) By Training on different sections of the resume data, 61.88% is the best accuracy.
Random Forest Model Development ● Number of Estimators ● Maximum depth of the random forest
Gradient Boosting Model Development ● Number of Estimators ● Learning rate
Results(Bags of Words) ● By Training on different sections of the resume data using bags of words, 61.88% is the best accu...
Ensemble Models ● Choosing ensemble models based on the previous results. ● Ensemble Model 1 ○ Work Experience: GB(FV) ○ E...
Ensemble 1 and 2 ● Ensemble Model 1 ○ GB(FV) ○ SVM(BoW) ○ GB(BoW) ● Ensemble Model 2 ○ Work Experience ■ LR(FV)+RF(FV)+GB(...
Ensemble 3 See the performance from ensemble 2 by using training on different sections.
Error Analysis (1)Annotation Bias: 8 resumes may be annotated wrong based on the guideline. (2)PDF File Problem: 1 scanned...
Conclusion What’s new in this thesis: ● Create and clean up a new real dataset for level-based resume classiﬁcation. ● Pre...
29 views

Published on

Undergraduate Honors Thesis, Emory University
Authors: Haoqi Gu
Year: 2020

Published in: Technology
  1. 1. Level-based Resume Classiﬁcation on Nursing Positions Haoqi Gu Advisor: Jinho D. Choi April 3rd, 2020
  2. 2. Guidelines: ➢ Introduction ➢ Background ➢ Dataset ➢ Approach ➢ Experiments ➢ Conclusion
  3. 3. Introduction ● Document Classiﬁcation ○ Similar Characteristics ● Text, image, music, etc. ● Text Classiﬁcation ● Resume Classiﬁcation ● Motivations ○ Tedious process of hiring applicants at Emory University
  4. 4. Background ● Related work: Job Category Classiﬁcation ○ Domain adaptation on a number of job descriptions and use this model to indirectly classify resumes. (Indirect classiﬁcation on job categories) ○ Classify resumes by its interest, education and so on into different categories ● Comparison : Level-based Job Classiﬁcation ○ Similar resumes (Application resume from the same job) ○ Classify resumes into levels. (Different expertise levels) ● New Dataset Uniqueness ○ Real applicants’ resumes from Emory University. ○ One speciﬁc job: clinical research coordinator. ○ No similar dataset online ○ No similar research
  5. 5. Dataset ● Dataset Statistics ● Annotation ● Format Conversion and Section Extraction(DOCX) ● Format Conversion and Section Extraction(PDF) ● Extraction Accuracy ● Preprocessed Dataset
  6. 6. Dataset Statistics ● Job Position: Clinical Research Coordinator ● 6512 Application Resumes ● Four Levels of CRC positions: CRC I, CRC II, CRC III, and CRC IV ● Number of applicants for each level: 23:17:8:2 ● 2025 randomly selected: 932 CRC I, 688 CRC II, 324 CRC III, 81 CRC IV ○ Advantages: mimic the population, and contain training samples for different groups
  7. 7. Dataset Annotation ● Done by Dr. Elaine Fisher, Rebecca Thomas, Charlie Williams and Sabrina Sabir from the School of Nursing ● 250 got double checked
  8. 8. Dataset Description
  9. 9. Difﬁculties ● The formats of resumes are different. ○ as the previous picture ● Length of the resumes differ ○ Common to see long resumes more than 1000 tokens ○ Some are very short. ● Lots of Noise ○ Side information ● First work to mine these resumes
  10. 10. Format Conversion and Section Extraction(DOCX) ● Sample original format ● DOCX to HTML ﬁle (API: Pypandoc)
  11. 11. Format Conversion and Section Extraction(DOCX) ● Sample Converted Format ● Extract the HTML tags for section Titles (<p><strong>...</strong></p>) ● Mark all lines including the same HTML tags ● Separate the whole resume by these lines
  12. 12. Format Conversion and Section Extraction(PDF) ● Sample original ﬁles ● PDF to plain text ﬁle (API: Tika Parser)
  13. 13. Format Conversion and Section Extraction(PDF) ● Sample Converted Format ● String matching to catch section titles ● Mark all lines containing the section titles ● Separate the whole resume by these lines
  14. 14. Resume ID with Different Sections
  15. 15. Extraction Accuracy ● Manually Check 100 Resumes ● Extraction Accuracy: 97% ● 6 Section Extracted: Work Experience, Education, Activities, Skills, Proﬁle and Other ● Sample Processed Data(.JSON)
  16. 16. Sample Preprocessed Data ● A list of resumes ● Each Resume contains several section information ● Section information contains the information about the section name, the section content, and the tokenized lines.
  17. 17. Approach ● String matching ● Feature vector ● Bags of Words ● Ensemble Models
  18. 18. Approach(String matching) ● Idea: comparison between key information in resumes and guidelines ● Sample Guidelines (Work experience+Education) ● Extract key information from resumes as key phrase ○ Work Experience 5-by-1 vector: [3.333, 0, 1, 0, 0] ○ Education 7-by-1 vector: [ [ ],[ ],[‘master’, 1],[‘bachelor’,0],[ ],[ ],[ ] ] ● Multiple If-Statement to check whether the resume contains the required information in the guideline
  19. 19. Sample Guideline
  20. 20. String Matching(Work Experience Vector) ● Guideline ○ Five classes of work experience mentioned: ■ 0: not related 1: clinical setting 2: clinical research 3: clinical intern 4 lab experience ■ Type of work experience is clearly introduced ● Extraction ○ String matching and regular expression ○ Sample: re.compile('((s*(p).?(h).?(d).?,?s*)$)|(((s*(m|pharm).?(d).?,?s*)$)|(lab scientist)$)')
  21. 21. String Matching(Work Experience Vector) ● 5-by-1 vector ○ The first entry represent the existence of not related work experience in the resume ○ The second entry represent the existence of work experience under clinical setting ○ ...Similar for the rest ● Entry value ○ For each entry, the value means the duration(in year) of that type of work experience ○ 0 means none for that work experience ○ Sample: [0, 1, 0, 0, 0]
  22. 22. String Matching(Education) ● Guideline ○ Seven degrees of education mentioned: ■ 'No college','Technical diplo','Associate', 'Bachelor','Master','PHD', 'MD' ■ Each degree is either science and health related degree or non-science and non-health related degree. (Clearly introduced) ● Extraction ○ Regular Expression ○ Sample: re.compile('(s*(b).?(a|s|sc).?(chelor|chelor’s|chelors|s)?,?s*)$') (Bachelor Degree)
  23. 23. String Matching(Education) ● 7-by-1 vector ○ The first entry represent the existence of no college degree ○ ...Similar for the rest ● Entry value ○ For each entry, the value means the type of the degree: ■ 0: no degree mentioned ■ 1: mentioned and non science and non health related degree ■ 2: mentioned and science-and-health related degree ○ Sample: [0,0,0,0,1,0,0 ]
  24. 24. String Matching(Judgement) ● Every resume has two vectors which represent the key information in work experience and education section. ● Sample: Work Experience: [ 0, 0, 3, 0, 0 ] Education: [ 0, 0, 0, 1, 0, 0, 0 ], which fulﬁll the second requirement. f
  25. 25. Approach(Feature Vector) ● Previous Vectors ● Combine Work Experience and Education Vectors: 12-by-1 feature vector ● Models going to train on feature vector: Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine, and ensemble models.
  26. 26. Approach(Bags of Words) ● TF-IDF of a speciﬁc word: ○ ○ Raw count of the speciﬁc word in the resume ○ Nominator: Total number of the resumes: 2025, Denominator: Number of resumes that contain this word
  27. 27. Approach(Bags of Words) ● 125-by-1 vector with each entry representing 125 frequent words ● Models going to train on bags of words: Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine, and ensemble models.
  28. 28. Approach(Ensemble Models) Sections for training: 1) Work Experience 2) Education 3) Work Experience + Education ● Ensemble Model 1 ○ Select one best model on work experience, education and work experience and education, respectively(total 3 models) ○ Combine their prediction result, and vote for the final result. ● Ensemble Model 2 ○ From all models performing work experience, education and work experience and education, select top 3 models for each(total 9 models) ○ Combine three models prediction result for each part, and vote for the final result. ● Ensemble Model 3 ○ Combine the 9 models in ensemble 3 ○ Aggregate the prediction result and vote for the final prediction result.
  29. 29. Experiments ● Data Split ○ training (75%),development(10%), and test(15%) ● Results ● Error Analysis
  30. 30. Results(String matching) ● String Matching(Only on work experience): 42.92 ● String Matching(Only on education): 1.43 ● String Matching(Both work experience and education): 50.05
  31. 31. Results(Feature Vectors) By Training on different sections of the resume data, 61.88% is the best accuracy.
  32. 32. Random Forest Model Development ● Number of Estimators ● Maximum depth of the random forest
  33. 33. Gradient Boosting Model Development ● Number of Estimators ● Learning rate
  34. 34. Results(Bags of Words) ● By Training on different sections of the resume data using bags of words, 61.88% is the best accuracy.
  35. 35. Ensemble Models ● Choosing ensemble models based on the previous results. ● Ensemble Model 1 ○ Work Experience: GB(FV) ○ Education: SVM(BoW) ○ Both: GB(BoW) ○ Combine 3 models ● Ensemble Model 2 ○ Work Experience ■ LR(FV)+RF(FV)+GB(FV) ○ Education ■ LR(BoW)+RF(BoW)+SVM(BoW) ○ Both: ■ LR(BoW)+RF(BoW)+GB(BoW) ○ Combine 9 models.
  36. 36. Ensemble 1 and 2 ● Ensemble Model 1 ○ GB(FV) ○ SVM(BoW) ○ GB(BoW) ● Ensemble Model 2 ○ Work Experience ■ LR(FV)+RF(FV)+GB(FV) ○ Education ■ LR(BoW)+RF(BoW)+SVM(BoW) ○ Both: ■ LR(BoW)+RF(BoW)+GB(BoW)
  37. 37. Ensemble 3 See the performance from ensemble 2 by using training on different sections.
  38. 38. Error Analysis (1)Annotation Bias: 8 resumes may be annotated wrong based on the guideline. (2)PDF File Problem: 1 scanned files in PDF format (3)DOCX File Problem ● 2 files in DOCX format are using fancy structure that cannot be converted efficiently to HTML file. ● 1 file in DOCX format is using unorganized section titles: education title in picture format while work experience title in text format, which makes the HTML converter hard to detect titles. (4)Writing Style Problem: 2 resumes do not follow common formats to write in education and work experience sections, which makes the information extraction process less efficient.
  39. 39. Conclusion What’s new in this thesis: ● Create and clean up a new real dataset for level-based resume classiﬁcation. ● Preliminary models on this dataset and received good results ● More promising future research to improve the results by trying other models.

