Determining Column Numbers in Rèsumè with Clustering

•Download as PPTX, PDF•

0 likes•91 views

Hello, I am Yavuz Balı who R&D enginner from Kariyer.net. Today, I present about the study "Determining Column Numbers in Rèsumè with Clustering", which is a part of this project.

Engineering

Determining Column Numbers in Rèsumè
with Clustering
Şeref Recep Keskin
R&D Engineer
Kariyer.net
Yavuz Balı
R&D Engineer
Kariyer.net
Günce Kezban Orman
Department of Computer Engineering
Galatasaray University
Sultan N. Turhan
Department of Computer Engineering
Galatasaray University
F. Serhan Daniş
Department of Computer Engineering
Galatasaray University

Contents
 Introduction
 Dataset
 Methodology
 Experiment
 Conclusion and Future Directions

Introduction
 It is Turkey's first and leading job and employee recruitment
platform, established in 1999.
 More than 1 M job postings are viewed daily and around 300K job
applications are received.
 It has been an Research and Development center since 2014. The
first in the Turkish Internet industry.
About Kariyer.net
 It mediates the employment of 1.5 million people annually.

Introduction
 Examining the resumes of the candidates in the
recruitment processes constitutes a large workforce.
 It is aimed to reduce the workload of the recruitment
process by transforming unstructured documents into
structural ones.
 This study is about column detection, which is one of the
problems during the conversion of documents into
structural format.

Problem Definition &
Proposed Solution
Major Considerations:
• Extract information from résumé
• Detect column type résumé
Solution:
• Formalizes the problem of finding columns of a
résumé as a clustering problem
Text’ 𝑥0 Of Two Column Distribution in 2D Plane
Text’ 𝑥0 Of Two Column Distribution in 2D Plane

Dataset
One Column Résumé Two Column Résumé

𝒙𝟎 𝒚𝟎 𝒙𝟏 𝒚𝟏 words Page_number
5 152.62 188.18 210.97 209.15 education 1
6 97.77 208.27 417.59 243.82 2014-2019ncomput.. 1
7 78.83 256.16 288.61 270.73 Programmingnpyth.. 1
8 120.30 282.93 304.52 309.46 ıdenjupyter notebo.. 1
9 71.15 314.62 304.56 365.06 framework annlibr.. 1
Parsed Résumé Dataframe
Example of One Colume Résumé
Dataset

Parameter Explanation
𝒙𝟎 Left corners x coordinate
𝒚𝟎 Top corners y coordinate
𝒙𝟏 Right corners x coordinate
𝒚𝟏 Bottom corners y coordinate
words The output of text extraction
Dataset
Descriptions of properties expressing texts

Methodology
Two Column Résumé One Column Résumé
The coordinate distributions of the x0 and y0 features of the
texts in single and double column résumé are shown below:

Methodology
 Algorithms used in the study are listed below:
 K-means approach with post-processing methods:
 Silhouette
 Elbow
 DBSCAN

Methodology
K-means with Silhouette Method

Experiments
K-means with Silhouette Method
 Treshold of Silhouette Score = 0.95
0.95 < Silhouette Score , Two Column Résumé
0.95 > Silhouette Score , One Column Résumé
Single-Column samples
Two-Columns Sample

Experiments
K-means with Elbow Method
 Treshold of Within-Cluster-Sum of Squared
Errors (WSS) = 50.000
50.000 > WSS , Two Column Résumé
50.000 < WSS, One Column Résumé
Single-Column samples
Two-Columns Sample

Experiments
Silhouette Elbow DBSCAN
Method Test Accuracy F1-Score Recall Precision
DBSCAN 83% 72% 68% 77%
Elbow 75% 57% 49% 66%
Silhouette 57% 43% 49% 38%

Conclusion and Future Directions
Conclusion
• The DBSCAN method has a high accuracy value compared to the elbow
and silhouette methods.
• Although 72% F1-score is a high value, it is not sufficient.
Future Directions
• In future studies, metric values can be increased with different clustering
approaches (model-based, spectral, hierarchical, etc.) on this specific
problem.

Similar to Determining Column Numbers in Rèsumè with Clustering

itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdfbeshahashenafe20

Algorithmyarkhosh

A method for the development of Dublin Core Application ProfilesMariana Curado Malta

A survey on ranking sql queries using skyline and usereSAT Publishing House

WCTFR : W RAPPING C URVELET T RANSFORM B ASED F ACE R ECOGNITIONcsandit

Module-2_ML.pdfArpanSoni16

A Validation of Object-Oriented Design Metrics as Quality Indicatorsvie_dels

Project managementsmumbahelp

Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan

Some Reviews on Circularity Evaluation using Non- Linear Optimization TechniquesIRJET Journal

Introduction to MATLABRavikiran A

Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19

Operations Research Digital Material.pdfTANVEERSINGHSOLANKI

Final Mini Project Presentation_2023_24.pptRamSharma159674

Application of or for industrial engineersHakeem-Ur- Rehman

EE-232-LEC-01 Data_structures.pptxiamultapromax

MSPresentation_Spring2011Shaun Smith

Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal

Apsec 2014 PresentationAhrim Han, Ph.D.

Bangla Hand Written Digit Recognition presentation slide .pptxKhondokerAbuNaim

Similar to Determining Column Numbers in Rèsumè with Clustering (20)

itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf

Algorithm

A method for the development of Dublin Core Application Profiles

A survey on ranking sql queries using skyline and user

WCTFR : W RAPPING C URVELET T RANSFORM B ASED F ACE R ECOGNITION

Module-2_ML.pdf

A Validation of Object-Oriented Design Metrics as Quality Indicators

Project management

Predicting query performance and explaining results to assist Linked Data con...

Some Reviews on Circularity Evaluation using Non- Linear Optimization Techniques

Introduction to MATLAB

Performance Evaluation: A Comparative Study of Various Classifiers

Operations Research Digital Material.pdf

Final Mini Project Presentation_2023_24.ppt

Application of or for industrial engineers

EE-232-LEC-01 Data_structures.pptx

MSPresentation_Spring2011

Review of Existing Methods in K-means Clustering Algorithm

Apsec 2014 Presentation

Bangla Hand Written Digit Recognition presentation slide .pptx

Recently uploaded

Autodesk Construction Cloud (Autodesk Build).pptxMustafa Ahmed

Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA

Developing a smart system for infant incubators using the internet of things ...IJECEIAES

Working Principle of Echo Sounder and Doppler Effect.pdfSkNahidulIslamShrabo

engineering chemistry power point presentationsj9399037128

Adsorption (mass transfer operations 2) pptjigup7320

8th International Conference on Soft Computing, Mathematics and Control (SMC ...josephjonse

Signal Processing and Linear System AnalysisNational Chung Hsing University

01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...AshwaniAnuragi1

Artificial Intelligence in due diligencemahaffeycheryld

Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Toolssoginsider

NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...Amil baba

Passive Air Cooling System and Solar Water Heater.pptamrabdallah9

Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar

Basics of Relay for Engineering Studentskannan348865

Databricks Generative AI Fundamentals .pdfVinayVadlagattu

Independent Solar-Powered Electric Vehicle Charging Stationsiddharthteach18

Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4

analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology

一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书c3384a92eb32

Recently uploaded (20)

Autodesk Construction Cloud (Autodesk Build).pptx

Diploma Engineering Drawing Qp-2024 Ece .pdf

Developing a smart system for infant incubators using the internet of things ...

Working Principle of Echo Sounder and Doppler Effect.pdf

engineering chemistry power point presentation

Adsorption (mass transfer operations 2) ppt

8th International Conference on Soft Computing, Mathematics and Control (SMC ...

Signal Processing and Linear System Analysis

01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...

Artificial Intelligence in due diligence

Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools

NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...

Passive Air Cooling System and Solar Water Heater.ppt

Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf

Basics of Relay for Engineering Students

Databricks Generative AI Fundamentals .pdf

Independent Solar-Powered Electric Vehicle Charging Station

Path loss model, OKUMURA Model, Hata Model

analog-vs-digital-communication (concept of analog and digital).pptx

一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书

Determining Column Numbers in Rèsumè with Clustering

1. Determining Column Numbers in Rèsumè with Clustering Şeref Recep Keskin R&D Engineer Kariyer.net Yavuz Balı R&D Engineer Kariyer.net Günce Kezban Orman Department of Computer Engineering Galatasaray University Sultan N. Turhan Department of Computer Engineering Galatasaray University F. Serhan Daniş Department of Computer Engineering Galatasaray University

2. Contents  Introduction  Dataset  Methodology  Experiment  Conclusion and Future Directions

3. Introduction  It is Turkey's first and leading job and employee recruitment platform, established in 1999.  More than 1 M job postings are viewed daily and around 300K job applications are received.  It has been an Research and Development center since 2014. The first in the Turkish Internet industry. About Kariyer.net  It mediates the employment of 1.5 million people annually.

4. Introduction  Examining the resumes of the candidates in the recruitment processes constitutes a large workforce.  It is aimed to reduce the workload of the recruitment process by transforming unstructured documents into structural ones.  This study is about column detection, which is one of the problems during the conversion of documents into structural format.

5. Problem Definition & Proposed Solution Major Considerations: • Extract information from résumé • Detect column type résumé Solution: • Formalizes the problem of finding columns of a résumé as a clustering problem Text’ 𝑥0 Of Two Column Distribution in 2D Plane Text’ 𝑥0 Of Two Column Distribution in 2D Plane

6. Dataset One Column Résumé Two Column Résumé

7. Dataset

8. 𝒙𝟎 𝒚𝟎 𝒙𝟏 𝒚𝟏 words Page_number 5 152.62 188.18 210.97 209.15 education 1 6 97.77 208.27 417.59 243.82 2014-2019ncomput.. 1 7 78.83 256.16 288.61 270.73 Programmingnpyth.. 1 8 120.30 282.93 304.52 309.46 ıdenjupyter notebo.. 1 9 71.15 314.62 304.56 365.06 framework annlibr.. 1 Parsed Résumé Dataframe Example of One Colume Résumé Dataset

9. Parameter Explanation 𝒙𝟎 Left corners x coordinate 𝒚𝟎 Top corners y coordinate 𝒙𝟏 Right corners x coordinate 𝒚𝟏 Bottom corners y coordinate words The output of text extraction Dataset Descriptions of properties expressing texts

10. Methodology Two Column Résumé One Column Résumé The coordinate distributions of the x0 and y0 features of the texts in single and double column résumé are shown below:

11. Methodology  Algorithms used in the study are listed below:  K-means approach with post-processing methods:  Silhouette  Elbow  DBSCAN

12. Methodology K-means with Elbow Method

13. Methodology K-means with Silhouette Method

14. Methodology DBSCAN Method

15. Experiments K-means with Silhouette Method  Treshold of Silhouette Score = 0.95 0.95 < Silhouette Score , Two Column Résumé 0.95 > Silhouette Score , One Column Résumé Single-Column samples Two-Columns Sample

16. Experiments K-means with Elbow Method  Treshold of Within-Cluster-Sum of Squared Errors (WSS) = 50.000 50.000 > WSS , Two Column Résumé 50.000 < WSS, One Column Résumé Single-Column samples Two-Columns Sample

17. Experiments DBSCAN Method

18. Experiments Silhouette Elbow DBSCAN Method Test Accuracy F1-Score Recall Precision DBSCAN 83% 72% 68% 77% Elbow 75% 57% 49% 66% Silhouette 57% 43% 49% 38%

19. Conclusion and Future Directions Conclusion • The DBSCAN method has a high accuracy value compared to the elbow and silhouette methods. • Although 72% F1-score is a high value, it is not sufficient. Future Directions • In future studies, metric values can be increased with different clustering approaches (model-based, spectral, hierarchical, etc.) on this specific problem.

20. Thank you for listening

Editor's Notes

Hello, I am Yavuz Balı who R&D enginner from Kariyer.net. Today, I present about the study "Determining Column Numbers in Rèsumè with Clustering", which is a part of this project.
1- I aim to present the presentation under 5 headings. 2- Read header
Since 1999, Turkey's largest employment platform Kariyer.net brings together job seekers and employers online with new generation technologies in job search and recruitment processes. On its platform, Kariyer.net allows candidates to upload their free-style resumes. These resumes are stored in databases without un-processed. It is of great importance for Kariyer.net to transform the résumés in non-structural form into structural form.
In the recruitment process, the workload of manual résumé reviews is quite time consuming for the recruiters. Extracting the actual meaning in resumes and structuring their forms can facilitate this process. More specifically, we firstly focus on finding the column number of any résumé since once the main parts of the résumé are separated, the subdivisions can easily be analyzed.
In order to determine the number of columns, the coordinate information of the texts parsed from the resumes is focused. We assume that only the $y_0$ information changes for the texts belonging to the same paragraph, and the $x_0$ information remains within a certain tolerance range. In the case of multiple columns, the coordinate $x_0$ is considered an attribute that specifies the starting positions of the text. When the coordinates of one-column and two-column resumes were examined, a significant difference was detected. It has been observed that the data are really clustered around two centers, based on x0 coordinate data of the two-column resumes. As seen on the slide, the distribution of x0 coordinate information in 2D for two different column types is shown. So in this study, thus, formalizes the problem of finding columns of a résumé as a clustering problem.
In this study, resumes collected from Kariyer.net databases were used. A total of 1018 resumes were used for the test. Resumes uploaded directly to the system can have different formats and styles, with one-columns or two-columns, different headers, or writing details. As can be seen on the slide, resumes with two columns on the right and one column on the left are exemplified.
Test data set created to compare the methods, contain 685 single-column and 333 double-column resumes.
It focuses on the coordinate information of the texts parsed from the resumes in order to determine the number of columns. In order to determine the number of columns, the resumes are pre-processed and the texts they contain are separated. In this respect, the problem is considered as an analytical problem that examines the coordinates of the analyzed data. On the slide, shows a table of texts parsed from resumes.
İn this slide, Table contains the descriptions of the features obtained in the parsed texts.
When the x0 and y0 coordinates of the texts obtained from the one-column and two-column resumes are analyzed, the catplots on the slide are obtained. When the catplot of the two-column resume is examined, it is seen that the text coordinates form two different projection sets on the x-axis. It is seen that the text coordinates of the single-column resumes form projections a sparse distribution on the x-axis. On the other hand, there is no feature that the Y axis decomposes.
In this study, three different methods were used for clustering of coordinate data. The well-known K-means approach was used with two different post-processing methods. In addition, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which has a proven performance in analyzing geographic data, was used. The number of columns was interpreted using the methods and coordinate data.(Şereften info al)
ELBOW: With the Elbow method, it calculates the sum of the squares of the distances from the center of the cluster to which the data is included for the number of clusters [13]. This calculation is also called Within-Cluster-Sum of Squared Errors (WSS). A threshold value is determined for the WSS values calculated for the pre-prepared training dataset and the CVs with one and two columns. The WSS value is calculated for cluster number 2. The number of columns is estimated by using the determined threshold value and the WSS value obtained in any resume.
SILHOUETTE: The Silhouette method, on the other hand, is a method that provides the most appropriate number of clusters and interpretation of the consistency between data sets. The method calculates the silhouette coefficients of each point It measures how similar a point is to its own cluster compared to other clusters. Silhouette coefficient gets values between -1 and +1. As in Elbow, a threshold value is determined with the silhouette coefficients calculated for single and double column resumes on the training dataset . Using the determined threshold value, an estimation is made with the silhouette coefficients obtained from the resumes.
DBSCAN: Unlike K-means algorithm, it does not require the number of clusters to be specified beforehand. It is also an outlier resistant algorithm. The data produced by the parsed resumes creates a valid dataset for the DBSCAN Algorithm. Coordinate information of text blocks creates point data in the 2D coordinate plane. The number of text blocks clustered on the page can be determined using the DBSCAN Algorithm. The number of clusters detected can give information about the number of columns in the resume.
Firstly, we consider the Silhouette method. Box plots created with silhouette coefficients obtained from the training data set for 2 cluster are shown on the slide. When the box plots were examined, a net treshold value could not be determined to be able to separate two different column types. After manual trials, the treshold value with the highest accuracy was found to be 0.95. The Confussion matrix shows the success achieved for this value.
For the Elbow method, the boxplots created with the WSS values obtained from the training set for 2 cluster are shown on the slide. But net WSS value couldn't found within this method fot parsing one or two columns. The highest accuracy treshold value detected for Elbow is 50,000. The estimation results obtained with this value are given with the confusion matrix shown on the slide.
The result of the estimations made on the test data set with the DBSCAN method is shown using the confusion matirx. The number of clusters was determined by DBSCAN from the coordinate data obtained from the resumes. For resumes containing one cluster, single column estimates were made, and for resumes containing two clusters, double column estimates were made.
When comparing all 3 methods, it is observed that the DBSCAN method has the highest success. It is seen that DBSCAN achieved a success rate of 90.07% in single-column résumés and 67.56% in two column résumés. It is seen that the Silhouette method reaches a success rate of 87.88% in single- column résumés and 49.24% in two-column résumés. On the other hand, the Elbow method has a success rate of 61.16% in single column résumés and 48.94% in two-column résumés. When the results were examined, higher results were obtained in all 3 methods in single-column resumes than in double-column resumes.
In this study, three different clustering methods were considered for the problem of determining the number of columns in the resumes. When the results of the methods were compared, the DBSCAN method obtained the highest metric values with 83% accuracy and 72% F1-score. Although these results are acceptable, its are not sufficient to determine the number of columns in the documents. When the resumes are examined, it is seen that the information is transferred under the relevant headings. Accordingly, headings contain information about the column number of a resume. As an extension of this work, header information (semantic and positional information) can be used to simultaneously identify headings and the number of columns. In this way, we assume that success rates for the resume parsing task can be increased to reasonable rates. In addition to the this study, different clustering approaches (model-based, spectral, hierarchical, etc.) and different metrics (gap statistics, modularity, etc.) can be used to find the optimal clustering numbers.
If you have any questions, I'd be happy to answer its. Question 1: Is a solution produced for only 1 and 2 columns? Answer 1:This study includes the preliminary study. We continue with the best results of the possible options. Based on the data set we have at the moment, it proceeds on 1 and 2 column documents. Question 2: Why cluster solution? Answer 2: When the coordinates of the texts were examined, we observed that the initial coordinates of the tex t showed different distributions and clustering on the x-axis depending on the number of columns. Question 3: How did you get the texts in the PDF? Answer 3: We have developed an algorithm where we get content information of documents such as PDF, Docx, etc. Question 4: Bu bir cluster çözümü mü Answer 4 : In this study, we need to find the number of segments. Unlike the use of clusters such as profiling, segmentation or partitioning, we use cluster solutions as a tool in this study. Question 5: Can it be used for any document? Answer 5:Yes, the Study contains a general solution. Therefore, it can be used for column determine of different documents. Question 5: neden bu çalışma var? Answer 5: This is a small part of a big project. It is a solution to the problem encountered while parsing the texts into sections in the CV parse project. If it don’t known column number,It is possible to mix texts or sections when parsing texts in documents of different column types. This would be difficult to explain here. If you send me an e-mail, I will be happy to explain it in detail.

Determining Column Numbers in Rèsumè with Clustering

Recommended

Recommended

More Related Content

Similar to Determining Column Numbers in Rèsumè with Clustering

Similar to Determining Column Numbers in Rèsumè with Clustering (20)

More from Kemal Can Kara

More from Kemal Can Kara (8)

Recently uploaded

Recently uploaded (20)

Determining Column Numbers in Rèsumè with Clustering

Editor's Notes