Hello, I am Yavuz Balı who R&D enginner from Kariyer.net. Today, I present about the study "Determining Column Numbers in Rèsumè with Clustering", which is a part of this project.
Determining Column Numbers in Rèsumè with Clustering
1. Determining Column Numbers in Rèsumè
with Clustering
Şeref Recep Keskin
R&D Engineer
Kariyer.net
Yavuz Balı
R&D Engineer
Kariyer.net
Günce Kezban Orman
Department of Computer Engineering
Galatasaray University
Sultan N. Turhan
Department of Computer Engineering
Galatasaray University
F. Serhan Daniş
Department of Computer Engineering
Galatasaray University
3. Introduction
It is Turkey's first and leading job and employee recruitment
platform, established in 1999.
More than 1 M job postings are viewed daily and around 300K job
applications are received.
It has been an Research and Development center since 2014. The
first in the Turkish Internet industry.
About Kariyer.net
It mediates the employment of 1.5 million people annually.
4. Introduction
Examining the resumes of the candidates in the
recruitment processes constitutes a large workforce.
It is aimed to reduce the workload of the recruitment
process by transforming unstructured documents into
structural ones.
This study is about column detection, which is one of the
problems during the conversion of documents into
structural format.
5. Problem Definition &
Proposed Solution
Major Considerations:
• Extract information from résumé
• Detect column type résumé
Solution:
• Formalizes the problem of finding columns of a
résumé as a clustering problem
Text’ 𝑥0 Of Two Column Distribution in 2D Plane
Text’ 𝑥0 Of Two Column Distribution in 2D Plane
9. Parameter Explanation
𝒙𝟎 Left corners x coordinate
𝒚𝟎 Top corners y coordinate
𝒙𝟏 Right corners x coordinate
𝒚𝟏 Bottom corners y coordinate
words The output of text extraction
Dataset
Descriptions of properties expressing texts
10. Methodology
Two Column Résumé One Column Résumé
The coordinate distributions of the x0 and y0 features of the
texts in single and double column résumé are shown below:
11. Methodology
Algorithms used in the study are listed below:
K-means approach with post-processing methods:
Silhouette
Elbow
DBSCAN
19. Conclusion and Future Directions
Conclusion
• The DBSCAN method has a high accuracy value compared to the elbow
and silhouette methods.
• Although 72% F1-score is a high value, it is not sufficient.
Future Directions
• In future studies, metric values can be increased with different clustering
approaches (model-based, spectral, hierarchical, etc.) on this specific
problem.
Hello, I am Yavuz Balı who R&D enginner from Kariyer.net. Today, I present about the study "Determining Column Numbers in Rèsumè with Clustering", which is a part of this project.
1- I aim to present the presentation under 5 headings.
2- Read header
Since 1999, Turkey's largest employment platform Kariyer.net brings together job seekers and employers online with new generation technologies in job search and recruitment processes. On its platform, Kariyer.net allows candidates to upload their free-style resumes. These resumes are stored in databases without un-processed. It is of great importance for Kariyer.net to transform the résumés in non-structural form into structural form.
In the recruitment process, the workload of manual résumé reviews is quite time consuming for the recruiters. Extracting the actual meaning in resumes and structuring their forms can facilitate this process. More specifically, we firstly focus on finding the column number of any résumé since once the main parts of the résumé are separated, the subdivisions can easily be analyzed.
In order to determine the number of columns, the coordinate information of the texts parsed from the resumes is focused. We assume that only the $y_0$ information changes for the texts belonging to the same paragraph, and the $x_0$ information remains within a certain tolerance range. In the case of multiple columns, the coordinate $x_0$ is considered an attribute that specifies the starting positions of the text. When the coordinates of one-column and two-column resumes were examined, a significant difference was detected. It has been observed that the data are really clustered around two centers, based on x0 coordinate data of the two-column resumes. As seen on the slide, the distribution of x0 coordinate information in 2D for two different column types is shown. So in this study, thus, formalizes the problem of finding columns of a résumé as a clustering problem.
In this study, resumes collected from Kariyer.net databases were used. A total of 1018 resumes were used for the test. Resumes uploaded directly to the system can have different formats and styles, with one-columns or two-columns, different headers, or writing details. As can be seen on the slide, resumes with two columns on the right and one column on the left are exemplified.
Test data set created to compare the methods, contain 685 single-column and 333 double-column resumes.
It focuses on the coordinate information of the texts parsed from the resumes in order to determine the number of columns. In order to determine the number of columns, the resumes are pre-processed and the texts they contain are separated. In this respect, the problem is considered as an analytical problem that examines the coordinates of the analyzed data. On the slide, shows a table of texts parsed from resumes.
İn this slide, Table contains the descriptions of the features obtained in the parsed texts.
When the x0 and y0 coordinates of the texts obtained from the one-column and two-column resumes are analyzed, the catplots on the slide are obtained. When the catplot of the two-column resume is examined, it is seen that the text coordinates form two different projection sets on the x-axis. It is seen that the text coordinates of the single-column resumes form projections a sparse distribution on the x-axis. On the other hand, there is no feature that the Y axis decomposes.
In this study, three different methods were used for clustering of coordinate data. The well-known K-means approach was used with two different post-processing methods. In addition, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which has a proven performance in analyzing geographic data, was used. The number of columns was interpreted using the methods and coordinate data.(Şereften info al)
ELBOW:
With the Elbow method, it calculates the sum of the squares of the distances from the center of the cluster to which the data is included for the number of clusters [13]. This calculation is also called Within-Cluster-Sum of Squared Errors (WSS). A threshold value is determined for the WSS values calculated for the pre-prepared training dataset and the CVs with one and two columns. The WSS value is calculated for cluster number 2. The number of columns is estimated by using the determined threshold value and the WSS value obtained in any resume.
SILHOUETTE:
The Silhouette method, on the other hand, is a method that provides the most appropriate number of clusters and interpretation of the consistency between data sets. The method calculates the silhouette coefficients of each point It measures how similar a point is to its own cluster compared to other clusters. Silhouette coefficient gets values between -1 and +1. As in Elbow, a threshold value is determined with the silhouette coefficients calculated for single and double column resumes on the training dataset . Using the determined threshold value, an estimation is made with the silhouette coefficients obtained from the resumes.
DBSCAN:
Unlike K-means algorithm, it does not require the number of clusters to be specified beforehand. It is also an outlier resistant algorithm. The data produced by the parsed resumes creates a valid dataset for the DBSCAN Algorithm. Coordinate information of text blocks creates point data in the 2D coordinate plane. The number of text blocks clustered on the page can be determined using the DBSCAN Algorithm. The number of clusters detected can give information about the number of columns in the resume.
Firstly, we consider the Silhouette method. Box plots created with silhouette coefficients obtained from the training data set for 2 cluster are shown on the slide. When the box plots were examined, a net treshold value could not be determined to be able to separate two different column types. After manual trials, the treshold value with the highest accuracy was found to be 0.95. The Confussion matrix shows the success achieved for this value.
For the Elbow method, the boxplots created with the WSS values obtained from the training set for 2 cluster are shown on the slide. But net WSS value couldn't found within this method fot parsing one or two columns. The highest accuracy treshold value detected for Elbow is 50,000. The estimation results obtained with this value are given with the confusion matrix shown on the slide.
The result of the estimations made on the test data set with the DBSCAN method is shown using the confusion matirx. The number of clusters was determined by DBSCAN from the coordinate data obtained from the resumes. For resumes containing one cluster, single column estimates were made, and for resumes containing two clusters, double column estimates were made.
When comparing all 3 methods, it is observed that the DBSCAN method has the highest success. It is seen that DBSCAN achieved a success rate of 90.07% in single-column résumés and 67.56% in two column résumés. It is seen that the Silhouette method reaches a success rate of 87.88% in single- column résumés and 49.24% in two-column résumés. On the other hand, the Elbow method has a success rate of 61.16% in single column résumés and 48.94% in two-column résumés. When the results were examined, higher results were obtained in all 3 methods in single-column resumes than in double-column resumes.
In this study, three different clustering methods were considered for the problem of determining the number of columns in the resumes. When the results of the methods were compared, the DBSCAN method obtained the highest metric values with 83% accuracy and 72% F1-score. Although these results are acceptable, its are not sufficient to determine the number of columns in the documents. When the resumes are examined, it is seen that the information is transferred under the relevant headings. Accordingly, headings contain information about the column number of a resume. As an extension of this work, header information (semantic and positional information) can be used to simultaneously identify headings and the number of columns. In this way, we assume that success rates for the resume parsing task can be increased to reasonable rates. In addition to the this study, different clustering approaches (model-based, spectral, hierarchical, etc.) and different metrics (gap statistics, modularity, etc.) can be used to find the optimal clustering numbers.
If you have any questions, I'd be happy to answer its.
Question 1: Is a solution produced for only 1 and 2 columns?
Answer 1:This study includes the preliminary study. We continue with the best results of the possible options.
Based on the data set we have at the moment, it proceeds on 1 and 2 column documents.
Question 2: Why cluster solution?
Answer 2: When the coordinates of the texts were examined, we observed that the initial coordinates of the tex
t showed different distributions and clustering on the x-axis depending on the number of columns.
Question 3: How did you get the texts in the PDF?
Answer 3: We have developed an algorithm where we get content information of documents such as PDF, Docx, etc.
Question 4: Bu bir cluster çözümü mü
Answer 4 : In this study, we need to find the number of segments. Unlike the use of clusters such as profiling,
segmentation or partitioning, we use cluster solutions as a tool in this study.
Question 5: Can it be used for any document?
Answer 5:Yes, the Study contains a general solution. Therefore, it can be used for column determine of different documents.
Question 5: neden bu çalışma var?
Answer 5: This is a small part of a big project. It is a solution to the problem encountered while parsing the texts into sections in the
CV parse project. If it don’t known column number,It is possible to mix texts or sections when parsing texts in documents of different column types.
This would be difficult to explain here. If you send me an e-mail, I will be happy to explain it in detail.