SlideShare a Scribd company logo
1 of 20
Determining Column Numbers in Rèsumè
with Clustering
Şeref Recep Keskin
R&D Engineer
Kariyer.net
Yavuz Balı
R&D Engineer
Kariyer.net
Günce Kezban Orman
Department of Computer Engineering
Galatasaray University
Sultan N. Turhan
Department of Computer Engineering
Galatasaray University
F. Serhan Daniş
Department of Computer Engineering
Galatasaray University
Contents
 Introduction
 Dataset
 Methodology
 Experiment
 Conclusion and Future Directions
Introduction
 It is Turkey's first and leading job and employee recruitment
platform, established in 1999.
 More than 1 M job postings are viewed daily and around 300K job
applications are received.
 It has been an Research and Development center since 2014. The
first in the Turkish Internet industry.
About Kariyer.net
 It mediates the employment of 1.5 million people annually.
Introduction
 Examining the resumes of the candidates in the
recruitment processes constitutes a large workforce.
 It is aimed to reduce the workload of the recruitment
process by transforming unstructured documents into
structural ones.
 This study is about column detection, which is one of the
problems during the conversion of documents into
structural format.
Problem Definition &
Proposed Solution
Major Considerations:
• Extract information from résumé
• Detect column type résumé
Solution:
• Formalizes the problem of finding columns of a
résumé as a clustering problem
Text’ 𝑥0 Of Two Column Distribution in 2D Plane
Text’ 𝑥0 Of Two Column Distribution in 2D Plane
Dataset
One Column Résumé Two Column Résumé
Dataset
𝒙𝟎 𝒚𝟎 𝒙𝟏 𝒚𝟏 words Page_number
5 152.62 188.18 210.97 209.15 education 1
6 97.77 208.27 417.59 243.82 2014-2019ncomput.. 1
7 78.83 256.16 288.61 270.73 Programmingnpyth.. 1
8 120.30 282.93 304.52 309.46 ıdenjupyter notebo.. 1
9 71.15 314.62 304.56 365.06 framework annlibr.. 1
Parsed Résumé Dataframe
Example of One Colume Résumé
Dataset
Parameter Explanation
𝒙𝟎 Left corners x coordinate
𝒚𝟎 Top corners y coordinate
𝒙𝟏 Right corners x coordinate
𝒚𝟏 Bottom corners y coordinate
words The output of text extraction
Dataset
Descriptions of properties expressing texts
Methodology
Two Column Résumé One Column Résumé
The coordinate distributions of the x0 and y0 features of the
texts in single and double column résumé are shown below:
Methodology
 Algorithms used in the study are listed below:
 K-means approach with post-processing methods:
 Silhouette
 Elbow
 DBSCAN
Methodology
K-means with Elbow Method
Methodology
K-means with Silhouette Method
Methodology
DBSCAN Method
Experiments
K-means with Silhouette Method
 Treshold of Silhouette Score = 0.95
0.95 < Silhouette Score , Two Column Résumé
0.95 > Silhouette Score , One Column Résumé
Single-Column samples
Two-Columns Sample
Experiments
K-means with Elbow Method
 Treshold of Within-Cluster-Sum of Squared
Errors (WSS) = 50.000
50.000 > WSS , Two Column Résumé
50.000 < WSS, One Column Résumé
Single-Column samples
Two-Columns Sample
Experiments
DBSCAN Method
Experiments
Silhouette Elbow DBSCAN
Method Test Accuracy F1-Score Recall Precision
DBSCAN 83% 72% 68% 77%
Elbow 75% 57% 49% 66%
Silhouette 57% 43% 49% 38%
Conclusion and Future Directions
Conclusion
• The DBSCAN method has a high accuracy value compared to the elbow
and silhouette methods.
• Although 72% F1-score is a high value, it is not sufficient.
Future Directions
• In future studies, metric values can be increased with different clustering
approaches (model-based, spectral, hierarchical, etc.) on this specific
problem.
Thank you for listening

More Related Content

Similar to Determining Column Numbers in Rèsumè with Clustering

itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdfitm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdfbeshahashenafe20
 
A method for the development of Dublin Core Application Profiles
A method for the development of Dublin Core Application ProfilesA method for the development of Dublin Core Application Profiles
A method for the development of Dublin Core Application ProfilesMariana Curado Malta
 
A survey on ranking sql queries using skyline and user
A survey on ranking sql queries using skyline and userA survey on ranking sql queries using skyline and user
A survey on ranking sql queries using skyline and usereSAT Publishing House
 
WCTFR : W RAPPING C URVELET T RANSFORM B ASED F ACE R ECOGNITION
WCTFR : W RAPPING  C URVELET T RANSFORM  B ASED  F ACE  R ECOGNITIONWCTFR : W RAPPING  C URVELET T RANSFORM  B ASED  F ACE  R ECOGNITION
WCTFR : W RAPPING C URVELET T RANSFORM B ASED F ACE R ECOGNITIONcsandit
 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicatorsvie_dels
 
Project management
Project managementProject management
Project managementsmumbahelp
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Some Reviews on Circularity Evaluation using Non- Linear Optimization Techniques
Some Reviews on Circularity Evaluation using Non- Linear Optimization TechniquesSome Reviews on Circularity Evaluation using Non- Linear Optimization Techniques
Some Reviews on Circularity Evaluation using Non- Linear Optimization TechniquesIRJET Journal
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLABRavikiran A
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19
 
Operations Research Digital Material.pdf
Operations Research Digital Material.pdfOperations Research Digital Material.pdf
Operations Research Digital Material.pdfTANVEERSINGHSOLANKI
 
Final Mini Project Presentation_2023_24.ppt
Final Mini  Project Presentation_2023_24.pptFinal Mini  Project Presentation_2023_24.ppt
Final Mini Project Presentation_2023_24.pptRamSharma159674
 
Application of or for industrial engineers
Application of or for industrial engineersApplication of or for industrial engineers
Application of or for industrial engineersHakeem-Ur- Rehman
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxiamultapromax
 
MSPresentation_Spring2011
MSPresentation_Spring2011MSPresentation_Spring2011
MSPresentation_Spring2011Shaun Smith
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Bangla Hand Written Digit Recognition presentation slide .pptx
Bangla Hand Written Digit Recognition presentation slide .pptxBangla Hand Written Digit Recognition presentation slide .pptx
Bangla Hand Written Digit Recognition presentation slide .pptxKhondokerAbuNaim
 

Similar to Determining Column Numbers in Rèsumè with Clustering (20)

itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdfitm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
 
Algorithm
AlgorithmAlgorithm
Algorithm
 
A method for the development of Dublin Core Application Profiles
A method for the development of Dublin Core Application ProfilesA method for the development of Dublin Core Application Profiles
A method for the development of Dublin Core Application Profiles
 
A survey on ranking sql queries using skyline and user
A survey on ranking sql queries using skyline and userA survey on ranking sql queries using skyline and user
A survey on ranking sql queries using skyline and user
 
WCTFR : W RAPPING C URVELET T RANSFORM B ASED F ACE R ECOGNITION
WCTFR : W RAPPING  C URVELET T RANSFORM  B ASED  F ACE  R ECOGNITIONWCTFR : W RAPPING  C URVELET T RANSFORM  B ASED  F ACE  R ECOGNITION
WCTFR : W RAPPING C URVELET T RANSFORM B ASED F ACE R ECOGNITION
 
Module-2_ML.pdf
Module-2_ML.pdfModule-2_ML.pdf
Module-2_ML.pdf
 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicators
 
Project management
Project managementProject management
Project management
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Some Reviews on Circularity Evaluation using Non- Linear Optimization Techniques
Some Reviews on Circularity Evaluation using Non- Linear Optimization TechniquesSome Reviews on Circularity Evaluation using Non- Linear Optimization Techniques
Some Reviews on Circularity Evaluation using Non- Linear Optimization Techniques
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLAB
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
Operations Research Digital Material.pdf
Operations Research Digital Material.pdfOperations Research Digital Material.pdf
Operations Research Digital Material.pdf
 
Final Mini Project Presentation_2023_24.ppt
Final Mini  Project Presentation_2023_24.pptFinal Mini  Project Presentation_2023_24.ppt
Final Mini Project Presentation_2023_24.ppt
 
Application of or for industrial engineers
Application of or for industrial engineersApplication of or for industrial engineers
Application of or for industrial engineers
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptx
 
MSPresentation_Spring2011
MSPresentation_Spring2011MSPresentation_Spring2011
MSPresentation_Spring2011
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Apsec 2014 Presentation
Apsec 2014 PresentationApsec 2014 Presentation
Apsec 2014 Presentation
 
Bangla Hand Written Digit Recognition presentation slide .pptx
Bangla Hand Written Digit Recognition presentation slide .pptxBangla Hand Written Digit Recognition presentation slide .pptx
Bangla Hand Written Digit Recognition presentation slide .pptx
 

More from Kemal Can Kara

SparkDay 2017 - Kariyer.net
SparkDay 2017 - Kariyer.netSparkDay 2017 - Kariyer.net
SparkDay 2017 - Kariyer.netKemal Can Kara
 
Bağlam Temelli Kurumsal Raporlama Yönetici Asistanı
Bağlam Temelli Kurumsal Raporlama Yönetici AsistanıBağlam Temelli Kurumsal Raporlama Yönetici Asistanı
Bağlam Temelli Kurumsal Raporlama Yönetici AsistanıKemal Can Kara
 
Yapay Zeka Destekli İş Ön Mülakatı Sistemi
Yapay Zeka Destekli İş Ön Mülakatı SistemiYapay Zeka Destekli İş Ön Mülakatı Sistemi
Yapay Zeka Destekli İş Ön Mülakatı SistemiKemal Can Kara
 
B3S'17 - Kariyer.net Sunumu
B3S'17 - Kariyer.net SunumuB3S'17 - Kariyer.net Sunumu
B3S'17 - Kariyer.net SunumuKemal Can Kara
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
 

More from Kemal Can Kara (8)

Hora sunum
Hora sunumHora sunum
Hora sunum
 
Trai
TraiTrai
Trai
 
SparkDay 2017 - Kariyer.net
SparkDay 2017 - Kariyer.netSparkDay 2017 - Kariyer.net
SparkDay 2017 - Kariyer.net
 
Bağlam Temelli Kurumsal Raporlama Yönetici Asistanı
Bağlam Temelli Kurumsal Raporlama Yönetici AsistanıBağlam Temelli Kurumsal Raporlama Yönetici Asistanı
Bağlam Temelli Kurumsal Raporlama Yönetici Asistanı
 
Yapay Zeka Destekli İş Ön Mülakatı Sistemi
Yapay Zeka Destekli İş Ön Mülakatı SistemiYapay Zeka Destekli İş Ön Mülakatı Sistemi
Yapay Zeka Destekli İş Ön Mülakatı Sistemi
 
UBMK'17 - Kariyer.net
UBMK'17 - Kariyer.netUBMK'17 - Kariyer.net
UBMK'17 - Kariyer.net
 
B3S'17 - Kariyer.net Sunumu
B3S'17 - Kariyer.net SunumuB3S'17 - Kariyer.net Sunumu
B3S'17 - Kariyer.net Sunumu
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 

Recently uploaded

Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxMustafa Ahmed
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfSkNahidulIslamShrabo
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentationsj9399037128
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptjigup7320
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...josephjonse
 
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...AshwaniAnuragi1
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligencemahaffeycheryld
 
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Toolssoginsider
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...Amil baba
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptamrabdallah9
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Studentskannan348865
 
Databricks Generative AI Fundamentals .pdf
Databricks Generative AI Fundamentals  .pdfDatabricks Generative AI Fundamentals  .pdf
Databricks Generative AI Fundamentals .pdfVinayVadlagattu
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Stationsiddharthteach18
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书c3384a92eb32
 

Recently uploaded (20)

Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdf
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentation
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) ppt
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Databricks Generative AI Fundamentals .pdf
Databricks Generative AI Fundamentals  .pdfDatabricks Generative AI Fundamentals  .pdf
Databricks Generative AI Fundamentals .pdf
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
 

Determining Column Numbers in Rèsumè with Clustering

  • 1. Determining Column Numbers in Rèsumè with Clustering Şeref Recep Keskin R&D Engineer Kariyer.net Yavuz Balı R&D Engineer Kariyer.net Günce Kezban Orman Department of Computer Engineering Galatasaray University Sultan N. Turhan Department of Computer Engineering Galatasaray University F. Serhan Daniş Department of Computer Engineering Galatasaray University
  • 2. Contents  Introduction  Dataset  Methodology  Experiment  Conclusion and Future Directions
  • 3. Introduction  It is Turkey's first and leading job and employee recruitment platform, established in 1999.  More than 1 M job postings are viewed daily and around 300K job applications are received.  It has been an Research and Development center since 2014. The first in the Turkish Internet industry. About Kariyer.net  It mediates the employment of 1.5 million people annually.
  • 4. Introduction  Examining the resumes of the candidates in the recruitment processes constitutes a large workforce.  It is aimed to reduce the workload of the recruitment process by transforming unstructured documents into structural ones.  This study is about column detection, which is one of the problems during the conversion of documents into structural format.
  • 5. Problem Definition & Proposed Solution Major Considerations: • Extract information from résumé • Detect column type résumé Solution: • Formalizes the problem of finding columns of a résumé as a clustering problem Text’ 𝑥0 Of Two Column Distribution in 2D Plane Text’ 𝑥0 Of Two Column Distribution in 2D Plane
  • 6. Dataset One Column Résumé Two Column Résumé
  • 8. 𝒙𝟎 𝒚𝟎 𝒙𝟏 𝒚𝟏 words Page_number 5 152.62 188.18 210.97 209.15 education 1 6 97.77 208.27 417.59 243.82 2014-2019ncomput.. 1 7 78.83 256.16 288.61 270.73 Programmingnpyth.. 1 8 120.30 282.93 304.52 309.46 ıdenjupyter notebo.. 1 9 71.15 314.62 304.56 365.06 framework annlibr.. 1 Parsed Résumé Dataframe Example of One Colume Résumé Dataset
  • 9. Parameter Explanation 𝒙𝟎 Left corners x coordinate 𝒚𝟎 Top corners y coordinate 𝒙𝟏 Right corners x coordinate 𝒚𝟏 Bottom corners y coordinate words The output of text extraction Dataset Descriptions of properties expressing texts
  • 10. Methodology Two Column Résumé One Column Résumé The coordinate distributions of the x0 and y0 features of the texts in single and double column résumé are shown below:
  • 11. Methodology  Algorithms used in the study are listed below:  K-means approach with post-processing methods:  Silhouette  Elbow  DBSCAN
  • 15. Experiments K-means with Silhouette Method  Treshold of Silhouette Score = 0.95 0.95 < Silhouette Score , Two Column Résumé 0.95 > Silhouette Score , One Column Résumé Single-Column samples Two-Columns Sample
  • 16. Experiments K-means with Elbow Method  Treshold of Within-Cluster-Sum of Squared Errors (WSS) = 50.000 50.000 > WSS , Two Column Résumé 50.000 < WSS, One Column Résumé Single-Column samples Two-Columns Sample
  • 18. Experiments Silhouette Elbow DBSCAN Method Test Accuracy F1-Score Recall Precision DBSCAN 83% 72% 68% 77% Elbow 75% 57% 49% 66% Silhouette 57% 43% 49% 38%
  • 19. Conclusion and Future Directions Conclusion • The DBSCAN method has a high accuracy value compared to the elbow and silhouette methods. • Although 72% F1-score is a high value, it is not sufficient. Future Directions • In future studies, metric values can be increased with different clustering approaches (model-based, spectral, hierarchical, etc.) on this specific problem.
  • 20. Thank you for listening

Editor's Notes

  1. Hello, I am Yavuz Balı who R&D enginner from Kariyer.net. Today, I present about the study "Determining Column Numbers in Rèsumè with Clustering", which is a part of this project.
  2. 1- I aim to present the presentation under 5 headings. 2- Read header
  3. Since 1999, Turkey's largest employment platform Kariyer.net brings together job seekers and employers online with new generation technologies in job search and recruitment processes. On its platform, Kariyer.net allows candidates to upload their free-style resumes. These resumes are stored in databases without un-processed. It is of great importance for Kariyer.net to transform the résumés in non-structural form into structural form.
  4. In the recruitment process, the workload of manual résumé reviews is quite time consuming for the recruiters. Extracting the actual meaning in resumes and structuring their forms can facilitate this process. More specifically, we firstly focus on finding the column number of any résumé since once the main parts of the résumé are separated, the subdivisions can easily be analyzed.
  5. In order to determine the number of columns, the coordinate information of the texts parsed from the resumes is focused. We assume that only the $y_0$ information changes for the texts belonging to the same paragraph, and the $x_0$ information remains within a certain tolerance range. In the case of multiple columns, the coordinate $x_0$ is considered an attribute that specifies the starting positions of the text. When the coordinates of one-column and two-column resumes were examined, a significant difference was detected. It has been observed that the data are really clustered around two centers, based on x0 coordinate data of the two-column resumes. As seen on the slide, the distribution of x0 coordinate information in 2D for two different column types is shown. So in this study, thus, formalizes the problem of finding columns of a résumé as a clustering problem.
  6. In this study, resumes collected from Kariyer.net databases were used. A total of 1018 resumes were used for the test. Resumes uploaded directly to the system can have different formats and styles, with one-columns or two-columns, different headers, or writing details. As can be seen on the slide, resumes with two columns on the right and one column on the left are exemplified.
  7. Test data set created to compare the methods, contain 685 single-column and 333 double-column resumes.
  8. It focuses on the coordinate information of the texts parsed from the resumes in order to determine the number of columns. In order to determine the number of columns, the resumes are pre-processed and the texts they contain are separated. In this respect, the problem is considered as an analytical problem that examines the coordinates of the analyzed data. On the slide, shows a table of texts parsed from resumes.
  9. İn this slide, Table contains the descriptions of the features obtained in the parsed texts.
  10. When the x0 and y0 coordinates of the texts obtained from the one-column and two-column resumes are analyzed, the catplots on the slide are obtained. When the catplot of the two-column resume is examined, it is seen that the text coordinates form two different projection sets on the x-axis. It is seen that the text coordinates of the single-column resumes form projections a sparse distribution on the x-axis. On the other hand, there is no feature that the Y axis decomposes.
  11. In this study, three different methods were used for clustering of coordinate data. The well-known K-means approach was used with two different post-processing methods. In addition, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which has a proven performance in analyzing geographic data, was used. The number of columns was interpreted using the methods and coordinate data.(Şereften info al)
  12. ELBOW: With the Elbow method, it calculates the sum of the squares of the distances from the center of the cluster to which the data is included for the number of clusters [13]. This calculation is also called Within-Cluster-Sum of Squared Errors (WSS). A threshold value is determined for the WSS values calculated for the pre-prepared training dataset and the CVs with one and two columns. The WSS value is calculated for cluster number 2. The number of columns is estimated by using the determined threshold value and the WSS value obtained in any resume.
  13. SILHOUETTE: The Silhouette method, on the other hand, is a method that provides the most appropriate number of clusters and interpretation of the consistency between data sets. The method calculates the silhouette coefficients of each point It measures how similar a point is to its own cluster compared to other clusters. Silhouette coefficient gets values between -1 and +1. As in Elbow, a threshold value is determined with the silhouette coefficients calculated for single and double column resumes on the training dataset . Using the determined threshold value, an estimation is made with the silhouette coefficients obtained from the resumes.
  14. DBSCAN: Unlike K-means algorithm, it does not require the number of clusters to be specified beforehand. It is also an outlier resistant algorithm. The data produced by the parsed resumes creates a valid dataset for the DBSCAN Algorithm. Coordinate information of text blocks creates point data in the 2D coordinate plane. The number of text blocks clustered on the page can be determined using the DBSCAN Algorithm. The number of clusters detected can give information about the number of columns in the resume.
  15. Firstly, we consider the Silhouette method. Box plots created with silhouette coefficients obtained from the training data set for 2 cluster are shown on the slide. When the box plots were examined, a net treshold value could not be determined to be able to separate two different column types. After manual trials, the treshold value with the highest accuracy was found to be 0.95. The Confussion matrix shows the success achieved for this value.
  16. For the Elbow method, the boxplots created with the WSS values obtained from the training set for 2 cluster are shown on the slide. But net WSS value couldn't found within this method fot parsing one or two columns. The highest accuracy treshold value detected for Elbow is 50,000. The estimation results obtained with this value are given with the confusion matrix shown on the slide.
  17. The result of the estimations made on the test data set with the DBSCAN method is shown using the confusion matirx. The number of clusters was determined by DBSCAN from the coordinate data obtained from the resumes. For resumes containing one cluster, single column estimates were made, and for resumes containing two clusters, double column estimates were made.
  18. When comparing all 3 methods, it is observed that the DBSCAN method has the highest success. It is seen that DBSCAN achieved a success rate of 90.07% in single-column résumés and 67.56% in two column résumés. It is seen that the Silhouette method reaches a success rate of 87.88% in single- column résumés and 49.24% in two-column résumés. On the other hand, the Elbow method has a success rate of 61.16% in single column résumés and 48.94% in two-column résumés. When the results were examined, higher results were obtained in all 3 methods in single-column resumes than in double-column resumes.
  19. In this study, three different clustering methods were considered for the problem of determining the number of columns in the resumes. When the results of the methods were compared, the DBSCAN method obtained the highest metric values ​​with 83% accuracy and 72% F1-score. Although these results are acceptable, its are not sufficient to determine the number of columns in the documents. When the resumes are examined, it is seen that the information is transferred under the relevant headings. Accordingly, headings contain information about the column number of a resume. As an extension of this work, header information (semantic and positional information) can be used to simultaneously identify headings and the number of columns. In this way, we assume that success rates for the resume parsing task can be increased to reasonable rates. In addition to the this study, different clustering approaches (model-based, spectral, hierarchical, etc.) and different metrics (gap statistics, modularity, etc.) can be used to find the optimal clustering numbers.
  20. If you have any questions, I'd be happy to answer its. Question 1: Is a solution produced for only 1 and 2 columns? Answer 1:This study includes the preliminary study. We continue with the best results of the possible options. Based on the data set we have at the moment, it proceeds on 1 and 2 column documents. Question 2: Why cluster solution? Answer 2: When the coordinates of the texts were examined, we observed that the initial coordinates of the tex t showed different distributions and clustering on the x-axis depending on the number of columns. Question 3: How did you get the texts in the PDF? Answer 3: We have developed an algorithm where we get content information of documents such as PDF, Docx, etc. Question 4: Bu bir cluster çözümü mü Answer 4 : In this study, we need to find the number of segments. Unlike the use of clusters such as profiling, segmentation or partitioning, we use cluster solutions as a tool in this study. Question 5: Can it be used for any document? Answer 5:Yes, the Study contains a general solution. Therefore, it can be used for column determine of different documents. Question 5: neden bu çalışma var? Answer 5: This is a small part of a big project. It is a solution to the problem encountered while parsing the texts into sections in the CV parse project. If it don’t known column number,It is possible to mix texts or sections when parsing texts in documents of different column types. This would be difficult to explain here. If you send me an e-mail, I will be happy to explain it in detail.