This project proposes a model of extracting important information from the semi-structured
text format in a curriculum vitae or resume and ranking it according to the preference of the
associated company and requirements. In order to achieve the desired goal, the entire process has
been divided into 3 basic segments. The first segment consists of segmenting the entire CV /
Resume based on the topic of each part, the second segment consists of extracting data in
structured form from the unstructured data and the final segment consists of evaluating the
structured data by decision tree algorithm and training the system. The structured data extraction
process is done by segmenting the entire CV / Resume by converting it to text. After the
conversion to structured data, decision tree algorithm techniques are used to classify the input into
different categories based on qualifications, experience, etc.
1. VIVEKANAND EDUCATION SOCIETY’S INSTITUTE OF
TECHNOLOGY
Department of Computer Engineering
Project Report on
Resume and CV Summarization using NLP
In partial fulfillment of the Final Year, Bachelor of Engineering (B.E.) Degree in
Computer Engineering at the University of Mumbai Academic Year 2020-2021.
Submitted by
Anjali Asrani D17B 05
Sneha Indulkar D17B 23
Kaif Siddique D17B 63
Project Mentor
Mrs. Vidya Zope
(2020-2021)
1
2. VIVEKANAND EDUCATION SOCIETY’S INSTITUTE OF
TECHNOLOGY
Department of Computer Engineering
Certificate
This is to certify that Anjali Asrani, Sneha Indulkar and Kaif Siddique of Final
Year Computer Engineering studying under the University of Mumbai have
satisfactorily completed the mini project on “Resume and CV Summarization
using NLP” as a part of their coursework of Mini Project for Semester-VIII under
the guidance of their mentor Mrs. Vidya Zope in the year 2020-2021.
This mini project report entitled (Resume and CV Summarization using NLP) by
(Anjali Asrani, Sneha Indulkar and Kaif Siddique) is approved for the degree of
____________.
Programme Outcomes Grade
PO1,PO2,PO3,PO4,PO5,PO6,PO7,
PO8, PO9, PO10, PO11, PO12
PSO1, PSO2
Date:
Project Guide : Mrs. Vidya Zope
2
3. MINI PROJECT REPORT APPROVAL FOR B. E
(COMPUTER ENGINEERING)
This mini project report entitled Resume and CV Summarization using NLP by
Anjali Asrani, Sneha Indulkar and Kaif Siddique is approved for the degree of
B.E Computer Engg.
Internal Examiner
---------------------------------------------
External Examiner
---------------------------------------------
Head of the Department
-----------------------------------------------
Principal
-----------------------------------------------
Date:
Place:
3
4. Declaration
We declare that this written submission represents our ideas in our own words and
where others' ideas or words have been included, we have adequately cited and
referenced the original sources. We also declare that we have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in our submission. We understand
that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
______________________________ ______________________________
Anjali Asrani 05 Sneha Indulkar 23
______________________________
Kaif siddiqui 63
Date :
4
5. ACKNOWLEDGEMENT
We are thankful to our college Vivekanand Education Society’s Institute of Technology
for considering our project and extending help at all stages needed during our work of
collecting information regarding the project.
It gives us immense pleasure to express our deep and sincere gratitude to Assistant
Professor Mrs. Vidya Zope (Project Guide) for her kind help and valuable advice during the
development of project synopsis and for her guidance and suggestions.
We are deeply indebted to Head of the Computer Department Dr.(Mrs.) Nupur Giri
and our Principal Dr. (Mrs.) J.M. Nair for giving us this valuable opportunity to do this
project.
We express our hearty thanks to them for their assistance without which it would have
been difficult in finishing this project synopsis and project review successfully.
We convey our deep sense of gratitude to all teaching and non-teaching staff for their
constant encouragement, support and selfless help throughout the project work. It is great
pleasure to acknowledge the help and suggestion, which we received from the Department of
Computer Engineering.
We wish to express our profound thanks to all those who helped us in gathering
information about the project. Our families too have provided moral support and
encouragement at several times.
5
6. Computer Engineering Department
COURSE OUTCOMES FOR B.E PROJECT
Learners will be to:-
Course Outcome Description of the Course Outcome
CO 1 Do literature survey/industrial visit and identify the
problem of the selected project topic.
CO2 Apply basic engineering fundamental in the domain of
practical applications FORproblem identification,
formulation and solution
CO 3 Attempt & Design a problem solution in a right approach
to complex problems
CO 4 Cultivate the habit of working in a team
CO 5 Correlate the theoretical and experimental/simulations
results and draw the proper inferences
CO 6 Demonstrate the knowledge, skills and attitudes of a
professional engineer & Prepare report as per the standard
guidelines.
6
7. Abstract
This project proposes a model of extracting important information from the semi-structured
text format in a curriculum vitae or resume and ranking it according to the preference of the
associated company and requirements. In order to achieve the desired goal, the entire process has
been divided into 3 basic segments. The first segment consists of segmenting the entire CV /
Resume based on the topic of each part, the second segment consists of extracting data in
structured form from the unstructured data and the final segment consists of evaluating the
structured data by decision tree algorithm and training the system. The structured data extraction
process is done by segmenting the entire CV / Resume by converting it to text. After the
conversion to structured data, decision tree algorithm techniques are used to classify the input into
different categories based on qualifications, experience etc.
7
8. CHAPTER 1
INTRODUCTION
____________________________________________________________________________
1.1 Introduction to the project
After completing education the next phase that comes in a person’s life is a job.
However, there are lots of people who start working before completing their formal education.
While searching for jobs the most important thing to represent an applicant is Curriculum Vitae
(CV) or Resume. In this era of technology, job searching has become more smart and easier at the
same time. However, there are more than enough applicants for a single job and it is really tough
for an employer to select candidates only based on their CV / Resume. To solve this problem, there
are companies who provide specific formats for their applicants so that they can make this process
a little bit easier. Even after doing that the process is still pretty boring and most of the cases full of
errors.
Every organization has to deal with folders together with resumes. Going through these
resumes can be a tiring process added to the fact that it is very time consuming. It would help a ton
if there were to exist a model which processes the resumes and not only gives out how many and
which of them meet the requirements, but also gives a summarized version of the resumes. These
concise resumes can come of great use for a hiring company in their selection procedures , to
straight away reject those applications that are not suitable for the job description
1.2 Problem Definition
Large companies and recruitment agencies receive, process and manage hundreds of
resumes from job applicants. Besides, many people publish their resumes on the web. Dealing with
loads of resumes at once can be time consuming since all they need is a bunch of valuable resumes
that represent candidates specialized in fields the company/agency is looking for. The resumes that
one receives in their original format (pdf, docm etc.) are unstructured. The unstructured format of
these resumes , with random templates and fonts make it difficult for processing. These resumes
8
9. can be automatically retrieved and processed by a resume information extraction system. Extracted
information such as name, phone / mobile nos., e-mails id., qualification, experience, skill sets etc.
can be stored as structured information in a database.
1.3 Scope of project
Even though in the research one of the most feasible ways to evaluate a CV/ Resume was
detailed, the domain was kept restricted to the CVs/ Resumes of only engineering students and the
amount of sample data versus the amount of test data was relatively small. In addition to that, CVs/
Resumes with some varied layout design is out of the scope of this paper. For the future scope of
this research, the methodologies can be used for the data from CVs/ Resumes of other job
departments or the whole research can be done in a much larger scope.
1.4 CV / Resume Analyzing Process
In the past, CVs/Resumes submitted by job seekers used to be manually analyzed and
judged by the employers. This method is still followed in recent times. However, as the big
companies often need to deal with hundreds of CVs/Resumes each and every day, it has become
very problematic and time consuming to handle such a big number of CVs/Resumes one by one.
As a result, many companies started to provide specific formats or forms where the job seekers
need to fill up with required information and then the CV/Resume will be analyzed by machine
with simple pattern recognition and keywords searching. While this method reduced the workload
for the employers, it increased the amount of work for the applicants significantly as they need to
maintain different formats for each job they apply for. Additionally, it also tends to reduce the
creativity and the flexibility of writing the skills along with the qualifications in a CV/Resume.
1.5 Natural Language Processing Approach
With all the pros and cons in mind, there has always been an attempt to find an automated
method which finds the best of worlds, where the employers can easily select qualified candidates
in a short time and where the applicants can also demonstrate their creativity while maintaining
just one format to apply in different organizations. The innovation in the field of Natural Language
9
10. Processing [4] along with Machine Learning [5] has been really helpful in this case. The ability to
understand unstructured written language and extract important information from it to teach the
machine is exactly what is needed to analyze any written documents such as resume papers just
like human beings.
1.6 Technologies to be used
1. Software requirement :
i. Jypeter notebook
ii. Colab
2. Technology used
i. NLP
ii. Python
iii. Spacy tool
10
11. CHAPTER 2
LITERATURE SURVEY
______________________________________________________________________________
2.1 Research Paper Referred
● Satyaki Sanyal et. al [1] Parse information from a resume using natural language
processing, find the keywords, cluster them onto sectors based on their keywords and lastly
show the most relevant resume to the employer based on keyword matching. First, the user
uploads a resume to the web platform. The parser parses all the necessary information from
the resume and auto fills a form for the user to proofread. Once the user confirms, the
resume is saved into our NoSQL database ready to show itself to the employers. Also, t he
user gets their resume in both JSON format and pdf. .
● Dr.K.Satheesh*(Professor), A.Jahnavi 1 et. al.[2] Screening resumes out of bulk is a
challenging task and recruiters or hiring managers waste a lot of their valuable time by
searching through each and every resume. Often resumes are populated with irrelevant and
unnecessary information. Therefore, the process of parsing thousands of resumes manually
consumes a lot of time and energy thereby it makes the hiring process expensive. In
traditional hiring, resume screening is a manual process which consumes a lot of time and
energy. In this paper the process of screening resumes is automated by using advanced
Natural Language Processing which is a field in Machine Learning .Our model helps the
recruiters in screening the resumes based on job description within no time. It makes the
hiring process easy and efficient by extracting the required entities automatically by using
the Spacy NER model from the resumes and then generates a graph displaying the score of
each and every resume. Based on the scores recruiter can choose the required candidates
without rummaging through piles of resumes from unqualified candidates
11
12. CHAPTER 3
CONCEPTUAL SYSTEM DESIGN
______________________________________________________________________________
3.1 System diagram
The resumes that one receives in their original format (pdf, docm etc.) are unstructured.
The unstructured format of these resumes , with random templates and fonts make it difficult for
processing. These resumes can be automatically retrieved and processed by a resume information
extraction system. Extracted information such as name, phone / mobile nos., e-mails id.,
qualification, experience, skill sets etc. can be stored as structured information in a database.
Considered sample resume or CV
Segment 1 : (Name and details)
Md. Sakib Zaman Flat-A3, 127 West Kafrul, Agargaon, Taltola, Dhaka 1207 Mobile:
+8801912397694 E-mail: sakib2033@gmail.com
12
13. Segment 2 : (Working experience)
Employment Status Currently working as a Student Tutor/Teaching Assistant at
Department of Computer Science & Engineering, BRAC University from January 2017 Currently
working as a Student Trainer at Competitive Programming Training Session organized by
Department of Computer Science & Engineering, BRAC University and BRAC University ACM
Students Chapter from August 2016 Currently working as a Student Mentor at First Year Advising
Team, BRAC University Former Intern Software Engineer at Projukti Next from 2 nd May 2017 to
31st May 2017
Segment 3 : (Educational qualifications)
Educational Qualification Final year student of Computer Science and Engineering, BRAC
University, Dhaka CGPA: 3.74 in scale of 4.0 (till April, 2017) H.S.C. (2013) from Notre Dame
College, Dhaka GPA: 5.0 in scale of 5.0 S.S.C (2011) from Sher-E-Bangla Nagar Govt. Boys High
School, Dhaka GPA: 5.0 in scale of 5.0
Segment 4 : (Technical skills)
Field dependent Technical Skills Programming Languages: Java, C, C++, C# Operating
Systems: Windows, Linux Database System: MySQL Web: HTML5, CSS3 Segment 5 (Awards &
achievements) Achievements in Competitions and Programming 1st Runner-up in BRAC
University Intra University Programming Contest,
Segment 6 : (Projects)
Field dependent: Projects Currently working on an Online File Server System for
Educational Institutions Library Management System for Data Structure course Hospital
Management System for Data Structure course Cineplex Management System for Database course
Dhaka City Management System for Software Engineering course
13
14. 3.2 Flowchart
We use python’s [3]spaCy module for training the NER model. spaCy’s models are statistical and
every “decision” they make — for example, which part-of-speech tag to assign, or whether a word
is a named entity — is a prediction. This prediction is based on the examples the model has seen
during training.
The model is then shown the unlabelled text and will make a prediction. Because we know
the correct answer, we can give the model feedback on its prediction in the form of an error
gradient of the loss function that calculates the difference between the training example and the
expected output. The greater the difference, the more significant the gradient and the updates to
our model.
14
15. CHAPTER 4
IMPLEMENTATION AND EVALUATION
____________________________________________________________________________
4.1 Resume summarization using NER:
Data Preparation:
Our first task is to create manually annotated training data to train the model. So we are
using an online automation tool called Dataturks which automatically parses the documents and
allows us to create annotations of required entities.
PDF to Text:
Our aim for this project is to come up with an end-to-end tool which takes in a document
and gives out the expected result, in this case - The category and the summary. The majority of the
resumes out there are submitted in pdf format, we decided to add a preprocessing step of
converting PDF to text, by making use of the well known Optical Character Recognition. We made
use of pdfminer under python for this task
15
16. Data Cleaning :
1. Unnecessary separators: A lot of resumes had separators like a string of ’-’, which was
considered to be removed too[3]..
2. Punctuation and Stop Words: Punctuation and stop words didn’t seem to add any value
to the analysis, and hence it was decided to be gotten rid of.
3. Erroneous Formatting: There were also some records with highly erroneous formatting
which came in the way of our cleaning/analysis. Getting rid of them was the best resort.
4. Personal details: Details like email id, phone numbers, dates etc would add nothing but
plain noise to the analysis which would add merely any value in the process of
classification.It was hence considered best to remove them.
Class Labels and Indexing : [3] For added ease during further processing, mapping the class
labels to numeric constants was done. This made it easier for the model to predict
16
20. CHAPTER 6
CONCLUSION AND FUTURE SCOPE
______________________________________________________________________________
6.1 Conclusion
Our application helps the recruiters to screen the resumes more efficiently thereby reducing
the cost of hiring. This will provide a potential candidate to the organization and the candidate will
be successfully placed in an organization which appreciates his/her skill set and ability.
6.2 Future Scope
The application can be extended further to other domains like Telecom, Healthcare,
E-commerce and public sector jobs
20
21. CHAPTER 7
REFERENCES
______________________________________________________________________________
● [1] Resume Parser with Natural Language Processing:
https://www.researchgate.net/publication/313851778_Resume_Parser_with_Natural_Languag
e_Processing / international Journal of Engineering Science and Computing, February 2017
● [2]Resume Ranking based on Job Description using SpaCy NER model:
https://www.irjet.net/archives/V7/i5/IRJET-V7I516.pdf / International Research Journal of
Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 05 | May 2020
www.irjet.net p-ISSN: 2395-0072
● [3] Article : A Review of Named Entity Recognition (NER) Using Automatic
Summarization of Resumes
https://towardsdatascience.com/a-review-of-named-entity-recognition-ner-using-automatic-su
mmarization-of-resumes-5248a75de175
21