Resume Summary
Submitted by: Team 8
Krishna Chouhan, 201405635
Ala Praveen, 201405617
Kalpit Thakkar, 201201071
Anubhav Shrivastava, 201201105
Guide: Prof. Vasudeva Varma
Mentor : Ashish Kumar
1
Content
Abstract
Architecture
Technology
Working
Future work
References
Project Links
2
Abstract
Analyzing a resume manually is plausible but analyzing a huge collection of
resume manually is not possible. Moreover, finding a particular set of
information from collection of resume is not practical.
Storing the resumes in certain format into a storage and accessing the required
information is easier if the resume are parsed in and refined.
There are a number of document formats available for resume and it is not only
to store this unstructured data into a structured format for better storage but
also for fast extraction of that information.
3
Architecture
Resume summarizer is divided into two parts:
- Resume Parser;
- Search Resume
Resume Parser:
Parsing resume and converting unstructured resume files into a
structured collection of resume.
Search Resume:
This part of resume summarizer is used to search the required
information from collection of structured resume, and displaying the required4
5
Technology Used
- Java : Parsing the resume into a simple format that could be used by
hadoop for mapping and reducing the data is done in java.
- Hadoop : System uses hadoop mapping and reducing approach for
operating on resume. There are a number of mappers used one for each
file format but only one single reducer is implemented.
- MySql : For storing the structured data locally MySql DB is used.
6
Working
The basic working of this project starts with two types of input.
First input could be a resume, this resume is then processed to find the
information from the unstructured file and converted into structured format. The
resume provided as input could be a single resume provided as a live input by
user or we can provide collection of resume.
Second input is the search query, that is fired against the collection of
structured format of resume from the storage.
7
Input:
An interface is provided where user can upload the resume and can specify
the format (format is optional).
If user specify the format then we need to check the uploaded resume format
with specified format if it matches then process the resume using resume
parser and store the results in Database, otherwise check the uploaded resume
format with available formats we are able to parse, if uploaded resume format
matches with available formats then parse it using resume parser and store in
Database in some Structured format.
But the problem is resumes need not be in structured format, so processing
and storing the resumes in Database which are not in structured format is
difficult task.
8
Working
Resume are processed from either kind of input.
Firstly, the resume is parsed for the information according to the format of the
resume file .
Secondly, the data obtained from parsing is given as input to mapper. There
are multiple mappers according to the supported file formats and these
mappers maps the data accordingly.
Lastly, the data after mapping is passed to reducer and reduces reduces the
data into the desired format and stored into the storage.
Processing Resume:
9
Working
In the same interface we are providing another interface in such a way that user
can query the database (i.e. search query) by using some keywords (or) data
elements which are supposed to be existing in resumes.
After the user entered a query to search , then we perform certain operations
on Database by using conditions and filters which are in user’s query.
Then we refine the above results and rank them and then dispaly as a result to
user’s query.
The working flow is looks as follows:
Resume Search:
10
Working
11
Data is stored in a tuple based structured after processing the resumes.
Each tuple stores a resume and the fields of resume are stored in form of comma
separated columns.
Each field or column of a tuple stores certain information and thus each field
from the resume is matched to a column of a tuple. Also, the data that could not
be parsed successfully and the low priority information is stored in the last
column of the tuple.
While on displaying the data from a search query the data is searched from the
storage and resume are displayed according to the rank obtained by each resume
when searched for the appropriate fields. 12
WorkingStorage & Display:
13
Future Work
Resume Summarizer operates on the collection of resume that are in certain
formats and stores the resume in structured format. This project parses the
resume on a certain number of file formats and there is a future scope of
increasing these number of format to even more complicated file structures like
images.
Also the process of identifying the elements can be improved by implementing
machine learning into the summarizer after parsing.
14
References
http://stackoverflow.com/questions/2036236/tips-on-how-to-parse-custom-file-
format
https://thomaslevine.com/!/parsing-pdfs
http://stackoverflow.com/questions/4015477/read-pdf-files-using-java
https://blogs.oracle.com/prasanna/entry/openoffice_parser_extracting_text_from
http://javabeginnerstutorial.com/code-base/read-doc-file-in-java-using-poi/
http://stackoverflow.com/questions/16476711/how-to-read-docx-file-content-in-
java-api-using-poi-jar
https://en.wikipedia.org/wiki/MapReduce
https://docs.mongodb.org/manual/core/map-reduce/
15

Resume summary

  • 1.
    Resume Summary Submitted by:Team 8 Krishna Chouhan, 201405635 Ala Praveen, 201405617 Kalpit Thakkar, 201201071 Anubhav Shrivastava, 201201105 Guide: Prof. Vasudeva Varma Mentor : Ashish Kumar 1
  • 2.
  • 3.
    Abstract Analyzing a resumemanually is plausible but analyzing a huge collection of resume manually is not possible. Moreover, finding a particular set of information from collection of resume is not practical. Storing the resumes in certain format into a storage and accessing the required information is easier if the resume are parsed in and refined. There are a number of document formats available for resume and it is not only to store this unstructured data into a structured format for better storage but also for fast extraction of that information. 3
  • 4.
    Architecture Resume summarizer isdivided into two parts: - Resume Parser; - Search Resume Resume Parser: Parsing resume and converting unstructured resume files into a structured collection of resume. Search Resume: This part of resume summarizer is used to search the required information from collection of structured resume, and displaying the required4
  • 5.
  • 6.
    Technology Used - Java: Parsing the resume into a simple format that could be used by hadoop for mapping and reducing the data is done in java. - Hadoop : System uses hadoop mapping and reducing approach for operating on resume. There are a number of mappers used one for each file format but only one single reducer is implemented. - MySql : For storing the structured data locally MySql DB is used. 6
  • 7.
    Working The basic workingof this project starts with two types of input. First input could be a resume, this resume is then processed to find the information from the unstructured file and converted into structured format. The resume provided as input could be a single resume provided as a live input by user or we can provide collection of resume. Second input is the search query, that is fired against the collection of structured format of resume from the storage. 7
  • 8.
    Input: An interface isprovided where user can upload the resume and can specify the format (format is optional). If user specify the format then we need to check the uploaded resume format with specified format if it matches then process the resume using resume parser and store the results in Database, otherwise check the uploaded resume format with available formats we are able to parse, if uploaded resume format matches with available formats then parse it using resume parser and store in Database in some Structured format. But the problem is resumes need not be in structured format, so processing and storing the resumes in Database which are not in structured format is difficult task. 8 Working
  • 9.
    Resume are processedfrom either kind of input. Firstly, the resume is parsed for the information according to the format of the resume file . Secondly, the data obtained from parsing is given as input to mapper. There are multiple mappers according to the supported file formats and these mappers maps the data accordingly. Lastly, the data after mapping is passed to reducer and reduces reduces the data into the desired format and stored into the storage. Processing Resume: 9 Working
  • 10.
    In the sameinterface we are providing another interface in such a way that user can query the database (i.e. search query) by using some keywords (or) data elements which are supposed to be existing in resumes. After the user entered a query to search , then we perform certain operations on Database by using conditions and filters which are in user’s query. Then we refine the above results and rank them and then dispaly as a result to user’s query. The working flow is looks as follows: Resume Search: 10 Working
  • 11.
  • 12.
    Data is storedin a tuple based structured after processing the resumes. Each tuple stores a resume and the fields of resume are stored in form of comma separated columns. Each field or column of a tuple stores certain information and thus each field from the resume is matched to a column of a tuple. Also, the data that could not be parsed successfully and the low priority information is stored in the last column of the tuple. While on displaying the data from a search query the data is searched from the storage and resume are displayed according to the rank obtained by each resume when searched for the appropriate fields. 12 WorkingStorage & Display:
  • 13.
  • 14.
    Future Work Resume Summarizeroperates on the collection of resume that are in certain formats and stores the resume in structured format. This project parses the resume on a certain number of file formats and there is a future scope of increasing these number of format to even more complicated file structures like images. Also the process of identifying the elements can be improved by implementing machine learning into the summarizer after parsing. 14
  • 15.