Comp_Ling_Resume_Generator.pdf

1
Resume Text Generation Using Language Models
Hana Ba-Sabaa, Aadil Islam, Joe Zhang
Dartmouth College
hana.h.ba-sabaa.22@dartmouth.edu,
{aadil.islam,joe.zhang}.gr@dartmouth.edu

2
Abstract
Text generation is a challenging task in
Natural Language Processing, where the
intention is to generate text when provided
with some input. The release of different
pre-trained open-source text generation
models like GPT-1,2,3 and GPT-Neo has
spurred so much attention due to the
algorithms’ alleged ability to generate
humanlike text. Using tasks to assess
whether Natural Language Generation
(NLG) algorithms can produce “natural”
text human evaluators’ inability to
distinguish human-written versus
machine-generated text is not a well-
studied area. We conducted an experiment
assessing behavioral reactions to the state-
of-the-art NLG algorithms GPT-2 and
GPT-Neo and compared them to human-
written text. In our study, our algorithm-
generated text was not successful in
outperforming human-written texts, as
evaluated by our human participants. Our
results do not align with current
contributions in the field that suggest that
participants fail to reliably detect
algorithmically generated texts in the
Human-in-the-loop treatment. We discuss
how these results convey some
shortcomings in our sample size, which if
addressed in following research, will
further prove the extraordinary
performance of NLG algorithms to
produce human-like text.
1 Introduction
Each year, manystudents and young professionals
are in the process of applying for jobs and
composing/editing their resumes. Our project is
an English language resume text generator that
will assist applicants through the arduous process
of applying for jobs. Our tool outputs bullet points
that elaborate the requirements/job description
given a topic or a job title. For instance, when the
user inputs the job title “School Counselor”
followed by the seed phrase “Worked at Smith
High School as a school counselor. Helped
students…”, an example output would be:
“...understand and overcome social or behavioral
problems.” Since it is difficult to write resumes
that properly describe a user’s prior job/internship
experiences without knowing the specifics of the
user’s experiences, our tool will provide a
template that can be personalized based on
personal experiences. The output can be adjusted
and modified manually by the user in any text
editor.
Similar work was done in a research study by
Köbis et al. (2021) that explored people’s ability
to discern artificial content from human content.
In this experiment, humans directly competed
with an AI agent in the form of the natural
language generation (NLG) algorithm GPT-2
(Radford et al., 2019). The results showed that
human participants were incapable of reliably
detecting algorithm-generated poetry, even when
incentivized to do so. Additionally, the study
showed significantly higher preference for
algorithm-generated texts when humans were
involved in the selection process, versus when the
selection was random. We drew inspiration from
the study by introducing human-in-the-loop
factors into our evaluation techniques when
selecting the machine-generated text.
The work of Lee et al. (2020) leveraged GPT-2
355M to generate coherent patent claims. Their
training dataset comprised over 500,000 of the
granted U.S. utility patents in 2013, which were
then preprocessed by splitting claim texts into
smaller units of inventive thoughts called ‘claim
spans,’ circumventing the need to fine-tune GPT-
2 upon entire claim texts that may be too coarse-
grained to model upon. This inspired us to
reconsider generating full resume texts and
instead focus on modeling smaller phrases such as
bullet points for job titles. The authors also
noticed that generating the first reasonably

3
coherent patent claims required relatively few
training steps of fine-tuning, implying that future
deep learning models may too have potential in
modeling within similar domains.
2 Methodology
2.1 Dataset
The dataset leveraged in this project was obtained
from Kaggle1
and comprises 2,484 resumes
scraped from a database of example resumes
hosted by LiveCareer.2
Our resumes were those
that were scored 85 (out of 100) or above by
LiveCareer agents, ie. those found by former
recruiters to be fairly well-written. This was to
ensure our system is fine-tuned on the highest-
quality resumes possible. The dataset offers the
following attributes for each resume:
● ID: Unique identifier for the resume PDF.
● Resume_str: Resume text in string format.
● Resume_html: Resume content in HTML
format as present from web scraping.
● Category: Type of job that the resume was
used to apply for: HR, Designer,
Information-Technology, Teacher,
Advocate, Business-Development,
Healthcare, Fitness, Agriculture, BPO,
Sales, Consultant, Digital-Media,
Automobile, Chef, Finance, Apparel,
Engineering, Accountant, Construction,
Public-Relations, Banking, Arts, Aviation.
2.2 Preprocessing
Figure 1: Resume #2 of dataset after preprocessing. Extracted job
titles and bullet points are colored in red and green, respectively.
We parsed HTML contents from resumes and
extracted tags containing job titles and their
corresponding bullet points. We then performed
1
https://www.kaggle.com/datasets/sneh
aanbhawal/resume-dataset
preprocessing steps to ensure that the data is
formatted suitably for being fed as input to our
language models. For instance, after extracting
the necessary text from each resume, we remove
all characters that are neither alphanumeric nor
punctuation. We kept punctuation in order to
retain proper formatting for dependent clauses
and lists. We kept numbers simply as
placeholders; even though the precise values may
not apply to each user, he/she can easily copy and
paste our generated bullet points and edit them by
incorporating his/her individual experiences.
Figure 1 visualizes the content we handle from a
random resume’s HTML content.
In order to enable users to customize the
information they want to generate, down to the
desired job title to be written about, we
transformed our original dataset of 2,484 resumes
into a dataset of 64,441 tuples of form (job title,
bullet point) comprising all observed pairs of job
titles and bullet points. This yields text samples of
the form “<JOB_TITLE>: <BULLET_POINT>”
such as:
Assistant Head Teller:
Consistently met or exceeded
quarterly sales goals.
By fine-tuning our language models on such
text samples, we allow users to pass in a prompt
of the form “<JOB_TITLE>”, optionally
followed by a colon and one or more seed words,
and receive synthetic job-specific bullet points.
2.3 Algorithm
Language generation models that we
experimented with include GPT-2 124M and
GPT-Neo 125M (Black et al., 2021). Though we
originally intended to use GPT-Neo 355M in
order to examine the effect of additional neural
network layers on the quality of GPT-Neo text
generations, we were held back by GPU memory
limitations of our Google Colab coding
environments (Bisong, 2019). We chose the
aforementioned models over the preferred state-
of-the-art GPT-3 architecture proposed by
OpenAI (Brown et al., 2020) due to the latter not
being open source, as well as the heavy monetary
costs of using OpenAI’s API to fine-tune on our
2
https://www.livecareer.com/resume-search/

4
sizable dataset. In contrast, GPT-Neo models
developed by EleutherAI3
are open source and
offer competitive performances in comparison
with those of GPT-2 models.4
In contrast to GPT-
2 models, GPT-Neo models are said to be more
suitable for longer texts and pre-trained on more
recent data.
To fine-tune and evaluate our language models,
we used the aitextgen library,5
a Python toolkit for
text-based AI training and generation across
numerous deep learning architectures. We fine-
tuned each language model on a single GPU for
10 epochs with a learning rate of 1e-3.
2.4 Evaluation
2.4.1 Human Evaluators
We created a Google Form containing 48 text
samples of both synthetic and human generated
bullet points and presented them to 10 people to
rate how ‘human-sounding’ they are. We first
chose the top-six most frequent job titles from our
dataset, which are Accountant, Sales Associate,
Consultant, Teacher, Administrative Assistant,
and Executive Chef. For each job title, we then
chose two ground truth bullet points from the
actual dataset, three bullet points generated by
GPT-2, and three bullet points generated by GPT-
Neo. We randomized the order of the bullet points
within each job title and asked our human
evaluators to rate them. The options were based
on a 5-point Likert scale denoting (5) Strongly
Agree, (4) Agree, (3) Neutral, (2) Disagree, and
(1) Strongly Disagree. Evaluators were current
undergraduate students.
2.4.2 BLEU Score
Bilingual evaluation understudy (BLEU) is an
algorithm for evaluating the quality of text
(Papineni et al., 2002). Designed to evaluate
language translation, BLEU has been shown to be
a performant metric for many natural language
generation tasks, having a high correlation with
human judgment (Callison-Burch et al., 2006). It
uses n-grams to calculate the similarity between
generated text and a reference corpus, which is
3
https://www.eleuther.ai/
4
https://towardsdatascience.com/guide-
to-fine-tuning-text-generation-models-
gpt-2-gpt-neo-and-t5-dc5de6b3bc5e
especially good for short sentences. We decided
to use BLEU as an initial metric to gauge text
generation.
2.4.3 Preliminary Results
We calculated the BLEU score for 6 phrases using
GPT-Neo. The respective input job titles were
chosen based on the number of data points it had:
2 with a lot of data, 2 with an average amount of
data, and 2 with few data. Since BLEU score
requires a ground truth phrase, the input needed to
be unique so that there was only a single point in
the data starting with the input. Some inputs are as
follows: “Teacher: Developed lesson”, “Project
Coordinator: Collected”, and “Finance and Sales
Consultant: Planned”. The BLEU score for these
initial results was 0.19.
3 Results
Figure 2: Loss curves for GPT-2 124M and GPT-Neo 125M after
fine-tuning for 10 epochs with learning rate of 1e-3.
Job Title
Bullet
Point
Sourc
e
Accou
ntant
Sales
Assoc
iate
Cons
ultant
Teach
er
Admi
nistra
tive
Assist
ant
Execu
tive
Chef
Groun
d
Truth
2.80
(1.36)
3.55
(1.10)
3.35
(1.27)
3.15
(1.27)
4.05
(1.10)
2.95
(1.28)
GPT-
2
124M
4.13
(1.20)
3.13
(1.63)
3.33
(1.27)
3.47
(1.57)
3.90
(1.21)
2.77
(1.52)
5
https://github.com/minimaxir/aitextgen

5
GPT-
Neo
125M
2.67
(1.54)
4.40
(0.97)
2.23
(1.04)
3.77
(1.14)
2.20
(1.52)
2.23
(1.38)
Table 1: Average human likeness scores (ranging from 1.0 to 5.0
where higher is better) across job types for resume bullet points
taken from ground truth resumes versus generated from language
models. Bolded values indicate the most natural-sounding bullet
points across job types. Parentheses indicate standard deviations.
Figure 2 shows the convergence of both language
models upon fine-tuning, indicating that both
GPT models are adjusting to subtleties in English
language found across resume bullet points. Table
1 compares human likeness scores for resume
bullet points generated by our models.
4 Discussion
4.1 BLEU Score
BLEU scores are between 0 and 1, with 0.6
considered the best one could realistically achieve
and 0.3 as decent. We got a relatively low score of
0.19 for our preliminary results. However, a low
BLEU score does not necessarily mean the
generated text is bad. There are many possibilities
for a ‘good’ sentence on a resume and there is not
one correct answer. The model may have
generated a good text that was very different from
the originals. We decided human evaluation
would be better suited for our task.
4.2 Human Evaluation
For only two of the top-six job titles do GPT-Neo
text generations appear to be (on average) more
humanlike than ground truth bullet points and
GPT-2 text generations, whereas they fare poorly
across all other job titles, making it a potentially
less versatile model for clients seeking
customization across a variety of job titles. Upon
closer examination, the following GPT-Neo text
generation for an ‘Administrative Assistant’ job
title received the poorest Likert score of Strongly
Disagree from 90% of evaluators:
Ensuring that certifications and
coordinates the billing and updates
Although the proposed bullet point cleverly
begins without a subject–this is assumed to refer
to the job applicant across most resumes–it fails
to complete the initial verb phrase describing how
the applicant supposedly handles certifications,
transitions unnaturally to how the applicant
supposedly coordinates billing processes, and
forms a run-on sentence bearing the conjunction
‘and’ twice. It seems that although the
responsibilities of the job title have been learned
by GPT-Neo over the course of fine-tuning,
perhaps further fine-tuning on text samples that
are ensured to be grammatically correct would
help expose the model to syntactic features of
language present in resumes. Another GPT-Neo
text generation given a score of Strongly
Disagree, this time by 80% of evaluators and for
the ‘Executive Chef’ job title, was the following:
Responsible for the daily operation
of the food items for the operation
of the food items
The proposed bullet point exhibits repetition of
the prepositional phrase “for the ?:(daily)
operation of the food items,” prompting the need
to study what a good balance would be between
forced no-repetition and repeating cycles of
identical n-grams. This balance could be
implicitly learned by fine-tuning our language
models further and by manually experimenting
with the repetition penalty parameter (Keskar et
al., 2019) featured in the generate method of
our aitextgen pipeline.
Regarding ethical concerns, recall that the
resumes we used in this project were specifically
those evaluated by LiveCareer agents to be fairly
well-written. Consequently, our dataset is biased
in terms of the quality of writing that real world
resumes actually bear. Another concern is the
usage of text generations. Our generations are
meant to serve as a high-quality template. Clients
should not misuse our system by taking its text
generations verbatim, as neither GPT-2 nor GPT-
Neo were fine-tuned on the client’s personal
experiences. Rather, clients should incorporate
their personal experiences into the writing styles
present in our text generations.
5 Conclusion
Algorithms that generate human-sounding text
language are becoming ever more widely
accessible. Models like GPT-2 and GPT-Neo can
guide users through daunting writing processes,
such as resume building. Understanding humans’
behavioral reactions to these algorithms helps
shape future breakthroughs in the field that will
address the shortcomings of the existing models.
As a step in that direction, our study adopts a
behavioral science approach to examine a balance
of formulaic and creative artificial intelligence in

6
the form of resumes. As with the majority of
studies, the design of our current study is subject
to limitations. Due to the small sample size of our
human evaluators, further research on a larger
sample size is critical for reinforcing the accuracy
of our results. We hope more studies follow suit
to provide new behavioral insights into human
versus innovative artificial intelligence.
References
Bisong, Ekaba. Building machine learning and deep
learning models on Google cloud platform: A
comprehensive guide for beginners. Apress, 2019.
Black, Sid, et al. "GPT-Neo: Large scale
autoregressive language modeling with mesh-
tensorflow." If you use this software, please cite it
using these metadata 58 (2021).
Brown, Tom, et al. "Language models are few-shot
learners." Advances in neural information
processing systems 33 (2020): 1877-1901.
Callison-Burch, Chris, Miles Osborne, and Philipp
Koehn. "Re-evaluatingthe role of BLEU in machine
translation research." 11th conference of the
european chapter of the association for
computational linguistics. 2006.
Keskar, Nitish Shirish, et al. "Ctrl: A conditional
transformer language model for controllable
generation." arXiv preprint arXiv:1909.05858
(2019).
Köbis, Nils, and Luca D. Mossink. "Artificial
intelligence versus Maya Angelou: Experimental
evidence that people cannot differentiate AI-
generated from human-written poetry." Computers
in human behavior 114 (2021): 106553.
Lee, Jieh-Sheng, and Jieh Hsiang. "Patent claim
generation by fine-tuning OpenAI GPT-2." World
Patent Information 62 (2020): 101983.
Papineni, Kishore, et al. "Bleu: a method for automatic
evaluation of machine translation." Proceedings of
the 40th annual meeting of the Association for
Computational Linguistics. 2002.
Radford, Alec, et al. "Language models are
unsupervised multitask learners." OpenAI blog 1.8
(2019): 9.

Comp_Ling_Resume_Generator.pdf

Recommended

Recommended

More Related Content

Similar to Comp_Ling_Resume_Generator.pdf

Similar to Comp_Ling_Resume_Generator.pdf (20)

Recently uploaded

Recently uploaded (20)

Comp_Ling_Resume_Generator.pdf