An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes

An Analysis of Keyword Preferences Amongst Recruiters
and Candidate Resumes
By Amir Behbehani
Abstract
Many prospective employees post resumes online, which recruiters use to screen and “submit” potential
candidates to hiring managers. One challenge is that recruiters must sift through hundreds of resumes, and
search for hard skills for which they are generally unqualified to discern a “fit.” Moreover, the recruiter
must search for clues that determine a “cultural fit,” which is nearly impossible without some contact with
the candidate. This paper examines the decision making process recruiters use to determine candidate
qualifications, and attempts to determine if a computer algorithm can be comparably, or more effective,
than a human recruiter at finding a candidate fit.

Overview
Path.to, in efforts to validate a proprietary scoring algorithm, which matches job-seekers to potential
employers, has hired two (this number can increase) recruiters tasked to review and rate resumes.
Objective
The goal of this project is to determine if the recruiters possess any "tacit" knowledge, absent from the
matching algorithm, thus not yet codified, which enable a reduced-friction hiring process.
Method
To identify differences between human classification and the machine learning classifier, we design an
experiment that requires n recruiters to rate identically distributed resumes (i.e., recruiters rate the same
resumes), k number of times. The value of k is a number between 1 and j, where j depends on a
resampling technique. The purpose of resampling is to mimic multiple recruiters, and thus identify the
underlying rating sample distribution for each job seeker,the sample statistic, and related variances.
Variances in this experiment are of two types: intra-variances (variances in the ranking of an individual
resume, by one recruiter, over the multiple resampled draws) and extra-variances (variances in the
ranking of an individual resume between (or among) recruiters). Large variances in the former indicate
inconsistent, and thus, unreliable rankings from the said recruiter. Large variances in the latter may
indicate subjective bias.
The rankings from the machine learning algorithm are, then, compared to the rankings from n recruiters.
The variance in scores for the applicants by the recruiters and the machine learning algorithm, if small,
represents consistent ranking system that captures enough objectively measurable data to determine a
strong match. Conversely, large differences between the machine learning and recruiter scores indicate a
ranking algorithm that requires further tuning
Control vs. Experiment
Our control group are the human recruiters (assuming they are reliable within and between (among, if n
>2) themselves). We are seeing how closely the computer ratings approximate the recruiter ratings.
Conversely, the machine ratings are the 'experimental' group.
Computer Matching Algorithm
Matching employees and employers, algorithmically, largely requires scoring potential hires (i.e.
applicants) for a specific job. That is, applicants may score a strong ‘fit’ for a specific job, or job posting,
but a specific fit does not render an individual applicant a strong fit for the greater pool of job postings.
For example, an applicant may score a strong fit for a specific law firm hiring attorneys. It does not
follow, ergo propter hoc, this prospective candidate would qualify as a strong fit for an engineering job at
technology firm; another candidate will probably score a better fit for the engineering job. While qualified
matches are not, necessarily, mutually exclusive, collectively exhaustive (candidate-A might scoring a
strong fit for position-X does not, necessarily, preclude a probable fit for position-Y as well), we expect
the likelihood a candidate that scores well for one job to score well for another job, or set of jobs, to
decrease exponentially.

The computerized scoring system broadly defines applicant qualifications, categorically, as a Tacit or
Explicit skillset. Tacit knowledge, generally speaking, refers to a set of skills that are highly social,
organizationally specific, and difficult to train. Tacit workers understand people, products, organizational
dynamics (beyond just the org-chart), and are highly emotionally attuned. Explicit knowledge, by
contrast, is domain specific, highly technical, easy to teach (although, not always easy to learn), and
organizationally transferable. Explicit workers are your statisticians, attorneys, accountants, and niche
engineers. In any organization, you need both tacit and explicit knowledge. Innovation requires
harnessing explicit knowledge, transforming it into tacit knowledge, and selling it. As such, any
meaningful employee/employer matching system would require measuring employee explicit and tacit
knowledge. The former, we approximate using traditional proxies such as years of experience, education,
stated skillset, etc. The latter, we approximate using a social data (Facebook, Twitter, Forrst, etc.)
checking applicant tweets and Facebook posts, for example, for content relevant to the various job
training intended to determine their preferred work environment).
The total set of scoring determinants (rankers) and their respective weights are as follows:
Explicit Skills (max 1.15)
● Core Skills (max .25)
● Similarity (max .20)
● Skills Topic Model (max .10)
● Job Experience (max .15)
● Education (max .05)
● Category (max .15)
● Total Experience (max .15)
● Endorsements (max .10)
Tacit Skills
● Social Network Statuses (max 0.02)
○ Twitter
○ Facebook
○ Dribbble
○ Forrst
○ Behance
○ Github
● Signals (Likes and Dislikes) (max .21)
○ ApplicantJob
○ ApplicantBusiness
○ BusinessApplicant
○ BusinessJobApplicant
○ BusinessTitle

● Cultural Preferences ( max 0.12)
○ Benefits
○ Formality
○ CompanySize
○ Risk
○ Salary
Scoring starts by identifying the appropriate rankers to be used in the scoring. Not all rankers are used for
each, individual employee/employer score, and same rankers are not consistently applied to each
applicant. Instead, the job post, itself, helps determine what sets of rankers we will use, score, and apply
to the overall rank. Once rankers are selected, the scoring algorithm evaluates applicants, applying to each
ranker a score between 0 and 99. Rankers are then normalized and reevaluated to bound total scores, an
additive function of the various determinant rankers,between 0 and 99.
The Total Score will assume the form:
฀ = ฀aX1 + �bX2 +�cX3 … �jXi ;
X: Rating of 0 -99
�: slope or weight of respective ranker
Survey Questionnaires
To experimentally compare the results of the computerized matching system to our control group - two
randomly selected recruiters - we dynamically allocate a set of questions to gauge how each recruiter
perceives applicant “fit”. Each question coincides with a ranker used in the computerized score; and is
weighted accordingly. The questions are dynamically allocated specific to each applicant/job pairing to
match the rankers used in the overall applicant scoring. Each question has a set of five possible answers
(A through E), additively awarding 20 points to candidates whom the recruiters positively rank (i.e. A=99,
B=74, C=49, D=24, E=0). This approach attempts to “back into” the scores for each ranker.
Questionnaire
How well does the company environment match the desires of the candidate? (UserPreference) (.12)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all

How much does the candidate’s overall profile match the requirements of this position? (SimilarityWeight
+ Skills Topic Model Weight) (.20 + .10)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much did the candidate’s work experience qualify them for this position? (JobXpWeight) (.15)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much did the candidate’s education experience qualify them for this position? (EducationWeight)
(.05)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much did the candidate’s skills qualify them for this position? (Core Skills Ranker) (.25)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much does the candidate’s skill level qualify them for this position? (Experience Weight) (.15)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How well does this position match the stated interests of the candidate? (Category Weight) (.15)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all

Experimental Participants (i.e., Howmany recruiters do we need?)
While two recruiters is a limited sample, and thus cannot be representative of the general population of
recruiters, the nature of the resampling technique attempts to solve this problem. A third recruiter would,
certainly, increase significance should the two recruiters disagree frequently, i.e., extra-variance is high,
or an individual recruiter has high inter-variance within the resampled resumes.
Resampling Technique
We should resample at least 10% of the resumes with the recruiters to check for reliability. More
importantly, the number of duplicate-rated resumes for each recruiter should be at least 50, if not 100, to
test for significance at the 95% level; conversely, we can adjust the confidence in our result with less
duplicate-rated resumes. The challenge is to ensure enough trials to check for reliability between a single
recruiters ratings. Presumably, though, the computer will rate each resume identically each time it sees it,
there's no need to resample with the computer.
How many are we resampling, practically?
Front End Process
Step 0: Recruiter A draws and rates an applicant resume from a stack of M resumes.
Back End Process
Step 1: We serve a stack of M resumes to Recruiter-A and Recruiter-B, both recruiters see the same
resumes.
Step 2: Every k(x) resume is resampled, and thus Recruiter-A and Recruiter-B, respectively, re-rate these
resumes, where k(x) is a pseudo-random process, to minimize the effect that the recruiters learn the
resampling frequency.
On the k(x)th interval, we sample P resumes Q times, each randomly (these numbers can change). For
example, we can reserve 5, previously rated, resumes for resampling, each of which will be re-rated 20
times; P = 5, Q = 20, or (5,20). Resampling (5,20), would require the recruiters to examine
approximately 1200 resumes, in order to adequately space the resampled resumes: 5*20=120 resampled
resumes; to space 120 resampled resumes at intervals of approximately k(x) = 10 requires 1200 resumes.
Alternatively, (2,20), would require only 400 resumes distributed at intervals of k(x) = 10, or 320 resumes
if k(x) = 8. I suggest P be at least 2, if not 3, or more, and Q be at least 20, if not 30, or more.
Note: The more resumes we are resampling, the smaller the space between resampled resumes. This is
because with a larger pool of resampled resumes the chances of the same resume appearing 1/f(x) times
reduces, inversely. For example, if we resample 5 resumes, we do not need to have a space of 10 between
every resampled resume, as that would indicate that raters would only see the same resume once every 50
times - clearly, we don't have to space things out quite that much.

Thus, at the end, we have: (a.) ratings for every resume from each rater, and (b.) a series of Q ratings for
each of the P resumes that we have decided to re-sample. If the standard error for each of these P
resumes is roughly equivalent, we can comfortably use this error rate across all resumes.
Addendum: Standard Errors and Sampling Distributions
Taking the metaphor of rolling a die, we can think of standard errors in the following way. Let's say I roll
a fair die 100 times, and average all of those die rolls. The average of those die rolls will be something
like 3.5ish. Probably not exactly 3.5 (sometimes I'll randomly get more 6's, sometimes more 1's, etc) but
close to 3.5. If I do it again, I'll get something close to 3.5 the next time, and the next time, and the next
time. All those numbers close to 3.5 will have a distribution (called the sampling distribution), and that
distribution will be characterized by being centered around 3.5, with a standard deviation equal to the
standard error. The standard error (equation given below) thus gives us a feel for how any particular
mean might deviate from the true population mean (3.5).
If we think about this in terms of classrooms and height, we can think of it this way: any given classroom
has some randomly selected students in it. Now, there is a true population mean height of students, but
each classroom in the building will have a mean height that is slightly off from the main mean height.
Without knowing the population mean height, though, we can approximate how far off the classroom
mean height is using the equation for standard error, which just requires that we have the sample
(classroom) standard deviation and the sample size of the classroom.
The equation for standard error is given by the following:
SE=SD/sqrt(n)
Where SE is the standard error of the sampling distribution, SD is the standard deviation of one sample,
and n is the sample size of that particular sample. Thus, assuming an equivalent standard deviation, the
higher our sample size (the more times we dish out the same resume) the more certain of the variance
around the "true" value of the rating we'll have. The higher the standard deviation, the higher sample
we'll need to get the same size standard error.
For example, in our case, let's say that we had a recruiter rate a resume 8 times, and they rated it a 7 twice,
and 8 four times, and a 9 twice. We can calculate a 95% confidence interval in which we are 95%
confident the "true" value of the rating exists. In R, this would look like:
x<-c(7,7,8,8,8,8,9,9)
xbar<- mean(x)
xdev<-sd(x)
xse<-xdev/sqrt(length(x))
xbar+1.96*xse
xbar-1.96*xse

Giving us a range from 7.47 to 8.52. (We may bootstrap the standard error when we're worried that
distribution of responses may cause the equation above to give us a biased estimate of the standard error.
Bootstrapping is process of resampling from the sample repeated times to gain a better estimate of the
standard error.)
Presumably, the second recruiter would also rate the resume, and we would be able to examine whether
their confidence range overlapped with the confidence range of the first recruiter.
With this in mind, essentially, the goal of this study is to see whether or not the machine ratings are
equivalent "enough" to the human ratings - in other words, do the machine ratings fall with a reasonable
range of error from the true mean?
To do that, we need a confidence interval for each resume. We could do this via two mechanisms: (1)
resampling every resume a few times for the reviewers to rate, or (2) assuming that the error in any rating
does not vary from resume to resume, and use just a couple resumes to construct a standard error we'd use
for each resume.
Rating Technique
Instead of the recruiters rating the individual resumes on a scale of 1 to 10, I suggest the recruiters answer
questions about the resumes, which map to scores. This method will simultaneously, answer why the
recruiter is or is not interested in the candidate, and will, presumably, reduce the intra-variance.
Figure 1. Survey Questions

Dataset
Data collected from the experiment will conform to the following matrix. For each employer
(EmployerID) we will serve the recruiters a randomized set of resumes (ResumeID). The recruiters will
then score each resume (R1-Score, and R2-Score). A subset of resumes will, then, be resampled and
served again (TrialNo) whereby the recruiters will rate,again, rate the resumes.
● EmployerID - The hiring company
● ResumeID - The applicant
● TrialNo - The number of the trial for which we are gathering ratings. This key differentiates the
duplicate ratings for the same resume.
● R1-Score - Recruiter 1's score
● R2-Score - Recruiter 2's score
● M-Score Machine's score.
Output
The output then tell us the range of expected (rank) values for each applicant per each job. So, we will
know, for example, 95% of the scores for John Doe working at Microsoft will be between, say, 7 and 9 in
the control group vs. 6.5 in the experiment group (no range, because no resampling)? From the data
collected during the study, we calibrate the machine learning algorithm to capture a score closer to 6.5.
Important Caveat - Memory Effect and Repeatedly Sampled Resume Obfuscation
We need to cleverly mask the profiles in a way that the the recruiters will not recognize the same resume
served multiple times. This ‘memory’ effect can bias the experiment. There are several ways to mitigate
the effects of memory effect,which include:
1 Remove all names, phone numbers, addresses, and other uniquely identifiable attributes from
each resume.
2 Artificially expand the pool of applicant resumes for each job. This further obfuscates the fact
that each recruiter will repeatedly view and rate a subset of resampled resumes. We can easily
expand the pool of applicant resumes, by either using existing resumes from the general
repository (i.e., resumes that don’t necessarily match the said job), or by creating ‘dummy
resumes’. Of course, we would have to scratch the dummy resumes from our final dataset from
which we determine statistical significance. The purpose of the dummy resumes is strictly to
mitigate memory effect.
3 Finally, we can abstract information from each profile to create more homogenous application
pool. For example, applicant-A and applicant-B might each have worked at Microsoft; the
former applicant might have worked during the years 1992- 1997, while the latter applicant
during the years 2000-2005. In both cases, the applicants worked for Microsoft for 5 years. For
obfuscation purposes, we can abstract the dates into a common attribute (Years_Worked).
4 Between resamplings, we should randomize the order in which the resumes are rated. This will
further reduce the chance of the recruiters ‘catching on’ that the resumes are being resampled.

R-Code
#Path.To
#Psuedo-Analysis: Resumes
#9.20.12
#Requires: None
#Output: Dataset indicating whether machine ratings matches reasonable range of human ratings
#EmployerID - The hiring company
#ResumeID - The applicant
#TrialNo - The number of the trial for which we are gathering ratings. This key differentiates
the duplicate ratings for the same resume.
#R1-Score - Recruiter 1's score
#R2-Score - Recruiter 2's score
#M-Score Machine's score.
#Create fake data
#Assume 10 positions, 100 resumes
#3 resumes (1,2, and 3) resampled 20 times
#employers creates an array of 1-10, count by 1
#employerID 160 = number of times we sample from employers
employers<-seq(1,10,1)
EmployerID<-sample(employers, 160, replace=T)
resumeID<-c(seq(1,100,1), sample(1, 20, replace=T), sample(2, 20, replace=T), sample(3, 20,
replace=T))
TrialNo<-c(sample(1,100,replace=T), seq(2,21,1), seq(2,21,1), seq(2,21,1))
R1Score<-round(rnorm(160, 5, 2))
R2Score<-round(rnorm(160, 5, 2))
MScore<-round(rnorm(160, 5, 2))
data<-as.data.frame(cbind(EmployerID, resumeID, TrialNo, R1Score, R2Score, MScore))
#To look at variance within ratings of Rater1's ratings of resumes 1, 2, and 3
#Get all Reviewer 1's ratings of resume #1, #2, and #3
rat1res1<-c(data$R1Score[1], data$R1Score[101:120])
#Get all Reviewer 2's ratings of resume #1, #2, and #3
#How much are individual recruiters ratings varying on individual resumes? Lower SDs indicate
more reliable ratings
sd(rat1res1)
sd(rat2res1)
sd(rat1res2)
sd(rat2res2)
sd(rat1res3)
sd(rat2res3)
#A simple measure: are the reviewers rating the resumes roughly similarly? Lower mean
differences indicate more reliable ratings
mean(rat1res1)-mean(rat2res1)

#Assuming we're content with the numbers we see with the above, we move on.
#What is the standard error for each rating?
se1<-sd(rat1res1)/sqrt(length(rat1res1))
#Create mean standard error and mean standard deviation
semean<-mean(c(se1, se2, se3, se4, se5, se6))
sdmean<-mean(c(sd(rat1res1),sd(rat2res1),sd(rat1res2),sd(rat2res2),sd(rat1res3),sd(rat2res3)))
#Is the machine rating within the estimated 95% CI for each number?
#Create lower bounds and upper bounds around the first rater, using SD
data$lb1<-data$R1Score-(1.96*sdmean)
data$ub1<-data$R1Score+(1.96*sdmean)
data$lb2<-data$R2Score-(1.96*sdmean)
data$ub2<-data$R2Score+(1.96*sdmean)
#Is the machine code within this range?
match1<-c()
match2<-c()
for (i in 1:100) {
if (data$MScore[i]>data$lb1[i] & data$MScore[i]<data$ub1[i]) {match1[i]<-"Match"} else
{match1[i]<-"Outside Bounds"}
if (data$MScore[i]>data$lb2[i] & data$MScore[i]<data$ub2[i]) {match2[i]<-"Match"} else
{match2[i]<-"Outside Bounds"}
}
#Combine data and write out to csv file
dataexamine<-as.data.frame(cbind(data[1:100,],match1,match2))
write.csv(dataexamine, file="~/Documents/Documents/Work/Amir/RecuiterStudy/data.csv")
#Using intercoder reliability, how reliable are all coders, taken together?
library(irr)
robinson(data[,4:6])
#How about the two recruiters?
#The first recruiter and the machine?
robinson(cbind(data[,4],data[,6]))
#The second recruiter and the machine?
Lines 18-26 just create data (and concatenate it) that will look roughly like the data output from the
experiment. Here, we assume 10 employers (lines 18-19), 100 resumes (line 20), 3 of which we resample
an additional 20 times (line 21 identifies these values), and we assign (for the time being) arbitrary
recruiter-A, recruiter-B , and machine scores to the resumes (lines 22-24). Line 26 concatenates all into a
data set.

Lines 30-32 combine all of recruiter-A ratings for the resampled resumes 1, 2, and 3; lines 35-27 combine
all of recruiter-B ratings for the resamples resumes 1, 2, and 3. Then, lines 40-45 find the standard
deviation of all ratings for both raters on the resampled resumes. Smaller standard deviations indicate
more reliable ratings form the human raters (a good thing, and, assuming we have reliability there,
something that will allow us to justify the use of one aggregate SD to all resume ratings.)
Lines 48-50 examine, in a simple fashion, differentiate recruiter-A ratings from recruiter-B, on average.
Numbers closer to zero indicate higher levels of reliability between raters.
Lines 54-59 create standard errors for all resample resumes and raters, and lines 62 and 63 create average
standard deviations and average standard errors across all resampled resumes.
Lines 67-71 create upper and lower bounds (95% bounds, to be precise) for the ratings of both recruiter-A
and recruiter-B. We use the standard deviation, here, rather than the standard error, because we cannot
apply the standard error taken from a sample of 21 (the sample we have for resampled resumes) and apply
it to the samples of recruiter-A (the sample we have for non resampled resumes.) In essence, the standard
error estimates the variation in means from repeated samples, not the variation in raw ratings themselves.
To see whether the machine rating is matching raw human ratings, we want to look at the variation in
human ratings, not the variation in means of samples of human ratings.
Lines 74-79 create two new columns for the end of the dataset: one column indicating whether the
machine rating fell within the 95% bounds for recruiter-A, and one column indicating whether the
machine rating fell within the 95% bounds for recruiter-B. If the machine is within bounds, the word
"Match" is recorded in the column, and if the machine is outside of bounds, the words "Outside Bounds"
are recorded. This allows us to easily examine, visually, to see where the machine matches and where it
doesn't.
Lines 82 and 83 combine all the data and records the dataset to a csv file at a specified location.

Preliminary Results - Dataset from Oct 29th
Results from an initial set of experimental data, which include recruiter qualitative responses, were coded,
aggregated, and evaluated as follows:
A set of qualitative responses from which the recruiters rated resumes were coded on a linear scale from 0
to 100, where a coded rating of 0 signifies a bad match and a coded rating of 100 signifies a perfect match.
Response Coded Value
Perfectly 100
Very Much 75
Unacceptable 50
Acceptable 25
Not at All 0
Each recruiter rates an applicant resume against a corresponding job posting, selecting “fit” from one of
nine Question IDs. Thus, a resume that scores perfectly for each Question ID has a possible 900 points
(100 points, for each category), which are then adjusted or weighted such that the total points d = [36 *
(Math.tan(1.94 +(x / 68))) + 90; x equals the sum of values for each Question ID adjusted by
the machine selected weights], fall between the range of 0-99.
Question ID Weights
Category 0.1
Core Skills 0.23
Education 0.03
Experience Level 0.04
General Similarity 0.18
JobExperience 0.08
Recommend NULL
Skills Topics 0.05
User Preference 0.1
Resampled resumes, then, receive multiple ratings by both recruiters (e.g., Recruiter A might rate
Resume-A 5 times, each time evaluating the resume slightly differently, while Recruiter B might rate
Resume-A 7 times, each time evaluating the resume slightly differently. The scores for Resume-A are then
aggregated and we measure both the intra and extra variances.
Descriptive Statistics
Average Machine Rating: 75.52

Average Human Rating: 50.69
Correlation Coefficient (between human and machine): 0.312
Among the resampled resumes, most had different ratings, but three were rated identically each time it
was rated (all three resumes were re-rated by the same recruiter).
Average Standard Deviation Across all Resampled Resumes:7.16
Interpretation: We can expect 95% of all human ratings to be within 14.32 points on either side of the
mean rating. Because of the lower human rating than average machine rating, most of the machine scores
and human scores are not matches - the machine scores are simply too high. (Only 13.7% of the machine
scores are within the human range).
Margin of Error Caveat: If we generously boost the human scores by 25 points, making the average
human score and average machine score the same, then 84.1% of the machine scores are within two
standard deviations of the human score.
Statistical Significance: Using any sort of statistical significance test at this point is meaningless; the
sample size is too small. With more data, we can, potentially estimate our variances and assign a measure
of statistical significance to our findings.
Additional Discussion Points
● Need to address the coding rules
● Recruiters had zero perfect fits

● Machine ratings are clustered above 80pts and below 70 points, this does not reflect human scores
● Human ratings are more continuous
Link to Aggregated Dataset
Stata Code - For Massaging Data
insheet using "/Users/tarpus/Desktop/initialdata2.0.csv", comma
*Drop all the glenns
drop if user_id==1
*Sort on profile_id while keeping the rest of the order stable (so, don't re -order resampled
resumes OR randomly re-order the order of the ratings)
sort created_at, stable
sort profile_id, stable
encode job_id, gen(job_id_numeric)
encode profile_id, gen(profile_id_numeric)
*drop if profile_id_numeric==75 && user_id==6
drop if profile_id_numeric==75
drop if profile_id_numeric==34
destring points, gen(pointsnum) i("NA")
destring weight, gen(weightnum) i("NA")
destring points, gen(scorenum) i("NA")
encode answer, gen(answernum)
recode answernum 4=75 1=50 3=25 2=0, gen(hscore)
gen hpoints=hscore*weight
outsheet job_id_numeric profile_id_numeric user_id points hpoints created_at using
"/Users/tarpus/Desktop/initialdata2.0clean.csv", comma replace nolabel
*39377014fc19545b4675b7a1b165566e02e4b34
*9f93a3145210aa734fb37fca4618a9a46b38403
clear
insheet using "/Users/tarpus/Desktop/initialdata2.0IDed.csv", comma
destring points, gen(pointsnum) i("NA")
destring weight, gen(weightnum) i("NA")
destring points, gen(scorenum) i("NA")
collapse
R Code - For Data Analysis
rawdata<-read.csv("/Users/tarpus/Desktop/initialdata2.0clean.csv")
rawdata$uniqueID<-rep(1:(length(rawdata$points)/9), each=9)
summary(rawdata$job_id_numeric)
summary(rawdata$profile_id_numeric)
adata<-aggregate(rawdata, by=list(rawdata$uniqueID), FUN=mean, na.rm=TRUE)
adata$comprating<-36*(tan(1.94+((adata$points*8)/68)))+90
adata$humanrating<-36*(tan(1.94+((adata$hpoints*8)/68)))+90
summary(adata$comprating)
summary(adata$humanrating)
library(lattice)
xyplot(adata$comprating~adata$humanrating)
cor(adata$comprating,adata$humanrating)
#find re-samples
ob<-table(adata$profile_id_numeric); ob-1
#Average resume rating:
mean(adata$comprating)
mean(adata$humanrating)
sd(adata$comprating)
sd(adata$humanrating)
write.csv(adata, file="/Users/tarpus/Desktop/aggregateddata.csv")

resamples<-read.csv("/Users/tarpus/Desktop/aggregateddataresamples.csv")
mean(c(resamples$humanrating[1], resamples$humanrating[2]))
mean(c(resamples$humanrating[7], resamples$humanrating[8], resamples$humanrating[9]))
sds<-c(
sd(c(resamples$humanrating[1], resamples$humanrating[2])),
sd(c(resamples$humanrating[7], resamples$humanrating[8], resamples$humanrating[9])),
sd(c(resamples$humanrating[26], resamples$humanrating[27], resamples$humanrating[28])),
sd(c(resamples$humanrating[31], resamples$humanrating[32], resamples$humanrating[33]))
)
sdmean<-mean(sds)
adata$lb1<-(adata$humanrating-(1.96*sdmean))
adata$ub1<-(adata$humanrating+(1.96*sdmean))
#many are outside match, b/c of different scale, so standardize ratings (add ~25 to human
ratings)
adata$lb2<-(adata$humanrating+(mean(adata$comprating)-mean(adata$humanrating))-(1.96*sdmean))
adata$ub2<-(adata$humanrating+(mean(adata$comprating)-mean(adata$humanrating))+(1.96*sdmean))
#Is the machine code within this range?
match1<-c()
match2<-c()
for (i in 1:length(adata$comprating)) {
if (adata$comprating[i]>adata$lb1[i] & adata$comprating[i]<adata$ub1[i]) {match1[i]<-"Match"}
else {match1[i]<-"Outside Bounds"}
if (adata$comprating[i]>adata$lb2[i] & adata$comprating[i]<adata$ub2[i]) {match2[i]<-"Match"}
else {match2[i]<-"Outside Bounds"}
}
match1<-c()
match2<-c()
for (i in 1:length(adata$comprating)) {
if (adata$comprating[i]>adata$lb1[i] & adata$comprating[i]<adata$ub1[i]) {match1[i]<-1} else
{match1[i]<-0}
if (adata$comprating[i]>adata$lb2[i] & adata$comprating[i]<adata$ub2[i]) {match2[i]<-1} else
{match2[i]<-0}
}
mean(match1)
mean(match2)

#Combine data and write out to csv file
dataexamine<-as.data.frame(cbind(adata,match1, match2))
write.csv(dataexamine, file="/Users/tarpus/Desktop/dataexamine.csv")
#different ratings, different raters: 10, 22, 77, 80, 88, 92, 93, 105, 112
#different ratings, same rater: 41, 66, 109,
#same rating, different raters:
Secondary Results - Dataset from Oct 29th
Correlations between:
weighted human rating and human recommendation: .83
human recommendation and weighted machine rec #1: .20
human recommendation and weighted machine rec #2: .27
weighted human rating and weighted machine rating: .28
Scatterplots showing the relationship between the same:
Conclusion
Since we are limited in the number of recruiters, we examine the recruiter final recommendations
and the recruiter ratings to determine, very quickly, if the recruiters’ ratings are
trustworthy. That is, we asked each recruiter to rate each resume in two ways: first, they are
asked to rate the resume among 8 categories quinarily (on scale of 1-5) as to how well the
resume fits the job on those categories. Next, the recruiter is asked to give each resume an
overall recommendation - on the same 1-5 scale - as to how well the applicant’s resume “fits” the
job. It would seem very questionable if the recruiter rated highly a resume on the individual
categories, only to rate the same resume low on the overall recommendation (i.e, not recommend
the applicant). Our sanity check involves correlating the human overall recommendations to
the individual human category ratings, after walking the human category ratings through the
machine concatenation and weighting procedure . While not a perfect fit, the two ratings very
closely correlate (.83), signifying (1.) our recruiter reasonably rated resumes, and (2.) broadly
speaking, the tangent function is and machine weighting scheme is reasonable.
Human vs. Machine at a Glance
Similarly, we compare the human recommendations against the two iterations of machine ratings,
and then, finally, the human ratings against the machine ratings. In each case, the correlations
were very weak (between .21 and .31). This disjunct likely results from (1.) the machine rating
resumes significantly differently than the human recruiter across each evaluation category, and
(2.) the weighting system that turns individual machine ratings into a single judgment introduces
extra noise, because the weights to do not match the way humans naturally rate the resumes.

Relative Value (Weights)
To discover how humans "naturally" rate the resumes, we can leverage the fact the recruiters
both provide a single final rating and the fact they provide individual ratings. Thus, we create
a linear model that predicts the final human recommendation based on the individual category
human ratings. The resulting coefficients (and statistical significances) of the independent
variables elucidate how our human coders weight the resumes. We, thus, use both an Ordinary
Least Squares (OLS) regression and an Ordinal Logistic regression to model the factor weights
that best predict the binary outcome of a resume being recommended for a job. The ordinal logit,
critically, does not assume a continuous scale - rather than predicting a number, the model
predicts the probability (log of odds ratio, precisely) that a given set of covariates will fall into
one category or the other. Technically speaking, this makes it the superior model, as it fits our
data more appropriately. However, the results form the ordinal logistic model generally match
the OLS estimation, substantively speaking, indicating that the continuous scale assumption of
the OLS model is not badly violated. Accordingly, and due to the more straightforward nature of
OLS, so we focus on the OLS model for most of this write-up.
NOTE: When we attempt to operationalize the weights and, thus, calibrate the machine scores, it
will make more sense to use the OLS model, than the logistic model.
Ordinary Least Squares Model
The OLS model suggests that human coders weight "general similarity" and "job experience" far
more than any other categories. These two variables, accordingly, predict about 85% of the
variation in the human raters' overall evaluation of the candidates. (This number is confirmed
when, again, we run an estimation with just those two variables, and none of the others - the R2
remains at an approximate .85, even with other variables excluded.) The OLS suggests a very
different weighting system than that of the machine generated scores: namely, it (loosely)
suggests a weighting system where 80%-90% of the machines' judgment is driven in equal parts
by general similarity and job experience, with the rest driven by the other variables. (So, we can
place a weight of .45 on general similarity, a weight of .45 on job experience, and a weight of .02
on everything else. Or, a weight of .4 on the first two, and a weight of .04 on everything else.)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.65630 1.08808 0.603 0.54679
user_preference_h_scale 0.01686 0.02368 0.712 0.47697
general_similarity_h_scale 0.40938 0.04936 8.293 2.46e-15 ***
skills_topics_h_scale NA NA NA NA
job_xp_h_scale 0.39072 0.04822 8.103 9.26e-15 ***
education_h_scale -0.02619 0.02535 -1.033 0.30222
core_skills_h_scale -0.01473 0.02636 -0.559 0.57670
experience_level_h_scale 0.14142 0.04828 2.929 0.00363 **
category_h_scale 0.01649 0.02487 0.663 0.50778
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.629 on 347 degrees of freedom

Multiple R-squared: 0.8494,Adjusted R-squared: 0.8463
F-statistic: 279.5 on 7 and 347 DF, p-value: < 2.2e-16
Constructing a Set of Predictive Models
To confirm the assumptions made by the OLS model, we perform a secondary analysis using an
ordinal logistic regression model. In this model, we predict the [log] likelihood that the overall
recommendation an individual human gives a resume falls into one of the five ratings [Perfect
Fit, Very Much a Fit, Acceptable Fit, Unacceptable Fit, & Not a Fit at All) given the ratings
given by the human on individual job fit categories. We employ a logistic model, and at that, a
special case of the logistic model, an ordinal logistic regression, or a proportional odds logistic
regression (POLR, for short). The ordinal logistic model factors various rating components (e.g.,
job experience, core skills, etc.) to determine, and assign, a stochastically-derived score for each
applicant resume; individual resumes are scored for a specific job; i.e., the logistic score, then,
predicts the likelihood a resume matches a prescribed job posting.
Coefficients:
Value Std. Error t value
user_preference_h_glm 0.4053 0.5436 0.7455
general_similarity_h_glm 4.0417 0.8831 4.5767***
job_xp_h_glm 4.1715 0.8370 4.9839***
education_h_glm -0.2695 0.5020 -0.5369
core_skills_h_glm 0.1567 0.5874 0.2667
experience_level_h_glm 0.3481 0.7979 0.4362
category_h_glm 0.6937 0.5441 1.2750
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As can be seen in the above model results, the proportional odds ordinal logistic regression
model largely confirms the results we receive in the OLS model: general similarity and job
experience are, in general, the two most important predictors of human overall job
recommendations. The only difference in the ordinal logistic model, actually, is that it makes an
even stronger case for the (similar) importance of these two variables: not only do their two
coefficients (nearly) match again, but in this model, we see that experience level is no longer
statistically significant. This underscores, again, that a weighting system that weights general
similarity and job experience heavily while weighting the other categories only lightly would
match apparent human implicit weights more closely.
Model Score Performance
Using insights gleaned from the OLS and ordinal logit model, we construct a new, calibrated,
scoring function, where, rather than machine generated weights, we use human weights in the
construction of the final score. In short, with any computer resume rating, there are two
components:
1 The rating score assigned to individual categories for the resumes (general similarity, job
experience, experience level, etc.)

2 The weight assigned to that category by the machine system.
The individual ratings, which the machines gives on a zero to one hundred scale, are then
multiplied by their weight, after which time they are added together. So, for instance, if the
machine was using two rating categories, boogie and woogie, and weighted those categories .4
and .6, with scores on each category of 50 and 75, respectively, we could find the penultimate
machine rating by saying:
Rating = .4*(boogie)+.6*(woogie)
Rating = .4*50+.6*75
Rating = 65
That 65 would then be walked through the normalizing tangent function to arrive to the final
resume rating. Accordingly, if the machine ratings appear to be off from the human ratings,
there are only two sources of error: the weighting system and the machine rating system
itself. Given the low correlation between human ratings and machine ratings, our first attempt
was to alter the machine weighting system to match the implicit human rating system more
closely. The way that we did this was by using the suggested weights from the OLS model -
approximately, a weight of .45 on general similarity, a weight of .45 on job experience, and a
weight of .02 on everything else.
However, using this new weighting system, the performance of the machine did not increase - in
fact, they decrease. The correlations between machine rankings and human rankings, using the
implied human weights, lie between .08 and .22. (Before re-weighting, the correlation strength
had been much better - between .21 and .31). Slightly changing the weights in small ways (say,
a weight of .4 on job experience and general similarity, and a weight of .04 on everything else)
does not significantly improve the correlations.
Individual Category Rating Comparison
Given this, we decided to examine correlations between individual human category ratings and
individual machine category ratings. The answers were enlightening - on many categories, the
machine ratings and human ratings were not even weakly correlated, with individual category
ratings falling between -.10 and .46, and only three out of eight category ratings correlating a
strength above .10 (all reported correlations are using new machine point scores):
User preference: 0.46
General similarity: 0.06
Skills Topics: -0.10
Job Experience: 0.20
Education: N/A
Core Skills: 0.32
Experience Level: N/A
Category: 0.10
Given this data, two conclusions came to light. The first conclusion was that, perhaps, we could

improve the fit of the machine final rating and the human final rating if we re-weighted both to
focus more on areas where the human and machine agreed. Doing this - weighting user
preference, job experience, and core skills - higher than all the rest (namely, the weighting
system is .4 for user preference, .3 for core skills, .2 for job experience, and .05 and .05 for
category and general similarity, respectively) gives results that are again unsatisfactory, however,
not significantly improving results from prior weightings. Correlations between final rankings of
machine and human, with this re-weighting, range from .22 to 34 - a marginal increase from the
original weights, but only a marginal one.
Points System
This brings us to the second point brought up by the low correlations between individual point
scores of machine and human - largely, the machine is just not good at matching human scores,
nor is the machine good at rating resumes in the first place. For instance, in the above
correlations, there is no correlation reported for either education or experience level, because the
machine failed to produce any variance in these categories at all - in experience, the machine
gave every single resume a score of 100 (indicating a perfect experience level) and for education,
the machine gave every single resume a score of 0 (indicating that the resume has no apparent
education at all.) Of the other categories, only in three of the eight categories (user preference,
job experience, and core skills) does the machine do better than random chance in predicting
how the human will rate the resume, and only in one of the two categories (job experience) that
the human thinks is implicitly important (as revealed by the OLS and ordinal logit models).
This lack of correspondence between the machine and human scores, combined with the fact that
the machine fails to pass a certain sort of “sanity test” itself on certain categories (education and
experience level) by not rating any resumes as different at all, suggests that it is not the weights,
but the category scores, that is preventing the machine from properly matching up with the
human weights.
References
http://mathworld.wolfram.com/BootstrapMethods.html
http://statistics.stanford.edu/~ckirby/techreports/BIO/BIO%20101.pdf
http://www.methodsconsultants.com/tutorials/bootstrap1.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html

An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes

Recommended

Recommended

More Related Content

Similar to An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes

Similar to An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes (20)

More from Valerie Felton

More from Valerie Felton (20)

Recently uploaded

Recently uploaded (20)

An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes