SlideShare a Scribd company logo
1 of 21
Download to read offline
An Analysis of Keyword Preferences Amongst Recruiters
and Candidate Resumes
By Amir Behbehani
Abstract
Many prospective employees post resumes online, which recruiters use to screen and “submit” potential
candidates to hiring managers. One challenge is that recruiters must sift through hundreds of resumes, and
search for hard skills for which they are generally unqualified to discern a “fit.” Moreover, the recruiter
must search for clues that determine a “cultural fit,” which is nearly impossible without some contact with
the candidate. This paper examines the decision making process recruiters use to determine candidate
qualifications, and attempts to determine if a computer algorithm can be comparably, or more effective,
than a human recruiter at finding a candidate fit.
Overview
Path.to, in efforts to validate a proprietary scoring algorithm, which matches job-seekers to potential
employers, has hired two (this number can increase) recruiters tasked to review and rate resumes.
Objective
The goal of this project is to determine if the recruiters possess any "tacit" knowledge, absent from the
matching algorithm, thus not yet codified, which enable a reduced-friction hiring process.
Method
To identify differences between human classification and the machine learning classifier, we design an
experiment that requires n recruiters to rate identically distributed resumes (i.e., recruiters rate the same
resumes), k number of times. The value of k is a number between 1 and j, where j depends on a
resampling technique. The purpose of resampling is to mimic multiple recruiters, and thus identify the
underlying rating sample distribution for each job seeker,the sample statistic, and related variances.
Variances in this experiment are of two types: intra-variances (variances in the ranking of an individual
resume, by one recruiter, over the multiple resampled draws) and extra-variances (variances in the
ranking of an individual resume between (or among) recruiters). Large variances in the former indicate
inconsistent, and thus, unreliable rankings from the said recruiter. Large variances in the latter may
indicate subjective bias.
The rankings from the machine learning algorithm are, then, compared to the rankings from n recruiters.
The variance in scores for the applicants by the recruiters and the machine learning algorithm, if small,
represents consistent ranking system that captures enough objectively measurable data to determine a
strong match. Conversely, large differences between the machine learning and recruiter scores indicate a
ranking algorithm that requires further tuning
Control vs. Experiment
Our control group are the human recruiters (assuming they are reliable within and between (among, if n
>2) themselves). We are seeing how closely the computer ratings approximate the recruiter ratings.
Conversely, the machine ratings are the 'experimental' group.
Computer Matching Algorithm
Matching employees and employers, algorithmically, largely requires scoring potential hires (i.e.
applicants) for a specific job. That is, applicants may score a strong ‘fit’ for a specific job, or job posting,
but a specific fit does not render an individual applicant a strong fit for the greater pool of job postings.
For example, an applicant may score a strong fit for a specific law firm hiring attorneys. It does not
follow, ergo propter hoc, this prospective candidate would qualify as a strong fit for an engineering job at
technology firm; another candidate will probably score a better fit for the engineering job. While qualified
matches are not, necessarily, mutually exclusive, collectively exhaustive (candidate-A might scoring a
strong fit for position-X does not, necessarily, preclude a probable fit for position-Y as well), we expect
the likelihood a candidate that scores well for one job to score well for another job, or set of jobs, to
decrease exponentially.
The computerized scoring system broadly defines applicant qualifications, categorically, as a Tacit or
Explicit skillset. Tacit knowledge, generally speaking, refers to a set of skills that are highly social,
organizationally specific, and difficult to train. Tacit workers understand people, products, organizational
dynamics (beyond just the org-chart), and are highly emotionally attuned. Explicit knowledge, by
contrast, is domain specific, highly technical, easy to teach (although, not always easy to learn), and
organizationally transferable. Explicit workers are your statisticians, attorneys, accountants, and niche
engineers. In any organization, you need both tacit and explicit knowledge. Innovation requires
harnessing explicit knowledge, transforming it into tacit knowledge, and selling it. As such, any
meaningful employee/employer matching system would require measuring employee explicit and tacit
knowledge. The former, we approximate using traditional proxies such as years of experience, education,
stated skillset, etc. The latter, we approximate using a social data (Facebook, Twitter, Forrst, etc.)
checking applicant tweets and Facebook posts, for example, for content relevant to the various job
training intended to determine their preferred work environment).
The total set of scoring determinants (rankers) and their respective weights are as follows:
Explicit Skills (max 1.15)
● Core Skills (max .25)
● Similarity (max .20)
● Skills Topic Model (max .10)
● Job Experience (max .15)
● Education (max .05)
● Category (max .15)
● Total Experience (max .15)
● Endorsements (max .10)
Tacit Skills
● Social Network Statuses (max 0.02)
○ Twitter
○ Facebook
○ Dribbble
○ Forrst
○ Behance
○ Github
● Signals (Likes and Dislikes) (max .21)
○ ApplicantJob
○ ApplicantBusiness
○ BusinessApplicant
○ BusinessJobApplicant
○ BusinessTitle
● Cultural Preferences ( max 0.12)
○ Benefits
○ Formality
○ CompanySize
○ Risk
○ Salary
Scoring starts by identifying the appropriate rankers to be used in the scoring. Not all rankers are used for
each, individual employee/employer score, and same rankers are not consistently applied to each
applicant. Instead, the job post, itself, helps determine what sets of rankers we will use, score, and apply
to the overall rank. Once rankers are selected, the scoring algorithm evaluates applicants, applying to each
ranker a score between 0 and 99. Rankers are then normalized and reevaluated to bound total scores, an
additive function of the various determinant rankers,between 0 and 99.
The Total Score will assume the form:
฀ = ฀aX1 + �bX2 +�cX3 … �jXi ;
X: Rating of 0 -99
�: slope or weight of respective ranker
Survey Questionnaires
To experimentally compare the results of the computerized matching system to our control group - two
randomly selected recruiters - we dynamically allocate a set of questions to gauge how each recruiter
perceives applicant “fit”. Each question coincides with a ranker used in the computerized score; and is
weighted accordingly. The questions are dynamically allocated specific to each applicant/job pairing to
match the rankers used in the overall applicant scoring. Each question has a set of five possible answers
(A through E), additively awarding 20 points to candidates whom the recruiters positively rank (i.e. A=99,
B=74, C=49, D=24, E=0). This approach attempts to “back into” the scores for each ranker.
Questionnaire
How well does the company environment match the desires of the candidate? (UserPreference) (.12)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much does the candidate’s overall profile match the requirements of this position? (SimilarityWeight
+ Skills Topic Model Weight) (.20 + .10)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much did the candidate’s work experience qualify them for this position? (JobXpWeight) (.15)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much did the candidate’s education experience qualify them for this position? (EducationWeight)
(.05)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much did the candidate’s skills qualify them for this position? (Core Skills Ranker) (.25)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How much does the candidate’s skill level qualify them for this position? (Experience Weight) (.15)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
How well does this position match the stated interests of the candidate? (Category Weight) (.15)
A. Perfectly
B. Very Much
C. Acceptable
D. Unacceptable
E. Not at all
Experimental Participants (i.e., Howmany recruiters do we need?)
While two recruiters is a limited sample, and thus cannot be representative of the general population of
recruiters, the nature of the resampling technique attempts to solve this problem. A third recruiter would,
certainly, increase significance should the two recruiters disagree frequently, i.e., extra-variance is high,
or an individual recruiter has high inter-variance within the resampled resumes.
Resampling Technique
We should resample at least 10% of the resumes with the recruiters to check for reliability. More
importantly, the number of duplicate-rated resumes for each recruiter should be at least 50, if not 100, to
test for significance at the 95% level; conversely, we can adjust the confidence in our result with less
duplicate-rated resumes. The challenge is to ensure enough trials to check for reliability between a single
recruiters ratings. Presumably, though, the computer will rate each resume identically each time it sees it,
there's no need to resample with the computer.
How many are we resampling, practically?
Front End Process
Step 0: Recruiter A draws and rates an applicant resume from a stack of M resumes.
Back End Process
Step 1: We serve a stack of M resumes to Recruiter-A and Recruiter-B, both recruiters see the same
resumes.
Step 2: Every k(x) resume is resampled, and thus Recruiter-A and Recruiter-B, respectively, re-rate these
resumes, where k(x) is a pseudo-random process, to minimize the effect that the recruiters learn the
resampling frequency.
On the k(x)th interval, we sample P resumes Q times, each randomly (these numbers can change). For
example, we can reserve 5, previously rated, resumes for resampling, each of which will be re-rated 20
times; P = 5, Q = 20, or (5,20). Resampling (5,20), would require the recruiters to examine
approximately 1200 resumes, in order to adequately space the resampled resumes: 5*20=120 resampled
resumes; to space 120 resampled resumes at intervals of approximately k(x) = 10 requires 1200 resumes.
Alternatively, (2,20), would require only 400 resumes distributed at intervals of k(x) = 10, or 320 resumes
if k(x) = 8. I suggest P be at least 2, if not 3, or more, and Q be at least 20, if not 30, or more.
Note: The more resumes we are resampling, the smaller the space between resampled resumes. This is
because with a larger pool of resampled resumes the chances of the same resume appearing 1/f(x) times
reduces, inversely. For example, if we resample 5 resumes, we do not need to have a space of 10 between
every resampled resume, as that would indicate that raters would only see the same resume once every 50
times - clearly, we don't have to space things out quite that much.
Thus, at the end, we have: (a.) ratings for every resume from each rater, and (b.) a series of Q ratings for
each of the P resumes that we have decided to re-sample. If the standard error for each of these P
resumes is roughly equivalent, we can comfortably use this error rate across all resumes.
Addendum: Standard Errors and Sampling Distributions
Taking the metaphor of rolling a die, we can think of standard errors in the following way. Let's say I roll
a fair die 100 times, and average all of those die rolls. The average of those die rolls will be something
like 3.5ish. Probably not exactly 3.5 (sometimes I'll randomly get more 6's, sometimes more 1's, etc) but
close to 3.5. If I do it again, I'll get something close to 3.5 the next time, and the next time, and the next
time. All those numbers close to 3.5 will have a distribution (called the sampling distribution), and that
distribution will be characterized by being centered around 3.5, with a standard deviation equal to the
standard error. The standard error (equation given below) thus gives us a feel for how any particular
mean might deviate from the true population mean (3.5).
If we think about this in terms of classrooms and height, we can think of it this way: any given classroom
has some randomly selected students in it. Now, there is a true population mean height of students, but
each classroom in the building will have a mean height that is slightly off from the main mean height.
Without knowing the population mean height, though, we can approximate how far off the classroom
mean height is using the equation for standard error, which just requires that we have the sample
(classroom) standard deviation and the sample size of the classroom.
The equation for standard error is given by the following:
SE=SD/sqrt(n)
Where SE is the standard error of the sampling distribution, SD is the standard deviation of one sample,
and n is the sample size of that particular sample. Thus, assuming an equivalent standard deviation, the
higher our sample size (the more times we dish out the same resume) the more certain of the variance
around the "true" value of the rating we'll have. The higher the standard deviation, the higher sample
we'll need to get the same size standard error.
For example, in our case, let's say that we had a recruiter rate a resume 8 times, and they rated it a 7 twice,
and 8 four times, and a 9 twice. We can calculate a 95% confidence interval in which we are 95%
confident the "true" value of the rating exists. In R, this would look like:
x<-c(7,7,8,8,8,8,9,9)
xbar<- mean(x)
xdev<-sd(x)
xse<-xdev/sqrt(length(x))
xbar+1.96*xse
xbar-1.96*xse
Giving us a range from 7.47 to 8.52. (We may bootstrap the standard error when we're worried that
distribution of responses may cause the equation above to give us a biased estimate of the standard error.
Bootstrapping is process of resampling from the sample repeated times to gain a better estimate of the
standard error.)
Presumably, the second recruiter would also rate the resume, and we would be able to examine whether
their confidence range overlapped with the confidence range of the first recruiter.
With this in mind, essentially, the goal of this study is to see whether or not the machine ratings are
equivalent "enough" to the human ratings - in other words, do the machine ratings fall with a reasonable
range of error from the true mean?
To do that, we need a confidence interval for each resume. We could do this via two mechanisms: (1)
resampling every resume a few times for the reviewers to rate, or (2) assuming that the error in any rating
does not vary from resume to resume, and use just a couple resumes to construct a standard error we'd use
for each resume.
Rating Technique
Instead of the recruiters rating the individual resumes on a scale of 1 to 10, I suggest the recruiters answer
questions about the resumes, which map to scores. This method will simultaneously, answer why the
recruiter is or is not interested in the candidate, and will, presumably, reduce the intra-variance.
Figure 1. Survey Questions
Dataset
Data collected from the experiment will conform to the following matrix. For each employer
(EmployerID) we will serve the recruiters a randomized set of resumes (ResumeID). The recruiters will
then score each resume (R1-Score, and R2-Score). A subset of resumes will, then, be resampled and
served again (TrialNo) whereby the recruiters will rate,again, rate the resumes.
● EmployerID - The hiring company
● ResumeID - The applicant
● TrialNo - The number of the trial for which we are gathering ratings. This key differentiates the
duplicate ratings for the same resume.
● R1-Score - Recruiter 1's score
● R2-Score - Recruiter 2's score
● M-Score Machine's score.
Output
The output then tell us the range of expected (rank) values for each applicant per each job. So, we will
know, for example, 95% of the scores for John Doe working at Microsoft will be between, say, 7 and 9 in
the control group vs. 6.5 in the experiment group (no range, because no resampling)? From the data
collected during the study, we calibrate the machine learning algorithm to capture a score closer to 6.5.
Important Caveat - Memory Effect and Repeatedly Sampled Resume Obfuscation
We need to cleverly mask the profiles in a way that the the recruiters will not recognize the same resume
served multiple times. This ‘memory’ effect can bias the experiment. There are several ways to mitigate
the effects of memory effect,which include:
1 Remove all names, phone numbers, addresses, and other uniquely identifiable attributes from
each resume.
2 Artificially expand the pool of applicant resumes for each job. This further obfuscates the fact
that each recruiter will repeatedly view and rate a subset of resampled resumes. We can easily
expand the pool of applicant resumes, by either using existing resumes from the general
repository (i.e., resumes that don’t necessarily match the said job), or by creating ‘dummy
resumes’. Of course, we would have to scratch the dummy resumes from our final dataset from
which we determine statistical significance. The purpose of the dummy resumes is strictly to
mitigate memory effect.
3 Finally, we can abstract information from each profile to create more homogenous application
pool. For example, applicant-A and applicant-B might each have worked at Microsoft; the
former applicant might have worked during the years 1992- 1997, while the latter applicant
during the years 2000-2005. In both cases, the applicants worked for Microsoft for 5 years. For
obfuscation purposes, we can abstract the dates into a common attribute (Years_Worked).
4 Between resamplings, we should randomize the order in which the resumes are rated. This will
further reduce the chance of the recruiters ‘catching on’ that the resumes are being resampled.
R-Code
#Path.To
#Psuedo-Analysis: Resumes
#9.20.12
#Requires: None
#Output: Dataset indicating whether machine ratings matches reasonable range of human ratings
#EmployerID - The hiring company
#ResumeID - The applicant
#TrialNo - The number of the trial for which we are gathering ratings. This key differentiates
the duplicate ratings for the same resume.
#R1-Score - Recruiter 1's score
#R2-Score - Recruiter 2's score
#M-Score Machine's score.
#Create fake data
#Assume 10 positions, 100 resumes
#3 resumes (1,2, and 3) resampled 20 times
#employers creates an array of 1-10, count by 1
#employerID 160 = number of times we sample from employers
employers<-seq(1,10,1)
EmployerID<-sample(employers, 160, replace=T)
resumeID<-c(seq(1,100,1), sample(1, 20, replace=T), sample(2, 20, replace=T), sample(3, 20,
replace=T))
TrialNo<-c(sample(1,100,replace=T), seq(2,21,1), seq(2,21,1), seq(2,21,1))
R1Score<-round(rnorm(160, 5, 2))
R2Score<-round(rnorm(160, 5, 2))
MScore<-round(rnorm(160, 5, 2))
data<-as.data.frame(cbind(EmployerID, resumeID, TrialNo, R1Score, R2Score, MScore))
#To look at variance within ratings of Rater1's ratings of resumes 1, 2, and 3
#Get all Reviewer 1's ratings of resume #1, #2, and #3
rat1res1<-c(data$R1Score[1], data$R1Score[101:120])
rat1res2<-c(data$R1Score[2], data$R1Score[121:140])
rat1res3<-c(data$R1Score[3], data$R1Score[141:160])
#Get all Reviewer 2's ratings of resume #1, #2, and #3
rat2res1<-c(data$R1Score[1], data$R2Score[101:120])
rat2res2<-c(data$R1Score[2], data$R2Score[121:140])
rat2res3<-c(data$R1Score[3], data$R2Score[141:160])
#How much are individual recruiters ratings varying on individual resumes? Lower SDs indicate
more reliable ratings
sd(rat1res1)
sd(rat2res1)
sd(rat1res2)
sd(rat2res2)
sd(rat1res3)
sd(rat2res3)
#A simple measure: are the reviewers rating the resumes roughly similarly? Lower mean
differences indicate more reliable ratings
mean(rat1res1)-mean(rat2res1)
mean(rat1res2)-mean(rat2res2)
mean(rat1res3)-mean(rat2res3)
#Assuming we're content with the numbers we see with the above, we move on.
#What is the standard error for each rating?
se1<-sd(rat1res1)/sqrt(length(rat1res1))
se2<-sd(rat1res2)/sqrt(length(rat1res2))
se3<-sd(rat1res3)/sqrt(length(rat1res3))
se4<-sd(rat2res1)/sqrt(length(rat2res1))
se5<-sd(rat2res2)/sqrt(length(rat2res2))
se6<-sd(rat2res3)/sqrt(length(rat2res3))
#Create mean standard error and mean standard deviation
semean<-mean(c(se1, se2, se3, se4, se5, se6))
sdmean<-mean(c(sd(rat1res1),sd(rat2res1),sd(rat1res2),sd(rat2res2),sd(rat1res3),sd(rat2res3)))
#Is the machine rating within the estimated 95% CI for each number?
#Create lower bounds and upper bounds around the first rater, using SD
data$lb1<-data$R1Score-(1.96*sdmean)
data$ub1<-data$R1Score+(1.96*sdmean)
data$lb2<-data$R2Score-(1.96*sdmean)
data$ub2<-data$R2Score+(1.96*sdmean)
#Is the machine code within this range?
match1<-c()
match2<-c()
for (i in 1:100) {
if (data$MScore[i]>data$lb1[i] & data$MScore[i]<data$ub1[i]) {match1[i]<-"Match"} else
{match1[i]<-"Outside Bounds"}
if (data$MScore[i]>data$lb2[i] & data$MScore[i]<data$ub2[i]) {match2[i]<-"Match"} else
{match2[i]<-"Outside Bounds"}
}
#Combine data and write out to csv file
dataexamine<-as.data.frame(cbind(data[1:100,],match1,match2))
write.csv(dataexamine, file="~/Documents/Documents/Work/Amir/RecuiterStudy/data.csv")
#Using intercoder reliability, how reliable are all coders, taken together?
library(irr)
robinson(data[,4:6])
#How about the two recruiters?
robinson(data[,4:5])
#The first recruiter and the machine?
robinson(cbind(data[,4],data[,6]))
#The second recruiter and the machine?
robinson(data[,5:6])
Lines 18-26 just create data (and concatenate it) that will look roughly like the data output from the
experiment. Here, we assume 10 employers (lines 18-19), 100 resumes (line 20), 3 of which we resample
an additional 20 times (line 21 identifies these values), and we assign (for the time being) arbitrary
recruiter-A, recruiter-B , and machine scores to the resumes (lines 22-24). Line 26 concatenates all into a
data set.
Lines 30-32 combine all of recruiter-A ratings for the resampled resumes 1, 2, and 3; lines 35-27 combine
all of recruiter-B ratings for the resamples resumes 1, 2, and 3. Then, lines 40-45 find the standard
deviation of all ratings for both raters on the resampled resumes. Smaller standard deviations indicate
more reliable ratings form the human raters (a good thing, and, assuming we have reliability there,
something that will allow us to justify the use of one aggregate SD to all resume ratings.)
Lines 48-50 examine, in a simple fashion, differentiate recruiter-A ratings from recruiter-B, on average.
Numbers closer to zero indicate higher levels of reliability between raters.
Lines 54-59 create standard errors for all resample resumes and raters, and lines 62 and 63 create average
standard deviations and average standard errors across all resampled resumes.
Lines 67-71 create upper and lower bounds (95% bounds, to be precise) for the ratings of both recruiter-A
and recruiter-B. We use the standard deviation, here, rather than the standard error, because we cannot
apply the standard error taken from a sample of 21 (the sample we have for resampled resumes) and apply
it to the samples of recruiter-A (the sample we have for non resampled resumes.) In essence, the standard
error estimates the variation in means from repeated samples, not the variation in raw ratings themselves.
To see whether the machine rating is matching raw human ratings, we want to look at the variation in
human ratings, not the variation in means of samples of human ratings.
Lines 74-79 create two new columns for the end of the dataset: one column indicating whether the
machine rating fell within the 95% bounds for recruiter-A, and one column indicating whether the
machine rating fell within the 95% bounds for recruiter-B. If the machine is within bounds, the word
"Match" is recorded in the column, and if the machine is outside of bounds, the words "Outside Bounds"
are recorded. This allows us to easily examine, visually, to see where the machine matches and where it
doesn't.
Lines 82 and 83 combine all the data and records the dataset to a csv file at a specified location.
Preliminary Results - Dataset from Oct 29th
Results from an initial set of experimental data, which include recruiter qualitative responses, were coded,
aggregated, and evaluated as follows:
A set of qualitative responses from which the recruiters rated resumes were coded on a linear scale from 0
to 100, where a coded rating of 0 signifies a bad match and a coded rating of 100 signifies a perfect match.
Response Coded Value
Perfectly 100
Very Much 75
Unacceptable 50
Acceptable 25
Not at All 0
Each recruiter rates an applicant resume against a corresponding job posting, selecting “fit” from one of
nine Question IDs. Thus, a resume that scores perfectly for each Question ID has a possible 900 points
(100 points, for each category), which are then adjusted or weighted such that the total points d = [36 *
(Math.tan(1.94 +(x / 68))) + 90; x equals the sum of values for each Question ID adjusted by
the machine selected weights], fall between the range of 0-99.
Question ID Weights
Category 0.1
Core Skills 0.23
Education 0.03
Experience Level 0.04
General Similarity 0.18
JobExperience 0.08
Recommend NULL
Skills Topics 0.05
User Preference 0.1
Resampled resumes, then, receive multiple ratings by both recruiters (e.g., Recruiter A might rate
Resume-A 5 times, each time evaluating the resume slightly differently, while Recruiter B might rate
Resume-A 7 times, each time evaluating the resume slightly differently. The scores for Resume-A are then
aggregated and we measure both the intra and extra variances.
Descriptive Statistics
Average Machine Rating: 75.52
Average Human Rating: 50.69
Correlation Coefficient (between human and machine): 0.312
Among the resampled resumes, most had different ratings, but three were rated identically each time it
was rated (all three resumes were re-rated by the same recruiter).
Average Standard Deviation Across all Resampled Resumes:7.16
Interpretation: We can expect 95% of all human ratings to be within 14.32 points on either side of the
mean rating. Because of the lower human rating than average machine rating, most of the machine scores
and human scores are not matches - the machine scores are simply too high. (Only 13.7% of the machine
scores are within the human range).
Margin of Error Caveat: If we generously boost the human scores by 25 points, making the average
human score and average machine score the same, then 84.1% of the machine scores are within two
standard deviations of the human score.
Statistical Significance: Using any sort of statistical significance test at this point is meaningless; the
sample size is too small. With more data, we can, potentially estimate our variances and assign a measure
of statistical significance to our findings.
Additional Discussion Points
● Need to address the coding rules
● Recruiters had zero perfect fits
● Machine ratings are clustered above 80pts and below 70 points, this does not reflect human scores
● Human ratings are more continuous
Link to Aggregated Dataset
Stata Code - For Massaging Data
insheet using "/Users/tarpus/Desktop/initialdata2.0.csv", comma
*Drop all the glenns
drop if user_id==1
*Sort on profile_id while keeping the rest of the order stable (so, don't re -order resampled
resumes OR randomly re-order the order of the ratings)
sort created_at, stable
sort profile_id, stable
encode job_id, gen(job_id_numeric)
encode profile_id, gen(profile_id_numeric)
*drop if profile_id_numeric==75 && user_id==6
drop if profile_id_numeric==75
drop if profile_id_numeric==34
destring points, gen(pointsnum) i("NA")
destring weight, gen(weightnum) i("NA")
destring points, gen(scorenum) i("NA")
encode answer, gen(answernum)
recode answernum 4=75 1=50 3=25 2=0, gen(hscore)
gen hpoints=hscore*weight
outsheet job_id_numeric profile_id_numeric user_id points hpoints created_at using
"/Users/tarpus/Desktop/initialdata2.0clean.csv", comma replace nolabel
*39377014fc19545b4675b7a1b165566e02e4b34
*9f93a3145210aa734fb37fca4618a9a46b38403
clear
insheet using "/Users/tarpus/Desktop/initialdata2.0IDed.csv", comma
destring points, gen(pointsnum) i("NA")
destring weight, gen(weightnum) i("NA")
destring points, gen(scorenum) i("NA")
collapse
R Code - For Data Analysis
rawdata<-read.csv("/Users/tarpus/Desktop/initialdata2.0clean.csv")
rawdata$uniqueID<-rep(1:(length(rawdata$points)/9), each=9)
summary(rawdata$job_id_numeric)
summary(rawdata$profile_id_numeric)
adata<-aggregate(rawdata, by=list(rawdata$uniqueID), FUN=mean, na.rm=TRUE)
adata$comprating<-36*(tan(1.94+((adata$points*8)/68)))+90
adata$humanrating<-36*(tan(1.94+((adata$hpoints*8)/68)))+90
summary(adata$comprating)
summary(adata$humanrating)
library(lattice)
xyplot(adata$comprating~adata$humanrating)
cor(adata$comprating,adata$humanrating)
#find re-samples
ob<-table(adata$profile_id_numeric); ob-1
#Average resume rating:
mean(adata$comprating)
mean(adata$humanrating)
sd(adata$comprating)
sd(adata$humanrating)
write.csv(adata, file="/Users/tarpus/Desktop/aggregateddata.csv")
resamples<-read.csv("/Users/tarpus/Desktop/aggregateddataresamples.csv")
mean(c(resamples$humanrating[1], resamples$humanrating[2]))
mean(c(resamples$humanrating[2], resamples$humanrating[3]))
mean(c(resamples$humanrating[5], resamples$humanrating[6]))
mean(c(resamples$humanrating[7], resamples$humanrating[8], resamples$humanrating[9]))
mean(c(resamples$humanrating[10], resamples$humanrating[11]))
mean(c(resamples$humanrating[12], resamples$humanrating[13]))
mean(c(resamples$humanrating[14], resamples$humanrating[15]))
mean(c(resamples$humanrating[16], resamples$humanrating[17]))
mean(c(resamples$humanrating[18], resamples$humanrating[19]))
mean(c(resamples$humanrating[20], resamples$humanrating[21]))
mean(c(resamples$humanrating[22], resamples$humanrating[23]))
mean(c(resamples$humanrating[24], resamples$humanrating[25]))
mean(c(resamples$humanrating[26], resamples$humanrating[27], resamples$humanrating[28]))
mean(c(resamples$humanrating[29], resamples$humanrating[30]))
mean(c(resamples$humanrating[31], resamples$humanrating[32], resamples$humanrating[33]))
sds<-c(
sd(c(resamples$humanrating[1], resamples$humanrating[2])),
sd(c(resamples$humanrating[2], resamples$humanrating[3])),
sd(c(resamples$humanrating[5], resamples$humanrating[6])),
sd(c(resamples$humanrating[7], resamples$humanrating[8], resamples$humanrating[9])),
sd(c(resamples$humanrating[10], resamples$humanrating[11])),
sd(c(resamples$humanrating[12], resamples$humanrating[13])),
sd(c(resamples$humanrating[14], resamples$humanrating[15])),
sd(c(resamples$humanrating[16], resamples$humanrating[17])),
sd(c(resamples$humanrating[18], resamples$humanrating[19])),
sd(c(resamples$humanrating[20], resamples$humanrating[21])),
sd(c(resamples$humanrating[22], resamples$humanrating[23])),
sd(c(resamples$humanrating[24], resamples$humanrating[25])),
sd(c(resamples$humanrating[26], resamples$humanrating[27], resamples$humanrating[28])),
sd(c(resamples$humanrating[29], resamples$humanrating[30])),
sd(c(resamples$humanrating[31], resamples$humanrating[32], resamples$humanrating[33]))
)
sdmean<-mean(sds)
adata$lb1<-(adata$humanrating-(1.96*sdmean))
adata$ub1<-(adata$humanrating+(1.96*sdmean))
#many are outside match, b/c of different scale, so standardize ratings (add ~25 to human
ratings)
adata$lb2<-(adata$humanrating+(mean(adata$comprating)-mean(adata$humanrating))-(1.96*sdmean))
adata$ub2<-(adata$humanrating+(mean(adata$comprating)-mean(adata$humanrating))+(1.96*sdmean))
#Is the machine code within this range?
match1<-c()
match2<-c()
for (i in 1:length(adata$comprating)) {
if (adata$comprating[i]>adata$lb1[i] & adata$comprating[i]<adata$ub1[i]) {match1[i]<-"Match"}
else {match1[i]<-"Outside Bounds"}
if (adata$comprating[i]>adata$lb2[i] & adata$comprating[i]<adata$ub2[i]) {match2[i]<-"Match"}
else {match2[i]<-"Outside Bounds"}
}
match1<-c()
match2<-c()
for (i in 1:length(adata$comprating)) {
if (adata$comprating[i]>adata$lb1[i] & adata$comprating[i]<adata$ub1[i]) {match1[i]<-1} else
{match1[i]<-0}
if (adata$comprating[i]>adata$lb2[i] & adata$comprating[i]<adata$ub2[i]) {match2[i]<-1} else
{match2[i]<-0}
}
mean(match1)
mean(match2)
#Combine data and write out to csv file
dataexamine<-as.data.frame(cbind(adata,match1, match2))
write.csv(dataexamine, file="/Users/tarpus/Desktop/dataexamine.csv")
#different ratings, different raters: 10, 22, 77, 80, 88, 92, 93, 105, 112
#different ratings, same rater: 41, 66, 109,
#same rating, different raters:
Secondary Results - Dataset from Oct 29th
Correlations between:
weighted human rating and human recommendation: .83
human recommendation and weighted machine rec #1: .20
human recommendation and weighted machine rec #2: .27
weighted human rating and weighted machine rating: .28
Scatterplots showing the relationship between the same:
Conclusion
Since we are limited in the number of recruiters, we examine the recruiter final recommendations
and the recruiter ratings to determine, very quickly, if the recruiters’ ratings are
trustworthy. That is, we asked each recruiter to rate each resume in two ways: first, they are
asked to rate the resume among 8 categories quinarily (on scale of 1-5) as to how well the
resume fits the job on those categories. Next, the recruiter is asked to give each resume an
overall recommendation - on the same 1-5 scale - as to how well the applicant’s resume “fits” the
job. It would seem very questionable if the recruiter rated highly a resume on the individual
categories, only to rate the same resume low on the overall recommendation (i.e, not recommend
the applicant). Our sanity check involves correlating the human overall recommendations to
the individual human category ratings, after walking the human category ratings through the
machine concatenation and weighting procedure . While not a perfect fit, the two ratings very
closely correlate (.83), signifying (1.) our recruiter reasonably rated resumes, and (2.) broadly
speaking, the tangent function is and machine weighting scheme is reasonable.
Human vs. Machine at a Glance
Similarly, we compare the human recommendations against the two iterations of machine ratings,
and then, finally, the human ratings against the machine ratings. In each case, the correlations
were very weak (between .21 and .31). This disjunct likely results from (1.) the machine rating
resumes significantly differently than the human recruiter across each evaluation category, and
(2.) the weighting system that turns individual machine ratings into a single judgment introduces
extra noise, because the weights to do not match the way humans naturally rate the resumes.
Relative Value (Weights)
To discover how humans "naturally" rate the resumes, we can leverage the fact the recruiters
both provide a single final rating and the fact they provide individual ratings. Thus, we create
a linear model that predicts the final human recommendation based on the individual category
human ratings. The resulting coefficients (and statistical significances) of the independent
variables elucidate how our human coders weight the resumes. We, thus, use both an Ordinary
Least Squares (OLS) regression and an Ordinal Logistic regression to model the factor weights
that best predict the binary outcome of a resume being recommended for a job. The ordinal logit,
critically, does not assume a continuous scale - rather than predicting a number, the model
predicts the probability (log of odds ratio, precisely) that a given set of covariates will fall into
one category or the other. Technically speaking, this makes it the superior model, as it fits our
data more appropriately. However, the results form the ordinal logistic model generally match
the OLS estimation, substantively speaking, indicating that the continuous scale assumption of
the OLS model is not badly violated. Accordingly, and due to the more straightforward nature of
OLS, so we focus on the OLS model for most of this write-up.
NOTE: When we attempt to operationalize the weights and, thus, calibrate the machine scores, it
will make more sense to use the OLS model, than the logistic model.
Ordinary Least Squares Model
The OLS model suggests that human coders weight "general similarity" and "job experience" far
more than any other categories. These two variables, accordingly, predict about 85% of the
variation in the human raters' overall evaluation of the candidates. (This number is confirmed
when, again, we run an estimation with just those two variables, and none of the others - the R2
remains at an approximate .85, even with other variables excluded.) The OLS suggests a very
different weighting system than that of the machine generated scores: namely, it (loosely)
suggests a weighting system where 80%-90% of the machines' judgment is driven in equal parts
by general similarity and job experience, with the rest driven by the other variables. (So, we can
place a weight of .45 on general similarity, a weight of .45 on job experience, and a weight of .02
on everything else. Or, a weight of .4 on the first two, and a weight of .04 on everything else.)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.65630 1.08808 0.603 0.54679
user_preference_h_scale 0.01686 0.02368 0.712 0.47697
general_similarity_h_scale 0.40938 0.04936 8.293 2.46e-15 ***
skills_topics_h_scale NA NA NA NA
job_xp_h_scale 0.39072 0.04822 8.103 9.26e-15 ***
education_h_scale -0.02619 0.02535 -1.033 0.30222
core_skills_h_scale -0.01473 0.02636 -0.559 0.57670
experience_level_h_scale 0.14142 0.04828 2.929 0.00363 **
category_h_scale 0.01649 0.02487 0.663 0.50778
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.629 on 347 degrees of freedom
Multiple R-squared: 0.8494,Adjusted R-squared: 0.8463
F-statistic: 279.5 on 7 and 347 DF, p-value: < 2.2e-16
Constructing a Set of Predictive Models
To confirm the assumptions made by the OLS model, we perform a secondary analysis using an
ordinal logistic regression model. In this model, we predict the [log] likelihood that the overall
recommendation an individual human gives a resume falls into one of the five ratings [Perfect
Fit, Very Much a Fit, Acceptable Fit, Unacceptable Fit, & Not a Fit at All) given the ratings
given by the human on individual job fit categories. We employ a logistic model, and at that, a
special case of the logistic model, an ordinal logistic regression, or a proportional odds logistic
regression (POLR, for short). The ordinal logistic model factors various rating components (e.g.,
job experience, core skills, etc.) to determine, and assign, a stochastically-derived score for each
applicant resume; individual resumes are scored for a specific job; i.e., the logistic score, then,
predicts the likelihood a resume matches a prescribed job posting.
Coefficients:
Value Std. Error t value
user_preference_h_glm 0.4053 0.5436 0.7455
general_similarity_h_glm 4.0417 0.8831 4.5767***
job_xp_h_glm 4.1715 0.8370 4.9839***
education_h_glm -0.2695 0.5020 -0.5369
core_skills_h_glm 0.1567 0.5874 0.2667
experience_level_h_glm 0.3481 0.7979 0.4362
category_h_glm 0.6937 0.5441 1.2750
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As can be seen in the above model results, the proportional odds ordinal logistic regression
model largely confirms the results we receive in the OLS model: general similarity and job
experience are, in general, the two most important predictors of human overall job
recommendations. The only difference in the ordinal logistic model, actually, is that it makes an
even stronger case for the (similar) importance of these two variables: not only do their two
coefficients (nearly) match again, but in this model, we see that experience level is no longer
statistically significant. This underscores, again, that a weighting system that weights general
similarity and job experience heavily while weighting the other categories only lightly would
match apparent human implicit weights more closely.
Model Score Performance
Using insights gleaned from the OLS and ordinal logit model, we construct a new, calibrated,
scoring function, where, rather than machine generated weights, we use human weights in the
construction of the final score. In short, with any computer resume rating, there are two
components:
1 The rating score assigned to individual categories for the resumes (general similarity, job
experience, experience level, etc.)
2 The weight assigned to that category by the machine system.
The individual ratings, which the machines gives on a zero to one hundred scale, are then
multiplied by their weight, after which time they are added together. So, for instance, if the
machine was using two rating categories, boogie and woogie, and weighted those categories .4
and .6, with scores on each category of 50 and 75, respectively, we could find the penultimate
machine rating by saying:
Rating = .4*(boogie)+.6*(woogie)
Rating = .4*50+.6*75
Rating = 65
That 65 would then be walked through the normalizing tangent function to arrive to the final
resume rating. Accordingly, if the machine ratings appear to be off from the human ratings,
there are only two sources of error: the weighting system and the machine rating system
itself. Given the low correlation between human ratings and machine ratings, our first attempt
was to alter the machine weighting system to match the implicit human rating system more
closely. The way that we did this was by using the suggested weights from the OLS model -
approximately, a weight of .45 on general similarity, a weight of .45 on job experience, and a
weight of .02 on everything else.
However, using this new weighting system, the performance of the machine did not increase - in
fact, they decrease. The correlations between machine rankings and human rankings, using the
implied human weights, lie between .08 and .22. (Before re-weighting, the correlation strength
had been much better - between .21 and .31). Slightly changing the weights in small ways (say,
a weight of .4 on job experience and general similarity, and a weight of .04 on everything else)
does not significantly improve the correlations.
Individual Category Rating Comparison
Given this, we decided to examine correlations between individual human category ratings and
individual machine category ratings. The answers were enlightening - on many categories, the
machine ratings and human ratings were not even weakly correlated, with individual category
ratings falling between -.10 and .46, and only three out of eight category ratings correlating a
strength above .10 (all reported correlations are using new machine point scores):
User preference: 0.46
General similarity: 0.06
Skills Topics: -0.10
Job Experience: 0.20
Education: N/A
Core Skills: 0.32
Experience Level: N/A
Category: 0.10
Given this data, two conclusions came to light. The first conclusion was that, perhaps, we could
improve the fit of the machine final rating and the human final rating if we re-weighted both to
focus more on areas where the human and machine agreed. Doing this - weighting user
preference, job experience, and core skills - higher than all the rest (namely, the weighting
system is .4 for user preference, .3 for core skills, .2 for job experience, and .05 and .05 for
category and general similarity, respectively) gives results that are again unsatisfactory, however,
not significantly improving results from prior weightings. Correlations between final rankings of
machine and human, with this re-weighting, range from .22 to 34 - a marginal increase from the
original weights, but only a marginal one.
Points System
This brings us to the second point brought up by the low correlations between individual point
scores of machine and human - largely, the machine is just not good at matching human scores,
nor is the machine good at rating resumes in the first place. For instance, in the above
correlations, there is no correlation reported for either education or experience level, because the
machine failed to produce any variance in these categories at all - in experience, the machine
gave every single resume a score of 100 (indicating a perfect experience level) and for education,
the machine gave every single resume a score of 0 (indicating that the resume has no apparent
education at all.) Of the other categories, only in three of the eight categories (user preference,
job experience, and core skills) does the machine do better than random chance in predicting
how the human will rate the resume, and only in one of the two categories (job experience) that
the human thinks is implicitly important (as revealed by the OLS and ordinal logit models).
This lack of correspondence between the machine and human scores, combined with the fact that
the machine fails to pass a certain sort of “sanity test” itself on certain categories (education and
experience level) by not rating any resumes as different at all, suggests that it is not the weights,
but the category scores, that is preventing the machine from properly matching up with the
human weights.
References
http://mathworld.wolfram.com/BootstrapMethods.html
http://statistics.stanford.edu/~ckirby/techreports/BIO/BIO%20101.pdf
http://www.methodsconsultants.com/tutorials/bootstrap1.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html

More Related Content

Similar to An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes

INFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docx
INFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docxINFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docx
INFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docxjaggernaoma
 
Chapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docx
Chapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docxChapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docx
Chapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docxchristinemaritza
 
Recruitment Matrix
Recruitment MatrixRecruitment Matrix
Recruitment MatrixConsultonmic
 
STUDeNT WORKBOOK Designing A Pay Structure
 STUDeNT  WORKBOOK Designing  A  Pay  Structure    STUDeNT  WORKBOOK Designing  A  Pay  Structure
STUDeNT WORKBOOK Designing A Pay Structure MoseStaton39
 
Stu de nt workbook designing a pay structure
 Stu de nt  workbook designing  a  pay  structure    Stu de nt  workbook designing  a  pay  structure
Stu de nt workbook designing a pay structure Vivan17
 
Human resource management
Human resource managementHuman resource management
Human resource managementMayank Patel
 
STUDeNT WORKBOOK Designing A Pay Structure .docx
 STUDeNT  WORKBOOK Designing  A  Pay  Structure   .docx STUDeNT  WORKBOOK Designing  A  Pay  Structure   .docx
STUDeNT WORKBOOK Designing A Pay Structure .docxaryan532920
 
STUDeNT WORKBOOK Designing A Pay Structure .docx
STUDeNT  WORKBOOK Designing  A  Pay  Structure   .docxSTUDeNT  WORKBOOK Designing  A  Pay  Structure   .docx
STUDeNT WORKBOOK Designing A Pay Structure .docxAASTHA76
 
RecruitingKhanh K. NguyenCapella UniversityOctober 9, .docx
RecruitingKhanh K. NguyenCapella UniversityOctober 9, .docxRecruitingKhanh K. NguyenCapella UniversityOctober 9, .docx
RecruitingKhanh K. NguyenCapella UniversityOctober 9, .docxaudeleypearl
 
Restricting the recruitment and selection.pptx
Restricting the recruitment and selection.pptxRestricting the recruitment and selection.pptx
Restricting the recruitment and selection.pptxRidaZaman1
 
Assignment 1 Staffing Plan for a Growing Business -BUS 335-Du.docx
Assignment 1 Staffing Plan for a Growing Business   -BUS 335-Du.docxAssignment 1 Staffing Plan for a Growing Business   -BUS 335-Du.docx
Assignment 1 Staffing Plan for a Growing Business -BUS 335-Du.docxdeanmtaylor1545
 
Using Business Intelligence in the Talent Marketplace - Updated
Using Business Intelligence in the Talent Marketplace - UpdatedUsing Business Intelligence in the Talent Marketplace - Updated
Using Business Intelligence in the Talent Marketplace - UpdatedAshley Zito Rowe
 
Drs 255 skills in job matching and placement
Drs 255 skills in job matching and placementDrs 255 skills in job matching and placement
Drs 255 skills in job matching and placementpaulyeboah
 
Job Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation AssessmentsJob Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation AssessmentsHire Results
 
Transforming Competencies Through Thecnology!
Transforming Competencies Through Thecnology!Transforming Competencies Through Thecnology!
Transforming Competencies Through Thecnology!mycompetencybuilder
 
Transforming Competencies Through Technology!
Transforming Competencies Through Technology!Transforming Competencies Through Technology!
Transforming Competencies Through Technology!mycompetencybuilder
 
Talent Gradient Construction Based on Competency
Talent Gradient Construction Based on CompetencyTalent Gradient Construction Based on Competency
Talent Gradient Construction Based on CompetencyVincent T. ZHAO
 

Similar to An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes (20)

INFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docx
INFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docxINFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docx
INFORMATION SYSTEMS CASE STUDYBrainstorm ideas for a new informa.docx
 
Chapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docx
Chapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docxChapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docx
Chapter 13 Staffing System Evaluation and TechnologyStaffing Metri.docx
 
Recruitment Matrix
Recruitment MatrixRecruitment Matrix
Recruitment Matrix
 
Corporate Deck
Corporate DeckCorporate Deck
Corporate Deck
 
Managing human resources (psych)
Managing human resources (psych)Managing human resources (psych)
Managing human resources (psych)
 
STUDeNT WORKBOOK Designing A Pay Structure
 STUDeNT  WORKBOOK Designing  A  Pay  Structure    STUDeNT  WORKBOOK Designing  A  Pay  Structure
STUDeNT WORKBOOK Designing A Pay Structure
 
Stu de nt workbook designing a pay structure
 Stu de nt  workbook designing  a  pay  structure    Stu de nt  workbook designing  a  pay  structure
Stu de nt workbook designing a pay structure
 
Human resource management
Human resource managementHuman resource management
Human resource management
 
STUDeNT WORKBOOK Designing A Pay Structure .docx
 STUDeNT  WORKBOOK Designing  A  Pay  Structure   .docx STUDeNT  WORKBOOK Designing  A  Pay  Structure   .docx
STUDeNT WORKBOOK Designing A Pay Structure .docx
 
STUDeNT WORKBOOK Designing A Pay Structure .docx
STUDeNT  WORKBOOK Designing  A  Pay  Structure   .docxSTUDeNT  WORKBOOK Designing  A  Pay  Structure   .docx
STUDeNT WORKBOOK Designing A Pay Structure .docx
 
RecruitingKhanh K. NguyenCapella UniversityOctober 9, .docx
RecruitingKhanh K. NguyenCapella UniversityOctober 9, .docxRecruitingKhanh K. NguyenCapella UniversityOctober 9, .docx
RecruitingKhanh K. NguyenCapella UniversityOctober 9, .docx
 
Restricting the recruitment and selection.pptx
Restricting the recruitment and selection.pptxRestricting the recruitment and selection.pptx
Restricting the recruitment and selection.pptx
 
Assignment 1 Staffing Plan for a Growing Business -BUS 335-Du.docx
Assignment 1 Staffing Plan for a Growing Business   -BUS 335-Du.docxAssignment 1 Staffing Plan for a Growing Business   -BUS 335-Du.docx
Assignment 1 Staffing Plan for a Growing Business -BUS 335-Du.docx
 
Using Business Intelligence in the Talent Marketplace - Updated
Using Business Intelligence in the Talent Marketplace - UpdatedUsing Business Intelligence in the Talent Marketplace - Updated
Using Business Intelligence in the Talent Marketplace - Updated
 
Drs 255 skills in job matching and placement
Drs 255 skills in job matching and placementDrs 255 skills in job matching and placement
Drs 255 skills in job matching and placement
 
Job Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation AssessmentsJob Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation Assessments
 
Transforming Competencies Through Thecnology!
Transforming Competencies Through Thecnology!Transforming Competencies Through Thecnology!
Transforming Competencies Through Thecnology!
 
Transforming Competencies Through Technology!
Transforming Competencies Through Technology!Transforming Competencies Through Technology!
Transforming Competencies Through Technology!
 
Talent Gradient Construction Based on Competency
Talent Gradient Construction Based on CompetencyTalent Gradient Construction Based on Competency
Talent Gradient Construction Based on Competency
 
MCBServices(Final)
MCBServices(Final)MCBServices(Final)
MCBServices(Final)
 

More from Valerie Felton

Teach For America Essay
Teach For America EssayTeach For America Essay
Teach For America EssayValerie Felton
 
Frederick Douglass Essay Questions
Frederick Douglass Essay QuestionsFrederick Douglass Essay Questions
Frederick Douglass Essay QuestionsValerie Felton
 
PPT - United States Constitution Day September 17, 1787 PowerPoint
PPT - United States Constitution Day September 17, 1787 PowerPointPPT - United States Constitution Day September 17, 1787 PowerPoint
PPT - United States Constitution Day September 17, 1787 PowerPointValerie Felton
 
Qualities Of The Best Online Paper Writing Services - EssayMin
Qualities Of The Best Online Paper Writing Services - EssayMinQualities Of The Best Online Paper Writing Services - EssayMin
Qualities Of The Best Online Paper Writing Services - EssayMinValerie Felton
 
Law School Outlines Law School, Writing Lesson Plan
Law School Outlines Law School, Writing Lesson PlanLaw School Outlines Law School, Writing Lesson Plan
Law School Outlines Law School, Writing Lesson PlanValerie Felton
 
😝 International Essay Writing Competitio.pdf
😝 International Essay Writing Competitio.pdf😝 International Essay Writing Competitio.pdf
😝 International Essay Writing Competitio.pdfValerie Felton
 
Scholarship Statement Example - Sanox
Scholarship Statement Example - SanoxScholarship Statement Example - Sanox
Scholarship Statement Example - SanoxValerie Felton
 
Mla Essay Heading. MLA Heading Format And Writin
Mla Essay Heading. MLA Heading Format And WritinMla Essay Heading. MLA Heading Format And Writin
Mla Essay Heading. MLA Heading Format And WritinValerie Felton
 
Amp-Pinterest In Action Persuasi
Amp-Pinterest In Action PersuasiAmp-Pinterest In Action Persuasi
Amp-Pinterest In Action PersuasiValerie Felton
 
6 Best Images Of Snowflake Writing Paper Printable - S
6 Best Images Of Snowflake Writing Paper Printable - S6 Best Images Of Snowflake Writing Paper Printable - S
6 Best Images Of Snowflake Writing Paper Printable - SValerie Felton
 
Free Full Essay On Global Warming. Research Essay. 2022-10-27
Free Full Essay On Global Warming. Research Essay. 2022-10-27Free Full Essay On Global Warming. Research Essay. 2022-10-27
Free Full Essay On Global Warming. Research Essay. 2022-10-27Valerie Felton
 
Pencil Reviews ThriftyFun
Pencil Reviews ThriftyFunPencil Reviews ThriftyFun
Pencil Reviews ThriftyFunValerie Felton
 
WRITING TASK 1 GENERAL IELTS TOPICS SYDMITI
WRITING TASK 1 GENERAL IELTS TOPICS SYDMITIWRITING TASK 1 GENERAL IELTS TOPICS SYDMITI
WRITING TASK 1 GENERAL IELTS TOPICS SYDMITIValerie Felton
 
How To Write A Strong Essay
How To Write A Strong EssayHow To Write A Strong Essay
How To Write A Strong EssayValerie Felton
 
Essay On Terrorism In India Terrorism In India Essay Fo
Essay On Terrorism In India Terrorism In India Essay FoEssay On Terrorism In India Terrorism In India Essay Fo
Essay On Terrorism In India Terrorism In India Essay FoValerie Felton
 
Sunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenSunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenValerie Felton
 
Research Paper Hook Examples. How To Write A Hook For A Research
Research Paper Hook Examples. How To Write A Hook For A ResearchResearch Paper Hook Examples. How To Write A Hook For A Research
Research Paper Hook Examples. How To Write A Hook For A ResearchValerie Felton
 
Write A Short Essay On LibraryImportance Of LibraryEssay Writing
Write A Short Essay On LibraryImportance Of LibraryEssay WritingWrite A Short Essay On LibraryImportance Of LibraryEssay Writing
Write A Short Essay On LibraryImportance Of LibraryEssay WritingValerie Felton
 

More from Valerie Felton (20)

Essay About Reading
Essay About ReadingEssay About Reading
Essay About Reading
 
Teach For America Essay
Teach For America EssayTeach For America Essay
Teach For America Essay
 
College Essays Com
College Essays ComCollege Essays Com
College Essays Com
 
Frederick Douglass Essay Questions
Frederick Douglass Essay QuestionsFrederick Douglass Essay Questions
Frederick Douglass Essay Questions
 
PPT - United States Constitution Day September 17, 1787 PowerPoint
PPT - United States Constitution Day September 17, 1787 PowerPointPPT - United States Constitution Day September 17, 1787 PowerPoint
PPT - United States Constitution Day September 17, 1787 PowerPoint
 
Qualities Of The Best Online Paper Writing Services - EssayMin
Qualities Of The Best Online Paper Writing Services - EssayMinQualities Of The Best Online Paper Writing Services - EssayMin
Qualities Of The Best Online Paper Writing Services - EssayMin
 
Law School Outlines Law School, Writing Lesson Plan
Law School Outlines Law School, Writing Lesson PlanLaw School Outlines Law School, Writing Lesson Plan
Law School Outlines Law School, Writing Lesson Plan
 
😝 International Essay Writing Competitio.pdf
😝 International Essay Writing Competitio.pdf😝 International Essay Writing Competitio.pdf
😝 International Essay Writing Competitio.pdf
 
Scholarship Statement Example - Sanox
Scholarship Statement Example - SanoxScholarship Statement Example - Sanox
Scholarship Statement Example - Sanox
 
Mla Essay Heading. MLA Heading Format And Writin
Mla Essay Heading. MLA Heading Format And WritinMla Essay Heading. MLA Heading Format And Writin
Mla Essay Heading. MLA Heading Format And Writin
 
Amp-Pinterest In Action Persuasi
Amp-Pinterest In Action PersuasiAmp-Pinterest In Action Persuasi
Amp-Pinterest In Action Persuasi
 
6 Best Images Of Snowflake Writing Paper Printable - S
6 Best Images Of Snowflake Writing Paper Printable - S6 Best Images Of Snowflake Writing Paper Printable - S
6 Best Images Of Snowflake Writing Paper Printable - S
 
Free Full Essay On Global Warming. Research Essay. 2022-10-27
Free Full Essay On Global Warming. Research Essay. 2022-10-27Free Full Essay On Global Warming. Research Essay. 2022-10-27
Free Full Essay On Global Warming. Research Essay. 2022-10-27
 
Pencil Reviews ThriftyFun
Pencil Reviews ThriftyFunPencil Reviews ThriftyFun
Pencil Reviews ThriftyFun
 
WRITING TASK 1 GENERAL IELTS TOPICS SYDMITI
WRITING TASK 1 GENERAL IELTS TOPICS SYDMITIWRITING TASK 1 GENERAL IELTS TOPICS SYDMITI
WRITING TASK 1 GENERAL IELTS TOPICS SYDMITI
 
How To Write A Strong Essay
How To Write A Strong EssayHow To Write A Strong Essay
How To Write A Strong Essay
 
Essay On Terrorism In India Terrorism In India Essay Fo
Essay On Terrorism In India Terrorism In India Essay FoEssay On Terrorism In India Terrorism In India Essay Fo
Essay On Terrorism In India Terrorism In India Essay Fo
 
Sunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box PenSunset Writing Paper Set, Optional Storage Box Pen
Sunset Writing Paper Set, Optional Storage Box Pen
 
Research Paper Hook Examples. How To Write A Hook For A Research
Research Paper Hook Examples. How To Write A Hook For A ResearchResearch Paper Hook Examples. How To Write A Hook For A Research
Research Paper Hook Examples. How To Write A Hook For A Research
 
Write A Short Essay On LibraryImportance Of LibraryEssay Writing
Write A Short Essay On LibraryImportance Of LibraryEssay WritingWrite A Short Essay On LibraryImportance Of LibraryEssay Writing
Write A Short Essay On LibraryImportance Of LibraryEssay Writing
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 

An Analysis Of Keyword Preferences Amongst Recruiters And Candidate Resumes

  • 1. An Analysis of Keyword Preferences Amongst Recruiters and Candidate Resumes By Amir Behbehani Abstract Many prospective employees post resumes online, which recruiters use to screen and “submit” potential candidates to hiring managers. One challenge is that recruiters must sift through hundreds of resumes, and search for hard skills for which they are generally unqualified to discern a “fit.” Moreover, the recruiter must search for clues that determine a “cultural fit,” which is nearly impossible without some contact with the candidate. This paper examines the decision making process recruiters use to determine candidate qualifications, and attempts to determine if a computer algorithm can be comparably, or more effective, than a human recruiter at finding a candidate fit.
  • 2. Overview Path.to, in efforts to validate a proprietary scoring algorithm, which matches job-seekers to potential employers, has hired two (this number can increase) recruiters tasked to review and rate resumes. Objective The goal of this project is to determine if the recruiters possess any "tacit" knowledge, absent from the matching algorithm, thus not yet codified, which enable a reduced-friction hiring process. Method To identify differences between human classification and the machine learning classifier, we design an experiment that requires n recruiters to rate identically distributed resumes (i.e., recruiters rate the same resumes), k number of times. The value of k is a number between 1 and j, where j depends on a resampling technique. The purpose of resampling is to mimic multiple recruiters, and thus identify the underlying rating sample distribution for each job seeker,the sample statistic, and related variances. Variances in this experiment are of two types: intra-variances (variances in the ranking of an individual resume, by one recruiter, over the multiple resampled draws) and extra-variances (variances in the ranking of an individual resume between (or among) recruiters). Large variances in the former indicate inconsistent, and thus, unreliable rankings from the said recruiter. Large variances in the latter may indicate subjective bias. The rankings from the machine learning algorithm are, then, compared to the rankings from n recruiters. The variance in scores for the applicants by the recruiters and the machine learning algorithm, if small, represents consistent ranking system that captures enough objectively measurable data to determine a strong match. Conversely, large differences between the machine learning and recruiter scores indicate a ranking algorithm that requires further tuning Control vs. Experiment Our control group are the human recruiters (assuming they are reliable within and between (among, if n >2) themselves). We are seeing how closely the computer ratings approximate the recruiter ratings. Conversely, the machine ratings are the 'experimental' group. Computer Matching Algorithm Matching employees and employers, algorithmically, largely requires scoring potential hires (i.e. applicants) for a specific job. That is, applicants may score a strong ‘fit’ for a specific job, or job posting, but a specific fit does not render an individual applicant a strong fit for the greater pool of job postings. For example, an applicant may score a strong fit for a specific law firm hiring attorneys. It does not follow, ergo propter hoc, this prospective candidate would qualify as a strong fit for an engineering job at technology firm; another candidate will probably score a better fit for the engineering job. While qualified matches are not, necessarily, mutually exclusive, collectively exhaustive (candidate-A might scoring a strong fit for position-X does not, necessarily, preclude a probable fit for position-Y as well), we expect the likelihood a candidate that scores well for one job to score well for another job, or set of jobs, to decrease exponentially.
  • 3. The computerized scoring system broadly defines applicant qualifications, categorically, as a Tacit or Explicit skillset. Tacit knowledge, generally speaking, refers to a set of skills that are highly social, organizationally specific, and difficult to train. Tacit workers understand people, products, organizational dynamics (beyond just the org-chart), and are highly emotionally attuned. Explicit knowledge, by contrast, is domain specific, highly technical, easy to teach (although, not always easy to learn), and organizationally transferable. Explicit workers are your statisticians, attorneys, accountants, and niche engineers. In any organization, you need both tacit and explicit knowledge. Innovation requires harnessing explicit knowledge, transforming it into tacit knowledge, and selling it. As such, any meaningful employee/employer matching system would require measuring employee explicit and tacit knowledge. The former, we approximate using traditional proxies such as years of experience, education, stated skillset, etc. The latter, we approximate using a social data (Facebook, Twitter, Forrst, etc.) checking applicant tweets and Facebook posts, for example, for content relevant to the various job training intended to determine their preferred work environment). The total set of scoring determinants (rankers) and their respective weights are as follows: Explicit Skills (max 1.15) ● Core Skills (max .25) ● Similarity (max .20) ● Skills Topic Model (max .10) ● Job Experience (max .15) ● Education (max .05) ● Category (max .15) ● Total Experience (max .15) ● Endorsements (max .10) Tacit Skills ● Social Network Statuses (max 0.02) ○ Twitter ○ Facebook ○ Dribbble ○ Forrst ○ Behance ○ Github ● Signals (Likes and Dislikes) (max .21) ○ ApplicantJob ○ ApplicantBusiness ○ BusinessApplicant ○ BusinessJobApplicant ○ BusinessTitle
  • 4. ● Cultural Preferences ( max 0.12) ○ Benefits ○ Formality ○ CompanySize ○ Risk ○ Salary Scoring starts by identifying the appropriate rankers to be used in the scoring. Not all rankers are used for each, individual employee/employer score, and same rankers are not consistently applied to each applicant. Instead, the job post, itself, helps determine what sets of rankers we will use, score, and apply to the overall rank. Once rankers are selected, the scoring algorithm evaluates applicants, applying to each ranker a score between 0 and 99. Rankers are then normalized and reevaluated to bound total scores, an additive function of the various determinant rankers,between 0 and 99. The Total Score will assume the form: ฀ = ฀aX1 + �bX2 +�cX3 … �jXi ; X: Rating of 0 -99 �: slope or weight of respective ranker Survey Questionnaires To experimentally compare the results of the computerized matching system to our control group - two randomly selected recruiters - we dynamically allocate a set of questions to gauge how each recruiter perceives applicant “fit”. Each question coincides with a ranker used in the computerized score; and is weighted accordingly. The questions are dynamically allocated specific to each applicant/job pairing to match the rankers used in the overall applicant scoring. Each question has a set of five possible answers (A through E), additively awarding 20 points to candidates whom the recruiters positively rank (i.e. A=99, B=74, C=49, D=24, E=0). This approach attempts to “back into” the scores for each ranker. Questionnaire How well does the company environment match the desires of the candidate? (UserPreference) (.12) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all
  • 5. How much does the candidate’s overall profile match the requirements of this position? (SimilarityWeight + Skills Topic Model Weight) (.20 + .10) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all How much did the candidate’s work experience qualify them for this position? (JobXpWeight) (.15) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all How much did the candidate’s education experience qualify them for this position? (EducationWeight) (.05) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all How much did the candidate’s skills qualify them for this position? (Core Skills Ranker) (.25) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all How much does the candidate’s skill level qualify them for this position? (Experience Weight) (.15) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all How well does this position match the stated interests of the candidate? (Category Weight) (.15) A. Perfectly B. Very Much C. Acceptable D. Unacceptable E. Not at all
  • 6. Experimental Participants (i.e., Howmany recruiters do we need?) While two recruiters is a limited sample, and thus cannot be representative of the general population of recruiters, the nature of the resampling technique attempts to solve this problem. A third recruiter would, certainly, increase significance should the two recruiters disagree frequently, i.e., extra-variance is high, or an individual recruiter has high inter-variance within the resampled resumes. Resampling Technique We should resample at least 10% of the resumes with the recruiters to check for reliability. More importantly, the number of duplicate-rated resumes for each recruiter should be at least 50, if not 100, to test for significance at the 95% level; conversely, we can adjust the confidence in our result with less duplicate-rated resumes. The challenge is to ensure enough trials to check for reliability between a single recruiters ratings. Presumably, though, the computer will rate each resume identically each time it sees it, there's no need to resample with the computer. How many are we resampling, practically? Front End Process Step 0: Recruiter A draws and rates an applicant resume from a stack of M resumes. Back End Process Step 1: We serve a stack of M resumes to Recruiter-A and Recruiter-B, both recruiters see the same resumes. Step 2: Every k(x) resume is resampled, and thus Recruiter-A and Recruiter-B, respectively, re-rate these resumes, where k(x) is a pseudo-random process, to minimize the effect that the recruiters learn the resampling frequency. On the k(x)th interval, we sample P resumes Q times, each randomly (these numbers can change). For example, we can reserve 5, previously rated, resumes for resampling, each of which will be re-rated 20 times; P = 5, Q = 20, or (5,20). Resampling (5,20), would require the recruiters to examine approximately 1200 resumes, in order to adequately space the resampled resumes: 5*20=120 resampled resumes; to space 120 resampled resumes at intervals of approximately k(x) = 10 requires 1200 resumes. Alternatively, (2,20), would require only 400 resumes distributed at intervals of k(x) = 10, or 320 resumes if k(x) = 8. I suggest P be at least 2, if not 3, or more, and Q be at least 20, if not 30, or more. Note: The more resumes we are resampling, the smaller the space between resampled resumes. This is because with a larger pool of resampled resumes the chances of the same resume appearing 1/f(x) times reduces, inversely. For example, if we resample 5 resumes, we do not need to have a space of 10 between every resampled resume, as that would indicate that raters would only see the same resume once every 50 times - clearly, we don't have to space things out quite that much.
  • 7. Thus, at the end, we have: (a.) ratings for every resume from each rater, and (b.) a series of Q ratings for each of the P resumes that we have decided to re-sample. If the standard error for each of these P resumes is roughly equivalent, we can comfortably use this error rate across all resumes. Addendum: Standard Errors and Sampling Distributions Taking the metaphor of rolling a die, we can think of standard errors in the following way. Let's say I roll a fair die 100 times, and average all of those die rolls. The average of those die rolls will be something like 3.5ish. Probably not exactly 3.5 (sometimes I'll randomly get more 6's, sometimes more 1's, etc) but close to 3.5. If I do it again, I'll get something close to 3.5 the next time, and the next time, and the next time. All those numbers close to 3.5 will have a distribution (called the sampling distribution), and that distribution will be characterized by being centered around 3.5, with a standard deviation equal to the standard error. The standard error (equation given below) thus gives us a feel for how any particular mean might deviate from the true population mean (3.5). If we think about this in terms of classrooms and height, we can think of it this way: any given classroom has some randomly selected students in it. Now, there is a true population mean height of students, but each classroom in the building will have a mean height that is slightly off from the main mean height. Without knowing the population mean height, though, we can approximate how far off the classroom mean height is using the equation for standard error, which just requires that we have the sample (classroom) standard deviation and the sample size of the classroom. The equation for standard error is given by the following: SE=SD/sqrt(n) Where SE is the standard error of the sampling distribution, SD is the standard deviation of one sample, and n is the sample size of that particular sample. Thus, assuming an equivalent standard deviation, the higher our sample size (the more times we dish out the same resume) the more certain of the variance around the "true" value of the rating we'll have. The higher the standard deviation, the higher sample we'll need to get the same size standard error. For example, in our case, let's say that we had a recruiter rate a resume 8 times, and they rated it a 7 twice, and 8 four times, and a 9 twice. We can calculate a 95% confidence interval in which we are 95% confident the "true" value of the rating exists. In R, this would look like: x<-c(7,7,8,8,8,8,9,9) xbar<- mean(x) xdev<-sd(x) xse<-xdev/sqrt(length(x)) xbar+1.96*xse xbar-1.96*xse
  • 8. Giving us a range from 7.47 to 8.52. (We may bootstrap the standard error when we're worried that distribution of responses may cause the equation above to give us a biased estimate of the standard error. Bootstrapping is process of resampling from the sample repeated times to gain a better estimate of the standard error.) Presumably, the second recruiter would also rate the resume, and we would be able to examine whether their confidence range overlapped with the confidence range of the first recruiter. With this in mind, essentially, the goal of this study is to see whether or not the machine ratings are equivalent "enough" to the human ratings - in other words, do the machine ratings fall with a reasonable range of error from the true mean? To do that, we need a confidence interval for each resume. We could do this via two mechanisms: (1) resampling every resume a few times for the reviewers to rate, or (2) assuming that the error in any rating does not vary from resume to resume, and use just a couple resumes to construct a standard error we'd use for each resume. Rating Technique Instead of the recruiters rating the individual resumes on a scale of 1 to 10, I suggest the recruiters answer questions about the resumes, which map to scores. This method will simultaneously, answer why the recruiter is or is not interested in the candidate, and will, presumably, reduce the intra-variance. Figure 1. Survey Questions
  • 9. Dataset Data collected from the experiment will conform to the following matrix. For each employer (EmployerID) we will serve the recruiters a randomized set of resumes (ResumeID). The recruiters will then score each resume (R1-Score, and R2-Score). A subset of resumes will, then, be resampled and served again (TrialNo) whereby the recruiters will rate,again, rate the resumes. ● EmployerID - The hiring company ● ResumeID - The applicant ● TrialNo - The number of the trial for which we are gathering ratings. This key differentiates the duplicate ratings for the same resume. ● R1-Score - Recruiter 1's score ● R2-Score - Recruiter 2's score ● M-Score Machine's score. Output The output then tell us the range of expected (rank) values for each applicant per each job. So, we will know, for example, 95% of the scores for John Doe working at Microsoft will be between, say, 7 and 9 in the control group vs. 6.5 in the experiment group (no range, because no resampling)? From the data collected during the study, we calibrate the machine learning algorithm to capture a score closer to 6.5. Important Caveat - Memory Effect and Repeatedly Sampled Resume Obfuscation We need to cleverly mask the profiles in a way that the the recruiters will not recognize the same resume served multiple times. This ‘memory’ effect can bias the experiment. There are several ways to mitigate the effects of memory effect,which include: 1 Remove all names, phone numbers, addresses, and other uniquely identifiable attributes from each resume. 2 Artificially expand the pool of applicant resumes for each job. This further obfuscates the fact that each recruiter will repeatedly view and rate a subset of resampled resumes. We can easily expand the pool of applicant resumes, by either using existing resumes from the general repository (i.e., resumes that don’t necessarily match the said job), or by creating ‘dummy resumes’. Of course, we would have to scratch the dummy resumes from our final dataset from which we determine statistical significance. The purpose of the dummy resumes is strictly to mitigate memory effect. 3 Finally, we can abstract information from each profile to create more homogenous application pool. For example, applicant-A and applicant-B might each have worked at Microsoft; the former applicant might have worked during the years 1992- 1997, while the latter applicant during the years 2000-2005. In both cases, the applicants worked for Microsoft for 5 years. For obfuscation purposes, we can abstract the dates into a common attribute (Years_Worked). 4 Between resamplings, we should randomize the order in which the resumes are rated. This will further reduce the chance of the recruiters ‘catching on’ that the resumes are being resampled.
  • 10. R-Code #Path.To #Psuedo-Analysis: Resumes #9.20.12 #Requires: None #Output: Dataset indicating whether machine ratings matches reasonable range of human ratings #EmployerID - The hiring company #ResumeID - The applicant #TrialNo - The number of the trial for which we are gathering ratings. This key differentiates the duplicate ratings for the same resume. #R1-Score - Recruiter 1's score #R2-Score - Recruiter 2's score #M-Score Machine's score. #Create fake data #Assume 10 positions, 100 resumes #3 resumes (1,2, and 3) resampled 20 times #employers creates an array of 1-10, count by 1 #employerID 160 = number of times we sample from employers employers<-seq(1,10,1) EmployerID<-sample(employers, 160, replace=T) resumeID<-c(seq(1,100,1), sample(1, 20, replace=T), sample(2, 20, replace=T), sample(3, 20, replace=T)) TrialNo<-c(sample(1,100,replace=T), seq(2,21,1), seq(2,21,1), seq(2,21,1)) R1Score<-round(rnorm(160, 5, 2)) R2Score<-round(rnorm(160, 5, 2)) MScore<-round(rnorm(160, 5, 2)) data<-as.data.frame(cbind(EmployerID, resumeID, TrialNo, R1Score, R2Score, MScore)) #To look at variance within ratings of Rater1's ratings of resumes 1, 2, and 3 #Get all Reviewer 1's ratings of resume #1, #2, and #3 rat1res1<-c(data$R1Score[1], data$R1Score[101:120]) rat1res2<-c(data$R1Score[2], data$R1Score[121:140]) rat1res3<-c(data$R1Score[3], data$R1Score[141:160]) #Get all Reviewer 2's ratings of resume #1, #2, and #3 rat2res1<-c(data$R1Score[1], data$R2Score[101:120]) rat2res2<-c(data$R1Score[2], data$R2Score[121:140]) rat2res3<-c(data$R1Score[3], data$R2Score[141:160]) #How much are individual recruiters ratings varying on individual resumes? Lower SDs indicate more reliable ratings sd(rat1res1) sd(rat2res1) sd(rat1res2) sd(rat2res2) sd(rat1res3) sd(rat2res3) #A simple measure: are the reviewers rating the resumes roughly similarly? Lower mean differences indicate more reliable ratings mean(rat1res1)-mean(rat2res1)
  • 11. mean(rat1res2)-mean(rat2res2) mean(rat1res3)-mean(rat2res3) #Assuming we're content with the numbers we see with the above, we move on. #What is the standard error for each rating? se1<-sd(rat1res1)/sqrt(length(rat1res1)) se2<-sd(rat1res2)/sqrt(length(rat1res2)) se3<-sd(rat1res3)/sqrt(length(rat1res3)) se4<-sd(rat2res1)/sqrt(length(rat2res1)) se5<-sd(rat2res2)/sqrt(length(rat2res2)) se6<-sd(rat2res3)/sqrt(length(rat2res3)) #Create mean standard error and mean standard deviation semean<-mean(c(se1, se2, se3, se4, se5, se6)) sdmean<-mean(c(sd(rat1res1),sd(rat2res1),sd(rat1res2),sd(rat2res2),sd(rat1res3),sd(rat2res3))) #Is the machine rating within the estimated 95% CI for each number? #Create lower bounds and upper bounds around the first rater, using SD data$lb1<-data$R1Score-(1.96*sdmean) data$ub1<-data$R1Score+(1.96*sdmean) data$lb2<-data$R2Score-(1.96*sdmean) data$ub2<-data$R2Score+(1.96*sdmean) #Is the machine code within this range? match1<-c() match2<-c() for (i in 1:100) { if (data$MScore[i]>data$lb1[i] & data$MScore[i]<data$ub1[i]) {match1[i]<-"Match"} else {match1[i]<-"Outside Bounds"} if (data$MScore[i]>data$lb2[i] & data$MScore[i]<data$ub2[i]) {match2[i]<-"Match"} else {match2[i]<-"Outside Bounds"} } #Combine data and write out to csv file dataexamine<-as.data.frame(cbind(data[1:100,],match1,match2)) write.csv(dataexamine, file="~/Documents/Documents/Work/Amir/RecuiterStudy/data.csv") #Using intercoder reliability, how reliable are all coders, taken together? library(irr) robinson(data[,4:6]) #How about the two recruiters? robinson(data[,4:5]) #The first recruiter and the machine? robinson(cbind(data[,4],data[,6])) #The second recruiter and the machine? robinson(data[,5:6]) Lines 18-26 just create data (and concatenate it) that will look roughly like the data output from the experiment. Here, we assume 10 employers (lines 18-19), 100 resumes (line 20), 3 of which we resample an additional 20 times (line 21 identifies these values), and we assign (for the time being) arbitrary recruiter-A, recruiter-B , and machine scores to the resumes (lines 22-24). Line 26 concatenates all into a data set.
  • 12. Lines 30-32 combine all of recruiter-A ratings for the resampled resumes 1, 2, and 3; lines 35-27 combine all of recruiter-B ratings for the resamples resumes 1, 2, and 3. Then, lines 40-45 find the standard deviation of all ratings for both raters on the resampled resumes. Smaller standard deviations indicate more reliable ratings form the human raters (a good thing, and, assuming we have reliability there, something that will allow us to justify the use of one aggregate SD to all resume ratings.) Lines 48-50 examine, in a simple fashion, differentiate recruiter-A ratings from recruiter-B, on average. Numbers closer to zero indicate higher levels of reliability between raters. Lines 54-59 create standard errors for all resample resumes and raters, and lines 62 and 63 create average standard deviations and average standard errors across all resampled resumes. Lines 67-71 create upper and lower bounds (95% bounds, to be precise) for the ratings of both recruiter-A and recruiter-B. We use the standard deviation, here, rather than the standard error, because we cannot apply the standard error taken from a sample of 21 (the sample we have for resampled resumes) and apply it to the samples of recruiter-A (the sample we have for non resampled resumes.) In essence, the standard error estimates the variation in means from repeated samples, not the variation in raw ratings themselves. To see whether the machine rating is matching raw human ratings, we want to look at the variation in human ratings, not the variation in means of samples of human ratings. Lines 74-79 create two new columns for the end of the dataset: one column indicating whether the machine rating fell within the 95% bounds for recruiter-A, and one column indicating whether the machine rating fell within the 95% bounds for recruiter-B. If the machine is within bounds, the word "Match" is recorded in the column, and if the machine is outside of bounds, the words "Outside Bounds" are recorded. This allows us to easily examine, visually, to see where the machine matches and where it doesn't. Lines 82 and 83 combine all the data and records the dataset to a csv file at a specified location.
  • 13. Preliminary Results - Dataset from Oct 29th Results from an initial set of experimental data, which include recruiter qualitative responses, were coded, aggregated, and evaluated as follows: A set of qualitative responses from which the recruiters rated resumes were coded on a linear scale from 0 to 100, where a coded rating of 0 signifies a bad match and a coded rating of 100 signifies a perfect match. Response Coded Value Perfectly 100 Very Much 75 Unacceptable 50 Acceptable 25 Not at All 0 Each recruiter rates an applicant resume against a corresponding job posting, selecting “fit” from one of nine Question IDs. Thus, a resume that scores perfectly for each Question ID has a possible 900 points (100 points, for each category), which are then adjusted or weighted such that the total points d = [36 * (Math.tan(1.94 +(x / 68))) + 90; x equals the sum of values for each Question ID adjusted by the machine selected weights], fall between the range of 0-99. Question ID Weights Category 0.1 Core Skills 0.23 Education 0.03 Experience Level 0.04 General Similarity 0.18 JobExperience 0.08 Recommend NULL Skills Topics 0.05 User Preference 0.1 Resampled resumes, then, receive multiple ratings by both recruiters (e.g., Recruiter A might rate Resume-A 5 times, each time evaluating the resume slightly differently, while Recruiter B might rate Resume-A 7 times, each time evaluating the resume slightly differently. The scores for Resume-A are then aggregated and we measure both the intra and extra variances. Descriptive Statistics Average Machine Rating: 75.52
  • 14. Average Human Rating: 50.69 Correlation Coefficient (between human and machine): 0.312 Among the resampled resumes, most had different ratings, but three were rated identically each time it was rated (all three resumes were re-rated by the same recruiter). Average Standard Deviation Across all Resampled Resumes:7.16 Interpretation: We can expect 95% of all human ratings to be within 14.32 points on either side of the mean rating. Because of the lower human rating than average machine rating, most of the machine scores and human scores are not matches - the machine scores are simply too high. (Only 13.7% of the machine scores are within the human range). Margin of Error Caveat: If we generously boost the human scores by 25 points, making the average human score and average machine score the same, then 84.1% of the machine scores are within two standard deviations of the human score. Statistical Significance: Using any sort of statistical significance test at this point is meaningless; the sample size is too small. With more data, we can, potentially estimate our variances and assign a measure of statistical significance to our findings. Additional Discussion Points ● Need to address the coding rules ● Recruiters had zero perfect fits
  • 15. ● Machine ratings are clustered above 80pts and below 70 points, this does not reflect human scores ● Human ratings are more continuous Link to Aggregated Dataset Stata Code - For Massaging Data insheet using "/Users/tarpus/Desktop/initialdata2.0.csv", comma *Drop all the glenns drop if user_id==1 *Sort on profile_id while keeping the rest of the order stable (so, don't re -order resampled resumes OR randomly re-order the order of the ratings) sort created_at, stable sort profile_id, stable encode job_id, gen(job_id_numeric) encode profile_id, gen(profile_id_numeric) *drop if profile_id_numeric==75 && user_id==6 drop if profile_id_numeric==75 drop if profile_id_numeric==34 destring points, gen(pointsnum) i("NA") destring weight, gen(weightnum) i("NA") destring points, gen(scorenum) i("NA") encode answer, gen(answernum) recode answernum 4=75 1=50 3=25 2=0, gen(hscore) gen hpoints=hscore*weight outsheet job_id_numeric profile_id_numeric user_id points hpoints created_at using "/Users/tarpus/Desktop/initialdata2.0clean.csv", comma replace nolabel *39377014fc19545b4675b7a1b165566e02e4b34 *9f93a3145210aa734fb37fca4618a9a46b38403 clear insheet using "/Users/tarpus/Desktop/initialdata2.0IDed.csv", comma destring points, gen(pointsnum) i("NA") destring weight, gen(weightnum) i("NA") destring points, gen(scorenum) i("NA") collapse R Code - For Data Analysis rawdata<-read.csv("/Users/tarpus/Desktop/initialdata2.0clean.csv") rawdata$uniqueID<-rep(1:(length(rawdata$points)/9), each=9) summary(rawdata$job_id_numeric) summary(rawdata$profile_id_numeric) adata<-aggregate(rawdata, by=list(rawdata$uniqueID), FUN=mean, na.rm=TRUE) adata$comprating<-36*(tan(1.94+((adata$points*8)/68)))+90 adata$humanrating<-36*(tan(1.94+((adata$hpoints*8)/68)))+90 summary(adata$comprating) summary(adata$humanrating) library(lattice) xyplot(adata$comprating~adata$humanrating) cor(adata$comprating,adata$humanrating) #find re-samples ob<-table(adata$profile_id_numeric); ob-1 #Average resume rating: mean(adata$comprating) mean(adata$humanrating) sd(adata$comprating) sd(adata$humanrating) write.csv(adata, file="/Users/tarpus/Desktop/aggregateddata.csv")
  • 16. resamples<-read.csv("/Users/tarpus/Desktop/aggregateddataresamples.csv") mean(c(resamples$humanrating[1], resamples$humanrating[2])) mean(c(resamples$humanrating[2], resamples$humanrating[3])) mean(c(resamples$humanrating[5], resamples$humanrating[6])) mean(c(resamples$humanrating[7], resamples$humanrating[8], resamples$humanrating[9])) mean(c(resamples$humanrating[10], resamples$humanrating[11])) mean(c(resamples$humanrating[12], resamples$humanrating[13])) mean(c(resamples$humanrating[14], resamples$humanrating[15])) mean(c(resamples$humanrating[16], resamples$humanrating[17])) mean(c(resamples$humanrating[18], resamples$humanrating[19])) mean(c(resamples$humanrating[20], resamples$humanrating[21])) mean(c(resamples$humanrating[22], resamples$humanrating[23])) mean(c(resamples$humanrating[24], resamples$humanrating[25])) mean(c(resamples$humanrating[26], resamples$humanrating[27], resamples$humanrating[28])) mean(c(resamples$humanrating[29], resamples$humanrating[30])) mean(c(resamples$humanrating[31], resamples$humanrating[32], resamples$humanrating[33])) sds<-c( sd(c(resamples$humanrating[1], resamples$humanrating[2])), sd(c(resamples$humanrating[2], resamples$humanrating[3])), sd(c(resamples$humanrating[5], resamples$humanrating[6])), sd(c(resamples$humanrating[7], resamples$humanrating[8], resamples$humanrating[9])), sd(c(resamples$humanrating[10], resamples$humanrating[11])), sd(c(resamples$humanrating[12], resamples$humanrating[13])), sd(c(resamples$humanrating[14], resamples$humanrating[15])), sd(c(resamples$humanrating[16], resamples$humanrating[17])), sd(c(resamples$humanrating[18], resamples$humanrating[19])), sd(c(resamples$humanrating[20], resamples$humanrating[21])), sd(c(resamples$humanrating[22], resamples$humanrating[23])), sd(c(resamples$humanrating[24], resamples$humanrating[25])), sd(c(resamples$humanrating[26], resamples$humanrating[27], resamples$humanrating[28])), sd(c(resamples$humanrating[29], resamples$humanrating[30])), sd(c(resamples$humanrating[31], resamples$humanrating[32], resamples$humanrating[33])) ) sdmean<-mean(sds) adata$lb1<-(adata$humanrating-(1.96*sdmean)) adata$ub1<-(adata$humanrating+(1.96*sdmean)) #many are outside match, b/c of different scale, so standardize ratings (add ~25 to human ratings) adata$lb2<-(adata$humanrating+(mean(adata$comprating)-mean(adata$humanrating))-(1.96*sdmean)) adata$ub2<-(adata$humanrating+(mean(adata$comprating)-mean(adata$humanrating))+(1.96*sdmean)) #Is the machine code within this range? match1<-c() match2<-c() for (i in 1:length(adata$comprating)) { if (adata$comprating[i]>adata$lb1[i] & adata$comprating[i]<adata$ub1[i]) {match1[i]<-"Match"} else {match1[i]<-"Outside Bounds"} if (adata$comprating[i]>adata$lb2[i] & adata$comprating[i]<adata$ub2[i]) {match2[i]<-"Match"} else {match2[i]<-"Outside Bounds"} } match1<-c() match2<-c() for (i in 1:length(adata$comprating)) { if (adata$comprating[i]>adata$lb1[i] & adata$comprating[i]<adata$ub1[i]) {match1[i]<-1} else {match1[i]<-0} if (adata$comprating[i]>adata$lb2[i] & adata$comprating[i]<adata$ub2[i]) {match2[i]<-1} else {match2[i]<-0} } mean(match1) mean(match2)
  • 17. #Combine data and write out to csv file dataexamine<-as.data.frame(cbind(adata,match1, match2)) write.csv(dataexamine, file="/Users/tarpus/Desktop/dataexamine.csv") #different ratings, different raters: 10, 22, 77, 80, 88, 92, 93, 105, 112 #different ratings, same rater: 41, 66, 109, #same rating, different raters: Secondary Results - Dataset from Oct 29th Correlations between: weighted human rating and human recommendation: .83 human recommendation and weighted machine rec #1: .20 human recommendation and weighted machine rec #2: .27 weighted human rating and weighted machine rating: .28 Scatterplots showing the relationship between the same: Conclusion Since we are limited in the number of recruiters, we examine the recruiter final recommendations and the recruiter ratings to determine, very quickly, if the recruiters’ ratings are trustworthy. That is, we asked each recruiter to rate each resume in two ways: first, they are asked to rate the resume among 8 categories quinarily (on scale of 1-5) as to how well the resume fits the job on those categories. Next, the recruiter is asked to give each resume an overall recommendation - on the same 1-5 scale - as to how well the applicant’s resume “fits” the job. It would seem very questionable if the recruiter rated highly a resume on the individual categories, only to rate the same resume low on the overall recommendation (i.e, not recommend the applicant). Our sanity check involves correlating the human overall recommendations to the individual human category ratings, after walking the human category ratings through the machine concatenation and weighting procedure . While not a perfect fit, the two ratings very closely correlate (.83), signifying (1.) our recruiter reasonably rated resumes, and (2.) broadly speaking, the tangent function is and machine weighting scheme is reasonable. Human vs. Machine at a Glance Similarly, we compare the human recommendations against the two iterations of machine ratings, and then, finally, the human ratings against the machine ratings. In each case, the correlations were very weak (between .21 and .31). This disjunct likely results from (1.) the machine rating resumes significantly differently than the human recruiter across each evaluation category, and (2.) the weighting system that turns individual machine ratings into a single judgment introduces extra noise, because the weights to do not match the way humans naturally rate the resumes.
  • 18. Relative Value (Weights) To discover how humans "naturally" rate the resumes, we can leverage the fact the recruiters both provide a single final rating and the fact they provide individual ratings. Thus, we create a linear model that predicts the final human recommendation based on the individual category human ratings. The resulting coefficients (and statistical significances) of the independent variables elucidate how our human coders weight the resumes. We, thus, use both an Ordinary Least Squares (OLS) regression and an Ordinal Logistic regression to model the factor weights that best predict the binary outcome of a resume being recommended for a job. The ordinal logit, critically, does not assume a continuous scale - rather than predicting a number, the model predicts the probability (log of odds ratio, precisely) that a given set of covariates will fall into one category or the other. Technically speaking, this makes it the superior model, as it fits our data more appropriately. However, the results form the ordinal logistic model generally match the OLS estimation, substantively speaking, indicating that the continuous scale assumption of the OLS model is not badly violated. Accordingly, and due to the more straightforward nature of OLS, so we focus on the OLS model for most of this write-up. NOTE: When we attempt to operationalize the weights and, thus, calibrate the machine scores, it will make more sense to use the OLS model, than the logistic model. Ordinary Least Squares Model The OLS model suggests that human coders weight "general similarity" and "job experience" far more than any other categories. These two variables, accordingly, predict about 85% of the variation in the human raters' overall evaluation of the candidates. (This number is confirmed when, again, we run an estimation with just those two variables, and none of the others - the R2 remains at an approximate .85, even with other variables excluded.) The OLS suggests a very different weighting system than that of the machine generated scores: namely, it (loosely) suggests a weighting system where 80%-90% of the machines' judgment is driven in equal parts by general similarity and job experience, with the rest driven by the other variables. (So, we can place a weight of .45 on general similarity, a weight of .45 on job experience, and a weight of .02 on everything else. Or, a weight of .4 on the first two, and a weight of .04 on everything else.) Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.65630 1.08808 0.603 0.54679 user_preference_h_scale 0.01686 0.02368 0.712 0.47697 general_similarity_h_scale 0.40938 0.04936 8.293 2.46e-15 *** skills_topics_h_scale NA NA NA NA job_xp_h_scale 0.39072 0.04822 8.103 9.26e-15 *** education_h_scale -0.02619 0.02535 -1.033 0.30222 core_skills_h_scale -0.01473 0.02636 -0.559 0.57670 experience_level_h_scale 0.14142 0.04828 2.929 0.00363 ** category_h_scale 0.01649 0.02487 0.663 0.50778 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.629 on 347 degrees of freedom
  • 19. Multiple R-squared: 0.8494,Adjusted R-squared: 0.8463 F-statistic: 279.5 on 7 and 347 DF, p-value: < 2.2e-16 Constructing a Set of Predictive Models To confirm the assumptions made by the OLS model, we perform a secondary analysis using an ordinal logistic regression model. In this model, we predict the [log] likelihood that the overall recommendation an individual human gives a resume falls into one of the five ratings [Perfect Fit, Very Much a Fit, Acceptable Fit, Unacceptable Fit, & Not a Fit at All) given the ratings given by the human on individual job fit categories. We employ a logistic model, and at that, a special case of the logistic model, an ordinal logistic regression, or a proportional odds logistic regression (POLR, for short). The ordinal logistic model factors various rating components (e.g., job experience, core skills, etc.) to determine, and assign, a stochastically-derived score for each applicant resume; individual resumes are scored for a specific job; i.e., the logistic score, then, predicts the likelihood a resume matches a prescribed job posting. Coefficients: Value Std. Error t value user_preference_h_glm 0.4053 0.5436 0.7455 general_similarity_h_glm 4.0417 0.8831 4.5767*** job_xp_h_glm 4.1715 0.8370 4.9839*** education_h_glm -0.2695 0.5020 -0.5369 core_skills_h_glm 0.1567 0.5874 0.2667 experience_level_h_glm 0.3481 0.7979 0.4362 category_h_glm 0.6937 0.5441 1.2750 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 As can be seen in the above model results, the proportional odds ordinal logistic regression model largely confirms the results we receive in the OLS model: general similarity and job experience are, in general, the two most important predictors of human overall job recommendations. The only difference in the ordinal logistic model, actually, is that it makes an even stronger case for the (similar) importance of these two variables: not only do their two coefficients (nearly) match again, but in this model, we see that experience level is no longer statistically significant. This underscores, again, that a weighting system that weights general similarity and job experience heavily while weighting the other categories only lightly would match apparent human implicit weights more closely. Model Score Performance Using insights gleaned from the OLS and ordinal logit model, we construct a new, calibrated, scoring function, where, rather than machine generated weights, we use human weights in the construction of the final score. In short, with any computer resume rating, there are two components: 1 The rating score assigned to individual categories for the resumes (general similarity, job experience, experience level, etc.)
  • 20. 2 The weight assigned to that category by the machine system. The individual ratings, which the machines gives on a zero to one hundred scale, are then multiplied by their weight, after which time they are added together. So, for instance, if the machine was using two rating categories, boogie and woogie, and weighted those categories .4 and .6, with scores on each category of 50 and 75, respectively, we could find the penultimate machine rating by saying: Rating = .4*(boogie)+.6*(woogie) Rating = .4*50+.6*75 Rating = 65 That 65 would then be walked through the normalizing tangent function to arrive to the final resume rating. Accordingly, if the machine ratings appear to be off from the human ratings, there are only two sources of error: the weighting system and the machine rating system itself. Given the low correlation between human ratings and machine ratings, our first attempt was to alter the machine weighting system to match the implicit human rating system more closely. The way that we did this was by using the suggested weights from the OLS model - approximately, a weight of .45 on general similarity, a weight of .45 on job experience, and a weight of .02 on everything else. However, using this new weighting system, the performance of the machine did not increase - in fact, they decrease. The correlations between machine rankings and human rankings, using the implied human weights, lie between .08 and .22. (Before re-weighting, the correlation strength had been much better - between .21 and .31). Slightly changing the weights in small ways (say, a weight of .4 on job experience and general similarity, and a weight of .04 on everything else) does not significantly improve the correlations. Individual Category Rating Comparison Given this, we decided to examine correlations between individual human category ratings and individual machine category ratings. The answers were enlightening - on many categories, the machine ratings and human ratings were not even weakly correlated, with individual category ratings falling between -.10 and .46, and only three out of eight category ratings correlating a strength above .10 (all reported correlations are using new machine point scores): User preference: 0.46 General similarity: 0.06 Skills Topics: -0.10 Job Experience: 0.20 Education: N/A Core Skills: 0.32 Experience Level: N/A Category: 0.10 Given this data, two conclusions came to light. The first conclusion was that, perhaps, we could
  • 21. improve the fit of the machine final rating and the human final rating if we re-weighted both to focus more on areas where the human and machine agreed. Doing this - weighting user preference, job experience, and core skills - higher than all the rest (namely, the weighting system is .4 for user preference, .3 for core skills, .2 for job experience, and .05 and .05 for category and general similarity, respectively) gives results that are again unsatisfactory, however, not significantly improving results from prior weightings. Correlations between final rankings of machine and human, with this re-weighting, range from .22 to 34 - a marginal increase from the original weights, but only a marginal one. Points System This brings us to the second point brought up by the low correlations between individual point scores of machine and human - largely, the machine is just not good at matching human scores, nor is the machine good at rating resumes in the first place. For instance, in the above correlations, there is no correlation reported for either education or experience level, because the machine failed to produce any variance in these categories at all - in experience, the machine gave every single resume a score of 100 (indicating a perfect experience level) and for education, the machine gave every single resume a score of 0 (indicating that the resume has no apparent education at all.) Of the other categories, only in three of the eight categories (user preference, job experience, and core skills) does the machine do better than random chance in predicting how the human will rate the resume, and only in one of the two categories (job experience) that the human thinks is implicitly important (as revealed by the OLS and ordinal logit models). This lack of correspondence between the machine and human scores, combined with the fact that the machine fails to pass a certain sort of “sanity test” itself on certain categories (education and experience level) by not rating any resumes as different at all, suggests that it is not the weights, but the category scores, that is preventing the machine from properly matching up with the human weights. References http://mathworld.wolfram.com/BootstrapMethods.html http://statistics.stanford.edu/~ckirby/techreports/BIO/BIO%20101.pdf http://www.methodsconsultants.com/tutorials/bootstrap1.html http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html