1.
HR Analytics: Why are our best and most
experienced employees leaving prematurely?
Erik Bebernes
2. Introduction
This
project
uses
a
dataset
I
found
on
kaggle,
where
a
company
has
been
experiencing
difficulty
retaining
their
best
and
most
experienced
employees.
The
data
frame
consists
of
15,000
observations
of
10
variables,
which
are:
names(hr)
[1]
"satisfaction_level"
"last_evaluation"
"number_project"
[4]
"average_montly_hours"
"time_spend_company"
"Work_accident"
[7]
"left"
"promotion_last_5years"
"sales"
[10]
"salary"
Satisfaction
Level
–
employees
overall
job
satisfaction
level
based
on
a
survey
Last
Evaluation
–
employees
performance
score
given
by
their
manager
Number
of
projects
–
how
many
projects
an
employee
has
been
involved
in
Average
monthly
hours-‐
mean
hours
worked
by
employee
per
month
Time
spend
company
–
years
employee
has
worked
for
the
company
Work
accident
–
binary
variable
indicating
if
1,
the
employee
has
had
an
accident
in
the
workplace
Left-‐
indicated
if
1,
the
employee
has
left
or
0,
the
employee
is
still
at
the
company
Promotion
last
5
years
–
binary
variable
signaling
if
the
employee
has
been
promoted
Sales-‐
categorical
variable
on
job
type
Salary-‐
categorical
variable
(low,
medium,
high)
of
how
much
the
employee
is
paid
annually
My
approach
to
this
project
can
be
summarized
in
the
following
steps:
1.) Clean
and
structure
the
data
set,
including
imputing
missing
values
if
necessary
2.) Create
subsets
between
the
best
employees
that
left
and
stayed
3.) Create
discrete
factor
variables
and
perform
association
rules
analysis
4.) Classify
employees
through
decision
tree
analysis
5.) Find
any
significant
correlations,
and
differences
in
correlations
between
said
subsets.
6.) Exploratory
visualization
analysis
in
an
attempt
to
explain
any
discrepancies
in
correlations.
7.) Run
a
random
forest
algorithm
to
confirm
significant
relationships
between
the
variables,
as
well
as
a
logistic
regression
8.) Provide
conclusions
and
recommendations
for
management
HR_comma_sep
<-‐
read.csv("~/Downloads/HR_comma_sep.csv",
header=TRUE)
View(HR_comma_sep)
hr<-‐HR_comma_sep
3.
Cleaning
and
structuring
the
dataset
At
first
glance
the
dataset
seems
clean,
but
to
make
sure
I’m
going
to
use
the
“amelia”
package
to
identify
any
missingness.
library(Amelia)
missmap(hr)
This
shows
that
there
is
no
missing
data.
>
str(hr)
'data.frame':
14999
obs.
of
10
variables:
$
satisfaction_level
:
num
0.38
0.8
0.11
0.72
0.37
0.41
0.1
0.92
0.89
0.42
...
$
last_evaluation
:
num
0.53
0.86
0.88
0.87
0.52
0.5
0.77
0.85
1
0.53
...
$
number_project
:
int
2
5
7
5
2
2
6
5
5
2
...
4. $
average_montly_hours
:
int
157
262
272
223
159
153
247
259
224
142
...
$
time_spend_company
:
int
3
6
4
5
3
3
4
5
5
3
...
$
Work_accident
:
int
0
0
0
0
0
0
0
0
0
0
...
$
left
:
int
1
1
1
1
1
1
1
1
1
1
...
$
promotion_last_5years:
int
0
0
0
0
0
0
0
0
0
0
...
$
sales
:
Factor
w/
10
levels
"accounting","hr",..:
8
8
8
8
8
8
8
8
8
8
...
$
salary
:
Factor
w/
3
levels
"high","low","medium":
2
3
3
2
2
2
2
2
2
2
...
Subsets
hrbestleft<-‐hr[which(hr$Last_eval>.72
&
hr$Left
==
1),]
#employees
with
high
evaluations
and
who
left
the
company
hrbeststay<-‐hr[which(hr$Last_eval>.72
&
hr$Left
==
'0'),]
#employees
with
high
evaluations
that
left
the
company
Creating
Discrete
Variables
and
Association
Rules
Analysis
quantile(hr$average_montly_hours,
.33)
quantile(hr$average_montly_hours,
.67)
hr$Hours_Discrete[hr$average_montly_hours
<=
69]<-‐
'low'
hr$Hours_Discrete[hr$average_montly_hours
>69
&
hr$average_montly_hours
<
134]<-‐
'average'
hr$Hours_Discrete[hr$average_montly_hours
>=134]<-‐
'high'
quantile(hr$satisfaction_level,
.33)
quantile(hr$satisfaction_level,
.67)
quantile(hr$satisfaction_level,
.8)
hr$Sat_Discrete[hr$satisfaction_level
<=
43]<-‐
'low'
hr$Sat_Discrete[hr$satisfaction_level
>43
&
hr$satisfaction_level
<
68]<-‐
'average'
hr$Sat_Discrete[hr$satisfaction_level
>=68]<-‐
'high'
library(arules)
hr$Work_accident<-‐as.factor(hr$Work_accident)
hr$left<-‐as.factor(hr$left)
hr$promotion_last_5years<-‐as.factor(hr$promotion_last_5years)
hr$Hours_Discrete<-‐as.factor(hr$Hours_Discrete)
hr$Sat_Discrete<-‐as.factor(hr$Sat_Discrete)
names(hr)
hrassoc<-‐hr[,c(6,7,8,9,10,11,12)]
rules<-‐apriori(hrassoc,
parameter
=
list(support
=
.2,
confidence
=
.7))
6.
Hours_Discrete=high}
=>
{left=1}
0.5527863
1
1
[24]
{salary=low,
Hours_Discrete=high}
=>
{Sat_Discrete=low}
0.5527863
1
1
[25]
{Work_accident=0,
salary=low}
=>
{left=1}
0.5816298
1
1
[26]
{Work_accident=0,
salary=low}
=>
{Sat_Discrete=low}
0.5816298
1
1
[27]
{promotion_last_5years=0,
salary=low}
=>
{left=1}
0.6043125
1
1
[28]
{promotion_last_5years=0,
salary=low}
=>
{Sat_Discrete=low}
0.6043125
1
1
[29]
{left=1,
salary=low}
=>
{Sat_Discrete=low}
0.6082330
1
1
[30]
{salary=low,
Sat_Discrete=low}
=>
{left=1}
0.6082330
1
1
Most
Interesting
rules:
1.)
of
the
people
who
left,
99%
never
received
a
promotion
2.)
95%
never
had
an
accident
3.)
60%
were
low
salary
4.)
100%
had
low
job
satisfaction
These
rules
signify
a
few
important
relationships
between
the
variables
that
may
explain
why
some
employees
are
leaving.
Of
the
employees
who
left,
99%
never
had
an
accident,
60%
were
low
salary
and
an
astonishing
100%
had
low
job
satisfaction.
This
must
mean
satisfaction
is
significant
in
determining
leaving
vs.
staying.
Next
I’m
going
to
look
at
correlations
between
satisfaction
and
the
numeric
variables.
Correlation
Analysis
Using
all
employees
in
the
dataset:
cor(hr[,1:5])
satisfaction_level
last_evaluation
number_project
average_montly_hours
satisfaction_level
1.00000000
0.1050212
-‐0.1429696
-‐0.02004811
last_evaluation
0.10502121
1.0000000
0.3493326
0.33974180
number_project
-‐0.14296959
0.3493326
1.0000000
0.41721063
average_montly_hours
-‐0.02004811
0.3397418
0.4172106
1.00000000
time_spend_company
-‐0.10086607
0.1315907
0.1967859
0.12775491
time_spend_company
satisfaction_level
-‐0.1008661
last_evaluation
0.1315907
number_project
0.1967859
7. average_montly_hours
0.1277549
time_spend_company
1.0000000
The
above
plot
and
output
shows
correlations
between
numeric
variables
of
all
employees.
Managers
seem
to
give
higher
evaluation
scores
to
employees
who
work
more
hours
and
who
have
more
projects,
however
there
is
a
negative
correlation
between
employee
satisfaction
and
number
of
projects.
It
should
be
interesting
to
see
how
this
compares
to
correlations
using
just
the
best
employees.
Correlations
using
just
the
best
employees
and
most
experienced
employees
that
left:
>
hrbestleft<-‐hr[which(hr$last_evaluation
>=
.72
&
hr$left
==
1),]
>
cor(hrbestleft[,1:5])
8.
satisfaction_level
last_evaluation
number_project
satisfaction_level
1.0000000
0.3611564
-‐0.7370609
last_evaluation
0.3611564
1.0000000
-‐0.2150533
number_project
-‐0.7370609
-‐0.2150533
1.0000000
average_montly_hours
-‐0.4771749
-‐0.1261519
0.5217016
time_spend_company
0.6582700
0.3147566
-‐0.3644283
average_montly_hours
time_spend_company
satisfaction_level
-‐0.4771749
0.6582700
last_evaluation
-‐0.1261519
0.3147566
number_project
0.5217016
-‐0.3644283
average_montly_hours
1.0000000
-‐0.1572702
time_spend_company
-‐0.1572702
1.0000000
There
are
some
very
notable
differences
here,
including
the
massive
negative
correlations
between
number
of
projects
and
satisfaction
level
and
the
large
negative
correlation
between
average
monthly
hours
and
satisfaction
level.
This
probably
means
that
managers
are
overworking
their
best
employees,
which
leads
to
lower
satisfaction
levels.
It’s
worth
looking
at
9. the
data
visually
to
see
if
this
is
in
fact
the
case.
I’ll
also
run
a
decision
tree
analysis
which
may
serve
as
a
confirmation.
Interpreting
Correlation
Differences
Visually
Do
the
best
employees
work
more
hours?
Comparing
these
histograms,
it’s
clear
that
employees
that
score
higher
on
manager
evaluations
are
working
considerably
more
hours
than
the
workforce
as
a
whole.
Do
the
best
employees
work
on
more
projects?
10. Yes,
the
best
employees
usually
have
more
projects.
There
is
a
downward
trend
as
the
number
of
projects
increase
when
you
look
at
the
workforce
as
a
whole,
and
the
opposite
can
almost
be
said
for
the
best
employees
(until
you
get
to
6
projects).
Have
the
best
employees
been
working
at
the
company
for
a
longer
period
of
time?
Almost
all
of
the
best
employees
have
been
at
the
company
for
at
least
four
years,
perhaps
this
can
be
related
to
“learning
by
doing.”
It’s
also
a
sufficient
amount
of
time
to
prove
to
managers
that
they
are
high
performing.
The
dataset
as
a
whole
shows
that
there
are
an
abundance
of
employees
who
have
been
there
for
2
and
3
years.
Let’s
see
if
anyone
is
being
promoted.
11. As
you
can
see
above,
of
the
best
performing
employees…hardly
any
of
them
have
been
promoted
in
the
last
five
years.
In
fact,
it’s
only
.2%.
It
must
be
discouraging
to
these
employees
to
be
highly
evaluated
and
not
be
rewarded
for
it.
Next
I’m
going
to
look
at
the
relationship
between
job
type
and
salary.
Are
there
noticeable
differences
in
pay
between
different
departments
of
the
company?
And
how
many
employees
are
in
each
department?
A
couple
of
things
I
noticed
while
looking
at
this
graph
are
that
a
majority
of
the
good
employees
are
on
the
low
end
of
the
salary
spectrum
and
most
of
them
are
working
in
sales,
support
in
technical
roles.
However
I
made
the
same
graph
using
the
dataset
as
a
whole
and
didn’t
see
much
of
a
difference,
so
I’ll
put
these
observations
aside
for
now.
As
I
mentioned
earlier
during
my
association
rules
analysis,
satisfaction
is
most
likely
significant
in
determining
why
the
best
employees
are
leaving.
The
plot
below
is
an
attempt
to
see
that
relationship
visually,
where
the
green
density
is
the
subset
of
the
best
employees
that
left,
the
red
density
are
the
best
employees
that
have
stayed,
and
the
blue
density
is
the
entire
dataset.
12. p1<-‐ggplot()+geom_density(data
=
hrbestleft,
aes(satisfaction_level),
fill
=
'green',
alpha
=
.3)+
geom_density(data
=
hrbeststay,
aes(satisfaction_level),
fill
=
'red',
alpha
=
.3)+
geom_density(data
=
hr,
aes(satisfaction_level),
fill
=
'blue',
alpha
=
.3)+theme_light(base_size
=
16)+xlab("Satisfaction
Level")+ylab("")+
ggtitle("Satisfaction
Levels
of
Subsets")
The
best
employees
that
left
(green)
is
what
really
stands
out
here.
Many
of
them
have
very
low
satisfaction
levels
(<.25),
then
there
is
a
lull,
and
then
another
group
with
satisfaction
levels
greater
than
.6.
It’s
difficult
to
say
why
this
might
be.
Perhaps
there
is
a
difference
in
how
the
employees
interpret
satisfaction.
It’s
possible
that
they
still
enjoyed
their
job
despite
being
over
worked
and
not
being
promoted.
I
think
the
best
way
to
figure
this
out
is
through
a
decision
tree
analysis,
where
those
who
left
will
be
classified
more
accurately.
But
first,
I
want
to
combine
average
monthly
hours
and
satisfaction
into
a
plot.
Since
I
noticed
earlier
that
the
good
employees
that
left
were
working
a
lot
more
hours,
there
should
be
a
strong
relationship
between
the
two.
plot6<-‐ggplot(hr,
aes(satisfaction_level,
average_montly_hours,
color
=
left,
alpha
=
.3))+geom_point()+ggtitle("Hours
and
Satisfaction")
13.
These
distributions
are
very
tight,
which
tells
me
that
the
decision
tree
will
be
a
great
addition
to
my
analysis.
The
blue
box
must
be
underperforming
employees,
those
that
have
not
been
working
many
hours
and
aren’t
that
satisfied.
Where
the
other
two
blue
distributions,
judging
by
the
density
plots
on
the
previous
page,
are
high
performing
employees.
My
next
plot
is
another
confirmation
of
that
hypothesis,
but
this
time
I’m
adding
years
spent
at
the
company.
14. The
cluster
on
the
right
has
a
lot
of
employees
that
have
been
at
the
company
for
a
long
time,
I
think
the
lack
of
promotions
may
have
something
to
do
with
them
leaving.
Decision
Tree
Analysis
Decision
trees
are
best
used
on
small
datasets,
so
in
order
to
get
a
few
simple
rules
(and
to
avoid
over-‐fitting
the
model)
I
made
a
small
sample
of
the
data
(2%).
install.packages("party")
library(party)
set.seed(421)
ind<-‐sample(2,
nrow(hr),
replace
=
TRUE,
prob
=
c(0.02,0.3))
traindata<-‐hr[ind==1,]
testdata<-‐hr[ind==2,]
form<-‐left~satisfaction_level+average_montly_hours+time_spend_company+last_evaluation
hrtree<-‐ctree(form,
data
=
traindata,
controls
=
ctree_control(maxsurrogate
=
3))
table(predict(hrtree),
traindata$left)
plot(hrtree,
type
=
"simple")
?ctree
print(hrtree)
Using
the
variables
time
spent
at
company,
satisfaction,
average
monthly
hours
and
last
evaluation
(what
I
think
are
the
most
important
variables
based
on
the
visualizations
I
made)
I
was
able
to
come
up
with
a
few
rules
that
help
classify
employees
into
the
leaving
and
staying
categories.
Here
are
my
key
takeaways:
1.) Employees
with
low
satisfaction
levels,
but
haven’t
been
at
the
company
long
will
generally
stay.
satisfaction_level
p < 0.001
1
≤ 0.46 > 0.46
time_spend_company
p < 0.001
2
≤ 4 > 4
time_spend_company
p = 0.001
3
≤ 2 > 2
n = 21
y = (0.952, 0.048)
4
n = 217
y = (0.258, 0.742)
5
n = 46
y = (0.891, 0.109)
6
time_spend_company
p < 0.001
7
≤ 4 > 4
n = 562
y = (0.984, 0.016)
8
last_evaluation
p < 0.001
9
≤ 0.8 > 0.8
n = 61
y = (0.951, 0.049)
10
average_montly_hours
p < 0.001
11
≤ 216 > 216
n = 18
y = (1, 0)
12
time_spend_company
p = 0.001
13
≤ 5 > 5
n = 37
y = (0.081, 0.919)
14
n = 22
y = (0.273, 0.727)
15
15. 2.) Employees
with
low
satisfaction
levels
and
who
have
been
at
the
company
between
2
and
5
years
leave.
3.) Employees
with
high
satisfaction
levels
who
have
been
working
for
less
than
or
equal
to
4
years
stay.
4.) High
performing
employees
with
high
satisfaction
and
who
have
been
at
the
company
>4
years
leave
when
they
are
working
too
many
hours.
This
analysis
is
91.5%
accurate,
which
is
pretty
good
considering
how
simple
the
tree
is.
If
I
were
to
show
management
one
graph
it
would
be
this,
it
identifies
clear
cut
patterns
and
confirms
much
of
what
I
had
been
hypothesizing
with
my
previous
analyses.
Random
Forest
and
Logistic
Regression
Before
offering
my
final
advice
to
management,
I
want
to
see
how
accurately
I
can
predict
who
is
going
to
leave.
An
accurate
machine
learning
algorithm
will
allow
the
company
to
focus
on
specific
employees…perhaps
offering
them
a
raise
or
reducing
their
hours
before
they
decide
to
leave.
First
I’m
going
to
try
a
logistic
regression,
which
determines
probabilities
of
a
binary
dependent
variable
for
each
observation.
Any
probability
greater
than
.5
will
mean
the
employee
will
leave.
Let’s
see
how
it
goes:
Logistic
Regression:
#creating
a
test
and
training
set
using
dplyr
set.seed(142)
train<-‐sample_frac(hr,
.7)
sid<-‐as.numeric(rownames(train))
test<-‐hr[-‐sid,]
fitted.results<-‐predict(glmmodel,
newdata
=
test,
type
=
"response")
#type
=
response
converts
logits
to
predicted
probabilities
new<-‐mutate(test,
fitted.results)
predicted.to.leave<-‐filter(new,
fitted.results
>
.5)
predicted.to.stay<-‐filter(new,
fitted.results
<
.5)
View(predicted.to.stay)
summary(predicted.to.stay$left)
summary(predicted.to.leave$left)
The
model
ended
up
being
only
79.4%
accurate.
Which
is
okay,
but
considering
the
decision
tree
was
91%,
I
think
I
can
come
up
with
a
better
model.
Random
forest
works
by
averaging
the
results
of
many
decision
trees
and
can
work
very
well.
Let’s
try
that:
randindex<-‐
sample(1:dim(hr)[1])
cutpoint2_3<-‐floor(2*dim(hr)[1]/3)
16. traindata<-‐hr[randindex[1:cutpoint2_3],]
testdata<-‐
hr[randindex[(cutpoint2_3+1):dim(hr)[1]],]
library(randomForest)
rfmodel
<-‐
randomForest(factor(left)
~
satisfaction_level
+
number_project
+
average_montly_hours
+
time_spend_company
+
promotion_last_5years
+
last_evaluation,
data
=
traindata)
plot9<-‐plot(rfmodel,
ylim=c(0,0.36))
The
false
positive
and
false
negative
errors
are
very
low,
which
is
a
good
sign.
Let’s
see
how
accurate
the
model
is
when
I
try
it
on
a
test
set.
prediction<-‐predict(rfmodel,
testdata)
confusionMatrix(prediction,
testdata$left)
Confusion
Matrix
and
Statistics
17.
Reference
Prediction
0
1
0
3786
48
1
10
1156
Accuracy
:
0.9884
95%
CI
:
(0.985,
0.9912)
No
Information
Rate
:
0.7592
P-‐Value
[Acc
>
NIR]
:
<
2.2e-‐16
Kappa
:
0.9679
Mcnemar's
Test
P-‐Value
:
1.184e-‐06
Sensitivity
:
0.9974
Specificity
:
0.9601
Pos
Pred
Value
:
0.9875
Neg
Pred
Value
:
0.9914
Prevalence
:
0.7592
Detection
Rate
:
0.7572
Detection
Prevalence
:
0.7668
Balanced
Accuracy
:
0.9787
'Positive'
Class
:
0
The
model
is
98.84%
accurate,
this
will
prove
to
be
very
beneficial
in
identifying
employees
that
are
likely
to
be
leaving
in
the
future.
What
variables
are
most
important
in
leaving
vs.
staying?
importance(rfmodel)
MeanDecreaseGini
satisfaction_level
1226.048093
number_project
665.390311
average_montly_hours
536.922188
time_spend_company
664.193153
promotion_last_5years
4.487941
last_evaluation
430.694068
According
to
the
random
forest
model,
satisfaction,
number
of
projects
and
time
spent
at
the
company
are
the
three
most
significant
variables.
18. Conclusion
and
Recommendations
I
very
much
enjoyed
learning
more
about
this
dataset.
I
performed
so
many
types
of
analyses
because
retaining
a
company’s
best
employees
is
extremely
important.
High
turnover
is
costly,
and
if
a
company
wants
to
grow
you
need
the
right
people
leading
the
way.
I’ve
worked
for
organizations
in
the
past
that
have
had
high
turnover
rates,
and
while
you
want
underperforming
employees
to
leave,
you
want
your
best
workers
to
grow
with
you.
What
I
found
most
useful
in
this
project
were
visualizations,
the
decision
tree
and
the
random
forest
algorithm.
They
all
can
be
used
in
different
ways.
If
management
wants
a
basic
understanding
of
what’s
going
on,
I
would
show
them
the
visuals,
if
they
want
to
know
what
patterns
are
harming
them,
I
would
go
over
the
decision
tree,
and
if
they
want
to
know
what
employees
will
leave
in
the
future,
the
random
forest
model
would
be
helpful.
Based
on
all
of
those,
here
are
the
two
key
points
management
should
know
concerning
why
their
best
and
most
experienced
employees
are
leaving
prematurely:
1.) They
are
being
overworked
–
it’s
common
for
managers
to
take
advantage
of
employees
who
do
a
good
job
by
giving
them
a
heavier
workload.
This
is
costing
the
company,
because
they
are
deciding
to
leave.
2.) They
aren’t
being
promoted-‐
good
employees
expect
to
be
rewarded.
There
is
a
large
group
of
employees
with
high
satisfaction
levels
who
have
been
at
the
company
for
more
than
four
years,
but
they
decided
to
leave
because
there
isn’t
any
career
growth.
There
are
a
couple
of
simple,
obvious
actions
management
can
take.
They
shouldn’t
work
their
best
employees
more
than
anyone
else,
and
they
should
be
promoted
after
3
or
4
years.
In
time,
I
think
they
will
find
that
although
the
company
will
be
less
productive
in
the
short
run,
reducing
their
turnover
rate
of
their
best
employees
will
lead
to
incremental
growth.