JATIN SAINI
MS BUSINESS ANALYTICS UNIVERSITY OF CINCINNATI
HR ANALYTICS REPORT –
DATA MANAGEMENT PROJECT
Introduction
This report helps us find why best and most employees leave prematurely. We use
exploratory data analysis and multi factor linear regression techniques to understand the pattern
between predictor and response variables. After analysing data, we fit linear regression model
and check the assumptions of linear model. The data has been divided in 3 classes based on
their salaries: high, medium and low.
Data Description
In this study we use data available on Kaggle (https://www.kaggle.com/ludobenistant/hr-
analytics)
 satisfaction_level: Satisfaction Level
 last_evaluation: Last Evaluation
 number_project: Number of Projects
 average_montly_hours: Average Monthly Hours
 time_spend_company: Time Spent at the Company
 work_accident: Whether they have had a work accident
 promotion_last_5years: Whether they have had a promotion in last 5 years
 sales: Department (sales)
 salary: Salary (high/medium/low)
 left: Whether the Employee has left (left=1 )
Cleaning Dataset
Table 1: Overview of dataset
We observed no null or missing values in
the data from figure 1.
Figure 1
Given the values in the data table, we can understand that the data is normalised and
has been collected by connecting employee record table, employee evaluation table and work
history table. Hence, we do not have to normalise the data.
As seen in the initial data exploration, satisfaction level and last evaluation have a scale
of 0 to 1 and work accident and promotion record have binary output [0,1]. Since, this data
does not have employee ID or employee name as attributes, it is difficult for us to identify
duplicate rows. Under the assumption that satisfaction level, last evaluation, number of projects
handled, time spent in the company and salary group can’t be the same for any 2 employees,
we check duplicates comparing these columns together.
After removing duplicates based on 9 variables, we find 11739 distinct rows. This step
helps us in eliminating data that may have been present in the dataset due to system error or
any other reason that we are not aware of.
Now we study descriptive summary of the data:
TABLE 2: Showing Descriptive Summary of Variables
Variable Satisfacti
on Level
Last
Evaluati
on
Number
of
Projects
Handled
Average
Monthly
Hours
Time
Spent in
the
Compan
y
Work
Accident
Promote
d in last 5
years
Min 0.09 0.36 2 96 2 0 0
1st
Quadrant
0.44 0.56 3 156 3 0 0
Median 0.64 0.72 4 200 3 0 0
Mean 0.61 0.72 3.8 201 3.5 0.15 0.02
3rd
Quadrant
0.82 0.87 5 245 4 0 0
Max 1 1 7 310 10 1 1
Next, we try to find any
correlation between variables. Common
assumption states that employees leaving
would have lower satisfaction level than
employees staying. We try to confirm this
belief from the data by plotting
correlations between variables.
FIGURE 2: The descriptive summary for
each variable
Correlation Matrix between variables in figure 2 shows that satisfaction_level and work
accident have negative correlation with employees leaving and time spent in the company has
positive correlation.
Below are paired plots and correlation between satisfaction_level, time spent in the
company, work accident and employees left status.
FIGURE 3
Below plot elaborates on the satisfaction level of employees according to their salary and
employment status
Figure 4: Satisfaction levels for different Salary Groups with Employee Status
Figure 4 indicates that the satisfaction level of employees staying is greater than 0.5 in
most of the cases, while employees leaving, generally, have a satisfaction level of less than 0.5
which clearly supports our common assumption that higher satisfaction of employees helps in
retaining employees.
Another plot which helps us get the sense of the data is the workload plot for
departments for people leaving and staying is shown in figure 5.
Figure 5: Work load for Employees across different Departments
From figure 5 we can see a trend going on that employees handling more than 4 projects
leaves the company, and it is true across all the departments. This clearly indicates an important
insight that people with more than 4 projects have higher tendency to leave and this is valid for
approximately all the departments.
Methodology
From the above data exploration, we can infer that
1. higher the satisfaction level of employees the fewer are the chances of them leaving.
2. higher number of projects or higher workload leads to higher chances of leaving
3. higher instances of work accident indicate that chances of employee leaving would be
low.
These inferences can now be shown in the following equation:
Chances of leaving = 0.35 + 0.04 * Time_Spend_Company - 0.45*Satisfaction_level – 0.10
* Work_accident
It can be interpreted as ‘Chances of leaving’ 0.5 or higher can show high chances of employee
leaving the job.
Now, we can see that the best fitted model does not have number of projects handled by the
employees as a factor in determining whether an employee would leave the company or not,
and, satisfaction level seems to be the most influential factors amongst all the factors.
Inferences
From the analysis conducted, we can infer following key points:
1. The dataset provided was a combination of 3 different tables and it would have been
helpful if we would have received the data in that form.
2. There is linear correlation between chances of an employee leaving and time spent in
the company, his satisfaction level with the job and if he/she had an accident during
the job years.
3. Employees handling more than 4 projects have higher chances of leaving the job. The
relationship between projects handled and chances of leaving of an employee are not
linearly related.
4. Satisfaction level (presented on a scale of 0 to 1) have a huge impact on the chances
of an employee leaving, and this is valid across all salary groups
5. General notion that higher satisfaction level of an employee has lower chances of
leaving holds true as seen from the final model

Hr analytics project

  • 1.
    JATIN SAINI MS BUSINESSANALYTICS UNIVERSITY OF CINCINNATI HR ANALYTICS REPORT – DATA MANAGEMENT PROJECT
  • 2.
    Introduction This report helpsus find why best and most employees leave prematurely. We use exploratory data analysis and multi factor linear regression techniques to understand the pattern between predictor and response variables. After analysing data, we fit linear regression model and check the assumptions of linear model. The data has been divided in 3 classes based on their salaries: high, medium and low. Data Description In this study we use data available on Kaggle (https://www.kaggle.com/ludobenistant/hr- analytics)  satisfaction_level: Satisfaction Level  last_evaluation: Last Evaluation  number_project: Number of Projects  average_montly_hours: Average Monthly Hours  time_spend_company: Time Spent at the Company  work_accident: Whether they have had a work accident  promotion_last_5years: Whether they have had a promotion in last 5 years  sales: Department (sales)  salary: Salary (high/medium/low)  left: Whether the Employee has left (left=1 ) Cleaning Dataset Table 1: Overview of dataset
  • 3.
    We observed nonull or missing values in the data from figure 1. Figure 1 Given the values in the data table, we can understand that the data is normalised and has been collected by connecting employee record table, employee evaluation table and work history table. Hence, we do not have to normalise the data. As seen in the initial data exploration, satisfaction level and last evaluation have a scale of 0 to 1 and work accident and promotion record have binary output [0,1]. Since, this data does not have employee ID or employee name as attributes, it is difficult for us to identify duplicate rows. Under the assumption that satisfaction level, last evaluation, number of projects handled, time spent in the company and salary group can’t be the same for any 2 employees, we check duplicates comparing these columns together. After removing duplicates based on 9 variables, we find 11739 distinct rows. This step helps us in eliminating data that may have been present in the dataset due to system error or any other reason that we are not aware of. Now we study descriptive summary of the data:
  • 4.
    TABLE 2: ShowingDescriptive Summary of Variables Variable Satisfacti on Level Last Evaluati on Number of Projects Handled Average Monthly Hours Time Spent in the Compan y Work Accident Promote d in last 5 years Min 0.09 0.36 2 96 2 0 0 1st Quadrant 0.44 0.56 3 156 3 0 0 Median 0.64 0.72 4 200 3 0 0 Mean 0.61 0.72 3.8 201 3.5 0.15 0.02 3rd Quadrant 0.82 0.87 5 245 4 0 0 Max 1 1 7 310 10 1 1 Next, we try to find any correlation between variables. Common assumption states that employees leaving would have lower satisfaction level than employees staying. We try to confirm this belief from the data by plotting correlations between variables. FIGURE 2: The descriptive summary for each variable
  • 5.
    Correlation Matrix betweenvariables in figure 2 shows that satisfaction_level and work accident have negative correlation with employees leaving and time spent in the company has positive correlation. Below are paired plots and correlation between satisfaction_level, time spent in the company, work accident and employees left status. FIGURE 3
  • 6.
    Below plot elaborateson the satisfaction level of employees according to their salary and employment status Figure 4: Satisfaction levels for different Salary Groups with Employee Status Figure 4 indicates that the satisfaction level of employees staying is greater than 0.5 in most of the cases, while employees leaving, generally, have a satisfaction level of less than 0.5 which clearly supports our common assumption that higher satisfaction of employees helps in retaining employees. Another plot which helps us get the sense of the data is the workload plot for departments for people leaving and staying is shown in figure 5.
  • 7.
    Figure 5: Workload for Employees across different Departments From figure 5 we can see a trend going on that employees handling more than 4 projects leaves the company, and it is true across all the departments. This clearly indicates an important insight that people with more than 4 projects have higher tendency to leave and this is valid for approximately all the departments. Methodology From the above data exploration, we can infer that 1. higher the satisfaction level of employees the fewer are the chances of them leaving. 2. higher number of projects or higher workload leads to higher chances of leaving 3. higher instances of work accident indicate that chances of employee leaving would be low.
  • 8.
    These inferences cannow be shown in the following equation: Chances of leaving = 0.35 + 0.04 * Time_Spend_Company - 0.45*Satisfaction_level – 0.10 * Work_accident It can be interpreted as ‘Chances of leaving’ 0.5 or higher can show high chances of employee leaving the job. Now, we can see that the best fitted model does not have number of projects handled by the employees as a factor in determining whether an employee would leave the company or not, and, satisfaction level seems to be the most influential factors amongst all the factors. Inferences From the analysis conducted, we can infer following key points: 1. The dataset provided was a combination of 3 different tables and it would have been helpful if we would have received the data in that form. 2. There is linear correlation between chances of an employee leaving and time spent in the company, his satisfaction level with the job and if he/she had an accident during the job years. 3. Employees handling more than 4 projects have higher chances of leaving the job. The relationship between projects handled and chances of leaving of an employee are not linearly related. 4. Satisfaction level (presented on a scale of 0 to 1) have a huge impact on the chances of an employee leaving, and this is valid across all salary groups 5. General notion that higher satisfaction level of an employee has lower chances of leaving holds true as seen from the final model