Delve into our students' project on employee retention, highlighting data-driven strategies to enhance workforce stability. Explore how analytics can predict turnover, identify key retention drivers, and improve employee engagement. Gain insights into HR analytics, predictive modeling, and innovative approaches to employee retention. To learn more, do check out https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Phase 1: Project overview
1.Business Problem
1.1 Objective:
To predict why and when employees are most likely to leave the company using a Machine Learning
model, so that actions can be implemented to improve employee retention as well as possibly planning
new hiring in advance. Here we have data on former employees, where our target variable Y which is the
probability of an employee leaving the company. This project would fall under what is commonly known
as "HR Anlytics", "People Analytics".
1.2 Challenges:
The main challenge is that the data provided has more percentage of employees who are active, but not
about Ex-employees which we require. To incorparate a balanced data using SMOTE technique and perform
accordingly to provide the Y variable.
1.3 Real World Impact:
Overall impact of this solution would impact every possible company having employees, in understanding
likelihood of an active employee leaving the company and take decisions in advance for the retention. It would
also help save the true cost of replacing an employee which is caused due to the amount of time spent to
interview and find a replacement, sign-on bonuses, and the loss of productivity for several months while the new
employee gets accustomed to the new role. A study by the Center for American Progress found that companies
typically pay about one-fifth of an employeeâs salary to replace that employee.
3. 2.Dataset
2.1 Data Fields:
⢠Attrition: Whether employees are still with the company or a Ex-employee
⢠Age: 18 to 60 years old
⢠Gender: Female or Male
⢠Department: Research & Development, Sales, Human Resources.
⢠BusinessTravel: Travel_Rarely, Travel_Frequently, Non-Travel.
⢠DistanceFromHome: Distance between the company and their home in miles.
⢠MonthlyIncome: Employees numeric monthly income.
⢠MaritalStatus: Married, Single, Divorced.
⢠Education: Level of education.
⢠EducationField: Life Sciencesďź Medicalďź MarketingďźTechnical DegreeďźOther.
⢠EnvironmentSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
⢠RelationshipSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
⢠JobInvolvement: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
⢠JobRole: Sales ExecutiveďźResearch Science, Laboratory Tec, Manufacturing, Healthcare Rep, etc.
⢠JobSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
⢠OverTime: Whether they work overtime or not.
⢠NumCompaniesWorked: Number of companies they worked for before joinging IBM.
⢠PerformanceRating: 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'.
⢠YearsAtCompany: Years they worked for IBM.
⢠WorkLifeBalance: 1 'Bad' 2 'Good' 3 'Better' 4 'Best'.
⢠YearsSinceLastPromotion: Years passed since their last promotion.
4. 2.2 Datasets:
In this case study, a HR dataset was sourced from which contains employee data for 1,470 employees
with various information about the employees. I will use this dataset to predict when employees are
going to quit by understanding the main drivers of employee churn. Only a single dataset is present:
⢠WA_Fn-UseC_-HR-Employee-Attrition.csv
2.3 Data Understanding & Tools:
⢠Data comes from a Kaggle competition so it can be downloaded directly for the solution but if we
want to productionize the live data we might have to make a data pipeline for the same. Cloud solutions
and SQL queries for data pipelines are very commonly seen in companies which can be used effectively.
⢠For this particular instance we can use Pandas and Numpy libraries to process the data as we have data
in CSV format.
⢠As the data is company specific additional data can be acquired by having business understanding of
the same.
5. 3.Solutions to similar problems:
3.1 Solution Approach and Problem Type:
This project would fall under what is commonly known as "HR Anlytics", "People
Analytics". I will be usign a step-by-step systematic approach using a method that could
be used for a variety of ML problems. Some of the Machine learning algorihms are:
⢠Logistic Regression.
⢠Random Forest.
⢠Decision Trees.
⢠K Nearest Neighbours.
We can use cross validation technique to understand, compare model performances
& can also implement Grid Search for the best possible hyper parameters of models.
7. Phase 2 : EDA and Feature Extraction
EDA and Feature extraction are valua understand the likelihood of an employee leaving
the company. We will explore the data using this steps :-
⢠Employee Data understanding and insights.
⢠Removing Duplicates and imputing missing values.
⢠Checking correlation
⢠Univariate analysis
⢠Multivariate analysis
⢠Outliers
â˘Binarizing Categorical variables.
⢠Over sampling and Under sampling.
⢠Data Transformation (Normalization)
8. 1.Libraries :
Following libraries have been used:
Description of these libraries are as follows:-
⢠Pandas for Dataframe operations
⢠Numpy for Numeric operations
⢠Matplotlib and Seaborn are Data Visualisation libraries
⢠Scikit-Learn for all the Machine learning algorithms.
⢠imbalanced-learn for re-sampling techniques for strong between-class imbalance
9. 2.EDA
We will start with understanding of the data:
â˘Data shape.
â˘Datatypes of each variable
â˘Unique values of each variable
2.1 Finding NA and Null values.
Found no Null or NA values across the data set
2.2 Dropping not so important variables according to data.
Found 3 variables EmployeeCount, Over18, StandardHours to have only one unique value.
It is also noticed that Employee number doesn't have much meaning for the analysis, so we
are dropping this column as well.
10. 2.3 Correlation Analysis :
2.3.1 Correlations of Target variable Y
⢠Our Target Variable Y has most positive correlations with :
PerformanceRating 0.002889
MonthlyRate 0.015170
NumCompaniesWorked 0.043494
DistanceFromHome 0.077924
⢠Target variable Y has most negative correlations with :
TotalWorkingYears -0.171063
JobLevel -0.169105
YearsInCurrentRole -0.160545
MonthlyIncome -0.159840
Age -0.159205
12. Most of the values appear to be weakly correlated with each other. But there are lots of insights
here to be had :
⢠Job level and total working years are highly correlated.
(i.e., the longer you work the higher job level you achieve)
⢠Age is correlated JobLevel and Education.
(i.e., the older we are the more educated and successful you are)
⢠HourlyRate, DailyRate, and MonthlyRate are completely uncorrelated with each other.
⢠MonthlyIncome is highly correlated to Job Level.
⢠Monthly Income and total working years are highly correlated.
⢠Performance rating and percentage salary hike are highly correlated.
⢠Years in current role and years at company are highly correlated.
(i.e., sticking with company for long can promote your role)
⢠Years with current manager and years at company are highly correlated.
⢠Work life Balance correlates to pretty much none of the numeric values
⢠Number of companies worked at is weakly correlated with the time spent at the company.
(might indicate we're likely the leave)
⢠If performance rating is high , then bigger percent salary hike.
13. 2.4 Descriptive statistics :
The describe() function gives Descriptive statistics include that summarize the central tendency,
dispersion and shape of a dataset's distribution. The below are the important takeaways :
⢠Mean age of employees is 37
⢠Most people get a promotion in 2-5 years
⢠Average time employed at is 7 years
⢠No one has a performance rating under 3
⢠Most people get training 2-3 times a year
14. 2.5 Visualization :
2.5.1 Trend of attrition for age :
⢠Thereâs a sudden rise of trend near 25 years and employees leave.
⢠Also Most of the employees leave in their early before 30âs & the exact same pattern is followed by near 31 years.
⢠Later on the attrition trend decreases as age increases.
⢠So the age 25 & in between 28 to 32 should be identified as potentially having a risk of leaving more employees.
15. 2.5.1 Trend of attrition for age :
(a) (b)
(a) Most people leave in their early 30's. The current employees major age category falls betwwen 35 to 40.
The higher age people,typically after 35 the attrition rate becomes low.
(b) The Average age of a male employees leaving is 34, whereas it is 33 for female employees.
16. 2.5.2 Years at company trend analysis :
⢠The highest attrition rate occurs in the first year of the job. Over 20% of all employees who left did so in their first year.
⢠The vast majority of the workforce has been for under 10 years.
⢠The max years at company being 40 years.
⢠The average years at company for a female is 6 years and for a male it is 5 years.
17. 2.5.3 Overtime analysis :
⢠Majority of proportion tend to leave because of overtime, we confidently say over time is related to attrition by above.
⢠The major proportion of overtime being men by 60%, with an average age of 34 years. The other portion being female with
average age of 31 years.
18. 2.5.4 Distance From Home trend analysis :
⢠Average distance from home for currently active employees: 8.92 miles and ex-employees: 10.63 miles
⢠Hence we can conclude that employee is likely to quit is distance is more than 10 Miles.
⢠It shows that majority of employees whose marital status is married, tend to leave the company if the
distance is more, which on average is coming out to be 11 miles approximately.
19. 2.5.5 Analysis based on monthly income :
⢠The bar graph plots Age vs Monthly Income, the line trend follows the trend of attrition with age.
⢠We see that the as the age of employee increases, the monthly income also increases and the attrition trend
of employee leaving the company decreases.
⢠So the employee of higher age is overall a loyal employee.
20. 2.5.5 Analysis based on monthly income :
(a) (b)
(a) AS the job level increases, we see that the monthly increases. Hence we can conclude that a good employee who stays in
company coping with the work, has good increments in job level, income & in the far run he becomes a loyal employee.
(b) The average monthly income for a female ex-employee is Rs. 4770, while for a male ex-employee is 4798.
⢠But leaving this attrition factor, the general monthly income of female employee is more than male employee.
21. 2.5.6 Analysis based on Business Travel :
(a) (b)
(a) We see that the good majority of proportion who needs to travel frequently will leave the company.
(b) The resultant is that, the count of male employees who are likely to travel frequently & leave the company
is more than that of female employees who travel frequently.
22. 2.5.7 Number of companies worked analysis :
(a) (b)
⢠The employees who had worked in one company before, is more likely to leave,i.e Employees who hit
their overall two-year anniversary should be identified as potentially having a higher-risk of leaving.
⢠The average age approximately is 30 years for those employees.
23. 2.5.8 Trend of Attrition for total working years :
(a) (b)
⢠The attrition rate decreases i.e the employee is less likely to live as the number of working years increase.
⢠Employees who have between 5-9 years experience should be identified as potentially having a higher-risk of leaving.
⢠The average total working years of ex-employees is 9 years and for active people it is 12 years.
24. 2.5.8 Trend of Attrition for years with current manager :
(a) (b)
⢠A large number of leavers leave 6 months after their Current Managers.
⢠Thereâs a same repeated pattern for 2-3 years and the 7th year, where the ex-employees attrition rate randomly
increases and later becomes normal. They should be identified as potentially having a higher-risk of leaving,
and be aware od this pattern.
25. 2.5.9 Attrition analysis for Department :
(a)
⢠Of all the true attrition range, 65% employees leave from Research & development department.
⢠The next being sales department with 30% from total
⢠This trend of attrition happens for employees of age 33-35, this should be identified as risk of leaving and
should take advance steps on these aged employees of different departments.
26. 2.5.10 Attribution analysis for Marital status :
(a)
⢠Most of the ex-employees are max proportinate of sigle & married classes.
⢠The average monthly income of married employees is high and the next is for single ones.
⢠Also the number of average years working at company seem to be high with 6.5 years, follwed by single ones
with 4.5 years.
27. 3. Random Findings
⢠Compared to other roles, human resouces generally get promoted fast.
⢠Human resouces people are having a slightly lower job satisfaction compared to other roles.
28. Phase 3 : Data Transformation.
1.Undersampling & Oversampling (SMOTE)
⢠We see that the dataset provided has more proportion of Active employees thanEx-employees. So the
maching learning model may create biases towards active employees more. In order to avoid this imbalanced
data we use SMOTE technique, so that minority class will become proportionate to majority class.
⢠The initial count of the unique values were [ 0 : 1233, 1: 237]
⢠After oversampling & undersampling we see that, the values of minority class has been increased
and that of majority class has been decreased.
⢠The final propertion of values are coming to be [0: 770, 1: 616], which is decent.
⢠This aspect of dataset transformation is important for model, to avoid biases for our output.
29. 2. Feature Scaling
2.1 Normalization of Data
⢠The resultant data after sampling, needs to be normalized between certain range of values,
so that the model wont be biased towards the high values of different variables.
⢠The data has been normalized to the values between 0 & 1, independently of the statistical
distribution they follow.
3.Train and Test data splitting
⢠The data has been split to test data, training data & the model is trained with the training data.
⢠Both the dependent & independent variables are split to test & training data.
⢠It is done,in order for the model to perform with higher precision when the unknown test data
is fed to it.
30. Phase 4 : Building Machine Learning Models
1. Baseline algorithms
⢠First we use a range of baseline algorithms (using out-of-the-box hyper-parameters)
before we move on to more sophisticated solutions.
⢠We use a baseline prediction algorithm to know whether the predictions for a given
algorithm are good or not.
⢠A baseline prediction algorithm provides a set of predictions that we can evaluate as
we would any predictions for our problem, such as classification.
⢠The scores from these algorithms provide the required point of comparison when
evaluating all other machine learning algorithms on our problem.
⢠Once established, we can comment on how much better a given algorithm is.
The algorithms considered in this section are: Logistic Regression, Random Forest, KNN,
Decision Tree Classifier.
31. From the above results, we can conclude that:
â˘The Machine Learning model âRandom Forest Classifier seems to be the best fit model, with the
accuracy average being 90.2 approximately.
â˘The Machine learning models Decision tree, Logistic Regression also seems to be performing good,
with accuracy average being 82.4, 77.5 respectively. We shall proceed with this models for analysis.
32. 2. Logistic Regression
⢠Logistic Regression is a Machine Learning classification algorithm that is used to predict
the probability of a categorical dependent variable. Logistic Regression is classification
algorithm that is not as sophisticated as the ensemble methods. Hence, it provides us with
a good benchmark.
⢠We use the Logistic Regression() function of the sklearn library to fit the data.
from sklearn.linear_model import LogisticRegression
⢠y prediction mean accuracy is coming out to be 77.8 approximately.
⢠On fitting the data to linear regression we get :
Intercept value is 1.16107 and coefficient values are:
array([[-0.60427658, -0.66403848, 1.01115777, -0.17174707, -1.16585296,
0.259884 , -0.23246306, -1.86911886, -0.5026541 , -1.20096013,
-0.16554763, 0.32295026, 1.05840619, 1.78505991, -0.17840222,
-0.26499954, -0.73665015, -0.73702936, -0.6300062 , -0.63448951,
-0.6229809 , 1.14153067, -1.35572966, 2.03116262, -1.03586988,
1.41016837, 0.92204911, -0.45429208, 0.38475649, -0.35644322,
- 0.11705673, -0.16685064, -0.2734763 , 0.80221059, 0.99447046,
1.33874149, -0.24631437, -0.14315613, -0.31208634, 0.06669798,
-0.01705458, 0.94274675, 0.56312458, 0.97563792]])
33. 2.1 Confusion Matrix, Classification Report & ROC Curve.
[[165 28] auc score = 77.6
[52 102]]
Precision score for predicting our required Y variable is 75.
34. 3. Random Forest Classifier
⢠Random Forest is a machine learning method that is capable of solving both regression and classification. It
is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It aggregates Classification (or
Regression) Trees.
⢠Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and use
averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a large
number of features, and is helpful for estimating which of your variables are important in the underlying data
being modeled.
⢠Following the same train, test split data with the below parameters gives us the mean score of 92 approx.
â˘The scores of all the splits is as follows & their mean gives us overall score of the model.
35. 4. Decision Tree classifier
⢠Decision tree classifier is a machine learning algorithm used for both classification and regression
tasks, that predicts value of a target variable by learning simple decision rules inferred from the input
features.
⢠Decision trees are structured as a hierarchical tree-like structure, where each internal node represents
a feature or attribute, and each branch represents a decision rule based on that attribute. The leaf
nodes represent the final predicted outcome or class label.
⢠They are also capable of handling nonlinear relationships between features and the target variable
⢠Following the same train, test split data with the below parameters gives us the score of 87.6 approx.
36. 5. Grid Search for fine tuning of hyper parameters.
⢠Grid search works by creating a grid of all possible combinations of hyperparameter values specified
by the user. It then trains and evaluates model using each combination of hyperparameters and selects
the one that yields the best performance based on a predefined evaluation metric, such as accuracy,
precision, or F1 score
⢠It systematically explores all possible combinations of hyperparameters, ensuring that the best combination
is found within the specified search space. However, this exhaustive search can be computationally expensive.
5.1 Grid Search for Random forest classifier.
⢠Fitting the same train, test split with the above input parameters, gives us the best possible hyper parameters
as:
Best score = 0.9056670382757339
Best parameters = {'max_features': 1, 'n_estimators': 178}
37. 5.2 Grid Search for Decision trees classifier.
⢠Fitting the same train, test split with the above input parameters, gives us the best possible
hyper parameters as :
Best score = 0.8896137963275589
Best parameters = {'criterion': 'entropy', 'max_depth': 13, 'random_state': 17}
38. 5.2 Grid Search for Decision trees classifier.
⢠Fitting the same train, test split with the above input parameters, gives us the best possible
hyper parameters as :
Best score = 0.8896137963275589
Best parameters = {'criterion': 'entropy', 'max_depth': 13, 'random_state': 17}
39. Conclusion :
Retention plans :
The major indicators of employees leaving include:
⢠Age: Employees of young age bracket 25-35 are more likely to leave. Hence, efforts should be made to
clearly articulate the long-term plan for the company for retention, as well as provide incentives in any way
to upgrade job level.
⢠Monthly Income: People on higher wages are less likely to leave the company. Hence, efforts should be made to
gather information from current local market to determine if the company is paying competitive monthly wages.
⢠Over Time: People who work overtime are more likely to leave the company. Hence efforts must be taken to
appropriately scope projects before with adequate support and required crew so as to reduce the overtime.
⢠YearsWithCurrManager: A large number of leavers leave 6 months after their Current Managers. By getting line
Manager details for each employee,should determine under which Manager have experienced the largest
numbers of resignings over the past year. By extracting patterns in the employees who have resigned
may indicate recurring patterns in employees leaving in which case action may be taken accordingly.
40. ⢠DistanceFromHome: Employees who live further from home are more likely to leave the company. Hence,
efforts should be made to provide support in the form of transportation for group of employees leaving the same
area, or in the form of Allowances. Initial screening of employees based on their home location is probably not
recommended as it would be regarded as a form of discrimination as long as employees make it to work on time
every day.
⢠TotalWorkingYears: The more experienced employees are less likely to leave. Employees who have between 5-8
years of experience should be identified as potentially having a higher-risk of leaving.
⢠YearsAtCompany: Loyal companies are less likely to leave. Employees who hit their two-year anniversary should
be identified as potentially having a higher-risk of leaving.
The company should look deeper into human resource roles to understanding which part people are not satisfied
with the job. Frequent communication and one-on-ones are strongly recommended.
While the company doesnât need to worry too much about people who worked for 2 â 4 companies, itâs still
worth paying attention to males who went to more than 5 companies.