SlideShare a Scribd company logo
PREDICTING EMPLOYEE ATTRITION
1.1 OBJECTIVE AND SCOPE OF THE STUDY
 The objective of this project is to predict the attrition rate for
each employee, to find out who’s more likely to leave the
organization.
 It will help organizations to find ways to prevent attrition or
to plan in advance the hiring of new candidate.
 Attrition proves to be a costly and time consuming problem
for the organization and it also leads to loss of productivity.
 The scope of the project extends to companies in all
industries.
1.2 ANALYTICS APPROACH
 Check for missing values in the data, and if any, will process
the data accordingly.
 Understand how the features are related with our target
variable - attrition
 Convert target variable into numeric form
 Apply feature selection and feature engineering to make it
model ready
 Apply various algorithms to check which one is the most
suitable
 Draw out recommendations based on our analysis.
1.3 DATA SOURCES
 For this project, an HR dataset named ‘IBM HR Analytics
Employee Attrition & Performance’, has been picked, which
is available on IBM website.
 The data contains records of 1,470 employees.
 It has information about employee’s current employment
status, the total number of companies worked for in the past,
Total number of years at the current company and the current
roles, Their education level, distance from home, monthly
income, etc.
1.4 TOOLS AND TECHNIQUES
 We have selected Python as our analytics tool.
 Python includes many packages such as Pandas, NumPy,
Matplotlib, Seaborn etc.
 Algorithms such as Logistic Regression, Random Forest,
Support Vector Machine and XGBoost have been used for
prediction.
 Importing Libraries
2.1 IMPORTING LIBRARY AND DATA EXTRACTION
 Importing Packages
 Data Extraction
2.2 EXPLORATORY DATA ANALYSIS
 Refers to the process of performing initial investigations on the
data so as to discover patterns, to spot inconsistencies, to test
hypothesis and to check assumptions with the help of graphical
representations
 Displaying First 5 Rows
 Displaying rows and columns
 Identifying Missing Values
 Count of “Yes” and “No” values of Attrition
2.3 VISUALIZATION(EDA) -
 Attrition V/s “Age”
 Attrition V/s “Distance from Home”
 Attrition V/s “Job Satisfaction”
 Attrition V/s “Performance Rating”
 Attrition V/s “Training Times Last Year”
 Attrition V/s “Work Life Balance”
 Attrition V/s “Years At Company”
 Attrition V/s “Years in Current Role”
 Attrition V/s “Years Since Last Promotion”
 Attrition V/s Categorical Variables
Attrition V/s “Gender, Marital status and Overtime”
Attrition V/s “Department, Job Role, and Business Travel”
Data Pre-Processing-
Steps Involved –
 Taking care of missing data and dropping non-relevant
features
 Feature extraction
 Converting categorical features into numeric form
Binarization of the converted categorical features
 Feature scaling
 Understanding correlation of features with each other
 Splitting data into training and test data sets
 Refers to data mining technique that transforms raw data into
an understandable format
 Useful in making the data ready for analysis
3.1 FEATURE SELECTION
 Process wherein those features are selected, which contribute
most to the prediction variable or output.
Benefits of feature selection :
 Improve the performance
 Improves Accuracy
 Providing the better understanding of Data
Dropping non-relevant variables
#dropping all fixed and non-relevant variables
attrition_df.drop(['DailyRate','EmployeeCount','EmployeeNumber','HourlyRate','Month
lyRate','Over18','PerformanceRating','StandardHours','StockOptionLevel','TrainingTi
mesLastYear'], axis=1,inplace=True)
Check number of rows and columns
Features Extraction
3.2 FEATURE ENGINEERING
Label Encoding
 Label Encoding refers to converting the categorical variables into numeric
form, so as to convert it into the machine-readable form.
 It is an important pre-processing step for the structured dataset in supervised
learning.
 Fit and transform the required columns of the data, and then replace the
existing text data with the new encoded data.
Convert categorical variables into numeric variables
 One Hot Encoder
 It is used to perform “binarization” of the categorical features and
include it as a feature to train the model.
 It takes a column which has categorical data that has been label
encoded, and then splits the column into multiple columns.
 The numbers are replaced by 1s and 0s, depending on which
column has what value.
Applying “One Hot Encoder” on Label Encoded features
Feature Scaling
 Feature scaling is a method used to standardize the range of
independent variables or features of data
 It is also known as Data Normalization
 It is used to scale the features to a range which is centred around
zero so that the variance of the features are in the same range
 Two most popular methods of feature scaling are standardization
and normalization
Scaling the features
Correlation Matrix
• Correlation is a statistical technique which determines how one
variables moves/changes in relation with the other variable.
• It’s a bi-variant analysis measure which describes the association
between different variables.
Usefulness of Correlation matrix –
 If two variables are closely correlated, then we can predict one
variable from the other.
 Correlation plays a vital role in locating the important variables
on which other variables depend.
 It is used as the foundation for various modeling techniques.
 Proper correlation analysis leads to better understanding of data.
Plotting correlation matrix
Correlation matrix Plot
Splitting data into train and test
 The process of modeling means training a machine learning
algorithm to predict the labels from the features, tuning it for
the business need, and validating it on holdout data.
 Models used for employee attrition:
 Logistic Regression
 Random Forest
 Support vector machine
 XG Boost
Model building -
4.1 LOGISTIC REGRESSION
 Logistic Regression is one of the most basic and widely used
machine learning algorithms for solving a classification problem.
 It is a method used to predict a dependent variable (Y), given an
independent variable (X), given that the dependent variable
is categorical.
 Linear Regression equation
 Y stands for the dependent variable that needs to be predicted.
 β0 is the Y-intercept, which is basically the point on the line which
touches the y-axis.
 β1 is the slope of the line (the slope can be negative or positive
depending on the relationship between the dependent variable and
the independent variable.)
 X here represents the independent variable that is used to predict
our resultant dependent value.
 ∈ denotes the error in the computation
 Sigmoid Function
p(x)= β0+ β1x
 Building Logistic Regression Model
 Testing the Model
 Confusion Matrix
 Confusion matrix is the most crucial metric commonly used to
evaluate classification models.
 The confusion matrix avoids "confusion" by measuring the
actual and predicted values in a tabular format.
In table above, Positive class = 1 and Negative class = 0.
Standard table of confusion matrix -
 Creating confusion matrix
 AUC score
 Receiver Operator Characteristic (ROC)
 ROC determines the accuracy of a classification model at a user
defined threshold value.
 It determines the model's accuracy using Area Under Curve
(AUC).
 The area under the curve (AUC), also referred to as index of
accuracy (A) or concordant index, represents the performance of
the ROC curve. Higher the area, better the model.
 Plotting ROC curve
 ROC Curve For Logistic Regression
Using Logistic Regression algorithm, we got the accuracy score of
79% and roc_auc score of 0.77
4.2 RANDOM FOREST
• Random Forest is a supervised learning algorithm.
• It creates a forest and makes it random based on bagging
technique. It aggregates Classification Trees.
• In Random Forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node.
 Building Random Forest Model
 Testing the Model
 Confusion Matrix
 AUC score
 Plotting ROC curve
Using Random Forest algorithm, we got the accuracy score of 79%
and roc_auc score of 0.76.
 ROC Curve For Random Forest
4.3 SUPPORT VECTOR MACHINE
 SVM is a supervised machine learning algorithm used for both
regression and classification problems.
 Objective is to find a hyperplane in an N -dimensional space.
 Hyperplanes
 Hyperplanes are decision boundaries
that help segregate the data points.
 The dimension of the hyperplane
depends upon the number of features.
 Support Vectors
 These are data points that are closest to the hyperplane and
influence the position and orientation of the hyperplane.
 Used to maximize the margin of the classifier.
 Considered as critical elements of a dataset
 Kernel Technique
 Used when non-linear hyperplanes are needed
 The hyperplane is no longer a line, it must now be a plane
 Since we have a non-linear
classification problem, kernel
technique used here is Radial Basis
Function (rbf)
 Helps in segregating data that are
linearly non-separable.
 Building SVM Model
 Testing SVM Model
 Confusion Matrix
 AUC Score
 Plotting ROC Curve
Using SVM algorithm, we got the accuracy score of 79% and
roc_auc score of 0.77
 ROC Curve For SVM
4.4 XG BOOST
 XGBoost is a decision-tree-based ensemble Machine Learning algorithm
that uses a gradient boosting framework.
 XGBoost belongs to a family of boosting algorithms that convert weak
learners into strong learners.
 It is a sequential process, i.e., trees are grown using the information from
a previously grown tree one after the other, iteratively, the errors of the
previous model are corrected by the next predictor.
 Advantages of XGBoost -
 Regularization
 Parallel Processing
 High Flexibility
 Handling Missing Values
 Tree Pruning
 Built-in Cross-Validation
 Building XGBoost Model
 Testing the Model
 Confusion Matrix
 AUC Score
 Plotting ROC Curve
Using XGBoost algorithm we got the accuracy score of 82% and
roc_auc score 0.81
 ROC Curve For XGBoost Model
4.5 COMPARISON OF MODELS
 It can be observed by the table that XGBoost outperforms all other models.
 Hence, based on these results we can conclude that, XGBoost will be the best
model to predict future Employee Attrition for this company.
KEY FINDINGS
 The dataset does not feature any missing values or any redundant
features.
 The strongest positive correlations with the target features are:
Distance from home, Job satisfaction, marital status, overtime and
business travel
 The strongest negative correlations with the target features are:
Performance Rating and Training times last year
RECOMMENDATIONS
 Transportation should be provided to employees living in the same
area, or else transportation allowance should be provided.
 Plan and allocate projects in such a way to avoid the use of
overtime.
 Employees who hit their two-year anniversary should be identified
as potentially having a higher-risk of leaving.
 Gather information on industry benchmarks to determine if the
company is providing competitive wages.
THANK YOU

More Related Content

What's hot

Hr analytics
Hr analyticsHr analytics
Hr analytics
Preksha Pagare
 
Project report on attrition analysis
Project report on attrition analysis Project report on attrition analysis
Project report on attrition analysis
mohanapriya301
 
HR Analytics
HR AnalyticsHR Analytics
HR Analytics
Shojibul Alam Shojib
 
Hr analytics project
Hr analytics projectHr analytics project
Hr analytics project
Jatin Saini
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
MachinePulse
 
Employee Attrition
Employee AttritionEmployee Attrition
Employee AttritionVinay sattur
 
Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.
GK Sinha
 
Analytics in hr
Analytics in hrAnalytics in hr
Analytics in hr
sonalimadhusmitajena1
 
Hr analytics
Hr analyticsHr analytics
Hr analytics
Anjali Das V.M
 
MBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENT
MBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENTMBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENT
MBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENT
Salim Palayi
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
Mr. Fmhyudin
 
HR ANALYTICS
HR ANALYTICS HR ANALYTICS
HR ANALYTICS
Shivam Agarwal
 
Internship report on HRIS
Internship report on HRISInternship report on HRIS
Internship report on HRIS
Riju Dnj
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian
 
AI in Talent Acquisition
AI in Talent AcquisitionAI in Talent Acquisition
AI in Talent Acquisition
Prof. Neeta Awasthy
 
Employee Satisfaction
Employee SatisfactionEmployee Satisfaction
Employee Satisfaction
Marwa Abo-Amra
 

What's hot (20)

Hr analytics
Hr analyticsHr analytics
Hr analytics
 
Project report on attrition analysis
Project report on attrition analysis Project report on attrition analysis
Project report on attrition analysis
 
HR Analytics
HR AnalyticsHR Analytics
HR Analytics
 
Hr analytics project
Hr analytics projectHr analytics project
Hr analytics project
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Employee Attrition
Employee AttritionEmployee Attrition
Employee Attrition
 
Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.Employee Attrition Rate, MBA HR, Final Project Report.
Employee Attrition Rate, MBA HR, Final Project Report.
 
Analytics in hr
Analytics in hrAnalytics in hr
Analytics in hr
 
attrition analysis
attrition analysisattrition analysis
attrition analysis
 
Hr analytics
Hr analyticsHr analytics
Hr analytics
 
MBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENT
MBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENTMBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENT
MBA HR PROJECT REPORT ON TRAINING AND DEVELOPMENT
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
 
Attrition
AttritionAttrition
Attrition
 
Employee attrition
Employee attritionEmployee attrition
Employee attrition
 
HR ANALYTICS
HR ANALYTICS HR ANALYTICS
HR ANALYTICS
 
Internship report on HRIS
Internship report on HRISInternship report on HRIS
Internship report on HRIS
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
AI in Talent Acquisition
AI in Talent AcquisitionAI in Talent Acquisition
AI in Talent Acquisition
 
Recruitment Life Cycle
Recruitment Life CycleRecruitment Life Cycle
Recruitment Life Cycle
 
Employee Satisfaction
Employee SatisfactionEmployee Satisfaction
Employee Satisfaction
 

Similar to Predicting Employee Attrition

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
Hussain395748
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
Ashish Patel
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
BeyaNasr1
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET Journal
 
Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
Krishna Mohan Mishra
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
IJCI JOURNAL
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
Codemotion
 
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
IRJET Journal
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solutionDA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
gitikasingh2004
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
IRJET Journal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Big Data Analytics.pptx
Big Data Analytics.pptxBig Data Analytics.pptx
Big Data Analytics.pptx
Kaviya452563
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User GuideAndy Salmon
 
Open06
Open06Open06
Open06butest
 

Similar to Predicting Employee Attrition (20)

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
 
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solutionDA ST-1 SET-B-Solution.pdf we also provide the many type of solution
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Big Data Analytics.pptx
Big Data Analytics.pptxBig Data Analytics.pptx
Big Data Analytics.pptx
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User Guide
 
Open06
Open06Open06
Open06
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 

Predicting Employee Attrition

  • 2.
  • 3. 1.1 OBJECTIVE AND SCOPE OF THE STUDY  The objective of this project is to predict the attrition rate for each employee, to find out who’s more likely to leave the organization.  It will help organizations to find ways to prevent attrition or to plan in advance the hiring of new candidate.  Attrition proves to be a costly and time consuming problem for the organization and it also leads to loss of productivity.  The scope of the project extends to companies in all industries.
  • 4. 1.2 ANALYTICS APPROACH  Check for missing values in the data, and if any, will process the data accordingly.  Understand how the features are related with our target variable - attrition  Convert target variable into numeric form  Apply feature selection and feature engineering to make it model ready  Apply various algorithms to check which one is the most suitable  Draw out recommendations based on our analysis.
  • 5. 1.3 DATA SOURCES  For this project, an HR dataset named ‘IBM HR Analytics Employee Attrition & Performance’, has been picked, which is available on IBM website.  The data contains records of 1,470 employees.  It has information about employee’s current employment status, the total number of companies worked for in the past, Total number of years at the current company and the current roles, Their education level, distance from home, monthly income, etc.
  • 6. 1.4 TOOLS AND TECHNIQUES  We have selected Python as our analytics tool.  Python includes many packages such as Pandas, NumPy, Matplotlib, Seaborn etc.  Algorithms such as Logistic Regression, Random Forest, Support Vector Machine and XGBoost have been used for prediction.
  • 7.
  • 8.  Importing Libraries 2.1 IMPORTING LIBRARY AND DATA EXTRACTION
  • 9.  Importing Packages  Data Extraction
  • 10. 2.2 EXPLORATORY DATA ANALYSIS  Refers to the process of performing initial investigations on the data so as to discover patterns, to spot inconsistencies, to test hypothesis and to check assumptions with the help of graphical representations  Displaying First 5 Rows
  • 11.  Displaying rows and columns
  • 13.  Count of “Yes” and “No” values of Attrition
  • 14. 2.3 VISUALIZATION(EDA) -  Attrition V/s “Age”
  • 15.  Attrition V/s “Distance from Home”
  • 16.  Attrition V/s “Job Satisfaction”
  • 17.  Attrition V/s “Performance Rating”
  • 18.  Attrition V/s “Training Times Last Year”
  • 19.  Attrition V/s “Work Life Balance”
  • 20.  Attrition V/s “Years At Company”
  • 21.  Attrition V/s “Years in Current Role”
  • 22.  Attrition V/s “Years Since Last Promotion”
  • 23.  Attrition V/s Categorical Variables
  • 24. Attrition V/s “Gender, Marital status and Overtime”
  • 25. Attrition V/s “Department, Job Role, and Business Travel”
  • 26.
  • 27. Data Pre-Processing- Steps Involved –  Taking care of missing data and dropping non-relevant features  Feature extraction  Converting categorical features into numeric form Binarization of the converted categorical features  Feature scaling  Understanding correlation of features with each other  Splitting data into training and test data sets  Refers to data mining technique that transforms raw data into an understandable format  Useful in making the data ready for analysis
  • 28. 3.1 FEATURE SELECTION  Process wherein those features are selected, which contribute most to the prediction variable or output. Benefits of feature selection :  Improve the performance  Improves Accuracy  Providing the better understanding of Data
  • 29. Dropping non-relevant variables #dropping all fixed and non-relevant variables attrition_df.drop(['DailyRate','EmployeeCount','EmployeeNumber','HourlyRate','Month lyRate','Over18','PerformanceRating','StandardHours','StockOptionLevel','TrainingTi mesLastYear'], axis=1,inplace=True) Check number of rows and columns
  • 31. Label Encoding  Label Encoding refers to converting the categorical variables into numeric form, so as to convert it into the machine-readable form.  It is an important pre-processing step for the structured dataset in supervised learning.  Fit and transform the required columns of the data, and then replace the existing text data with the new encoded data.
  • 32. Convert categorical variables into numeric variables
  • 33.  One Hot Encoder  It is used to perform “binarization” of the categorical features and include it as a feature to train the model.  It takes a column which has categorical data that has been label encoded, and then splits the column into multiple columns.  The numbers are replaced by 1s and 0s, depending on which column has what value.
  • 34. Applying “One Hot Encoder” on Label Encoded features
  • 35. Feature Scaling  Feature scaling is a method used to standardize the range of independent variables or features of data  It is also known as Data Normalization  It is used to scale the features to a range which is centred around zero so that the variance of the features are in the same range  Two most popular methods of feature scaling are standardization and normalization
  • 37. Correlation Matrix • Correlation is a statistical technique which determines how one variables moves/changes in relation with the other variable. • It’s a bi-variant analysis measure which describes the association between different variables. Usefulness of Correlation matrix –  If two variables are closely correlated, then we can predict one variable from the other.  Correlation plays a vital role in locating the important variables on which other variables depend.  It is used as the foundation for various modeling techniques.  Proper correlation analysis leads to better understanding of data.
  • 40. Splitting data into train and test
  • 41.
  • 42.  The process of modeling means training a machine learning algorithm to predict the labels from the features, tuning it for the business need, and validating it on holdout data.  Models used for employee attrition:  Logistic Regression  Random Forest  Support vector machine  XG Boost Model building -
  • 43. 4.1 LOGISTIC REGRESSION  Logistic Regression is one of the most basic and widely used machine learning algorithms for solving a classification problem.  It is a method used to predict a dependent variable (Y), given an independent variable (X), given that the dependent variable is categorical.
  • 44.  Linear Regression equation  Y stands for the dependent variable that needs to be predicted.  β0 is the Y-intercept, which is basically the point on the line which touches the y-axis.  β1 is the slope of the line (the slope can be negative or positive depending on the relationship between the dependent variable and the independent variable.)  X here represents the independent variable that is used to predict our resultant dependent value.  ∈ denotes the error in the computation
  • 46.  Building Logistic Regression Model
  • 48.  Confusion Matrix  Confusion matrix is the most crucial metric commonly used to evaluate classification models.  The confusion matrix avoids "confusion" by measuring the actual and predicted values in a tabular format. In table above, Positive class = 1 and Negative class = 0. Standard table of confusion matrix -
  • 49.  Creating confusion matrix  AUC score
  • 50.  Receiver Operator Characteristic (ROC)  ROC determines the accuracy of a classification model at a user defined threshold value.  It determines the model's accuracy using Area Under Curve (AUC).  The area under the curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model.
  • 52.  ROC Curve For Logistic Regression Using Logistic Regression algorithm, we got the accuracy score of 79% and roc_auc score of 0.77
  • 53. 4.2 RANDOM FOREST • Random Forest is a supervised learning algorithm. • It creates a forest and makes it random based on bagging technique. It aggregates Classification Trees. • In Random Forest, only a random subset of the features is taken into consideration by the algorithm for splitting a node.
  • 54.  Building Random Forest Model
  • 55.  Testing the Model  Confusion Matrix
  • 56.  AUC score  Plotting ROC curve
  • 57. Using Random Forest algorithm, we got the accuracy score of 79% and roc_auc score of 0.76.  ROC Curve For Random Forest
  • 58. 4.3 SUPPORT VECTOR MACHINE  SVM is a supervised machine learning algorithm used for both regression and classification problems.  Objective is to find a hyperplane in an N -dimensional space.  Hyperplanes  Hyperplanes are decision boundaries that help segregate the data points.  The dimension of the hyperplane depends upon the number of features.
  • 59.  Support Vectors  These are data points that are closest to the hyperplane and influence the position and orientation of the hyperplane.  Used to maximize the margin of the classifier.  Considered as critical elements of a dataset
  • 60.  Kernel Technique  Used when non-linear hyperplanes are needed  The hyperplane is no longer a line, it must now be a plane  Since we have a non-linear classification problem, kernel technique used here is Radial Basis Function (rbf)  Helps in segregating data that are linearly non-separable.
  • 62.  Testing SVM Model  Confusion Matrix
  • 63.  AUC Score  Plotting ROC Curve
  • 64. Using SVM algorithm, we got the accuracy score of 79% and roc_auc score of 0.77  ROC Curve For SVM
  • 65. 4.4 XG BOOST  XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.  XGBoost belongs to a family of boosting algorithms that convert weak learners into strong learners.  It is a sequential process, i.e., trees are grown using the information from a previously grown tree one after the other, iteratively, the errors of the previous model are corrected by the next predictor.  Advantages of XGBoost -  Regularization  Parallel Processing  High Flexibility  Handling Missing Values  Tree Pruning  Built-in Cross-Validation
  • 67.  Testing the Model  Confusion Matrix
  • 68.  AUC Score  Plotting ROC Curve
  • 69. Using XGBoost algorithm we got the accuracy score of 82% and roc_auc score 0.81  ROC Curve For XGBoost Model
  • 70. 4.5 COMPARISON OF MODELS  It can be observed by the table that XGBoost outperforms all other models.  Hence, based on these results we can conclude that, XGBoost will be the best model to predict future Employee Attrition for this company.
  • 71.
  • 72. KEY FINDINGS  The dataset does not feature any missing values or any redundant features.  The strongest positive correlations with the target features are: Distance from home, Job satisfaction, marital status, overtime and business travel  The strongest negative correlations with the target features are: Performance Rating and Training times last year
  • 73.
  • 74. RECOMMENDATIONS  Transportation should be provided to employees living in the same area, or else transportation allowance should be provided.  Plan and allocate projects in such a way to avoid the use of overtime.  Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.  Gather information on industry benchmarks to determine if the company is providing competitive wages.