SlideShare a Scribd company logo
1 of 24
Download to read offline
Project -5 (Machine Learning)
Employee mode of commuting
By Saleesh Satheeshchandran
PGP-BABI 2019-20 (G-6)
Objective
This project requires you to understand what mode of transport employees prefers to
commute to their office. The attached data 'Cars.csv' includes employee information about
their mode of transport as well as their personal and professional details like age, salary, work
exp. We need to predict whether or not an employee will use Car as a mode of transport.
Also, which variables are a significant predictor behind this decision?
Assumptions
There are no particular assumptions.
Tool used for the analysis
RStudio Version 1.2.1335
R Version 3.6.0
Input Data
The data is available in a spreadsheet format with .csv file extension.
The same is uploaded into the R Studio with the help of the function “read.csv.
Steps involved in the analysis
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and
missing values and check the summary of the dataset
1.2 EDA - Illustrate the insights based on EDA
1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
2. Data Preparation (SMOTE)
3.1 Applying Logistic Regression & Interpret results
3.2 Applying KNN Model & Interpret results
3.3 Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if it is
not applicable, how can you build an NB model in this case?)
3.4 Confusion matrix interpretation
3.5 Remarks on Model validation exercise <Which model performed the best>
3.6 Bagging
3.7 Boosting
4. Actionable Insights and Recommendations
Loaded Libraries
Exploratory Data Analysis
- There are a total of 9 variables.
- There are 444 observations
- The data has a mix of continuous and categoric variables.
- Continuous variables – Age, Salary, Work Experience, Distance
- Category variables – Gender, Engineer, MBA, License, Transport
- The target variable, “Transport” is a categoric variable with 3 classes.
- Engineer, MBA, License variables are integers and need to be converted to factors.
Identifying missing values and treating the data
- There is one missing value in the dataset
- The raw containing the missing value is omitted.
Treating the class of variables
- The target variable has 3 classes. But as per the problem statement it needs only 2
values that shows “Car” or “Not car”.
- The target variable is changed to a new column named “Caruse” by converting the
variable “transport” to a binary data.
- The variables Engineer, MBA, license and Caruse are converted to factors.
Checking for outliers
- The data is showing outliers in all the continuous variables
- The plot of “Age”shows 4 outliers. These are significant in the model.
- The plot of “Work Exp”shows 8 outliers. These are significant in the model.
- The plot of “Work Exp”shows numerous outliers. These are significant in the model.
The plot of “Distance ”shows 9 outliers. These are significant in the model
Univariate Analysis
The histogram of Age shows that the variable is skewed towards left.
The histogram of Work.Exp shows that the variable is skewed towards left.
The histogram of Salary shows that the variable is skewed towards left.
The histogram of Distance shows that the variable is skewed towards left.
The plot of Gender shows that there are twice as many males as to females.
The plot of Engineer shows that there are thrice as many Engineers as to Non-Engineers.
The plot of Gender shows that there are one-third as many MBAs as to non-MBA’s.
The plot of Caruse shows that there are one-fifth as many car users as to non-car users.
Bi-variate Analysis
Age vs Work.Exp plot shows a high correlation. (0.932251)
Age vs Salary plot shows a high correlation.( 0.8607652
Age vs Distance plot shows a low correlation.( 0.3530563)
Work.Exp vs Salary plot shows a high correlation.( 0.9320081)
Work.Exp vs Distance plot shows a low correlation.( 0.3727857)
Work.Exp vs Distance plot shows a low correlation.( 0.4422379)
Checking Multicollinearity and treating it
The below process is followed to check multicollinearity.
1. Create a logistic regression model with all variables.
2. Check the variable inflation factor.
3. Check the summary and identify variables showing considerable inflation.
The following variables show high contribution to multicollinearity.
- Age
- Work Exp
- Salary
Principle Component Analysis is done on continuous variables to eliminate multi-collinearity
Chi-Square test is done to determine the multicollinearity of categoric variables.
There is no significant multicollinearity between categoric variables.
The variable “Salary” is eliminated due to multi-collinearity.
Treating the data for Analysis
- A seed is set
- Data is split into Test and Train.
- Train has 310 observations
- Test has 133 Observations
- Proportion of the target variables in test and train are confirmed to be similar.
Logistic Regression
One of the best models for predictions when the variable to be determined is binary in
nature is the logistic regression model.
Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
A logistic regression model is run with all the significant variables that are available.
4 of the variables as per below are identified to be non-significant in the model.
- Gender
- Engineer
- MBA
- Salary
The mentioned variables are removed and the logistic regression model is run again.
Testing the performance
SMOTE
- The given data is under sampled.
- Smote is performed to boost positive observations of target variable.
- After Smote the positive occurrences are boosted to 366 numbers.
-The negative occurrences are boosted to 610 numbers .
Logistic Regression after Smote
- The data after SMOTE is used to perform Logistic Regression.
Measuring the model performance.
The results show that SMOTE caused below changes to the model
- Decreased Sensitivity
- Increased Specificity
- Increased Precision
- Decreased Accuracy
Naïve Bayes (Can be used in this model)
Naïve Bayes is a classification model which was initially established with a condition that all
the variables are supposed to be categorical and not continuous.
But in practical situations such a condition is not possible as we would be always getting a
mix of continuous and categorical variables.
We can try building a Naïve Bayes model data using the library (e1071) in R.
Model Performance
Naïve Bayes after SMOTE
The Naïve Bayes model is run on the data that was amplified with SMOTE
Model Performance
The results show that SMOTE caused below changes to the model
- Increased Sensitivity
- Increased Specificity
- Increased Precision
- Decreased Accuracy
K Nearest Neighbour (KNN)
k nearest neighbors is a simple algorithm that stores all available cases and classifies new
cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points
into well defined groups.
The kNN algorithm is applied to the training data set and the results are verified on the test
data set.
KNN Model does not have a formula like regression models.
It just classifies
Multiple values of K was used and experimented with the model.
When the value of K was 3, it seemed to predict the values much closer to the initial % of the
Car usage rate.
Model Performance
KNN with SMOTE Data
Measuring Performance
Bagging
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm
designed to improve the stability and accuracy of machine learning algorithms used in
statistical classification and regression. It also reduces variance and helps to avoid overfitting.
Bagging performance
Boosting
Actionable Insights
From the various analysis and modelling done on this project, Naïve Bayes with Smoting is
identified to be the best model.
The model can predict the Car usage probability of a customer with about 97% accuracy.
It is quite clear from the models tried that 3 of the variables are very critical in deciding the
Car Usage rate.
1. Age
2. Distance
3. License
This shows that if the company focuses to improve these 4 variables, they can significantly
improve the prediction of Car usage pattern of employees.
SMOTE has been identified as a very good tool to boost the data in this case.
Bagging too has been found to be very fruitful. The results came just second behind Naïve
Bayes with Smoting with about 96% Accuracy.
Both the methods are also helpful in increasing sensitivity and specificity.

More Related Content

What's hot

Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660syaidatulamirah
 
How to run a perfect Net Promoter Score campaign [Webinar]
How to run a perfect Net Promoter Score campaign [Webinar]How to run a perfect Net Promoter Score campaign [Webinar]
How to run a perfect Net Promoter Score campaign [Webinar]Retently
 
NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...
NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...
NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...Alex Johnson
 
Voice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKEVoice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKEWilliam Clarke
 
Ngen oss bss - architecture evolution
Ngen oss bss - architecture evolution Ngen oss bss - architecture evolution
Ngen oss bss - architecture evolution Grazio Panico
 
The IT Chargeback Journey
The IT Chargeback JourneyThe IT Chargeback Journey
The IT Chargeback JourneyPete Hidalgo
 
Legal and ethical issues in web design
Legal and ethical issues in web designLegal and ethical issues in web design
Legal and ethical issues in web designMike Cummins
 

What's hot (13)

Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660
 
How to run a perfect Net Promoter Score campaign [Webinar]
How to run a perfect Net Promoter Score campaign [Webinar]How to run a perfect Net Promoter Score campaign [Webinar]
How to run a perfect Net Promoter Score campaign [Webinar]
 
NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...
NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...
NOC to SOC: A Next-Gen Approach to Understanding and Proactively Assuring Cus...
 
Session 1 - Introduction to Customer Experience Management.ppt
Session 1 - Introduction to Customer Experience Management.pptSession 1 - Introduction to Customer Experience Management.ppt
Session 1 - Introduction to Customer Experience Management.ppt
 
Cost & Reliability
Cost & ReliabilityCost & Reliability
Cost & Reliability
 
Voice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKEVoice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKE
 
Ngen oss bss - architecture evolution
Ngen oss bss - architecture evolution Ngen oss bss - architecture evolution
Ngen oss bss - architecture evolution
 
Ecrm Presentation
Ecrm PresentationEcrm Presentation
Ecrm Presentation
 
E ticketing
E ticketing E ticketing
E ticketing
 
The IT Chargeback Journey
The IT Chargeback JourneyThe IT Chargeback Journey
The IT Chargeback Journey
 
Churn Predictive Modelling
Churn Predictive ModellingChurn Predictive Modelling
Churn Predictive Modelling
 
Network Operation Center Best Practices
Network Operation Center Best PracticesNetwork Operation Center Best Practices
Network Operation Center Best Practices
 
Legal and ethical issues in web design
Legal and ethical issues in web designLegal and ethical issues in web design
Legal and ethical issues in web design
 

Similar to Employee Commute Mode Prediction Using Machine Learning

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction SystemIRJET Journal
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee AttritionMohamad Sahil
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningIRJET Journal
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESIRJET Journal
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docx6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docxsodhi3
 
6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docx6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docxblondellchancy
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET Journal
 
Modelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boostingModelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boostingGregg Barrett
 
Multiple Linear Regression Applications Automobile Pricing
Multiple Linear Regression Applications Automobile PricingMultiple Linear Regression Applications Automobile Pricing
Multiple Linear Regression Applications Automobile Pricinginventionjournals
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...ESEM 2014
 

Similar to Employee Commute Mode Prediction Using Machine Learning (20)

Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
Machine learning project
Machine learning project Machine learning project
Machine learning project
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Factors affecting customer satisfaction
Factors affecting customer satisfactionFactors affecting customer satisfaction
Factors affecting customer satisfaction
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docx6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docx
 
6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docx6©iStockphotoThinkstockModels and ForecastingLear.docx
6©iStockphotoThinkstockModels and ForecastingLear.docx
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
 
Predictive modeling
Predictive modelingPredictive modeling
Predictive modeling
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Modelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boostingModelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boosting
 
Multiple Linear Regression Applications Automobile Pricing
Multiple Linear Regression Applications Automobile PricingMultiple Linear Regression Applications Automobile Pricing
Multiple Linear Regression Applications Automobile Pricing
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
 

More from Saleesh Satheeshchandran (19)

Medilogistics - Ranbaxy
Medilogistics - RanbaxyMedilogistics - Ranbaxy
Medilogistics - Ranbaxy
 
Ceat
CeatCeat
Ceat
 
Retail logistics solutions for Aditya Birla Retail
Retail logistics solutions for Aditya Birla RetailRetail logistics solutions for Aditya Birla Retail
Retail logistics solutions for Aditya Birla Retail
 
Aibono ops framework compressed
Aibono ops framework compressedAibono ops framework compressed
Aibono ops framework compressed
 
Voltas compressed
Voltas compressedVoltas compressed
Voltas compressed
 
Presentation
PresentationPresentation
Presentation
 
Whirlpool rdc &amp; cfa compressed
Whirlpool rdc &amp; cfa compressedWhirlpool rdc &amp; cfa compressed
Whirlpool rdc &amp; cfa compressed
 
Aquastar amway sakinaka v 2.0
Aquastar   amway sakinaka v 2.0Aquastar   amway sakinaka v 2.0
Aquastar amway sakinaka v 2.0
 
Aquastar corporate presentation ver 8[1].0 draft
Aquastar corporate presentation ver 8[1].0 draftAquastar corporate presentation ver 8[1].0 draft
Aquastar corporate presentation ver 8[1].0 draft
 
Dabur distrb final
Dabur distrb finalDabur distrb final
Dabur distrb final
 
Tsg introduction rev
Tsg introduction revTsg introduction rev
Tsg introduction rev
 
Mahindra retail cases
Mahindra retail casesMahindra retail cases
Mahindra retail cases
 
Mahindra logistics presentation
Mahindra logistics presentationMahindra logistics presentation
Mahindra logistics presentation
 
Dell project client presentation
Dell project client presentationDell project client presentation
Dell project client presentation
 
Credit risk - loan default model
Credit risk - loan default modelCredit risk - loan default model
Credit risk - loan default model
 
Coffee Cafe
Coffee CafeCoffee Cafe
Coffee Cafe
 
Time series forecasting
Time series forecastingTime series forecasting
Time series forecasting
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Cold storage case study
Cold storage case studyCold storage case study
Cold storage case study
 

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Employee Commute Mode Prediction Using Machine Learning

  • 1. Project -5 (Machine Learning) Employee mode of commuting By Saleesh Satheeshchandran PGP-BABI 2019-20 (G-6)
  • 2. Objective This project requires you to understand what mode of transport employees prefers to commute to their office. The attached data 'Cars.csv' includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision? Assumptions There are no particular assumptions. Tool used for the analysis RStudio Version 1.2.1335 R Version 3.6.0 Input Data The data is available in a spreadsheet format with .csv file extension. The same is uploaded into the R Studio with the help of the function “read.csv. Steps involved in the analysis 1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and missing values and check the summary of the dataset 1.2 EDA - Illustrate the insights based on EDA 1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it. 2. Data Preparation (SMOTE) 3.1 Applying Logistic Regression & Interpret results 3.2 Applying KNN Model & Interpret results 3.3 Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if it is not applicable, how can you build an NB model in this case?) 3.4 Confusion matrix interpretation
  • 3. 3.5 Remarks on Model validation exercise <Which model performed the best> 3.6 Bagging 3.7 Boosting 4. Actionable Insights and Recommendations Loaded Libraries
  • 4. Exploratory Data Analysis - There are a total of 9 variables. - There are 444 observations - The data has a mix of continuous and categoric variables. - Continuous variables – Age, Salary, Work Experience, Distance - Category variables – Gender, Engineer, MBA, License, Transport - The target variable, “Transport” is a categoric variable with 3 classes. - Engineer, MBA, License variables are integers and need to be converted to factors.
  • 5. Identifying missing values and treating the data - There is one missing value in the dataset - The raw containing the missing value is omitted. Treating the class of variables - The target variable has 3 classes. But as per the problem statement it needs only 2 values that shows “Car” or “Not car”. - The target variable is changed to a new column named “Caruse” by converting the variable “transport” to a binary data. - The variables Engineer, MBA, license and Caruse are converted to factors. Checking for outliers - The data is showing outliers in all the continuous variables
  • 6. - The plot of “Age”shows 4 outliers. These are significant in the model. - The plot of “Work Exp”shows 8 outliers. These are significant in the model. - The plot of “Work Exp”shows numerous outliers. These are significant in the model.
  • 7. The plot of “Distance ”shows 9 outliers. These are significant in the model Univariate Analysis The histogram of Age shows that the variable is skewed towards left.
  • 8. The histogram of Work.Exp shows that the variable is skewed towards left. The histogram of Salary shows that the variable is skewed towards left. The histogram of Distance shows that the variable is skewed towards left.
  • 9. The plot of Gender shows that there are twice as many males as to females. The plot of Engineer shows that there are thrice as many Engineers as to Non-Engineers. The plot of Gender shows that there are one-third as many MBAs as to non-MBA’s.
  • 10. The plot of Caruse shows that there are one-fifth as many car users as to non-car users. Bi-variate Analysis Age vs Work.Exp plot shows a high correlation. (0.932251)
  • 11. Age vs Salary plot shows a high correlation.( 0.8607652 Age vs Distance plot shows a low correlation.( 0.3530563)
  • 12. Work.Exp vs Salary plot shows a high correlation.( 0.9320081) Work.Exp vs Distance plot shows a low correlation.( 0.3727857) Work.Exp vs Distance plot shows a low correlation.( 0.4422379)
  • 13. Checking Multicollinearity and treating it The below process is followed to check multicollinearity. 1. Create a logistic regression model with all variables. 2. Check the variable inflation factor. 3. Check the summary and identify variables showing considerable inflation. The following variables show high contribution to multicollinearity. - Age - Work Exp - Salary
  • 14. Principle Component Analysis is done on continuous variables to eliminate multi-collinearity
  • 15. Chi-Square test is done to determine the multicollinearity of categoric variables. There is no significant multicollinearity between categoric variables. The variable “Salary” is eliminated due to multi-collinearity. Treating the data for Analysis - A seed is set - Data is split into Test and Train. - Train has 310 observations - Test has 133 Observations - Proportion of the target variables in test and train are confirmed to be similar.
  • 16. Logistic Regression One of the best models for predictions when the variable to be determined is binary in nature is the logistic regression model. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. A logistic regression model is run with all the significant variables that are available. 4 of the variables as per below are identified to be non-significant in the model. - Gender - Engineer - MBA - Salary
  • 17. The mentioned variables are removed and the logistic regression model is run again. Testing the performance
  • 18. SMOTE - The given data is under sampled. - Smote is performed to boost positive observations of target variable. - After Smote the positive occurrences are boosted to 366 numbers. -The negative occurrences are boosted to 610 numbers . Logistic Regression after Smote - The data after SMOTE is used to perform Logistic Regression.
  • 19. Measuring the model performance. The results show that SMOTE caused below changes to the model - Decreased Sensitivity - Increased Specificity - Increased Precision - Decreased Accuracy Naïve Bayes (Can be used in this model) Naïve Bayes is a classification model which was initially established with a condition that all the variables are supposed to be categorical and not continuous. But in practical situations such a condition is not possible as we would be always getting a mix of continuous and categorical variables. We can try building a Naïve Bayes model data using the library (e1071) in R. Model Performance
  • 20. Naïve Bayes after SMOTE The Naïve Bayes model is run on the data that was amplified with SMOTE Model Performance The results show that SMOTE caused below changes to the model - Increased Sensitivity - Increased Specificity - Increased Precision - Decreased Accuracy
  • 21. K Nearest Neighbour (KNN) k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups. The kNN algorithm is applied to the training data set and the results are verified on the test data set. KNN Model does not have a formula like regression models. It just classifies Multiple values of K was used and experimented with the model. When the value of K was 3, it seemed to predict the values much closer to the initial % of the Car usage rate. Model Performance
  • 22. KNN with SMOTE Data Measuring Performance
  • 23. Bagging Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Bagging performance Boosting
  • 24. Actionable Insights From the various analysis and modelling done on this project, Naïve Bayes with Smoting is identified to be the best model. The model can predict the Car usage probability of a customer with about 97% accuracy. It is quite clear from the models tried that 3 of the variables are very critical in deciding the Car Usage rate. 1. Age 2. Distance 3. License This shows that if the company focuses to improve these 4 variables, they can significantly improve the prediction of Car usage pattern of employees. SMOTE has been identified as a very good tool to boost the data in this case. Bagging too has been found to be very fruitful. The results came just second behind Naïve Bayes with Smoting with about 96% Accuracy. Both the methods are also helpful in increasing sensitivity and specificity.