SlideShare a Scribd company logo
1 of 42
Download to read offline
“US - Falcon Airline Passenger Satisfaction”
Submitted in Partial Fulfillment of requirements for the Award of certificate of
Post Graduate Program in Business Analytics
Capstone Project Report
Submitted to
Submitted by Group 7
Pratiti Ray Chaudhury
Bajantry Harshadeep
Mithilesh Kumar Yadav
Sravan Kolluru
Under the guidance of
(Ms. Chaitali Dixit)
Batch – (PGPBA.H.Jan’19)
Year Of Completion (July’ 2019)
CERTIFICATE
This is to certify that the participants Pratiti Ray Chaudhury, Bajantry Harshadeep, Mithilesh Kumar Yadav,
Sravan Kolluru who are the students of Great Lakes Institute of Management, has successfully completed their
project on “US - Falcon Airline Passenger Satisfaction”
This project is the record of authentic work carried out by them during the academic year 2018- 2019.
Mentor’s Name and Sign Program Director
Name :
Date :
Place: Hyderabad
CONTENTS
INTRODUCTION................................................................................................................... 4
PROBLEM STATEMENT..................................................................................................................5
DATA ABSTRACT .................................................................................................................. 5
DATAPREPARATION....................................................................................................................................7
EXPLORATORYANALYSIS .........................................................................................................................10
DATAMODELLING.....................................................................................................................................13
Dataprocessingandnormalization...........................................................................................13
TrainingandTestDatasets..........................................................................................................14
Models.........................................................................................................................................................14
1. LogisticRegression................................................................................................................14
2. LinearDiscriminantAnalysis................................................................................................16
3. DecisionTree –CARTAlgorithm.........................................................................................18
4. NaïveBayes...........................................................................................................................20
MODELCOMPARISON..............................................................................................................................22
1. LDAover Logisticregressionmodel....................................................................................23
2. Decision tree over LDA.................................................................................... 23
CONCLUSION..................................................................................................................... 24
APPENDIX – R SOURCE CODE............................................................................................ 25
4
INTRODUCTION
In the aviation industry, high-grade customer satisfaction is a key factor to run the business, as the
airline industry is very competitive and customer loyalty varies with small changes in the services.
Therefore, companies need to understand the customers’ need to deliver
unparalleled experiences to retain customers.
Using the customer’s satisfaction data obtained from falcon airlines, we here attempt to
understand the reasons for customer experience being satisfied or not. Based on that,
improvements will be made to provide better service by the airline company. Also, as part of the
analysis, we will be able to understand several factors which improve customer satisfaction level.
In this era of social networking, the customers get to Twitter, Facebook, Blogs or Google
Reviews to review or share their experience on the internet. Any miserable experience of a
customer, which is shared becomes viral on the internet, seriously affecting the brand or
goodwill of the company.
To arrest these unwanted viral trends on the internet and to avoid heavy legal compensations or
penalties, the companies in the service industry are identifying a customer’s satisfaction through a
service rating cards or survey after delivering service and compensating a dissatisfied customer
upon the fault of company service. This rewards or compensation program has effectively reduced
the complaints raised by a customer.
5
PROBLEM STATEMENT
In our scenario, the airline company wants to identify a customer satisfaction level, based on his
rating on various aspects of airline experience. Hence, primarily we build a model to
classify the customer satisfaction level. Precisely we will classify customers’ being satisfied or not
and accordingly try to find out the factors related to high satisfaction.
DATA ABSTRACT
The data we obtained from Kaggle represents US airline satisfaction data consisting of 129880
observations and 23 attributes. The Satisfaction level, which is our dependent variable, is
represented as a factor (“Satisfied” and “Neutral or Dissatisfied”).
Variable Variable Description Variable Value Level
Satisfaction
(Dependent Var)
Airline satisfaction level Satisfaction, neutral or
dissatisfaction
Age The actual age of the passengers
Gender Gender of the passengers Female, Male
Type of Travel Purpose of the flight of the
passengers
Personal Travel, Business
Travel
Class Travel class in the plane of the
passengers
Business, Eco, Eco Plus
Customer Type The customer type Loyal customer, disloyal
customer
Flight distance The flight distance of this journey
6
Inflight WIFI
Service
Satisfaction level of the inflight wifi
service
Rating: 0 (least) - 5 (highest)
Ease of Online
Booking
Satisfaction level of online booking Rating: 0 (least) - 5 (highest)
Online boarding Satisfaction level of online boarding Rating: 0 (least) - 5 (highest)
Inflight
Entertainment
Satisfaction level of inflight
entertainment
Rating: 0 (least) - 5 (highest)
Food and drink Satisfaction level of Food and drink
Rating: 0 (least) - 5 (highest)
Seat comfort Satisfaction level of Seat comfort Rating: 0 (least) - 5 (highest)
On-board service Satisfaction level of On-board service Rating: 0 (least) - 5 (highest)
Leg room service Satisfaction level of Leg room service Rating: 0 (least) - 5 (highest)
Departure/Arriva
l time convenient
Satisfaction level of Departure/Arrival
time convenient
Rating: 0 (least) - 5 (highest)
Baggage handling Satisfaction level of baggage handling
Rating: 0 (least) - 5 (highest)
Gate location Satisfaction level of Gate location Rating: 0 (least) - 5 (highest)
Cleanliness Satisfaction level of Cleanliness Rating: 0 (least) - 5 (highest)
Check-in service Satisfaction level of Check-in service Rating: 0 (least) - 5 (highest)
Departure Delay
in Minutes
Minutes delayed when departure
Arrival Delay in
Minutes
Minutes delayed when Arrival
Online Support
Satisfaction level of Online Support Rating: 0 (least) - 5 (highest)
7
DATA PREPARATION
We import the dataset into the R environment and observe the structure of the data loaded.
##'data.frame':90917 obs. of 25 variables:
## $ ID : int 149965 149966 149967 149968 149969 14997
0 149971 149972 149973 149974 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 2
1 2 2 1 1 ...
## $ CustomerType : Factor w/ 3 levels "","disloyal Customer",..:
3 3 3 3 3 3 3 3 3 3 ...
## $ Age : int 65 15 60 70 30 66 10 22 58 34 ...
## $ TypeTravel : Factor w/ 3 levels "","Business travel",..: 3
3 3 3 1 3 3 3 3 3 ...
## $ Class : Factor w/ 3 levels "Business","Eco",..: 2 2 2
2 2 2 2 2 2 2 ...
## $ Flight_Distance : int 265 2138 623 354 1894 227 1812 1556 104 3
633 ...
## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ...
## $ ArrivalDelayin_Mins : int 0 0 0 0 0 15 0 26 48 0 ...
## $ ID : int 149965 149966 149967 149968 149969 149970
149971 149972 149973 149974 ...
## $ Satisfaction : Factor w/ 2 levels "neutral or dissatisfied",.
.: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_comfort : Factor w/ 6 levels "acceptable","excellent",..
: 3 3 3 3 3 3 3 3 3 3 ...
## $ Departure.Arrival.time_convenient: Factor w/ 7 levels "","acceptable",..: 4 4 1 4
4 4 4 1 4 4 ...
## $ Food_drink : Factor w/ 7 levels "","acceptable",..: 4 4 4 4
4 1 1 4 4 4 ...
## $ Gate_location : Factor w/ 6 levels "Convinient","Inconvinient"
,..: 4 3 3 3 3 3 3 3 3 1 ...
## $ Inflightwifi_service : Factor w/ 6 levels "acceptable","excellent",..
: 5 5 1 4 5 5 5 5 1 5 ...
## $ Inflight_entertainment : Factor w/ 6 levels "acceptable","excellent",..
: 4 3 4 1 3 2 3 3 1 3 ...
## $ Online_support : Factor w/ 6 levels "acceptable","excellent",..
: 5 5 1 4 5 2 5 5 1 5 ...
8
## $ Ease_of_Onlinebooking : Factor w/ 6 levels "acceptable","excellent",..
: 1 5 6 5 5 2 5 5 1 5 ...
## $ Onboard_service : Factor w/ 7 levels "","acceptable",..: 2 1 7 6
3 3 2 6 2 2 ...
## $ Leg_room_service : Factor w/ 6 levels "acceptable","excellent",..
: 3 1 3 3 4 3 1 4 3 5 ...
## $ Baggage_handling : Factor w/ 5 levels "acceptable","excellent",..
: 1 3 5 4 2 2 3 2 5 2 ...
## $ Checkin_service : Factor w/ 6 levels "acceptable","excellent",..
: 2 4 4 4 2 2 2 1 5 5 ...
## $ Cleanliness : Factor w/ 6 levels "acceptable","excellent",..
: 1 4 6 5 4 2 4 4 1 2 ...
## $ Online_boarding : Factor w/ 6 levels "acceptable","excellent",..
: 5 5 1 2 5 1 5 5 2 5 ...
We observe the dataset having 90917 observations and 24 attributes. The attribute “satisfaction”
represents the satisfaction level of the customer on two different levels: “neutral or dissatisfied” and
“satisfied”. Also, we drop the ID attribute as we wouldn’t require that in our analysis or model building.
Hence, we have 1 dependent variable and 22 independent variables in our dataset. We check our dataset
for any possible NA values, which must be dealt before we proceed with building the model.
## ArrivalDelayin_Min
## 284
We observe 284 NA values in the attribute “ArrivalDelayin_Min”. Let us analyze the dataset
further to understand whether there are any NA values which are present in the data appart from
this variable. To perform this, we fill the blanks with NA values. Then we take the count of the NA
values of the attributes of this dataset. Following we observe there are NA’s values of attributes of
original dataset and the dataset with NA values are –
CustomerType TypeTravel Departure.Arrival.time_convenient
9099 9088 8244
Food_drink Onboard_service
8181 7179
9
By imputing the blank values of different attributes with NA’s, we observe that the attributes
have some missing values which are imputed with NA’s. Hence, we can replace the NA values
with mean values of the attribute or omit the missing, however before proceeding any
further, let us analyze the “Arrival.Delay.in.Minutes” variable’s correlation with other
variables.
ArrivalDelayin_Mins
Gender NA
CustomerType NA
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Age
Flight_Distance
DepartureDelayin_Mins
ArrivalDelayin_Mins
Gender
CustomerType
TypeTravel
Class
Satisfaction
Seat_comfort
Departure.Arrival.time_convenient
Food_drink
Gate_location
Inflightwifi_service
Inflight_entertainment
Online_support
Ease_of_Onlinebooking
Onboard_service
Leg_room_service
Baggage_handling
Checkin_service
Cleanliness
Online_boarding
Age
Flight_Distance
DepartureDelayin_Mins
ArrivalDelayin_Mins
Gender
CustomerType
TypeTravel
Class
Satisfaction
Seat_comfort
Departure.Arrival.time_convenient
Food_drink
Gate_location
Inflightwifi_service
Inflight_entertainment
Online_support
Ease_of_Onlinebooking
Onboard_service
Leg_room_service
Baggage_handling
Checkin_service
Cleanliness
Online_boarding
10
Age -0.010
TypeTravel NA
Class NA
Flight_Distance 0.108
DepartureDelayin_Mins 0.966
We observe a very high correlation (0.966) between Departure.Delay.in.Minutes and
Arrival.Delay.in.Minutes. Fortunately, without the need of replacing or omitting the NA values in the
variable “Arrival.Delay.in.Minutes”, we can exclude the variable entirely, as we have a similar variable
having no NA values in our dataset to proceed with our modeling.
EXPLORATORY ANALYSIS
After observing the impact of several independent variables over the “Satisfaction level”, we
present below few visualizations displaying a level of satisfaction at different factor levels for
the variables “Gender”, “Class”, “Seat Comfort”, “Inflight Entertainment”, “Online Support” and
“Baggage Handling”.
On an inferential perspective, these variables are considered the most significant in terms of
airline customer satisfaction, which is more likely the case as we observe the following
visualizations.
11
We observe that the customers traveling the
Business Class are more satisfied than the
customers traveling in Economy classes.
We observe that the Female customers are
comparatively more satisfied than the Male
customers as we can see 65% female
customers are satisfied against 34%
dissatisfied.
We observe that Seat Comfort is having a
significant effect on the customer
satisfaction level, as we see that customers
rating 5 on seat comfort are 99 percent
satisfied.
12
We observe that Baggage Handling is having a
significant effect on the customer satisfaction Level, as
we see that customers rating 5 on Baggage Handling
are 73 percent satisfied.
We observe that Inflight Entertainment is
having a significant effect on the customer
satisfaction level, as we see that customers
rating 5 on seat comfort are 95percent
satisfied.
We observe that Online Support is having a
significant effect on the customer
satisfaction level, as we see that customers
rating 0 on Online Support are 100 percent
dissatisfied.
13
DATA MODELLING
We build models to predict the satisfaction level of the customer based on the independent
variables of our dataset. This will help the airline company to understand the satisfaction level
of its customers and hence take necessary actions, in order to improve customer satisfaction.
Here we build 3 models to predict the satisfaction level, namely
• Logistic regression model
• Linear discriminant analysis and
• Decision tree models.
DATAPROCESSING AND NORMALIZATION
Initially, we check for the class bias to ensure our models are better.
We observe there is no class bias, as both the cases are approximately equally distributed in
the dataset.
Before we begin modeling the dataset, we convert the factorial data in the dataset into
numeric type as we require in the model building and further analysis, except for our
dependent variable.
As an initial step in the dataset, we observe few of the variables in our dataset are on the
different scale and hence mislead our modeling techniques. Hence, we normalize the data.
satisfied
54.7
##
## neutral or dissatisfied
## 45.3
14
TRAINING AND TEST DATASETS
Now that we have the normalized dataset, we divide the dataset into training and testing
dataset in the ratio of 70:30. This will help us in understanding the performance of our model.
MODELS
1. LOGISTIC REGRESSION
To build a classification model to predict satisfaction, we start with the go-to classification
model: Logistic Regression model. Here, we utilize the Logistic regression classification
algorithm from the GLM package in R to predict “satisfaction_v2” from the set of independent
variables available.
Call:
glm(formula = Satisfaction ~ ., family = "binomial", data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8657 -0.5916 0.2056 0.5355 3.5814
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.54608 0.08558 -88.171 < 2e-16 ***
Gender -0.91631 0.02315 -39.576 < 2e-16 ***
CustomerType 1.75680 0.03491 50.323 < 2e-16 ***
Age -0.40833 0.06229 -6.556 5.54e-11 ***
TypeTravel -0.58477 0.03103 -18.847 < 2e-16 ***
Class 0.86129 0.02858 30.136 < 2e-16 ***
Flight_Distance -0.65794 0.08321 -7.907 2.64e-15 ***
DepartureDelayin_Mins -7.54371 0.49612 -15.205 < 2e-16 ***
Seat_comfort 1.34953 0.06315 21.371 < 2e-16 ***
Departure.Arrival.time_convenient -0.82836 0.04684 -17.686 < 2e-16 ***
Food_drink -1.18942 0.06470 -18.385 < 2e-16 ***
Gate_location 0.46966 0.05299 8.862 < 2e-16 ***
Inflightwifi_service -0.44060 0.06243 -7.058 1.69e-12 ***
Inflight_entertainment 3.60675 0.05847 61.687 < 2e-16 ***
Online_support 0.43647 0.06368 6.854 7.18e-12 ***
Ease_of_Onlinebooking 1.38364 0.08170 16.935 < 2e-16 ***
15
Onboard_service 1.36192 0.05790 23.521 < 2e-16 ***
Leg_room_service 1.15122 0.04932 23.341 < 2e-16 ***
Baggage_handling 0.38854 0.05233 7.424 1.13e-13 ***
Checkin_service 1.44105 0.04868 29.601 < 2e-16 ***
Cleanliness 0.37903 0.06767 5.602 2.12e-08 ***
Online_boarding 0.68344 0.06994 9.772 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 87657 on 63642 degrees of freedom
Residual deviance: 50030 on 63621 degrees of freedom
AIC: 50074
Number of Fisher Scoring iterations: 5
From the above summary statistics of the model we built, we observe the coefficient weights
and the importance of the variable from p-value. We observe that all the independent
variables are significantly important in predicting the “satisfaction”.
We shall predict the test dataset using the model built. Note that we obtained the predictions
in terms of probabilities. Hence, by considering a cutoff of 0.5 (practically used default value),
we classify the predicted “satisfaction” into “Yes” or “No”. Let us build theconfusion
matrix, using the predicted and observed values to assess the performance of the model.
Confusion Matrix and Statistics
Reference
Prediction neutral or dissatisfied satisfied
neutral or dissatisfied 10020 2267
satisfied 2326 12661
Accuracy : 0.8316
95% CI : (0.8271, 0.836)
No Information Rate : 0.5473
P-Value [Acc > NIR] : <2e-16
Kappa : 0.66
Mcnemar's Test P-Value : 0.3921
Sensitivity : 0.8481
Specificity : 0.8116
Pos Pred Value : 0.8448
16
Neg Pred Value : 0.8155
Prevalence : 0.5473
Detection Rate : 0.4642
Detection Prevalence : 0.5495
Balanced Accuracy : 0.8299
'Positive' Class : satisfied
From the above results, we can observe that:
• The overall accuracy of 83.6% and Kappa value of 66%. That means, our model
performed better than a random prediction of the dependentvariable.
• Sensitivity value of 84 percent, indicating that our model has correctly identified 84%
of the satisfied customers correctly.
• Specificity value of 81 percent, indicating that our model has correctly identified 81
percent of the “neutral or dissatisfied” customers correctly.
2. LINEAR DISCRIMINANT ANALYSIS
One significant difference we observe between Logistic Regression and LDA is that when the
groups in the dataset are clearly classified logistic regression is unstable and hence LDA model
provides a better model. Let us observe the predicted class separability.
Coefficients of linear discriminants:
LD1
Gender -0.5345605
CustomerType 1.0790306
Age -0.2755955
TypeTravel -0.3446182
Class 0.5061304
Flight_Distance -0.3634092
DepartureDelayin_Mins -3.6944958
Seat_comfort 0.9148241
Departure.Arrival.time_convenient -0.5197594
Food_drink -0.7096648
Gate_location 0.2262170
Inflightwifi_service -0.2303117
Inflight_entertainment 2.1898107
Online_support 0.2880001
Ease_of_Onlinebooking 0.8540892
Onboard_service 0.7833862
17
Leg_room_service 0.6433999
Baggage_handling 0.2037793
Checkin_service 0.8032809
Cleanliness 0.2146604
Online_boarding 0.3705444
As we can observe that the “satisfied” and “neutral or dissatisfied” classes are more likely
separable, we do not expect to see much significance using the LDA model over the Logistic
model. However, we build a model using the LDA.
Confusion Matrix and Statistics
Reference
Prediction neutral or dissatisfied satisfied
neutral or dissatisfied 10024 2322
satisfied 2243 12685
Accuracy : 0.8326
95% CI : (0.8281, 0.837)
18
No Information Rate : 0.5502
P-Value [Acc > NIR] : <2e-16
Kappa : 0.662
Mcnemar's Test P-Value : 0.2483
Sensitivity : 0.8453
Specificity : 0.8172
Pos Pred Value : 0.8497
Neg Pred Value : 0.8119
Prevalence : 0.5502
Detection Rate : 0.4651
Detection Prevalence : 0.5473
Balanced Accuracy : 0.8312
'Positive' Class : satisfied
From the above results, we can observe that:
• The overall accuracy of 83.2% and Kappa value of 66%.
• Sensitivity value of 84 percent, indicating that our model has correctly identified 84%
of the satisfied customers correctly.
• Specificity value of 81 percent, indicating that our model has correctly identified 81
percent of the “neutral or dissatisfied” customers correctly.
3. DECISION TREE – CART ALGORITHM
We use a decision tree algorithm, CART (Classification and Regression Tree) for classifying the
dependent variable based on the set of independent variables. This algorithm repeatedly
partitions the data into multiple-subsets so that the leaf nodes are pure or homogeneous.
Here we ensure that our categorical variables are in factor form so that the CART algorithm
uses classification tree rather than regression tree. Following we visualize the model we built.
19
The algorithm built a classification tree, as we observe that the first or the more valuable
variable we have “Inflight entertainment” being rated 1-3 we classify them to be “neutral or
dissatisfied” class on left node, which is further classified based on the rating given on “seat
comfort”. If seat comfort is rated 1-3, we finally decided them to be “neutral or dissatisfied”,
else classified them to be “satisfied”. Similarly, the tree progresses finally into 7 leaf nodes.
Note that the factor levels are normalized (Normalized Value (Rating Value): 0(0), 0.2(1),
0.4(2), 0.6(3), 0.8(4), 1(5)).
We use this model to predict the test data set and observe the confusion matrix to
understand the model performance
20
Confusion Matrix and Statistics
Reference
Prediction neutral or dissatisfied satisfied
neutral or dissatisfied 10638 1997
satisfied 1708 12931
Accuracy : 0.8642
95% CI : (0.86, 0.8682)
No Information Rate : 0.5473
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7264
Mcnemar's Test P-Value : 2.229e-06
Sensitivity : 0.8662
Specificity : 0.8617
Pos Pred Value : 0.8833
Neg Pred Value : 0.8419
Prevalence : 0.5473
Detection Rate : 0.4741
Detection Prevalence : 0.5367
Balanced Accuracy : 0.8639
'Positive' Class : satisfied
From the above results, we can observe that:
• The overall accuracy of 86% and Kappa value of 72%. That means, our model
performed better than a random prediction of the dependentvariable.
• Sensitivity value of 86 percent, indicating that our model has correctly identified 86%
of the satisfied customers correctly.
• Specificity value of 86 percent, indicating that our model has correctly identified 86
percent of the “neutral or dissatisfied” customers correctly.
4. NAÏVE BAYES – classification algorithm
21
Confusion Matrix and Statistics
pc 1 0
1 13071 1949
0 1792 10329
Accuracy : 0.8622
95% CI : (0.858, 0.8662)
No Information Rate : 0.5476
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7215
Mcnemar's Test P-Value : 0.01076
Sensitivity : 0.8794
22
Specificity : 0.8413
Pos Pred Value : 0.8702
Neg Pred Value : 0.8522
Prevalence : 0.5476
Detection Rate : 0.4816
Detection Prevalence : 0.5534
Balanced Accuracy : 0.8603
'Positive' Class : 1
From the above results, we can observe that:
• The overall accuracy of 86% and Kappa value of 72%. That means, our model
performed better than a random prediction of the dependentvariable.
• Sensitivity value of 87 percent, indicating that our model has correctly identified 87%
of the satisfied customers correctly which is higher than all the models.
• Specificity value of 84 percent, indicating that our model has correctly identified 84
percent of the “neutral or dissatisfied” customers correctly which is lower than
CART model.
• This model has performed good in predicting the satisfied customers but failed to
perform that well when comes to neutral or dissatisfied customers
MODEL COMPARISON
In order to compare the model performance of the 3 models we built on a common metrics,
we extract the accuracy, Kappa statistic, Sensitivity and Specificity from the confusion Matrix
() output.
LogisticRegression LinearDiscriminantAna lysis DecisionTree Naïve Bayes
Accuracy 0.8315979 0.8326245 0.8641563 0.8622
Kappa 0.6600094 0.6620347 0.7264096 0.7215
Sensitivity 0.8481377 0.8452722 0.8662245 0.8794
23
Specificity 0.8115989 0.8171517 0.8616556 0.8413
We observe from the above summary that,
• Logistic regression and LDA models we built provide similar performance metrics for our
dataset.
• Decision tree provides better performance metrics over the other models we built but there is no
much difference between Naïve Bayes and Decision Tree.
Below we compare the models one against the other.
1. LDA OVER LOGISTIC REGRESSION MODEL
As we expected, we do not see significant improvement in the model performance, as our
predicted class was clearly more likely separable, and we have sufficient predictors. However,
we only see a slight .01 percent improvement in the Specificity of our model, which represents
correctly identifying “neutral or dissatisfied” customers. For our business model, it is critical to
identify the negative class correctly, as the airline company is looking forward to identify the
unsatisfied customers and reward them to improve their satisfaction.
2. DECISION TREE OVER LDA & Naïve Bayes
We obtained a decent 3 percent improvement in the model accuracy using the decision tree
algorithm over the logistic regression and LDA but no significant difference was observed
when compared with Naïve Bayes models we built. Also, we obtained a 5 percent and 2
percent improvement in the specificity of our model which is a critical metric for our business
model.
24
CONCLUSION
We conclude that decision tree model we built using the CART algorithm, is statistically better
suitable for our dataset and business model providing us with an accuracy of 86 percent along
with 86 percent correctly identifying “satisfied” customers (Sensitivity) and 86percent
correctly identifying “neutral or dissatisfied” customers (Specificity).
Also, we observed from exploratory analysis and decision tree as well, the Inflight experience
and Seat Comfort level and Online service level significantly affect the customer experience
along with several other variables considered. The airline service companies must ensure high
quality of service in these parameters to ensure a high level of customer satisfaction.
The models we built did lack to perfectly classify customer satisfaction level. The prediction
accuracy can be further improved using advanced classification algorithms or deep learning
concepts.
25
APPENDIX – R SOURCE CODE
# Loading Libraries
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(e1071)
library(MASS)
library(corrplot)
library(ggplot2)
library(tidyr)
library(sparklyr)
# Setting Working Directory
setwd("D:/PGP BA Classes/Capstone Projects/3. Airplane Passenger Satisfaction/Final Submission")
#Loading data
flightdata<-read.csv("Flight data.csv", header = T)
surveydata<-read.csv("Surveydata.csv", header = T)
flightsurvey<-cbind(flightdata,surveydata[,-1])
26
#Filling blanks with NA
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x)
ifelse(as.character(x)!="", x, NA)
}
## transform all columns
flightsurvey<-flightsurvey %>% mutate_each(list(empty_as_na))
str(flightsurvey)
#Dealing with NA values
#NA's Count
sapply(flightsurvey, function(x) sum(is.na(x)))
# Missing Vallue immputaion with mode
# For CustomerType
ct <- unique(flightsurvey$CustomerType[!is.na(flightsurvey$CustomerType)])
CustomerType <- ct[which.max(tabulate(match(flightsurvey$CustomerType, ct)))]
CustomerType
flightsurvey$CustomerType[is.na(flightsurvey$CustomerType)]<- CustomerType
# For TypeTravel
tt <- unique(flightsurvey$TypeTravel[!is.na(flightsurvey$TypeTravel)])
TypeTravel <- tt[which.max(tabulate(match(flightsurvey$TypeTravel, tt)))]
TypeTravel
27
flightsurvey$TypeTravel[is.na(flightsurvey$TypeTravel)]<- TypeTravel
# For Departure.Arrival.time_convenient
datc <-
unique(flightsurvey$Departure.Arrival.time_convenient[!is.na(flightsurvey$Departure.Arrival.time_convenie
nt)])
Departure.Arrival.time_convenient <-
datc[which.max(tabulate(match(flightsurvey$Departure.Arrival.time_convenient, datc)))]
Departure.Arrival.time_convenient
flightsurvey$Departure.Arrival.time_convenient[is.na(flightsurvey$Departure.Arrival.time_convenient)]<-
Departure.Arrival.time_convenient
# For Food_drink
fd <- unique(flightsurvey$Food_drink[!is.na(flightsurvey$Food_drink)])
Food_drink <- fd[which.max(tabulate(match(flightsurvey$Food_drink, fd)))]
Food_drink
flightsurvey$Food_drink[is.na(flightsurvey$Food_drink)]<- Food_drink
# For Onboard_service
os <- unique(flightsurvey$Onboard_service[!is.na(flightsurvey$Onboard_service)])
Onboard_service <- os[which.max(tabulate(match(flightsurvey$Onboard_service, os)))]
Onboard_service
flightsurvey$Onboard_service[is.na(flightsurvey$Onboard_service)]<- Onboard_service
# Converting Cate into int
flightsurvey$Seat_comfort<-factor(flightsurvey$Seat_comfort,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Seat_comfort <- as.integer(factor(flightsurvey$Seat_comfort))
28
flightsurvey$Departure.Arrival.time_convenient<-
factor(flightsurvey$Departure.Arrival.time_convenient,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Departure.Arrival.time_convenient <-
as.integer(factor(flightsurvey$Departure.Arrival.time_convenient))
flightsurvey$Food_drink<-factor(flightsurvey$Food_drink,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Food_drink <- as.integer(factor(flightsurvey$Food_drink))
flightsurvey$Gate_location<-factor(flightsurvey$Gate_location,levels=c("very
inconvinient","Inconvinient","need improvement","manageable","Convinient","very
convinient"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Gate_location <- as.integer(factor(flightsurvey$Gate_location))
flightsurvey$Inflightwifi_service<-factor(flightsurvey$Inflightwifi_service,levels=c("extremely
poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Inflightwifi_service <- as.integer(factor(flightsurvey$Inflightwifi_service))
flightsurvey$Inflight_entertainment<-factor(flightsurvey$Inflight_entertainment,levels=c("extremely
poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Inflight_entertainment <- as.integer(factor(flightsurvey$Inflight_entertainment))
flightsurvey$Online_support<-factor(flightsurvey$Online_support,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Online_support <- as.integer(factor(flightsurvey$Online_support))
flightsurvey$Ease_of_Onlinebooking<-factor(flightsurvey$Ease_of_Onlinebooking,levels=c("extremely
poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Ease_of_Onlinebooking <- as.integer(factor(flightsurvey$Ease_of_Onlinebooking))
flightsurvey$Onboard_service<-factor(flightsurvey$Onboard_service,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Onboard_service <- as.integer(factor(flightsurvey$Onboard_service))
flightsurvey$Leg_room_service<-factor(flightsurvey$Leg_room_service,levels=c("extremely
poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
29
flightsurvey$Leg_room_service <- as.integer(factor(flightsurvey$Leg_room_service))
flightsurvey$Baggage_handling<-factor(flightsurvey$Baggage_handling,levels=c("poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4),ordered = T)
flightsurvey$Baggage_handling <- as.integer(factor(flightsurvey$Baggage_handling))
flightsurvey$Checkin_service<-factor(flightsurvey$Checkin_service,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Checkin_service <- as.integer(factor(flightsurvey$Checkin_service))
flightsurvey$Cleanliness<-factor(flightsurvey$Cleanliness,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Cleanliness <- as.integer(factor(flightsurvey$Cleanliness))
flightsurvey$Online_boarding<-factor(flightsurvey$Online_boarding,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Online_boarding <- as.integer(factor(flightsurvey$Online_boarding))
str(flightsurvey)
#drop ID attribute
data <- flightsurvey[-1]
# Finding correlation
data.non <- data %>% filter(!is.na(data$ArrivalDelayin_Mins))
for(i in 1:ncol(data.non)){
if(class(data[,i])!="integer"){
data.non[,i] <- as.integer(data.non[,i])
}
}
data.cor <- round(cor(data.non[,1:7], data.non[,8]),3)
30
colnames(data.cor) <- "ArrivalDelayin_Mins"
data.cor
# Removing High Correlated atribute "Arrival.Delay.in.Minutes"
data <- data[-8]
#Exploratory Analysis
ggplot(data, aes(Satisfaction, group = Gender)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Satisfaction Level",y = "Percentage",title="Gender vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Gender)+ theme(axis.text.x =
element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
ggplot(data, aes(Satisfaction, group = Class)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Class of Travel",y = "Percentage",title="Travel Class vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) +
facet_grid(~Class)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust =
0.5))
ggplot(data, aes(Satisfaction, group = Seat_comfort)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count") +
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Seat Comfort Level",y = "Percentage",title="Seat Comfort vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Seat_comfort)+ theme(axis.text.x =
element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
31
ggplot(data, aes(Satisfaction, group = Inflight_entertainment)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count") +
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Inflight entertainment Level",y = "Percentage",title= "Inflight Entertainment vs Satisfaction",
fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Inflight_entertainment)+ theme(axis.text.x =
element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
ggplot(data, aes(Satisfaction, group = Online_support)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count") +
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Online support Level",y = "Percentage",title="Online Support vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Online_support)+ theme(axis.text.x =
element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
ggplot(data, aes(Satisfaction, group = Baggage_handling)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count")+
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Baggage Handling Level",y = "Percentage",title="Bagga ge Handling vs Satisfaction",
fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Baggage_handling)+ theme(axis.text.x =
element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
# Data Modelling
round(prop.table(table(data$Satisfaction))*100,digit=1)
for(i in 2:ncol(data)){
if(class(data[,i])=="factor"){
data[,i] <- as.integer(data[,i])
}
}
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
32
}
#write.csv(data, file = "Data.csv")
data <- read.csv("Data.csv",header = T)
data <- data[,-1]
str(data)
# Converting categorical to int
# Gender
levels(data$Gender)
data$Gender<-factor(data$Gender,levels=c("Female","Male"),labels=c(0,1))
data$Gender <- as.integer(factor(data$Gender))
# CustomerType
levels(data$CustomerType)
data$CustomerType<-factor(data$CustomerType,levels=c("disloyal Customer","Loyal
Customer"),labels=c(0,1),ordered = T)
data$CustomerType <- as.integer(factor(data$CustomerType))
#TypeTravel
levels(data$TypeTravel)
data$TypeTravel<-factor(data$TypeTravel,levels=c("Business travel","Personal Travel"),labels=c(0,1))
data$TypeTravel <- as.integer(factor(data$TypeTravel))
#Class
levels(data$Class)
data$Class<-factor(data$Class,levels=c("Eco","Eco Plus","Business"),labels=c(0,1,2),ordered = T)
33
data$Class <- as.integer(factor(data$Class))
data_n <- as.data.frame(lapply(data[2:22], normalize))
data_n <- cbind(data_n, Satisfaction=data$Satisfaction)
set.seed(5)
partition <- createDataPartition(y = data_n$Satisfaction, p=0.7, list= F)
data_train <- data_n[partition,]
data_test <- data_n[-partition,]
# write.csv(data_train, file = "data_train.csv")
# write.csv(data_test, file = "data_test.csv")
# Logistic regression
log_model <- glm(Satisfaction~., data=data_train, family = "binomial")
summary(log_model)
log_preds <- predict(log_model, data_test[,1:22], type = "response")
head(log_preds)
log_class <- array(c(99))
for (i in 1:length(log_preds)){
if(log_preds[i]>0.5){
log_class[i]<-"satisfied"
}else{
log_class[i]<-"neutral or dissatisfied"
}
}
34
#Creating a new dataframe containing the actual and predicted values.
log_result <- data.frame(Actual = data_test$Satisfaction, Prediction = log_class)
mr1 <- confusionMatrix(as.factor(log_class),data_test$Satisfaction, positive = "satisfied")
mr1
# Linear Discriminant Analysis
pairs(data_train[,1:5], main="Predict ", pch=22, bg=c("red", "blue")[unclass(data_train$Satisfaction)])
lda_model <- lda(Satisfaction ~ ., data_train)
lda_model
#Predict the model
lda_preds <- predict(lda_model, data_test)
mr2 <- confusionMatrix(data_test$Satisfaction, lda_preds$class, positive = "satisfied")
mr2
# Decision tree - CART algorithm
data_train_new <- data_train
for(i in c(1,2,4,5,8,9,10,11,12,13,14,15,16,17,18,19,20,21)){
data_train_new[,i] <- as.factor(data_train_new[,i])
}
35
data_test_new <- data_test
for(i in c(1,2,4,5,8,9,10,11,12,13,14,15,16,17,18,19,20,21)){
data_test_new[,i] <- as.factor(data_test_new[,i])
}
str(data_train_new)
str(data_test_new)
cart_model <- rpart(Satisfaction~ ., data_train_new, method="class")
cart_model
rpart.plot(cart_model,digits = 2, type=4, extra=106)
cm1 <- predict(cart_model, data_test_new, type = "class")
mr3 <- confusionMatrix(cm1, data_test$Satisfaction, positive = "satisfied")
mr3
# Model Comparision
comp_cfm <- function(x1,x2,x3,n1,n2,n3){
a <- data.frame()
b <- as.matrix(c(x1[[3]][1:2],x1[[4]][1:2]))
c <- as.matrix(c(x2[[3]][1:2],x2[[4]][1:2]))
d <- as.matrix(c(x3[[3]][1:2],x3[[4]][1:2]))
colnames(b) <- n1
colnames(c) <- n2
colnames(d) <- n3
a <- as.data.frame(cbind(b,c,d))
return(a)
}
36
r <- comp_cfm(mr1,mr2,mr3, "LogisticRegression", "LinearDiscriminantAna lysis", "DecisionTree")
r
##Capstone Project Data And Naive Bayes Model
setwd("D:/PGP BA Classes/Capstone Projects/3. Airplane Passenger Satisfaction/Final Submission")
survey<-read.csv("Surveydata.csv")
flight<- read.csv("Flight data.csv")
## Combine Data set
FSdata<-cbind(flight[-1],survey[-1])
str(FSdata)
colnames(FSdata)
library(dplyr)
## filling "Blanks" with "NA"
empty_as_na <- function(x)
{
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)
}
## transform all columns
FSdata<-FSdata %>% mutate_each(list(empty_as_na))
FSdata
##Missing value imputation for CustomerType
val1 <- unique(FSdata$CustomerType[!is.na(FSdata$CustomerType)])
mode1 <- val1[which.max(tabulate(match(FSdata$CustomerType, val1)))]
mode1
FSdata$CustomerType[is.na(FSdata$CustomerType)]<- mode1
37
##Missing value imputation for Typetravel
val2 <- unique(FSdata$TypeTravel[!is.na(FSdata$TypeTravel)])
mode2 <- val2[which.max(tabulate(match(FSdata$TypeTravel, val2)))]
mode2
FSdata$TypeTravel[is.na(FSdata$TypeTravel)]<- mode2
##Missing value imputation for Departure.Arrival.time_convenient
val3 <-unique(FSdata$Departure.Arrival.time_convenient[!is.na(FSdata$Departure.Arrival.time_convenient)])
mode3 <- val3[which.max(tabulate(match(FSdata$Departure.Arrival.time_convenient, val3)))]
mode3
FSdata$Departure.Arrival.time_convenient[is.na(FSdata$Departure.Arrival.time_convenient)]<-mode3
##Missing value imputation for Food_drink
val4 <- unique(FSdata$Food_drink[!is.na(FSdata$Food_drink)])
mode4 <- val4[which.max(tabulate(match(FSdata$Food_drink, val4)))]
mode4
FSdata$Food_drink[is.na(FSdata$Food_drink)]<- mode4
##Missing value imputation for Onboard_service
val5 <- unique(FSdata$Onboard_service[!is.na(FSdata$Onboard_service)])
mode5 <- val5[which.max(tabulate(match(FSdata$Onboard_service, val5)))]
mode5
FSdata$Onboard_service[is.na(FSdata$Onboard_service)]<- mode5
## deteling the variable "Arrival Delay in Mins" due to collinearity
FSdata<-FSdata[-8]
38
colnames(FSdata)
write.csv(FSdata,file = "FSdata.csv")
sum(is.na(FSdata))
## Binning
library(rbin)
FSdata$Age <- cut(FSdata$Age,
c(0,20,35,55,90))
table(FSdata$Age)
prop.table(table(FSdata$Age))
levels(FSdata$Age)
##Binning Flight_Distance
FSdata$Flight_Distance <- cut(FSdata$Flight_Distance,
c(0,200,400,600,800,1000,1200,1400,1600,1800,2000,2200,2400,2600,2800,3000,3200,3400,3600,3800,4000,4200,44
00,4600,4800,5000,5200,5400,5600,5800,6000))
table(FSdata$Flight_Distance)
prop.table(table(FSdata$Flight_Distance))
levels(FSdata$Flight_Distance)
##Binning DepartureDelayin_Mins (First 2 hrs every 10 mins, next 3 hrs every 15 mins,
#next 5 hrs every 30 mins, next 6 hrs every 60 mins, and beyond 6 hrs in the last bucket)
FSdata$DepartureDelayin_Mins <- cut(FSdata$DepartureDelayin_Mins,
c(0,10,20,30,40,50,60,70,80,90,100,110,120,135,150,
165,180,195,210,225,240,255,270,285,300,330,360,390,
420,450,480,510,540,570,600,660,720,780,840,900,960,1620))
table(FSdata$DepartureDelayin_Mins)
prop.table(table(FSdata$DepartureDelayin_Mins))
##Converting 0 as string
levels<-levels(FSdata$DepartureDelayin_Mins)
39
levels[length(levels)+1]<-"0"
FSdata$DepartureDelayin_Mins<-factor(FSdata$DepartureDelayin_Mins,levels =levels)
FSdata$DepartureDelayin_Mins[is.na(FSdata$DepartureDelayin_Mins)]<- "0"
str(FSdata)
## Converting Character variables to factor
FSdata$Gender<- factor(FSdata$Gender,c("Female","Male"),labels = c(0,1))
FSdata$CustomerType <- factor(FSdata$CustomerType,
c("Loyal Customer","disloyal Customer"),labels = c(1,0))
FSdata$TypeTravel<- factor(FSdata$TypeTravel,
c("Personal Travel", "Business travel"),labels = c(0,1))
FSdata$Class <- factor(FSdata$Class,c("Eco","Eco Plus","Business"),labels = c(1:3))
FSdata$Food_drink <- factor(FSdata$Food_drink,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Seat_comfort <- factor(FSdata$Seat_comfort,
c( "extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Departure.Arrival.time_convenient <- factor(FSdata$Departure.Arrival.time_convenient,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Gate_location <- factor(FSdata$Gate_location,
c("very inconvinient","Inconvinient","need improvement","manageable",
"Convinient" , "very convinient"),labels = c(0:5))
FSdata$Inflightwifi_service<- factor(FSdata$Inflightwifi_service,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Inflight_entertainment <- factor(FSdata$Inflight_entertainment,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Online_support <- factor(FSdata$Online_support,
40
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Ease_of_Onlinebooking <- factor(FSdata$Ease_of_Onlinebooking,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Onboard_service <- factor(FSdata$Onboard_service,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Leg_room_service<- factor(FSdata$Leg_room_service,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Baggage_handling<- factor(FSdata$Baggage_handling,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Checkin_service <- factor(FSdata$Checkin_service,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Cleanliness<- factor(FSdata$Cleanliness,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Online_boarding <- factor(FSdata$Online_boarding,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Satisfaction<-factor(FSdata$Satisfaction,
c("satisfied","neutral or dissatisfied"),labels=c(1,0))
levels(FSdata$Cleanliness)
str(FSdata)
41
##Split the data in train & test sample (70:30)
library(caret)
set.seed(123)
id<-sample(2,nrow(FSdata),prob = c(.7,.3),replace = T)
training<-FSdata[id==1,]
testing<-FSdata[id==2,]
c(nrow(training), nrow(testing))
str(testing)
## Rows and Columns
dim(training)
dim(testing)
## Columns name
colnames(training)
colnames(testing)
## Naive Baise Model
library(e1071)
library(caret)
library(AUC)
model.bayes<- naiveBayes(Satisfaction~.,data = training)
model.bayes
## test
pc<- NULL
pc<- predict(model.bayes,testing,type = "class")
summary(pc)
confusionMatrix(table(pc,testing$Satisfaction))
42
## Lift Chart
pb<- NULL
pb<- predict(model.bayes,testing,type = "raw")
pb<-as.data.frame(pb)
pred.bayes<-data.frame(testing$Satisfaction,pb$`1`)
colnames(pred.bayes)<-c("target","score")
lift.bayes<- lift(target~score,data=pred.bayes,cuts = 10,class = "1")
xyplot(lift.bayes,main="Bayesian Classifier -Lift Chart",
type=c("1","g"),lwd=2,
x.scales=list(x=list(alternative=FALSE,tick.number=10),
y=list(alternating=FALSE,tick.number=10)))

More Related Content

What's hot

What's hot (6)

Time series project
Time series projectTime series project
Time series project
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Dijkstra’s algorithm
Dijkstra’s algorithmDijkstra’s algorithm
Dijkstra’s algorithm
 
Vector in R
Vector in RVector in R
Vector in R
 
Statistical Confidence Level
Statistical Confidence LevelStatistical Confidence Level
Statistical Confidence Level
 
Feasibility of cloud migration for large enterprises
Feasibility of cloud migration for large enterprisesFeasibility of cloud migration for large enterprises
Feasibility of cloud migration for large enterprises
 

Similar to Us falcon airline passenger satisfaction

Scorebooking corporate partnership (1)
Scorebooking corporate partnership (1)Scorebooking corporate partnership (1)
Scorebooking corporate partnership (1)CHETAN KUMAR
 
Pitch Deck Teardown: point.me's $10M Series A deck
Pitch Deck Teardown: point.me's $10M Series A deckPitch Deck Teardown: point.me's $10M Series A deck
Pitch Deck Teardown: point.me's $10M Series A deckHajeJanKamps
 
E-reputation for the tourism industry 2015
E-reputation for the tourism industry 2015E-reputation for the tourism industry 2015
E-reputation for the tourism industry 2015Mikael Le Luron
 
Influence Interactive Wrecking Ball - United Airlines
Influence Interactive Wrecking Ball - United AirlinesInfluence Interactive Wrecking Ball - United Airlines
Influence Interactive Wrecking Ball - United AirlinesWilliam Flaiz
 
Combined guide dma echo awards india 2013
Combined guide dma echo awards india 2013Combined guide dma echo awards india 2013
Combined guide dma echo awards india 2013Vatsal Asher
 
A2Lab - Payment Request API
A2Lab - Payment Request APIA2Lab - Payment Request API
A2Lab - Payment Request APIJuliano Padilha
 
Rishabh UI Presentation - Meril.pdf
Rishabh UI Presentation - Meril.pdfRishabh UI Presentation - Meril.pdf
Rishabh UI Presentation - Meril.pdfRishabh Tiwari
 
Trip Friction Briefing to GBTA 2014
Trip Friction Briefing to GBTA 2014Trip Friction Briefing to GBTA 2014
Trip Friction Briefing to GBTA 2014Scott Gillespie
 
Measuring customer experiences
Measuring customer experiencesMeasuring customer experiences
Measuring customer experienceschristophevergult
 
Guest Satisfaction Management in Hotels using Eco (www.geteco.com)
Guest Satisfaction Management in Hotels using Eco (www.geteco.com)Guest Satisfaction Management in Hotels using Eco (www.geteco.com)
Guest Satisfaction Management in Hotels using Eco (www.geteco.com)kenreimer
 
Omni-Channel Journey of a Financial Services Customer - Nationwide Presentation
Omni-Channel Journey of a Financial Services Customer - Nationwide PresentationOmni-Channel Journey of a Financial Services Customer - Nationwide Presentation
Omni-Channel Journey of a Financial Services Customer - Nationwide PresentationEnsighten
 
ppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
ppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
ppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjTauqeerAhmad85
 
To Kill A Mockingbird Part 1 Essay Questions
To Kill A Mockingbird Part 1 Essay QuestionsTo Kill A Mockingbird Part 1 Essay Questions
To Kill A Mockingbird Part 1 Essay QuestionsJoanna Gardner
 
BEST PRACTICES IN TRAVEL WEBSITE 2
BEST PRACTICES IN TRAVEL WEBSITE 2BEST PRACTICES IN TRAVEL WEBSITE 2
BEST PRACTICES IN TRAVEL WEBSITE 2Francesco Canzoniere
 
Using marketing automation to optimise sales across online and offline
Using marketing automation to optimise sales across online and offlineUsing marketing automation to optimise sales across online and offline
Using marketing automation to optimise sales across online and offlineJohn Watton
 
Big data - 360 degree overview for travel industry
Big data - 360 degree overview for travel industryBig data - 360 degree overview for travel industry
Big data - 360 degree overview for travel industryInterGlobe Technologies
 
final year project document for demo so student can understand the format
final year project document for demo so student can understand the formatfinal year project document for demo so student can understand the format
final year project document for demo so student can understand the formatB68Falgunishrimali
 
Ucan2fs Bus Opp Presentation
Ucan2fs Bus Opp PresentationUcan2fs Bus Opp Presentation
Ucan2fs Bus Opp PresentationHarmik Poghossian
 

Similar to Us falcon airline passenger satisfaction (20)

Scorebooking corporate partnership (1)
Scorebooking corporate partnership (1)Scorebooking corporate partnership (1)
Scorebooking corporate partnership (1)
 
Pitch Deck Teardown: point.me's $10M Series A deck
Pitch Deck Teardown: point.me's $10M Series A deckPitch Deck Teardown: point.me's $10M Series A deck
Pitch Deck Teardown: point.me's $10M Series A deck
 
E-reputation for the tourism industry 2015
E-reputation for the tourism industry 2015E-reputation for the tourism industry 2015
E-reputation for the tourism industry 2015
 
Influence Interactive Wrecking Ball - United Airlines
Influence Interactive Wrecking Ball - United AirlinesInfluence Interactive Wrecking Ball - United Airlines
Influence Interactive Wrecking Ball - United Airlines
 
Combined guide dma echo awards india 2013
Combined guide dma echo awards india 2013Combined guide dma echo awards india 2013
Combined guide dma echo awards india 2013
 
A2Lab - Payment Request API
A2Lab - Payment Request APIA2Lab - Payment Request API
A2Lab - Payment Request API
 
Rishabh UI Presentation - Meril.pdf
Rishabh UI Presentation - Meril.pdfRishabh UI Presentation - Meril.pdf
Rishabh UI Presentation - Meril.pdf
 
Trip Friction Briefing to GBTA 2014
Trip Friction Briefing to GBTA 2014Trip Friction Briefing to GBTA 2014
Trip Friction Briefing to GBTA 2014
 
Measuring customer experiences
Measuring customer experiencesMeasuring customer experiences
Measuring customer experiences
 
Guest Satisfaction Management in Hotels using Eco (www.geteco.com)
Guest Satisfaction Management in Hotels using Eco (www.geteco.com)Guest Satisfaction Management in Hotels using Eco (www.geteco.com)
Guest Satisfaction Management in Hotels using Eco (www.geteco.com)
 
Omni-Channel Journey of a Financial Services Customer - Nationwide Presentation
Omni-Channel Journey of a Financial Services Customer - Nationwide PresentationOmni-Channel Journey of a Financial Services Customer - Nationwide Presentation
Omni-Channel Journey of a Financial Services Customer - Nationwide Presentation
 
ppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
ppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
ppt1 (1).pptxgkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
 
To Kill A Mockingbird Part 1 Essay Questions
To Kill A Mockingbird Part 1 Essay QuestionsTo Kill A Mockingbird Part 1 Essay Questions
To Kill A Mockingbird Part 1 Essay Questions
 
BEST PRACTICES IN TRAVEL WEBSITE 2
BEST PRACTICES IN TRAVEL WEBSITE 2BEST PRACTICES IN TRAVEL WEBSITE 2
BEST PRACTICES IN TRAVEL WEBSITE 2
 
Wbe Tours
Wbe ToursWbe Tours
Wbe Tours
 
Using marketing automation to optimise sales across online and offline
Using marketing automation to optimise sales across online and offlineUsing marketing automation to optimise sales across online and offline
Using marketing automation to optimise sales across online and offline
 
Big data - 360 degree overview for travel industry
Big data - 360 degree overview for travel industryBig data - 360 degree overview for travel industry
Big data - 360 degree overview for travel industry
 
final year project document for demo so student can understand the format
final year project document for demo so student can understand the formatfinal year project document for demo so student can understand the format
final year project document for demo so student can understand the format
 
Ucan2fs Bus Opp Presentation
Ucan2fs Bus Opp PresentationUcan2fs Bus Opp Presentation
Ucan2fs Bus Opp Presentation
 
ucan2fs
ucan2fsucan2fs
ucan2fs
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Us falcon airline passenger satisfaction

  • 1. “US - Falcon Airline Passenger Satisfaction” Submitted in Partial Fulfillment of requirements for the Award of certificate of Post Graduate Program in Business Analytics Capstone Project Report Submitted to Submitted by Group 7 Pratiti Ray Chaudhury Bajantry Harshadeep Mithilesh Kumar Yadav Sravan Kolluru Under the guidance of (Ms. Chaitali Dixit) Batch – (PGPBA.H.Jan’19) Year Of Completion (July’ 2019)
  • 2. CERTIFICATE This is to certify that the participants Pratiti Ray Chaudhury, Bajantry Harshadeep, Mithilesh Kumar Yadav, Sravan Kolluru who are the students of Great Lakes Institute of Management, has successfully completed their project on “US - Falcon Airline Passenger Satisfaction” This project is the record of authentic work carried out by them during the academic year 2018- 2019. Mentor’s Name and Sign Program Director Name : Date : Place: Hyderabad
  • 3. CONTENTS INTRODUCTION................................................................................................................... 4 PROBLEM STATEMENT..................................................................................................................5 DATA ABSTRACT .................................................................................................................. 5 DATAPREPARATION....................................................................................................................................7 EXPLORATORYANALYSIS .........................................................................................................................10 DATAMODELLING.....................................................................................................................................13 Dataprocessingandnormalization...........................................................................................13 TrainingandTestDatasets..........................................................................................................14 Models.........................................................................................................................................................14 1. LogisticRegression................................................................................................................14 2. LinearDiscriminantAnalysis................................................................................................16 3. DecisionTree –CARTAlgorithm.........................................................................................18 4. NaïveBayes...........................................................................................................................20 MODELCOMPARISON..............................................................................................................................22 1. LDAover Logisticregressionmodel....................................................................................23 2. Decision tree over LDA.................................................................................... 23 CONCLUSION..................................................................................................................... 24 APPENDIX – R SOURCE CODE............................................................................................ 25
  • 4. 4 INTRODUCTION In the aviation industry, high-grade customer satisfaction is a key factor to run the business, as the airline industry is very competitive and customer loyalty varies with small changes in the services. Therefore, companies need to understand the customers’ need to deliver unparalleled experiences to retain customers. Using the customer’s satisfaction data obtained from falcon airlines, we here attempt to understand the reasons for customer experience being satisfied or not. Based on that, improvements will be made to provide better service by the airline company. Also, as part of the analysis, we will be able to understand several factors which improve customer satisfaction level. In this era of social networking, the customers get to Twitter, Facebook, Blogs or Google Reviews to review or share their experience on the internet. Any miserable experience of a customer, which is shared becomes viral on the internet, seriously affecting the brand or goodwill of the company. To arrest these unwanted viral trends on the internet and to avoid heavy legal compensations or penalties, the companies in the service industry are identifying a customer’s satisfaction through a service rating cards or survey after delivering service and compensating a dissatisfied customer upon the fault of company service. This rewards or compensation program has effectively reduced the complaints raised by a customer.
  • 5. 5 PROBLEM STATEMENT In our scenario, the airline company wants to identify a customer satisfaction level, based on his rating on various aspects of airline experience. Hence, primarily we build a model to classify the customer satisfaction level. Precisely we will classify customers’ being satisfied or not and accordingly try to find out the factors related to high satisfaction. DATA ABSTRACT The data we obtained from Kaggle represents US airline satisfaction data consisting of 129880 observations and 23 attributes. The Satisfaction level, which is our dependent variable, is represented as a factor (“Satisfied” and “Neutral or Dissatisfied”). Variable Variable Description Variable Value Level Satisfaction (Dependent Var) Airline satisfaction level Satisfaction, neutral or dissatisfaction Age The actual age of the passengers Gender Gender of the passengers Female, Male Type of Travel Purpose of the flight of the passengers Personal Travel, Business Travel Class Travel class in the plane of the passengers Business, Eco, Eco Plus Customer Type The customer type Loyal customer, disloyal customer Flight distance The flight distance of this journey
  • 6. 6 Inflight WIFI Service Satisfaction level of the inflight wifi service Rating: 0 (least) - 5 (highest) Ease of Online Booking Satisfaction level of online booking Rating: 0 (least) - 5 (highest) Online boarding Satisfaction level of online boarding Rating: 0 (least) - 5 (highest) Inflight Entertainment Satisfaction level of inflight entertainment Rating: 0 (least) - 5 (highest) Food and drink Satisfaction level of Food and drink Rating: 0 (least) - 5 (highest) Seat comfort Satisfaction level of Seat comfort Rating: 0 (least) - 5 (highest) On-board service Satisfaction level of On-board service Rating: 0 (least) - 5 (highest) Leg room service Satisfaction level of Leg room service Rating: 0 (least) - 5 (highest) Departure/Arriva l time convenient Satisfaction level of Departure/Arrival time convenient Rating: 0 (least) - 5 (highest) Baggage handling Satisfaction level of baggage handling Rating: 0 (least) - 5 (highest) Gate location Satisfaction level of Gate location Rating: 0 (least) - 5 (highest) Cleanliness Satisfaction level of Cleanliness Rating: 0 (least) - 5 (highest) Check-in service Satisfaction level of Check-in service Rating: 0 (least) - 5 (highest) Departure Delay in Minutes Minutes delayed when departure Arrival Delay in Minutes Minutes delayed when Arrival Online Support Satisfaction level of Online Support Rating: 0 (least) - 5 (highest)
  • 7. 7 DATA PREPARATION We import the dataset into the R environment and observe the structure of the data loaded. ##'data.frame':90917 obs. of 25 variables: ## $ ID : int 149965 149966 149967 149968 149969 14997 0 149971 149972 149973 149974 ... ## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 2 2 1 1 ... ## $ CustomerType : Factor w/ 3 levels "","disloyal Customer",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ Age : int 65 15 60 70 30 66 10 22 58 34 ... ## $ TypeTravel : Factor w/ 3 levels "","Business travel",..: 3 3 3 3 1 3 3 3 3 3 ... ## $ Class : Factor w/ 3 levels "Business","Eco",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ Flight_Distance : int 265 2138 623 354 1894 227 1812 1556 104 3 633 ... ## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ... ## $ ArrivalDelayin_Mins : int 0 0 0 0 0 15 0 26 48 0 ... ## $ ID : int 149965 149966 149967 149968 149969 149970 149971 149972 149973 149974 ... ## $ Satisfaction : Factor w/ 2 levels "neutral or dissatisfied",. .: 2 2 2 2 2 2 2 2 2 2 ... ## $ Seat_comfort : Factor w/ 6 levels "acceptable","excellent",.. : 3 3 3 3 3 3 3 3 3 3 ... ## $ Departure.Arrival.time_convenient: Factor w/ 7 levels "","acceptable",..: 4 4 1 4 4 4 4 1 4 4 ... ## $ Food_drink : Factor w/ 7 levels "","acceptable",..: 4 4 4 4 4 1 1 4 4 4 ... ## $ Gate_location : Factor w/ 6 levels "Convinient","Inconvinient" ,..: 4 3 3 3 3 3 3 3 3 1 ... ## $ Inflightwifi_service : Factor w/ 6 levels "acceptable","excellent",.. : 5 5 1 4 5 5 5 5 1 5 ... ## $ Inflight_entertainment : Factor w/ 6 levels "acceptable","excellent",.. : 4 3 4 1 3 2 3 3 1 3 ... ## $ Online_support : Factor w/ 6 levels "acceptable","excellent",.. : 5 5 1 4 5 2 5 5 1 5 ...
  • 8. 8 ## $ Ease_of_Onlinebooking : Factor w/ 6 levels "acceptable","excellent",.. : 1 5 6 5 5 2 5 5 1 5 ... ## $ Onboard_service : Factor w/ 7 levels "","acceptable",..: 2 1 7 6 3 3 2 6 2 2 ... ## $ Leg_room_service : Factor w/ 6 levels "acceptable","excellent",.. : 3 1 3 3 4 3 1 4 3 5 ... ## $ Baggage_handling : Factor w/ 5 levels "acceptable","excellent",.. : 1 3 5 4 2 2 3 2 5 2 ... ## $ Checkin_service : Factor w/ 6 levels "acceptable","excellent",.. : 2 4 4 4 2 2 2 1 5 5 ... ## $ Cleanliness : Factor w/ 6 levels "acceptable","excellent",.. : 1 4 6 5 4 2 4 4 1 2 ... ## $ Online_boarding : Factor w/ 6 levels "acceptable","excellent",.. : 5 5 1 2 5 1 5 5 2 5 ... We observe the dataset having 90917 observations and 24 attributes. The attribute “satisfaction” represents the satisfaction level of the customer on two different levels: “neutral or dissatisfied” and “satisfied”. Also, we drop the ID attribute as we wouldn’t require that in our analysis or model building. Hence, we have 1 dependent variable and 22 independent variables in our dataset. We check our dataset for any possible NA values, which must be dealt before we proceed with building the model. ## ArrivalDelayin_Min ## 284 We observe 284 NA values in the attribute “ArrivalDelayin_Min”. Let us analyze the dataset further to understand whether there are any NA values which are present in the data appart from this variable. To perform this, we fill the blanks with NA values. Then we take the count of the NA values of the attributes of this dataset. Following we observe there are NA’s values of attributes of original dataset and the dataset with NA values are – CustomerType TypeTravel Departure.Arrival.time_convenient 9099 9088 8244 Food_drink Onboard_service 8181 7179
  • 9. 9 By imputing the blank values of different attributes with NA’s, we observe that the attributes have some missing values which are imputed with NA’s. Hence, we can replace the NA values with mean values of the attribute or omit the missing, however before proceeding any further, let us analyze the “Arrival.Delay.in.Minutes” variable’s correlation with other variables. ArrivalDelayin_Mins Gender NA CustomerType NA −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Age Flight_Distance DepartureDelayin_Mins ArrivalDelayin_Mins Gender CustomerType TypeTravel Class Satisfaction Seat_comfort Departure.Arrival.time_convenient Food_drink Gate_location Inflightwifi_service Inflight_entertainment Online_support Ease_of_Onlinebooking Onboard_service Leg_room_service Baggage_handling Checkin_service Cleanliness Online_boarding Age Flight_Distance DepartureDelayin_Mins ArrivalDelayin_Mins Gender CustomerType TypeTravel Class Satisfaction Seat_comfort Departure.Arrival.time_convenient Food_drink Gate_location Inflightwifi_service Inflight_entertainment Online_support Ease_of_Onlinebooking Onboard_service Leg_room_service Baggage_handling Checkin_service Cleanliness Online_boarding
  • 10. 10 Age -0.010 TypeTravel NA Class NA Flight_Distance 0.108 DepartureDelayin_Mins 0.966 We observe a very high correlation (0.966) between Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes. Fortunately, without the need of replacing or omitting the NA values in the variable “Arrival.Delay.in.Minutes”, we can exclude the variable entirely, as we have a similar variable having no NA values in our dataset to proceed with our modeling. EXPLORATORY ANALYSIS After observing the impact of several independent variables over the “Satisfaction level”, we present below few visualizations displaying a level of satisfaction at different factor levels for the variables “Gender”, “Class”, “Seat Comfort”, “Inflight Entertainment”, “Online Support” and “Baggage Handling”. On an inferential perspective, these variables are considered the most significant in terms of airline customer satisfaction, which is more likely the case as we observe the following visualizations.
  • 11. 11 We observe that the customers traveling the Business Class are more satisfied than the customers traveling in Economy classes. We observe that the Female customers are comparatively more satisfied than the Male customers as we can see 65% female customers are satisfied against 34% dissatisfied. We observe that Seat Comfort is having a significant effect on the customer satisfaction level, as we see that customers rating 5 on seat comfort are 99 percent satisfied.
  • 12. 12 We observe that Baggage Handling is having a significant effect on the customer satisfaction Level, as we see that customers rating 5 on Baggage Handling are 73 percent satisfied. We observe that Inflight Entertainment is having a significant effect on the customer satisfaction level, as we see that customers rating 5 on seat comfort are 95percent satisfied. We observe that Online Support is having a significant effect on the customer satisfaction level, as we see that customers rating 0 on Online Support are 100 percent dissatisfied.
  • 13. 13 DATA MODELLING We build models to predict the satisfaction level of the customer based on the independent variables of our dataset. This will help the airline company to understand the satisfaction level of its customers and hence take necessary actions, in order to improve customer satisfaction. Here we build 3 models to predict the satisfaction level, namely • Logistic regression model • Linear discriminant analysis and • Decision tree models. DATAPROCESSING AND NORMALIZATION Initially, we check for the class bias to ensure our models are better. We observe there is no class bias, as both the cases are approximately equally distributed in the dataset. Before we begin modeling the dataset, we convert the factorial data in the dataset into numeric type as we require in the model building and further analysis, except for our dependent variable. As an initial step in the dataset, we observe few of the variables in our dataset are on the different scale and hence mislead our modeling techniques. Hence, we normalize the data. satisfied 54.7 ## ## neutral or dissatisfied ## 45.3
  • 14. 14 TRAINING AND TEST DATASETS Now that we have the normalized dataset, we divide the dataset into training and testing dataset in the ratio of 70:30. This will help us in understanding the performance of our model. MODELS 1. LOGISTIC REGRESSION To build a classification model to predict satisfaction, we start with the go-to classification model: Logistic Regression model. Here, we utilize the Logistic regression classification algorithm from the GLM package in R to predict “satisfaction_v2” from the set of independent variables available. Call: glm(formula = Satisfaction ~ ., family = "binomial", data = data_train) Deviance Residuals: Min 1Q Median 3Q Max -2.8657 -0.5916 0.2056 0.5355 3.5814 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.54608 0.08558 -88.171 < 2e-16 *** Gender -0.91631 0.02315 -39.576 < 2e-16 *** CustomerType 1.75680 0.03491 50.323 < 2e-16 *** Age -0.40833 0.06229 -6.556 5.54e-11 *** TypeTravel -0.58477 0.03103 -18.847 < 2e-16 *** Class 0.86129 0.02858 30.136 < 2e-16 *** Flight_Distance -0.65794 0.08321 -7.907 2.64e-15 *** DepartureDelayin_Mins -7.54371 0.49612 -15.205 < 2e-16 *** Seat_comfort 1.34953 0.06315 21.371 < 2e-16 *** Departure.Arrival.time_convenient -0.82836 0.04684 -17.686 < 2e-16 *** Food_drink -1.18942 0.06470 -18.385 < 2e-16 *** Gate_location 0.46966 0.05299 8.862 < 2e-16 *** Inflightwifi_service -0.44060 0.06243 -7.058 1.69e-12 *** Inflight_entertainment 3.60675 0.05847 61.687 < 2e-16 *** Online_support 0.43647 0.06368 6.854 7.18e-12 *** Ease_of_Onlinebooking 1.38364 0.08170 16.935 < 2e-16 ***
  • 15. 15 Onboard_service 1.36192 0.05790 23.521 < 2e-16 *** Leg_room_service 1.15122 0.04932 23.341 < 2e-16 *** Baggage_handling 0.38854 0.05233 7.424 1.13e-13 *** Checkin_service 1.44105 0.04868 29.601 < 2e-16 *** Cleanliness 0.37903 0.06767 5.602 2.12e-08 *** Online_boarding 0.68344 0.06994 9.772 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 87657 on 63642 degrees of freedom Residual deviance: 50030 on 63621 degrees of freedom AIC: 50074 Number of Fisher Scoring iterations: 5 From the above summary statistics of the model we built, we observe the coefficient weights and the importance of the variable from p-value. We observe that all the independent variables are significantly important in predicting the “satisfaction”. We shall predict the test dataset using the model built. Note that we obtained the predictions in terms of probabilities. Hence, by considering a cutoff of 0.5 (practically used default value), we classify the predicted “satisfaction” into “Yes” or “No”. Let us build theconfusion matrix, using the predicted and observed values to assess the performance of the model. Confusion Matrix and Statistics Reference Prediction neutral or dissatisfied satisfied neutral or dissatisfied 10020 2267 satisfied 2326 12661 Accuracy : 0.8316 95% CI : (0.8271, 0.836) No Information Rate : 0.5473 P-Value [Acc > NIR] : <2e-16 Kappa : 0.66 Mcnemar's Test P-Value : 0.3921 Sensitivity : 0.8481 Specificity : 0.8116 Pos Pred Value : 0.8448
  • 16. 16 Neg Pred Value : 0.8155 Prevalence : 0.5473 Detection Rate : 0.4642 Detection Prevalence : 0.5495 Balanced Accuracy : 0.8299 'Positive' Class : satisfied From the above results, we can observe that: • The overall accuracy of 83.6% and Kappa value of 66%. That means, our model performed better than a random prediction of the dependentvariable. • Sensitivity value of 84 percent, indicating that our model has correctly identified 84% of the satisfied customers correctly. • Specificity value of 81 percent, indicating that our model has correctly identified 81 percent of the “neutral or dissatisfied” customers correctly. 2. LINEAR DISCRIMINANT ANALYSIS One significant difference we observe between Logistic Regression and LDA is that when the groups in the dataset are clearly classified logistic regression is unstable and hence LDA model provides a better model. Let us observe the predicted class separability. Coefficients of linear discriminants: LD1 Gender -0.5345605 CustomerType 1.0790306 Age -0.2755955 TypeTravel -0.3446182 Class 0.5061304 Flight_Distance -0.3634092 DepartureDelayin_Mins -3.6944958 Seat_comfort 0.9148241 Departure.Arrival.time_convenient -0.5197594 Food_drink -0.7096648 Gate_location 0.2262170 Inflightwifi_service -0.2303117 Inflight_entertainment 2.1898107 Online_support 0.2880001 Ease_of_Onlinebooking 0.8540892 Onboard_service 0.7833862
  • 17. 17 Leg_room_service 0.6433999 Baggage_handling 0.2037793 Checkin_service 0.8032809 Cleanliness 0.2146604 Online_boarding 0.3705444 As we can observe that the “satisfied” and “neutral or dissatisfied” classes are more likely separable, we do not expect to see much significance using the LDA model over the Logistic model. However, we build a model using the LDA. Confusion Matrix and Statistics Reference Prediction neutral or dissatisfied satisfied neutral or dissatisfied 10024 2322 satisfied 2243 12685 Accuracy : 0.8326 95% CI : (0.8281, 0.837)
  • 18. 18 No Information Rate : 0.5502 P-Value [Acc > NIR] : <2e-16 Kappa : 0.662 Mcnemar's Test P-Value : 0.2483 Sensitivity : 0.8453 Specificity : 0.8172 Pos Pred Value : 0.8497 Neg Pred Value : 0.8119 Prevalence : 0.5502 Detection Rate : 0.4651 Detection Prevalence : 0.5473 Balanced Accuracy : 0.8312 'Positive' Class : satisfied From the above results, we can observe that: • The overall accuracy of 83.2% and Kappa value of 66%. • Sensitivity value of 84 percent, indicating that our model has correctly identified 84% of the satisfied customers correctly. • Specificity value of 81 percent, indicating that our model has correctly identified 81 percent of the “neutral or dissatisfied” customers correctly. 3. DECISION TREE – CART ALGORITHM We use a decision tree algorithm, CART (Classification and Regression Tree) for classifying the dependent variable based on the set of independent variables. This algorithm repeatedly partitions the data into multiple-subsets so that the leaf nodes are pure or homogeneous. Here we ensure that our categorical variables are in factor form so that the CART algorithm uses classification tree rather than regression tree. Following we visualize the model we built.
  • 19. 19 The algorithm built a classification tree, as we observe that the first or the more valuable variable we have “Inflight entertainment” being rated 1-3 we classify them to be “neutral or dissatisfied” class on left node, which is further classified based on the rating given on “seat comfort”. If seat comfort is rated 1-3, we finally decided them to be “neutral or dissatisfied”, else classified them to be “satisfied”. Similarly, the tree progresses finally into 7 leaf nodes. Note that the factor levels are normalized (Normalized Value (Rating Value): 0(0), 0.2(1), 0.4(2), 0.6(3), 0.8(4), 1(5)). We use this model to predict the test data set and observe the confusion matrix to understand the model performance
  • 20. 20 Confusion Matrix and Statistics Reference Prediction neutral or dissatisfied satisfied neutral or dissatisfied 10638 1997 satisfied 1708 12931 Accuracy : 0.8642 95% CI : (0.86, 0.8682) No Information Rate : 0.5473 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7264 Mcnemar's Test P-Value : 2.229e-06 Sensitivity : 0.8662 Specificity : 0.8617 Pos Pred Value : 0.8833 Neg Pred Value : 0.8419 Prevalence : 0.5473 Detection Rate : 0.4741 Detection Prevalence : 0.5367 Balanced Accuracy : 0.8639 'Positive' Class : satisfied From the above results, we can observe that: • The overall accuracy of 86% and Kappa value of 72%. That means, our model performed better than a random prediction of the dependentvariable. • Sensitivity value of 86 percent, indicating that our model has correctly identified 86% of the satisfied customers correctly. • Specificity value of 86 percent, indicating that our model has correctly identified 86 percent of the “neutral or dissatisfied” customers correctly. 4. NAÏVE BAYES – classification algorithm
  • 21. 21 Confusion Matrix and Statistics pc 1 0 1 13071 1949 0 1792 10329 Accuracy : 0.8622 95% CI : (0.858, 0.8662) No Information Rate : 0.5476 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.7215 Mcnemar's Test P-Value : 0.01076 Sensitivity : 0.8794
  • 22. 22 Specificity : 0.8413 Pos Pred Value : 0.8702 Neg Pred Value : 0.8522 Prevalence : 0.5476 Detection Rate : 0.4816 Detection Prevalence : 0.5534 Balanced Accuracy : 0.8603 'Positive' Class : 1 From the above results, we can observe that: • The overall accuracy of 86% and Kappa value of 72%. That means, our model performed better than a random prediction of the dependentvariable. • Sensitivity value of 87 percent, indicating that our model has correctly identified 87% of the satisfied customers correctly which is higher than all the models. • Specificity value of 84 percent, indicating that our model has correctly identified 84 percent of the “neutral or dissatisfied” customers correctly which is lower than CART model. • This model has performed good in predicting the satisfied customers but failed to perform that well when comes to neutral or dissatisfied customers MODEL COMPARISON In order to compare the model performance of the 3 models we built on a common metrics, we extract the accuracy, Kappa statistic, Sensitivity and Specificity from the confusion Matrix () output. LogisticRegression LinearDiscriminantAna lysis DecisionTree Naïve Bayes Accuracy 0.8315979 0.8326245 0.8641563 0.8622 Kappa 0.6600094 0.6620347 0.7264096 0.7215 Sensitivity 0.8481377 0.8452722 0.8662245 0.8794
  • 23. 23 Specificity 0.8115989 0.8171517 0.8616556 0.8413 We observe from the above summary that, • Logistic regression and LDA models we built provide similar performance metrics for our dataset. • Decision tree provides better performance metrics over the other models we built but there is no much difference between Naïve Bayes and Decision Tree. Below we compare the models one against the other. 1. LDA OVER LOGISTIC REGRESSION MODEL As we expected, we do not see significant improvement in the model performance, as our predicted class was clearly more likely separable, and we have sufficient predictors. However, we only see a slight .01 percent improvement in the Specificity of our model, which represents correctly identifying “neutral or dissatisfied” customers. For our business model, it is critical to identify the negative class correctly, as the airline company is looking forward to identify the unsatisfied customers and reward them to improve their satisfaction. 2. DECISION TREE OVER LDA & Naïve Bayes We obtained a decent 3 percent improvement in the model accuracy using the decision tree algorithm over the logistic regression and LDA but no significant difference was observed when compared with Naïve Bayes models we built. Also, we obtained a 5 percent and 2 percent improvement in the specificity of our model which is a critical metric for our business model.
  • 24. 24 CONCLUSION We conclude that decision tree model we built using the CART algorithm, is statistically better suitable for our dataset and business model providing us with an accuracy of 86 percent along with 86 percent correctly identifying “satisfied” customers (Sensitivity) and 86percent correctly identifying “neutral or dissatisfied” customers (Specificity). Also, we observed from exploratory analysis and decision tree as well, the Inflight experience and Seat Comfort level and Online service level significantly affect the customer experience along with several other variables considered. The airline service companies must ensure high quality of service in these parameters to ensure a high level of customer satisfaction. The models we built did lack to perfectly classify customer satisfaction level. The prediction accuracy can be further improved using advanced classification algorithms or deep learning concepts.
  • 25. 25 APPENDIX – R SOURCE CODE # Loading Libraries library(dplyr) library(caret) library(rpart) library(rpart.plot) library(e1071) library(MASS) library(corrplot) library(ggplot2) library(tidyr) library(sparklyr) # Setting Working Directory setwd("D:/PGP BA Classes/Capstone Projects/3. Airplane Passenger Satisfaction/Final Submission") #Loading data flightdata<-read.csv("Flight data.csv", header = T) surveydata<-read.csv("Surveydata.csv", header = T) flightsurvey<-cbind(flightdata,surveydata[,-1])
  • 26. 26 #Filling blanks with NA empty_as_na <- function(x){ if("factor" %in% class(x)) x <- as.character(x) ifelse(as.character(x)!="", x, NA) } ## transform all columns flightsurvey<-flightsurvey %>% mutate_each(list(empty_as_na)) str(flightsurvey) #Dealing with NA values #NA's Count sapply(flightsurvey, function(x) sum(is.na(x))) # Missing Vallue immputaion with mode # For CustomerType ct <- unique(flightsurvey$CustomerType[!is.na(flightsurvey$CustomerType)]) CustomerType <- ct[which.max(tabulate(match(flightsurvey$CustomerType, ct)))] CustomerType flightsurvey$CustomerType[is.na(flightsurvey$CustomerType)]<- CustomerType # For TypeTravel tt <- unique(flightsurvey$TypeTravel[!is.na(flightsurvey$TypeTravel)]) TypeTravel <- tt[which.max(tabulate(match(flightsurvey$TypeTravel, tt)))] TypeTravel
  • 27. 27 flightsurvey$TypeTravel[is.na(flightsurvey$TypeTravel)]<- TypeTravel # For Departure.Arrival.time_convenient datc <- unique(flightsurvey$Departure.Arrival.time_convenient[!is.na(flightsurvey$Departure.Arrival.time_convenie nt)]) Departure.Arrival.time_convenient <- datc[which.max(tabulate(match(flightsurvey$Departure.Arrival.time_convenient, datc)))] Departure.Arrival.time_convenient flightsurvey$Departure.Arrival.time_convenient[is.na(flightsurvey$Departure.Arrival.time_convenient)]<- Departure.Arrival.time_convenient # For Food_drink fd <- unique(flightsurvey$Food_drink[!is.na(flightsurvey$Food_drink)]) Food_drink <- fd[which.max(tabulate(match(flightsurvey$Food_drink, fd)))] Food_drink flightsurvey$Food_drink[is.na(flightsurvey$Food_drink)]<- Food_drink # For Onboard_service os <- unique(flightsurvey$Onboard_service[!is.na(flightsurvey$Onboard_service)]) Onboard_service <- os[which.max(tabulate(match(flightsurvey$Onboard_service, os)))] Onboard_service flightsurvey$Onboard_service[is.na(flightsurvey$Onboard_service)]<- Onboard_service # Converting Cate into int flightsurvey$Seat_comfort<-factor(flightsurvey$Seat_comfort,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Seat_comfort <- as.integer(factor(flightsurvey$Seat_comfort))
  • 28. 28 flightsurvey$Departure.Arrival.time_convenient<- factor(flightsurvey$Departure.Arrival.time_convenient,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Departure.Arrival.time_convenient <- as.integer(factor(flightsurvey$Departure.Arrival.time_convenient)) flightsurvey$Food_drink<-factor(flightsurvey$Food_drink,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Food_drink <- as.integer(factor(flightsurvey$Food_drink)) flightsurvey$Gate_location<-factor(flightsurvey$Gate_location,levels=c("very inconvinient","Inconvinient","need improvement","manageable","Convinient","very convinient"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Gate_location <- as.integer(factor(flightsurvey$Gate_location)) flightsurvey$Inflightwifi_service<-factor(flightsurvey$Inflightwifi_service,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Inflightwifi_service <- as.integer(factor(flightsurvey$Inflightwifi_service)) flightsurvey$Inflight_entertainment<-factor(flightsurvey$Inflight_entertainment,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Inflight_entertainment <- as.integer(factor(flightsurvey$Inflight_entertainment)) flightsurvey$Online_support<-factor(flightsurvey$Online_support,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Online_support <- as.integer(factor(flightsurvey$Online_support)) flightsurvey$Ease_of_Onlinebooking<-factor(flightsurvey$Ease_of_Onlinebooking,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Ease_of_Onlinebooking <- as.integer(factor(flightsurvey$Ease_of_Onlinebooking)) flightsurvey$Onboard_service<-factor(flightsurvey$Onboard_service,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Onboard_service <- as.integer(factor(flightsurvey$Onboard_service)) flightsurvey$Leg_room_service<-factor(flightsurvey$Leg_room_service,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
  • 29. 29 flightsurvey$Leg_room_service <- as.integer(factor(flightsurvey$Leg_room_service)) flightsurvey$Baggage_handling<-factor(flightsurvey$Baggage_handling,levels=c("poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4),ordered = T) flightsurvey$Baggage_handling <- as.integer(factor(flightsurvey$Baggage_handling)) flightsurvey$Checkin_service<-factor(flightsurvey$Checkin_service,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Checkin_service <- as.integer(factor(flightsurvey$Checkin_service)) flightsurvey$Cleanliness<-factor(flightsurvey$Cleanliness,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Cleanliness <- as.integer(factor(flightsurvey$Cleanliness)) flightsurvey$Online_boarding<-factor(flightsurvey$Online_boarding,levels=c("extremely poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T) flightsurvey$Online_boarding <- as.integer(factor(flightsurvey$Online_boarding)) str(flightsurvey) #drop ID attribute data <- flightsurvey[-1] # Finding correlation data.non <- data %>% filter(!is.na(data$ArrivalDelayin_Mins)) for(i in 1:ncol(data.non)){ if(class(data[,i])!="integer"){ data.non[,i] <- as.integer(data.non[,i]) } } data.cor <- round(cor(data.non[,1:7], data.non[,8]),3)
  • 30. 30 colnames(data.cor) <- "ArrivalDelayin_Mins" data.cor # Removing High Correlated atribute "Arrival.Delay.in.Minutes" data <- data[-8] #Exploratory Analysis ggplot(data, aes(Satisfaction, group = Gender)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) + labs(x="Satisfaction Level",y = "Percentage",title="Gender vs Satisfaction", fill="Satisfaction") + scale_y_continuous(labels=scales::percent) + facet_grid(~Gender)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5)) ggplot(data, aes(Satisfaction, group = Class)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) + labs(x="Class of Travel",y = "Percentage",title="Travel Class vs Satisfaction", fill="Satisfaction") + scale_y_continuous(labels=scales::percent) + facet_grid(~Class)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5)) ggplot(data, aes(Satisfaction, group = Seat_comfort)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) + labs(x="Seat Comfort Level",y = "Percentage",title="Seat Comfort vs Satisfaction", fill="Satisfaction") + scale_y_continuous(labels=scales::percent) + facet_grid(~Seat_comfort)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
  • 31. 31 ggplot(data, aes(Satisfaction, group = Inflight_entertainment)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) + labs(x="Inflight entertainment Level",y = "Percentage",title= "Inflight Entertainment vs Satisfaction", fill="Satisfaction") + scale_y_continuous(labels=scales::percent) + facet_grid(~Inflight_entertainment)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5)) ggplot(data, aes(Satisfaction, group = Online_support)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) + labs(x="Online support Level",y = "Percentage",title="Online Support vs Satisfaction", fill="Satisfaction") + scale_y_continuous(labels=scales::percent) + facet_grid(~Online_support)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5)) ggplot(data, aes(Satisfaction, group = Baggage_handling)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count")+ geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) + labs(x="Baggage Handling Level",y = "Percentage",title="Bagga ge Handling vs Satisfaction", fill="Satisfaction") + scale_y_continuous(labels=scales::percent) + facet_grid(~Baggage_handling)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5)) # Data Modelling round(prop.table(table(data$Satisfaction))*100,digit=1) for(i in 2:ncol(data)){ if(class(data[,i])=="factor"){ data[,i] <- as.integer(data[,i]) } } normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))
  • 32. 32 } #write.csv(data, file = "Data.csv") data <- read.csv("Data.csv",header = T) data <- data[,-1] str(data) # Converting categorical to int # Gender levels(data$Gender) data$Gender<-factor(data$Gender,levels=c("Female","Male"),labels=c(0,1)) data$Gender <- as.integer(factor(data$Gender)) # CustomerType levels(data$CustomerType) data$CustomerType<-factor(data$CustomerType,levels=c("disloyal Customer","Loyal Customer"),labels=c(0,1),ordered = T) data$CustomerType <- as.integer(factor(data$CustomerType)) #TypeTravel levels(data$TypeTravel) data$TypeTravel<-factor(data$TypeTravel,levels=c("Business travel","Personal Travel"),labels=c(0,1)) data$TypeTravel <- as.integer(factor(data$TypeTravel)) #Class levels(data$Class) data$Class<-factor(data$Class,levels=c("Eco","Eco Plus","Business"),labels=c(0,1,2),ordered = T)
  • 33. 33 data$Class <- as.integer(factor(data$Class)) data_n <- as.data.frame(lapply(data[2:22], normalize)) data_n <- cbind(data_n, Satisfaction=data$Satisfaction) set.seed(5) partition <- createDataPartition(y = data_n$Satisfaction, p=0.7, list= F) data_train <- data_n[partition,] data_test <- data_n[-partition,] # write.csv(data_train, file = "data_train.csv") # write.csv(data_test, file = "data_test.csv") # Logistic regression log_model <- glm(Satisfaction~., data=data_train, family = "binomial") summary(log_model) log_preds <- predict(log_model, data_test[,1:22], type = "response") head(log_preds) log_class <- array(c(99)) for (i in 1:length(log_preds)){ if(log_preds[i]>0.5){ log_class[i]<-"satisfied" }else{ log_class[i]<-"neutral or dissatisfied" } }
  • 34. 34 #Creating a new dataframe containing the actual and predicted values. log_result <- data.frame(Actual = data_test$Satisfaction, Prediction = log_class) mr1 <- confusionMatrix(as.factor(log_class),data_test$Satisfaction, positive = "satisfied") mr1 # Linear Discriminant Analysis pairs(data_train[,1:5], main="Predict ", pch=22, bg=c("red", "blue")[unclass(data_train$Satisfaction)]) lda_model <- lda(Satisfaction ~ ., data_train) lda_model #Predict the model lda_preds <- predict(lda_model, data_test) mr2 <- confusionMatrix(data_test$Satisfaction, lda_preds$class, positive = "satisfied") mr2 # Decision tree - CART algorithm data_train_new <- data_train for(i in c(1,2,4,5,8,9,10,11,12,13,14,15,16,17,18,19,20,21)){ data_train_new[,i] <- as.factor(data_train_new[,i]) }
  • 35. 35 data_test_new <- data_test for(i in c(1,2,4,5,8,9,10,11,12,13,14,15,16,17,18,19,20,21)){ data_test_new[,i] <- as.factor(data_test_new[,i]) } str(data_train_new) str(data_test_new) cart_model <- rpart(Satisfaction~ ., data_train_new, method="class") cart_model rpart.plot(cart_model,digits = 2, type=4, extra=106) cm1 <- predict(cart_model, data_test_new, type = "class") mr3 <- confusionMatrix(cm1, data_test$Satisfaction, positive = "satisfied") mr3 # Model Comparision comp_cfm <- function(x1,x2,x3,n1,n2,n3){ a <- data.frame() b <- as.matrix(c(x1[[3]][1:2],x1[[4]][1:2])) c <- as.matrix(c(x2[[3]][1:2],x2[[4]][1:2])) d <- as.matrix(c(x3[[3]][1:2],x3[[4]][1:2])) colnames(b) <- n1 colnames(c) <- n2 colnames(d) <- n3 a <- as.data.frame(cbind(b,c,d)) return(a) }
  • 36. 36 r <- comp_cfm(mr1,mr2,mr3, "LogisticRegression", "LinearDiscriminantAna lysis", "DecisionTree") r ##Capstone Project Data And Naive Bayes Model setwd("D:/PGP BA Classes/Capstone Projects/3. Airplane Passenger Satisfaction/Final Submission") survey<-read.csv("Surveydata.csv") flight<- read.csv("Flight data.csv") ## Combine Data set FSdata<-cbind(flight[-1],survey[-1]) str(FSdata) colnames(FSdata) library(dplyr) ## filling "Blanks" with "NA" empty_as_na <- function(x) { if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors ifelse(as.character(x)!="", x, NA) } ## transform all columns FSdata<-FSdata %>% mutate_each(list(empty_as_na)) FSdata ##Missing value imputation for CustomerType val1 <- unique(FSdata$CustomerType[!is.na(FSdata$CustomerType)]) mode1 <- val1[which.max(tabulate(match(FSdata$CustomerType, val1)))] mode1 FSdata$CustomerType[is.na(FSdata$CustomerType)]<- mode1
  • 37. 37 ##Missing value imputation for Typetravel val2 <- unique(FSdata$TypeTravel[!is.na(FSdata$TypeTravel)]) mode2 <- val2[which.max(tabulate(match(FSdata$TypeTravel, val2)))] mode2 FSdata$TypeTravel[is.na(FSdata$TypeTravel)]<- mode2 ##Missing value imputation for Departure.Arrival.time_convenient val3 <-unique(FSdata$Departure.Arrival.time_convenient[!is.na(FSdata$Departure.Arrival.time_convenient)]) mode3 <- val3[which.max(tabulate(match(FSdata$Departure.Arrival.time_convenient, val3)))] mode3 FSdata$Departure.Arrival.time_convenient[is.na(FSdata$Departure.Arrival.time_convenient)]<-mode3 ##Missing value imputation for Food_drink val4 <- unique(FSdata$Food_drink[!is.na(FSdata$Food_drink)]) mode4 <- val4[which.max(tabulate(match(FSdata$Food_drink, val4)))] mode4 FSdata$Food_drink[is.na(FSdata$Food_drink)]<- mode4 ##Missing value imputation for Onboard_service val5 <- unique(FSdata$Onboard_service[!is.na(FSdata$Onboard_service)]) mode5 <- val5[which.max(tabulate(match(FSdata$Onboard_service, val5)))] mode5 FSdata$Onboard_service[is.na(FSdata$Onboard_service)]<- mode5 ## deteling the variable "Arrival Delay in Mins" due to collinearity FSdata<-FSdata[-8]
  • 38. 38 colnames(FSdata) write.csv(FSdata,file = "FSdata.csv") sum(is.na(FSdata)) ## Binning library(rbin) FSdata$Age <- cut(FSdata$Age, c(0,20,35,55,90)) table(FSdata$Age) prop.table(table(FSdata$Age)) levels(FSdata$Age) ##Binning Flight_Distance FSdata$Flight_Distance <- cut(FSdata$Flight_Distance, c(0,200,400,600,800,1000,1200,1400,1600,1800,2000,2200,2400,2600,2800,3000,3200,3400,3600,3800,4000,4200,44 00,4600,4800,5000,5200,5400,5600,5800,6000)) table(FSdata$Flight_Distance) prop.table(table(FSdata$Flight_Distance)) levels(FSdata$Flight_Distance) ##Binning DepartureDelayin_Mins (First 2 hrs every 10 mins, next 3 hrs every 15 mins, #next 5 hrs every 30 mins, next 6 hrs every 60 mins, and beyond 6 hrs in the last bucket) FSdata$DepartureDelayin_Mins <- cut(FSdata$DepartureDelayin_Mins, c(0,10,20,30,40,50,60,70,80,90,100,110,120,135,150, 165,180,195,210,225,240,255,270,285,300,330,360,390, 420,450,480,510,540,570,600,660,720,780,840,900,960,1620)) table(FSdata$DepartureDelayin_Mins) prop.table(table(FSdata$DepartureDelayin_Mins)) ##Converting 0 as string levels<-levels(FSdata$DepartureDelayin_Mins)
  • 39. 39 levels[length(levels)+1]<-"0" FSdata$DepartureDelayin_Mins<-factor(FSdata$DepartureDelayin_Mins,levels =levels) FSdata$DepartureDelayin_Mins[is.na(FSdata$DepartureDelayin_Mins)]<- "0" str(FSdata) ## Converting Character variables to factor FSdata$Gender<- factor(FSdata$Gender,c("Female","Male"),labels = c(0,1)) FSdata$CustomerType <- factor(FSdata$CustomerType, c("Loyal Customer","disloyal Customer"),labels = c(1,0)) FSdata$TypeTravel<- factor(FSdata$TypeTravel, c("Personal Travel", "Business travel"),labels = c(0,1)) FSdata$Class <- factor(FSdata$Class,c("Eco","Eco Plus","Business"),labels = c(1:3)) FSdata$Food_drink <- factor(FSdata$Food_drink, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Seat_comfort <- factor(FSdata$Seat_comfort, c( "extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Departure.Arrival.time_convenient <- factor(FSdata$Departure.Arrival.time_convenient, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Gate_location <- factor(FSdata$Gate_location, c("very inconvinient","Inconvinient","need improvement","manageable", "Convinient" , "very convinient"),labels = c(0:5)) FSdata$Inflightwifi_service<- factor(FSdata$Inflightwifi_service, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Inflight_entertainment <- factor(FSdata$Inflight_entertainment, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Online_support <- factor(FSdata$Online_support,
  • 40. 40 c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Ease_of_Onlinebooking <- factor(FSdata$Ease_of_Onlinebooking, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Onboard_service <- factor(FSdata$Onboard_service, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Leg_room_service<- factor(FSdata$Leg_room_service, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Baggage_handling<- factor(FSdata$Baggage_handling, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Checkin_service <- factor(FSdata$Checkin_service, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Cleanliness<- factor(FSdata$Cleanliness, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Online_boarding <- factor(FSdata$Online_boarding, c("extremely poor","poor","need improvement","acceptable", "good" , "excellent"),labels = c(0:5)) FSdata$Satisfaction<-factor(FSdata$Satisfaction, c("satisfied","neutral or dissatisfied"),labels=c(1,0)) levels(FSdata$Cleanliness) str(FSdata)
  • 41. 41 ##Split the data in train & test sample (70:30) library(caret) set.seed(123) id<-sample(2,nrow(FSdata),prob = c(.7,.3),replace = T) training<-FSdata[id==1,] testing<-FSdata[id==2,] c(nrow(training), nrow(testing)) str(testing) ## Rows and Columns dim(training) dim(testing) ## Columns name colnames(training) colnames(testing) ## Naive Baise Model library(e1071) library(caret) library(AUC) model.bayes<- naiveBayes(Satisfaction~.,data = training) model.bayes ## test pc<- NULL pc<- predict(model.bayes,testing,type = "class") summary(pc) confusionMatrix(table(pc,testing$Satisfaction))
  • 42. 42 ## Lift Chart pb<- NULL pb<- predict(model.bayes,testing,type = "raw") pb<-as.data.frame(pb) pred.bayes<-data.frame(testing$Satisfaction,pb$`1`) colnames(pred.bayes)<-c("target","score") lift.bayes<- lift(target~score,data=pred.bayes,cuts = 10,class = "1") xyplot(lift.bayes,main="Bayesian Classifier -Lift Chart", type=c("1","g"),lwd=2, x.scales=list(x=list(alternative=FALSE,tick.number=10), y=list(alternating=FALSE,tick.number=10)))