Us falcon airline passenger satisfaction

“US - Falcon Airline Passenger Satisfaction”
Submitted in Partial Fulfillment of requirements for the Award of certificate of
Post Graduate Program in Business Analytics
Capstone Project Report
Submitted to
Submitted by Group 7
Pratiti Ray Chaudhury
Bajantry Harshadeep
Mithilesh Kumar Yadav
Sravan Kolluru
Under the guidance of
(Ms. Chaitali Dixit)
Batch – (PGPBA.H.Jan’19)
Year Of Completion (July’ 2019)

CERTIFICATE
This is to certify that the participants Pratiti Ray Chaudhury, Bajantry Harshadeep, Mithilesh Kumar Yadav,
Sravan Kolluru who are the students of Great Lakes Institute of Management, has successfully completed their
project on “US - Falcon Airline Passenger Satisfaction”
This project is the record of authentic work carried out by them during the academic year 2018- 2019.
Mentor’s Name and Sign Program Director
Name :
Date :
Place: Hyderabad

CONTENTS
INTRODUCTION................................................................................................................... 4
PROBLEM STATEMENT..................................................................................................................5
DATA ABSTRACT .................................................................................................................. 5
DATAPREPARATION....................................................................................................................................7
EXPLORATORYANALYSIS .........................................................................................................................10
DATAMODELLING.....................................................................................................................................13
Dataprocessingandnormalization...........................................................................................13
TrainingandTestDatasets..........................................................................................................14
Models.........................................................................................................................................................14
1. LogisticRegression................................................................................................................14
2. LinearDiscriminantAnalysis................................................................................................16
3. DecisionTree –CARTAlgorithm.........................................................................................18
4. NaïveBayes...........................................................................................................................20
MODELCOMPARISON..............................................................................................................................22
1. LDAover Logisticregressionmodel....................................................................................23
2. Decision tree over LDA.................................................................................... 23
CONCLUSION..................................................................................................................... 24
APPENDIX – R SOURCE CODE............................................................................................ 25

4
INTRODUCTION
In the aviation industry, high-grade customer satisfaction is a key factor to run the business, as the
airline industry is very competitive and customer loyalty varies with small changes in the services.
Therefore, companies need to understand the customers’ need to deliver
unparalleled experiences to retain customers.
Using the customer’s satisfaction data obtained from falcon airlines, we here attempt to
understand the reasons for customer experience being satisfied or not. Based on that,
improvements will be made to provide better service by the airline company. Also, as part of the
analysis, we will be able to understand several factors which improve customer satisfaction level.
In this era of social networking, the customers get to Twitter, Facebook, Blogs or Google
Reviews to review or share their experience on the internet. Any miserable experience of a
customer, which is shared becomes viral on the internet, seriously affecting the brand or
goodwill of the company.
To arrest these unwanted viral trends on the internet and to avoid heavy legal compensations or
penalties, the companies in the service industry are identifying a customer’s satisfaction through a
service rating cards or survey after delivering service and compensating a dissatisfied customer
upon the fault of company service. This rewards or compensation program has effectively reduced
the complaints raised by a customer.

5
PROBLEM STATEMENT
In our scenario, the airline company wants to identify a customer satisfaction level, based on his
rating on various aspects of airline experience. Hence, primarily we build a model to
classify the customer satisfaction level. Precisely we will classify customers’ being satisfied or not
and accordingly try to find out the factors related to high satisfaction.
DATA ABSTRACT
The data we obtained from Kaggle represents US airline satisfaction data consisting of 129880
observations and 23 attributes. The Satisfaction level, which is our dependent variable, is
represented as a factor (“Satisfied” and “Neutral or Dissatisfied”).
Variable Variable Description Variable Value Level
Satisfaction
(Dependent Var)
Airline satisfaction level Satisfaction, neutral or
dissatisfaction
Age The actual age of the passengers
Gender Gender of the passengers Female, Male
Type of Travel Purpose of the flight of the
passengers
Personal Travel, Business
Travel
Class Travel class in the plane of the
passengers
Business, Eco, Eco Plus
Customer Type The customer type Loyal customer, disloyal
customer
Flight distance The flight distance of this journey

6
Inflight WIFI
Service
Satisfaction level of the inflight wifi
service
Rating: 0 (least) - 5 (highest)
Ease of Online
Booking
Satisfaction level of online booking Rating: 0 (least) - 5 (highest)
Online boarding Satisfaction level of online boarding Rating: 0 (least) - 5 (highest)
Inflight
Entertainment
Satisfaction level of inflight
entertainment
Food and drink Satisfaction level of Food and drink
Seat comfort Satisfaction level of Seat comfort Rating: 0 (least) - 5 (highest)
On-board service Satisfaction level of On-board service Rating: 0 (least) - 5 (highest)
Leg room service Satisfaction level of Leg room service Rating: 0 (least) - 5 (highest)
Departure/Arriva
l time convenient
Satisfaction level of Departure/Arrival
time convenient
Baggage handling Satisfaction level of baggage handling
Gate location Satisfaction level of Gate location Rating: 0 (least) - 5 (highest)
Cleanliness Satisfaction level of Cleanliness Rating: 0 (least) - 5 (highest)
Check-in service Satisfaction level of Check-in service Rating: 0 (least) - 5 (highest)
Departure Delay
in Minutes
Minutes delayed when departure
Arrival Delay in
Minutes
Minutes delayed when Arrival
Online Support
Satisfaction level of Online Support Rating: 0 (least) - 5 (highest)

7
DATA PREPARATION
We import the dataset into the R environment and observe the structure of the data loaded.
##'data.frame':90917 obs. of 25 variables:
## $ ID : int 149965 149966 149967 149968 149969 14997
0 149971 149972 149973 149974 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 2
1 2 2 1 1 ...
## $ CustomerType : Factor w/ 3 levels "","disloyal Customer",..:
3 3 3 3 3 3 3 3 3 3 ...
## $ Age : int 65 15 60 70 30 66 10 22 58 34 ...
## $ TypeTravel : Factor w/ 3 levels "","Business travel",..: 3
3 3 3 1 3 3 3 3 3 ...
## $ Class : Factor w/ 3 levels "Business","Eco",..: 2 2 2
2 2 2 2 2 2 2 ...
## $ Flight_Distance : int 265 2138 623 354 1894 227 1812 1556 104 3
633 ...
## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ...
## $ ArrivalDelayin_Mins : int 0 0 0 0 0 15 0 26 48 0 ...
## $ ID : int 149965 149966 149967 149968 149969 149970
149971 149972 149973 149974 ...
## $ Satisfaction : Factor w/ 2 levels "neutral or dissatisfied",.
.: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_comfort : Factor w/ 6 levels "acceptable","excellent",..
: 3 3 3 3 3 3 3 3 3 3 ...
## $ Departure.Arrival.time_convenient: Factor w/ 7 levels "","acceptable",..: 4 4 1 4
4 4 4 1 4 4 ...
## $ Food_drink : Factor w/ 7 levels "","acceptable",..: 4 4 4 4
4 1 1 4 4 4 ...
## $ Gate_location : Factor w/ 6 levels "Convinient","Inconvinient"
,..: 4 3 3 3 3 3 3 3 3 1 ...
## $ Inflightwifi_service : Factor w/ 6 levels "acceptable","excellent",..
: 5 5 1 4 5 5 5 5 1 5 ...
## $ Inflight_entertainment : Factor w/ 6 levels "acceptable","excellent",..
: 4 3 4 1 3 2 3 3 1 3 ...
## $ Online_support : Factor w/ 6 levels "acceptable","excellent",..
: 5 5 1 4 5 2 5 5 1 5 ...

8
## $ Ease_of_Onlinebooking : Factor w/ 6 levels "acceptable","excellent",..
: 1 5 6 5 5 2 5 5 1 5 ...
## $ Onboard_service : Factor w/ 7 levels "","acceptable",..: 2 1 7 6
3 3 2 6 2 2 ...
## $ Leg_room_service : Factor w/ 6 levels "acceptable","excellent",..
: 3 1 3 3 4 3 1 4 3 5 ...
## $ Baggage_handling : Factor w/ 5 levels "acceptable","excellent",..
: 1 3 5 4 2 2 3 2 5 2 ...
## $ Checkin_service : Factor w/ 6 levels "acceptable","excellent",..
: 2 4 4 4 2 2 2 1 5 5 ...
## $ Cleanliness : Factor w/ 6 levels "acceptable","excellent",..
: 1 4 6 5 4 2 4 4 1 2 ...
## $ Online_boarding : Factor w/ 6 levels "acceptable","excellent",..
: 5 5 1 2 5 1 5 5 2 5 ...
We observe the dataset having 90917 observations and 24 attributes. The attribute “satisfaction”
represents the satisfaction level of the customer on two different levels: “neutral or dissatisfied” and
“satisfied”. Also, we drop the ID attribute as we wouldn’t require that in our analysis or model building.
Hence, we have 1 dependent variable and 22 independent variables in our dataset. We check our dataset
for any possible NA values, which must be dealt before we proceed with building the model.
## ArrivalDelayin_Min
## 284
We observe 284 NA values in the attribute “ArrivalDelayin_Min”. Let us analyze the dataset
further to understand whether there are any NA values which are present in the data appart from
this variable. To perform this, we fill the blanks with NA values. Then we take the count of the NA
values of the attributes of this dataset. Following we observe there are NA’s values of attributes of
original dataset and the dataset with NA values are –
CustomerType TypeTravel Departure.Arrival.time_convenient
9099 9088 8244
Food_drink Onboard_service
8181 7179

9
By imputing the blank values of different attributes with NA’s, we observe that the attributes
have some missing values which are imputed with NA’s. Hence, we can replace the NA values
with mean values of the attribute or omit the missing, however before proceeding any
further, let us analyze the “Arrival.Delay.in.Minutes” variable’s correlation with other
variables.
ArrivalDelayin_Mins
Gender NA
CustomerType NA
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Age
Flight_Distance
DepartureDelayin_Mins
ArrivalDelayin_Mins
Gender
CustomerType
TypeTravel
Class
Satisfaction
Seat_comfort
Departure.Arrival.time_convenient
Food_drink
Gate_location
Inflightwifi_service
Inflight_entertainment
Online_support
Ease_of_Onlinebooking
Onboard_service
Leg_room_service
Baggage_handling
Checkin_service
Cleanliness
Online_boarding
Age
Flight_Distance
DepartureDelayin_Mins
ArrivalDelayin_Mins
Gender
CustomerType
TypeTravel
Class
Satisfaction
Seat_comfort
Food_drink
Gate_location
Inflightwifi_service
Inflight_entertainment
Online_support
Ease_of_Onlinebooking
Onboard_service
Leg_room_service
Baggage_handling
Checkin_service
Cleanliness
Online_boarding

10
Age -0.010
TypeTravel NA
Class NA
Flight_Distance 0.108
DepartureDelayin_Mins 0.966
We observe a very high correlation (0.966) between Departure.Delay.in.Minutes and
Arrival.Delay.in.Minutes. Fortunately, without the need of replacing or omitting the NA values in the
variable “Arrival.Delay.in.Minutes”, we can exclude the variable entirely, as we have a similar variable
having no NA values in our dataset to proceed with our modeling.
EXPLORATORY ANALYSIS
After observing the impact of several independent variables over the “Satisfaction level”, we
present below few visualizations displaying a level of satisfaction at different factor levels for
the variables “Gender”, “Class”, “Seat Comfort”, “Inflight Entertainment”, “Online Support” and
“Baggage Handling”.
On an inferential perspective, these variables are considered the most significant in terms of
airline customer satisfaction, which is more likely the case as we observe the following
visualizations.

11
We observe that the customers traveling the
Business Class are more satisfied than the
customers traveling in Economy classes.
We observe that the Female customers are
comparatively more satisfied than the Male
customers as we can see 65% female
customers are satisfied against 34%
dissatisfied.
We observe that Seat Comfort is having a
significant effect on the customer
satisfaction level, as we see that customers
rating 5 on seat comfort are 99 percent
satisfied.

12
We observe that Baggage Handling is having a
significant effect on the customer satisfaction Level, as
we see that customers rating 5 on Baggage Handling
are 73 percent satisfied.
We observe that Inflight Entertainment is
having a significant effect on the customer
rating 5 on seat comfort are 95percent
satisfied.
We observe that Online Support is having a
significant effect on the customer
rating 0 on Online Support are 100 percent
dissatisfied.

13
DATA MODELLING
We build models to predict the satisfaction level of the customer based on the independent
variables of our dataset. This will help the airline company to understand the satisfaction level
of its customers and hence take necessary actions, in order to improve customer satisfaction.
Here we build 3 models to predict the satisfaction level, namely
• Logistic regression model
• Linear discriminant analysis and
• Decision tree models.
DATAPROCESSING AND NORMALIZATION
Initially, we check for the class bias to ensure our models are better.
We observe there is no class bias, as both the cases are approximately equally distributed in
the dataset.
Before we begin modeling the dataset, we convert the factorial data in the dataset into
numeric type as we require in the model building and further analysis, except for our
dependent variable.
As an initial step in the dataset, we observe few of the variables in our dataset are on the
different scale and hence mislead our modeling techniques. Hence, we normalize the data.
satisfied
54.7
##
## neutral or dissatisfied
## 45.3

14
TRAINING AND TEST DATASETS
Now that we have the normalized dataset, we divide the dataset into training and testing
dataset in the ratio of 70:30. This will help us in understanding the performance of our model.
MODELS
1. LOGISTIC REGRESSION
To build a classification model to predict satisfaction, we start with the go-to classification
model: Logistic Regression model. Here, we utilize the Logistic regression classification
algorithm from the GLM package in R to predict “satisfaction_v2” from the set of independent
variables available.
Call:
glm(formula = Satisfaction ~ ., family = "binomial", data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8657 -0.5916 0.2056 0.5355 3.5814
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.54608 0.08558 -88.171 < 2e-16 ***
Gender -0.91631 0.02315 -39.576 < 2e-16 ***
CustomerType 1.75680 0.03491 50.323 < 2e-16 ***
Age -0.40833 0.06229 -6.556 5.54e-11 ***
TypeTravel -0.58477 0.03103 -18.847 < 2e-16 ***
Class 0.86129 0.02858 30.136 < 2e-16 ***
Flight_Distance -0.65794 0.08321 -7.907 2.64e-15 ***
DepartureDelayin_Mins -7.54371 0.49612 -15.205 < 2e-16 ***
Seat_comfort 1.34953 0.06315 21.371 < 2e-16 ***
Departure.Arrival.time_convenient -0.82836 0.04684 -17.686 < 2e-16 ***
Food_drink -1.18942 0.06470 -18.385 < 2e-16 ***
Gate_location 0.46966 0.05299 8.862 < 2e-16 ***
Inflightwifi_service -0.44060 0.06243 -7.058 1.69e-12 ***
Inflight_entertainment 3.60675 0.05847 61.687 < 2e-16 ***
Online_support 0.43647 0.06368 6.854 7.18e-12 ***
Ease_of_Onlinebooking 1.38364 0.08170 16.935 < 2e-16 ***

15
Onboard_service 1.36192 0.05790 23.521 < 2e-16 ***
Leg_room_service 1.15122 0.04932 23.341 < 2e-16 ***
Baggage_handling 0.38854 0.05233 7.424 1.13e-13 ***
Checkin_service 1.44105 0.04868 29.601 < 2e-16 ***
Cleanliness 0.37903 0.06767 5.602 2.12e-08 ***
Online_boarding 0.68344 0.06994 9.772 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 87657 on 63642 degrees of freedom
Residual deviance: 50030 on 63621 degrees of freedom
AIC: 50074
Number of Fisher Scoring iterations: 5
From the above summary statistics of the model we built, we observe the coefficient weights
and the importance of the variable from p-value. We observe that all the independent
variables are significantly important in predicting the “satisfaction”.
We shall predict the test dataset using the model built. Note that we obtained the predictions
in terms of probabilities. Hence, by considering a cutoff of 0.5 (practically used default value),
we classify the predicted “satisfaction” into “Yes” or “No”. Let us build theconfusion
matrix, using the predicted and observed values to assess the performance of the model.
Confusion Matrix and Statistics
Reference
Prediction neutral or dissatisfied satisfied
neutral or dissatisfied 10020 2267
satisfied 2326 12661
Accuracy : 0.8316
95% CI : (0.8271, 0.836)
No Information Rate : 0.5473
P-Value [Acc > NIR] : <2e-16
Kappa : 0.66
Mcnemar's Test P-Value : 0.3921
Sensitivity : 0.8481
Specificity : 0.8116
Pos Pred Value : 0.8448

16
Neg Pred Value : 0.8155
Prevalence : 0.5473
Detection Rate : 0.4642
Detection Prevalence : 0.5495
Balanced Accuracy : 0.8299
'Positive' Class : satisfied
From the above results, we can observe that:
• The overall accuracy of 83.6% and Kappa value of 66%. That means, our model
performed better than a random prediction of the dependentvariable.
• Sensitivity value of 84 percent, indicating that our model has correctly identified 84%
of the satisfied customers correctly.
• Specificity value of 81 percent, indicating that our model has correctly identified 81
percent of the “neutral or dissatisfied” customers correctly.
2. LINEAR DISCRIMINANT ANALYSIS
One significant difference we observe between Logistic Regression and LDA is that when the
groups in the dataset are clearly classified logistic regression is unstable and hence LDA model
provides a better model. Let us observe the predicted class separability.
Coefficients of linear discriminants:
LD1
Gender -0.5345605
CustomerType 1.0790306
Age -0.2755955
TypeTravel -0.3446182
Class 0.5061304
Flight_Distance -0.3634092
DepartureDelayin_Mins -3.6944958
Seat_comfort 0.9148241
Departure.Arrival.time_convenient -0.5197594
Food_drink -0.7096648
Gate_location 0.2262170
Inflightwifi_service -0.2303117
Inflight_entertainment 2.1898107
Online_support 0.2880001
Ease_of_Onlinebooking 0.8540892
Onboard_service 0.7833862

17
Leg_room_service 0.6433999
Baggage_handling 0.2037793
Checkin_service 0.8032809
Cleanliness 0.2146604
Online_boarding 0.3705444
As we can observe that the “satisfied” and “neutral or dissatisfied” classes are more likely
separable, we do not expect to see much significance using the LDA model over the Logistic
model. However, we build a model using the LDA.
Reference
Accuracy : 0.8326
95% CI : (0.8281, 0.837)

18
P-Value [Acc > NIR] : <2e-16
Kappa : 0.662
Prevalence : 0.5502
• The overall accuracy of 83.2% and Kappa value of 66%.
3. DECISION TREE – CART ALGORITHM
We use a decision tree algorithm, CART (Classification and Regression Tree) for classifying the
dependent variable based on the set of independent variables. This algorithm repeatedly
partitions the data into multiple-subsets so that the leaf nodes are pure or homogeneous.
Here we ensure that our categorical variables are in factor form so that the CART algorithm
uses classification tree rather than regression tree. Following we visualize the model we built.

19
The algorithm built a classification tree, as we observe that the first or the more valuable
variable we have “Inflight entertainment” being rated 1-3 we classify them to be “neutral or
dissatisfied” class on left node, which is further classified based on the rating given on “seat
comfort”. If seat comfort is rated 1-3, we finally decided them to be “neutral or dissatisfied”,
else classified them to be “satisfied”. Similarly, the tree progresses finally into 7 leaf nodes.
Note that the factor levels are normalized (Normalized Value (Rating Value): 0(0), 0.2(1),
0.4(2), 0.6(3), 0.8(4), 1(5)).
We use this model to predict the test data set and observe the confusion matrix to
understand the model performance

20
Reference
Accuracy : 0.8642
95% CI : (0.86, 0.8682)
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7264
Mcnemar's Test P-Value : 2.229e-06
Prevalence : 0.5473
• The overall accuracy of 86% and Kappa value of 72%. That means, our model
4. NAÏVE BAYES – classification algorithm

21
pc 1 0
1 13071 1949
0 1792 10329
Accuracy : 0.8622
95% CI : (0.858, 0.8662)
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7215

22
Prevalence : 0.5476
'Positive' Class : 1
• The overall accuracy of 86% and Kappa value of 72%. That means, our model
of the satisfied customers correctly which is higher than all the models.
percent of the “neutral or dissatisfied” customers correctly which is lower than
CART model.
• This model has performed good in predicting the satisfied customers but failed to
perform that well when comes to neutral or dissatisfied customers
MODEL COMPARISON
In order to compare the model performance of the 3 models we built on a common metrics,
we extract the accuracy, Kappa statistic, Sensitivity and Specificity from the confusion Matrix
() output.
LogisticRegression LinearDiscriminantAna lysis DecisionTree Naïve Bayes
Accuracy 0.8315979 0.8326245 0.8641563 0.8622
Kappa 0.6600094 0.6620347 0.7264096 0.7215
Sensitivity 0.8481377 0.8452722 0.8662245 0.8794

23
Specificity 0.8115989 0.8171517 0.8616556 0.8413
We observe from the above summary that,
• Logistic regression and LDA models we built provide similar performance metrics for our
dataset.
• Decision tree provides better performance metrics over the other models we built but there is no
much difference between Naïve Bayes and Decision Tree.
Below we compare the models one against the other.
1. LDA OVER LOGISTIC REGRESSION MODEL
As we expected, we do not see significant improvement in the model performance, as our
predicted class was clearly more likely separable, and we have sufficient predictors. However,
we only see a slight .01 percent improvement in the Specificity of our model, which represents
correctly identifying “neutral or dissatisfied” customers. For our business model, it is critical to
identify the negative class correctly, as the airline company is looking forward to identify the
unsatisfied customers and reward them to improve their satisfaction.
2. DECISION TREE OVER LDA & Naïve Bayes
We obtained a decent 3 percent improvement in the model accuracy using the decision tree
algorithm over the logistic regression and LDA but no significant difference was observed
when compared with Naïve Bayes models we built. Also, we obtained a 5 percent and 2
percent improvement in the specificity of our model which is a critical metric for our business
model.

24
CONCLUSION
We conclude that decision tree model we built using the CART algorithm, is statistically better
suitable for our dataset and business model providing us with an accuracy of 86 percent along
with 86 percent correctly identifying “satisfied” customers (Sensitivity) and 86percent
correctly identifying “neutral or dissatisfied” customers (Specificity).
Also, we observed from exploratory analysis and decision tree as well, the Inflight experience
and Seat Comfort level and Online service level significantly affect the customer experience
along with several other variables considered. The airline service companies must ensure high
quality of service in these parameters to ensure a high level of customer satisfaction.
The models we built did lack to perfectly classify customer satisfaction level. The prediction
accuracy can be further improved using advanced classification algorithms or deep learning
concepts.

25
APPENDIX – R SOURCE CODE
# Loading Libraries
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(e1071)
library(MASS)
library(corrplot)
library(ggplot2)
library(tidyr)
library(sparklyr)
# Setting Working Directory
setwd("D:/PGP BA Classes/Capstone Projects/3. Airplane Passenger Satisfaction/Final Submission")
#Loading data
flightdata<-read.csv("Flight data.csv", header = T)
surveydata<-read.csv("Surveydata.csv", header = T)
flightsurvey<-cbind(flightdata,surveydata[,-1])

26
#Filling blanks with NA
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x)
ifelse(as.character(x)!="", x, NA)
}
## transform all columns
flightsurvey<-flightsurvey %>% mutate_each(list(empty_as_na))
str(flightsurvey)
#Dealing with NA values
#NA's Count
sapply(flightsurvey, function(x) sum(is.na(x)))
# Missing Vallue immputaion with mode
# For CustomerType
ct <- unique(flightsurvey$CustomerType[!is.na(flightsurvey$CustomerType)])
CustomerType <- ct[which.max(tabulate(match(flightsurvey$CustomerType, ct)))]
CustomerType
flightsurvey$CustomerType[is.na(flightsurvey$CustomerType)]<- CustomerType
# For TypeTravel
tt <- unique(flightsurvey$TypeTravel[!is.na(flightsurvey$TypeTravel)])
TypeTravel <- tt[which.max(tabulate(match(flightsurvey$TypeTravel, tt)))]
TypeTravel

27
flightsurvey$TypeTravel[is.na(flightsurvey$TypeTravel)]<- TypeTravel
# For Departure.Arrival.time_convenient
datc <-
unique(flightsurvey$Departure.Arrival.time_convenient[!is.na(flightsurvey$Departure.Arrival.time_convenie
nt)])
Departure.Arrival.time_convenient <-
datc[which.max(tabulate(match(flightsurvey$Departure.Arrival.time_convenient, datc)))]
flightsurvey$Departure.Arrival.time_convenient[is.na(flightsurvey$Departure.Arrival.time_convenient)]<-
# For Food_drink
fd <- unique(flightsurvey$Food_drink[!is.na(flightsurvey$Food_drink)])
Food_drink <- fd[which.max(tabulate(match(flightsurvey$Food_drink, fd)))]
Food_drink
flightsurvey$Food_drink[is.na(flightsurvey$Food_drink)]<- Food_drink
# For Onboard_service
os <- unique(flightsurvey$Onboard_service[!is.na(flightsurvey$Onboard_service)])
Onboard_service <- os[which.max(tabulate(match(flightsurvey$Onboard_service, os)))]
Onboard_service
flightsurvey$Onboard_service[is.na(flightsurvey$Onboard_service)]<- Onboard_service
# Converting Cate into int
flightsurvey$Seat_comfort<-factor(flightsurvey$Seat_comfort,levels=c("extremely poor","poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Seat_comfort <- as.integer(factor(flightsurvey$Seat_comfort))

28
flightsurvey$Departure.Arrival.time_convenient<-
factor(flightsurvey$Departure.Arrival.time_convenient,levels=c("extremely poor","poor","need
flightsurvey$Departure.Arrival.time_convenient <-
as.integer(factor(flightsurvey$Departure.Arrival.time_convenient))
flightsurvey$Food_drink<-factor(flightsurvey$Food_drink,levels=c("extremely poor","poor","need
flightsurvey$Food_drink <- as.integer(factor(flightsurvey$Food_drink))
flightsurvey$Gate_location<-factor(flightsurvey$Gate_location,levels=c("very
inconvinient","Inconvinient","need improvement","manageable","Convinient","very
convinient"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Gate_location <- as.integer(factor(flightsurvey$Gate_location))
flightsurvey$Inflightwifi_service<-factor(flightsurvey$Inflightwifi_service,levels=c("extremely
poor","poor","need improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4,5),ordered = T)
flightsurvey$Inflightwifi_service <- as.integer(factor(flightsurvey$Inflightwifi_service))
flightsurvey$Inflight_entertainment<-factor(flightsurvey$Inflight_entertainment,levels=c("extremely
flightsurvey$Inflight_entertainment <- as.integer(factor(flightsurvey$Inflight_entertainment))
flightsurvey$Online_support<-factor(flightsurvey$Online_support,levels=c("extremely poor","poor","need
flightsurvey$Online_support <- as.integer(factor(flightsurvey$Online_support))
flightsurvey$Ease_of_Onlinebooking<-factor(flightsurvey$Ease_of_Onlinebooking,levels=c("extremely
flightsurvey$Ease_of_Onlinebooking <- as.integer(factor(flightsurvey$Ease_of_Onlinebooking))
flightsurvey$Onboard_service<-factor(flightsurvey$Onboard_service,levels=c("extremely poor","poor","need
flightsurvey$Onboard_service <- as.integer(factor(flightsurvey$Onboard_service))
flightsurvey$Leg_room_service<-factor(flightsurvey$Leg_room_service,levels=c("extremely

29
flightsurvey$Leg_room_service <- as.integer(factor(flightsurvey$Leg_room_service))
flightsurvey$Baggage_handling<-factor(flightsurvey$Baggage_handling,levels=c("poor","need
improvement","acceptable","good","excellent"),labels=c(0,1,2,3,4),ordered = T)
flightsurvey$Baggage_handling <- as.integer(factor(flightsurvey$Baggage_handling))
flightsurvey$Checkin_service<-factor(flightsurvey$Checkin_service,levels=c("extremely poor","poor","need
flightsurvey$Checkin_service <- as.integer(factor(flightsurvey$Checkin_service))
flightsurvey$Cleanliness<-factor(flightsurvey$Cleanliness,levels=c("extremely poor","poor","need
flightsurvey$Cleanliness <- as.integer(factor(flightsurvey$Cleanliness))
flightsurvey$Online_boarding<-factor(flightsurvey$Online_boarding,levels=c("extremely poor","poor","need
flightsurvey$Online_boarding <- as.integer(factor(flightsurvey$Online_boarding))
str(flightsurvey)
#drop ID attribute
data <- flightsurvey[-1]
# Finding correlation
data.non <- data %>% filter(!is.na(data$ArrivalDelayin_Mins))
for(i in 1:ncol(data.non)){
if(class(data[,i])!="integer"){
data.non[,i] <- as.integer(data.non[,i])
}
}
data.cor <- round(cor(data.non[,1:7], data.non[,8]),3)

30
colnames(data.cor) <- "ArrivalDelayin_Mins"
data.cor
# Removing High Correlated atribute "Arrival.Delay.in.Minutes"
data <- data[-8]
#Exploratory Analysis
ggplot(data, aes(Satisfaction, group = Gender)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop..),y= ..prop..), stat= "count", vjust = -.5, cex=3) +
labs(x="Satisfaction Level",y = "Percentage",title="Gender vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Gender)+ theme(axis.text.x =
element_text(size=8,angle = 90), plot.title = element_text(hjust = 0.5))
ggplot(data, aes(Satisfaction, group = Class)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
labs(x="Class of Travel",y = "Percentage",title="Travel Class vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) +
facet_grid(~Class)+ theme(axis.text.x = element_text(size=8,angle = 90), plot.title = element_text(hjust =
0.5))
ggplot(data, aes(Satisfaction, group = Seat_comfort)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count") +
labs(x="Seat Comfort Level",y = "Percentage",title="Seat Comfort vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Seat_comfort)+ theme(axis.text.x =

31
ggplot(data, aes(Satisfaction, group = Inflight_entertainment)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count") +
labs(x="Inflight entertainment Level",y = "Percentage",title= "Inflight Entertainment vs Satisfaction",
fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Inflight_entertainment)+ theme(axis.text.x =
ggplot(data, aes(Satisfaction, group = Online_support)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count") +
labs(x="Online support Level",y = "Percentage",title="Online Support vs Satisfaction", fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Online_support)+ theme(axis.text.x =
ggplot(data, aes(Satisfaction, group = Baggage_handling)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count")+
labs(x="Baggage Handling Level",y = "Percentage",title="Bagga ge Handling vs Satisfaction",
fill="Satisfaction") +
scale_y_continuous(labels=scales::percent) + facet_grid(~Baggage_handling)+ theme(axis.text.x =
# Data Modelling
round(prop.table(table(data$Satisfaction))*100,digit=1)
for(i in 2:ncol(data)){
if(class(data[,i])=="factor"){
data[,i] <- as.integer(data[,i])
}
}
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))

32
}
#write.csv(data, file = "Data.csv")
data <- read.csv("Data.csv",header = T)
data <- data[,-1]
str(data)
# Converting categorical to int
# Gender
levels(data$Gender)
data$Gender<-factor(data$Gender,levels=c("Female","Male"),labels=c(0,1))
data$Gender <- as.integer(factor(data$Gender))
# CustomerType
levels(data$CustomerType)
data$CustomerType<-factor(data$CustomerType,levels=c("disloyal Customer","Loyal
Customer"),labels=c(0,1),ordered = T)
data$CustomerType <- as.integer(factor(data$CustomerType))
#TypeTravel
levels(data$TypeTravel)
data$TypeTravel<-factor(data$TypeTravel,levels=c("Business travel","Personal Travel"),labels=c(0,1))
data$TypeTravel <- as.integer(factor(data$TypeTravel))
#Class
levels(data$Class)
data$Class<-factor(data$Class,levels=c("Eco","Eco Plus","Business"),labels=c(0,1,2),ordered = T)

33
data$Class <- as.integer(factor(data$Class))
data_n <- as.data.frame(lapply(data[2:22], normalize))
data_n <- cbind(data_n, Satisfaction=data$Satisfaction)
set.seed(5)
partition <- createDataPartition(y = data_n$Satisfaction, p=0.7, list= F)
data_train <- data_n[partition,]
data_test <- data_n[-partition,]
# write.csv(data_train, file = "data_train.csv")
# write.csv(data_test, file = "data_test.csv")
# Logistic regression
log_model <- glm(Satisfaction~., data=data_train, family = "binomial")
summary(log_model)
log_preds <- predict(log_model, data_test[,1:22], type = "response")
head(log_preds)
log_class <- array(c(99))
for (i in 1:length(log_preds)){
if(log_preds[i]>0.5){
log_class[i]<-"satisfied"
}else{
log_class[i]<-"neutral or dissatisfied"
}
}

34
#Creating a new dataframe containing the actual and predicted values.
log_result <- data.frame(Actual = data_test$Satisfaction, Prediction = log_class)
mr1 <- confusionMatrix(as.factor(log_class),data_test$Satisfaction, positive = "satisfied")
mr1
# Linear Discriminant Analysis
pairs(data_train[,1:5], main="Predict ", pch=22, bg=c("red", "blue")[unclass(data_train$Satisfaction)])
lda_model <- lda(Satisfaction ~ ., data_train)
lda_model
#Predict the model
lda_preds <- predict(lda_model, data_test)
mr2 <- confusionMatrix(data_test$Satisfaction, lda_preds$class, positive = "satisfied")
mr2
# Decision tree - CART algorithm
data_train_new <- data_train
for(i in c(1,2,4,5,8,9,10,11,12,13,14,15,16,17,18,19,20,21)){
data_train_new[,i] <- as.factor(data_train_new[,i])
}

35
data_test_new <- data_test
for(i in c(1,2,4,5,8,9,10,11,12,13,14,15,16,17,18,19,20,21)){
data_test_new[,i] <- as.factor(data_test_new[,i])
}
str(data_train_new)
str(data_test_new)
cart_model <- rpart(Satisfaction~ ., data_train_new, method="class")
cart_model
rpart.plot(cart_model,digits = 2, type=4, extra=106)
cm1 <- predict(cart_model, data_test_new, type = "class")
mr3 <- confusionMatrix(cm1, data_test$Satisfaction, positive = "satisfied")
mr3
# Model Comparision
comp_cfm <- function(x1,x2,x3,n1,n2,n3){
a <- data.frame()
b <- as.matrix(c(x1[[3]][1:2],x1[[4]][1:2]))
c <- as.matrix(c(x2[[3]][1:2],x2[[4]][1:2]))
d <- as.matrix(c(x3[[3]][1:2],x3[[4]][1:2]))
colnames(b) <- n1
colnames(c) <- n2
colnames(d) <- n3
a <- as.data.frame(cbind(b,c,d))
return(a)
}

36
r <- comp_cfm(mr1,mr2,mr3, "LogisticRegression", "LinearDiscriminantAna lysis", "DecisionTree")
r
##Capstone Project Data And Naive Bayes Model
setwd("D:/PGP BA Classes/Capstone Projects/3. Airplane Passenger Satisfaction/Final Submission")
survey<-read.csv("Surveydata.csv")
flight<- read.csv("Flight data.csv")
## Combine Data set
FSdata<-cbind(flight[-1],survey[-1])
str(FSdata)
colnames(FSdata)
library(dplyr)
## filling "Blanks" with "NA"
empty_as_na <- function(x)
{
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)
}
## transform all columns
FSdata<-FSdata %>% mutate_each(list(empty_as_na))
FSdata
##Missing value imputation for CustomerType
val1 <- unique(FSdata$CustomerType[!is.na(FSdata$CustomerType)])
mode1 <- val1[which.max(tabulate(match(FSdata$CustomerType, val1)))]
mode1
FSdata$CustomerType[is.na(FSdata$CustomerType)]<- mode1

37
##Missing value imputation for Typetravel
val2 <- unique(FSdata$TypeTravel[!is.na(FSdata$TypeTravel)])
mode2 <- val2[which.max(tabulate(match(FSdata$TypeTravel, val2)))]
mode2
FSdata$TypeTravel[is.na(FSdata$TypeTravel)]<- mode2
##Missing value imputation for Departure.Arrival.time_convenient
val3 <-unique(FSdata$Departure.Arrival.time_convenient[!is.na(FSdata$Departure.Arrival.time_convenient)])
mode3 <- val3[which.max(tabulate(match(FSdata$Departure.Arrival.time_convenient, val3)))]
mode3
FSdata$Departure.Arrival.time_convenient[is.na(FSdata$Departure.Arrival.time_convenient)]<-mode3
##Missing value imputation for Food_drink
val4 <- unique(FSdata$Food_drink[!is.na(FSdata$Food_drink)])
mode4 <- val4[which.max(tabulate(match(FSdata$Food_drink, val4)))]
mode4
FSdata$Food_drink[is.na(FSdata$Food_drink)]<- mode4
##Missing value imputation for Onboard_service
val5 <- unique(FSdata$Onboard_service[!is.na(FSdata$Onboard_service)])
mode5 <- val5[which.max(tabulate(match(FSdata$Onboard_service, val5)))]
mode5
FSdata$Onboard_service[is.na(FSdata$Onboard_service)]<- mode5
## deteling the variable "Arrival Delay in Mins" due to collinearity
FSdata<-FSdata[-8]

38
colnames(FSdata)
write.csv(FSdata,file = "FSdata.csv")
sum(is.na(FSdata))
## Binning
library(rbin)
FSdata$Age <- cut(FSdata$Age,
c(0,20,35,55,90))
table(FSdata$Age)
prop.table(table(FSdata$Age))
levels(FSdata$Age)
##Binning Flight_Distance
FSdata$Flight_Distance <- cut(FSdata$Flight_Distance,
c(0,200,400,600,800,1000,1200,1400,1600,1800,2000,2200,2400,2600,2800,3000,3200,3400,3600,3800,4000,4200,44
00,4600,4800,5000,5200,5400,5600,5800,6000))
table(FSdata$Flight_Distance)
prop.table(table(FSdata$Flight_Distance))
levels(FSdata$Flight_Distance)
##Binning DepartureDelayin_Mins (First 2 hrs every 10 mins, next 3 hrs every 15 mins,
#next 5 hrs every 30 mins, next 6 hrs every 60 mins, and beyond 6 hrs in the last bucket)
FSdata$DepartureDelayin_Mins <- cut(FSdata$DepartureDelayin_Mins,
c(0,10,20,30,40,50,60,70,80,90,100,110,120,135,150,
165,180,195,210,225,240,255,270,285,300,330,360,390,
420,450,480,510,540,570,600,660,720,780,840,900,960,1620))
table(FSdata$DepartureDelayin_Mins)
prop.table(table(FSdata$DepartureDelayin_Mins))
##Converting 0 as string
levels<-levels(FSdata$DepartureDelayin_Mins)

39
levels[length(levels)+1]<-"0"
FSdata$DepartureDelayin_Mins<-factor(FSdata$DepartureDelayin_Mins,levels =levels)
FSdata$DepartureDelayin_Mins[is.na(FSdata$DepartureDelayin_Mins)]<- "0"
str(FSdata)
## Converting Character variables to factor
FSdata$Gender<- factor(FSdata$Gender,c("Female","Male"),labels = c(0,1))
FSdata$CustomerType <- factor(FSdata$CustomerType,
c("Loyal Customer","disloyal Customer"),labels = c(1,0))
FSdata$TypeTravel<- factor(FSdata$TypeTravel,
c("Personal Travel", "Business travel"),labels = c(0,1))
FSdata$Class <- factor(FSdata$Class,c("Eco","Eco Plus","Business"),labels = c(1:3))
FSdata$Food_drink <- factor(FSdata$Food_drink,
c("extremely poor","poor","need improvement","acceptable",
"good" , "excellent"),labels = c(0:5))
FSdata$Seat_comfort <- factor(FSdata$Seat_comfort,
c( "extremely poor","poor","need improvement","acceptable",
FSdata$Departure.Arrival.time_convenient <- factor(FSdata$Departure.Arrival.time_convenient,
FSdata$Gate_location <- factor(FSdata$Gate_location,
c("very inconvinient","Inconvinient","need improvement","manageable",
"Convinient" , "very convinient"),labels = c(0:5))
FSdata$Inflightwifi_service<- factor(FSdata$Inflightwifi_service,
FSdata$Inflight_entertainment <- factor(FSdata$Inflight_entertainment,
FSdata$Online_support <- factor(FSdata$Online_support,

40
FSdata$Ease_of_Onlinebooking <- factor(FSdata$Ease_of_Onlinebooking,
FSdata$Onboard_service <- factor(FSdata$Onboard_service,
FSdata$Leg_room_service<- factor(FSdata$Leg_room_service,
FSdata$Baggage_handling<- factor(FSdata$Baggage_handling,
FSdata$Checkin_service <- factor(FSdata$Checkin_service,
FSdata$Cleanliness<- factor(FSdata$Cleanliness,
FSdata$Online_boarding <- factor(FSdata$Online_boarding,
FSdata$Satisfaction<-factor(FSdata$Satisfaction,
c("satisfied","neutral or dissatisfied"),labels=c(1,0))
levels(FSdata$Cleanliness)
str(FSdata)

41
##Split the data in train & test sample (70:30)
library(caret)
set.seed(123)
id<-sample(2,nrow(FSdata),prob = c(.7,.3),replace = T)
training<-FSdata[id==1,]
testing<-FSdata[id==2,]
c(nrow(training), nrow(testing))
str(testing)
## Rows and Columns
dim(training)
dim(testing)
## Columns name
colnames(training)
colnames(testing)
## Naive Baise Model
library(e1071)
library(caret)
library(AUC)
model.bayes<- naiveBayes(Satisfaction~.,data = training)
model.bayes
## test
pc<- NULL
pc<- predict(model.bayes,testing,type = "class")
summary(pc)
confusionMatrix(table(pc,testing$Satisfaction))

42
## Lift Chart
pb<- NULL
pb<- predict(model.bayes,testing,type = "raw")
pb<-as.data.frame(pb)
pred.bayes<-data.frame(testing$Satisfaction,pb$`1`)
colnames(pred.bayes)<-c("target","score")
lift.bayes<- lift(target~score,data=pred.bayes,cuts = 10,class = "1")
xyplot(lift.bayes,main="Bayesian Classifier -Lift Chart",
type=c("1","g"),lwd=2,
x.scales=list(x=list(alternative=FALSE,tick.number=10),
y=list(alternating=FALSE,tick.number=10)))

Us falcon airline passenger satisfaction

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Us falcon airline passenger satisfaction

Similar to Us falcon airline passenger satisfaction (20)

Recently uploaded

Recently uploaded (20)

Us falcon airline passenger satisfaction