SlideShare a Scribd company logo
1 of 27
FLIGHT LANDING PROJECT
(Logistic Regression)
STATISTICAL MODELING, MS BANA
13th February, 2018
Preethi Jayaram Jayaraman
MS BANA, Class of 2017
M12420360
OBJECTIVE OF THE STUDY:
The motivation of the study is to model the risk of landing overrun of
commercial flights. Landing Data of 950 commercial flights (Airbus and
Boeing) are available including variables such as Aircraft, Duration, Number
of Passengers, Ground Speed, Air Speed, Height, Pitch, Long Landing and Risky
Landing. This study evaluates which factors impact the variable long landing,
that indicates if the landing distance was greater than 2500m and the
variable, risky landing, that indicates if the landing distance was greater
than 3000m. This study can further be used to make decisions about the
landing based on the risk of long and risky landing overrun.
BACKGROUND:
As a start to the project, the datasets with details about the flights, FAA1
and FAA2 were merged and cleaned into the dataset, FAA_clean. FAA_clean is a
dataset with 831 observations and 8 variables. The structure and summary
statistics of FAA_clean can be found below.
# Step 0 – Structure, Summary Statistics of FAA_clean
str(FAA_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 831 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Variable Type Missing
values %
Min Max Mean Median
Aircraft Categorical - - - - -
Duration Numerical 5.8% 14.7 305.6 154.0 153.9
No_pasg Numerical - 29.0 87.0 60.1 60.0
Speed_grou
nd
Numerical - 27.7 141.2 79.4 79.6
Speed_air Numerical 75.5% 90.0 141.7 103.8 101.1
Height Numerical - -3.5 59.9 30.1 30.01
Pitch Numerical - 2.2 5.9 4.0 4.0
Distance Numerical - 34.1 6533.05 1526.0 1258.1
SECTION 1: CREATE BINARY RESPONSES
# Step 1 – Create Binary responses – Long Landing, Risky Landing
FAA_clean$long.landing <- ifelse(FAA_clean$distance > 2500,1,0)
FAA_clean$risky.landing <- ifelse(FAA_clean$distance > 3000,1,0)
FAA_clean <- FAA_clean[,-8]
Observations and Conclusion:
The variable, Distance, from the original dataset was modified to make two
binary variables, Long Landing and Risky Landing. Long Landing is defined as
1 for all flights where Distance is greater than 2500m and Risky Landing for
Distance greater than 3000m respectively. The continuous variable, Distance,
was discarded and Long Landing, Risky Landing will be considered as the
response variables of concern henceforth.
SECTION 2: IDENTIFYING IMPORTANT FACTORS FOR RESPONSE, ‘LONG LANDING’
# Step 2 – Distribution of Long Landing
hist(FAA_clean$long.landing)
pct <-
round(table(FAA_clean$long.landing)/length(FAA_clean$long.landing)*100,1)
labs <- c("Not Long landing (<2500 m),", "Long landing (>2500 m,),")
labs <- paste(labs,pct)
labs <- paste(labs,"%",sep = "" )
pie(table(FAA_clean$long.landing),labels = labs,col = rainbow(length(labs)),
main = "Pie chart of Long Landing")
Observations and Conclusion:
The distribution of the variable, Long Landing, can be seen in the above
figures. Clearly, 87.6% of the observations recorded Not Long Landing, while
12.4% of the flights’ landing was recorded as long (> 2500 m).
# Step 3 - Single Factor Regression
var_corr <- names(FAA_clean[1:7])
pvalue <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$long.landing ~ x,
family = binomial))$coefficients[8] },
FUN.VALUE = numeric(1) )
reg_coef <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$long.landing ~ x,
family = binomial))$coefficients[2] },
FUN.VALUE = numeric(1) )
sign_eqn <- vapply(reg_coef,
function(x) { ifelse(x >= 0, "Positive", "Negative")},
FUN.VALUE = character(1))
odds_ratio <- vapply(reg_coef, function(x) { exp(x) },
FUN.VALUE = numeric(1))
# Regression Table - Table 1
table1 <- data.frame(var_corr, abs(reg_coef), odds_ratio, sign_eqn, pvalue)
table1 <- table1[order(abs(pvalue)),]
names(table1) <- c("Variable", "Size of Regression Coefficient", "Odds
Ratio", "Direction of Regression Coefficient", "Size of the p-value")
Observations and Conclusion:
Table 1 gives the significance of the relationship (p value) between the
response variable, Long Landing and each predictor variable, X. The table is
ranked based on the increasing p values. Based on Table 1, Long Landing, is
most correlated with the variables in the order shown and the relationship
with variables, speed_ground, speed_air, aircraft, pitch, are found to be
significant.
# Step 4 - Visualizing the association b/w long-landing and the significant
variables
attach(FAA_clean)
# 4.1. Long-landing vs speed_ground
plot(jitter(long.landing,0.1)~jitter(speed_ground),FAA_clean,xlab = "Speed
Ground",
ylab = "Long Landing",pch = ".", main = "Long Landing vs Speed Ground")
ggplot(FAA_clean,aes(x = speed_ground, fill = long.landing)) +
geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) +
ggtitle("Long Landing vs Speed Ground")
cor(long.landing, speed_ground)
## [1] 0.6214409
Observations and Conclusion:
There seems to be a clear pattern between Long Landing and speed_ground.
Clearly, most flight landings that were considered long have speed_ground
greater than 100. From the scatter plot, it’s also clear that there are no
values where the landing was considered long when speed_ground is lesser than
100. From the histogram plot, it’s clear that the distribution of Long
Landing vs Speed_ground is normal.
# 4.2. Long-landing vs speed_air
plot(jitter(long.landing,0.1)~jitter(speed_air),FAA_clean,xlab = "Speed Air",
ylab = "Long Landing",pch = ".", main = "Long Landing vs Speed Air")
ggplot(FAA_clean,aes(x = speed_air,fill = long.landing)) +
geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) +
ggtitle("Long Landing vs Speed Air")
cor(long.landing, speed_air, use = "complete.obs")
## [1] 0.7329355
Observations and Conclusion:
There seems to be a clear pattern between Long Landing and speed_air as well.
Firstly, the lowest value of speed_air is 90 mph. Clearly, most flight
landings that were considered long have speed_air greater than 95. If the
speed_air is greater than 110, then the landing was definitely long. There’s
a chance that the landing would not be long if speed_air’s value is between
90 and 100 mph. From the histogram plot, a right-skew of the Long Landing
variable can be observed. There’s also a 0.74 correlation between the long
landing and the predictor variable, speed_air.
# 4.3. Long-landing vs Pitch
plot(jitter(long.landing,0.1)~jitter(pitch),FAA_clean,xlab = "pitch",
ylab = "Long Landing",pch = ".", main = "Long Landing vs Pitch")
ggplot(FAA_clean,aes(x = pitch,fill = long.landing)) +
geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) +
ggtitle("Long Landing vs Pitch")
cor(long.landing, pitch)
## [1] 0.06919407
Observations and Conclusion:
From the histogram and the bar plot, it’s evident that most values of pitch
in the dataset are 4. Also, there seem to be a few landings that were
considered long when the pitch is 4. There’s a very slight correlation of
0.07 between the long landing and the predictor variable, pitch.
# 4.4. Long-landing vs Aircraft
ggplot(FAA_clean,aes(x = long.landing,
fill = aircraft)) +
geom_bar(position = "dodge", width =
0.5) +
facet_grid((~ aircraft))
Observations and Conclusion:
From the bar plot chart, it’s clear that around 70 Boeing flight landings and
around 35 Airbus flight landings were considered long, while the rest of the
flight landings were not considered long.
# Step 5 – Identify collinearity in the predictor variables and group
plot(speed_ground, speed_air)
cor.test(speed_ground, speed_air, use = "complete.obs")
## Pearson's product-moment correlation
##
## data: speed_ground and speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
Observations and Conclusion:
To identify the correlations between the predictor variables, speed_air and
speed_ground, a plot between the variables was drawn. Furthermore, the
cor.test function was called to measure the correlation between the
variables. A high correlation of 0.98, observed between speed_air and
speed_ground indicates that only one of these variables should be picked to
build the final model.
cor(data.frame(speed_ground, speed_air, pitch, height, no_pasg, duration),
use = "complete.obs")
## speed_ground speed_air pitch height
## speed_ground 1.000000000 9.883475e-01 -0.06316127 -0.095483596
## speed_air 0.988347471 1.000000e+00 -0.04826810 -0.086729286
## pitch -0.063161271 -4.826810e-02 1.00000000 -0.033217630
## height -0.095483596 -8.672929e-02 -0.03321763 1.000000000
## no_pasg 0.003570599 2.242971e-05 -0.03766471 -0.006625455
## duration 0.023885892 4.454351e-02 -0.05627519 0.073775491
## no_pasg duration
## speed_ground 3.570599e-03 0.02388589
## speed_air 2.242971e-05 0.04454351
## pitch -3.766471e-02 -0.05627519
## height -6.625455e-03 0.07377549
## no_pasg 1.000000e+00 -0.06917843
## duration -6.917843e-02 1.00000000
Observations and Conclusion:
Further, from the results of the correlation matrix, no other such high
correlation was recorded. Based on the results of the correlation test,
speed_air was chosen as the final representative variable. The choice was
made as speed_air has a higher correlation with the response variable, Long
Landing and is more relevant to measure the response, Long Landing.
# Step 5 - Initiate a full model - after grouping
full.FAA_clean <- na.omit(FAA_clean)
full.model <- glm(data = full.FAA_clean, long.landing ~ aircraft + height +
pitch + speed_air + no_pasg + duration, family = binomial)
Observations and Conclusion:
Choosing speed_air as the representative variable between speed_air and
speed_ground, the full logistic model was built. The results of the full
logistic model show that variables, aircraft, height and speed_air are
significant with an AIC of 47.264.
# Step 6 – Forward Step variable selection using AIC criterion
full.model1 <- glm(data = full.FAA_clean, long.landing ~ aircraft + height +
pitch + speed_air + no_pasg + duration, family = binomial)
model.AIC <- step(full.model1,trace = 0)
Observations and Conclusion:
Running a forward variable selection model using AIC criterion, a final model
was identified with aircraft, height, pitch and speed_air as the final
variables. The AIC of the final model was 44.278.
Comparing the results of the forward step function, consistent results are
obtained indicating that only aircraft, height, pitch and speed_air are
significant.
# Step 7 - Forward Step variable selection using BIC criterion
model.BIC <- step(full.model1,k = log(195), trace = 0)
Observations and Conclusion:
Running a forward variable selection model using BIC criterion, a final model
was identified with aircraft, height and speed_air as the final variables.
The AIC of the final model found after step variable selection using BIC was
44.798.
Comparing the results of the step function with BIC criterion with the one
built using AIC as the criterion, the insignificant variable, pitch was
dropped. Clearly this is a function of the BIC criterion choosing a simpler
model over a more accurate model chosen by the AIC criterion- step function.
# Step 8 – Risk factors for ‘Long Landing’
# Summary of Findings:
i) 12.4% of all the flight landings recorded in the data set are long
landings, which are of concern in this study, of which most flights are
Boeing flights
ii) Very high correlation between the predictors, speed_air and
speed_ground show high potential for multi-collinearity. Hence, only one of
them should be used in a final model
iii) High correlation between
response variable, Long Landing and
predictor, Speed_air of 0.74
indicates that the predictor variable
may explain the response variable
very well
iv) The output of the forward selection step function using AIC criterion
was selected as final as it provides all the variables that contribute to the
risk factors for Long landing. The risk factors are captured in variables,
aircraft, height, pitch and speed_air. The below table summarizes the output
of the final model.
Final Model
Model Criterion Forward selection using AIC
Final Parameters aircraft, height, pitch, speed_air
AIC 44.278
DF 194
v) Comparing the results of the single factor regression and forward
selection method, similar results as displayed in the above table were
observed.
SECTION 3: IDENTIFYING IMPORTANT FACTORS FOR RESPONSE, ‘RISKY LANDING’
# Step 9 - Repeat Steps 1- 7 for 'Risky Landing'
# 9.2. Histogram of Risky landing
hist(FAA_clean$risky.landing)
pct1 <-
round(table(FAA_clean$risky.landing)/length(FAA_clean$risky.landing)*100,1)
labs <- c("Not Risky landing (<3000 m,)","Risky landing (>3000 m),")
labs <- paste(labs,pct1)
labs <- paste(labs,"%",sep = "" )
pie(table(FAA_clean$risky.landing),labels = labs,col = rainbow(length(labs)),
main = "Pie chart of Risky Landing")
Observations and Conclusion:
The distribution of the variable, Risky Landing, can be seen in the above
figures. Clearly, 92.7% of the observations recorded Not Risky Landing, while
7.3% of the flights’ landing was recorded as risky (> 3000 m).
# Step 9.3 - Single Factor Regression
var_corr <- names(FAA_clean[1:7])
pvalue <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$risky.landing ~ x,
family = binomial))$coefficients[8] },
FUN.VALUE = numeric(1) )
reg_coef <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$risky.landing ~ x,
family = binomial))$coefficients[2] },
FUN.VALUE = numeric(1) )
sign_eqn <- vapply(reg_coef,
function(x) { ifelse(x >= 0, "Positive", "Negative")},
FUN.VALUE = character(1))
odds_ratio <- vapply(reg_coef, function(x) { exp(x) },
FUN.VALUE = numeric(1))
# Regression Table - Table 2
table2 <- data.frame(var_corr, abs(reg_coef), odds_ratio, sign_eqn, pvalue)
table2 <- table2[order(abs(pvalue)),]
names(table2) <- c("Variable", "Size of Regression Coefficient", "Odds
Ratio", "Direction of Regression Coefficient", "Size of the p-value")
Observations and Conclusion:
Table 2 gives the significance of the relationship (p value) between the
response variable, Risky Landing and each predictor variable, X. The table is
ranked based on the increasing p values. Based on Table 2, Risky Landing, is
most correlated with the variables in the order shown and the relationship
with variables, speed_ground, speed_air, aircraft are found to be
significant.
# Step 9.4 - Visualizing the association b/w Risky-landing and the
significant variables
attach(FAA_clean)
# 9.4.1. Risky-landing vs speed_ground
plot(jitter(risky.landing,0.1)~jitter(speed_ground),FAA_clean,xlab = "Speed
Ground",
ylab = "Risky Landing",pch = ".", main = "Risky Landing vs Speed
Ground")
ggplot(FAA_clean,aes(x = speed_ground, fill = risky.landing)) +
geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) +
ggtitle("Risky Landing vs Speed Ground")
cor(risky.landing, speed_ground)
## [1] 0.5413304
Observations and Conclusion:
There seems to be a clear pattern between Risky Landing and speed_ground.
Clearly, most flight landings that were considered risky have speed_ground
greater than 100. From the scatter plot, it’s also clear that there are no
values where the landing was considered risky when speed_ground is lesser
than 10. From the histogram plot, it’s clear that the distribution of Long
Landing vs Speed_ground is almost normal.
# 9.4.2. Risky-landing vs speed_air
plot(jitter(risky.landing,0.1)~jitter(speed_air),FAA_clean,xlab = "Speed
Air", ylab = "Risky Landing",pch = ".", main = "Risky Landing vs Speed Air")
ggplot(FAA_clean,aes(x = speed_air, fill = risky.landing)) +
geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) +
ggtitle("Risky Landing vs Speed Air")
cor(risky.landing, speed_air, use = "complete.obs")
## [1] 0.8129461
Observations and Conclusion:
There seems to be a clear pattern between Risky Landing and speed_air as
well. Firstly, the lowest value of speed_air is 90 mph. Clearly, most flight
landings that were considered risky have speed_air greater than 105. If the
speed_air is greater than 110, then the landing was definitely risky. From
the histogram plot, a right-skew of the Long Landing variable can be
observed. There’s also a high correlation of 0.81 between the risky landing
and the predictor variable, speed_air.
# 3. Risky-landing vs Aircraft
ggplot(FAA_clean,aes(x =
risky.landing, fill =
aircraft)) +
geom_bar(position = "dodge",
width = 0.5) +
facet_grid((~ aircraft))
Observations and Conclusion:
From the bar plot chart, it’s clear that almost 50 Boeing flight landings and
around 20 Airbus flight landings were considered risky, while the rest of the
flight landings were not considered risky.
# Step 9.5 – Identify collinearity in the predictor variables and group
plot(speed_ground, speed_air)
cor.test(speed_ground, speed_air, use = "complete.obs")
## Pearson's product-moment correlation
##
## data: speed_ground and speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
Observations and Conclusion:
To identify the correlations between the predictor variables, speed_air and
speed_ground, a plot between the variables was drawn. Furthermore, the
cor.test function was called to measure the correlation between the
variables. A high correlation of 0.98, observed between speed_air and
speed_ground indicates that only one of these variables should be picked to
build the final model.
cor(data.frame(speed_ground, speed_air, pitch, height, no_pasg, duration),
use = "complete.obs")
## speed_ground speed_air pitch height
## speed_ground 1.000000000 9.883475e-01 -0.06316127 -0.095483596
## speed_air 0.988347471 1.000000e+00 -0.04826810 -0.086729286
## pitch -0.063161271 -4.826810e-02 1.00000000 -0.033217630
## height -0.095483596 -8.672929e-02 -0.03321763 1.000000000
## no_pasg 0.003570599 2.242971e-05 -0.03766471 -0.006625455
## duration 0.023885892 4.454351e-02 -0.05627519 0.073775491
## no_pasg duration
## speed_ground 3.570599e-03 0.02388589
## speed_air 2.242971e-05 0.04454351
## pitch -3.766471e-02 -0.05627519
## height -6.625455e-03 0.07377549
## no_pasg 1.000000e+00 -0.06917843
## duration -6.917843e-02 1.00000000
Observations and Conclusion:
Further, from the results of the correlation matrix, no other such high
correlation was recorded. Based on the results of the correlation test,
speed_air was chosen as the final representative variable. The choice was
made as speed_air has a higher correlation with the response variable, Long
Landing and is more relevant to measure the response, Long Landing.
# Step 9.5 - Initiate a full model - after grouping
full.FAA_clean <- na.omit(FAA_clean)
risky.full.model <- glm(data = full.FAA_clean, risky.landing ~ aircraft +
height + pitch + speed_air + no_pasg + duration, family = binomial)
Observations and Conclusion:
Choosing speed_air as the representative variable between speed_air and
speed_ground, the full logistic regression model was built. The results of
the full logistic model show that only variables, aircraft and speed_air are
significant with an AIC of 36.257.
# Step 9.6 - Forward Step variable selection using AIC criterion
risky.full.model1 <- glm(data = full.FAA_clean, risky.landing ~ aircraft +
height + pitch + speed_air + no_pasg + duration, family = binomial)
risky.model.AIC <- step(risky.full.model1,trace = 0)
Observations and Conclusion:
Running a forward variable selection model using AIC criterion, a final model
was identified with aircraft and speed_air as the final variables. The AIC of
the final model was 32.281.
Comparing the results of the forward step function with the single factor
regressions table, consistent results are obtained indicating that only
aircraft and speed_air are significant. This result is also consistent with
the results obtained from the full model.
# Step 9.7 - Forward Step variable selection using BIC criterion
risky.model.BIC <- step(risky.full.model1,k = log(195), trace = 0)
Observations and Conclusion:
Running a forward variable selection model using BIC criterion, a final model
was identified with aircraft and speed_air as the final variables. The AIC of
the final model found after step variable selection using BIC was 32.281.
Comparing the results of the step function with BIC criterion with the one
built using AIC as the criterion, the exact same predictor variables were
found as the output.
# Step 10 – Risk factors for ‘Risky Landing’
# Summary of Findings:
i) 7.3% of all the flight landings recorded in the data set are risky
landings, which are of concern in this study, of which most flights are
Boeing flights
ii) Very high correlation between the predictors, speed_air and
speed_ground show high potential for multi-collinearity. Hence, only one of
them should be used in a final model
iii) High correlation between
response variable, Risky Landing and
predictor, Speed_air of 0.82
indicates that the predictor variable
may explain the response variable
very well
iv) The output of the forward selection step function using AIC criterion
was selected as final as in this case models using AIC and BIC criterion
yielded the same results. The risk factors are captured in variables,
aircraft, height, pitch and speed_air. The below table summarizes the output
of the final model.
Final Model
Model Criterion Forward selection using AIC
Final Parameters aircraft, speed_air
AIC 32.281
DF 194
v) Comparing the results of the single factor regression and forward
selection method, similar results as displayed in the above table were
observed.
SECTION 4: COMPARE THE MODELS BUILT FOR ‘LONG LANDING’ & ‘RISKY LANDING’
# Step 11 – Summarize the difference b/w the two models
i) The final model chosen for ‘Long landing’ had 4 final parameters –
aircraft, height, pitch and speed_air, while the one for ‘Risky Landing’ has
only 2 parameters – speed_air and aircraft
ii) The AIC of the Long landing model is 44.2, while the AIC of the Risky
landing model is 32.281. This implies that aircraft and speed_air have a
predictive power for Risky Landing over Long Landing
# Step 12 - ROC curve
model.long <- glm(long.landing ~ aircraft + height + speed_air + pitch,
data = full.FAA_clean, family = binomial)
### Linear predictor
linpred <- predict(model.long)
### Predicted probabilities
predprob <- predict(model.long, type = "response")
### Predicted outcomes using 0.5 as the threshold
predout <- ifelse(predprob < 0.5,"no","yes")
longm <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity <- specificity <- rep(NA,length(thresh))
for (j in seq(along = thresh)) {
pp <- ifelse(longm$predprob < thresh[j], "no", "yes")
xx <- xtabs(~long.landing+pp, longm)
specificity[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
plot(1 - specificity,sensitivity,type = "l", lty = 2, col = "blue")
model.risky <- glm(risky.landing ~ aircraft + speed_air,
data = full.FAA_clean, family = binomial)
### Linear predictor
linpred <- predict(model.risky)
### Predicted probabilities
predprob <- predict(model.risky, type = "response")
### Predicted outcomes using 0.5 as the threshold
predout <- ifelse(predprob < 0.5,"no","yes")
riskym <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity1 <- specificity1 <- rep(NA,length(thresh))
for (j in seq(along = thresh)) {
pp <- ifelse(riskym$predprob < thresh[j], "no", "yes")
xx <- xtabs(~risky.landing+pp, riskym)
specificity1[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity1[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
lines(1 - specificity1,sensitivity1,type = "l", lty = 2, col = "green")
ROC Curve: (Blue – Long Landing, Red – Risky Landing)
Observations and Conclusion:
i) The Area under curve of the Risky landing model is larger than the Long
landing model
ii) Speed_air seems like a great predictor of Risky landing, which can be
seen with the extremely high value of AUC for Risky landing
# Step 13 - Predict probabilities of Long landing, Risky landing for new data
new.val <- data.frame(aircraft = "boeing", duration = 200, no_pasg = 80,
speed_ground = 115, speed_air = 120, height = 40, pitch = 4)
# long.landing - Linear predictor (eta)
predict(model.long,newdata = new.val,type = "link", se = T)
## $fit
## 1
## 37.91026
##
## $se.fit
## [1] 10.28447
##
## $residual.scale
## [1] 1
# Confidence interval - using linear predictor
round(ilogit(c(37.91026 - 1.96*10.28447, 37.91026 + 1.96*10.28447)),3)
## [1] 1 1
# long.landing – Predicted probability
predict(model.long, newdata = new.val, type = "response", se = T)
## $fit
## 1
## 1
##
## $se.fit
## 1
## 2.283611e-15
##
## $residual.scale
## [1] 1
# Confidence interval - using probability
round(c(1 - 1.96*2.283611e-15, 1 + 1.96*2.283611e-15),3)
## [1] 1 1
Observations and Conclusion:
As there is a large separation of long_landing variable using speed_air as
the predictor, the confidence interval drawn by both the eta method and the
probability method have [1,1].
# Risky landing - Linear predictor
predict(model.risky,newdata = new.val,type = "link", se = T)
## $fit
## 1
## 17.30626
##
## $se.fit
## [1] 4.423414
##
## $residual.scale
## [1] 1
# Confidence interval - using linear predictor
round(ilogit(c(17.30626 - 1.96*4.423414, 17.30626 + 1.96*4.423414)),3)
## [1] 1 1
# Risky.landing – Predicted probability
predict(model.risky, newdata = new.val, type = "response", se = T)
## $fit
## 1
## 1
##
## $se.fit
## 1
## 1.348172e-07
##
## $residual.scale
## [1] 1
# Confidence interval - using probability
round(c(1 - 1.96*1.348172e-07, 1 + 1.96*1.348172e-07),3)
## [1] 1 1
Observations and Conclusion:
As there is a large separation of risky_landing variable using speed_air as
the predictor, the confidence interval drawn by both the eta method and the
probability method have [1,1].
SECTION 5: COMPARE MODELS WITH NEW LINK FUNCTIONS
# Step 14 – Fit Probit and cloglog link functions for ‘Risky Landing’
model.risky <- glm(risky.landing ~ aircraft + speed_air,
data = full.FAA_clean, family = binomial)
model.riskyprobit <- glm(risky.landing ~ aircraft + speed_air,
data = full.FAA_clean, family = binomial(link = probit))
model.riskycloglog <- glm(risky.landing ~ aircraft + speed_air,
data = full.FAA_clean, family = binomial(link = cloglog))
Logit Model Probit Model Cloglog Model
Link Link = logit Link = Probit Link = cloglog
Aircraft
estimate
4.55 2.64 3.24
Speed_air
estimate
1.22 0.67 0.93
AIC 32.281 32.133 30.333
Observations and Conclusion:
The results of the logistic regression function were compared against models
built with Logit, Probit and cloglog link functions. The AIC of the model was
least for the one built using cloglog as the link function. Comparing the
models using the ROC curves will give a better idea of the best performing
model for the risky landing variable.
# Step 15 - ROC curves of all 3 models together
# Logit Model
linpred <- predict(model.risky)
predprob <- predict(model.risky, type = "response")
predout <- ifelse(predprob < 0.5,"no","yes")
riskym <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity1 <- specificity1 <- rep(NA,length(thresh))
for (j in seq(along = thresh)) {
pp <- ifelse(riskym$predprob < thresh[j], "no", "yes")
xx <- xtabs(~risky.landing+pp, riskym)
specificity1[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity1[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
plot(1 - specificity1,sensitivity1,type = "l", lty = 2, col = "blue")
# Probit Model
linpred <- predict(model.riskyprobit)
predprob <- predict(model.riskyprobit, type = "response")
predout <- ifelse(predprob < 0.5,"no","yes")
riskym <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity2 <- specificity2 <- rep(NA,length(thresh))
for (j in seq(along = thresh)) {
pp <- ifelse(riskym$predprob < thresh[j], "no", "yes")
xx <- xtabs(~risky.landing+pp, riskym)
specificity2[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity2[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
lines(1 - specificity2,sensitivity2, type = "l", lty = 2, col = "red")
# Cloglog Model
linpred <- predict(model.riskycloglog)
predprob <- predict(model.riskycloglog, type = "response")
predout <- ifelse(predprob < 0.5,"no","yes")
riskym <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity3 <- specificity3 <- rep(NA,length(thresh))
for (j in seq(along = thresh)) {
pp <- ifelse(riskym$predprob < thresh[j], "no", "yes")
xx <- xtabs(~risky.landing+pp, riskym)
specificity3[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity3[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
lines(1 - specificity3, sensitivity3, type = "l", lty = 2, col = "green")
ROC Curve: (Blue – Logit link, Red – Probit link, Green – Cloglog link)
Observations and Conclusion:
Comparing the models using the ROC curves give a better idea of the best
performing model for the variable, risky_landing. The AUC for the Probit link
function is the most followed by the logit link followed by the Cloglog link.
This could be a reflection of the performance of the model at the right-
extreme values near 1.
# Step 16 – Top 5 Risky Landings by all three models
top.risky <- as.data.frame(sapply(list(model.risky, model.riskyprobit,
model.riskycloglog), fitted))
colnames(top.risky) <- c("RiskyLogit", "RiskyProbit", "RiskyCloglog")
risky.full.data <- cbind(full.FAA_clean, top.risky)
logit.order <- head(risky.full.data[with(risky.full.data,
order(-RiskyLogit)),])
probit.order <- head(risky.full.data[with(risky.full.data,
order(-RiskyProbit)),])
cloglog.order <- head(risky.full.data[with(risky.full.data,
order(-RiskyCloglog)),])
Logit Probit Cloglog
115 6 6
20 10 10
57 11 11
125 17 17
93 20 20
118 40 28
Observations and Conclusion:
Determining the top 5 risky flights for each model gave the above flight
indices. Probit and Cloglog models almost have the same flights listed as the
most risky, while the Logit model gave completely different results. The
similarity between Probit and Cloglog models could be because the models are
similar for the extreme tail.
# Step 17 - Predict values for new data point defined
# Risky landing Probit model - Linear predictor
predict(model.riskyprobit,newdata = new.val,type = "link", se = T)
## $fit
## 1
## 9.690832
##
## $se.fit
## [1] 2.241033
##
## $residual.scale
## [1] 1
# Confidence interval - using linear predictor
pnorm(c(9.690832 - 1.96*2.241033, 9.690832 + 1.96*2.241033))
## [1] 0.999 1.000
# Risky landing Probit model – Probability
predict(model.riskyprobit, newdata = new.val, type = "response", se = T)
## $fit
## 1
## 1
##
## $se.fit
## 1
## 4.976094e-16
##
## $residual.scale
## [1] 1
# Confidence interval - using probability
round(c(1 - 1.96*4.976094e-16, 1 + 1.96*4.976094e-16),3)
## [1] 1 1
Logit Model Probit Model Cloglog Model
Link Link = logit Link = Probit Link = cloglog
Predicted
Probability
1 1 1
CI (eta) [1,1] [0.999, 1] [1,1]
CI (prob) [1,1] [1,1] [1,1]
Observations and Conclusion:
The table above summarizes the predicted probability and the confidence
intervals generated using the probit model and the hazard model. The Probit
and Hazard model reflect some separation within the the 0 and 1 values in the
probabilities using the linear predictor calculations, that are reflected
above.

More Related Content

Similar to Flight Landing Risk Assessment Project

Regression Analysis on Flights data
Regression Analysis on Flights dataRegression Analysis on Flights data
Regression Analysis on Flights dataMansi Verma
 
Flight Landing Analysis
Flight Landing AnalysisFlight Landing Analysis
Flight Landing AnalysisTauseef Alam
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesAdrián Vallés
 
Modeling and Prediction using SAS
Modeling and Prediction using SASModeling and Prediction using SAS
Modeling and Prediction using SASJatin Saini
 
Stats computing project_final
Stats computing project_finalStats computing project_final
Stats computing project_finalAyank Gupta
 
Predicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regressionPredicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regressionSamrudh Keshava Kumar
 
Conceptual Design of a Light Sport Aircraft
Conceptual Design of a Light Sport AircraftConceptual Design of a Light Sport Aircraft
Conceptual Design of a Light Sport AircraftDustan Gregory
 
R/Finance 2009 Chicago
R/Finance 2009 ChicagoR/Finance 2009 Chicago
R/Finance 2009 Chicagogyollin
 
Robust reachability analysis NASA
Robust reachability analysis NASARobust reachability analysis NASA
Robust reachability analysis NASAM Reza Rahmati
 
ENG687 Aerodynamics.docx
ENG687 Aerodynamics.docxENG687 Aerodynamics.docx
ENG687 Aerodynamics.docx4934bk
 
A Statistical Computer Experiments Approach To Airline Fleet Assignment
A Statistical Computer Experiments Approach To Airline Fleet AssignmentA Statistical Computer Experiments Approach To Airline Fleet Assignment
A Statistical Computer Experiments Approach To Airline Fleet AssignmentGina Rizzo
 
AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...
AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...
AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...Abhishek Jain
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing projectRashmiSubrahmanya
 
Model predictive-fuzzy-control-of-air-ratio-for-automotive-engines
Model predictive-fuzzy-control-of-air-ratio-for-automotive-enginesModel predictive-fuzzy-control-of-air-ratio-for-automotive-engines
Model predictive-fuzzy-control-of-air-ratio-for-automotive-enginespace130557
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Performance analysis and randamized agoritham
Performance analysis and randamized agorithamPerformance analysis and randamized agoritham
Performance analysis and randamized agorithamlilyMalar1
 

Similar to Flight Landing Risk Assessment Project (20)

Regression Analysis on Flights data
Regression Analysis on Flights dataRegression Analysis on Flights data
Regression Analysis on Flights data
 
Flight Landing Analysis
Flight Landing AnalysisFlight Landing Analysis
Flight Landing Analysis
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian Valles
 
Modeling and Prediction using SAS
Modeling and Prediction using SASModeling and Prediction using SAS
Modeling and Prediction using SAS
 
Stats computing project_final
Stats computing project_finalStats computing project_final
Stats computing project_final
 
Predicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regressionPredicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regression
 
Conceptual Design of a Light Sport Aircraft
Conceptual Design of a Light Sport AircraftConceptual Design of a Light Sport Aircraft
Conceptual Design of a Light Sport Aircraft
 
B045012015
B045012015B045012015
B045012015
 
R/Finance 2009 Chicago
R/Finance 2009 ChicagoR/Finance 2009 Chicago
R/Finance 2009 Chicago
 
Robust reachability analysis NASA
Robust reachability analysis NASARobust reachability analysis NASA
Robust reachability analysis NASA
 
ENG687 Aerodynamics.docx
ENG687 Aerodynamics.docxENG687 Aerodynamics.docx
ENG687 Aerodynamics.docx
 
A Statistical Computer Experiments Approach To Airline Fleet Assignment
A Statistical Computer Experiments Approach To Airline Fleet AssignmentA Statistical Computer Experiments Approach To Airline Fleet Assignment
A Statistical Computer Experiments Approach To Airline Fleet Assignment
 
AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...
AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...
AIROPT: A Multi-Objective Evolutionary Algorithm based Aerodynamic Shape Opti...
 
Airplane wings fem
Airplane wings femAirplane wings fem
Airplane wings fem
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing project
 
Model predictive-fuzzy-control-of-air-ratio-for-automotive-engines
Model predictive-fuzzy-control-of-air-ratio-for-automotive-enginesModel predictive-fuzzy-control-of-air-ratio-for-automotive-engines
Model predictive-fuzzy-control-of-air-ratio-for-automotive-engines
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
I1077680
I1077680I1077680
I1077680
 
Performance analysis and randamized agoritham
Performance analysis and randamized agorithamPerformance analysis and randamized agoritham
Performance analysis and randamized agoritham
 
iteapaper
iteapaperiteapaper
iteapaper
 

Recently uploaded

MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisBoston Institute of Analytics
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证a8om7o51
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 

Recently uploaded (20)

MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 

Flight Landing Risk Assessment Project

  • 1. FLIGHT LANDING PROJECT (Logistic Regression) STATISTICAL MODELING, MS BANA 13th February, 2018 Preethi Jayaram Jayaraman MS BANA, Class of 2017 M12420360
  • 2. OBJECTIVE OF THE STUDY: The motivation of the study is to model the risk of landing overrun of commercial flights. Landing Data of 950 commercial flights (Airbus and Boeing) are available including variables such as Aircraft, Duration, Number of Passengers, Ground Speed, Air Speed, Height, Pitch, Long Landing and Risky Landing. This study evaluates which factors impact the variable long landing, that indicates if the landing distance was greater than 2500m and the variable, risky landing, that indicates if the landing distance was greater than 3000m. This study can further be used to make decisions about the landing based on the risk of long and risky landing overrun. BACKGROUND: As a start to the project, the datasets with details about the flights, FAA1 and FAA2 were merged and cleaned into the dataset, FAA_clean. FAA_clean is a dataset with 831 observations and 8 variables. The structure and summary statistics of FAA_clean can be found below. # Step 0 – Structure, Summary Statistics of FAA_clean str(FAA_clean) ## Classes 'tbl_df', 'tbl' and 'data.frame': 831 obs. of 8 variables: ## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... ## $ duration : num 98.5 125.7 112 196.8 90.1 ... ## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... ## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... ## $ speed_air : num 109 103 NA NA NA ... ## $ height : num 27.4 27.8 18.6 30.7 32.4 ... ## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... ## $ distance : num 3370 2988 1145 1664 1050 ... Variable Type Missing values % Min Max Mean Median Aircraft Categorical - - - - - Duration Numerical 5.8% 14.7 305.6 154.0 153.9 No_pasg Numerical - 29.0 87.0 60.1 60.0 Speed_grou nd Numerical - 27.7 141.2 79.4 79.6 Speed_air Numerical 75.5% 90.0 141.7 103.8 101.1 Height Numerical - -3.5 59.9 30.1 30.01 Pitch Numerical - 2.2 5.9 4.0 4.0 Distance Numerical - 34.1 6533.05 1526.0 1258.1
  • 3. SECTION 1: CREATE BINARY RESPONSES # Step 1 – Create Binary responses – Long Landing, Risky Landing FAA_clean$long.landing <- ifelse(FAA_clean$distance > 2500,1,0) FAA_clean$risky.landing <- ifelse(FAA_clean$distance > 3000,1,0) FAA_clean <- FAA_clean[,-8] Observations and Conclusion: The variable, Distance, from the original dataset was modified to make two binary variables, Long Landing and Risky Landing. Long Landing is defined as 1 for all flights where Distance is greater than 2500m and Risky Landing for Distance greater than 3000m respectively. The continuous variable, Distance, was discarded and Long Landing, Risky Landing will be considered as the response variables of concern henceforth. SECTION 2: IDENTIFYING IMPORTANT FACTORS FOR RESPONSE, ‘LONG LANDING’ # Step 2 – Distribution of Long Landing hist(FAA_clean$long.landing) pct <- round(table(FAA_clean$long.landing)/length(FAA_clean$long.landing)*100,1) labs <- c("Not Long landing (<2500 m),", "Long landing (>2500 m,),") labs <- paste(labs,pct) labs <- paste(labs,"%",sep = "" ) pie(table(FAA_clean$long.landing),labels = labs,col = rainbow(length(labs)), main = "Pie chart of Long Landing")
  • 4. Observations and Conclusion: The distribution of the variable, Long Landing, can be seen in the above figures. Clearly, 87.6% of the observations recorded Not Long Landing, while 12.4% of the flights’ landing was recorded as long (> 2500 m). # Step 3 - Single Factor Regression var_corr <- names(FAA_clean[1:7]) pvalue <- vapply(FAA_clean[1:7], function(x) { summary(glm(FAA_clean$long.landing ~ x, family = binomial))$coefficients[8] }, FUN.VALUE = numeric(1) ) reg_coef <- vapply(FAA_clean[1:7], function(x) { summary(glm(FAA_clean$long.landing ~ x, family = binomial))$coefficients[2] }, FUN.VALUE = numeric(1) ) sign_eqn <- vapply(reg_coef, function(x) { ifelse(x >= 0, "Positive", "Negative")}, FUN.VALUE = character(1)) odds_ratio <- vapply(reg_coef, function(x) { exp(x) }, FUN.VALUE = numeric(1)) # Regression Table - Table 1 table1 <- data.frame(var_corr, abs(reg_coef), odds_ratio, sign_eqn, pvalue) table1 <- table1[order(abs(pvalue)),] names(table1) <- c("Variable", "Size of Regression Coefficient", "Odds Ratio", "Direction of Regression Coefficient", "Size of the p-value")
  • 5. Observations and Conclusion: Table 1 gives the significance of the relationship (p value) between the response variable, Long Landing and each predictor variable, X. The table is ranked based on the increasing p values. Based on Table 1, Long Landing, is most correlated with the variables in the order shown and the relationship with variables, speed_ground, speed_air, aircraft, pitch, are found to be significant. # Step 4 - Visualizing the association b/w long-landing and the significant variables attach(FAA_clean) # 4.1. Long-landing vs speed_ground plot(jitter(long.landing,0.1)~jitter(speed_ground),FAA_clean,xlab = "Speed Ground", ylab = "Long Landing",pch = ".", main = "Long Landing vs Speed Ground") ggplot(FAA_clean,aes(x = speed_ground, fill = long.landing)) + geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) + ggtitle("Long Landing vs Speed Ground") cor(long.landing, speed_ground) ## [1] 0.6214409
  • 6. Observations and Conclusion: There seems to be a clear pattern between Long Landing and speed_ground. Clearly, most flight landings that were considered long have speed_ground greater than 100. From the scatter plot, it’s also clear that there are no values where the landing was considered long when speed_ground is lesser than 100. From the histogram plot, it’s clear that the distribution of Long Landing vs Speed_ground is normal. # 4.2. Long-landing vs speed_air plot(jitter(long.landing,0.1)~jitter(speed_air),FAA_clean,xlab = "Speed Air", ylab = "Long Landing",pch = ".", main = "Long Landing vs Speed Air") ggplot(FAA_clean,aes(x = speed_air,fill = long.landing)) + geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) + ggtitle("Long Landing vs Speed Air") cor(long.landing, speed_air, use = "complete.obs") ## [1] 0.7329355 Observations and Conclusion: There seems to be a clear pattern between Long Landing and speed_air as well. Firstly, the lowest value of speed_air is 90 mph. Clearly, most flight landings that were considered long have speed_air greater than 95. If the speed_air is greater than 110, then the landing was definitely long. There’s a chance that the landing would not be long if speed_air’s value is between 90 and 100 mph. From the histogram plot, a right-skew of the Long Landing variable can be observed. There’s also a 0.74 correlation between the long landing and the predictor variable, speed_air.
  • 7. # 4.3. Long-landing vs Pitch plot(jitter(long.landing,0.1)~jitter(pitch),FAA_clean,xlab = "pitch", ylab = "Long Landing",pch = ".", main = "Long Landing vs Pitch") ggplot(FAA_clean,aes(x = pitch,fill = long.landing)) + geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) + ggtitle("Long Landing vs Pitch") cor(long.landing, pitch) ## [1] 0.06919407 Observations and Conclusion: From the histogram and the bar plot, it’s evident that most values of pitch in the dataset are 4. Also, there seem to be a few landings that were considered long when the pitch is 4. There’s a very slight correlation of 0.07 between the long landing and the predictor variable, pitch. # 4.4. Long-landing vs Aircraft ggplot(FAA_clean,aes(x = long.landing, fill = aircraft)) + geom_bar(position = "dodge", width = 0.5) + facet_grid((~ aircraft))
  • 8. Observations and Conclusion: From the bar plot chart, it’s clear that around 70 Boeing flight landings and around 35 Airbus flight landings were considered long, while the rest of the flight landings were not considered long. # Step 5 – Identify collinearity in the predictor variables and group plot(speed_ground, speed_air) cor.test(speed_ground, speed_air, use = "complete.obs") ## Pearson's product-moment correlation ## ## data: speed_ground and speed_air ## t = 90.453, df = 201, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.9841163 0.9908449 ## sample estimates: ## cor ## 0.9879383 Observations and Conclusion: To identify the correlations between the predictor variables, speed_air and speed_ground, a plot between the variables was drawn. Furthermore, the cor.test function was called to measure the correlation between the variables. A high correlation of 0.98, observed between speed_air and speed_ground indicates that only one of these variables should be picked to build the final model.
  • 9. cor(data.frame(speed_ground, speed_air, pitch, height, no_pasg, duration), use = "complete.obs") ## speed_ground speed_air pitch height ## speed_ground 1.000000000 9.883475e-01 -0.06316127 -0.095483596 ## speed_air 0.988347471 1.000000e+00 -0.04826810 -0.086729286 ## pitch -0.063161271 -4.826810e-02 1.00000000 -0.033217630 ## height -0.095483596 -8.672929e-02 -0.03321763 1.000000000 ## no_pasg 0.003570599 2.242971e-05 -0.03766471 -0.006625455 ## duration 0.023885892 4.454351e-02 -0.05627519 0.073775491 ## no_pasg duration ## speed_ground 3.570599e-03 0.02388589 ## speed_air 2.242971e-05 0.04454351 ## pitch -3.766471e-02 -0.05627519 ## height -6.625455e-03 0.07377549 ## no_pasg 1.000000e+00 -0.06917843 ## duration -6.917843e-02 1.00000000 Observations and Conclusion: Further, from the results of the correlation matrix, no other such high correlation was recorded. Based on the results of the correlation test, speed_air was chosen as the final representative variable. The choice was made as speed_air has a higher correlation with the response variable, Long Landing and is more relevant to measure the response, Long Landing. # Step 5 - Initiate a full model - after grouping full.FAA_clean <- na.omit(FAA_clean) full.model <- glm(data = full.FAA_clean, long.landing ~ aircraft + height + pitch + speed_air + no_pasg + duration, family = binomial) Observations and Conclusion: Choosing speed_air as the representative variable between speed_air and speed_ground, the full logistic model was built. The results of the full logistic model show that variables, aircraft, height and speed_air are significant with an AIC of 47.264. # Step 6 – Forward Step variable selection using AIC criterion full.model1 <- glm(data = full.FAA_clean, long.landing ~ aircraft + height + pitch + speed_air + no_pasg + duration, family = binomial) model.AIC <- step(full.model1,trace = 0) Observations and Conclusion: Running a forward variable selection model using AIC criterion, a final model was identified with aircraft, height, pitch and speed_air as the final variables. The AIC of the final model was 44.278.
  • 10. Comparing the results of the forward step function, consistent results are obtained indicating that only aircraft, height, pitch and speed_air are significant. # Step 7 - Forward Step variable selection using BIC criterion model.BIC <- step(full.model1,k = log(195), trace = 0) Observations and Conclusion: Running a forward variable selection model using BIC criterion, a final model was identified with aircraft, height and speed_air as the final variables. The AIC of the final model found after step variable selection using BIC was 44.798. Comparing the results of the step function with BIC criterion with the one built using AIC as the criterion, the insignificant variable, pitch was dropped. Clearly this is a function of the BIC criterion choosing a simpler model over a more accurate model chosen by the AIC criterion- step function. # Step 8 – Risk factors for ‘Long Landing’ # Summary of Findings: i) 12.4% of all the flight landings recorded in the data set are long landings, which are of concern in this study, of which most flights are Boeing flights ii) Very high correlation between the predictors, speed_air and speed_ground show high potential for multi-collinearity. Hence, only one of them should be used in a final model
  • 11. iii) High correlation between response variable, Long Landing and predictor, Speed_air of 0.74 indicates that the predictor variable may explain the response variable very well iv) The output of the forward selection step function using AIC criterion was selected as final as it provides all the variables that contribute to the risk factors for Long landing. The risk factors are captured in variables, aircraft, height, pitch and speed_air. The below table summarizes the output of the final model. Final Model Model Criterion Forward selection using AIC Final Parameters aircraft, height, pitch, speed_air AIC 44.278 DF 194 v) Comparing the results of the single factor regression and forward selection method, similar results as displayed in the above table were observed. SECTION 3: IDENTIFYING IMPORTANT FACTORS FOR RESPONSE, ‘RISKY LANDING’ # Step 9 - Repeat Steps 1- 7 for 'Risky Landing' # 9.2. Histogram of Risky landing hist(FAA_clean$risky.landing) pct1 <- round(table(FAA_clean$risky.landing)/length(FAA_clean$risky.landing)*100,1) labs <- c("Not Risky landing (<3000 m,)","Risky landing (>3000 m),") labs <- paste(labs,pct1) labs <- paste(labs,"%",sep = "" )
  • 12. pie(table(FAA_clean$risky.landing),labels = labs,col = rainbow(length(labs)), main = "Pie chart of Risky Landing") Observations and Conclusion: The distribution of the variable, Risky Landing, can be seen in the above figures. Clearly, 92.7% of the observations recorded Not Risky Landing, while 7.3% of the flights’ landing was recorded as risky (> 3000 m). # Step 9.3 - Single Factor Regression var_corr <- names(FAA_clean[1:7]) pvalue <- vapply(FAA_clean[1:7], function(x) { summary(glm(FAA_clean$risky.landing ~ x, family = binomial))$coefficients[8] }, FUN.VALUE = numeric(1) ) reg_coef <- vapply(FAA_clean[1:7], function(x) { summary(glm(FAA_clean$risky.landing ~ x, family = binomial))$coefficients[2] }, FUN.VALUE = numeric(1) ) sign_eqn <- vapply(reg_coef, function(x) { ifelse(x >= 0, "Positive", "Negative")}, FUN.VALUE = character(1)) odds_ratio <- vapply(reg_coef, function(x) { exp(x) }, FUN.VALUE = numeric(1)) # Regression Table - Table 2 table2 <- data.frame(var_corr, abs(reg_coef), odds_ratio, sign_eqn, pvalue) table2 <- table2[order(abs(pvalue)),]
  • 13. names(table2) <- c("Variable", "Size of Regression Coefficient", "Odds Ratio", "Direction of Regression Coefficient", "Size of the p-value") Observations and Conclusion: Table 2 gives the significance of the relationship (p value) between the response variable, Risky Landing and each predictor variable, X. The table is ranked based on the increasing p values. Based on Table 2, Risky Landing, is most correlated with the variables in the order shown and the relationship with variables, speed_ground, speed_air, aircraft are found to be significant. # Step 9.4 - Visualizing the association b/w Risky-landing and the significant variables attach(FAA_clean) # 9.4.1. Risky-landing vs speed_ground plot(jitter(risky.landing,0.1)~jitter(speed_ground),FAA_clean,xlab = "Speed Ground", ylab = "Risky Landing",pch = ".", main = "Risky Landing vs Speed Ground") ggplot(FAA_clean,aes(x = speed_ground, fill = risky.landing)) + geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) + ggtitle("Risky Landing vs Speed Ground")
  • 14. cor(risky.landing, speed_ground) ## [1] 0.5413304 Observations and Conclusion: There seems to be a clear pattern between Risky Landing and speed_ground. Clearly, most flight landings that were considered risky have speed_ground greater than 100. From the scatter plot, it’s also clear that there are no values where the landing was considered risky when speed_ground is lesser than 10. From the histogram plot, it’s clear that the distribution of Long Landing vs Speed_ground is almost normal. # 9.4.2. Risky-landing vs speed_air plot(jitter(risky.landing,0.1)~jitter(speed_air),FAA_clean,xlab = "Speed Air", ylab = "Risky Landing",pch = ".", main = "Risky Landing vs Speed Air") ggplot(FAA_clean,aes(x = speed_air, fill = risky.landing)) + geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) + ggtitle("Risky Landing vs Speed Air")
  • 15. cor(risky.landing, speed_air, use = "complete.obs") ## [1] 0.8129461 Observations and Conclusion: There seems to be a clear pattern between Risky Landing and speed_air as well. Firstly, the lowest value of speed_air is 90 mph. Clearly, most flight landings that were considered risky have speed_air greater than 105. If the speed_air is greater than 110, then the landing was definitely risky. From the histogram plot, a right-skew of the Long Landing variable can be observed. There’s also a high correlation of 0.81 between the risky landing and the predictor variable, speed_air. # 3. Risky-landing vs Aircraft ggplot(FAA_clean,aes(x = risky.landing, fill = aircraft)) + geom_bar(position = "dodge", width = 0.5) + facet_grid((~ aircraft))
  • 16. Observations and Conclusion: From the bar plot chart, it’s clear that almost 50 Boeing flight landings and around 20 Airbus flight landings were considered risky, while the rest of the flight landings were not considered risky. # Step 9.5 – Identify collinearity in the predictor variables and group plot(speed_ground, speed_air) cor.test(speed_ground, speed_air, use = "complete.obs") ## Pearson's product-moment correlation ## ## data: speed_ground and speed_air ## t = 90.453, df = 201, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.9841163 0.9908449 ## sample estimates: ## cor ## 0.9879383 Observations and Conclusion: To identify the correlations between the predictor variables, speed_air and speed_ground, a plot between the variables was drawn. Furthermore, the cor.test function was called to measure the correlation between the variables. A high correlation of 0.98, observed between speed_air and speed_ground indicates that only one of these variables should be picked to build the final model. cor(data.frame(speed_ground, speed_air, pitch, height, no_pasg, duration), use = "complete.obs")
  • 17. ## speed_ground speed_air pitch height ## speed_ground 1.000000000 9.883475e-01 -0.06316127 -0.095483596 ## speed_air 0.988347471 1.000000e+00 -0.04826810 -0.086729286 ## pitch -0.063161271 -4.826810e-02 1.00000000 -0.033217630 ## height -0.095483596 -8.672929e-02 -0.03321763 1.000000000 ## no_pasg 0.003570599 2.242971e-05 -0.03766471 -0.006625455 ## duration 0.023885892 4.454351e-02 -0.05627519 0.073775491 ## no_pasg duration ## speed_ground 3.570599e-03 0.02388589 ## speed_air 2.242971e-05 0.04454351 ## pitch -3.766471e-02 -0.05627519 ## height -6.625455e-03 0.07377549 ## no_pasg 1.000000e+00 -0.06917843 ## duration -6.917843e-02 1.00000000 Observations and Conclusion: Further, from the results of the correlation matrix, no other such high correlation was recorded. Based on the results of the correlation test, speed_air was chosen as the final representative variable. The choice was made as speed_air has a higher correlation with the response variable, Long Landing and is more relevant to measure the response, Long Landing. # Step 9.5 - Initiate a full model - after grouping full.FAA_clean <- na.omit(FAA_clean) risky.full.model <- glm(data = full.FAA_clean, risky.landing ~ aircraft + height + pitch + speed_air + no_pasg + duration, family = binomial) Observations and Conclusion: Choosing speed_air as the representative variable between speed_air and speed_ground, the full logistic regression model was built. The results of the full logistic model show that only variables, aircraft and speed_air are significant with an AIC of 36.257. # Step 9.6 - Forward Step variable selection using AIC criterion risky.full.model1 <- glm(data = full.FAA_clean, risky.landing ~ aircraft + height + pitch + speed_air + no_pasg + duration, family = binomial) risky.model.AIC <- step(risky.full.model1,trace = 0)
  • 18. Observations and Conclusion: Running a forward variable selection model using AIC criterion, a final model was identified with aircraft and speed_air as the final variables. The AIC of the final model was 32.281. Comparing the results of the forward step function with the single factor regressions table, consistent results are obtained indicating that only aircraft and speed_air are significant. This result is also consistent with the results obtained from the full model. # Step 9.7 - Forward Step variable selection using BIC criterion risky.model.BIC <- step(risky.full.model1,k = log(195), trace = 0) Observations and Conclusion: Running a forward variable selection model using BIC criterion, a final model was identified with aircraft and speed_air as the final variables. The AIC of the final model found after step variable selection using BIC was 32.281. Comparing the results of the step function with BIC criterion with the one built using AIC as the criterion, the exact same predictor variables were found as the output. # Step 10 – Risk factors for ‘Risky Landing’ # Summary of Findings: i) 7.3% of all the flight landings recorded in the data set are risky landings, which are of concern in this study, of which most flights are Boeing flights
  • 19. ii) Very high correlation between the predictors, speed_air and speed_ground show high potential for multi-collinearity. Hence, only one of them should be used in a final model iii) High correlation between response variable, Risky Landing and predictor, Speed_air of 0.82 indicates that the predictor variable may explain the response variable very well iv) The output of the forward selection step function using AIC criterion was selected as final as in this case models using AIC and BIC criterion yielded the same results. The risk factors are captured in variables, aircraft, height, pitch and speed_air. The below table summarizes the output of the final model. Final Model Model Criterion Forward selection using AIC Final Parameters aircraft, speed_air AIC 32.281 DF 194 v) Comparing the results of the single factor regression and forward selection method, similar results as displayed in the above table were observed. SECTION 4: COMPARE THE MODELS BUILT FOR ‘LONG LANDING’ & ‘RISKY LANDING’ # Step 11 – Summarize the difference b/w the two models i) The final model chosen for ‘Long landing’ had 4 final parameters – aircraft, height, pitch and speed_air, while the one for ‘Risky Landing’ has only 2 parameters – speed_air and aircraft ii) The AIC of the Long landing model is 44.2, while the AIC of the Risky landing model is 32.281. This implies that aircraft and speed_air have a predictive power for Risky Landing over Long Landing
  • 20. # Step 12 - ROC curve model.long <- glm(long.landing ~ aircraft + height + speed_air + pitch, data = full.FAA_clean, family = binomial) ### Linear predictor linpred <- predict(model.long) ### Predicted probabilities predprob <- predict(model.long, type = "response") ### Predicted outcomes using 0.5 as the threshold predout <- ifelse(predprob < 0.5,"no","yes") longm <- data.frame(full.FAA_clean,predprob,predout) thresh <- seq(0.01,0.5,0.01) sensitivity <- specificity <- rep(NA,length(thresh)) for (j in seq(along = thresh)) { pp <- ifelse(longm$predprob < thresh[j], "no", "yes") xx <- xtabs(~long.landing+pp, longm) specificity[j] <- xx[1,1]/(xx[1,1] + xx[1,2]) sensitivity[j] <- xx[2,2]/(xx[2,1] + xx[2,2]) } plot(1 - specificity,sensitivity,type = "l", lty = 2, col = "blue") model.risky <- glm(risky.landing ~ aircraft + speed_air, data = full.FAA_clean, family = binomial) ### Linear predictor linpred <- predict(model.risky) ### Predicted probabilities predprob <- predict(model.risky, type = "response") ### Predicted outcomes using 0.5 as the threshold predout <- ifelse(predprob < 0.5,"no","yes") riskym <- data.frame(full.FAA_clean,predprob,predout) thresh <- seq(0.01,0.5,0.01) sensitivity1 <- specificity1 <- rep(NA,length(thresh)) for (j in seq(along = thresh)) { pp <- ifelse(riskym$predprob < thresh[j], "no", "yes") xx <- xtabs(~risky.landing+pp, riskym) specificity1[j] <- xx[1,1]/(xx[1,1] + xx[1,2]) sensitivity1[j] <- xx[2,2]/(xx[2,1] + xx[2,2]) } lines(1 - specificity1,sensitivity1,type = "l", lty = 2, col = "green")
  • 21. ROC Curve: (Blue – Long Landing, Red – Risky Landing) Observations and Conclusion: i) The Area under curve of the Risky landing model is larger than the Long landing model ii) Speed_air seems like a great predictor of Risky landing, which can be seen with the extremely high value of AUC for Risky landing # Step 13 - Predict probabilities of Long landing, Risky landing for new data new.val <- data.frame(aircraft = "boeing", duration = 200, no_pasg = 80, speed_ground = 115, speed_air = 120, height = 40, pitch = 4) # long.landing - Linear predictor (eta) predict(model.long,newdata = new.val,type = "link", se = T) ## $fit ## 1 ## 37.91026 ## ## $se.fit ## [1] 10.28447 ## ## $residual.scale ## [1] 1 # Confidence interval - using linear predictor round(ilogit(c(37.91026 - 1.96*10.28447, 37.91026 + 1.96*10.28447)),3) ## [1] 1 1 # long.landing – Predicted probability predict(model.long, newdata = new.val, type = "response", se = T)
  • 22. ## $fit ## 1 ## 1 ## ## $se.fit ## 1 ## 2.283611e-15 ## ## $residual.scale ## [1] 1 # Confidence interval - using probability round(c(1 - 1.96*2.283611e-15, 1 + 1.96*2.283611e-15),3) ## [1] 1 1 Observations and Conclusion: As there is a large separation of long_landing variable using speed_air as the predictor, the confidence interval drawn by both the eta method and the probability method have [1,1]. # Risky landing - Linear predictor predict(model.risky,newdata = new.val,type = "link", se = T) ## $fit ## 1 ## 17.30626 ## ## $se.fit ## [1] 4.423414 ## ## $residual.scale ## [1] 1 # Confidence interval - using linear predictor round(ilogit(c(17.30626 - 1.96*4.423414, 17.30626 + 1.96*4.423414)),3) ## [1] 1 1 # Risky.landing – Predicted probability predict(model.risky, newdata = new.val, type = "response", se = T) ## $fit ## 1 ## 1 ## ## $se.fit ## 1 ## 1.348172e-07 ##
  • 23. ## $residual.scale ## [1] 1 # Confidence interval - using probability round(c(1 - 1.96*1.348172e-07, 1 + 1.96*1.348172e-07),3) ## [1] 1 1 Observations and Conclusion: As there is a large separation of risky_landing variable using speed_air as the predictor, the confidence interval drawn by both the eta method and the probability method have [1,1]. SECTION 5: COMPARE MODELS WITH NEW LINK FUNCTIONS # Step 14 – Fit Probit and cloglog link functions for ‘Risky Landing’ model.risky <- glm(risky.landing ~ aircraft + speed_air, data = full.FAA_clean, family = binomial) model.riskyprobit <- glm(risky.landing ~ aircraft + speed_air, data = full.FAA_clean, family = binomial(link = probit)) model.riskycloglog <- glm(risky.landing ~ aircraft + speed_air, data = full.FAA_clean, family = binomial(link = cloglog)) Logit Model Probit Model Cloglog Model Link Link = logit Link = Probit Link = cloglog Aircraft estimate 4.55 2.64 3.24 Speed_air estimate 1.22 0.67 0.93 AIC 32.281 32.133 30.333 Observations and Conclusion: The results of the logistic regression function were compared against models built with Logit, Probit and cloglog link functions. The AIC of the model was least for the one built using cloglog as the link function. Comparing the models using the ROC curves will give a better idea of the best performing model for the risky landing variable. # Step 15 - ROC curves of all 3 models together # Logit Model linpred <- predict(model.risky) predprob <- predict(model.risky, type = "response") predout <- ifelse(predprob < 0.5,"no","yes") riskym <- data.frame(full.FAA_clean,predprob,predout)
  • 24. thresh <- seq(0.01,0.5,0.01) sensitivity1 <- specificity1 <- rep(NA,length(thresh)) for (j in seq(along = thresh)) { pp <- ifelse(riskym$predprob < thresh[j], "no", "yes") xx <- xtabs(~risky.landing+pp, riskym) specificity1[j] <- xx[1,1]/(xx[1,1] + xx[1,2]) sensitivity1[j] <- xx[2,2]/(xx[2,1] + xx[2,2]) } plot(1 - specificity1,sensitivity1,type = "l", lty = 2, col = "blue") # Probit Model linpred <- predict(model.riskyprobit) predprob <- predict(model.riskyprobit, type = "response") predout <- ifelse(predprob < 0.5,"no","yes") riskym <- data.frame(full.FAA_clean,predprob,predout) thresh <- seq(0.01,0.5,0.01) sensitivity2 <- specificity2 <- rep(NA,length(thresh)) for (j in seq(along = thresh)) { pp <- ifelse(riskym$predprob < thresh[j], "no", "yes") xx <- xtabs(~risky.landing+pp, riskym) specificity2[j] <- xx[1,1]/(xx[1,1] + xx[1,2]) sensitivity2[j] <- xx[2,2]/(xx[2,1] + xx[2,2]) } lines(1 - specificity2,sensitivity2, type = "l", lty = 2, col = "red") # Cloglog Model linpred <- predict(model.riskycloglog) predprob <- predict(model.riskycloglog, type = "response") predout <- ifelse(predprob < 0.5,"no","yes") riskym <- data.frame(full.FAA_clean,predprob,predout) thresh <- seq(0.01,0.5,0.01) sensitivity3 <- specificity3 <- rep(NA,length(thresh)) for (j in seq(along = thresh)) { pp <- ifelse(riskym$predprob < thresh[j], "no", "yes") xx <- xtabs(~risky.landing+pp, riskym) specificity3[j] <- xx[1,1]/(xx[1,1] + xx[1,2]) sensitivity3[j] <- xx[2,2]/(xx[2,1] + xx[2,2]) } lines(1 - specificity3, sensitivity3, type = "l", lty = 2, col = "green")
  • 25. ROC Curve: (Blue – Logit link, Red – Probit link, Green – Cloglog link) Observations and Conclusion: Comparing the models using the ROC curves give a better idea of the best performing model for the variable, risky_landing. The AUC for the Probit link function is the most followed by the logit link followed by the Cloglog link. This could be a reflection of the performance of the model at the right- extreme values near 1. # Step 16 – Top 5 Risky Landings by all three models top.risky <- as.data.frame(sapply(list(model.risky, model.riskyprobit, model.riskycloglog), fitted)) colnames(top.risky) <- c("RiskyLogit", "RiskyProbit", "RiskyCloglog") risky.full.data <- cbind(full.FAA_clean, top.risky) logit.order <- head(risky.full.data[with(risky.full.data, order(-RiskyLogit)),]) probit.order <- head(risky.full.data[with(risky.full.data, order(-RiskyProbit)),]) cloglog.order <- head(risky.full.data[with(risky.full.data, order(-RiskyCloglog)),]) Logit Probit Cloglog 115 6 6 20 10 10 57 11 11 125 17 17 93 20 20 118 40 28
  • 26. Observations and Conclusion: Determining the top 5 risky flights for each model gave the above flight indices. Probit and Cloglog models almost have the same flights listed as the most risky, while the Logit model gave completely different results. The similarity between Probit and Cloglog models could be because the models are similar for the extreme tail. # Step 17 - Predict values for new data point defined # Risky landing Probit model - Linear predictor predict(model.riskyprobit,newdata = new.val,type = "link", se = T) ## $fit ## 1 ## 9.690832 ## ## $se.fit ## [1] 2.241033 ## ## $residual.scale ## [1] 1 # Confidence interval - using linear predictor pnorm(c(9.690832 - 1.96*2.241033, 9.690832 + 1.96*2.241033)) ## [1] 0.999 1.000 # Risky landing Probit model – Probability predict(model.riskyprobit, newdata = new.val, type = "response", se = T) ## $fit ## 1 ## 1 ## ## $se.fit ## 1 ## 4.976094e-16 ## ## $residual.scale ## [1] 1 # Confidence interval - using probability round(c(1 - 1.96*4.976094e-16, 1 + 1.96*4.976094e-16),3) ## [1] 1 1
  • 27. Logit Model Probit Model Cloglog Model Link Link = logit Link = Probit Link = cloglog Predicted Probability 1 1 1 CI (eta) [1,1] [0.999, 1] [1,1] CI (prob) [1,1] [1,1] [1,1] Observations and Conclusion: The table above summarizes the predicted probability and the confidence intervals generated using the probit model and the hazard model. The Probit and Hazard model reflect some separation within the the 0 and 1 values in the probabilities using the linear predictor calculations, that are reflected above.