Flight Landing Risk Assessment Project

FLIGHT LANDING PROJECT
(Logistic Regression)
STATISTICAL MODELING, MS BANA
13th February, 2018
Preethi Jayaram Jayaraman
MS BANA, Class of 2017
M12420360

OBJECTIVE OF THE STUDY:
The motivation of the study is to model the risk of landing overrun of
commercial flights. Landing Data of 950 commercial flights (Airbus and
Boeing) are available including variables such as Aircraft, Duration, Number
of Passengers, Ground Speed, Air Speed, Height, Pitch, Long Landing and Risky
Landing. This study evaluates which factors impact the variable long landing,
that indicates if the landing distance was greater than 2500m and the
variable, risky landing, that indicates if the landing distance was greater
than 3000m. This study can further be used to make decisions about the
landing based on the risk of long and risky landing overrun.
BACKGROUND:
As a start to the project, the datasets with details about the flights, FAA1
and FAA2 were merged and cleaned into the dataset, FAA_clean. FAA_clean is a
dataset with 831 observations and 8 variables. The structure and summary
statistics of FAA_clean can be found below.
# Step 0 – Structure, Summary Statistics of FAA_clean
str(FAA_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 831 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Variable Type Missing
values %
Min Max Mean Median
Aircraft Categorical - - - - -
Duration Numerical 5.8% 14.7 305.6 154.0 153.9
No_pasg Numerical - 29.0 87.0 60.1 60.0
Speed_grou
nd
Numerical - 27.7 141.2 79.4 79.6
Speed_air Numerical 75.5% 90.0 141.7 103.8 101.1
Height Numerical - -3.5 59.9 30.1 30.01
Pitch Numerical - 2.2 5.9 4.0 4.0
Distance Numerical - 34.1 6533.05 1526.0 1258.1

SECTION 1: CREATE BINARY RESPONSES
# Step 1 – Create Binary responses – Long Landing, Risky Landing
FAA_clean$long.landing <- ifelse(FAA_clean$distance > 2500,1,0)
FAA_clean$risky.landing <- ifelse(FAA_clean$distance > 3000,1,0)
FAA_clean <- FAA_clean[,-8]
Observations and Conclusion:
The variable, Distance, from the original dataset was modified to make two
binary variables, Long Landing and Risky Landing. Long Landing is defined as
1 for all flights where Distance is greater than 2500m and Risky Landing for
Distance greater than 3000m respectively. The continuous variable, Distance,
was discarded and Long Landing, Risky Landing will be considered as the
response variables of concern henceforth.
SECTION 2: IDENTIFYING IMPORTANT FACTORS FOR RESPONSE, ‘LONG LANDING’
# Step 2 – Distribution of Long Landing
hist(FAA_clean$long.landing)
pct <-
round(table(FAA_clean$long.landing)/length(FAA_clean$long.landing)*100,1)
labs <- c("Not Long landing (<2500 m),", "Long landing (>2500 m,),")
labs <- paste(labs,pct)
labs <- paste(labs,"%",sep = "" )
pie(table(FAA_clean$long.landing),labels = labs,col = rainbow(length(labs)),
main = "Pie chart of Long Landing")

The distribution of the variable, Long Landing, can be seen in the above
figures. Clearly, 87.6% of the observations recorded Not Long Landing, while
12.4% of the flights’ landing was recorded as long (> 2500 m).
# Step 3 - Single Factor Regression
var_corr <- names(FAA_clean[1:7])
pvalue <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$long.landing ~ x,
family = binomial))$coefficients[8] },
FUN.VALUE = numeric(1) )
reg_coef <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$long.landing ~ x,
sign_eqn <- vapply(reg_coef,
function(x) { ifelse(x >= 0, "Positive", "Negative")},
FUN.VALUE = character(1))
odds_ratio <- vapply(reg_coef, function(x) { exp(x) },
FUN.VALUE = numeric(1))
# Regression Table - Table 1
table1 <- data.frame(var_corr, abs(reg_coef), odds_ratio, sign_eqn, pvalue)
table1 <- table1[order(abs(pvalue)),]
names(table1) <- c("Variable", "Size of Regression Coefficient", "Odds
Ratio", "Direction of Regression Coefficient", "Size of the p-value")

Table 1 gives the significance of the relationship (p value) between the
response variable, Long Landing and each predictor variable, X. The table is
ranked based on the increasing p values. Based on Table 1, Long Landing, is
most correlated with the variables in the order shown and the relationship
with variables, speed_ground, speed_air, aircraft, pitch, are found to be
significant.
# Step 4 - Visualizing the association b/w long-landing and the significant
variables
attach(FAA_clean)
# 4.1. Long-landing vs speed_ground
plot(jitter(long.landing,0.1)~jitter(speed_ground),FAA_clean,xlab = "Speed
Ground",
ylab = "Long Landing",pch = ".", main = "Long Landing vs Speed Ground")
ggplot(FAA_clean,aes(x = speed_ground, fill = long.landing)) +
geom_histogram(position = "dodge", binwidth = 2, aes(y = ..density..)) +
ggtitle("Long Landing vs Speed Ground")
cor(long.landing, speed_ground)
## [1] 0.6214409

There seems to be a clear pattern between Long Landing and speed_ground.
Clearly, most flight landings that were considered long have speed_ground
greater than 100. From the scatter plot, it’s also clear that there are no
values where the landing was considered long when speed_ground is lesser than
100. From the histogram plot, it’s clear that the distribution of Long
Landing vs Speed_ground is normal.
# 4.2. Long-landing vs speed_air
plot(jitter(long.landing,0.1)~jitter(speed_air),FAA_clean,xlab = "Speed Air",
ylab = "Long Landing",pch = ".", main = "Long Landing vs Speed Air")
ggplot(FAA_clean,aes(x = speed_air,fill = long.landing)) +
ggtitle("Long Landing vs Speed Air")
cor(long.landing, speed_air, use = "complete.obs")
## [1] 0.7329355
There seems to be a clear pattern between Long Landing and speed_air as well.
Firstly, the lowest value of speed_air is 90 mph. Clearly, most flight
landings that were considered long have speed_air greater than 95. If the
speed_air is greater than 110, then the landing was definitely long. There’s
a chance that the landing would not be long if speed_air’s value is between
90 and 100 mph. From the histogram plot, a right-skew of the Long Landing
variable can be observed. There’s also a 0.74 correlation between the long
landing and the predictor variable, speed_air.

# 4.3. Long-landing vs Pitch
plot(jitter(long.landing,0.1)~jitter(pitch),FAA_clean,xlab = "pitch",
ylab = "Long Landing",pch = ".", main = "Long Landing vs Pitch")
ggplot(FAA_clean,aes(x = pitch,fill = long.landing)) +
ggtitle("Long Landing vs Pitch")
cor(long.landing, pitch)
## [1] 0.06919407
From the histogram and the bar plot, it’s evident that most values of pitch
in the dataset are 4. Also, there seem to be a few landings that were
considered long when the pitch is 4. There’s a very slight correlation of
0.07 between the long landing and the predictor variable, pitch.
# 4.4. Long-landing vs Aircraft
ggplot(FAA_clean,aes(x = long.landing,
fill = aircraft)) +
geom_bar(position = "dodge", width =
0.5) +
facet_grid((~ aircraft))

From the bar plot chart, it’s clear that around 70 Boeing flight landings and
around 35 Airbus flight landings were considered long, while the rest of the
flight landings were not considered long.
# Step 5 – Identify collinearity in the predictor variables and group
plot(speed_ground, speed_air)
cor.test(speed_ground, speed_air, use = "complete.obs")
## Pearson's product-moment correlation
##
## data: speed_ground and speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
To identify the correlations between the predictor variables, speed_air and
speed_ground, a plot between the variables was drawn. Furthermore, the
cor.test function was called to measure the correlation between the
variables. A high correlation of 0.98, observed between speed_air and
speed_ground indicates that only one of these variables should be picked to
build the final model.

cor(data.frame(speed_ground, speed_air, pitch, height, no_pasg, duration),
use = "complete.obs")
## speed_ground speed_air pitch height
## speed_ground 1.000000000 9.883475e-01 -0.06316127 -0.095483596
## speed_air 0.988347471 1.000000e+00 -0.04826810 -0.086729286
## pitch -0.063161271 -4.826810e-02 1.00000000 -0.033217630
## height -0.095483596 -8.672929e-02 -0.03321763 1.000000000
## no_pasg 0.003570599 2.242971e-05 -0.03766471 -0.006625455
## duration 0.023885892 4.454351e-02 -0.05627519 0.073775491
## no_pasg duration
## speed_ground 3.570599e-03 0.02388589
## speed_air 2.242971e-05 0.04454351
## pitch -3.766471e-02 -0.05627519
## height -6.625455e-03 0.07377549
## no_pasg 1.000000e+00 -0.06917843
## duration -6.917843e-02 1.00000000
Further, from the results of the correlation matrix, no other such high
correlation was recorded. Based on the results of the correlation test,
speed_air was chosen as the final representative variable. The choice was
made as speed_air has a higher correlation with the response variable, Long
Landing and is more relevant to measure the response, Long Landing.
# Step 5 - Initiate a full model - after grouping
full.FAA_clean <- na.omit(FAA_clean)
full.model <- glm(data = full.FAA_clean, long.landing ~ aircraft + height +
pitch + speed_air + no_pasg + duration, family = binomial)
Choosing speed_air as the representative variable between speed_air and
speed_ground, the full logistic model was built. The results of the full
logistic model show that variables, aircraft, height and speed_air are
significant with an AIC of 47.264.
# Step 6 – Forward Step variable selection using AIC criterion
full.model1 <- glm(data = full.FAA_clean, long.landing ~ aircraft + height +
pitch + speed_air + no_pasg + duration, family = binomial)
model.AIC <- step(full.model1,trace = 0)
Running a forward variable selection model using AIC criterion, a final model
was identified with aircraft, height, pitch and speed_air as the final
variables. The AIC of the final model was 44.278.

Comparing the results of the forward step function, consistent results are
obtained indicating that only aircraft, height, pitch and speed_air are
significant.
# Step 7 - Forward Step variable selection using BIC criterion
model.BIC <- step(full.model1,k = log(195), trace = 0)
Running a forward variable selection model using BIC criterion, a final model
was identified with aircraft, height and speed_air as the final variables.
The AIC of the final model found after step variable selection using BIC was
44.798.
Comparing the results of the step function with BIC criterion with the one
built using AIC as the criterion, the insignificant variable, pitch was
dropped. Clearly this is a function of the BIC criterion choosing a simpler
model over a more accurate model chosen by the AIC criterion- step function.
# Step 8 – Risk factors for ‘Long Landing’
# Summary of Findings:
i) 12.4% of all the flight landings recorded in the data set are long
landings, which are of concern in this study, of which most flights are
Boeing flights
ii) Very high correlation between the predictors, speed_air and
speed_ground show high potential for multi-collinearity. Hence, only one of
them should be used in a final model

iii) High correlation between
response variable, Long Landing and
predictor, Speed_air of 0.74
indicates that the predictor variable
may explain the response variable
very well
iv) The output of the forward selection step function using AIC criterion
was selected as final as it provides all the variables that contribute to the
risk factors for Long landing. The risk factors are captured in variables,
aircraft, height, pitch and speed_air. The below table summarizes the output
of the final model.
Final Model
Model Criterion Forward selection using AIC
Final Parameters aircraft, height, pitch, speed_air
AIC 44.278
DF 194
v) Comparing the results of the single factor regression and forward
selection method, similar results as displayed in the above table were
observed.
SECTION 3: IDENTIFYING IMPORTANT FACTORS FOR RESPONSE, ‘RISKY LANDING’
# Step 9 - Repeat Steps 1- 7 for 'Risky Landing'
# 9.2. Histogram of Risky landing
hist(FAA_clean$risky.landing)
pct1 <-
round(table(FAA_clean$risky.landing)/length(FAA_clean$risky.landing)*100,1)
labs <- c("Not Risky landing (<3000 m,)","Risky landing (>3000 m),")
labs <- paste(labs,pct1)
labs <- paste(labs,"%",sep = "" )

pie(table(FAA_clean$risky.landing),labels = labs,col = rainbow(length(labs)),
main = "Pie chart of Risky Landing")
The distribution of the variable, Risky Landing, can be seen in the above
figures. Clearly, 92.7% of the observations recorded Not Risky Landing, while
7.3% of the flights’ landing was recorded as risky (> 3000 m).
# Step 9.3 - Single Factor Regression
var_corr <- names(FAA_clean[1:7])
pvalue <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$risky.landing ~ x,
reg_coef <- vapply(FAA_clean[1:7],
function(x) { summary(glm(FAA_clean$risky.landing ~ x,
sign_eqn <- vapply(reg_coef,
function(x) { ifelse(x >= 0, "Positive", "Negative")},
FUN.VALUE = character(1))
odds_ratio <- vapply(reg_coef, function(x) { exp(x) },
FUN.VALUE = numeric(1))
# Regression Table - Table 2
table2 <- data.frame(var_corr, abs(reg_coef), odds_ratio, sign_eqn, pvalue)
table2 <- table2[order(abs(pvalue)),]

names(table2) <- c("Variable", "Size of Regression Coefficient", "Odds
Ratio", "Direction of Regression Coefficient", "Size of the p-value")
Table 2 gives the significance of the relationship (p value) between the
response variable, Risky Landing and each predictor variable, X. The table is
ranked based on the increasing p values. Based on Table 2, Risky Landing, is
most correlated with the variables in the order shown and the relationship
with variables, speed_ground, speed_air, aircraft are found to be
significant.
# Step 9.4 - Visualizing the association b/w Risky-landing and the
significant variables
attach(FAA_clean)
# 9.4.1. Risky-landing vs speed_ground
plot(jitter(risky.landing,0.1)~jitter(speed_ground),FAA_clean,xlab = "Speed
Ground",
ylab = "Risky Landing",pch = ".", main = "Risky Landing vs Speed
Ground")
ggplot(FAA_clean,aes(x = speed_ground, fill = risky.landing)) +
ggtitle("Risky Landing vs Speed Ground")

cor(risky.landing, speed_ground)
## [1] 0.5413304
There seems to be a clear pattern between Risky Landing and speed_ground.
Clearly, most flight landings that were considered risky have speed_ground
greater than 100. From the scatter plot, it’s also clear that there are no
values where the landing was considered risky when speed_ground is lesser
than 10. From the histogram plot, it’s clear that the distribution of Long
Landing vs Speed_ground is almost normal.
# 9.4.2. Risky-landing vs speed_air
plot(jitter(risky.landing,0.1)~jitter(speed_air),FAA_clean,xlab = "Speed
Air", ylab = "Risky Landing",pch = ".", main = "Risky Landing vs Speed Air")
ggplot(FAA_clean,aes(x = speed_air, fill = risky.landing)) +
ggtitle("Risky Landing vs Speed Air")

cor(risky.landing, speed_air, use = "complete.obs")
## [1] 0.8129461
There seems to be a clear pattern between Risky Landing and speed_air as
well. Firstly, the lowest value of speed_air is 90 mph. Clearly, most flight
landings that were considered risky have speed_air greater than 105. If the
speed_air is greater than 110, then the landing was definitely risky. From
the histogram plot, a right-skew of the Long Landing variable can be
observed. There’s also a high correlation of 0.81 between the risky landing
and the predictor variable, speed_air.
# 3. Risky-landing vs Aircraft
ggplot(FAA_clean,aes(x =
risky.landing, fill =
aircraft)) +
geom_bar(position = "dodge",
width = 0.5) +
facet_grid((~ aircraft))

From the bar plot chart, it’s clear that almost 50 Boeing flight landings and
around 20 Airbus flight landings were considered risky, while the rest of the
flight landings were not considered risky.
# Step 9.5 – Identify collinearity in the predictor variables and group
plot(speed_ground, speed_air)
cor.test(speed_ground, speed_air, use = "complete.obs")
## Pearson's product-moment correlation
##
## data: speed_ground and speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
To identify the correlations between the predictor variables, speed_air and
speed_ground, a plot between the variables was drawn. Furthermore, the
cor.test function was called to measure the correlation between the
variables. A high correlation of 0.98, observed between speed_air and
speed_ground indicates that only one of these variables should be picked to
build the final model.
cor(data.frame(speed_ground, speed_air, pitch, height, no_pasg, duration),
use = "complete.obs")

## speed_ground speed_air pitch height
## speed_ground 1.000000000 9.883475e-01 -0.06316127 -0.095483596
## speed_air 0.988347471 1.000000e+00 -0.04826810 -0.086729286
## pitch -0.063161271 -4.826810e-02 1.00000000 -0.033217630
## height -0.095483596 -8.672929e-02 -0.03321763 1.000000000
## no_pasg 0.003570599 2.242971e-05 -0.03766471 -0.006625455
## duration 0.023885892 4.454351e-02 -0.05627519 0.073775491
## no_pasg duration
## speed_ground 3.570599e-03 0.02388589
## speed_air 2.242971e-05 0.04454351
## pitch -3.766471e-02 -0.05627519
## height -6.625455e-03 0.07377549
## no_pasg 1.000000e+00 -0.06917843
## duration -6.917843e-02 1.00000000
Further, from the results of the correlation matrix, no other such high
correlation was recorded. Based on the results of the correlation test,
speed_air was chosen as the final representative variable. The choice was
made as speed_air has a higher correlation with the response variable, Long
Landing and is more relevant to measure the response, Long Landing.
# Step 9.5 - Initiate a full model - after grouping
full.FAA_clean <- na.omit(FAA_clean)
risky.full.model <- glm(data = full.FAA_clean, risky.landing ~ aircraft +
height + pitch + speed_air + no_pasg + duration, family = binomial)
Choosing speed_air as the representative variable between speed_air and
speed_ground, the full logistic regression model was built. The results of
the full logistic model show that only variables, aircraft and speed_air are
significant with an AIC of 36.257.
# Step 9.6 - Forward Step variable selection using AIC criterion
risky.full.model1 <- glm(data = full.FAA_clean, risky.landing ~ aircraft +
height + pitch + speed_air + no_pasg + duration, family = binomial)
risky.model.AIC <- step(risky.full.model1,trace = 0)

Running a forward variable selection model using AIC criterion, a final model
was identified with aircraft and speed_air as the final variables. The AIC of
the final model was 32.281.
Comparing the results of the forward step function with the single factor
regressions table, consistent results are obtained indicating that only
aircraft and speed_air are significant. This result is also consistent with
the results obtained from the full model.
# Step 9.7 - Forward Step variable selection using BIC criterion
risky.model.BIC <- step(risky.full.model1,k = log(195), trace = 0)
Running a forward variable selection model using BIC criterion, a final model
was identified with aircraft and speed_air as the final variables. The AIC of
the final model found after step variable selection using BIC was 32.281.
Comparing the results of the step function with BIC criterion with the one
built using AIC as the criterion, the exact same predictor variables were
found as the output.
# Step 10 – Risk factors for ‘Risky Landing’
# Summary of Findings:
i) 7.3% of all the flight landings recorded in the data set are risky
landings, which are of concern in this study, of which most flights are
Boeing flights

ii) Very high correlation between the predictors, speed_air and
speed_ground show high potential for multi-collinearity. Hence, only one of
them should be used in a final model
iii) High correlation between
response variable, Risky Landing and
predictor, Speed_air of 0.82
indicates that the predictor variable
may explain the response variable
very well
iv) The output of the forward selection step function using AIC criterion
was selected as final as in this case models using AIC and BIC criterion
yielded the same results. The risk factors are captured in variables,
aircraft, height, pitch and speed_air. The below table summarizes the output
of the final model.
Final Model
Model Criterion Forward selection using AIC
Final Parameters aircraft, speed_air
AIC 32.281
DF 194
v) Comparing the results of the single factor regression and forward
selection method, similar results as displayed in the above table were
observed.
SECTION 4: COMPARE THE MODELS BUILT FOR ‘LONG LANDING’ & ‘RISKY LANDING’
# Step 11 – Summarize the difference b/w the two models
i) The final model chosen for ‘Long landing’ had 4 final parameters –
aircraft, height, pitch and speed_air, while the one for ‘Risky Landing’ has
only 2 parameters – speed_air and aircraft
ii) The AIC of the Long landing model is 44.2, while the AIC of the Risky
landing model is 32.281. This implies that aircraft and speed_air have a
predictive power for Risky Landing over Long Landing

# Step 12 - ROC curve
model.long <- glm(long.landing ~ aircraft + height + speed_air + pitch,
data = full.FAA_clean, family = binomial)
### Linear predictor
linpred <- predict(model.long)
### Predicted probabilities
predprob <- predict(model.long, type = "response")
### Predicted outcomes using 0.5 as the threshold
predout <- ifelse(predprob < 0.5,"no","yes")
longm <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity <- specificity <- rep(NA,length(thresh))
for (j in seq(along = thresh)) {
pp <- ifelse(longm$predprob < thresh[j], "no", "yes")
xx <- xtabs(~long.landing+pp, longm)
specificity[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
plot(1 - specificity,sensitivity,type = "l", lty = 2, col = "blue")
model.risky <- glm(risky.landing ~ aircraft + speed_air,
### Linear predictor
linpred <- predict(model.risky)
### Predicted probabilities
predprob <- predict(model.risky, type = "response")
### Predicted outcomes using 0.5 as the threshold
riskym <- data.frame(full.FAA_clean,predprob,predout)
thresh <- seq(0.01,0.5,0.01)
sensitivity1 <- specificity1 <- rep(NA,length(thresh))
pp <- ifelse(riskym$predprob < thresh[j], "no", "yes")
xx <- xtabs(~risky.landing+pp, riskym)
specificity1[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity1[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
lines(1 - specificity1,sensitivity1,type = "l", lty = 2, col = "green")

ROC Curve: (Blue – Long Landing, Red – Risky Landing)
i) The Area under curve of the Risky landing model is larger than the Long
landing model
ii) Speed_air seems like a great predictor of Risky landing, which can be
seen with the extremely high value of AUC for Risky landing
# Step 13 - Predict probabilities of Long landing, Risky landing for new data
new.val <- data.frame(aircraft = "boeing", duration = 200, no_pasg = 80,
speed_ground = 115, speed_air = 120, height = 40, pitch = 4)
# long.landing - Linear predictor (eta)
predict(model.long,newdata = new.val,type = "link", se = T)
## $fit
## 1
## 37.91026
##
## $se.fit
## [1] 10.28447
##
## $residual.scale
## [1] 1
# Confidence interval - using linear predictor
round(ilogit(c(37.91026 - 1.96*10.28447, 37.91026 + 1.96*10.28447)),3)
## [1] 1 1
# long.landing – Predicted probability
predict(model.long, newdata = new.val, type = "response", se = T)

## $fit
## 1
## 1
##
## $se.fit
## 1
## 2.283611e-15
##
## $residual.scale
## [1] 1
# Confidence interval - using probability
round(c(1 - 1.96*2.283611e-15, 1 + 1.96*2.283611e-15),3)
## [1] 1 1
As there is a large separation of long_landing variable using speed_air as
the predictor, the confidence interval drawn by both the eta method and the
probability method have [1,1].
# Risky landing - Linear predictor
predict(model.risky,newdata = new.val,type = "link", se = T)
## $fit
## 1
## 17.30626
##
## $se.fit
## [1] 4.423414
##
## $residual.scale
## [1] 1
round(ilogit(c(17.30626 - 1.96*4.423414, 17.30626 + 1.96*4.423414)),3)
## [1] 1 1
# Risky.landing – Predicted probability
predict(model.risky, newdata = new.val, type = "response", se = T)
## $fit
## 1
## 1
##
## $se.fit
## 1
## 1.348172e-07
##

## $residual.scale
## [1] 1
round(c(1 - 1.96*1.348172e-07, 1 + 1.96*1.348172e-07),3)
## [1] 1 1
As there is a large separation of risky_landing variable using speed_air as
the predictor, the confidence interval drawn by both the eta method and the
probability method have [1,1].
SECTION 5: COMPARE MODELS WITH NEW LINK FUNCTIONS
# Step 14 – Fit Probit and cloglog link functions for ‘Risky Landing’
model.risky <- glm(risky.landing ~ aircraft + speed_air,
model.riskyprobit <- glm(risky.landing ~ aircraft + speed_air,
data = full.FAA_clean, family = binomial(link = probit))
model.riskycloglog <- glm(risky.landing ~ aircraft + speed_air,
data = full.FAA_clean, family = binomial(link = cloglog))
Logit Model Probit Model Cloglog Model
Link Link = logit Link = Probit Link = cloglog
Aircraft
estimate
4.55 2.64 3.24
Speed_air
estimate
1.22 0.67 0.93
AIC 32.281 32.133 30.333
The results of the logistic regression function were compared against models
built with Logit, Probit and cloglog link functions. The AIC of the model was
least for the one built using cloglog as the link function. Comparing the
models using the ROC curves will give a better idea of the best performing
model for the risky landing variable.
# Step 15 - ROC curves of all 3 models together
# Logit Model
linpred <- predict(model.risky)
predprob <- predict(model.risky, type = "response")

thresh <- seq(0.01,0.5,0.01)
}
plot(1 - specificity1,sensitivity1,type = "l", lty = 2, col = "blue")
# Probit Model
linpred <- predict(model.riskyprobit)
predprob <- predict(model.riskyprobit, type = "response")
thresh <- seq(0.01,0.5,0.01)
}
lines(1 - specificity2,sensitivity2, type = "l", lty = 2, col = "red")
# Cloglog Model
linpred <- predict(model.riskycloglog)
predprob <- predict(model.riskycloglog, type = "response")
thresh <- seq(0.01,0.5,0.01)
}
lines(1 - specificity3, sensitivity3, type = "l", lty = 2, col = "green")

ROC Curve: (Blue – Logit link, Red – Probit link, Green – Cloglog link)
Comparing the models using the ROC curves give a better idea of the best
performing model for the variable, risky_landing. The AUC for the Probit link
function is the most followed by the logit link followed by the Cloglog link.
This could be a reflection of the performance of the model at the right-
extreme values near 1.
# Step 16 – Top 5 Risky Landings by all three models
top.risky <- as.data.frame(sapply(list(model.risky, model.riskyprobit,
model.riskycloglog), fitted))
colnames(top.risky) <- c("RiskyLogit", "RiskyProbit", "RiskyCloglog")
risky.full.data <- cbind(full.FAA_clean, top.risky)
logit.order <- head(risky.full.data[with(risky.full.data,
order(-RiskyLogit)),])
probit.order <- head(risky.full.data[with(risky.full.data,
order(-RiskyProbit)),])
cloglog.order <- head(risky.full.data[with(risky.full.data,
order(-RiskyCloglog)),])
Logit Probit Cloglog
115 6 6
20 10 10
57 11 11
125 17 17
93 20 20
118 40 28

Determining the top 5 risky flights for each model gave the above flight
indices. Probit and Cloglog models almost have the same flights listed as the
most risky, while the Logit model gave completely different results. The
similarity between Probit and Cloglog models could be because the models are
similar for the extreme tail.
# Step 17 - Predict values for new data point defined
# Risky landing Probit model - Linear predictor
predict(model.riskyprobit,newdata = new.val,type = "link", se = T)
## $fit
## 1
## 9.690832
##
## $se.fit
## [1] 2.241033
##
## $residual.scale
## [1] 1
pnorm(c(9.690832 - 1.96*2.241033, 9.690832 + 1.96*2.241033))
## [1] 0.999 1.000
# Risky landing Probit model – Probability
predict(model.riskyprobit, newdata = new.val, type = "response", se = T)
## $fit
## 1
## 1
##
## $se.fit
## 1
## 4.976094e-16
##
## $residual.scale
## [1] 1
round(c(1 - 1.96*4.976094e-16, 1 + 1.96*4.976094e-16),3)
## [1] 1 1

Logit Model Probit Model Cloglog Model
Link Link = logit Link = Probit Link = cloglog
Predicted
Probability
1 1 1
CI (eta) [1,1] [0.999, 1] [1,1]
CI (prob) [1,1] [1,1] [1,1]
The table above summarizes the predicted probability and the confidence
intervals generated using the probit model and the hazard model. The Probit
and Hazard model reflect some separation within the the 0 and 1 values in the
probabilities using the linear predictor calculations, that are reflected
above.

Flight Landing Risk Assessment Project

Recommended

Recommended

More Related Content

Similar to Flight Landing Risk Assessment Project

Similar to Flight Landing Risk Assessment Project (20)

Recently uploaded

Recently uploaded (20)

Flight Landing Risk Assessment Project