Flights Landing Overrun Project

FLIGHT LANDING PROJECT
STATISTICAL MODELING, MS BANA
20th January, 2018
Preethi Jayaram Jayaraman
BANA, Class of 2017
M12420360

OBJECTIVE OF THE STUDY:
The motivation of the study is to reduce the risk of landing overrun of
commercial flights. Landing Data of 950 commercial flights (Airbus and Boeing
) are available including variables such as Aircraft, Duration, Number of Pas
sengers, Ground Speed, Air Speed, Height, Pitch and Distance. This study eval
uates which factors impact the landing distance of a commercial distance and
the magnitude of the impact. The study can further be used to make decisions
about landing based on the risk of landing overrun.
SECTION 1: INITIAL EXPLORATION OF DATA
# Step 1 – Reading files into R
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
# Step 2 – Structure of the Dataset
str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
str(FAA2)
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Observations and Conclusion:
The data sets, FAA1 and FAA2, have 800 and 150 observations respectively. FAA
1 has 8 variables, while FAA2 has only 7 variables. FAA1 has a variable, name
d Duration, that FAA2 doesn’t have. This could be a difference due to the dat
a collection methods employed while FAA1 and FAA2 were collected.
FAA1 and FAA2, have 7 and 6 numerical variables and 1 categorical variable, a
ircraft make. Both data sets are imported as data frames which will help in e
asy analysis.

# Step 3 – Data Merging and Checking Duplicates
FAA_final <- bind_rows(FAA1, FAA2)
sum(duplicated(FAA_final[,-2]))
## [1] 100
FAA_final <- FAA_final[!duplicated(FAA_final[,-2]),]
The data sets, FAA1 and FAA2, are merged into one data set, FAA_final. While
checking for duplicates, the Duration, variable was excluded as only FAA1 has
it. After merging, 100 duplicates in the merged dataset were found. As keepin
g duplicates in the dataset will skew the analysis, they were removed and sav
ed back into FAA_final.
# Step 4 - Structure of the Dataset
str(FAA_final)
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The data sets, FAA_final, has 850 unique observations and 8 variables (7 nume
rical and 1 categorical). The summary command provides the summary statistics
of each variable. Below is a consolidated form of the summary statistics of
each variable.
Variable Type Missing
values %
Min Max Mean Median
Aircraft Categorical - - - - -
Duration Numerical 5.8% 14.7 305.6 154.0 153.9
No_pasg Numerical - 29.0 87.0 60.1 60.0
Speed_grou
nd
Numerical - 27.7 141.2 79.4 79.6
Speed_air Numerical 75.5% 90.0 141.7 103.8 101.1
Height Numerical - -3.5 59.9 30.1 30.01
Pitch Numerical - 2.2 5.9 4.0 4.0
Distance Numerical - 34.1 6533.05 1526.0 1258.1

# Step 5 – Summary of Findings
i) There are 850 observations in the merged dataset
ii) There are 8 variables, of which the variable, Landing Distance, is the
Response variable and the other 7 variables are the predictors
iii) There are two makes of Aircrafts, Boeing and Airbus in the Dataset
iv) There were 100 duplicates in the data set that were removed
v) There are 6% missing values in variable, Duration and 75% missing value
s in variable, Speed_air.
SECTION 2: DATA CLEANING AND FURTHER EXPLORATION:
# Step 6 – Abnormal values
FAA_final %>% filter(duration < 40) %>% nrow()
FAA_final %>% filter(speed_ground < 30 || speed_ground > 140) %>% nrow()
FAA_final %>% filter(speed_air < 30 || speed_air > 140) %>% nrow()
FAA_final %>% filter(height < 6) %>% nrow()
FAA_final %>% filter(pitch < 0) %>% nrow()
FAA_final %>% filter(distance > 6000) %>% nrow()
FAA_abnormal <- FAA_final %>% filter(duration < 40 | no_pasg < 0 | (speed_gro
und < 30 | speed_ground > 140)
| (speed_air < 30 | speed_air > 140) | height < 6
| pitch < 0 | distance < 0 | distance > 6000)
FAA_clean <- anti_join(FAA_final, FAA_abnormal)
The abnormal values in the data set were defined based on the guidelines prov
ided by the data dictionary. The below table shows the number of abnormal val
ues in each variable. The abnormal observations were removed into a new datas
et, FAA_abnormal and FAA_clean (with 831 observations) was created after remo
ving the abnormal rows.
Variable Abnormal Values
Aircraft -
Duration 5
No_pasg -
Speed_ground 0
Speed_air 0
Height 10
Pitch 0
Distance 2

# Step 7
str(FAA_clean)
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The abnormal values in the data set, FAA_final, were removed and the consolid
ated summary statistics of each variable is provided below.
Variable Type Missing
values %
Min Max Mean Median
Aircraft Categorical - - - - -
Duration Numerical 5.8% 41.9 305.6 154.0 154.2
No_pasg Numerical - 29.0 87.0 60.1 60.0
Speed_grou
nd
Numerical - 33.5 131.7 79.5 79.8
Speed_air Numerical 75.5% 90.0 132.9 103.4 101.1
Height Numerical - 6.2 59.9 30.4 30.1
Pitch Numerical - 2.2 5.9 4.0 4.0
Distance Numerical - 41.7 5381.9 1522.4 1262.1
# Step 8 – Histograms of all variables
barplot(table(FAA_clean$aircraft), main = "Number of Aircrafts by type")

hist(FAA_clean$duration, main = "Histogram of Duration")
hist(FAA_clean$no_pasg, main = "Histogram of Number of Passengers")
hist(FAA_clean$speed_ground, main = "Histogram of Speed Ground")
hist(FAA_clean$speed_air, main = "Histogram of Speed Air")
hist(FAA_clean$height, main = "Histogram of Height")
hist(FAA_clean$pitch, main = "Histogram of Pitch")
hist(FAA_clean$distance, main = "Histogram of Distance")

# Step 9 - Summary of Findings
i) There are more Airbus flights in the data set than Boeing
ii) Variables, Duration, No_Pasg, Height and Distance, seem to be normally
distributed with respective means close to medians
iii) From the histogram, it is evident that Speed_air and Distance, seem to b
e heavily right skewed, which works fine for this analysis
iv) Speed_air’s distribution ranges from 90 mph to 140 mph. Data below 90 mph
doesn’t seem available in the data set

SECTION 3: INITIAL ANALYSIS FOR IDENTIFYING FACTORS AFFECTING THE RESP
ONSE VARIABLE, ‘LANDING DISTANCE’:
# Step 10 - Correlation Table
# Binary code factor into numeric for correlation calculation
FAA_clean$aircraft <- as.numeric(factor(FAA_clean$aircraft))
dist_corr <- vapply(FAA_clean[1:7], function(x) { cor(FAA_clean$distance, x,
use = "complete.obs") }, FUN.VALUE = numeric(1))
sign_corr <- vapply(dist_corr, function(x) { ifelse(x >= 0, "Positive", "Nega
tive")}, FUN.VALUE = character(1))
var_corr <- names(FAA_clean[1:7])
# Correlation table - Table 1
table1 <- data.frame(var_corr, abs(dist_corr), sign_corr)
table1 <- table1[order(-abs(dist_corr)),]
names(table1) <- c("Variable", "Size of the Correlation", "Direction of Corre
lation")
Table 1 gives the pair-wise correlation between the landing distance and each
variable, X. The table is ranked based on the size (absolute value) of the co
rrelation. Based on Table 1, Landing Distance, is most correlated with the va
riables in the order shown.

# Step 11 – X-Y Scatter plots
pairs(FAA_clean, main = "Pairwise Correlation plots")
The pair-wise correlation plot between the landing distance and each variable
, X, shows the strength of the correlation between the predictor and the resp
onse variables. As found in Table 1, there is a strong positive correlation b
etween Distance and Speed_ground, Speed_air.
SECTION 3.1: REGRESSION USING A SINGLE FACTOR EACH TIME
# Step 13 - p value
reg_eqn <- vapply(FAA_clean[1:7], function(x) { summary(lm(FAA_clean$distance
~ x))$coefficients[8] }, FUN.VALUE = numeric(1) )
sign_eqn <- vapply(reg_eqn, function(x) { ifelse(x >= 0, "Positive", "Negativ
e")}, FUN.VALUE = character(1))
# Regression Table - Table 2
table2 <- data.frame(var_corr, abs(reg_eqn), sign_eqn)
table2 <- table2[order(abs(reg_eqn)),]

names(table2) <- c("Variable", "Size of the p-value", "Direction of Regressio
n Coefficient")
Table 2 gives the significance of the relationship (p value) between the land
ing distance and each variable, X. The table is ranked based on the increasin
g p values. Based on Table 2, Landing Distance, is most correlated with the v
ariables in the order shown.
# Step 14 – Standardized Regression Coefficient
FAA_std <- data.frame(vapply(FAA_clean[1:8], function(x) { (x - mean(x, na.rm
= TRUE))/sd(x, na.rm = TRUE)}, FUN.VALUE = numeric(831)))
std_eqn <- vapply(FAA_std[1:7], function(x) { summary(lm(FAA_std$distance ~ x
))$coefficients[2] }, FUN.VALUE = numeric(1) )
std_sign_eqn <- vapply(std_eqn, function(x) { ifelse(x >= 0, "Positive", "Neg
ative")}, FUN.VALUE = character(1))
# Regression Table - Table 3
table3 <- data.frame(var_corr, abs(std_eqn), std_sign_eqn)
table3 <- table3[order(-abs(std_eqn)),]
names(table3) <- c("Variable", "Size of the Regression Coefficient", "Directi
on of Regression Coefficient")
Table 3 gives the size of the regression coefficient between the landing dist
ance and each variable, X. The table is ranked based on the decreasing values
of the regression coefficient. Based on Table 3, Landing Distance, is most co
rrelated with the variables in the order shown.

# Step 15 – Comparison of Tables 1, 2, 3
table0 <- data.frame(var_corr, abs(dist_corr), abs(reg_eqn), abs(std_eqn))
table0 <- table0[order(abs(reg_eqn)),]
names(table0) <- c("Variable", "Size of the Correlation", "Size of the p-valu
e", "Size of the Regression Coefficient")
Consolidating Table 1, 2, 3, Table 0 was created. Table 0 gives the size of t
he correlation, p-value of the association and the regression coefficient bet
ween the landing distance and each variable, X. From Table 0, clearly, the re
lative importance of the variables can be determined, as in the order of the
variables in the table.
SECTION 3.2: CHECK COLLINEARITY
# Step 16 – Compare regression models
model1 <- lm(FAA_clean$distance ~ FAA_clean$speed_ground)
model2 <- lm(FAA_clean$distance ~ FAA_clean$speed_air)

model3 <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$speed_ai
r)
Model 1 Model 2 Model 3
X1 speed_ground speed_air speed_ground
X2 - - speed_air
R- squared 0.7504 0.8875 0.8883
Adj R-squared 0.7501 0.887 0.8871
p-value <2e-16*** <2e-16*** 0.258, 6.9e-12***
Model MSE 448.1 276.3 276.1
N considered 831 203 203
cor.test(FAA_clean$speed_ground, FAA_clean$speed_air)
##
## Pearson's product-moment correlation
##
## data: FAA_clean$speed_ground and FAA_clean$speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
From the table above, we can see that for a linear model built on speed_groun
d has an R-squared of 0.75, while one with speed_air, gives 0.8875. Clearly s
peed_air is a better predictor of Landing Distance. However, due to the large
number of missing values present in the speed_air variable, only 203 observat
ions were considered for the regression model. Also, note that for Model 3, w
hen both variables are considered, speed_ground becomes insignificant. As spe
ed_ground and speed_air have high correlation (0.98), only one of them should
be chosen for the future models. As speed_ground has lesser missing values, i
t’ll be a more real predictor of Landing distance, though its R-squared is le
sser. Hence, speed_air was dropped for further analysis.
SECTION 3.3: VARIABLE SELECTION
# Step 17, 18, 19
model.a <- lm(FAA_clean$distance ~ FAA_clean$speed_ground)
model.b <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t)
model.c <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height)
model.d <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf

t + FAA_clean$height + FAA_clean$pitch)
model.e <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height + FAA_clean$pitch + FAA_clean$duration)
model.f <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height + FAA_clean$pitch + FAA_clean$duration + FAA_clean$no_pa
sg)
rsq <- 0
rsq[1] <- summary(model.a)$r.squared
rsq[2] <- summary(model.b)$r.squared
rsq[3] <- summary(model.c)$r.squared
rsq[4] <- summary(model.d)$r.squared
rsq[5] <- summary(model.e)$r.squared
rsq[6] <- summary(model.f)$r.squared
rsq
[1] 0.7503784 0.8251319 0.8488989 0.8493717 0.8504184 0.8506023
adj.rsq <- 0
adj.rsq[1] <- summary(model.a)$adj.r.squared
adj.rsq[2] <- summary(model.b)$adj.r.squared
adj.rsq[3] <- summary(model.c)$adj.r.squared
adj.rsq[4] <- summary(model.d)$adj.r.squared
adj.rsq[5] <- summary(model.e)$adj.r.squared
adj.rsq[6] <- summary(model.f)$adj.r.squared
> adj.rsq
[1] 0.7500773 0.8247095 0.8483508 0.8486423 0.8494534 0.8494442
aic <- 0
aic[1] <- AIC(model.a)
aic[2] <- AIC(model.b)
aic[3] <- AIC(model.c)
aic[4] <- AIC(model.d)
aic[5] <- AIC(model.e)
aic[6] <- AIC(model.f)
aic
[1] 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88
plot(c(1:6), rsq)
plot(c(1:6), adj.rsq)
plot(c(1:6), aic)

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
Predict
ors
Speed_gro
und
Speed_gro
und,
Aircraft
Speed_gro
und,
Aircraft,
Height
Speed_gro
und,
Aircraft,
Height,
Pitch
Speed_gro
und,
Aircraft,
Height,
Pitch,
Duration
Speed_gro
und,
Aircraft,
Height,
Pitch,
Duration,
No_pasg
R-
squared
0.7504 0.8251 0.8489 0.8494 0.8504 0.8506
Adj R-
squared
0.7501 0.8247 0.8484 0.8486 0.8495 0.8494
AIC 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88
N used 831 831 831 831 781 781

# Step 20 – Variables chosen for the predictive model
The table above shows the R-squared, adj R-squared, AIC values of all 6 linea
r models. Model 5 and Model 6 both have the best adjusted R-squared values of
0.8495 and 0.8494 and lowest AIC values of 11378.84 and 11379.88. As the valu
es are very comparable, Model 5 is better than Model 6 as it has lesser numbe
r of predictors and is a simpler model than Model 6. To predict the Landing D
istance, predictors Speed_ground, Aircraft, Height, Pitch, Duration can be ch
osen.
SECTION 3.4: VARIABLE SELECTION BASED ON AUTOMATED ALGORITHM
# Step 21 - Forward Step AIC
fit1 <- lm(data = FAA_clean, distance ~ 1)
fit2 <- lm(data = FAA_clean, distance ~ speed_ground + aircraft + height + pi
tch + duration + no_pasg)
stepAIC(fit1, scope = list(upper = fit2, lower = fit1), direction = "forward"
)
## Start: AIC=11299.8
## distance ~ 1
## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83
1 rows from a combined fit
## Df Sum of Sq RSS AIC
## + speed_ground 1 480561690 157699570 10104
## + aircraft 1 33759132 604502127 11220
## + height 1 6866417 631394842 11256
## + pitch 1 3010731 635250529 11262
## + duration 1 1685114 636576145 11263
## <none> 638261260 11263
## + no_pasg 1 181284 638079976 11265
##
## Step: AIC=10148.53
## distance ~ speed_ground
## + aircraft 1 47102191 110597379 9810.8
## + height 1 14123617 143575953 10027.6
## + pitch 1 8246571 149453000 10061.0
## <none> 157699570 10103.6
## + no_pasg 1 154554 157545016 10104.8
## + duration 1 50570 157649000 10105.4
##

## Step: AIC=9854.77
## distance ~ speed_ground + aircraft
## + height 1 15048298 95549081 9691.2
## <none> 110597379 9810.8
## + pitch 1 182007 110415372 9811.4
## + no_pasg 1 41575 110555804 9812.5
## + duration 1 9394 110587985 9812.7
##
## Step: AIC=9735.37
## distance ~ speed_ground + aircraft + height
## <none> 95549081 9691.2
## + no_pasg 1 120379 95428702 9692.2
## + pitch 1 71174 95477907 9692.6
## + duration 1 4446 95544635 9693.2
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = FAA_clean
)
##
## Coefficients:
## (Intercept) speed_ground aircraft height
## -3008.29 42.40 496.05 14.15
Using the automated stepAIC function, the forward selection method was used w
ith the base model using no predictors and the final model using all predicto
rs. The stepAIC function determines the best variables to be selected within
the two defined limits. Using the automated function, the final model of
distance ~ speed_ground + aircraft + height, is selected as the best model wi
th AIC = 9735.37. The model determines the best variables using 4 steps start
ing.
Comparing the result with Step 19, we end up with a different model. However,
the stepAIC function ends up with the selected variables with AIC = 9735.37,
lesser AIC and a simpler model.

Flights Landing Overrun Project

Recommended

Recommended

More Related Content

Similar to Flights Landing Overrun Project

Similar to Flights Landing Overrun Project (20)

Recently uploaded

Recently uploaded (20)

Flights Landing Overrun Project