SlideShare a Scribd company logo
FLIGHT LANDING PROJECT
STATISTICAL MODELING, MS BANA
20th January, 2018
Preethi Jayaram Jayaraman
BANA, Class of 2017
M12420360
OBJECTIVE OF THE STUDY:
The motivation of the study is to reduce the risk of landing overrun of
commercial flights. Landing Data of 950 commercial flights (Airbus and Boeing
) are available including variables such as Aircraft, Duration, Number of Pas
sengers, Ground Speed, Air Speed, Height, Pitch and Distance. This study eval
uates which factors impact the landing distance of a commercial distance and
the magnitude of the impact. The study can further be used to make decisions
about landing based on the risk of landing overrun.
SECTION 1: INITIAL EXPLORATION OF DATA
# Step 1 – Reading files into R
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
# Step 2 – Structure of the Dataset
str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Observations and Conclusion:
The data sets, FAA1 and FAA2, have 800 and 150 observations respectively. FAA
1 has 8 variables, while FAA2 has only 7 variables. FAA1 has a variable, name
d Duration, that FAA2 doesn’t have. This could be a difference due to the dat
a collection methods employed while FAA1 and FAA2 were collected.
FAA1 and FAA2, have 7 and 6 numerical variables and 1 categorical variable, a
ircraft make. Both data sets are imported as data frames which will help in e
asy analysis.
# Step 3 – Data Merging and Checking Duplicates
FAA_final <- bind_rows(FAA1, FAA2)
sum(duplicated(FAA_final[,-2]))
## [1] 100
FAA_final <- FAA_final[!duplicated(FAA_final[,-2]),]
Observations and Conclusion:
The data sets, FAA1 and FAA2, are merged into one data set, FAA_final. While
checking for duplicates, the Duration, variable was excluded as only FAA1 has
it. After merging, 100 duplicates in the merged dataset were found. As keepin
g duplicates in the dataset will skew the analysis, they were removed and sav
ed back into FAA_final.
# Step 4 - Structure of the Dataset
str(FAA_final)
## Classes 'tbl_df', 'tbl' and 'data.frame': 850 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Observations and Conclusion:
The data sets, FAA_final, has 850 unique observations and 8 variables (7 nume
rical and 1 categorical). The summary command provides the summary statistics
of each variable. Below is a consolidated form of the summary statistics of
each variable.
Variable Type Missing
values %
Min Max Mean Median
Aircraft Categorical - - - - -
Duration Numerical 5.8% 14.7 305.6 154.0 153.9
No_pasg Numerical - 29.0 87.0 60.1 60.0
Speed_grou
nd
Numerical - 27.7 141.2 79.4 79.6
Speed_air Numerical 75.5% 90.0 141.7 103.8 101.1
Height Numerical - -3.5 59.9 30.1 30.01
Pitch Numerical - 2.2 5.9 4.0 4.0
Distance Numerical - 34.1 6533.05 1526.0 1258.1
# Step 5 – Summary of Findings
Observations and Conclusion:
i) There are 850 observations in the merged dataset
ii) There are 8 variables, of which the variable, Landing Distance, is the
Response variable and the other 7 variables are the predictors
iii) There are two makes of Aircrafts, Boeing and Airbus in the Dataset
iv) There were 100 duplicates in the data set that were removed
v) There are 6% missing values in variable, Duration and 75% missing value
s in variable, Speed_air.
SECTION 2: DATA CLEANING AND FURTHER EXPLORATION:
# Step 6 – Abnormal values
FAA_final %>% filter(duration < 40) %>% nrow()
FAA_final %>% filter(speed_ground < 30 || speed_ground > 140) %>% nrow()
FAA_final %>% filter(speed_air < 30 || speed_air > 140) %>% nrow()
FAA_final %>% filter(height < 6) %>% nrow()
FAA_final %>% filter(pitch < 0) %>% nrow()
FAA_final %>% filter(distance > 6000) %>% nrow()
FAA_abnormal <- FAA_final %>% filter(duration < 40 | no_pasg < 0 | (speed_gro
und < 30 | speed_ground > 140)
| (speed_air < 30 | speed_air > 140) | height < 6
| pitch < 0 | distance < 0 | distance > 6000)
FAA_clean <- anti_join(FAA_final, FAA_abnormal)
Observations and Conclusion:
The abnormal values in the data set were defined based on the guidelines prov
ided by the data dictionary. The below table shows the number of abnormal val
ues in each variable. The abnormal observations were removed into a new datas
et, FAA_abnormal and FAA_clean (with 831 observations) was created after remo
ving the abnormal rows.
Variable Abnormal Values
Aircraft -
Duration 5
No_pasg -
Speed_ground 0
Speed_air 0
Height 10
Pitch 0
Distance 2
# Step 7
str(FAA_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 831 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Observations and Conclusion:
The abnormal values in the data set, FAA_final, were removed and the consolid
ated summary statistics of each variable is provided below.
Variable Type Missing
values %
Min Max Mean Median
Aircraft Categorical - - - - -
Duration Numerical 5.8% 41.9 305.6 154.0 154.2
No_pasg Numerical - 29.0 87.0 60.1 60.0
Speed_grou
nd
Numerical - 33.5 131.7 79.5 79.8
Speed_air Numerical 75.5% 90.0 132.9 103.4 101.1
Height Numerical - 6.2 59.9 30.4 30.1
Pitch Numerical - 2.2 5.9 4.0 4.0
Distance Numerical - 41.7 5381.9 1522.4 1262.1
# Step 8 – Histograms of all variables
barplot(table(FAA_clean$aircraft), main = "Number of Aircrafts by type")
hist(FAA_clean$duration, main = "Histogram of Duration")
hist(FAA_clean$no_pasg, main = "Histogram of Number of Passengers")
hist(FAA_clean$speed_ground, main = "Histogram of Speed Ground")
hist(FAA_clean$speed_air, main = "Histogram of Speed Air")
hist(FAA_clean$height, main = "Histogram of Height")
hist(FAA_clean$pitch, main = "Histogram of Pitch")
hist(FAA_clean$distance, main = "Histogram of Distance")
# Step 9 - Summary of Findings
Observations and Conclusion:
i) There are more Airbus flights in the data set than Boeing
ii) Variables, Duration, No_Pasg, Height and Distance, seem to be normally
distributed with respective means close to medians
iii) From the histogram, it is evident that Speed_air and Distance, seem to b
e heavily right skewed, which works fine for this analysis
iv) Speed_air’s distribution ranges from 90 mph to 140 mph. Data below 90 mph
doesn’t seem available in the data set
SECTION 3: INITIAL ANALYSIS FOR IDENTIFYING FACTORS AFFECTING THE RESP
ONSE VARIABLE, ‘LANDING DISTANCE’:
# Step 10 - Correlation Table
# Binary code factor into numeric for correlation calculation
FAA_clean$aircraft <- as.numeric(factor(FAA_clean$aircraft))
dist_corr <- vapply(FAA_clean[1:7], function(x) { cor(FAA_clean$distance, x,
use = "complete.obs") }, FUN.VALUE = numeric(1))
sign_corr <- vapply(dist_corr, function(x) { ifelse(x >= 0, "Positive", "Nega
tive")}, FUN.VALUE = character(1))
var_corr <- names(FAA_clean[1:7])
# Correlation table - Table 1
table1 <- data.frame(var_corr, abs(dist_corr), sign_corr)
table1 <- table1[order(-abs(dist_corr)),]
names(table1) <- c("Variable", "Size of the Correlation", "Direction of Corre
lation")
Observations and Conclusion:
Table 1 gives the pair-wise correlation between the landing distance and each
variable, X. The table is ranked based on the size (absolute value) of the co
rrelation. Based on Table 1, Landing Distance, is most correlated with the va
riables in the order shown.
# Step 11 – X-Y Scatter plots
pairs(FAA_clean, main = "Pairwise Correlation plots")
Observations and Conclusion:
The pair-wise correlation plot between the landing distance and each variable
, X, shows the strength of the correlation between the predictor and the resp
onse variables. As found in Table 1, there is a strong positive correlation b
etween Distance and Speed_ground, Speed_air.
SECTION 3.1: REGRESSION USING A SINGLE FACTOR EACH TIME
# Step 13 - p value
reg_eqn <- vapply(FAA_clean[1:7], function(x) { summary(lm(FAA_clean$distance
~ x))$coefficients[8] }, FUN.VALUE = numeric(1) )
sign_eqn <- vapply(reg_eqn, function(x) { ifelse(x >= 0, "Positive", "Negativ
e")}, FUN.VALUE = character(1))
# Regression Table - Table 2
table2 <- data.frame(var_corr, abs(reg_eqn), sign_eqn)
table2 <- table2[order(abs(reg_eqn)),]
names(table2) <- c("Variable", "Size of the p-value", "Direction of Regressio
n Coefficient")
Observations and Conclusion:
Table 2 gives the significance of the relationship (p value) between the land
ing distance and each variable, X. The table is ranked based on the increasin
g p values. Based on Table 2, Landing Distance, is most correlated with the v
ariables in the order shown.
# Step 14 – Standardized Regression Coefficient
FAA_std <- data.frame(vapply(FAA_clean[1:8], function(x) { (x - mean(x, na.rm
= TRUE))/sd(x, na.rm = TRUE)}, FUN.VALUE = numeric(831)))
std_eqn <- vapply(FAA_std[1:7], function(x) { summary(lm(FAA_std$distance ~ x
))$coefficients[2] }, FUN.VALUE = numeric(1) )
std_sign_eqn <- vapply(std_eqn, function(x) { ifelse(x >= 0, "Positive", "Neg
ative")}, FUN.VALUE = character(1))
# Regression Table - Table 3
table3 <- data.frame(var_corr, abs(std_eqn), std_sign_eqn)
table3 <- table3[order(-abs(std_eqn)),]
names(table3) <- c("Variable", "Size of the Regression Coefficient", "Directi
on of Regression Coefficient")
Observations and Conclusion:
Table 3 gives the size of the regression coefficient between the landing dist
ance and each variable, X. The table is ranked based on the decreasing values
of the regression coefficient. Based on Table 3, Landing Distance, is most co
rrelated with the variables in the order shown.
# Step 15 – Comparison of Tables 1, 2, 3
table0 <- data.frame(var_corr, abs(dist_corr), abs(reg_eqn), abs(std_eqn))
table0 <- table0[order(abs(reg_eqn)),]
names(table0) <- c("Variable", "Size of the Correlation", "Size of the p-valu
e", "Size of the Regression Coefficient")
Observations and Conclusion:
Consolidating Table 1, 2, 3, Table 0 was created. Table 0 gives the size of t
he correlation, p-value of the association and the regression coefficient bet
ween the landing distance and each variable, X. From Table 0, clearly, the re
lative importance of the variables can be determined, as in the order of the
variables in the table.
SECTION 3.2: CHECK COLLINEARITY
# Step 16 – Compare regression models
model1 <- lm(FAA_clean$distance ~ FAA_clean$speed_ground)
model2 <- lm(FAA_clean$distance ~ FAA_clean$speed_air)
model3 <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$speed_ai
r)
Model 1 Model 2 Model 3
X1 speed_ground speed_air speed_ground
X2 - - speed_air
R- squared 0.7504 0.8875 0.8883
Adj R-squared 0.7501 0.887 0.8871
p-value <2e-16*** <2e-16*** 0.258, 6.9e-12***
Model MSE 448.1 276.3 276.1
N considered 831 203 203
cor.test(FAA_clean$speed_ground, FAA_clean$speed_air)
##
## Pearson's product-moment correlation
##
## data: FAA_clean$speed_ground and FAA_clean$speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
Observations and Conclusion:
From the table above, we can see that for a linear model built on speed_groun
d has an R-squared of 0.75, while one with speed_air, gives 0.8875. Clearly s
peed_air is a better predictor of Landing Distance. However, due to the large
number of missing values present in the speed_air variable, only 203 observat
ions were considered for the regression model. Also, note that for Model 3, w
hen both variables are considered, speed_ground becomes insignificant. As spe
ed_ground and speed_air have high correlation (0.98), only one of them should
be chosen for the future models. As speed_ground has lesser missing values, i
t’ll be a more real predictor of Landing distance, though its R-squared is le
sser. Hence, speed_air was dropped for further analysis.
SECTION 3.3: VARIABLE SELECTION
# Step 17, 18, 19
model.a <- lm(FAA_clean$distance ~ FAA_clean$speed_ground)
model.b <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t)
model.c <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height)
model.d <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height + FAA_clean$pitch)
model.e <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height + FAA_clean$pitch + FAA_clean$duration)
model.f <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
t + FAA_clean$height + FAA_clean$pitch + FAA_clean$duration + FAA_clean$no_pa
sg)
rsq <- 0
rsq[1] <- summary(model.a)$r.squared
rsq[2] <- summary(model.b)$r.squared
rsq[3] <- summary(model.c)$r.squared
rsq[4] <- summary(model.d)$r.squared
rsq[5] <- summary(model.e)$r.squared
rsq[6] <- summary(model.f)$r.squared
rsq
[1] 0.7503784 0.8251319 0.8488989 0.8493717 0.8504184 0.8506023
adj.rsq <- 0
adj.rsq[1] <- summary(model.a)$adj.r.squared
adj.rsq[2] <- summary(model.b)$adj.r.squared
adj.rsq[3] <- summary(model.c)$adj.r.squared
adj.rsq[4] <- summary(model.d)$adj.r.squared
adj.rsq[5] <- summary(model.e)$adj.r.squared
adj.rsq[6] <- summary(model.f)$adj.r.squared
> adj.rsq
[1] 0.7500773 0.8247095 0.8483508 0.8486423 0.8494534 0.8494442
aic <- 0
aic[1] <- AIC(model.a)
aic[2] <- AIC(model.b)
aic[3] <- AIC(model.c)
aic[4] <- AIC(model.d)
aic[5] <- AIC(model.e)
aic[6] <- AIC(model.f)
aic
[1] 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88
plot(c(1:6), rsq)
plot(c(1:6), adj.rsq)
plot(c(1:6), aic)
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
Predict
ors
Speed_gro
und
Speed_gro
und,
Aircraft
Speed_gro
und,
Aircraft,
Height
Speed_gro
und,
Aircraft,
Height,
Pitch
Speed_gro
und,
Aircraft,
Height,
Pitch,
Duration
Speed_gro
und,
Aircraft,
Height,
Pitch,
Duration,
No_pasg
R-
squared
0.7504 0.8251 0.8489 0.8494 0.8504 0.8506
Adj R-
squared
0.7501 0.8247 0.8484 0.8486 0.8495 0.8494
AIC 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88
N used 831 831 831 831 781 781
# Step 20 – Variables chosen for the predictive model
Observations and Conclusion:
The table above shows the R-squared, adj R-squared, AIC values of all 6 linea
r models. Model 5 and Model 6 both have the best adjusted R-squared values of
0.8495 and 0.8494 and lowest AIC values of 11378.84 and 11379.88. As the valu
es are very comparable, Model 5 is better than Model 6 as it has lesser numbe
r of predictors and is a simpler model than Model 6. To predict the Landing D
istance, predictors Speed_ground, Aircraft, Height, Pitch, Duration can be ch
osen.
SECTION 3.4: VARIABLE SELECTION BASED ON AUTOMATED ALGORITHM
# Step 21 - Forward Step AIC
fit1 <- lm(data = FAA_clean, distance ~ 1)
fit2 <- lm(data = FAA_clean, distance ~ speed_ground + aircraft + height + pi
tch + duration + no_pasg)
stepAIC(fit1, scope = list(upper = fit2, lower = fit1), direction = "forward"
)
## Start: AIC=11299.8
## distance ~ 1
## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83
1 rows from a combined fit
## Df Sum of Sq RSS AIC
## + speed_ground 1 480561690 157699570 10104
## + aircraft 1 33759132 604502127 11220
## + height 1 6866417 631394842 11256
## + pitch 1 3010731 635250529 11262
## + duration 1 1685114 636576145 11263
## <none> 638261260 11263
## + no_pasg 1 181284 638079976 11265
##
## Step: AIC=10148.53
## distance ~ speed_ground
## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83
1 rows from a combined fit
## Df Sum of Sq RSS AIC
## + aircraft 1 47102191 110597379 9810.8
## + height 1 14123617 143575953 10027.6
## + pitch 1 8246571 149453000 10061.0
## <none> 157699570 10103.6
## + no_pasg 1 154554 157545016 10104.8
## + duration 1 50570 157649000 10105.4
##
## Step: AIC=9854.77
## distance ~ speed_ground + aircraft
## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83
1 rows from a combined fit
## Df Sum of Sq RSS AIC
## + height 1 15048298 95549081 9691.2
## <none> 110597379 9810.8
## + pitch 1 182007 110415372 9811.4
## + no_pasg 1 41575 110555804 9812.5
## + duration 1 9394 110587985 9812.7
##
## Step: AIC=9735.37
## distance ~ speed_ground + aircraft + height
## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83
1 rows from a combined fit
## Df Sum of Sq RSS AIC
## <none> 95549081 9691.2
## + no_pasg 1 120379 95428702 9692.2
## + pitch 1 71174 95477907 9692.6
## + duration 1 4446 95544635 9693.2
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = FAA_clean
)
##
## Coefficients:
## (Intercept) speed_ground aircraft height
## -3008.29 42.40 496.05 14.15
Observations and Conclusion:
Using the automated stepAIC function, the forward selection method was used w
ith the base model using no predictors and the final model using all predicto
rs. The stepAIC function determines the best variables to be selected within
the two defined limits. Using the automated function, the final model of
distance ~ speed_ground + aircraft + height, is selected as the best model wi
th AIC = 9735.37. The model determines the best variables using 4 steps start
ing.
Comparing the result with Step 19, we end up with a different model. However,
the stepAIC function ends up with the selected variables with AIC = 9735.37,
lesser AIC and a simpler model.

More Related Content

Similar to Flights Landing Overrun Project

Power of call symput data
Power of call symput dataPower of call symput data
Power of call symput data
Yash Sharma
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
Paulo Faria
 
1) Create a contingency table (this is very easy in Statgraphics .docx
1) Create a contingency table (this is very easy in Statgraphics .docx1) Create a contingency table (this is very easy in Statgraphics .docx
1) Create a contingency table (this is very easy in Statgraphics .docx
monicafrancis71118
 

Similar to Flights Landing Overrun Project (20)

Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing project
 
Flight Data Analysis
Flight Data AnalysisFlight Data Analysis
Flight Data Analysis
 
224-2009
224-2009224-2009
224-2009
 
Power of call symput data
Power of call symput dataPower of call symput data
Power of call symput data
 
How fast ist it really? Benchmarking in practice
How fast ist it really? Benchmarking in practiceHow fast ist it really? Benchmarking in practice
How fast ist it really? Benchmarking in practice
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and Analysis
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Model Selection and Multi-model Inference
Model Selection and Multi-model InferenceModel Selection and Multi-model Inference
Model Selection and Multi-model Inference
 
6
66
6
 
Modeling and Prediction using SAS
Modeling and Prediction using SASModeling and Prediction using SAS
Modeling and Prediction using SAS
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Shell Script Disk Usage Report and E-Mail Current Threshold Status
Shell Script  Disk Usage Report and E-Mail Current Threshold StatusShell Script  Disk Usage Report and E-Mail Current Threshold Status
Shell Script Disk Usage Report and E-Mail Current Threshold Status
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programming
 
Flight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASFlight Landing Distance Study Using SAS
Flight Landing Distance Study Using SAS
 
1) Create a contingency table (this is very easy in Statgraphics .docx
1) Create a contingency table (this is very easy in Statgraphics .docx1) Create a contingency table (this is very easy in Statgraphics .docx
1) Create a contingency table (this is very easy in Statgraphics .docx
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 

Recently uploaded (20)

Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 

Flights Landing Overrun Project

  • 1. FLIGHT LANDING PROJECT STATISTICAL MODELING, MS BANA 20th January, 2018 Preethi Jayaram Jayaraman BANA, Class of 2017 M12420360
  • 2. OBJECTIVE OF THE STUDY: The motivation of the study is to reduce the risk of landing overrun of commercial flights. Landing Data of 950 commercial flights (Airbus and Boeing ) are available including variables such as Aircraft, Duration, Number of Pas sengers, Ground Speed, Air Speed, Height, Pitch and Distance. This study eval uates which factors impact the landing distance of a commercial distance and the magnitude of the impact. The study can further be used to make decisions about landing based on the risk of landing overrun. SECTION 1: INITIAL EXPLORATION OF DATA # Step 1 – Reading files into R FAA1 <- read_excel("FAA1.xls") FAA2 <- read_excel("FAA2.xls") # Step 2 – Structure of the Dataset str(FAA1) ## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables: ## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... ## $ duration : num 98.5 125.7 112 196.8 90.1 ... ## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... ## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... ## $ speed_air : num 109 103 NA NA NA ... ## $ height : num 27.4 27.8 18.6 30.7 32.4 ... ## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... ## $ distance : num 3370 2988 1145 1664 1050 ... str(FAA2) ## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables: ## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... ## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... ## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... ## $ speed_air : num 109 103 NA NA NA ... ## $ height : num 27.4 27.8 18.6 30.7 32.4 ... ## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... ## $ distance : num 3370 2988 1145 1664 1050 ... Observations and Conclusion: The data sets, FAA1 and FAA2, have 800 and 150 observations respectively. FAA 1 has 8 variables, while FAA2 has only 7 variables. FAA1 has a variable, name d Duration, that FAA2 doesn’t have. This could be a difference due to the dat a collection methods employed while FAA1 and FAA2 were collected. FAA1 and FAA2, have 7 and 6 numerical variables and 1 categorical variable, a ircraft make. Both data sets are imported as data frames which will help in e asy analysis.
  • 3. # Step 3 – Data Merging and Checking Duplicates FAA_final <- bind_rows(FAA1, FAA2) sum(duplicated(FAA_final[,-2])) ## [1] 100 FAA_final <- FAA_final[!duplicated(FAA_final[,-2]),] Observations and Conclusion: The data sets, FAA1 and FAA2, are merged into one data set, FAA_final. While checking for duplicates, the Duration, variable was excluded as only FAA1 has it. After merging, 100 duplicates in the merged dataset were found. As keepin g duplicates in the dataset will skew the analysis, they were removed and sav ed back into FAA_final. # Step 4 - Structure of the Dataset str(FAA_final) ## Classes 'tbl_df', 'tbl' and 'data.frame': 850 obs. of 8 variables: ## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... ## $ duration : num 98.5 125.7 112 196.8 90.1 ... ## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... ## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... ## $ speed_air : num 109 103 NA NA NA ... ## $ height : num 27.4 27.8 18.6 30.7 32.4 ... ## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... ## $ distance : num 3370 2988 1145 1664 1050 ... Observations and Conclusion: The data sets, FAA_final, has 850 unique observations and 8 variables (7 nume rical and 1 categorical). The summary command provides the summary statistics of each variable. Below is a consolidated form of the summary statistics of each variable. Variable Type Missing values % Min Max Mean Median Aircraft Categorical - - - - - Duration Numerical 5.8% 14.7 305.6 154.0 153.9 No_pasg Numerical - 29.0 87.0 60.1 60.0 Speed_grou nd Numerical - 27.7 141.2 79.4 79.6 Speed_air Numerical 75.5% 90.0 141.7 103.8 101.1 Height Numerical - -3.5 59.9 30.1 30.01 Pitch Numerical - 2.2 5.9 4.0 4.0 Distance Numerical - 34.1 6533.05 1526.0 1258.1
  • 4. # Step 5 – Summary of Findings Observations and Conclusion: i) There are 850 observations in the merged dataset ii) There are 8 variables, of which the variable, Landing Distance, is the Response variable and the other 7 variables are the predictors iii) There are two makes of Aircrafts, Boeing and Airbus in the Dataset iv) There were 100 duplicates in the data set that were removed v) There are 6% missing values in variable, Duration and 75% missing value s in variable, Speed_air. SECTION 2: DATA CLEANING AND FURTHER EXPLORATION: # Step 6 – Abnormal values FAA_final %>% filter(duration < 40) %>% nrow() FAA_final %>% filter(speed_ground < 30 || speed_ground > 140) %>% nrow() FAA_final %>% filter(speed_air < 30 || speed_air > 140) %>% nrow() FAA_final %>% filter(height < 6) %>% nrow() FAA_final %>% filter(pitch < 0) %>% nrow() FAA_final %>% filter(distance > 6000) %>% nrow() FAA_abnormal <- FAA_final %>% filter(duration < 40 | no_pasg < 0 | (speed_gro und < 30 | speed_ground > 140) | (speed_air < 30 | speed_air > 140) | height < 6 | pitch < 0 | distance < 0 | distance > 6000) FAA_clean <- anti_join(FAA_final, FAA_abnormal) Observations and Conclusion: The abnormal values in the data set were defined based on the guidelines prov ided by the data dictionary. The below table shows the number of abnormal val ues in each variable. The abnormal observations were removed into a new datas et, FAA_abnormal and FAA_clean (with 831 observations) was created after remo ving the abnormal rows. Variable Abnormal Values Aircraft - Duration 5 No_pasg - Speed_ground 0 Speed_air 0 Height 10 Pitch 0 Distance 2
  • 5. # Step 7 str(FAA_clean) ## Classes 'tbl_df', 'tbl' and 'data.frame': 831 obs. of 8 variables: ## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... ## $ duration : num 98.5 125.7 112 196.8 90.1 ... ## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... ## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... ## $ speed_air : num 109 103 NA NA NA ... ## $ height : num 27.4 27.8 18.6 30.7 32.4 ... ## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... ## $ distance : num 3370 2988 1145 1664 1050 ... Observations and Conclusion: The abnormal values in the data set, FAA_final, were removed and the consolid ated summary statistics of each variable is provided below. Variable Type Missing values % Min Max Mean Median Aircraft Categorical - - - - - Duration Numerical 5.8% 41.9 305.6 154.0 154.2 No_pasg Numerical - 29.0 87.0 60.1 60.0 Speed_grou nd Numerical - 33.5 131.7 79.5 79.8 Speed_air Numerical 75.5% 90.0 132.9 103.4 101.1 Height Numerical - 6.2 59.9 30.4 30.1 Pitch Numerical - 2.2 5.9 4.0 4.0 Distance Numerical - 41.7 5381.9 1522.4 1262.1 # Step 8 – Histograms of all variables barplot(table(FAA_clean$aircraft), main = "Number of Aircrafts by type")
  • 6. hist(FAA_clean$duration, main = "Histogram of Duration") hist(FAA_clean$no_pasg, main = "Histogram of Number of Passengers") hist(FAA_clean$speed_ground, main = "Histogram of Speed Ground") hist(FAA_clean$speed_air, main = "Histogram of Speed Air") hist(FAA_clean$height, main = "Histogram of Height") hist(FAA_clean$pitch, main = "Histogram of Pitch") hist(FAA_clean$distance, main = "Histogram of Distance")
  • 7. # Step 9 - Summary of Findings Observations and Conclusion: i) There are more Airbus flights in the data set than Boeing ii) Variables, Duration, No_Pasg, Height and Distance, seem to be normally distributed with respective means close to medians iii) From the histogram, it is evident that Speed_air and Distance, seem to b e heavily right skewed, which works fine for this analysis iv) Speed_air’s distribution ranges from 90 mph to 140 mph. Data below 90 mph doesn’t seem available in the data set
  • 8. SECTION 3: INITIAL ANALYSIS FOR IDENTIFYING FACTORS AFFECTING THE RESP ONSE VARIABLE, ‘LANDING DISTANCE’: # Step 10 - Correlation Table # Binary code factor into numeric for correlation calculation FAA_clean$aircraft <- as.numeric(factor(FAA_clean$aircraft)) dist_corr <- vapply(FAA_clean[1:7], function(x) { cor(FAA_clean$distance, x, use = "complete.obs") }, FUN.VALUE = numeric(1)) sign_corr <- vapply(dist_corr, function(x) { ifelse(x >= 0, "Positive", "Nega tive")}, FUN.VALUE = character(1)) var_corr <- names(FAA_clean[1:7]) # Correlation table - Table 1 table1 <- data.frame(var_corr, abs(dist_corr), sign_corr) table1 <- table1[order(-abs(dist_corr)),] names(table1) <- c("Variable", "Size of the Correlation", "Direction of Corre lation") Observations and Conclusion: Table 1 gives the pair-wise correlation between the landing distance and each variable, X. The table is ranked based on the size (absolute value) of the co rrelation. Based on Table 1, Landing Distance, is most correlated with the va riables in the order shown.
  • 9. # Step 11 – X-Y Scatter plots pairs(FAA_clean, main = "Pairwise Correlation plots") Observations and Conclusion: The pair-wise correlation plot between the landing distance and each variable , X, shows the strength of the correlation between the predictor and the resp onse variables. As found in Table 1, there is a strong positive correlation b etween Distance and Speed_ground, Speed_air. SECTION 3.1: REGRESSION USING A SINGLE FACTOR EACH TIME # Step 13 - p value reg_eqn <- vapply(FAA_clean[1:7], function(x) { summary(lm(FAA_clean$distance ~ x))$coefficients[8] }, FUN.VALUE = numeric(1) ) sign_eqn <- vapply(reg_eqn, function(x) { ifelse(x >= 0, "Positive", "Negativ e")}, FUN.VALUE = character(1)) # Regression Table - Table 2 table2 <- data.frame(var_corr, abs(reg_eqn), sign_eqn) table2 <- table2[order(abs(reg_eqn)),]
  • 10. names(table2) <- c("Variable", "Size of the p-value", "Direction of Regressio n Coefficient") Observations and Conclusion: Table 2 gives the significance of the relationship (p value) between the land ing distance and each variable, X. The table is ranked based on the increasin g p values. Based on Table 2, Landing Distance, is most correlated with the v ariables in the order shown. # Step 14 – Standardized Regression Coefficient FAA_std <- data.frame(vapply(FAA_clean[1:8], function(x) { (x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)}, FUN.VALUE = numeric(831))) std_eqn <- vapply(FAA_std[1:7], function(x) { summary(lm(FAA_std$distance ~ x ))$coefficients[2] }, FUN.VALUE = numeric(1) ) std_sign_eqn <- vapply(std_eqn, function(x) { ifelse(x >= 0, "Positive", "Neg ative")}, FUN.VALUE = character(1)) # Regression Table - Table 3 table3 <- data.frame(var_corr, abs(std_eqn), std_sign_eqn) table3 <- table3[order(-abs(std_eqn)),] names(table3) <- c("Variable", "Size of the Regression Coefficient", "Directi on of Regression Coefficient") Observations and Conclusion: Table 3 gives the size of the regression coefficient between the landing dist ance and each variable, X. The table is ranked based on the decreasing values of the regression coefficient. Based on Table 3, Landing Distance, is most co rrelated with the variables in the order shown.
  • 11. # Step 15 – Comparison of Tables 1, 2, 3 table0 <- data.frame(var_corr, abs(dist_corr), abs(reg_eqn), abs(std_eqn)) table0 <- table0[order(abs(reg_eqn)),] names(table0) <- c("Variable", "Size of the Correlation", "Size of the p-valu e", "Size of the Regression Coefficient") Observations and Conclusion: Consolidating Table 1, 2, 3, Table 0 was created. Table 0 gives the size of t he correlation, p-value of the association and the regression coefficient bet ween the landing distance and each variable, X. From Table 0, clearly, the re lative importance of the variables can be determined, as in the order of the variables in the table. SECTION 3.2: CHECK COLLINEARITY # Step 16 – Compare regression models model1 <- lm(FAA_clean$distance ~ FAA_clean$speed_ground) model2 <- lm(FAA_clean$distance ~ FAA_clean$speed_air)
  • 12. model3 <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$speed_ai r) Model 1 Model 2 Model 3 X1 speed_ground speed_air speed_ground X2 - - speed_air R- squared 0.7504 0.8875 0.8883 Adj R-squared 0.7501 0.887 0.8871 p-value <2e-16*** <2e-16*** 0.258, 6.9e-12*** Model MSE 448.1 276.3 276.1 N considered 831 203 203 cor.test(FAA_clean$speed_ground, FAA_clean$speed_air) ## ## Pearson's product-moment correlation ## ## data: FAA_clean$speed_ground and FAA_clean$speed_air ## t = 90.453, df = 201, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.9841163 0.9908449 ## sample estimates: ## cor ## 0.9879383 Observations and Conclusion: From the table above, we can see that for a linear model built on speed_groun d has an R-squared of 0.75, while one with speed_air, gives 0.8875. Clearly s peed_air is a better predictor of Landing Distance. However, due to the large number of missing values present in the speed_air variable, only 203 observat ions were considered for the regression model. Also, note that for Model 3, w hen both variables are considered, speed_ground becomes insignificant. As spe ed_ground and speed_air have high correlation (0.98), only one of them should be chosen for the future models. As speed_ground has lesser missing values, i t’ll be a more real predictor of Landing distance, though its R-squared is le sser. Hence, speed_air was dropped for further analysis. SECTION 3.3: VARIABLE SELECTION # Step 17, 18, 19 model.a <- lm(FAA_clean$distance ~ FAA_clean$speed_ground) model.b <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf t) model.c <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf t + FAA_clean$height) model.d <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf
  • 13. t + FAA_clean$height + FAA_clean$pitch) model.e <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf t + FAA_clean$height + FAA_clean$pitch + FAA_clean$duration) model.f <- lm(FAA_clean$distance ~ FAA_clean$speed_ground + FAA_clean$aircraf t + FAA_clean$height + FAA_clean$pitch + FAA_clean$duration + FAA_clean$no_pa sg) rsq <- 0 rsq[1] <- summary(model.a)$r.squared rsq[2] <- summary(model.b)$r.squared rsq[3] <- summary(model.c)$r.squared rsq[4] <- summary(model.d)$r.squared rsq[5] <- summary(model.e)$r.squared rsq[6] <- summary(model.f)$r.squared rsq [1] 0.7503784 0.8251319 0.8488989 0.8493717 0.8504184 0.8506023 adj.rsq <- 0 adj.rsq[1] <- summary(model.a)$adj.r.squared adj.rsq[2] <- summary(model.b)$adj.r.squared adj.rsq[3] <- summary(model.c)$adj.r.squared adj.rsq[4] <- summary(model.d)$adj.r.squared adj.rsq[5] <- summary(model.e)$adj.r.squared adj.rsq[6] <- summary(model.f)$adj.r.squared > adj.rsq [1] 0.7500773 0.8247095 0.8483508 0.8486423 0.8494534 0.8494442 aic <- 0 aic[1] <- AIC(model.a) aic[2] <- AIC(model.b) aic[3] <- AIC(model.c) aic[4] <- AIC(model.d) aic[5] <- AIC(model.e) aic[6] <- AIC(model.f) aic [1] 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88 plot(c(1:6), rsq) plot(c(1:6), adj.rsq) plot(c(1:6), aic)
  • 14. Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Predict ors Speed_gro und Speed_gro und, Aircraft Speed_gro und, Aircraft, Height Speed_gro und, Aircraft, Height, Pitch Speed_gro und, Aircraft, Height, Pitch, Duration Speed_gro und, Aircraft, Height, Pitch, Duration, No_pasg R- squared 0.7504 0.8251 0.8489 0.8494 0.8504 0.8506 Adj R- squared 0.7501 0.8247 0.8484 0.8486 0.8495 0.8494 AIC 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88 N used 831 831 831 831 781 781
  • 15. # Step 20 – Variables chosen for the predictive model Observations and Conclusion: The table above shows the R-squared, adj R-squared, AIC values of all 6 linea r models. Model 5 and Model 6 both have the best adjusted R-squared values of 0.8495 and 0.8494 and lowest AIC values of 11378.84 and 11379.88. As the valu es are very comparable, Model 5 is better than Model 6 as it has lesser numbe r of predictors and is a simpler model than Model 6. To predict the Landing D istance, predictors Speed_ground, Aircraft, Height, Pitch, Duration can be ch osen. SECTION 3.4: VARIABLE SELECTION BASED ON AUTOMATED ALGORITHM # Step 21 - Forward Step AIC fit1 <- lm(data = FAA_clean, distance ~ 1) fit2 <- lm(data = FAA_clean, distance ~ speed_ground + aircraft + height + pi tch + duration + no_pasg) stepAIC(fit1, scope = list(upper = fit2, lower = fit1), direction = "forward" ) ## Start: AIC=11299.8 ## distance ~ 1 ## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83 1 rows from a combined fit ## Df Sum of Sq RSS AIC ## + speed_ground 1 480561690 157699570 10104 ## + aircraft 1 33759132 604502127 11220 ## + height 1 6866417 631394842 11256 ## + pitch 1 3010731 635250529 11262 ## + duration 1 1685114 636576145 11263 ## <none> 638261260 11263 ## + no_pasg 1 181284 638079976 11265 ## ## Step: AIC=10148.53 ## distance ~ speed_ground ## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83 1 rows from a combined fit ## Df Sum of Sq RSS AIC ## + aircraft 1 47102191 110597379 9810.8 ## + height 1 14123617 143575953 10027.6 ## + pitch 1 8246571 149453000 10061.0 ## <none> 157699570 10103.6 ## + no_pasg 1 154554 157545016 10104.8 ## + duration 1 50570 157649000 10105.4 ##
  • 16. ## Step: AIC=9854.77 ## distance ~ speed_ground + aircraft ## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83 1 rows from a combined fit ## Df Sum of Sq RSS AIC ## + height 1 15048298 95549081 9691.2 ## <none> 110597379 9810.8 ## + pitch 1 182007 110415372 9811.4 ## + no_pasg 1 41575 110555804 9812.5 ## + duration 1 9394 110587985 9812.7 ## ## Step: AIC=9735.37 ## distance ~ speed_ground + aircraft + height ## Warning in add1.lm(object, scope = scope, scale = scale): using the 781/83 1 rows from a combined fit ## Df Sum of Sq RSS AIC ## <none> 95549081 9691.2 ## + no_pasg 1 120379 95428702 9692.2 ## + pitch 1 71174 95477907 9692.6 ## + duration 1 4446 95544635 9693.2 ## ## Call: ## lm(formula = distance ~ speed_ground + aircraft + height, data = FAA_clean ) ## ## Coefficients: ## (Intercept) speed_ground aircraft height ## -3008.29 42.40 496.05 14.15 Observations and Conclusion: Using the automated stepAIC function, the forward selection method was used w ith the base model using no predictors and the final model using all predicto rs. The stepAIC function determines the best variables to be selected within the two defined limits. Using the automated function, the final model of distance ~ speed_ground + aircraft + height, is selected as the best model wi th AIC = 9735.37. The model determines the best variables using 4 steps start ing. Comparing the result with Step 19, we end up with a different model. However, the stepAIC function ends up with the selected variables with AIC = 9735.37, lesser AIC and a simpler model.