SlideShare a Scribd company logo
1 of 21
Download to read offline
0
1/24/2018 Project - Part 1
Statistical Modeling
Rashmi Subrahmanya (M12383010)
UNIVERSITY OF CINCINNATI
i
Contents
Tables............................................................................................................................................................ ii
Executive Summary.......................................................................................................................................1
Introduction ..................................................................................................................................................2
Variable Dictionary...................................................................................................................................2
Chapter 1: Initial Data Exploration................................................................................................................3
R Code.......................................................................................................................................................3
R Output....................................................................................................................................................3
Observations.............................................................................................................................................4
Conclusion.................................................................................................................................................5
Chapter 2: Data Cleaning and Exploration....................................................................................................5
R Code.......................................................................................................................................................5
R Output....................................................................................................................................................5
Observations.............................................................................................................................................7
Conclusion.................................................................................................................................................7
Chapter 3: Data Analysis...............................................................................................................................8
R Code.......................................................................................................................................................8
R Output....................................................................................................................................................8
Observations...........................................................................................................................................10
Conclusion...............................................................................................................................................10
Chapter 4: Regression Analysis...................................................................................................................11
R Code.....................................................................................................................................................11
R Output..................................................................................................................................................12
Observations...........................................................................................................................................12
Conclusion...............................................................................................................................................13
Chapter 5: Check for collinearity ................................................................................................................13
R Code.....................................................................................................................................................13
R Output..................................................................................................................................................13
Observation.............................................................................................................................................14
Conclusion...............................................................................................................................................14
Chapter 6: Variable Selection .....................................................................................................................15
R Code.....................................................................................................................................................15
R Output..................................................................................................................................................15
ii
Observations...........................................................................................................................................17
Conclusion...............................................................................................................................................17
Chapter 7: Variable Selection based on automate algorithm ....................................................................17
R Code.....................................................................................................................................................17
R Output..................................................................................................................................................17
Observation.............................................................................................................................................18
Conclusion...............................................................................................................................................18
Tables
Table 1: Pairwise correlation between distance and each factor.................................................................9
Table 2: Ranking factors based on -p-value................................................................................................12
Table 3: Ranking factors after standardization of variables .......................................................................12
Table 4: Ranking of variables ......................................................................................................................13
Table 5: Checking for collinearity................................................................................................................14
Table 6: Comparing different models.........................................................................................................15
1
Executive Summary
This project is carried out to understand which factors influence the landing distance of flights to
minimize the risk of over run. A summary of the variables used in the project is provided in
introduction section. In chapter 1, two data sets are imported and examined. They are then merged
to one data set. 100 duplicate rows were observed in merged data set which were subsequently
removed. Also, missing values were observed in ‘duration’ and ‘speed_air’ columns. Summary
statistics of each variable is provided in this chapter. Chapter 2 checks for abnormal values as
defined by variable dictionary. 17 rows were found to contain abnormal values and they were
removed. Histograms of each variable is plotted to understand their distribution.
In chapter 3, correlation matrix was calculated, and scatter plots were used to see which factors
are correlated with landing distance, their strength and direction. Aircraft variable is also recoded.
The predictor variables are ranked according to strength of correlation. In chapter 4, landing
distance is regressed on each of predictor variable, one at a time and p-values of resulting linear
regression models are noted. Then the variables are standardized, and the process is repeated. It is
observed that rank of predictor variables, in terms of influence on landing distance is same in all
three ways – correlation matrix, regression models before and after standardization.
Chapter 5 checks for collinearity between predictor variables. Speed_air and speed_ground is
found to be highly correlated. Speed_ground is dropped from further analysis. In chapter 6, linear
regression models are built adding one variable at a time. The r squared, adjusted r squared and
AIC values of the models are plotted against number of variables. Based on this, all variables,
except speed_ground, are used to build predictive model for landing distance. However, based on
p-values of the models, only speed_air, aircraft and height are significant. Another thing to note is
that model is built using 195 observations only due to missing values in speed_air and duration
column. In final chapter, stepAIC function in R is used to perform forward variable selection.
Based on the results, speed_air, height and aircraft are used to build predictive model.
2
Introduction
The goal of the project is to study what factors and how they impact the landing distance of a
commercial flight to reduce the risk of landing over run. We have landing data from 950
commercial flights as the input data.
Variable Dictionary
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal
flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the
threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing
would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of
the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway.
The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance
between the threshold of the runway and the point where the aircraft can be fully stopped. The
length of the airport runway is typically less than 6000 feet.
3
Chapter 1: Initial Data Exploration
In this chapter, we import two data sets FAA1 and FAA2 in R and look at their structure. Then,
we combine both data sets to get a new data set named ‘FAA’. We check for duplicate values in
the new data set and have a look at its structure. We also obtain summary statistics of each variable
in FAA data set.
R Code
#Importing FAA1 and FAA2 excel files
FAA1 <- readxl::read_xls('FAA1.xls', col_names = TRUE)
FAA2 <- readxl::read_xls('FAA2.xls', col_names = TRUE)
#A look at first few rows of FAA1 and FAA2 data set
head(FAA1)
head(FAA2)
#Checking structure of the data sets
str(FAA1)
str(FAA2)
#Merging FAA1 and FAA2 into a single data set - FAA
FAA <- plyr::rbind.fill(FAA1,FAA2)
head(FAA)
#A look at structure of new data set
str(FAA)
#Checking for duplicates
FAA_dup <- FAA[duplicated(FAA$speed_ground), ]
nrow(FAA_dup)
FAA <- FAA[!duplicated(FAA$speed_ground), ]
#Summary of each variable in FAA
summary(FAA)
R Output
#Structure of FAA1
> str(FAA1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 800 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
4
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Structure of FAA2
> str(FAA2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 7 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Structure of FAA (combined data set)
> str(FAA)
'data.frame': 850 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Summary of each variable
Observations
• FAA1 data set has 800 observations and 8 variables while FAA2 data set has 150
observations and 7 variables. Variable ‘duration’ is missing in FAA2 data set. Both FAA1
and FAA2 are data frames. In both data sets, variable ‘aircraft’ is of character data type,
while other variables are numeric.
• ‘duration’ column has 50 missing values, while speed_air has 642 missing values. There
are no missing values in other columns.
5
• Minimum value of duration and distance is too low while minimum height is negative.
These are abnormal values, as defined by variable dictionary.
• There are 100 duplicate rows in FAA data set, which are deleted from further analysis.
After removing duplicates, FAA has 850 observations and 8 variables. ‘aircraft’ is of
character data type, while other variables are of numeric type.
Conclusion
• Duplicate rows are removed from further analysis as they do not provide any meaningful
information and they can affect results.
• Both speed_air and duration columns are retained for now, even though they have missing
values. They can be dropped later, if required.
Chapter 2: Data Cleaning and Exploration
In this chapter, we check for abnormal values, as defined in the variable dictionary. If there are
any rows with abnormal values, we remove them. We plot histogram of each variable to understand
their distributions.
R Code
#Removing abnormal values from data set
FAA <- FAA[(FAA$duration > 40 | is.na(FAA$duration)), ]
FAA <- FAA[(FAA$height >= 6), ]
FAA <- FAA[(FAA$speed_air >= 30 | FAA$speed_air <= 140 | is.na(FAA$speed_air)), ]
FAA <- FAA[(FAA$speed_ground >= 30 | FAA$speed_ground <= 140), ]
FAA <- FAA[(FAA$distance < 6000), ]
dim(FAA)
#summary of cleaned data set
summary(FAA)
#Plotting histogram of each variable
hist(FAA$duration, breaks = 30, main = 'Histogram of duration variable', xlab = 'Duration')
hist(FAA$no_pasg, breaks = 30, main = 'Histogram of no_pasg', xlab = 'Number of Passengers')
hist(FAA$speed_ground, breaks = 30, main = 'Histogram of speed_ground', xlab = 'Speed Ground')
hist(FAA$speed_air, breaks = 30, main = 'Histogram of speed_air', xlab = 'Speed Air')
hist(FAA$height, breaks = 30, main = 'Histogram of height', xlab = 'Height')
hist(FAA$pitch, breaks = 30, main = 'Histogram of pitch', xlab = 'Pitch')
hist(log(FAA$distance), breaks = 30, main = 'Histogram of log(distance)', xlab = 'Landing Distance')
R Output
#Structure of FAA after removing abnormal values
6
> str(FAA)
'data.frame': 831 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" NA NA ...
$ duration : num 98.5 125.7 NA NA NA ...
$ no_pasg : num 53 69 NA NA NA NA NA NA NA NA ...
$ speed_ground: num 108 102 NA NA NA ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 NA NA NA ...
$ pitch : num 4.04 4.12 NA NA NA ...
$ distance : num 3370 2988 NA NA NA ...
#summary of cleaned data set
#Histogram of each variable
Figure 1: Histogram of duration Figure 2: Histogram of no_pasg
Figure 3: Histogram of speed_ground Figure 4: Histogram of speed_air
7
Figure 5: Histogram of height Figure 6: Histogram of pitch
Figure 7: Histogram of distance
Observations
• There are abnormal values in the data set, as defined by the variable dictionary. There are
5 such values in duration, 3 in speed_ground, 1 in speed_air, 10 in height and 2 in distance
column.
• 17 rows/observations with abnormal values were removed.
• Distribution of speed_air shows that it is right-skewed.
Conclusion
• Final data set has 833 observations and 8 columns.
• The observations with abnormal values are deleted, since the number of such observations
is very low.
8
Chapter 3: Data Analysis
This chapter comprises of initial data analysis where we try to identify factors which impact the
response variable, landing distance.
R Code
#Recoding aircraft to numeric values: Boeing - 0, Airbus - 1
FAA$aircraft <- ifelse(FAA$aircraft == 'boeing', 0, 1)
#Computing pairwise correlation
round(cor(FAA, use = 'pairwise.complete.obs'), 4)
corrplot(FAACor, method = "ellipse")
#Scatter plots
par(mfrow = c(4, 2))
plot(FAA$aircraft, FAA$distance)
plot(FAA$duration, FAA$distance)
plot(FAA$no_pasg, FAA$distance)
plot(FAA$speed_ground, FAA$distance)
plot(FAA$speed_air, FAA$distance)
plot(FAA$height, FAA$distance)
plot(FAA$pitch, FAA$distance)
par(mfrow = c(1,1))
R Output
#Correlation Matrix
9
Figure 8: Correlation plot of variables
Table 1: Pairwise correlation between distance and each factor
Variables Strength of correlation Direction
Speed_air 0.9421 Positive
Speed_ground 0.8608 Positive
Aircraft 0.2369 Negative
Height 0.0999 Positive
Pitch 0.0863 Positive
Duration 0.0520 Negative
No_pasg 0.0173 Negative
#Scatter plots
10
Figure 9: Scatter plot
Observations
• From the correlation matrix and scatter plots, it is evident that speed_ground and speed_air
are the important factors which impact landing distance and they have strong positive
correlation coefficient.
• Aircraft make also impacts landing distance, but the strength of correlation is weak and
negative.
• Other factors are weakly correlated with landing distance or have little impact on it.
Conclusion
• Speed_ground, speed_air and aircraft are important factors which impact landing distance.
11
Chapter 4: Regression Analysis
In this chapter, we regress landing distance on each of the factors and observe the p-values of each
model. Then, we standardize each variable using the following formula:
X’
= {X – mean(X)}/sd(X)
We regress landing distance on each of standardized variables and note down p-values. Then, we
compare results from correlation matrix and the regression analysis to see if the results are
consistent.
R Code
#Regression using single factor each time
model1 <- lm(distance ~ aircraft, data = FAA)
summary(model1)
model2 <- lm(distance ~ duration, data = FAA)
summary(model2)
model3 <- lm(distance ~ no_pasg, data = FAA)
summary(model3)
model4 <- lm(distance ~ speed_ground, data = FAA)
summary(model4)
model5 <- lm(distance ~ speed_air, data = FAA)
summary(model5)
model6 <- lm(distance ~ height, data = FAA)
summary(model6)
model7 <- lm(distance ~ pitch, data = FAA)
summary(model7)
#Standardizing and creating new variables
FAA$aircraft.std <- (FAA$aircraft - mean(FAA$aircraft))/sd(FAA$aircraft)
FAA$duration.std <- (FAA$duration – mean (FAA$duration, na.rm = TRUE))/sd(FAA$duration, na.rm =
TRUE)
FAA$no_pasg.std <- (FAA$no_pasg - mean(FAA$no_pasg))/sd(FAA$no_pasg)
FAA$speed_ground.std <- (FAA$speed_ground - mean(FAA$speed_ground))/sd(FAA$speed_ground)
FAA$speed_air.std <- (FAA$speed_air - mean(FAA$speed_air, na.rm = TRUE))/sd(FAA$speed_air,
na.rm = TRUE)
FAA$height.std <- (FAA$height - mean(FAA$height))/sd(FAA$height)
FAA$pitch.std <- (FAA$pitch - mean(FAA$pitch))/sd(FAA$pitch)
#Regression using standardized variables
model8 <- lm(distance ~ aircraft.std, data = FAA)
summary(model8)
model9 <- lm(distance ~ duration.std, data = FAA)
12
summary(model9)
model10 <- lm(distance ~ no_pasg.std, data = FAA)
summary(model10)
model11 <- lm(distance ~ speed_ground.std, data = FAA)
summary(model11)
model12 <- lm(distance ~ speed_air.std, data = FAA)
summary(model12)
model13 <- lm(distance ~ height.std, data = FAA)
summary(model13)
model14 <- lm(distance ~ pitch.std, data = FAA)
summary(model14)
R Output
#p-value of different models
Table 2: Ranking factors based on -p-value
Variables p-value Direction of regression coefficient
Speed_air <0.0001 Positive
Speed_ground <0.0001 Positive
Aircraft <0.0001 Negative
Height 0.00389 Positive
Pitch 0.0127 Positive
Duration 0.146 Negative
No_pasg 0.618 Negative
#p-value after standardizing the variables
Table 3: Ranking factors after standardization of variables
Variables p-value Direction of regression coefficient
Speed_air <0.0001 Positive
Speed_ground <0.0001 Positive
Aircraft <0.0001 Negative
Height 0.00389 Positive
Pitch 0.0127 Positive
Duration 0.146 Negative
No_pasg 0.618 Negative
Observations
• Comparing results from tables 1,2 and 3, we observe that results are consistent. In table 4
below, the factors are ranked based on their relative importance in determining the landing
distance.
13
Table 4: Ranking of variables
Rank Variable
1 Speed_air
2 Speed_ground
3 Aircraft
4 Height
5 Pitch
6 Duration
7 No_pasg
• Speed_air, speed_ground, height and pitch have positive correlation with landing distance
while aircraft has negative correlation with landing distance.
Conclusion
Assuming a significance of 0.05, Speed_air, speed_ground, aircraft, height and pitch are most
important factors influencing landing distance.
Chapter 5: Check for collinearity
Speed_air and speed_ground pretty much provide same information. In this chapter, we check for
correlation between speed_air and speed_ground. If there is high correlation between the two
variables, we retain only one of them.
R Code
#Checking for collinearity between speed_ground and speed_air
model1 <- lm (distance ~ speed_ground, data = FAA)
summary(model1)
model2 <- lm (distance ~ speed_air, data = FAA)
summary(model2)
model3 <- lm (distance ~ speed_ground + speed_air, data = FAA)
summary(model3)
#Correlation between speed_ground and speed_air
Cor (FAA$speed_air, FAA$speed_ground, use = "pairwise.complete.obs")
R Output
#Table showing regression coefficients
14
Table 5: Checking for collinearity
Model Number Model Variable Regression
Coefficient
p-value
1 LD ~ speed_ground Speed_ground 40.8252 <0.0001
2 LD ~ speed_air Speed_air 79.532 <0.0001
3 LD ~ speed_ground
+ speed_air
Speed_ground -14.37 0.258
Speed_air 93.96 <0.0001
Observation
• We observe from models 1 and 2 that both speed_ground and speed_air are significant
factors in determining landing distance. However, according to model 3, only speed_air is
significant factor (p-value < 0.0001).
• We also observe a sign change in regression coefficient of speed_ground and a change in
significance value. P-value of speed_ground in model 3 is greater than 0.05 suggesting that
it may not be a significant factor which is not true.
• We can say that collinearity exists, that is, speed_ground and speed_air is correlated with
each other. In fact, they have strong correlation with value of 0.9879. It is better to drop
one of them since including both might result in unstable model.
• Speed_air can be considered as speed_ground plus wind speed.
Conclusion
Speed_air is retained even though there are lot of missing values due to following reason:
• It is an important factor, from domain knowledge.
• From scatter plot, it is seen that speed_air has nearly linear relation with landing distance
which is not the case for speed_ground. This makes it possible to fit a linear regression
model for speed_air.
• Speed_air column has observations required for predicting landing over run, while in case
of speed_ground, a large portion of the observations is less than 90 mph which is not very
useful in predicting landing over run.
• It is easier to get values of speed_air.
15
Chapter 6: Variable Selection
We fit models based on variable ranking in table 4 by adding one variable at a time. We obtain r-
square, adjusted r squared and AIC values for each model.
R Code
#Plotting R squared values against number of parameters
r.squared.1 <- summary(model1)$r.squared
r.squared.2 <- summary(model2)$r.squared
r.squared.3 <- summary(model3)$r.squared
r.squared.4 <- summary(model4)$r.squared
r.squared.5 <- summary(model5)$r.squared
r.squared.6 <- summary(model6)$r.squared
plot(c(1,2,3,4,5,6), c(r.squared.1,r.squared.2,r.squared.3,r.squared.4,r.squared.5,r.squared.6), type = "b",
ylab = "R squared", xlab = "Number of predictors")
#Plotting Adjusted R squared values against number of parameters
r.adj.squared.1 <- summary(model1)$adj.r.squared
r.adj.squared.2 <- summary(model2)$adj.r.squared
r.adj.squared.3 <- summary(model3)$adj.r.squared
r.adj.squared.4 <- summary(model4)$adj.r.squared
r.adj.squared.5 <- summary(model5)$adj.r.squared
r.adj.squared.6 <- summary(model6)$adj.r.squared
plot(c(1,2,3,4,5,6), c(r.adj.squared.1,r.adj.squared.2,r.adj.squared.3,r.adj.squared.4,
r.adj.squared.5,r.adj.squared.6), type = "b", ylab = "Adjusted R squared",
xlab = "Number of predictors")
#Plotting AIC values against number of parameters
plot(c(1,2,3,4,5,6), c(r.AIC.1,r.AIC.2,r.AIC.3,r.AIC.4,r.AIC.5,r.AIC.6), type = "b",
ylab = "AIC", xlab = "Number of predictors")
R Output
Table 6: Comparing different models
Model
Number
Model R-squared value Adjusted R-
squared value
AIC value
1 LD ~ speed_air 0.8875 0.8870 2862.423
2 LD ~ speed_air + aircraft 0.9493 0.9488 2702.784
3 LD ~ speed_air + aircraft +
height
0.9737 0.9733 2571.310
4 LD ~ speed_air + aircraft +
height + pitch
0.9737 0.9732 2573.300
16
5 LD ~ speed_air + aircraft +
height + pitch + duration
0.9744 0.9737 2473.168
6 LD ~ speed_air + aircraft +
height + pitch + duration +
no_pasg
0.9747 0.9739 2473.010
Figure 10: Plot of R squared values of different models Vs Number of predictors
Figure 11: Plot of Adjusted R Squared of different models Vs Number of predictors
Figure 12: Plot of AIC of different models Vs Number of predictors
17
Observations
• Model 6 has highest adjusted r squared value and lowest AIC value. While comparing
models, we choose the one with higher adjusted r squared value, that is, one whose
predictor variables are better able to explain variation in dependent variable. Also, we
choose model with lower AIC value.
• However, if we look at p-values of the models, only speed_air, aircraft and height are
significant.
• It should also be noted in the final data set, speed_air has 630 missing values. While
modeling, only 195 observations are taken into consideration.
Conclusion
• Based on adjusted r squared and AIC values, I would choose speed_air, height, pitch,
no_pasg and duration to build predictive model for landing distance.
• Based on p-values of models, I would choose speed_air, aircraft and height.
• The final model is as follows:
Distance = -5796.9430 + (81.9833 * speed_air) – (437.8295 * aircraft) + (13.71 * height)
• It is seen that among the influential factors, height has least impact on landing distance.
Chapter 7: Variable Selection based on automate algorithm
In this chapter, stepAIC function in R is used to perform forward variable selection. The results so
obtained are compared with results obtained in previous chapter to see if they are consistent.
R Code
model <- lm(distance ~ ., data = FAA)
step <- stepAIC(model, direction = "forward")
summary(step)
R Output
#Summary of stepAIC function
18
Observation
• Using stepAIC function to perform variable forward selection, I would select three
variables to build predictive model for landing distance. From output above, it can be seen
that p-values for aircraft, speed_air and height are significant.
Conclusion
• The final model after using stepAIC function is as follows:
Distance = -5791.6573 – (437.9428 * aircraft) + (85.5469 * speed_air) + (13.6756 * height)
• Based on results from chapter 6 and 7, I would choose speed_air, aircraft and height in the
final model.

More Related Content

Similar to Statistical Modeling Project - Part 1 Analysis

IRJET- Aerodynamic Analysis of Aircraft Wings using CFD
IRJET- Aerodynamic Analysis of Aircraft Wings using CFDIRJET- Aerodynamic Analysis of Aircraft Wings using CFD
IRJET- Aerodynamic Analysis of Aircraft Wings using CFDIRJET Journal
 
Trajectory pricing for the European Air Traffic Management system using modul...
Trajectory pricing for the European Air Traffic Management system using modul...Trajectory pricing for the European Air Traffic Management system using modul...
Trajectory pricing for the European Air Traffic Management system using modul...igormahorcic
 
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...Hannah Baker
 
Aviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selectionAviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selectionMohammed Hadi
 
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...Project KRIT
 
Aviation Article : Getting The Right Picture
Aviation Article  : Getting The Right PictureAviation Article  : Getting The Right Picture
Aviation Article : Getting The Right PictureMohammed Hadi
 
Autonomous cargo transporter report
Autonomous cargo transporter reportAutonomous cargo transporter report
Autonomous cargo transporter reportMuireannSpain
 
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...Project KRIT
 
CFD Analysis of conceptual Aircraft body
CFD Analysis of conceptual Aircraft bodyCFD Analysis of conceptual Aircraft body
CFD Analysis of conceptual Aircraft bodyIRJET Journal
 
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...IAEME Publication
 
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design  Integrated Models And AlgorithmsAirline Fleet Assignment And Schedule Design  Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design Integrated Models And AlgorithmsJennifer Roman
 
AIAA Design Build & Fly Design Report
AIAA Design Build & Fly Design ReportAIAA Design Build & Fly Design Report
AIAA Design Build & Fly Design ReportMuhammedAhnuf
 

Similar to Statistical Modeling Project - Part 1 Analysis (20)

Structures proyect
Structures proyectStructures proyect
Structures proyect
 
Assignment 5
Assignment 5Assignment 5
Assignment 5
 
finalReport
finalReportfinalReport
finalReport
 
IRJET- Aerodynamic Analysis of Aircraft Wings using CFD
IRJET- Aerodynamic Analysis of Aircraft Wings using CFDIRJET- Aerodynamic Analysis of Aircraft Wings using CFD
IRJET- Aerodynamic Analysis of Aircraft Wings using CFD
 
Low Cost Airports in India Part 1 - Applying implementation frameworks 7-S mo...
Low Cost Airports in India Part 1 - Applying implementation frameworks 7-S mo...Low Cost Airports in India Part 1 - Applying implementation frameworks 7-S mo...
Low Cost Airports in India Part 1 - Applying implementation frameworks 7-S mo...
 
Trajectory pricing for the European Air Traffic Management system using modul...
Trajectory pricing for the European Air Traffic Management system using modul...Trajectory pricing for the European Air Traffic Management system using modul...
Trajectory pricing for the European Air Traffic Management system using modul...
 
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
 
Aviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selectionAviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selection
 
Final Report Wind Tunnel
Final Report Wind TunnelFinal Report Wind Tunnel
Final Report Wind Tunnel
 
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
 
Assignment 6
Assignment 6Assignment 6
Assignment 6
 
Aviation Article : Getting The Right Picture
Aviation Article  : Getting The Right PictureAviation Article  : Getting The Right Picture
Aviation Article : Getting The Right Picture
 
Autonomous cargo transporter report
Autonomous cargo transporter reportAutonomous cargo transporter report
Autonomous cargo transporter report
 
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
 
CFD Analysis of conceptual Aircraft body
CFD Analysis of conceptual Aircraft bodyCFD Analysis of conceptual Aircraft body
CFD Analysis of conceptual Aircraft body
 
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
 
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design  Integrated Models And AlgorithmsAirline Fleet Assignment And Schedule Design  Integrated Models And Algorithms
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
 
AIAA Design Build & Fly Design Report
AIAA Design Build & Fly Design ReportAIAA Design Build & Fly Design Report
AIAA Design Build & Fly Design Report
 
6 prediccion velocidad cr2c - 99171
6   prediccion velocidad cr2c - 991716   prediccion velocidad cr2c - 99171
6 prediccion velocidad cr2c - 99171
 
FYP
FYPFYP
FYP
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Statistical Modeling Project - Part 1 Analysis

  • 1. 0 1/24/2018 Project - Part 1 Statistical Modeling Rashmi Subrahmanya (M12383010) UNIVERSITY OF CINCINNATI
  • 2. i Contents Tables............................................................................................................................................................ ii Executive Summary.......................................................................................................................................1 Introduction ..................................................................................................................................................2 Variable Dictionary...................................................................................................................................2 Chapter 1: Initial Data Exploration................................................................................................................3 R Code.......................................................................................................................................................3 R Output....................................................................................................................................................3 Observations.............................................................................................................................................4 Conclusion.................................................................................................................................................5 Chapter 2: Data Cleaning and Exploration....................................................................................................5 R Code.......................................................................................................................................................5 R Output....................................................................................................................................................5 Observations.............................................................................................................................................7 Conclusion.................................................................................................................................................7 Chapter 3: Data Analysis...............................................................................................................................8 R Code.......................................................................................................................................................8 R Output....................................................................................................................................................8 Observations...........................................................................................................................................10 Conclusion...............................................................................................................................................10 Chapter 4: Regression Analysis...................................................................................................................11 R Code.....................................................................................................................................................11 R Output..................................................................................................................................................12 Observations...........................................................................................................................................12 Conclusion...............................................................................................................................................13 Chapter 5: Check for collinearity ................................................................................................................13 R Code.....................................................................................................................................................13 R Output..................................................................................................................................................13 Observation.............................................................................................................................................14 Conclusion...............................................................................................................................................14 Chapter 6: Variable Selection .....................................................................................................................15 R Code.....................................................................................................................................................15 R Output..................................................................................................................................................15
  • 3. ii Observations...........................................................................................................................................17 Conclusion...............................................................................................................................................17 Chapter 7: Variable Selection based on automate algorithm ....................................................................17 R Code.....................................................................................................................................................17 R Output..................................................................................................................................................17 Observation.............................................................................................................................................18 Conclusion...............................................................................................................................................18 Tables Table 1: Pairwise correlation between distance and each factor.................................................................9 Table 2: Ranking factors based on -p-value................................................................................................12 Table 3: Ranking factors after standardization of variables .......................................................................12 Table 4: Ranking of variables ......................................................................................................................13 Table 5: Checking for collinearity................................................................................................................14 Table 6: Comparing different models.........................................................................................................15
  • 4. 1 Executive Summary This project is carried out to understand which factors influence the landing distance of flights to minimize the risk of over run. A summary of the variables used in the project is provided in introduction section. In chapter 1, two data sets are imported and examined. They are then merged to one data set. 100 duplicate rows were observed in merged data set which were subsequently removed. Also, missing values were observed in ‘duration’ and ‘speed_air’ columns. Summary statistics of each variable is provided in this chapter. Chapter 2 checks for abnormal values as defined by variable dictionary. 17 rows were found to contain abnormal values and they were removed. Histograms of each variable is plotted to understand their distribution. In chapter 3, correlation matrix was calculated, and scatter plots were used to see which factors are correlated with landing distance, their strength and direction. Aircraft variable is also recoded. The predictor variables are ranked according to strength of correlation. In chapter 4, landing distance is regressed on each of predictor variable, one at a time and p-values of resulting linear regression models are noted. Then the variables are standardized, and the process is repeated. It is observed that rank of predictor variables, in terms of influence on landing distance is same in all three ways – correlation matrix, regression models before and after standardization. Chapter 5 checks for collinearity between predictor variables. Speed_air and speed_ground is found to be highly correlated. Speed_ground is dropped from further analysis. In chapter 6, linear regression models are built adding one variable at a time. The r squared, adjusted r squared and AIC values of the models are plotted against number of variables. Based on this, all variables, except speed_ground, are used to build predictive model for landing distance. However, based on p-values of the models, only speed_air, aircraft and height are significant. Another thing to note is that model is built using 195 observations only due to missing values in speed_air and duration column. In final chapter, stepAIC function in R is used to perform forward variable selection. Based on the results, speed_air, height and aircraft are used to build predictive model.
  • 5. 2 Introduction The goal of the project is to study what factors and how they impact the landing distance of a commercial flight to reduce the risk of landing over run. We have landing data from 950 commercial flights as the input data. Variable Dictionary Aircraft: The make of an aircraft (Boeing or Airbus). Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min. No_pasg: The number of passengers in a flight. Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway. Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway. Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
  • 6. 3 Chapter 1: Initial Data Exploration In this chapter, we import two data sets FAA1 and FAA2 in R and look at their structure. Then, we combine both data sets to get a new data set named ‘FAA’. We check for duplicate values in the new data set and have a look at its structure. We also obtain summary statistics of each variable in FAA data set. R Code #Importing FAA1 and FAA2 excel files FAA1 <- readxl::read_xls('FAA1.xls', col_names = TRUE) FAA2 <- readxl::read_xls('FAA2.xls', col_names = TRUE) #A look at first few rows of FAA1 and FAA2 data set head(FAA1) head(FAA2) #Checking structure of the data sets str(FAA1) str(FAA2) #Merging FAA1 and FAA2 into a single data set - FAA FAA <- plyr::rbind.fill(FAA1,FAA2) head(FAA) #A look at structure of new data set str(FAA) #Checking for duplicates FAA_dup <- FAA[duplicated(FAA$speed_ground), ] nrow(FAA_dup) FAA <- FAA[!duplicated(FAA$speed_ground), ] #Summary of each variable in FAA summary(FAA) R Output #Structure of FAA1 > str(FAA1) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 800 obs. of 8 variables: $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... $ duration : num 98.5 125.7 112 196.8 90.1 ... $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... $ speed_air : num 109 103 NA NA NA ...
  • 7. 4 $ height : num 27.4 27.8 18.6 30.7 32.4 ... $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... $ distance : num 3370 2988 1145 1664 1050 ... #Structure of FAA2 > str(FAA2) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 7 variables: $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... $ speed_air : num 109 103 NA NA NA ... $ height : num 27.4 27.8 18.6 30.7 32.4 ... $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... $ distance : num 3370 2988 1145 1664 1050 ... #Structure of FAA (combined data set) > str(FAA) 'data.frame': 850 obs. of 8 variables: $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... $ duration : num 98.5 125.7 112 196.8 90.1 ... $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... $ speed_air : num 109 103 NA NA NA ... $ height : num 27.4 27.8 18.6 30.7 32.4 ... $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... $ distance : num 3370 2988 1145 1664 1050 ... #Summary of each variable Observations • FAA1 data set has 800 observations and 8 variables while FAA2 data set has 150 observations and 7 variables. Variable ‘duration’ is missing in FAA2 data set. Both FAA1 and FAA2 are data frames. In both data sets, variable ‘aircraft’ is of character data type, while other variables are numeric. • ‘duration’ column has 50 missing values, while speed_air has 642 missing values. There are no missing values in other columns.
  • 8. 5 • Minimum value of duration and distance is too low while minimum height is negative. These are abnormal values, as defined by variable dictionary. • There are 100 duplicate rows in FAA data set, which are deleted from further analysis. After removing duplicates, FAA has 850 observations and 8 variables. ‘aircraft’ is of character data type, while other variables are of numeric type. Conclusion • Duplicate rows are removed from further analysis as they do not provide any meaningful information and they can affect results. • Both speed_air and duration columns are retained for now, even though they have missing values. They can be dropped later, if required. Chapter 2: Data Cleaning and Exploration In this chapter, we check for abnormal values, as defined in the variable dictionary. If there are any rows with abnormal values, we remove them. We plot histogram of each variable to understand their distributions. R Code #Removing abnormal values from data set FAA <- FAA[(FAA$duration > 40 | is.na(FAA$duration)), ] FAA <- FAA[(FAA$height >= 6), ] FAA <- FAA[(FAA$speed_air >= 30 | FAA$speed_air <= 140 | is.na(FAA$speed_air)), ] FAA <- FAA[(FAA$speed_ground >= 30 | FAA$speed_ground <= 140), ] FAA <- FAA[(FAA$distance < 6000), ] dim(FAA) #summary of cleaned data set summary(FAA) #Plotting histogram of each variable hist(FAA$duration, breaks = 30, main = 'Histogram of duration variable', xlab = 'Duration') hist(FAA$no_pasg, breaks = 30, main = 'Histogram of no_pasg', xlab = 'Number of Passengers') hist(FAA$speed_ground, breaks = 30, main = 'Histogram of speed_ground', xlab = 'Speed Ground') hist(FAA$speed_air, breaks = 30, main = 'Histogram of speed_air', xlab = 'Speed Air') hist(FAA$height, breaks = 30, main = 'Histogram of height', xlab = 'Height') hist(FAA$pitch, breaks = 30, main = 'Histogram of pitch', xlab = 'Pitch') hist(log(FAA$distance), breaks = 30, main = 'Histogram of log(distance)', xlab = 'Landing Distance') R Output #Structure of FAA after removing abnormal values
  • 9. 6 > str(FAA) 'data.frame': 831 obs. of 8 variables: $ aircraft : chr "boeing" "boeing" NA NA ... $ duration : num 98.5 125.7 NA NA NA ... $ no_pasg : num 53 69 NA NA NA NA NA NA NA NA ... $ speed_ground: num 108 102 NA NA NA ... $ speed_air : num 109 103 NA NA NA ... $ height : num 27.4 27.8 NA NA NA ... $ pitch : num 4.04 4.12 NA NA NA ... $ distance : num 3370 2988 NA NA NA ... #summary of cleaned data set #Histogram of each variable Figure 1: Histogram of duration Figure 2: Histogram of no_pasg Figure 3: Histogram of speed_ground Figure 4: Histogram of speed_air
  • 10. 7 Figure 5: Histogram of height Figure 6: Histogram of pitch Figure 7: Histogram of distance Observations • There are abnormal values in the data set, as defined by the variable dictionary. There are 5 such values in duration, 3 in speed_ground, 1 in speed_air, 10 in height and 2 in distance column. • 17 rows/observations with abnormal values were removed. • Distribution of speed_air shows that it is right-skewed. Conclusion • Final data set has 833 observations and 8 columns. • The observations with abnormal values are deleted, since the number of such observations is very low.
  • 11. 8 Chapter 3: Data Analysis This chapter comprises of initial data analysis where we try to identify factors which impact the response variable, landing distance. R Code #Recoding aircraft to numeric values: Boeing - 0, Airbus - 1 FAA$aircraft <- ifelse(FAA$aircraft == 'boeing', 0, 1) #Computing pairwise correlation round(cor(FAA, use = 'pairwise.complete.obs'), 4) corrplot(FAACor, method = "ellipse") #Scatter plots par(mfrow = c(4, 2)) plot(FAA$aircraft, FAA$distance) plot(FAA$duration, FAA$distance) plot(FAA$no_pasg, FAA$distance) plot(FAA$speed_ground, FAA$distance) plot(FAA$speed_air, FAA$distance) plot(FAA$height, FAA$distance) plot(FAA$pitch, FAA$distance) par(mfrow = c(1,1)) R Output #Correlation Matrix
  • 12. 9 Figure 8: Correlation plot of variables Table 1: Pairwise correlation between distance and each factor Variables Strength of correlation Direction Speed_air 0.9421 Positive Speed_ground 0.8608 Positive Aircraft 0.2369 Negative Height 0.0999 Positive Pitch 0.0863 Positive Duration 0.0520 Negative No_pasg 0.0173 Negative #Scatter plots
  • 13. 10 Figure 9: Scatter plot Observations • From the correlation matrix and scatter plots, it is evident that speed_ground and speed_air are the important factors which impact landing distance and they have strong positive correlation coefficient. • Aircraft make also impacts landing distance, but the strength of correlation is weak and negative. • Other factors are weakly correlated with landing distance or have little impact on it. Conclusion • Speed_ground, speed_air and aircraft are important factors which impact landing distance.
  • 14. 11 Chapter 4: Regression Analysis In this chapter, we regress landing distance on each of the factors and observe the p-values of each model. Then, we standardize each variable using the following formula: X’ = {X – mean(X)}/sd(X) We regress landing distance on each of standardized variables and note down p-values. Then, we compare results from correlation matrix and the regression analysis to see if the results are consistent. R Code #Regression using single factor each time model1 <- lm(distance ~ aircraft, data = FAA) summary(model1) model2 <- lm(distance ~ duration, data = FAA) summary(model2) model3 <- lm(distance ~ no_pasg, data = FAA) summary(model3) model4 <- lm(distance ~ speed_ground, data = FAA) summary(model4) model5 <- lm(distance ~ speed_air, data = FAA) summary(model5) model6 <- lm(distance ~ height, data = FAA) summary(model6) model7 <- lm(distance ~ pitch, data = FAA) summary(model7) #Standardizing and creating new variables FAA$aircraft.std <- (FAA$aircraft - mean(FAA$aircraft))/sd(FAA$aircraft) FAA$duration.std <- (FAA$duration – mean (FAA$duration, na.rm = TRUE))/sd(FAA$duration, na.rm = TRUE) FAA$no_pasg.std <- (FAA$no_pasg - mean(FAA$no_pasg))/sd(FAA$no_pasg) FAA$speed_ground.std <- (FAA$speed_ground - mean(FAA$speed_ground))/sd(FAA$speed_ground) FAA$speed_air.std <- (FAA$speed_air - mean(FAA$speed_air, na.rm = TRUE))/sd(FAA$speed_air, na.rm = TRUE) FAA$height.std <- (FAA$height - mean(FAA$height))/sd(FAA$height) FAA$pitch.std <- (FAA$pitch - mean(FAA$pitch))/sd(FAA$pitch) #Regression using standardized variables model8 <- lm(distance ~ aircraft.std, data = FAA) summary(model8) model9 <- lm(distance ~ duration.std, data = FAA)
  • 15. 12 summary(model9) model10 <- lm(distance ~ no_pasg.std, data = FAA) summary(model10) model11 <- lm(distance ~ speed_ground.std, data = FAA) summary(model11) model12 <- lm(distance ~ speed_air.std, data = FAA) summary(model12) model13 <- lm(distance ~ height.std, data = FAA) summary(model13) model14 <- lm(distance ~ pitch.std, data = FAA) summary(model14) R Output #p-value of different models Table 2: Ranking factors based on -p-value Variables p-value Direction of regression coefficient Speed_air <0.0001 Positive Speed_ground <0.0001 Positive Aircraft <0.0001 Negative Height 0.00389 Positive Pitch 0.0127 Positive Duration 0.146 Negative No_pasg 0.618 Negative #p-value after standardizing the variables Table 3: Ranking factors after standardization of variables Variables p-value Direction of regression coefficient Speed_air <0.0001 Positive Speed_ground <0.0001 Positive Aircraft <0.0001 Negative Height 0.00389 Positive Pitch 0.0127 Positive Duration 0.146 Negative No_pasg 0.618 Negative Observations • Comparing results from tables 1,2 and 3, we observe that results are consistent. In table 4 below, the factors are ranked based on their relative importance in determining the landing distance.
  • 16. 13 Table 4: Ranking of variables Rank Variable 1 Speed_air 2 Speed_ground 3 Aircraft 4 Height 5 Pitch 6 Duration 7 No_pasg • Speed_air, speed_ground, height and pitch have positive correlation with landing distance while aircraft has negative correlation with landing distance. Conclusion Assuming a significance of 0.05, Speed_air, speed_ground, aircraft, height and pitch are most important factors influencing landing distance. Chapter 5: Check for collinearity Speed_air and speed_ground pretty much provide same information. In this chapter, we check for correlation between speed_air and speed_ground. If there is high correlation between the two variables, we retain only one of them. R Code #Checking for collinearity between speed_ground and speed_air model1 <- lm (distance ~ speed_ground, data = FAA) summary(model1) model2 <- lm (distance ~ speed_air, data = FAA) summary(model2) model3 <- lm (distance ~ speed_ground + speed_air, data = FAA) summary(model3) #Correlation between speed_ground and speed_air Cor (FAA$speed_air, FAA$speed_ground, use = "pairwise.complete.obs") R Output #Table showing regression coefficients
  • 17. 14 Table 5: Checking for collinearity Model Number Model Variable Regression Coefficient p-value 1 LD ~ speed_ground Speed_ground 40.8252 <0.0001 2 LD ~ speed_air Speed_air 79.532 <0.0001 3 LD ~ speed_ground + speed_air Speed_ground -14.37 0.258 Speed_air 93.96 <0.0001 Observation • We observe from models 1 and 2 that both speed_ground and speed_air are significant factors in determining landing distance. However, according to model 3, only speed_air is significant factor (p-value < 0.0001). • We also observe a sign change in regression coefficient of speed_ground and a change in significance value. P-value of speed_ground in model 3 is greater than 0.05 suggesting that it may not be a significant factor which is not true. • We can say that collinearity exists, that is, speed_ground and speed_air is correlated with each other. In fact, they have strong correlation with value of 0.9879. It is better to drop one of them since including both might result in unstable model. • Speed_air can be considered as speed_ground plus wind speed. Conclusion Speed_air is retained even though there are lot of missing values due to following reason: • It is an important factor, from domain knowledge. • From scatter plot, it is seen that speed_air has nearly linear relation with landing distance which is not the case for speed_ground. This makes it possible to fit a linear regression model for speed_air. • Speed_air column has observations required for predicting landing over run, while in case of speed_ground, a large portion of the observations is less than 90 mph which is not very useful in predicting landing over run. • It is easier to get values of speed_air.
  • 18. 15 Chapter 6: Variable Selection We fit models based on variable ranking in table 4 by adding one variable at a time. We obtain r- square, adjusted r squared and AIC values for each model. R Code #Plotting R squared values against number of parameters r.squared.1 <- summary(model1)$r.squared r.squared.2 <- summary(model2)$r.squared r.squared.3 <- summary(model3)$r.squared r.squared.4 <- summary(model4)$r.squared r.squared.5 <- summary(model5)$r.squared r.squared.6 <- summary(model6)$r.squared plot(c(1,2,3,4,5,6), c(r.squared.1,r.squared.2,r.squared.3,r.squared.4,r.squared.5,r.squared.6), type = "b", ylab = "R squared", xlab = "Number of predictors") #Plotting Adjusted R squared values against number of parameters r.adj.squared.1 <- summary(model1)$adj.r.squared r.adj.squared.2 <- summary(model2)$adj.r.squared r.adj.squared.3 <- summary(model3)$adj.r.squared r.adj.squared.4 <- summary(model4)$adj.r.squared r.adj.squared.5 <- summary(model5)$adj.r.squared r.adj.squared.6 <- summary(model6)$adj.r.squared plot(c(1,2,3,4,5,6), c(r.adj.squared.1,r.adj.squared.2,r.adj.squared.3,r.adj.squared.4, r.adj.squared.5,r.adj.squared.6), type = "b", ylab = "Adjusted R squared", xlab = "Number of predictors") #Plotting AIC values against number of parameters plot(c(1,2,3,4,5,6), c(r.AIC.1,r.AIC.2,r.AIC.3,r.AIC.4,r.AIC.5,r.AIC.6), type = "b", ylab = "AIC", xlab = "Number of predictors") R Output Table 6: Comparing different models Model Number Model R-squared value Adjusted R- squared value AIC value 1 LD ~ speed_air 0.8875 0.8870 2862.423 2 LD ~ speed_air + aircraft 0.9493 0.9488 2702.784 3 LD ~ speed_air + aircraft + height 0.9737 0.9733 2571.310 4 LD ~ speed_air + aircraft + height + pitch 0.9737 0.9732 2573.300
  • 19. 16 5 LD ~ speed_air + aircraft + height + pitch + duration 0.9744 0.9737 2473.168 6 LD ~ speed_air + aircraft + height + pitch + duration + no_pasg 0.9747 0.9739 2473.010 Figure 10: Plot of R squared values of different models Vs Number of predictors Figure 11: Plot of Adjusted R Squared of different models Vs Number of predictors Figure 12: Plot of AIC of different models Vs Number of predictors
  • 20. 17 Observations • Model 6 has highest adjusted r squared value and lowest AIC value. While comparing models, we choose the one with higher adjusted r squared value, that is, one whose predictor variables are better able to explain variation in dependent variable. Also, we choose model with lower AIC value. • However, if we look at p-values of the models, only speed_air, aircraft and height are significant. • It should also be noted in the final data set, speed_air has 630 missing values. While modeling, only 195 observations are taken into consideration. Conclusion • Based on adjusted r squared and AIC values, I would choose speed_air, height, pitch, no_pasg and duration to build predictive model for landing distance. • Based on p-values of models, I would choose speed_air, aircraft and height. • The final model is as follows: Distance = -5796.9430 + (81.9833 * speed_air) – (437.8295 * aircraft) + (13.71 * height) • It is seen that among the influential factors, height has least impact on landing distance. Chapter 7: Variable Selection based on automate algorithm In this chapter, stepAIC function in R is used to perform forward variable selection. The results so obtained are compared with results obtained in previous chapter to see if they are consistent. R Code model <- lm(distance ~ ., data = FAA) step <- stepAIC(model, direction = "forward") summary(step) R Output #Summary of stepAIC function
  • 21. 18 Observation • Using stepAIC function to perform variable forward selection, I would select three variables to build predictive model for landing distance. From output above, it can be seen that p-values for aircraft, speed_air and height are significant. Conclusion • The final model after using stepAIC function is as follows: Distance = -5791.6573 – (437.9428 * aircraft) + (85.5469 * speed_air) + (13.6756 * height) • Based on results from chapter 6 and 7, I would choose speed_air, aircraft and height in the final model.