FAA Flight Landing Distance Forecasting and Analysis

Quynh Tran
M11853850
FLIGHT LANDING DISTANCE FORECASTING PROJECT
BANA 5143

EXECUTIVE SUMMARY
- The overall goal of this project is to get a model to forecast landing distance based on
variables given in the dataset. To be able to come up with a good model that fits the
dataset, we need to go through some certain steps to explore, clean, visualize, and analyze
values in the dataset.
- After removing missing values of FAA2, combining 2 datasets, removing duplicate
values of the combined dataset, removing abnormal values, I’ve come up with a
cleaned dataset to fit the linear regression forecasting model for landing distance.
- This is the final linear model I’ve got for this project:
Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air +
13.63*height – 4.05*pitch.
(Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise)

FLIGHT LANDING PROJECT
Chapter 1: DATA UNDERSTANDING AND DATA EXPLORATION
Goal:
- The main purpose of chapter 1 is data understanding and data exploration. This should be
considered as the most important step in this project. The overall goal of this step or this whole
project are to understand that we’re motivated to reduce the risk of overrun by finding relevant
factors that could impact the landing distance of a commercial flight. So what should we start
off with? I decided to start off the project by importing two datasets - FAA1 and FAA2, then
removing missing values for FAA2, combining two datasets, doing statistics summary for this
combined dataset, removing duplicates, removing abnormal values in the combined dataset,
and finally creating a descriptive statistics summary for the cleaned dataset.
SAS codes, outputs and observations:
1. Importing FAA1 and FAA2 datasets

2. Removing missing values in FAA2
- After I imported two datasets separately, I saw there were 50 missing records in FAA2 dataset,
so I used the IF…THEN function to remove those missing records in FAA2 before combining two
datasets into one.
3. Combining 2 datasets

4. Creating statistics summary for the Combined dataset
- After I finished combining two datasets FAA1 and FAA2, I have 950 records left in the dataset. In
this dataset duration and speed_air have the highest number of missing values. Duration has
150 missing values and speed_air has 711 missing values.
5. Removing all duplicates in the Combined dataset

- After removing all duplicate values in the Combined dataset, we have 850 records left and name
it “DUPLICATES_REMOVED” as a new datafile that will be used for further steps.
6. Cleaning abnormal values
- We use those criteria posted in the Project instruction to perceive and clean abnormal values
from the “DUPLICATES_REMOVED” dataset. In this step, we will not remove missing values for
duration and speed_air variables because those missing values are not randomly created, and
missing values are not equivalent to abnormal values. Especially for speed_air, we have more
than 700 blank cells for this variable. Therefore, we can understand that they all have meaning
behind that, so I decided to create a new column named “value_condition” and assign all
missing values will have value_condition as “missing”. Those that have values will have blank
cells for “value_condition”. After that, those with values that are considered as abnormal will be
deleted from the dataset based off these criteria:
+ Duration: The duration of a normal flight should always be greater than 40 min.
+ Speed_ground: If its value is less than 30MPH or greater than 140MPH, then the landing would
be considered as abnormal.
+ Speed_air: If its value is less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
+ Height: The landing aircraft is required to be at least 6 meters high at the threshold of the
runway.
+ Distance: The length of the airport runway is typically less than 6000 feet.

- After cleaning abnormal data, I have 831 records left and put those in another dataset named
“Cleaned_final_data”.
7. Creating statistics summary after cleaning abnormal data

- This step helps us have an overview of how the distribution of numeric variables our dataset has
changed after we finished cleaning abnormal data. In this table, these statistics items are given:
sample size (N), number of missing records (N Miss), Minimum, Maximum, Mean, Median and
Standard Deviation. Compared to the summary we’ve got for the dataset before cleaning
abnormal data, there is only very slight changes in these items since a very small number of
records removed (850 – 831 = 19 records removed).
8. Creating descriptive statistics and distribution histograms for all variables
- Since we already created statistics summary for all variables in the table posted above, I will not
include their tables here. For this section, I’m going to include all histograms that portray
distribution of all variables we have here, so we can see which variable has a normal
distribution, which one does not.

- Duration is almost normally distributed.
- No_pasg appears to be almost normally distributed.
- Speed_ground appears to be normally distributed.

- Speed_air’s distribution is rightly skewed.
- Height appears to be normally distributed.

- Pitch is normally distributed.
- Distance is strongly rightly skewed.
Conclusions on chapter 1:

- Chapter 1 is like a starting point of our project. It is basic, simple, but also extremely crucial for
our project. In this first stage, I basically went through the data exploration and data
understanding process. After importing two datasets to SAS system, I realized there were some
missing records in FAA2 dataset. Since these missing values may affect my dataset and
forecasting model fitting process in the future, I had to remove those values to make FAA2
dataset look as what we need. I then combined two datasets into one dataset called
“Combined”. After combining two datasets I moved on to removing duplicates step. Removing
duplicates helps us avoid having biased assumption on our dataset. Then I started cleaning
abnormal values based off the criteria we got from the Project instruction. Removing abnormal
values really helps us develop a forecasting model that fits our dataset better without making
biases, same as for removing duplicates step. Now after data is cleaned and all the data
exploration steps are done, we can move on to the next step, an important step before moving
to fitting the model step – Chapter 2: Doing correlation matrix and X-Y plot.
Chapter 2: DATA VISUALIZATION AND DESCRIPTIVE STUDY
Goal:
- This step is when we will visualize the distribution of distance (our outcome variable) and other
independent variables, as well as scatter plots and correlation matrix between them. Those
visualization tools in SAS will help us have an overview of the relationships between distance
and other variables. This is also when we will be able decide what we should do with our
categorical variables, whether we should keep all variables as predictors or drop some that are
not statistically significant to distance forecasting model.
SAS codes, outputs and observations:
1. Creating boxplot showing distribution of distance of each aircraft

- This basically shows how distance is distributed differently for airbus compared to boeing. All
statistical items showed in boxplot for distance vs. airbus are typically lower than those for
distance vs. boeing.
2. Creating dummy variable for aircraft

- Based on the box plot we just created, we can see that whether an aircraft is airbus or boeing
may tell something in my forecast for the landing distance, so I decided to create a dummy
variable called ‘aircraft01’ for the ‘aircraft’ categorical variable. I assigned if the aircraft is airbus,
then its ‘aircraft01’ will be 1, and will be 0 otherwise. After creating a dummy variable for
aircraft and put that in another dataset, we will be using this dataset for all of our regression
analysis to develop our model.
3. Creating scatterplot matrix showing linear correlation between distance (outcome variable) and
other variables

- By visualizing the correlation between distance and other variables we will be able to identify
which variables have a strong correlation with distance and which ones do not. A variable that
has a strong correlation with the outcome variable should be picked as a predictor. Variables
that do not have a strong correlation with the outcome should not be picked as one of the
predictors used for fitting forecasting model. The column I have framed here contains
scatterplots showing the correlation between distance and other variables. Based off the matrix,
we can see that: aircraft01, speed_ground, speed_air, height, and pitch have moderate
correlation (height and pitch) to strong correlation (aircraft01, speed_ground and speed_air)
with distance, while duration and no_pasg have no to very weak correlation with distance. With
this scatterplot matrix, we can initially have the idea of not using duration and no_pasg as
predictors for the model. But to make sure our visualization and understanding about duration
and no_pasg is correct, we will create a correlation matrix showing correlation coefficients and
p-value of each variables with distance.
4. Creating correlation matrix showing how distance is correlated with other variables

- To have a more insightful comparison for signs (positive/negative) of correlations between
predicting variables and distance in Marginal Analysis (from correlation matrix) and Joint
Analysis (from the linear regression model), I will create a comparison table based on what
we’ve got above:
Correlation
Coeff.
Direction p-value Significant?
Aircraft01 -0.23814 - <0.0001 Yes
Duration -0.05138 - 0.1514 No
No-pasg -0.01776 - 0.6093 No
Speed_ground 0.86624 + <0.0001 Yes
Speed_air 0.94210 + <0.0001 Yes
Height 0.09941 + 0.0041 Yes
Pitch 0.08703 + 0.0121 Yes?
- Comments on comparison table: For Pitch because we cannot decide whether it is significant
just based on its p-value of 0.0121. This p-value is lower than 0.05 but is still high to be
perceived as significant.
- While the scatterplot matrix is such a great tool to help us visualize the relationship among
variables, this correlation matrix provides us correlation coefficients of all variables and p-value
of each variable if they’re used as a predictor to forecast other variable.

- How to read this matrix:
+ The first line of each box represents correlation coefficient between two variables contributed
to that box.
+ The second line represents p-value in the case where one variable is used to predict the other
variable. P-value smaller than 0.05 signals that we can reject the null hypothesis and vice versa.
The null hypothesis here is one variable does not have a statistical significance to the model
predicting the other variable.
+ The third line represents sample size (or number of observations).
- Observations for the framed row:
+ For correlation coefficients, duration and no_pasg have relatively low correlation coefficients
with distance, which are -0.051 and -0.018, respectively.
+ For p-value, while aircraft01, speed_ground, speed_air, height, and pitch have relatively low p-
value, which means they make significant contribution to the model. On the other hand,
duration and no_pasg have noticeably high p-value (0.1514 and 0.6093, respectively), which
means duration and no-pasg do not make any or make little significant contribution to the
model forecasting distance in this scenario.
- After seeing this result of correlation coefficients and p-values, I decided not to use duration and
no_pasg as predictors to fit the linear regression model forecasting distance.
- As mentioned previously, our goal in this chapter is to visualize and perceive the correlations
between distance – our outcome variable, and other independent variables. The scatterplot
matrix helps us initially perceive which variables have strong correlations with distance, and
who do not. Those that have strong correlation with distance will be kept as predictors to build
linear regression model to predict landing distance. Those who have weak correlations with
distance, on the other hand, will be removed from our predicting model. Here duration and
no_pasg have weak correlations with distance.
- We cannot rely only on the scatterplots to decide whether we should keep or drop certain
variables for the predicting model. This is when we need the correlation matrix. According to
this matrix, other than low correlation coefficients with distance, duration and no_pasg are also
the two variables that have relatively high p-value associated with distance. As we all know a
small p-value (typically smaller than 0.05) indicates a strong evidence against the null
hypothesis, which means we can reject the null when p-value is low. Meanwhile, a large p-value
indicates a weak evidence against our null hypothesis, which means we fail to reject the null
when p-value is large. In this scenario, the null hypothesis for each independent variable is “A
variable do not make any significant contribution to the linear regression model to predict
landing distance”. Duration and no_pasg have large p-value associated with distance, which
indicates that these two variables are not statistically significant to the model; therefore, I
decided not to use these two variables as predictors for the model forecasting distance.
- I will still keep Pitch as a predictor to my linear model. We can figure out whether Pitch is
actually significant to the linear model later by looking at its R squared in the model.

- Variables will be used as predictors are: aircraft01, speed_ground, speed_air, height and pitch.
- The box plot showing distribution of distance of each type of aircraft tells us that there are
differences in distance depending on whether the aircraft is airbus or boeing. Therefore, it is
crucial to keep dummy variable ‘aircraft01’ as our predictors.
Chapter 3: STATISTICAL REGRESSION MODELLING
Goal:
- The goal of this chapter is to develop a linear regression model that predicts value of landing
distance of an aircraft depending on values of these predictors: aircraft01 (aircraft01 = 1 if our
aircraft is airbus and aircraft01 = 0 otherwise), speed_ground, speed_air, height and pitch.
- Another goal in this stage is to check the how good the model fits the dataset by evaluating
Analysis of Variance measures such as R squared, Adjusted R squared, Root MSE, etc.
SAS codes, outputs, and observations:
1. Building a linear regression model to predict distance

- According to our Parameter Estimates, our linear regression model will be demonstrated
through this equation:
- The R squared and Adj. R squared of this model is relatively high (R squared = 0.9738 and Adj.
R squared = 0.9732).

- All the fit diagnostics plots and graphs here show us residuals we have from the linear
regression model is normally distributed.

- Observations from the linear regression model (Joint Analysis) and the comparison table for
correlation matrix (Marginal Analysis):
+ aircraft01, speed_air, and height have same signs of correlation coefficient in the regression
model as what they have in the correlation matrix.
+ aircraft01, speed_air, and height have relatively low p-value in the regression model, which
indicates they make good contribution to the model.
+ speed_ground, and pitch have different signs of correlation coefficients in the regression
model as compared to what they have in the correlation matrix.
+ speed_ground, and pitch have relatively high p-value in the relatively high p-value in the
regression model, which indicates they do not make good contribution to the model.
➔ This should be because of the interaction or collinearity speed_ground has with speed_air,
and the questionable high or low p-value we have for pitch in the Marginal Analysis that
makes speed_ground and pitch have different signs of correlation coefficients.
➔ Therefore, to test if the new model without speed_ground and pitch would make a better
model, we will try to run new codes for linear model:
+ Codes for a new linear model without speed_ground and pitch:

- The new model has this equation:
Distance = -5962.93 – 427.442*aircraft01 + 82.15*speed_air + + 13.70*height

- Look at the output we’ve got from the linear regression model with speed_ground and pitch
removed, we have Adj. R squared = 0.9733, which is 0.0001 larger than the Adj. R squared from
the original model we created. And p-value of all variables is also <0.0001, which indicates they’re
all good predictors that make good contribution to the model too.
- In this chapter we have figured out the linear regression model to forecasting landing distance of
an aircraft using these predictors: aircraft01, speed_ground, speed_air, height and pitch.
- Based on the R squared and Adj. R squared values of this model, we can figure out that we’ve got
a good model that can predict 97.32% of data points of the distance variable or we can say 97.37%
of distance values fall within our regression model line.
- Because of the different signs of parameter est. we have for speed_ground and pitch in the
regression model compared to theirs in the correlation matrix, and also their high p-values in the
regression model, I decided to try building a new linear model without using speed_ground and
pitch.
- The result is that we have 0.0001 higher Adj. R squared for the new model compared to the
original one.
- An even better forecasting model I would expect for this dataset is an exponential model that
would be able to capture the exponential curve characteristics of the predictors with distance.

- But for we have learned and discovered so far for this dataset, I think it would still be totally good
to use either my original linear model or the one with speed_ground and pitch removed to
forecast landing distance. But since speed_ground has way more available data compared to
speed_air, I would still prefer to use the original model for forecasting.
SUMMARY
- The linear regression model I decided to use for this project is:
- Adjusted R squared of this model is 0.9732, which indicates this is a good model that can
forecast 97.32% of the landing distance of the dataset.
- Speed_ground and pitch can be removed from the predictor group because of their different
signs in the Marginal Analysis and low p-value in the Joint Analysis.
- Summarizing questions:
1. How many observations (flights) do you use to fit your final model? If not all 950 flights,
why?
- I used 831 observations to fit my final model. The observation number is not 950 because I
removed duplicate values and abnormal values from the observation dataset for fitting the
model. These steps give us more cleaned, and unbiased dataset to produce a model that can fit
the dataset well.
2. What factors and how they impact the landing distance of a flight?
- Factors that impact the landing distance of a flight are: aircraft01 (aircraft01 = 1 if aircraft =
airbus and = 0 otherwise), speed_ground, speed_air, height and pitch.
- Factors that do not impact the landing distance of a flight are: duration and no_pasg. Based on
the low correlation coefficients and large p-values of these variables with distance, we can
figure out that duration and no_pasg are not statistically significant to the fitting model of
distance. Therefore, I did not use these two variables as predictors for my final model.
- Since speed_ground and speed_air are strongly correlated with each other, we should have
dropped one of them out. But since I hadn’t known which variable I should drop before fitting
the model, so I decided to keep both to fit my model. And because speed_air has very small
amount of available data, it didn’t affect tremendously to our result.
- After fitting the model, I realized that speed_ground and pitch may not make a good
contribution to the model.
➔ Factors that impact the landing distance determined after fitting the model are: aircraft01,
speed_air and height.
3. Is there any difference between the two makes Boeing and Airbus?

- Based on the box plot I’ve created to show the distribution of landing distance based on
whether the aircraft is airbus or boeing, we can visualize that landing distance for airbus is
slightly lower than landing distance for boeing.
- Based on the final linear regression model I’ve come up with:
➔ Thanks to the dummy variable ‘aircraft01’ we can figure out that landing distance decreases
by 428.28 feet if the aircraft is airbus and vice versa.

FAA Flight Landing Distance Forecasting and Analysis

Recommended

Recommended

More Related Content

Similar to FAA Flight Landing Distance Forecasting and Analysis

Similar to FAA Flight Landing Distance Forecasting and Analysis (20)

Recently uploaded

Recently uploaded (20)

FAA Flight Landing Distance Forecasting and Analysis