SlideShare a Scribd company logo
1 of 26
Download to read offline
Quynh Tran
M11853850
FLIGHT LANDING DISTANCE FORECASTING PROJECT
BANA 5143
EXECUTIVE SUMMARY
- The overall goal of this project is to get a model to forecast landing distance based on
variables given in the dataset. To be able to come up with a good model that fits the
dataset, we need to go through some certain steps to explore, clean, visualize, and analyze
values in the dataset.
- After removing missing values of FAA2, combining 2 datasets, removing duplicate
values of the combined dataset, removing abnormal values, I’ve come up with a
cleaned dataset to fit the linear regression forecasting model for landing distance.
- This is the final linear model I’ve got for this project:
Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air +
13.63*height – 4.05*pitch.
(Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise)
FLIGHT LANDING PROJECT
Chapter 1: DATA UNDERSTANDING AND DATA EXPLORATION
Goal:
- The main purpose of chapter 1 is data understanding and data exploration. This should be
considered as the most important step in this project. The overall goal of this step or this whole
project are to understand that we’re motivated to reduce the risk of overrun by finding relevant
factors that could impact the landing distance of a commercial flight. So what should we start
off with? I decided to start off the project by importing two datasets - FAA1 and FAA2, then
removing missing values for FAA2, combining two datasets, doing statistics summary for this
combined dataset, removing duplicates, removing abnormal values in the combined dataset,
and finally creating a descriptive statistics summary for the cleaned dataset.
SAS codes, outputs and observations:
1. Importing FAA1 and FAA2 datasets
2. Removing missing values in FAA2
- After I imported two datasets separately, I saw there were 50 missing records in FAA2 dataset,
so I used the IF…THEN function to remove those missing records in FAA2 before combining two
datasets into one.
3. Combining 2 datasets
4. Creating statistics summary for the Combined dataset
- After I finished combining two datasets FAA1 and FAA2, I have 950 records left in the dataset. In
this dataset duration and speed_air have the highest number of missing values. Duration has
150 missing values and speed_air has 711 missing values.
5. Removing all duplicates in the Combined dataset
- After removing all duplicate values in the Combined dataset, we have 850 records left and name
it “DUPLICATES_REMOVED” as a new datafile that will be used for further steps.
6. Cleaning abnormal values
- We use those criteria posted in the Project instruction to perceive and clean abnormal values
from the “DUPLICATES_REMOVED” dataset. In this step, we will not remove missing values for
duration and speed_air variables because those missing values are not randomly created, and
missing values are not equivalent to abnormal values. Especially for speed_air, we have more
than 700 blank cells for this variable. Therefore, we can understand that they all have meaning
behind that, so I decided to create a new column named “value_condition” and assign all
missing values will have value_condition as “missing”. Those that have values will have blank
cells for “value_condition”. After that, those with values that are considered as abnormal will be
deleted from the dataset based off these criteria:
+ Duration: The duration of a normal flight should always be greater than 40 min.
+ Speed_ground: If its value is less than 30MPH or greater than 140MPH, then the landing would
be considered as abnormal.
+ Speed_air: If its value is less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
+ Height: The landing aircraft is required to be at least 6 meters high at the threshold of the
runway.
+ Distance: The length of the airport runway is typically less than 6000 feet.
- After cleaning abnormal data, I have 831 records left and put those in another dataset named
“Cleaned_final_data”.
7. Creating statistics summary after cleaning abnormal data
- This step helps us have an overview of how the distribution of numeric variables our dataset has
changed after we finished cleaning abnormal data. In this table, these statistics items are given:
sample size (N), number of missing records (N Miss), Minimum, Maximum, Mean, Median and
Standard Deviation. Compared to the summary we’ve got for the dataset before cleaning
abnormal data, there is only very slight changes in these items since a very small number of
records removed (850 – 831 = 19 records removed).
8. Creating descriptive statistics and distribution histograms for all variables
- Since we already created statistics summary for all variables in the table posted above, I will not
include their tables here. For this section, I’m going to include all histograms that portray
distribution of all variables we have here, so we can see which variable has a normal
distribution, which one does not.
- Duration is almost normally distributed.
- No_pasg appears to be almost normally distributed.
- Speed_ground appears to be normally distributed.
- Speed_air’s distribution is rightly skewed.
- Height appears to be normally distributed.
- Pitch is normally distributed.
- Distance is strongly rightly skewed.
Conclusions on chapter 1:
- Chapter 1 is like a starting point of our project. It is basic, simple, but also extremely crucial for
our project. In this first stage, I basically went through the data exploration and data
understanding process. After importing two datasets to SAS system, I realized there were some
missing records in FAA2 dataset. Since these missing values may affect my dataset and
forecasting model fitting process in the future, I had to remove those values to make FAA2
dataset look as what we need. I then combined two datasets into one dataset called
“Combined”. After combining two datasets I moved on to removing duplicates step. Removing
duplicates helps us avoid having biased assumption on our dataset. Then I started cleaning
abnormal values based off the criteria we got from the Project instruction. Removing abnormal
values really helps us develop a forecasting model that fits our dataset better without making
biases, same as for removing duplicates step. Now after data is cleaned and all the data
exploration steps are done, we can move on to the next step, an important step before moving
to fitting the model step – Chapter 2: Doing correlation matrix and X-Y plot.
Chapter 2: DATA VISUALIZATION AND DESCRIPTIVE STUDY
Goal:
- This step is when we will visualize the distribution of distance (our outcome variable) and other
independent variables, as well as scatter plots and correlation matrix between them. Those
visualization tools in SAS will help us have an overview of the relationships between distance
and other variables. This is also when we will be able decide what we should do with our
categorical variables, whether we should keep all variables as predictors or drop some that are
not statistically significant to distance forecasting model.
SAS codes, outputs and observations:
1. Creating boxplot showing distribution of distance of each aircraft
- This basically shows how distance is distributed differently for airbus compared to boeing. All
statistical items showed in boxplot for distance vs. airbus are typically lower than those for
distance vs. boeing.
2. Creating dummy variable for aircraft
- Based on the box plot we just created, we can see that whether an aircraft is airbus or boeing
may tell something in my forecast for the landing distance, so I decided to create a dummy
variable called ‘aircraft01’ for the ‘aircraft’ categorical variable. I assigned if the aircraft is airbus,
then its ‘aircraft01’ will be 1, and will be 0 otherwise. After creating a dummy variable for
aircraft and put that in another dataset, we will be using this dataset for all of our regression
analysis to develop our model.
3. Creating scatterplot matrix showing linear correlation between distance (outcome variable) and
other variables
- By visualizing the correlation between distance and other variables we will be able to identify
which variables have a strong correlation with distance and which ones do not. A variable that
has a strong correlation with the outcome variable should be picked as a predictor. Variables
that do not have a strong correlation with the outcome should not be picked as one of the
predictors used for fitting forecasting model. The column I have framed here contains
scatterplots showing the correlation between distance and other variables. Based off the matrix,
we can see that: aircraft01, speed_ground, speed_air, height, and pitch have moderate
correlation (height and pitch) to strong correlation (aircraft01, speed_ground and speed_air)
with distance, while duration and no_pasg have no to very weak correlation with distance. With
this scatterplot matrix, we can initially have the idea of not using duration and no_pasg as
predictors for the model. But to make sure our visualization and understanding about duration
and no_pasg is correct, we will create a correlation matrix showing correlation coefficients and
p-value of each variables with distance.
4. Creating correlation matrix showing how distance is correlated with other variables
-
- To have a more insightful comparison for signs (positive/negative) of correlations between
predicting variables and distance in Marginal Analysis (from correlation matrix) and Joint
Analysis (from the linear regression model), I will create a comparison table based on what
we’ve got above:
Correlation
Coeff.
Direction p-value Significant?
Aircraft01 -0.23814 - <0.0001 Yes
Duration -0.05138 - 0.1514 No
No-pasg -0.01776 - 0.6093 No
Speed_ground 0.86624 + <0.0001 Yes
Speed_air 0.94210 + <0.0001 Yes
Height 0.09941 + 0.0041 Yes
Pitch 0.08703 + 0.0121 Yes?
- Comments on comparison table: For Pitch because we cannot decide whether it is significant
just based on its p-value of 0.0121. This p-value is lower than 0.05 but is still high to be
perceived as significant.
- While the scatterplot matrix is such a great tool to help us visualize the relationship among
variables, this correlation matrix provides us correlation coefficients of all variables and p-value
of each variable if they’re used as a predictor to forecast other variable.
- How to read this matrix:
+ The first line of each box represents correlation coefficient between two variables contributed
to that box.
+ The second line represents p-value in the case where one variable is used to predict the other
variable. P-value smaller than 0.05 signals that we can reject the null hypothesis and vice versa.
The null hypothesis here is one variable does not have a statistical significance to the model
predicting the other variable.
+ The third line represents sample size (or number of observations).
- Observations for the framed row:
+ For correlation coefficients, duration and no_pasg have relatively low correlation coefficients
with distance, which are -0.051 and -0.018, respectively.
+ For p-value, while aircraft01, speed_ground, speed_air, height, and pitch have relatively low p-
value, which means they make significant contribution to the model. On the other hand,
duration and no_pasg have noticeably high p-value (0.1514 and 0.6093, respectively), which
means duration and no-pasg do not make any or make little significant contribution to the
model forecasting distance in this scenario.
- After seeing this result of correlation coefficients and p-values, I decided not to use duration and
no_pasg as predictors to fit the linear regression model forecasting distance.
Conclusions on chapter 2:
- As mentioned previously, our goal in this chapter is to visualize and perceive the correlations
between distance – our outcome variable, and other independent variables. The scatterplot
matrix helps us initially perceive which variables have strong correlations with distance, and
who do not. Those that have strong correlation with distance will be kept as predictors to build
linear regression model to predict landing distance. Those who have weak correlations with
distance, on the other hand, will be removed from our predicting model. Here duration and
no_pasg have weak correlations with distance.
- We cannot rely only on the scatterplots to decide whether we should keep or drop certain
variables for the predicting model. This is when we need the correlation matrix. According to
this matrix, other than low correlation coefficients with distance, duration and no_pasg are also
the two variables that have relatively high p-value associated with distance. As we all know a
small p-value (typically smaller than 0.05) indicates a strong evidence against the null
hypothesis, which means we can reject the null when p-value is low. Meanwhile, a large p-value
indicates a weak evidence against our null hypothesis, which means we fail to reject the null
when p-value is large. In this scenario, the null hypothesis for each independent variable is “A
variable do not make any significant contribution to the linear regression model to predict
landing distance”. Duration and no_pasg have large p-value associated with distance, which
indicates that these two variables are not statistically significant to the model; therefore, I
decided not to use these two variables as predictors for the model forecasting distance.
- I will still keep Pitch as a predictor to my linear model. We can figure out whether Pitch is
actually significant to the linear model later by looking at its R squared in the model.
- Variables will be used as predictors are: aircraft01, speed_ground, speed_air, height and pitch.
- The box plot showing distribution of distance of each type of aircraft tells us that there are
differences in distance depending on whether the aircraft is airbus or boeing. Therefore, it is
crucial to keep dummy variable ‘aircraft01’ as our predictors.
Chapter 3: STATISTICAL REGRESSION MODELLING
Goal:
- The goal of this chapter is to develop a linear regression model that predicts value of landing
distance of an aircraft depending on values of these predictors: aircraft01 (aircraft01 = 1 if our
aircraft is airbus and aircraft01 = 0 otherwise), speed_ground, speed_air, height and pitch.
- Another goal in this stage is to check the how good the model fits the dataset by evaluating
Analysis of Variance measures such as R squared, Adjusted R squared, Root MSE, etc.
SAS codes, outputs, and observations:
1. Building a linear regression model to predict distance
- According to our Parameter Estimates, our linear regression model will be demonstrated
through this equation:
Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air +
13.63*height – 4.05*pitch.
(Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise)
- The R squared and Adj. R squared of this model is relatively high (R squared = 0.9738 and Adj.
R squared = 0.9732).
- All the fit diagnostics plots and graphs here show us residuals we have from the linear
regression model is normally distributed.
- Observations from the linear regression model (Joint Analysis) and the comparison table for
correlation matrix (Marginal Analysis):
+ aircraft01, speed_air, and height have same signs of correlation coefficient in the regression
model as what they have in the correlation matrix.
+ aircraft01, speed_air, and height have relatively low p-value in the regression model, which
indicates they make good contribution to the model.
+ speed_ground, and pitch have different signs of correlation coefficients in the regression
model as compared to what they have in the correlation matrix.
+ speed_ground, and pitch have relatively high p-value in the relatively high p-value in the
regression model, which indicates they do not make good contribution to the model.
➔ This should be because of the interaction or collinearity speed_ground has with speed_air,
and the questionable high or low p-value we have for pitch in the Marginal Analysis that
makes speed_ground and pitch have different signs of correlation coefficients.
➔ Therefore, to test if the new model without speed_ground and pitch would make a better
model, we will try to run new codes for linear model:
+ Codes for a new linear model without speed_ground and pitch:
- The new model has this equation:
Distance = -5962.93 – 427.442*aircraft01 + 82.15*speed_air + + 13.70*height
- Look at the output we’ve got from the linear regression model with speed_ground and pitch
removed, we have Adj. R squared = 0.9733, which is 0.0001 larger than the Adj. R squared from
the original model we created. And p-value of all variables is also <0.0001, which indicates they’re
all good predictors that make good contribution to the model too.
Conclusions on chapter 3:
- In this chapter we have figured out the linear regression model to forecasting landing distance of
an aircraft using these predictors: aircraft01, speed_ground, speed_air, height and pitch.
- Based on the R squared and Adj. R squared values of this model, we can figure out that we’ve got
a good model that can predict 97.32% of data points of the distance variable or we can say 97.37%
of distance values fall within our regression model line.
- Because of the different signs of parameter est. we have for speed_ground and pitch in the
regression model compared to theirs in the correlation matrix, and also their high p-values in the
regression model, I decided to try building a new linear model without using speed_ground and
pitch.
- The result is that we have 0.0001 higher Adj. R squared for the new model compared to the
original one.
- An even better forecasting model I would expect for this dataset is an exponential model that
would be able to capture the exponential curve characteristics of the predictors with distance.
- But for we have learned and discovered so far for this dataset, I think it would still be totally good
to use either my original linear model or the one with speed_ground and pitch removed to
forecast landing distance. But since speed_ground has way more available data compared to
speed_air, I would still prefer to use the original model for forecasting.
SUMMARY
- The linear regression model I decided to use for this project is:
Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air +
13.63*height – 4.05*pitch.
(Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise)
- Adjusted R squared of this model is 0.9732, which indicates this is a good model that can
forecast 97.32% of the landing distance of the dataset.
- Speed_ground and pitch can be removed from the predictor group because of their different
signs in the Marginal Analysis and low p-value in the Joint Analysis.
- Summarizing questions:
1. How many observations (flights) do you use to fit your final model? If not all 950 flights,
why?
- I used 831 observations to fit my final model. The observation number is not 950 because I
removed duplicate values and abnormal values from the observation dataset for fitting the
model. These steps give us more cleaned, and unbiased dataset to produce a model that can fit
the dataset well.
2. What factors and how they impact the landing distance of a flight?
- Factors that impact the landing distance of a flight are: aircraft01 (aircraft01 = 1 if aircraft =
airbus and = 0 otherwise), speed_ground, speed_air, height and pitch.
- Factors that do not impact the landing distance of a flight are: duration and no_pasg. Based on
the low correlation coefficients and large p-values of these variables with distance, we can
figure out that duration and no_pasg are not statistically significant to the fitting model of
distance. Therefore, I did not use these two variables as predictors for my final model.
- Since speed_ground and speed_air are strongly correlated with each other, we should have
dropped one of them out. But since I hadn’t known which variable I should drop before fitting
the model, so I decided to keep both to fit my model. And because speed_air has very small
amount of available data, it didn’t affect tremendously to our result.
- After fitting the model, I realized that speed_ground and pitch may not make a good
contribution to the model.
➔ Factors that impact the landing distance determined after fitting the model are: aircraft01,
speed_air and height.
3. Is there any difference between the two makes Boeing and Airbus?
- Based on the box plot I’ve created to show the distribution of landing distance based on
whether the aircraft is airbus or boeing, we can visualize that landing distance for airbus is
slightly lower than landing distance for boeing.
- Based on the final linear regression model I’ve come up with:
Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air +
13.63*height – 4.05*pitch.
(Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise)
➔ Thanks to the dummy variable ‘aircraft01’ we can figure out that landing distance decreases
by 428.28 feet if the aircraft is airbus and vice versa.

More Related Content

Similar to FAA Flight Landing Distance Forecasting and Analysis

Flight Landing Analysis
Flight Landing AnalysisFlight Landing Analysis
Flight Landing AnalysisTauseef Alam
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing projectRashmiSubrahmanya
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataIRJET Journal
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Layout planning
Layout planningLayout planning
Layout planning8979473684
 
ENGR 132 Final Project
ENGR 132 Final ProjectENGR 132 Final Project
ENGR 132 Final ProjectMia Sheppard
 
Predicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPredicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPrerit Saxena
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelDavid Ritchie
 
Regression kriging
Regression krigingRegression kriging
Regression krigingFAO
 
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point ErrorsA study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errorsijpla
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...IRJET Journal
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysisRaman Kannan
 

Similar to FAA Flight Landing Distance Forecasting and Analysis (20)

Flight landing Project
Flight landing ProjectFlight landing Project
Flight landing Project
 
Flight Data Analysis
Flight Data AnalysisFlight Data Analysis
Flight Data Analysis
 
Flight Landing Analysis
Flight Landing AnalysisFlight Landing Analysis
Flight Landing Analysis
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing project
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Time series project
Time series projectTime series project
Time series project
 
Layout planning
Layout planningLayout planning
Layout planning
 
ENGR 132 Final Project
ENGR 132 Final ProjectENGR 132 Final Project
ENGR 132 Final Project
 
Predicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPredicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regression
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
 
Team 16_Report
Team 16_ReportTeam 16_Report
Team 16_Report
 
Team 16_Report
Team 16_ReportTeam 16_Report
Team 16_Report
 
Network predictive analysis
Network predictive analysisNetwork predictive analysis
Network predictive analysis
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
 
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point ErrorsA study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errors
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

FAA Flight Landing Distance Forecasting and Analysis

  • 1. Quynh Tran M11853850 FLIGHT LANDING DISTANCE FORECASTING PROJECT BANA 5143
  • 2. EXECUTIVE SUMMARY - The overall goal of this project is to get a model to forecast landing distance based on variables given in the dataset. To be able to come up with a good model that fits the dataset, we need to go through some certain steps to explore, clean, visualize, and analyze values in the dataset. - After removing missing values of FAA2, combining 2 datasets, removing duplicate values of the combined dataset, removing abnormal values, I’ve come up with a cleaned dataset to fit the linear regression forecasting model for landing distance. - This is the final linear model I’ve got for this project: Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air + 13.63*height – 4.05*pitch. (Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise)
  • 3. FLIGHT LANDING PROJECT Chapter 1: DATA UNDERSTANDING AND DATA EXPLORATION Goal: - The main purpose of chapter 1 is data understanding and data exploration. This should be considered as the most important step in this project. The overall goal of this step or this whole project are to understand that we’re motivated to reduce the risk of overrun by finding relevant factors that could impact the landing distance of a commercial flight. So what should we start off with? I decided to start off the project by importing two datasets - FAA1 and FAA2, then removing missing values for FAA2, combining two datasets, doing statistics summary for this combined dataset, removing duplicates, removing abnormal values in the combined dataset, and finally creating a descriptive statistics summary for the cleaned dataset. SAS codes, outputs and observations: 1. Importing FAA1 and FAA2 datasets
  • 4. 2. Removing missing values in FAA2 - After I imported two datasets separately, I saw there were 50 missing records in FAA2 dataset, so I used the IF…THEN function to remove those missing records in FAA2 before combining two datasets into one. 3. Combining 2 datasets
  • 5. 4. Creating statistics summary for the Combined dataset - After I finished combining two datasets FAA1 and FAA2, I have 950 records left in the dataset. In this dataset duration and speed_air have the highest number of missing values. Duration has 150 missing values and speed_air has 711 missing values. 5. Removing all duplicates in the Combined dataset
  • 6. - After removing all duplicate values in the Combined dataset, we have 850 records left and name it “DUPLICATES_REMOVED” as a new datafile that will be used for further steps. 6. Cleaning abnormal values - We use those criteria posted in the Project instruction to perceive and clean abnormal values from the “DUPLICATES_REMOVED” dataset. In this step, we will not remove missing values for duration and speed_air variables because those missing values are not randomly created, and missing values are not equivalent to abnormal values. Especially for speed_air, we have more than 700 blank cells for this variable. Therefore, we can understand that they all have meaning behind that, so I decided to create a new column named “value_condition” and assign all missing values will have value_condition as “missing”. Those that have values will have blank cells for “value_condition”. After that, those with values that are considered as abnormal will be deleted from the dataset based off these criteria: + Duration: The duration of a normal flight should always be greater than 40 min. + Speed_ground: If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. + Speed_air: If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. + Height: The landing aircraft is required to be at least 6 meters high at the threshold of the runway. + Distance: The length of the airport runway is typically less than 6000 feet.
  • 7. - After cleaning abnormal data, I have 831 records left and put those in another dataset named “Cleaned_final_data”. 7. Creating statistics summary after cleaning abnormal data
  • 8. - This step helps us have an overview of how the distribution of numeric variables our dataset has changed after we finished cleaning abnormal data. In this table, these statistics items are given: sample size (N), number of missing records (N Miss), Minimum, Maximum, Mean, Median and Standard Deviation. Compared to the summary we’ve got for the dataset before cleaning abnormal data, there is only very slight changes in these items since a very small number of records removed (850 – 831 = 19 records removed). 8. Creating descriptive statistics and distribution histograms for all variables - Since we already created statistics summary for all variables in the table posted above, I will not include their tables here. For this section, I’m going to include all histograms that portray distribution of all variables we have here, so we can see which variable has a normal distribution, which one does not.
  • 9. - Duration is almost normally distributed. - No_pasg appears to be almost normally distributed. - Speed_ground appears to be normally distributed.
  • 10. - Speed_air’s distribution is rightly skewed. - Height appears to be normally distributed.
  • 11. - Pitch is normally distributed. - Distance is strongly rightly skewed. Conclusions on chapter 1:
  • 12. - Chapter 1 is like a starting point of our project. It is basic, simple, but also extremely crucial for our project. In this first stage, I basically went through the data exploration and data understanding process. After importing two datasets to SAS system, I realized there were some missing records in FAA2 dataset. Since these missing values may affect my dataset and forecasting model fitting process in the future, I had to remove those values to make FAA2 dataset look as what we need. I then combined two datasets into one dataset called “Combined”. After combining two datasets I moved on to removing duplicates step. Removing duplicates helps us avoid having biased assumption on our dataset. Then I started cleaning abnormal values based off the criteria we got from the Project instruction. Removing abnormal values really helps us develop a forecasting model that fits our dataset better without making biases, same as for removing duplicates step. Now after data is cleaned and all the data exploration steps are done, we can move on to the next step, an important step before moving to fitting the model step – Chapter 2: Doing correlation matrix and X-Y plot. Chapter 2: DATA VISUALIZATION AND DESCRIPTIVE STUDY Goal: - This step is when we will visualize the distribution of distance (our outcome variable) and other independent variables, as well as scatter plots and correlation matrix between them. Those visualization tools in SAS will help us have an overview of the relationships between distance and other variables. This is also when we will be able decide what we should do with our categorical variables, whether we should keep all variables as predictors or drop some that are not statistically significant to distance forecasting model. SAS codes, outputs and observations: 1. Creating boxplot showing distribution of distance of each aircraft
  • 13. - This basically shows how distance is distributed differently for airbus compared to boeing. All statistical items showed in boxplot for distance vs. airbus are typically lower than those for distance vs. boeing. 2. Creating dummy variable for aircraft
  • 14. - Based on the box plot we just created, we can see that whether an aircraft is airbus or boeing may tell something in my forecast for the landing distance, so I decided to create a dummy variable called ‘aircraft01’ for the ‘aircraft’ categorical variable. I assigned if the aircraft is airbus, then its ‘aircraft01’ will be 1, and will be 0 otherwise. After creating a dummy variable for aircraft and put that in another dataset, we will be using this dataset for all of our regression analysis to develop our model. 3. Creating scatterplot matrix showing linear correlation between distance (outcome variable) and other variables
  • 15. - By visualizing the correlation between distance and other variables we will be able to identify which variables have a strong correlation with distance and which ones do not. A variable that has a strong correlation with the outcome variable should be picked as a predictor. Variables that do not have a strong correlation with the outcome should not be picked as one of the predictors used for fitting forecasting model. The column I have framed here contains scatterplots showing the correlation between distance and other variables. Based off the matrix, we can see that: aircraft01, speed_ground, speed_air, height, and pitch have moderate correlation (height and pitch) to strong correlation (aircraft01, speed_ground and speed_air) with distance, while duration and no_pasg have no to very weak correlation with distance. With this scatterplot matrix, we can initially have the idea of not using duration and no_pasg as predictors for the model. But to make sure our visualization and understanding about duration and no_pasg is correct, we will create a correlation matrix showing correlation coefficients and p-value of each variables with distance. 4. Creating correlation matrix showing how distance is correlated with other variables
  • 16. -
  • 17. - To have a more insightful comparison for signs (positive/negative) of correlations between predicting variables and distance in Marginal Analysis (from correlation matrix) and Joint Analysis (from the linear regression model), I will create a comparison table based on what we’ve got above: Correlation Coeff. Direction p-value Significant? Aircraft01 -0.23814 - <0.0001 Yes Duration -0.05138 - 0.1514 No No-pasg -0.01776 - 0.6093 No Speed_ground 0.86624 + <0.0001 Yes Speed_air 0.94210 + <0.0001 Yes Height 0.09941 + 0.0041 Yes Pitch 0.08703 + 0.0121 Yes? - Comments on comparison table: For Pitch because we cannot decide whether it is significant just based on its p-value of 0.0121. This p-value is lower than 0.05 but is still high to be perceived as significant. - While the scatterplot matrix is such a great tool to help us visualize the relationship among variables, this correlation matrix provides us correlation coefficients of all variables and p-value of each variable if they’re used as a predictor to forecast other variable.
  • 18. - How to read this matrix: + The first line of each box represents correlation coefficient between two variables contributed to that box. + The second line represents p-value in the case where one variable is used to predict the other variable. P-value smaller than 0.05 signals that we can reject the null hypothesis and vice versa. The null hypothesis here is one variable does not have a statistical significance to the model predicting the other variable. + The third line represents sample size (or number of observations). - Observations for the framed row: + For correlation coefficients, duration and no_pasg have relatively low correlation coefficients with distance, which are -0.051 and -0.018, respectively. + For p-value, while aircraft01, speed_ground, speed_air, height, and pitch have relatively low p- value, which means they make significant contribution to the model. On the other hand, duration and no_pasg have noticeably high p-value (0.1514 and 0.6093, respectively), which means duration and no-pasg do not make any or make little significant contribution to the model forecasting distance in this scenario. - After seeing this result of correlation coefficients and p-values, I decided not to use duration and no_pasg as predictors to fit the linear regression model forecasting distance. Conclusions on chapter 2: - As mentioned previously, our goal in this chapter is to visualize and perceive the correlations between distance – our outcome variable, and other independent variables. The scatterplot matrix helps us initially perceive which variables have strong correlations with distance, and who do not. Those that have strong correlation with distance will be kept as predictors to build linear regression model to predict landing distance. Those who have weak correlations with distance, on the other hand, will be removed from our predicting model. Here duration and no_pasg have weak correlations with distance. - We cannot rely only on the scatterplots to decide whether we should keep or drop certain variables for the predicting model. This is when we need the correlation matrix. According to this matrix, other than low correlation coefficients with distance, duration and no_pasg are also the two variables that have relatively high p-value associated with distance. As we all know a small p-value (typically smaller than 0.05) indicates a strong evidence against the null hypothesis, which means we can reject the null when p-value is low. Meanwhile, a large p-value indicates a weak evidence against our null hypothesis, which means we fail to reject the null when p-value is large. In this scenario, the null hypothesis for each independent variable is “A variable do not make any significant contribution to the linear regression model to predict landing distance”. Duration and no_pasg have large p-value associated with distance, which indicates that these two variables are not statistically significant to the model; therefore, I decided not to use these two variables as predictors for the model forecasting distance. - I will still keep Pitch as a predictor to my linear model. We can figure out whether Pitch is actually significant to the linear model later by looking at its R squared in the model.
  • 19. - Variables will be used as predictors are: aircraft01, speed_ground, speed_air, height and pitch. - The box plot showing distribution of distance of each type of aircraft tells us that there are differences in distance depending on whether the aircraft is airbus or boeing. Therefore, it is crucial to keep dummy variable ‘aircraft01’ as our predictors. Chapter 3: STATISTICAL REGRESSION MODELLING Goal: - The goal of this chapter is to develop a linear regression model that predicts value of landing distance of an aircraft depending on values of these predictors: aircraft01 (aircraft01 = 1 if our aircraft is airbus and aircraft01 = 0 otherwise), speed_ground, speed_air, height and pitch. - Another goal in this stage is to check the how good the model fits the dataset by evaluating Analysis of Variance measures such as R squared, Adjusted R squared, Root MSE, etc. SAS codes, outputs, and observations: 1. Building a linear regression model to predict distance
  • 20. - According to our Parameter Estimates, our linear regression model will be demonstrated through this equation: Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air + 13.63*height – 4.05*pitch. (Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise) - The R squared and Adj. R squared of this model is relatively high (R squared = 0.9738 and Adj. R squared = 0.9732).
  • 21. - All the fit diagnostics plots and graphs here show us residuals we have from the linear regression model is normally distributed.
  • 22. - Observations from the linear regression model (Joint Analysis) and the comparison table for correlation matrix (Marginal Analysis): + aircraft01, speed_air, and height have same signs of correlation coefficient in the regression model as what they have in the correlation matrix. + aircraft01, speed_air, and height have relatively low p-value in the regression model, which indicates they make good contribution to the model. + speed_ground, and pitch have different signs of correlation coefficients in the regression model as compared to what they have in the correlation matrix. + speed_ground, and pitch have relatively high p-value in the relatively high p-value in the regression model, which indicates they do not make good contribution to the model. ➔ This should be because of the interaction or collinearity speed_ground has with speed_air, and the questionable high or low p-value we have for pitch in the Marginal Analysis that makes speed_ground and pitch have different signs of correlation coefficients. ➔ Therefore, to test if the new model without speed_ground and pitch would make a better model, we will try to run new codes for linear model: + Codes for a new linear model without speed_ground and pitch:
  • 23. - The new model has this equation: Distance = -5962.93 – 427.442*aircraft01 + 82.15*speed_air + + 13.70*height
  • 24. - Look at the output we’ve got from the linear regression model with speed_ground and pitch removed, we have Adj. R squared = 0.9733, which is 0.0001 larger than the Adj. R squared from the original model we created. And p-value of all variables is also <0.0001, which indicates they’re all good predictors that make good contribution to the model too. Conclusions on chapter 3: - In this chapter we have figured out the linear regression model to forecasting landing distance of an aircraft using these predictors: aircraft01, speed_ground, speed_air, height and pitch. - Based on the R squared and Adj. R squared values of this model, we can figure out that we’ve got a good model that can predict 97.32% of data points of the distance variable or we can say 97.37% of distance values fall within our regression model line. - Because of the different signs of parameter est. we have for speed_ground and pitch in the regression model compared to theirs in the correlation matrix, and also their high p-values in the regression model, I decided to try building a new linear model without using speed_ground and pitch. - The result is that we have 0.0001 higher Adj. R squared for the new model compared to the original one. - An even better forecasting model I would expect for this dataset is an exponential model that would be able to capture the exponential curve characteristics of the predictors with distance.
  • 25. - But for we have learned and discovered so far for this dataset, I think it would still be totally good to use either my original linear model or the one with speed_ground and pitch removed to forecast landing distance. But since speed_ground has way more available data compared to speed_air, I would still prefer to use the original model for forecasting. SUMMARY - The linear regression model I decided to use for this project is: Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air + 13.63*height – 4.05*pitch. (Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise) - Adjusted R squared of this model is 0.9732, which indicates this is a good model that can forecast 97.32% of the landing distance of the dataset. - Speed_ground and pitch can be removed from the predictor group because of their different signs in the Marginal Analysis and low p-value in the Joint Analysis. - Summarizing questions: 1. How many observations (flights) do you use to fit your final model? If not all 950 flights, why? - I used 831 observations to fit my final model. The observation number is not 950 because I removed duplicate values and abnormal values from the observation dataset for fitting the model. These steps give us more cleaned, and unbiased dataset to produce a model that can fit the dataset well. 2. What factors and how they impact the landing distance of a flight? - Factors that impact the landing distance of a flight are: aircraft01 (aircraft01 = 1 if aircraft = airbus and = 0 otherwise), speed_ground, speed_air, height and pitch. - Factors that do not impact the landing distance of a flight are: duration and no_pasg. Based on the low correlation coefficients and large p-values of these variables with distance, we can figure out that duration and no_pasg are not statistically significant to the fitting model of distance. Therefore, I did not use these two variables as predictors for my final model. - Since speed_ground and speed_air are strongly correlated with each other, we should have dropped one of them out. But since I hadn’t known which variable I should drop before fitting the model, so I decided to keep both to fit my model. And because speed_air has very small amount of available data, it didn’t affect tremendously to our result. - After fitting the model, I realized that speed_ground and pitch may not make a good contribution to the model. ➔ Factors that impact the landing distance determined after fitting the model are: aircraft01, speed_air and height. 3. Is there any difference between the two makes Boeing and Airbus?
  • 26. - Based on the box plot I’ve created to show the distribution of landing distance based on whether the aircraft is airbus or boeing, we can visualize that landing distance for airbus is slightly lower than landing distance for boeing. - Based on the final linear regression model I’ve come up with: Distance = -5945.70 – 428.48*aircraft01 – 6.07*speed_ground + 88.23*speed_air + 13.63*height – 4.05*pitch. (Note: ‘aircraft01’ = 1 if ‘aircraft’ = ‘airbus’ and ‘aircraft01’ = 0 otherwise) ➔ Thanks to the dummy variable ‘aircraft01’ we can figure out that landing distance decreases by 428.28 feet if the aircraft is airbus and vice versa.