Flight Landing Distance Study Using SAS

10/4/2017
Flight Landing
Distance Study
Identify the factors affecting landing
distance using SAS
BANA 6043 Project Report
Sarita Maharia
UCID – M12340569

Flight Landing Distance Study by Sarita Maharia
1
Index
Contents
Summary.......................................................................................................................................................2
Chapter 1 – Data exploration and data cleaning ..........................................................................................3
Chapter 2 – Descriptive Study.......................................................................................................................7
Chapter 3 – Statistical Modeling.................................................................................................................13
Chapter 4 – Model Validation.....................................................................................................................17
Chapter 5 – Remodeling and model validation ..........................................................................................19
Questions from project pdf.........................................................................................................................21
Appendix .....................................................................................................................................................23

2
Summary
This project report details steps taken to fit a linear model to predict flight landing distance given input
data. The dataset contains 850 observations of 8 variables. The variable dictionary is provided in the
appendix. Below is the summary of steps and corresponding observations:
1. Data Cleaning –
a. Duplicates in 2 input data sets are removed and
b. Negative values of height variable are deleted as these could be wrong recordings.
2. Descriptive Study – Analyze plots and correlation coefficients
a. Distance has strong positive correlation with both speed_air and speed_ground but
plots show little curve so transformations might be required.
b. Independent variables speed_air and speed_ground show strong correlation, hence can
distort model if both are included.
3. Statistical Modeling – Fit model with all variables and cleaned data from step 1
a. The regression coefficients change sign when regression is run with individual variables
and all variables together with distance.
b. Speed_ground is removed to solve issue in step a.
c. The significant variables are – aircraft type, height and speed_air.
d. MAPE(Mean absolute Percentage Error) is approx. 4% for this base model.
4. Model Validation – Validate model created in step 3
a. Residuals show a curve pattern and are not symmetric.
b. Mean of residuals is not zero.
c. This means independent variables as is don’t have linear relationship with distance and
transformation is required.
5. Remodeling and re-validation –
a. Alternate model is used with transformed spped_air.
b. The alternate model has better Adjusted R square and passes residual validation criteria.
c. Alternate model also better explains variability than the vase model and has lower
MAPE.
d. The final model is listed below:
𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2
+ 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡

3
Chapter 1 – Data exploration and data cleaning
Goals
Merge given datasets after understanding variables and eliminate duplicates.
Identify outliers and variables with missing values and treat the variables.
Check whether the minimum and maximum observations for a variable are logical.
SAS Code and Output
Both the given datasets are concatenated and results saved in combined_flights dataset. There are total 950 rows
in combined dataset.
Backup of dataset is taken
/* import first dataset */
FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA1.xls';
PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data1;
GETNAMES=YES;
RUN;
/* remove extra rows that might be created because of spreadsheet import */
data project.data1_required;
set project.data1;
if not(cmiss(of aircraft distance duration height no_pasg pitch speed_air
speed_ground) eq 8);
run;
/* import FAA2 sheet */
FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA2.xls';
PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data2;
GETNAMES=YES;
RUN;
/* remove extra rows that might be created because of spreadsheet import */
data project.data2_required;
set project.data2;
if not(cmiss(of aircraft distance height no_pasg pitch speed_air speed_ground) eq 7);
run;
/* concatenate both datasets */
data project.combined_flights;
set project.data1_required project.data2_required;
run;
/* create copy of main dataset */
data project.combined_flights_copy;
set project.combined_flights;
run;

4
Check for duplicates in the combined dataset and remove duplicates if there are any
/* build frequency table to check for duplicates */
proc freq data=project.combined_flights;
tables aircraft*distance*height*no_pasg*pitch*speed_air*speed_ground / noprint
out=keylist;
run;
/* print duplicate rows */
proc print;
where count ge 2;
run;
/* sort data on all variables so that duplicates can be deleted */
proc sort data=project.combined_flights out=project.combined_flights_sort;
by aircraft descending duration distance height no_pasg pitch speed_air speed_ground;
run;
/* dataset with unique values */
proc sort data=project.combined_flights_sort out=project.combined_flights_unique
nodupkey;
by aircraft distance height no_pasg pitch speed_air speed_ground;
run;
850 rows in frequency table, hence there are duplicates in the dataset. Below are the sample duplicate rows from
frequency table:
Code to find number of missing, mean, min and max of all variables:
proc means data=project.combined_flights_unique n nmiss min max mean;
run;
Speed_air variable has almost 75% missing values. This variable will be retained as it’s an important variable.
Similarly duration variable will be retained even though it has null values.

5
Code to find negative height values:
/* find rows of data that have negative heights. This looks to be wrong recording of data */
proc print data=project.combined_flights_unique;
where height < 0;
run;
Output:
Code to delete negative height values observations:
data project.combined_flights_updated;
set project.combined_flights_unique;
if height < 0 then delete;
run;
Now there are 845 observations:
Code to find levels of categorical variables:
/* find types of aircraft and their count in combined datatset */
proc freq data = project.combined_flights_updated nlevels ;
table aircraft;
run;
Output:
Find outliers and distribution for all variables using below code. Outliers are maintained in data as they represent
extreme conditions
/* plot for outliers */
proc univariate data=project.combined_flights_t2 plot;
run;

6
Observations
1. Input datasets had 950 observations. After data cleaning, output dataset has 845 observations. Below are
the cleaned observations:
a. 100 duplicate observations from 2 datasets
b. 5 rows are deleted because height has negative values. These might be recorded incorrectly.
2. Missing values in output dataset – these are retained as is
a. 75% null values are present for speed_air variable
b. 50 observations have duration variable missing.
3. There are many outliers for distance variable but these are retained as these present chances of overrun.
4. There are almost same number of rows for both values of categorical variables
Conclusion
1. 845 observations are present in the cleaned dataset after deleting duplicate observations and negative
height rows.
2. Null values and the outliers in variables are retained.

7
Chapter 2 – Descriptive Study
Goals
Understand correlation between different variables and analyze plots
SAS Code and Output
First create copy of input dataset and sort it. Also code the aircraft type so that it can be used in regression.
Code:
/* create copy of input dataset */
data project.flights_input;
set project.combined_flights_updated;
run;
/* sort input dataset by aircraft type */
Proc sort data=project.flights_input;
by aircraft;
run;
/* code aircraft type to dummy variables. airbus=0 and boeing=1 */
data project.flight_coded;
set project.flights_input;
if aircraft = "airbus" then aircraft_type=0;
esle aircraft_type=1;
drop aircraft;
run;
Output: 845 observations in output dataset with aircraft coded as 0 for airbus and 1 for boeing.
Generate plots for all variables with distance variable to understand direction and shape of relation
proc plot data=project.flight_coded;
plot distance*duration;
run;
plot distance*height;
run;
plot distance*no_pasg;
run;
plot distance*pitch;
run;
plot distance*speed_air;
run;
plot distance*speed_ground;
run;

8
No recognizable pattern between distance and other variables except below:
1. Distance and speed_air have positive relation with little curve. Also, there are no values below 90 for
speed_air which means that we have truncated data.
2. Distance and speed_ground have positive relation with curve
Find strength of correlation using below code:
proc corr data=project.flight_coded;
var _all_;
run;

9
Output: Strong correlations are highlighted in yellow:
Since plots of distance with speed_air and speed_ground have little curve, these variables are transformed to have
linear relation and increased correlation coefficient. Out of all transformations, cube of speed_air and
speed_distance give maximum correlation coefficient.
Code:
/*possible transformations*/
data project.flights_coded_t;
set project.flight_coded;
speed_air2=speed_air**2;
speed_air3=speed_air**3;
speed_air12=sqrt(speed_air);
speed_airlog=log(speed_air);
speed_ground2=speed_ground**2;
speed_ground3=speed_ground**3;
speed_ground12=sqrt(speed_ground);
speed_groundlog=log(speed_ground);
run;
/*find correlations in transformed data*/
proc corr data=project.flights_coded_t;
var distance speed_air speed_air2 speed_air3 speed_air12 speed_airlog
speed_ground speed_ground2 speed_ground3 speed_ground12 speed_groundlog;
run;
/*verify plots for transformed data */
proc plot data=project.flights_coded_t;
plot distance*speed_air3;
run;
plot distance*speed_ground3;
run;

10
plot speed_ground3*speed_air3;
run;
Increased correlation coefficients are highlighted in yellow
Plots after doing transformations look linear:

11
Square of speed_ground and speed_air also have linear relationship

12
Observations
Distance has strong positive correlation with speed_air and speed_ground but the plots have little curve. The curve
looks linear after applying square transformation.
Correlation coefficient for distance with speed_air and speed_ground increases after both variables are
transformed by applying square. They also show strong positive linear relation.
Conclusion
All variables as is might not fit linear model and we might need to use transformed speed variables to validate the
model because speed_ground, speed_air plots with distance have curve.
Speed_air and speed_ground have high collinearity that could impact the linear model.

13
Chapter 3 – Statistical Modeling
Goals
Fit a linear model to predict landing distance
SAS Code and Output
First try to identify parameters for regression between distance and individual independent variables using below
code:
proc reg data=project.flight_coded;
model distance=aircraft_type;
run;
model distance=duration;
run;
model distance=height;
run;
model distance=no_pasg;
run;
model distance=pitch;
run;
model distance=speed_air;
run;
model distance=speed_ground;
run;
Now identify parameters for regression between distance and all other variables
model distance=aircraft_type duration height no_pasg pitch speed_air speed_ground;
run;

14
Below is summary output from correlation and regression models run above:
The values in yellow change sign when all variables are considered together. From Chapter 2 conclusion, we see
that there is strong correlation between speed_air and speed_ground. So, we need to remove impact from
collinearity among independent variables to fit the model properly.
Out of speed_air and speed_ground, we need to select one to remove from the model. Speed_air has truncated
data which means low speed_air observations are missing. The main purpose of this project is to identify scenarios
for overrun. Since there is strong positive relation between speed_air and distance, chances of overrun are more
for high speed scenarios. Also, speed_air is a very important variable to drop. Hence, we will keep speed_air
variable and drop speed_ground for our model.
Model without speed_ground:
model distance=aircraft_type duration height no_pasg pitch speed_ground;
run;
Now insignificant variables (with p-value > 0.05) are removed from the model one by one and below are the final
variables:
model distance=aircraft_type height speed_air;
run;
Independent
variables
Direction
Correaltion
coefficient
P-value
corr coeff
regression coeff
Distance vs
individual var
p-value reg coeff
Distance vs
individual var
regression
coeff
Distance vs all
p-value reg coeff
Distance vs all var
aircraft type 0.238 <.0001 442.765 <.0001 440.47015 <.0001
duration
no visible
relation
-0.06197 0.0808 -1.17686 0.0808 0.09881 0.6258
height
no visible
relation
0.12306 0.0003 11.40984 0.0003 13.93222 <.0001
no_psg
no visible
relation
-0.02778 0.42 -3.4422 0.42 -2.05743 0.1545
pitch
no visible
relation
0.10294 0.0027 180.88083 0.0027 -3.60074 0.8528
speed_air
Strong
Positive
little curve
0.94728 <.0001 82.17473 <.0001 87.61587 <.0001
speed_ground
Strong
Positive
little curve
0.862 <.0001 41.96801 <.0001 -3.96633 0.5562

15
Output: all variables and the model are significant. Almost 97% data is explained using the model.
Fit diagnostics show that residuals are not random and they show a pattern.

16
Observations
Sign of regression parameters change when regression is run with all independent variables together.
Speed_ground is removed from the model as it’s collinearity with speed_air was affecting the regression
parameters of other variables. Out of speed_ground and speed_air, speed_ground is removed.
No_pasg and duration are not significant, hence removed from model.
Fit diagnostics show that residuals are not symmetric.
Conclusion
Linear model fits data after removing non-significant variables but gives residuals plots showing curve pattern.
Model has R square 95%. We need to run diagnostics to understand the residual behavior.

17
Chapter 4 – Model Validation
Goals
Analyze residual plot to check if it’s random and check if mean of residuals is zero
SAS Code and Output
Copy residuals in a separate dataset using below code
model distance=aircraft_type height pitch speed_ground / r;
output out=project.model1_residuals r=residual;
run;
Code to check distribution and hypothesis for mean=0 of residuals
/* distribution not symmetric as per Shapiro Wilk test */
proc univariate data=project.residuals normal plot;
var residual;
run;
/* null hypothesis of mean=0 is not rejected as p value is 1 */
proc ttest data=project.residuals;
var residual;
run;
Distribution is not normal for residuals:
Residuals also fail normality test as highlighted p-value is less than 0.05

18
Mape (Mean absolute percentage error) is calculated using below code and value is 23.3%
data project.model1_mape;
set project.model1_residuals;
err=abs(residual)/distance;
keep err;
run;
proc sql;
create table project.model1_mape_t as
select avg(err) from project.model1_mape;
run;
Observations
Residuals are not symmetric and also fail normality test. So, the linear model is not good fit. Mape is 4.22%.
Conclusion
Linear model generated in previous chapter is not a good fit. Transformations are required on data as residuals
have pattern in form of curve.

19
Chapter 5 – Remodeling and model validation
Goals
Transform independent variables so that residuals are random and have normal distribution
Create alternative models and compare against base model to find best fit.
SAS Code and Output
Considering model created in Chapter 3 as base model, we will now create alternative model using transformed
speed_air variable.
As per Chapter 2 observations, transformed speed_air variable (after applying square) has linear plot with
distance. Transformed speed_air and speed_ground variables (after applying square) also have strong positive
linear relation. Below is the code to fit linear model using transformed variable:
proc reg data=project.flights_coded_t;
model distance=aircraft_type height speed_air2;
run;
Model has Adjusted R square 98.24 which is little better than base model. Below is the output:

20
Fit diagnostics show that residuals look random:
Here the residuals pass normality test too as highlighted in below figure:
Residuals have zero mean based on below hypothesis test. p-value is 1, so we can’t reject null hypothesis that
mean of residuals is 0.

21
MAPE is calculated as below – it comes as 3.65%. It’s good estimate to understand error in data and is lower than
the base model.
data project.model2_mape;
set project.model2_residuals;
err=abs(residual)/distance;
keep err;
run;
proc sql;
create table project.model2_mape_t as
select avg(err) from project.model2_mape;
run;
Conclusion
Base model created without any transformation is not a good model based on model diagnostics and fit test.
Alternate model created using speed_air square transformation gives best fit in terms of R square, MAPE, zero
mean of residuals and normal distributions of residuals.
The significant variables for the alternate model are - Aircraft_type, speed_air**2 and height. Approx 98% of
variability in data is explained using this model. It has MAPE of 3.65%.
Questions from project pdf
How many observations (flights) do you use to fit your final model? If not all 950 flights, why?
831 observations used after removing below rows:
• Duplicate 100 rows
• Rows with negative values of height – 5
• Rows with abnormal observations for each variable defined for the project – 14
However, final model is fit using speed_air variable that has missing values. So, model finally used 208 observations.
What factors and how they impact the landing distance of a flight?
Aircraft_type, speed_ground, height and pitch affect landing distance as per below equation:
𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2
+ 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡
Is there any difference between the two makes Boeing and Airbus?
Mean landing distance of Boeing is more than mean landing distance of Airbus as per below TTEST results. The 2
scenarios of overrun are from Boeing only with speed_ground more than 135mph. There is difference in landing
distance means for each aircraft type when the mean of speed_ground is same.
Lower tail TTEST for landing distance: null hypothesis that mean landing distance of airbus is less than or equal to
Boeing is rejected. Hence, Boeing has more mean landing distance with 95% confidence level.

22

23
Appendix

Flight Landing Distance Study Using SAS

Recommended

Recommended

More Related Content

Similar to Flight Landing Distance Study Using SAS

Similar to Flight Landing Distance Study Using SAS (20)

Recently uploaded

Recently uploaded (20)

Flight Landing Distance Study Using SAS