SlideShare a Scribd company logo
10/4/2017
Flight Landing
Distance Study
Identify the factors affecting landing
distance using SAS
BANA 6043 Project Report
Sarita Maharia
UCID – M12340569
Flight Landing Distance Study by Sarita Maharia
1
Index
Contents
Summary.......................................................................................................................................................2
Chapter 1 – Data exploration and data cleaning ..........................................................................................3
Chapter 2 – Descriptive Study.......................................................................................................................7
Chapter 3 – Statistical Modeling.................................................................................................................13
Chapter 4 – Model Validation.....................................................................................................................17
Chapter 5 – Remodeling and model validation ..........................................................................................19
Questions from project pdf.........................................................................................................................21
Appendix .....................................................................................................................................................23
Flight Landing Distance Study by Sarita Maharia
2
Summary
This project report details steps taken to fit a linear model to predict flight landing distance given input
data. The dataset contains 850 observations of 8 variables. The variable dictionary is provided in the
appendix. Below is the summary of steps and corresponding observations:
1. Data Cleaning –
a. Duplicates in 2 input data sets are removed and
b. Negative values of height variable are deleted as these could be wrong recordings.
2. Descriptive Study – Analyze plots and correlation coefficients
a. Distance has strong positive correlation with both speed_air and speed_ground but
plots show little curve so transformations might be required.
b. Independent variables speed_air and speed_ground show strong correlation, hence can
distort model if both are included.
3. Statistical Modeling – Fit model with all variables and cleaned data from step 1
a. The regression coefficients change sign when regression is run with individual variables
and all variables together with distance.
b. Speed_ground is removed to solve issue in step a.
c. The significant variables are – aircraft type, height and speed_air.
d. MAPE(Mean absolute Percentage Error) is approx. 4% for this base model.
4. Model Validation – Validate model created in step 3
a. Residuals show a curve pattern and are not symmetric.
b. Mean of residuals is not zero.
c. This means independent variables as is don’t have linear relationship with distance and
transformation is required.
5. Remodeling and re-validation –
a. Alternate model is used with transformed spped_air.
b. The alternate model has better Adjusted R square and passes residual validation criteria.
c. Alternate model also better explains variability than the vase model and has lower
MAPE.
d. The final model is listed below:
𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2
+ 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡
Flight Landing Distance Study by Sarita Maharia
3
Chapter 1 – Data exploration and data cleaning
Goals
Merge given datasets after understanding variables and eliminate duplicates.
Identify outliers and variables with missing values and treat the variables.
Check whether the minimum and maximum observations for a variable are logical.
SAS Code and Output
Both the given datasets are concatenated and results saved in combined_flights dataset. There are total 950 rows
in combined dataset.
Backup of dataset is taken
/* import first dataset */
FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA1.xls';
PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data1;
GETNAMES=YES;
RUN;
/* remove extra rows that might be created because of spreadsheet import */
data project.data1_required;
set project.data1;
if not(cmiss(of aircraft distance duration height no_pasg pitch speed_air
speed_ground) eq 8);
run;
/* import FAA2 sheet */
FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA2.xls';
PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data2;
GETNAMES=YES;
RUN;
/* remove extra rows that might be created because of spreadsheet import */
data project.data2_required;
set project.data2;
if not(cmiss(of aircraft distance height no_pasg pitch speed_air speed_ground) eq 7);
run;
/* concatenate both datasets */
data project.combined_flights;
set project.data1_required project.data2_required;
run;
/* create copy of main dataset */
data project.combined_flights_copy;
set project.combined_flights;
run;
Flight Landing Distance Study by Sarita Maharia
4
Check for duplicates in the combined dataset and remove duplicates if there are any
/* build frequency table to check for duplicates */
proc freq data=project.combined_flights;
tables aircraft*distance*height*no_pasg*pitch*speed_air*speed_ground / noprint
out=keylist;
run;
/* print duplicate rows */
proc print;
where count ge 2;
run;
/* sort data on all variables so that duplicates can be deleted */
proc sort data=project.combined_flights out=project.combined_flights_sort;
by aircraft descending duration distance height no_pasg pitch speed_air speed_ground;
run;
/* dataset with unique values */
proc sort data=project.combined_flights_sort out=project.combined_flights_unique
nodupkey;
by aircraft distance height no_pasg pitch speed_air speed_ground;
run;
850 rows in frequency table, hence there are duplicates in the dataset. Below are the sample duplicate rows from
frequency table:
Code to find number of missing, mean, min and max of all variables:
proc means data=project.combined_flights_unique n nmiss min max mean;
run;
Speed_air variable has almost 75% missing values. This variable will be retained as it’s an important variable.
Similarly duration variable will be retained even though it has null values.
Flight Landing Distance Study by Sarita Maharia
5
Code to find negative height values:
/* find rows of data that have negative heights. This looks to be wrong recording of data */
proc print data=project.combined_flights_unique;
where height < 0;
run;
Output:
Code to delete negative height values observations:
data project.combined_flights_updated;
set project.combined_flights_unique;
if height < 0 then delete;
run;
Now there are 845 observations:
Code to find levels of categorical variables:
/* find types of aircraft and their count in combined datatset */
proc freq data = project.combined_flights_updated nlevels ;
table aircraft;
run;
Output:
Find outliers and distribution for all variables using below code. Outliers are maintained in data as they represent
extreme conditions
/* plot for outliers */
proc univariate data=project.combined_flights_t2 plot;
run;
Flight Landing Distance Study by Sarita Maharia
6
Observations
1. Input datasets had 950 observations. After data cleaning, output dataset has 845 observations. Below are
the cleaned observations:
a. 100 duplicate observations from 2 datasets
b. 5 rows are deleted because height has negative values. These might be recorded incorrectly.
2. Missing values in output dataset – these are retained as is
a. 75% null values are present for speed_air variable
b. 50 observations have duration variable missing.
3. There are many outliers for distance variable but these are retained as these present chances of overrun.
4. There are almost same number of rows for both values of categorical variables
Conclusion
1. 845 observations are present in the cleaned dataset after deleting duplicate observations and negative
height rows.
2. Null values and the outliers in variables are retained.
Flight Landing Distance Study by Sarita Maharia
7
Chapter 2 – Descriptive Study
Goals
Understand correlation between different variables and analyze plots
SAS Code and Output
First create copy of input dataset and sort it. Also code the aircraft type so that it can be used in regression.
Code:
/* create copy of input dataset */
data project.flights_input;
set project.combined_flights_updated;
run;
/* sort input dataset by aircraft type */
Proc sort data=project.flights_input;
by aircraft;
run;
/* code aircraft type to dummy variables. airbus=0 and boeing=1 */
data project.flight_coded;
set project.flights_input;
if aircraft = "airbus" then aircraft_type=0;
esle aircraft_type=1;
drop aircraft;
run;
Output: 845 observations in output dataset with aircraft coded as 0 for airbus and 1 for boeing.
Generate plots for all variables with distance variable to understand direction and shape of relation
proc plot data=project.flight_coded;
plot distance*duration;
run;
proc plot data=project.flight_coded;
plot distance*height;
run;
proc plot data=project.flight_coded;
plot distance*no_pasg;
run;
proc plot data=project.flight_coded;
plot distance*pitch;
run;
proc plot data=project.flight_coded;
plot distance*speed_air;
run;
proc plot data=project.flight_coded;
plot distance*speed_ground;
run;
Flight Landing Distance Study by Sarita Maharia
8
No recognizable pattern between distance and other variables except below:
1. Distance and speed_air have positive relation with little curve. Also, there are no values below 90 for
speed_air which means that we have truncated data.
2. Distance and speed_ground have positive relation with curve
Find strength of correlation using below code:
proc corr data=project.flight_coded;
var _all_;
run;
Flight Landing Distance Study by Sarita Maharia
9
Output: Strong correlations are highlighted in yellow:
Since plots of distance with speed_air and speed_ground have little curve, these variables are transformed to have
linear relation and increased correlation coefficient. Out of all transformations, cube of speed_air and
speed_distance give maximum correlation coefficient.
Code:
/*possible transformations*/
data project.flights_coded_t;
set project.flight_coded;
speed_air2=speed_air**2;
speed_air3=speed_air**3;
speed_air12=sqrt(speed_air);
speed_airlog=log(speed_air);
speed_ground2=speed_ground**2;
speed_ground3=speed_ground**3;
speed_ground12=sqrt(speed_ground);
speed_groundlog=log(speed_ground);
run;
/*find correlations in transformed data*/
proc corr data=project.flights_coded_t;
var distance speed_air speed_air2 speed_air3 speed_air12 speed_airlog
speed_ground speed_ground2 speed_ground3 speed_ground12 speed_groundlog;
run;
/*verify plots for transformed data */
proc plot data=project.flights_coded_t;
plot distance*speed_air3;
run;
proc plot data=project.flights_coded_t;
plot distance*speed_ground3;
run;
proc plot data=project.flights_coded_t;
Flight Landing Distance Study by Sarita Maharia
10
plot speed_ground3*speed_air3;
run;
Increased correlation coefficients are highlighted in yellow
Plots after doing transformations look linear:
Flight Landing Distance Study by Sarita Maharia
11
Square of speed_ground and speed_air also have linear relationship
Flight Landing Distance Study by Sarita Maharia
12
Observations
Distance has strong positive correlation with speed_air and speed_ground but the plots have little curve. The curve
looks linear after applying square transformation.
Correlation coefficient for distance with speed_air and speed_ground increases after both variables are
transformed by applying square. They also show strong positive linear relation.
Conclusion
All variables as is might not fit linear model and we might need to use transformed speed variables to validate the
model because speed_ground, speed_air plots with distance have curve.
Speed_air and speed_ground have high collinearity that could impact the linear model.
Flight Landing Distance Study by Sarita Maharia
13
Chapter 3 – Statistical Modeling
Goals
Fit a linear model to predict landing distance
SAS Code and Output
First try to identify parameters for regression between distance and individual independent variables using below
code:
proc reg data=project.flight_coded;
model distance=aircraft_type;
run;
proc reg data=project.flight_coded;
model distance=duration;
run;
proc reg data=project.flight_coded;
model distance=height;
run;
proc reg data=project.flight_coded;
model distance=no_pasg;
run;
proc reg data=project.flight_coded;
model distance=pitch;
run;
proc reg data=project.flight_coded;
model distance=speed_air;
run;
proc reg data=project.flight_coded;
model distance=speed_ground;
run;
Now identify parameters for regression between distance and all other variables
proc reg data=project.flight_coded;
model distance=aircraft_type duration height no_pasg pitch speed_air speed_ground;
run;
Flight Landing Distance Study by Sarita Maharia
14
Below is summary output from correlation and regression models run above:
The values in yellow change sign when all variables are considered together. From Chapter 2 conclusion, we see
that there is strong correlation between speed_air and speed_ground. So, we need to remove impact from
collinearity among independent variables to fit the model properly.
Out of speed_air and speed_ground, we need to select one to remove from the model. Speed_air has truncated
data which means low speed_air observations are missing. The main purpose of this project is to identify scenarios
for overrun. Since there is strong positive relation between speed_air and distance, chances of overrun are more
for high speed scenarios. Also, speed_air is a very important variable to drop. Hence, we will keep speed_air
variable and drop speed_ground for our model.
Model without speed_ground:
proc reg data=project.flight_coded;
model distance=aircraft_type duration height no_pasg pitch speed_ground;
run;
Now insignificant variables (with p-value > 0.05) are removed from the model one by one and below are the final
variables:
proc reg data=project.flight_coded;
model distance=aircraft_type height speed_air;
run;
Independent
variables
Direction
Correaltion
coefficient
P-value
corr coeff
regression coeff
Distance vs
individual var
p-value reg coeff
Distance vs
individual var
regression
coeff
Distance vs all
p-value reg coeff
Distance vs all var
aircraft type 0.238 <.0001 442.765 <.0001 440.47015 <.0001
duration
no visible
relation
-0.06197 0.0808 -1.17686 0.0808 0.09881 0.6258
height
no visible
relation
0.12306 0.0003 11.40984 0.0003 13.93222 <.0001
no_psg
no visible
relation
-0.02778 0.42 -3.4422 0.42 -2.05743 0.1545
pitch
no visible
relation
0.10294 0.0027 180.88083 0.0027 -3.60074 0.8528
speed_air
Strong
Positive
little curve
0.94728 <.0001 82.17473 <.0001 87.61587 <.0001
speed_ground
Strong
Positive
little curve
0.862 <.0001 41.96801 <.0001 -3.96633 0.5562
Flight Landing Distance Study by Sarita Maharia
15
Output: all variables and the model are significant. Almost 97% data is explained using the model.
Fit diagnostics show that residuals are not random and they show a pattern.
Flight Landing Distance Study by Sarita Maharia
16
Observations
Sign of regression parameters change when regression is run with all independent variables together.
Speed_ground is removed from the model as it’s collinearity with speed_air was affecting the regression
parameters of other variables. Out of speed_ground and speed_air, speed_ground is removed.
No_pasg and duration are not significant, hence removed from model.
Fit diagnostics show that residuals are not symmetric.
Conclusion
Linear model fits data after removing non-significant variables but gives residuals plots showing curve pattern.
Model has R square 95%. We need to run diagnostics to understand the residual behavior.
Flight Landing Distance Study by Sarita Maharia
17
Chapter 4 – Model Validation
Goals
Analyze residual plot to check if it’s random and check if mean of residuals is zero
SAS Code and Output
Copy residuals in a separate dataset using below code
proc reg data=project.flight_coded;
model distance=aircraft_type height pitch speed_ground / r;
output out=project.model1_residuals r=residual;
run;
Code to check distribution and hypothesis for mean=0 of residuals
/* distribution not symmetric as per Shapiro Wilk test */
proc univariate data=project.residuals normal plot;
var residual;
run;
/* null hypothesis of mean=0 is not rejected as p value is 1 */
proc ttest data=project.residuals;
var residual;
run;
Distribution is not normal for residuals:
Residuals also fail normality test as highlighted p-value is less than 0.05
Flight Landing Distance Study by Sarita Maharia
18
Mape (Mean absolute percentage error) is calculated using below code and value is 23.3%
data project.model1_mape;
set project.model1_residuals;
err=abs(residual)/distance;
keep err;
run;
proc sql;
create table project.model1_mape_t as
select avg(err) from project.model1_mape;
run;
Observations
Residuals are not symmetric and also fail normality test. So, the linear model is not good fit. Mape is 4.22%.
Conclusion
Linear model generated in previous chapter is not a good fit. Transformations are required on data as residuals
have pattern in form of curve.
Flight Landing Distance Study by Sarita Maharia
19
Chapter 5 – Remodeling and model validation
Goals
Transform independent variables so that residuals are random and have normal distribution
Create alternative models and compare against base model to find best fit.
SAS Code and Output
Considering model created in Chapter 3 as base model, we will now create alternative model using transformed
speed_air variable.
As per Chapter 2 observations, transformed speed_air variable (after applying square) has linear plot with
distance. Transformed speed_air and speed_ground variables (after applying square) also have strong positive
linear relation. Below is the code to fit linear model using transformed variable:
proc reg data=project.flights_coded_t;
model distance=aircraft_type height speed_air2;
run;
Model has Adjusted R square 98.24 which is little better than base model. Below is the output:
Flight Landing Distance Study by Sarita Maharia
20
Fit diagnostics show that residuals look random:
Here the residuals pass normality test too as highlighted in below figure:
Residuals have zero mean based on below hypothesis test. p-value is 1, so we can’t reject null hypothesis that
mean of residuals is 0.
Flight Landing Distance Study by Sarita Maharia
21
MAPE is calculated as below – it comes as 3.65%. It’s good estimate to understand error in data and is lower than
the base model.
data project.model2_mape;
set project.model2_residuals;
err=abs(residual)/distance;
keep err;
run;
proc sql;
create table project.model2_mape_t as
select avg(err) from project.model2_mape;
run;
Conclusion
Base model created without any transformation is not a good model based on model diagnostics and fit test.
Alternate model created using speed_air square transformation gives best fit in terms of R square, MAPE, zero
mean of residuals and normal distributions of residuals.
The significant variables for the alternate model are - Aircraft_type, speed_air**2 and height. Approx 98% of
variability in data is explained using this model. It has MAPE of 3.65%.
Questions from project pdf
How many observations (flights) do you use to fit your final model? If not all 950 flights, why?
831 observations used after removing below rows:
• Duplicate 100 rows
• Rows with negative values of height – 5
• Rows with abnormal observations for each variable defined for the project – 14
However, final model is fit using speed_air variable that has missing values. So, model finally used 208 observations.
What factors and how they impact the landing distance of a flight?
Aircraft_type, speed_ground, height and pitch affect landing distance as per below equation:
𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2
+ 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡
Is there any difference between the two makes Boeing and Airbus?
Mean landing distance of Boeing is more than mean landing distance of Airbus as per below TTEST results. The 2
scenarios of overrun are from Boeing only with speed_ground more than 135mph. There is difference in landing
distance means for each aircraft type when the mean of speed_ground is same.
Lower tail TTEST for landing distance: null hypothesis that mean landing distance of airbus is less than or equal to
Boeing is rejected. Hence, Boeing has more mean landing distance with 95% confidence level.
Flight Landing Distance Study by Sarita Maharia
22
Flight Landing Distance Study by Sarita Maharia
23
Appendix

More Related Content

Similar to Flight Landing Distance Study Using SAS

Flight Data Analysis
Flight Data AnalysisFlight Data Analysis
Flight Data Analysis
Dhivya Rajprasad
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian Valles
Adrián Vallés
 
Opps manual final copy
Opps manual final   copyOpps manual final   copy
Opps manual final copy
moorthy muppidathi
 
OOPs manual final copy
OOPs manual final   copyOOPs manual final   copy
OOPs manual final copy
moorthy muppidathi
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing project
RashmiSubrahmanya
 
Flight departure delay prediction
Flight departure delay predictionFlight departure delay prediction
Flight departure delay prediction
Vivek Maskara
 
3- In the program figurespointers we have a base class location and va.pdf
3- In the program figurespointers we have a base class location and va.pdf3- In the program figurespointers we have a base class location and va.pdf
3- In the program figurespointers we have a base class location and va.pdf
atozshoppe
 
Supporting Flight Test And Flight Matching
Supporting Flight Test And Flight MatchingSupporting Flight Test And Flight Matching
Supporting Flight Test And Flight Matching
j2aircraft
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
Configuring monitoring
Configuring monitoringConfiguring monitoring
Configuring monitoring
RamnGonzlezRuiz2
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET Journal
 
Chapter Seven(2)
Chapter Seven(2)Chapter Seven(2)
Chapter Seven(2)
bolovv
 
Csphtp1 06
Csphtp1 06Csphtp1 06
Csphtp1 06
HUST
 
Apps1
Apps1Apps1
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and Analysis
Quynh Tran
 
l7-pointers.ppt
l7-pointers.pptl7-pointers.ppt
l7-pointers.ppt
ssuser2076d9
 
Java Airline Reservation System – Travel Smarter, Not Harder.pdf
Java Airline Reservation System – Travel Smarter, Not Harder.pdfJava Airline Reservation System – Travel Smarter, Not Harder.pdf
Java Airline Reservation System – Travel Smarter, Not Harder.pdf
SudhanshiBakre1
 
Os lab final
Os lab finalOs lab final
Os lab final
LakshmiSarvani6
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
Lviv Startup Club
 
Final Case Study Churn (Autosaved)
Final Case Study Churn (Autosaved)Final Case Study Churn (Autosaved)
Final Case Study Churn (Autosaved)
Marreddy P
 

Similar to Flight Landing Distance Study Using SAS (20)

Flight Data Analysis
Flight Data AnalysisFlight Data Analysis
Flight Data Analysis
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian Valles
 
Opps manual final copy
Opps manual final   copyOpps manual final   copy
Opps manual final copy
 
OOPs manual final copy
OOPs manual final   copyOOPs manual final   copy
OOPs manual final copy
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing project
 
Flight departure delay prediction
Flight departure delay predictionFlight departure delay prediction
Flight departure delay prediction
 
3- In the program figurespointers we have a base class location and va.pdf
3- In the program figurespointers we have a base class location and va.pdf3- In the program figurespointers we have a base class location and va.pdf
3- In the program figurespointers we have a base class location and va.pdf
 
Supporting Flight Test And Flight Matching
Supporting Flight Test And Flight MatchingSupporting Flight Test And Flight Matching
Supporting Flight Test And Flight Matching
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Configuring monitoring
Configuring monitoringConfiguring monitoring
Configuring monitoring
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
 
Chapter Seven(2)
Chapter Seven(2)Chapter Seven(2)
Chapter Seven(2)
 
Csphtp1 06
Csphtp1 06Csphtp1 06
Csphtp1 06
 
Apps1
Apps1Apps1
Apps1
 
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and Analysis
 
l7-pointers.ppt
l7-pointers.pptl7-pointers.ppt
l7-pointers.ppt
 
Java Airline Reservation System – Travel Smarter, Not Harder.pdf
Java Airline Reservation System – Travel Smarter, Not Harder.pdfJava Airline Reservation System – Travel Smarter, Not Harder.pdf
Java Airline Reservation System – Travel Smarter, Not Harder.pdf
 
Os lab final
Os lab finalOs lab final
Os lab final
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
 
Final Case Study Churn (Autosaved)
Final Case Study Churn (Autosaved)Final Case Study Churn (Autosaved)
Final Case Study Churn (Autosaved)
 

Recently uploaded

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Flight Landing Distance Study Using SAS

  • 1. 10/4/2017 Flight Landing Distance Study Identify the factors affecting landing distance using SAS BANA 6043 Project Report Sarita Maharia UCID – M12340569
  • 2. Flight Landing Distance Study by Sarita Maharia 1 Index Contents Summary.......................................................................................................................................................2 Chapter 1 – Data exploration and data cleaning ..........................................................................................3 Chapter 2 – Descriptive Study.......................................................................................................................7 Chapter 3 – Statistical Modeling.................................................................................................................13 Chapter 4 – Model Validation.....................................................................................................................17 Chapter 5 – Remodeling and model validation ..........................................................................................19 Questions from project pdf.........................................................................................................................21 Appendix .....................................................................................................................................................23
  • 3. Flight Landing Distance Study by Sarita Maharia 2 Summary This project report details steps taken to fit a linear model to predict flight landing distance given input data. The dataset contains 850 observations of 8 variables. The variable dictionary is provided in the appendix. Below is the summary of steps and corresponding observations: 1. Data Cleaning – a. Duplicates in 2 input data sets are removed and b. Negative values of height variable are deleted as these could be wrong recordings. 2. Descriptive Study – Analyze plots and correlation coefficients a. Distance has strong positive correlation with both speed_air and speed_ground but plots show little curve so transformations might be required. b. Independent variables speed_air and speed_ground show strong correlation, hence can distort model if both are included. 3. Statistical Modeling – Fit model with all variables and cleaned data from step 1 a. The regression coefficients change sign when regression is run with individual variables and all variables together with distance. b. Speed_ground is removed to solve issue in step a. c. The significant variables are – aircraft type, height and speed_air. d. MAPE(Mean absolute Percentage Error) is approx. 4% for this base model. 4. Model Validation – Validate model created in step 3 a. Residuals show a curve pattern and are not symmetric. b. Mean of residuals is not zero. c. This means independent variables as is don’t have linear relationship with distance and transformation is required. 5. Remodeling and re-validation – a. Alternate model is used with transformed spped_air. b. The alternate model has better Adjusted R square and passes residual validation criteria. c. Alternate model also better explains variability than the vase model and has lower MAPE. d. The final model is listed below: 𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2 + 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡
  • 4. Flight Landing Distance Study by Sarita Maharia 3 Chapter 1 – Data exploration and data cleaning Goals Merge given datasets after understanding variables and eliminate duplicates. Identify outliers and variables with missing values and treat the variables. Check whether the minimum and maximum observations for a variable are logical. SAS Code and Output Both the given datasets are concatenated and results saved in combined_flights dataset. There are total 950 rows in combined dataset. Backup of dataset is taken /* import first dataset */ FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA1.xls'; PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data1; GETNAMES=YES; RUN; /* remove extra rows that might be created because of spreadsheet import */ data project.data1_required; set project.data1; if not(cmiss(of aircraft distance duration height no_pasg pitch speed_air speed_ground) eq 8); run; /* import FAA2 sheet */ FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA2.xls'; PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data2; GETNAMES=YES; RUN; /* remove extra rows that might be created because of spreadsheet import */ data project.data2_required; set project.data2; if not(cmiss(of aircraft distance height no_pasg pitch speed_air speed_ground) eq 7); run; /* concatenate both datasets */ data project.combined_flights; set project.data1_required project.data2_required; run; /* create copy of main dataset */ data project.combined_flights_copy; set project.combined_flights; run;
  • 5. Flight Landing Distance Study by Sarita Maharia 4 Check for duplicates in the combined dataset and remove duplicates if there are any /* build frequency table to check for duplicates */ proc freq data=project.combined_flights; tables aircraft*distance*height*no_pasg*pitch*speed_air*speed_ground / noprint out=keylist; run; /* print duplicate rows */ proc print; where count ge 2; run; /* sort data on all variables so that duplicates can be deleted */ proc sort data=project.combined_flights out=project.combined_flights_sort; by aircraft descending duration distance height no_pasg pitch speed_air speed_ground; run; /* dataset with unique values */ proc sort data=project.combined_flights_sort out=project.combined_flights_unique nodupkey; by aircraft distance height no_pasg pitch speed_air speed_ground; run; 850 rows in frequency table, hence there are duplicates in the dataset. Below are the sample duplicate rows from frequency table: Code to find number of missing, mean, min and max of all variables: proc means data=project.combined_flights_unique n nmiss min max mean; run; Speed_air variable has almost 75% missing values. This variable will be retained as it’s an important variable. Similarly duration variable will be retained even though it has null values.
  • 6. Flight Landing Distance Study by Sarita Maharia 5 Code to find negative height values: /* find rows of data that have negative heights. This looks to be wrong recording of data */ proc print data=project.combined_flights_unique; where height < 0; run; Output: Code to delete negative height values observations: data project.combined_flights_updated; set project.combined_flights_unique; if height < 0 then delete; run; Now there are 845 observations: Code to find levels of categorical variables: /* find types of aircraft and their count in combined datatset */ proc freq data = project.combined_flights_updated nlevels ; table aircraft; run; Output: Find outliers and distribution for all variables using below code. Outliers are maintained in data as they represent extreme conditions /* plot for outliers */ proc univariate data=project.combined_flights_t2 plot; run;
  • 7. Flight Landing Distance Study by Sarita Maharia 6 Observations 1. Input datasets had 950 observations. After data cleaning, output dataset has 845 observations. Below are the cleaned observations: a. 100 duplicate observations from 2 datasets b. 5 rows are deleted because height has negative values. These might be recorded incorrectly. 2. Missing values in output dataset – these are retained as is a. 75% null values are present for speed_air variable b. 50 observations have duration variable missing. 3. There are many outliers for distance variable but these are retained as these present chances of overrun. 4. There are almost same number of rows for both values of categorical variables Conclusion 1. 845 observations are present in the cleaned dataset after deleting duplicate observations and negative height rows. 2. Null values and the outliers in variables are retained.
  • 8. Flight Landing Distance Study by Sarita Maharia 7 Chapter 2 – Descriptive Study Goals Understand correlation between different variables and analyze plots SAS Code and Output First create copy of input dataset and sort it. Also code the aircraft type so that it can be used in regression. Code: /* create copy of input dataset */ data project.flights_input; set project.combined_flights_updated; run; /* sort input dataset by aircraft type */ Proc sort data=project.flights_input; by aircraft; run; /* code aircraft type to dummy variables. airbus=0 and boeing=1 */ data project.flight_coded; set project.flights_input; if aircraft = "airbus" then aircraft_type=0; esle aircraft_type=1; drop aircraft; run; Output: 845 observations in output dataset with aircraft coded as 0 for airbus and 1 for boeing. Generate plots for all variables with distance variable to understand direction and shape of relation proc plot data=project.flight_coded; plot distance*duration; run; proc plot data=project.flight_coded; plot distance*height; run; proc plot data=project.flight_coded; plot distance*no_pasg; run; proc plot data=project.flight_coded; plot distance*pitch; run; proc plot data=project.flight_coded; plot distance*speed_air; run; proc plot data=project.flight_coded; plot distance*speed_ground; run;
  • 9. Flight Landing Distance Study by Sarita Maharia 8 No recognizable pattern between distance and other variables except below: 1. Distance and speed_air have positive relation with little curve. Also, there are no values below 90 for speed_air which means that we have truncated data. 2. Distance and speed_ground have positive relation with curve Find strength of correlation using below code: proc corr data=project.flight_coded; var _all_; run;
  • 10. Flight Landing Distance Study by Sarita Maharia 9 Output: Strong correlations are highlighted in yellow: Since plots of distance with speed_air and speed_ground have little curve, these variables are transformed to have linear relation and increased correlation coefficient. Out of all transformations, cube of speed_air and speed_distance give maximum correlation coefficient. Code: /*possible transformations*/ data project.flights_coded_t; set project.flight_coded; speed_air2=speed_air**2; speed_air3=speed_air**3; speed_air12=sqrt(speed_air); speed_airlog=log(speed_air); speed_ground2=speed_ground**2; speed_ground3=speed_ground**3; speed_ground12=sqrt(speed_ground); speed_groundlog=log(speed_ground); run; /*find correlations in transformed data*/ proc corr data=project.flights_coded_t; var distance speed_air speed_air2 speed_air3 speed_air12 speed_airlog speed_ground speed_ground2 speed_ground3 speed_ground12 speed_groundlog; run; /*verify plots for transformed data */ proc plot data=project.flights_coded_t; plot distance*speed_air3; run; proc plot data=project.flights_coded_t; plot distance*speed_ground3; run; proc plot data=project.flights_coded_t;
  • 11. Flight Landing Distance Study by Sarita Maharia 10 plot speed_ground3*speed_air3; run; Increased correlation coefficients are highlighted in yellow Plots after doing transformations look linear:
  • 12. Flight Landing Distance Study by Sarita Maharia 11 Square of speed_ground and speed_air also have linear relationship
  • 13. Flight Landing Distance Study by Sarita Maharia 12 Observations Distance has strong positive correlation with speed_air and speed_ground but the plots have little curve. The curve looks linear after applying square transformation. Correlation coefficient for distance with speed_air and speed_ground increases after both variables are transformed by applying square. They also show strong positive linear relation. Conclusion All variables as is might not fit linear model and we might need to use transformed speed variables to validate the model because speed_ground, speed_air plots with distance have curve. Speed_air and speed_ground have high collinearity that could impact the linear model.
  • 14. Flight Landing Distance Study by Sarita Maharia 13 Chapter 3 – Statistical Modeling Goals Fit a linear model to predict landing distance SAS Code and Output First try to identify parameters for regression between distance and individual independent variables using below code: proc reg data=project.flight_coded; model distance=aircraft_type; run; proc reg data=project.flight_coded; model distance=duration; run; proc reg data=project.flight_coded; model distance=height; run; proc reg data=project.flight_coded; model distance=no_pasg; run; proc reg data=project.flight_coded; model distance=pitch; run; proc reg data=project.flight_coded; model distance=speed_air; run; proc reg data=project.flight_coded; model distance=speed_ground; run; Now identify parameters for regression between distance and all other variables proc reg data=project.flight_coded; model distance=aircraft_type duration height no_pasg pitch speed_air speed_ground; run;
  • 15. Flight Landing Distance Study by Sarita Maharia 14 Below is summary output from correlation and regression models run above: The values in yellow change sign when all variables are considered together. From Chapter 2 conclusion, we see that there is strong correlation between speed_air and speed_ground. So, we need to remove impact from collinearity among independent variables to fit the model properly. Out of speed_air and speed_ground, we need to select one to remove from the model. Speed_air has truncated data which means low speed_air observations are missing. The main purpose of this project is to identify scenarios for overrun. Since there is strong positive relation between speed_air and distance, chances of overrun are more for high speed scenarios. Also, speed_air is a very important variable to drop. Hence, we will keep speed_air variable and drop speed_ground for our model. Model without speed_ground: proc reg data=project.flight_coded; model distance=aircraft_type duration height no_pasg pitch speed_ground; run; Now insignificant variables (with p-value > 0.05) are removed from the model one by one and below are the final variables: proc reg data=project.flight_coded; model distance=aircraft_type height speed_air; run; Independent variables Direction Correaltion coefficient P-value corr coeff regression coeff Distance vs individual var p-value reg coeff Distance vs individual var regression coeff Distance vs all p-value reg coeff Distance vs all var aircraft type 0.238 <.0001 442.765 <.0001 440.47015 <.0001 duration no visible relation -0.06197 0.0808 -1.17686 0.0808 0.09881 0.6258 height no visible relation 0.12306 0.0003 11.40984 0.0003 13.93222 <.0001 no_psg no visible relation -0.02778 0.42 -3.4422 0.42 -2.05743 0.1545 pitch no visible relation 0.10294 0.0027 180.88083 0.0027 -3.60074 0.8528 speed_air Strong Positive little curve 0.94728 <.0001 82.17473 <.0001 87.61587 <.0001 speed_ground Strong Positive little curve 0.862 <.0001 41.96801 <.0001 -3.96633 0.5562
  • 16. Flight Landing Distance Study by Sarita Maharia 15 Output: all variables and the model are significant. Almost 97% data is explained using the model. Fit diagnostics show that residuals are not random and they show a pattern.
  • 17. Flight Landing Distance Study by Sarita Maharia 16 Observations Sign of regression parameters change when regression is run with all independent variables together. Speed_ground is removed from the model as it’s collinearity with speed_air was affecting the regression parameters of other variables. Out of speed_ground and speed_air, speed_ground is removed. No_pasg and duration are not significant, hence removed from model. Fit diagnostics show that residuals are not symmetric. Conclusion Linear model fits data after removing non-significant variables but gives residuals plots showing curve pattern. Model has R square 95%. We need to run diagnostics to understand the residual behavior.
  • 18. Flight Landing Distance Study by Sarita Maharia 17 Chapter 4 – Model Validation Goals Analyze residual plot to check if it’s random and check if mean of residuals is zero SAS Code and Output Copy residuals in a separate dataset using below code proc reg data=project.flight_coded; model distance=aircraft_type height pitch speed_ground / r; output out=project.model1_residuals r=residual; run; Code to check distribution and hypothesis for mean=0 of residuals /* distribution not symmetric as per Shapiro Wilk test */ proc univariate data=project.residuals normal plot; var residual; run; /* null hypothesis of mean=0 is not rejected as p value is 1 */ proc ttest data=project.residuals; var residual; run; Distribution is not normal for residuals: Residuals also fail normality test as highlighted p-value is less than 0.05
  • 19. Flight Landing Distance Study by Sarita Maharia 18 Mape (Mean absolute percentage error) is calculated using below code and value is 23.3% data project.model1_mape; set project.model1_residuals; err=abs(residual)/distance; keep err; run; proc sql; create table project.model1_mape_t as select avg(err) from project.model1_mape; run; Observations Residuals are not symmetric and also fail normality test. So, the linear model is not good fit. Mape is 4.22%. Conclusion Linear model generated in previous chapter is not a good fit. Transformations are required on data as residuals have pattern in form of curve.
  • 20. Flight Landing Distance Study by Sarita Maharia 19 Chapter 5 – Remodeling and model validation Goals Transform independent variables so that residuals are random and have normal distribution Create alternative models and compare against base model to find best fit. SAS Code and Output Considering model created in Chapter 3 as base model, we will now create alternative model using transformed speed_air variable. As per Chapter 2 observations, transformed speed_air variable (after applying square) has linear plot with distance. Transformed speed_air and speed_ground variables (after applying square) also have strong positive linear relation. Below is the code to fit linear model using transformed variable: proc reg data=project.flights_coded_t; model distance=aircraft_type height speed_air2; run; Model has Adjusted R square 98.24 which is little better than base model. Below is the output:
  • 21. Flight Landing Distance Study by Sarita Maharia 20 Fit diagnostics show that residuals look random: Here the residuals pass normality test too as highlighted in below figure: Residuals have zero mean based on below hypothesis test. p-value is 1, so we can’t reject null hypothesis that mean of residuals is 0.
  • 22. Flight Landing Distance Study by Sarita Maharia 21 MAPE is calculated as below – it comes as 3.65%. It’s good estimate to understand error in data and is lower than the base model. data project.model2_mape; set project.model2_residuals; err=abs(residual)/distance; keep err; run; proc sql; create table project.model2_mape_t as select avg(err) from project.model2_mape; run; Conclusion Base model created without any transformation is not a good model based on model diagnostics and fit test. Alternate model created using speed_air square transformation gives best fit in terms of R square, MAPE, zero mean of residuals and normal distributions of residuals. The significant variables for the alternate model are - Aircraft_type, speed_air**2 and height. Approx 98% of variability in data is explained using this model. It has MAPE of 3.65%. Questions from project pdf How many observations (flights) do you use to fit your final model? If not all 950 flights, why? 831 observations used after removing below rows: • Duplicate 100 rows • Rows with negative values of height – 5 • Rows with abnormal observations for each variable defined for the project – 14 However, final model is fit using speed_air variable that has missing values. So, model finally used 208 observations. What factors and how they impact the landing distance of a flight? Aircraft_type, speed_ground, height and pitch affect landing distance as per below equation: 𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2 + 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡 Is there any difference between the two makes Boeing and Airbus? Mean landing distance of Boeing is more than mean landing distance of Airbus as per below TTEST results. The 2 scenarios of overrun are from Boeing only with speed_ground more than 135mph. There is difference in landing distance means for each aircraft type when the mean of speed_ground is same. Lower tail TTEST for landing distance: null hypothesis that mean landing distance of airbus is less than or equal to Boeing is rejected. Hence, Boeing has more mean landing distance with 95% confidence level.
  • 23. Flight Landing Distance Study by Sarita Maharia 22
  • 24. Flight Landing Distance Study by Sarita Maharia 23 Appendix