This document describes a statistical analysis project to determine the factors that affect commercial flight landing distance. The author analyzes 950 flight observations to build a linear regression model with landing distance as the target variable. Key factors found to impact landing distance are aircraft type, ground speed, and height of the aircraft. The final regression equation found is: Distance =-2554.47 + 501.57(Aircraft_Cat)+ 42.79(Ground Speed)+ 12.52(Height), where Aircraft_Cat is a dummy variable for aircraft type. 832 observations were used to fit the final model after removing outliers.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Flight landing Project
1. STATISTICAL COMPUTING PROJECT – BANA 6043
NAME : POORVI DESHPANDE
UCID: M12388313
ABSTRACT
In this project, we are trying to determine what factors affect the landing distance of a commercial
flight and how they would impact the landing distance. We have been given two data sets
consisting of 950 flight observations combined.
To identify the factors and the magnitude by which they affect landing distance, we have used a
linear regression model wherein the target variable is landing distance (distance) and the rest serve
as predictor variables.
We follow various steps to reach an equation that describes our model. The steps include data
preparation, data exploration and data modelling. The correlation between the target and the
explanatory variables are calculated and a set of variables are chosen which have significant impact
on landing distance.
Landing distance is dependent on the type of aircraft, ground speed and height of the aircraft.
Distance =-2554.47 + 501.57(aircraft_cat)+ 42.79(speed_ground)+ 12.52(height)
*Aircraft_cat is 0 for Boeing and 1 for Airbus
2. CHAPTER 1 : DATA PREPARATION
Data preparation is done so as to obtain a clean data set for further analysis and accurate statistics.
The data needs to checked and filtered according to the acceptable conditions defined in the
problem statement.
STEPS
1. Combining data sets
We had 2 data sets with us. It serves better to combine the two and make out common inferences
about the combined dataset as all the columns were column save one.
PROC IMPORT
DATAFILE="/home/deshpapi0/Landing/FAA1.xls"
OUT=FAA1
DBMS=xls REPLACE;
RUN;
PROC IMPORT
DATAFILE="/home/deshpapi0/Landing/FAA2.xls"
OUT=FAA2
DBMS=xls REPLACE;
RUN;
DATA COMBINED;
SET FAA1 FAA2;
RUN;
2. Fetching basic details about the combined data set.
PROC MEANS DATA=combined;
RUN;
PROC UNIVARIATE DATA=COMBINED;
VAR speed_air;
HISTOGRAM speed_air;
PROC UNIVARIATE DATA=COMBINED;
VAR height;
HISTOGRAM height;
PROC UNIVARIATE DATA=COMBINED;
VAR pitch;
HISTOGRAM pitch;
3. PROC UNIVARIATE DATA=COMBINED;
VAR distance;
HISTOGRAM distance;
PROC UNIVARIATE DATA=COMBINED;
VAR duration;
HISTOGRAM duration;
HISTOGRAMS OF VARIABLES
4. We observe that other than speed_air and distance, other variables have a normal (or close to
normal) distribution.
3. Check for duplicate values
PROC SORT data=COMBINED NODUPKEY;
BY aircraft speed_ground no_pasg speed_air height pitch distance;
RUN;
We had 100 duplicate rows. These duplicate rows are deleted.
4. Checking for missing values and treating them
proc means data=COMBINED NMISS N;
run;
We find that there are 50 missing values for the variable ‘duration’ and 642 missing values for
‘speed_air’. At this stage we cannot go ahead and delete these missing values because we do not
know how significantly they affect the target variable. Also there could be outliers in these missing
values which could change the statistics of the data considerably.
5. 5. Categorizing data
A flight is marked as normal or abnormal based on a number of criteria.
Another dataset has been created on which I have applied transformations. Since we limit our
model to the normal observations only, we can delete the abnormal observations.
According to the conditions given,
1. Duration: The duration of a normal flight should always be greater than 40min.
Deleting all flights with flight duration less than 40.
2. Speed_ground: If its value is less than 30MPH or greater than 140MPH, then the landing
would be considered as abnormal. Deleting all abnormal speed_ground.
3. Height: The landing aircraft is required to be at least 6 meters high at the threshold of the
runway. So, eight < 6 meters is abnormal. . Deleting all rows with height<6.
4. Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of
the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
NOTE: Missing values are counted as Normal for now.
/*deleting abnormal flights*/
DATA FLIGHT_DATA;
SET COMBINED;
IF (duration<40 AND duration ^= '.') OR Speed_ground<30 OR Speed_ground>140
OR (speed_air<30 AND speed_air ^='.') OR speed_air>140 OR height<6 THEN
DELETE;
RUN;
proc print data= flight_data;
run;
.
.
6. .
6. Fetching statistics about the clean data set
PROC MEANS DATA=FLIGHT_DATA;
RUN;
HISTOGRAMS OF VARIABLE IN THE CLEANED DATA DET : FLIGHT_DATA
7.
8. CHAPTER 2 : DATA EXPLORATION
Data exploration is done so as to statistically analyze the clean data for further regression
modelling. This step encompasses visualizing the spread of data, checking for linearity and to see
if there exists a correlation between the target and predictor variables. The variables which do not
have any effect on the target variable can be eliminated.
Steps:
1. It is advised to plot the data before modelling as it gives an estimate of the linear correlation
between variables. If there is a linear correlation, the plot turns out to be a straight line (or
close to a straight line). Otherwise we witness a scattered plot where in no linear
relationship can be determined.
PROC PLOT DATA= FLIGHT_DATA;
PLOT distance * (duration no_pasg speed_ground speed_air height pitch);
RUN;
PLOTS:
distance * duration
11. distance * pitch;
We observe that speed_ground and speed_air are in linear correlation with distance. But by how
much? We need to find the magnitude of correlation. We obtain that objective by finding
coefficients of correlation.
2. Finding correlation coefficients
Before finding coefficient of correlation, we need to transform ‘aircraft’ which is a categorical
variable into a numerical one. We do this by creating dummy variables.
/*dummy variables for aircraft */
DATA FLIGHT_DATA;
SET flight_data;
IF (aircraft= "boeing") then aircraft_cat = 1;
else aircraft_cat = 0;
RUN;
This creates another column aircraft_cat and populates it with 1 for airbus and 0 for boeing. This
doesn’t affect our result in any way but also lets us take the make of aircraft into consideration.
Now, a correlation matrix is created to determine the magnitude of correlation between the
variables. Since this also gives us the correlation between independent variables, we can also
determine if any other variable is dependent on other variables.
12. proc corr data=flight_data;
var distance aircraft_cat duration no_pasg speed_ground speed_air height pitch;
title Pairwise correlation coefficients;
run;
13. Conclusions:
a) The variable distance is highly correlated with speed_ground and speed_air. Also,
distance is not correlated with duration and no_pasg as it has p values less than 0.05.
b) Therefore, these two variables no_pasg and duration play no role in determining the
regression model for the given data.
c) We observe that speed_air and speed_ground are in high correlation with each other.
Thus, we can drop one of the variables among the two. Since speed_air has a significant
amount of missing values, it makes sense to drop that column altogether.
/* drop column speed_air */
data FLIGHT_DATA (drop=speed_air) ;
set FLIGHT_DATA;
PROC PRINT DATA=FLIGHT_DATA;
RUN;
CHAPTER 3 : DATA MODELLING
Data modelling is done to obtain an equation that explains the dependence of the target variables
over the independent variables chosen through data exploration. In this model, we will focus on
how other factors are affecting landing distance through regression.
A simple linear equation can be defined as 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀
Where 𝛼 is the intercept, 𝛽 is the parameter estimate for variable x and 𝜀 is error.
/* regression */
PROC REG data=flight_data;
MODEL distance = aircraft_cat duration speed_ground height pitch / r spec;
output out= FLIGHT_FINAL r= residual; run;
14. Since the null hypothesis (variable is significant in the equation) can be rejected for duration (p
value = 0.9097) and pitch (p value = 0.4561), we can simply remove these variables out of the
equation.
Now we do a regression on the remaining variables : Intercept, speed_ground, aircraft_cat, height.
PROC REG data=flight_data;
MODEL distance = aircraft_cat speed_ground height / r spec;
output out= FLIGHT_FINAL r= residual;
run;
15. After another regression test, we are not able to reject any of the other variables based on p-value.
Therefore, this is our final regression model.
α = -2554.46892
β1 = 501.57254 for x1 = aircraft_cat
β2 = 42.78669 for x2 = speed_ground
β3 = 14.52014 for x3 = height
Distance =-2554.47 + 501.57(aircraft_cat)+ 42.79(speed_ground)+ 12.52(height)
16. Questions:
1. How many observations (flights) do you use to fit your final model? If not all 950 flights, why?
In this model, 832 observations have been used after removing all the abnormal values and
duplicates. These are outlier values and do not comply with the conditions given as acceptable
flight landings.
2. What factors and how they impact the landing distance of a flight?
Landing distance is dependent on the type of aircraft, ground speed and height of the
aircraft.
Distance =-2554.47 + 501.57(aircraft_cat)+ 42.79(speed_ground)+ 12.52(height)
Aircraft_cat is dummy variable created to distinguish the categories of aircraft namely 0 as
airbus and 1 for boeing. It does not change anything in the results.
17. 3. Is there any difference between the two makes Boeing and Airbus?
Yes. Since aircraft category is a determinant variable impacting the landing distance. It goes to
say that for different aircraft types, we will observe different values of landing distances.
Typically, here.. we have 444 observations with aircraft type as airbus and 388 observations with
aircraft type as boeing. T