Regression Analysis on Flights data

BANA 6043
PROJECT 
STAT COMPUTING
Name: Mansi Verma
UCID: M 10632087

PROBLEM STATEMEMT:
To study the factors that impact the landing distance of a commercial
flight in the given data of 950 flights with the below data variables:
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing.
The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when
passing over the threshold of the runway. If its value is less than 30MPH
or greater than 140MPH, then the landing would be considered as
abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing
over the threshold of the runway. If its value is less than 30MPH or
greater than 140MPH, then the landing would be considered as
abnormal.
Height (in meters): The height of an aircraft when it is passing over the
threshold of the runway. The landing aircraft is required to be at least 6
meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the
threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically,
it refers to the distance between the threshold of the runway and the
point where the aircraft can be fully stopped. The length of the airport
runway is typically less than 6000 feet.

SUMMARY
The factors that impact the landing distance of a commercial flight in
then given data of 950 flights were studied. After eliminating the
observations that did not meet the constraints of the aviation industry,
we analyzed the remaining 831 records to come up with an equation that
explains the landing distance.
Distance = -1049 + 454.45(aircraft_name) + 0.27(speed_ground1) +
14(height) + 21(pitch)
*Aircraft name is ‘0’ for Boeing and ‘1’ for Airbus
*Speed_ground1 = 𝑆𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑2
The variables provided in the data were all put to test to check the
impact of each one of them on the landing distance. We found that the
number of passengers and the duration of the flight do not affect the
landing distance which is practically reasonable.
The landing distance depend on the pitch, height and majorly on square
of the variable speed_ground. We can see these results in regression
modelling done with the data further in the report. The distance also
depend on the make of the aircraft and thus it is also a part of our
equation obtained.
We calculate the parametric coefficients for each explanatory variable
through linear regression modelling and check the assumptions under
which we can apply it.

CHAPTER ONE: DATA EXPLORATION AND
DATA CLEANING
GOAL:
The objective of exploration and data cleaning is to prepare data for
further analysis as we need to visualize and do modelling in the data to
come up with the results and insights. The validity checks need to be
performed in the data as we have a few conditions that are to be met in
the Airline Business.
Steps:
1.
The data is provided in two separate files, so we import the excel files
and append them together using SAS. It picks blank records which we
need to delete. We use SET command to append data one below others
as the data fields in both data files are almost same.

The code stacks the data files one below other, we notice that the data
from the second set has one less variable of duration than the first data
file. There are also data rows with the same values in FAA1 and FAA2
with only one missing variable of duration.
2.
We need to analyze data by taking some more knowledge of the
variables and data points.

This shows that data needs to be cleaned as the min and max values
show that the there is some discrepancy in the data with the norms it
needs to follow. Also Nmiss values show that there are 711 values
missing for the variable speed_air whereas only 239 values of this
variable is present. We might want to delete this field as it will not give
a true insight to the result as its not captured for more than 75% of the
data rows. The Proc means table printed will help us think through other
steps we will take in eventually to ready our data for modelling.
3.
Now we need to remove duplicates from the data with the same values
and as we know the FAA2 data file did not have the values for the
duration, we need to exclude this variable from consideration while
looking for the duplicates.

For this we sort the data and then use nodupkey option to remove
duplicates by all variables excluding the duration field. This will help us
identify the overlap in the two data sets FAA1 and FAA2 and can be
captured in the data ‘removed’ through the below code.
This leaves us with 850 data rows and puts the duplicated data of 100
rows in a different data set which is trivial but just for us to see.

4.
As the variable speed_air has more than 75% values missing, we can
remove this data or let it be there to drop it later from the data. We might
forecast it or impute it with the mean value basis our model but for the
time being we let it be there to study it further.
5.
Now we perform sanity check of the data by seeing if it fulfills the
norms of the airline industry by validating each variable one by one.
This removes 5 records with the abnormal flight duration. We are left
with 845 records and the 5 data rows with abnormal flight duration can
be put to another data set just for us to see.
(Output below:)

6.
Now we check for the variable height and delete the data rows with the
heights less than 6 as its unacceptable.

We are left with 835 records now and get 10 rows with unacceptable
heights.
7.
Now we check for the variable speed_ground as per the constraints the
values of the variable should lie between 30MPH to 140MPH. So we
delete the data rows with the unacceptable speed_ground.

This delete another 3 data rows and we are left 832 rows.

8.
Now we check for the variable distance and as the length of the runway
is 6000; any data value cannot exceed this value.
We are now left with 831 records.

9.
Now we check for the variable speed_air as per the constraints the
values of the variable should lie between 30MPH to 140MPH. So we
delete the data rows with the unacceptable speed_air.
There are no rows with unacceptable data remaining in the data and thus
we are still left with 831 rows.

10.
Now we see the distributions of each of the variables as we require
distribution assumptions for applying modelling techniques. We capture
the moments, null hypotheses, quantiles of each variable. We look at the
histograms of each variable to notice their distributions.
1. Duration

The duration is almost normal with little skewed.
2. Number of passengers

Appears normally distributed but slightly left skewed.
3. Speed_Ground

Appears normally distributed.
4. Speed _Air

The air speed is not at all normal.
5. Height

Normally distributed with a slight right skew.
6. Pitch

Pitch looks normally distributed.
7. Distance

This doesn’t appear to be normally distributed.
Data Preparation Questions:
1. How to treat data variable with more than 75% of its values
missing?
2. How to realize values in the data for the variables with a very few
data values missing for the sake of completeness.
3. How to impute the values to the data variable with majority of data
missing?
4. How can we substitute a value to an unacceptable data point rather
than delete the entire data row?

CHAPTER TWO: DATA EXPLORATION
GOAL:
The objective of data exploration is to study the prepared data to prepare
it for regression model. This includes visualizing data, check for the
linearity and also to see the correlations between each variable to
eliminate variables which do not change our response variable.
Steps:
1. Before beginning the modeling, we plot our data. By examining
these initial plots, we can quickly assess whether the data have
linear relationships or interactions are present.
A variable that has a linear relationship with the response variable will
produce a plot that resembles a straight line(speed_air). The other plots
are scattered.
We can consider transforming other variables in our modeling to
increase the linearity.

Distance Vs No_pasg
Distance Vs Speed_ground

Distance Vs Speed_air
Distance Vs Height

Distance Vs Pitch
2. Tansformation to speed_ground can increase the linearity.
The square function applied to speed_ground makes the plot much linear
than the previous graph.

Distance Vs 𝑆𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑2
3. As the aircraft is a categorical variable we define it dummy
numerical values so that they correlation with the response variable
can be realized.
This creates 0 as the name for Boeing aircraft and 1 for the Airbus.
We can take any numerical values in place of ‘0’ and ‘1’; it does
not change our results.

4. Now we create a correlation matrix of each variable in the data to
analyze the dependence of the response variable on these different
variables. The correlation matrix also helps us identify the
dependence between two independent variables. In such a case we
are likely to eliminate one of these variables from our model.
Additionally, running correlations among the independent
variables is helpful. These correlations will help prevent
multicollinearity problems later.
Results:
 The variable distance is correlated with all the variables except the
variables aircraft_name and no_pasg which has the p value greater
than 0.05.
 Thus we infer that the no_pasg and aircraft_name don’t play a
significant role in explaining our response variables.
 As the speed_ground and speed_air are also highly correlated, we
can drop one of these variables. We choose to drop speed_air as
there are numerous observations with the missing values for
speed_air.

we can now drop the variable speed_air.
CHAPTER TWO: MODELLING
GOAL:
The objective of modelling is to build an equation for the response
variable to understand its dependence on the independent variables
chosen. We concerned with finding a model that describes the
relationship between distance and several predictor (explanatory)
variables by regression.
Introduction:
A linear model has the form Y = b0 + b1X + ε. The constant b0 is called
the intercept and the coefficient b1 is the parameter estimate for the
variable X. The ε is the error term. ε is the residual that cannot be
explained by the variables in the model.

The F value is as high as 2683.94 and R square is .9288 which shows
that the independent variables very clearly explain our response variable
distance and thus we are in a position to obtain our equation.

As our model has number of variables thus we look into the value of adj
R Sq which also shows a high value. We can thus assume that our model
is fine.
As the pvalue of the variable duration is more than 0.05 , we drop this
variable and our response variable clearly is not dependent on the
duration. So, it should not be a part of the equation. Rest all variables
have their pvalue greater than 0.05 thus they make our equation.
BUILDING the EQAUTION
Y = b0 + b1X1 + b2X2 + ε
Y = distance
B0 = -1049
B1= 454.45
X1=aircraft_name
B2=0.27
X2=speed_ground1
B3=14
X3=height
B4=21
X4=pitch
Distance = -1049 + 454.45(aircraft_name) + 0.27(speed_ground1) +
14(height) + 21(pitch)

The mean of residuals is zero. The overall fit of the model can be
checked by looking at the F-Value and its corresponding p-value (Prob
>F) for the total model under the Analysis of Variance portion of the
REG print out. Generally, we want a Prob>F value less than 0.05.

CHAPTER TWO: MODEL CHECKING
GOAL:
The objective of model checking is to check the assumptions for the
noise terms.
They are assumed to be:
1. Independent
2. Normally distributed.  
3. Mean 0
4. Constant Variance
We will validate that the residuals are independentas it is an assumption
of linear regression by examining the residuals of our final model.
Specifically, we will use diagnostic statistics from REG as well as create
an output dataset of residual values for PROC UNIVARIATE to test.

The p of Chi-square value is less than 0.05.
The distribution of the residuals.

Write your short answers to these questions: 
1. How many observations (flights) do you use to fit your
final model? If not all 950 flights, why?
I fit 831 observations as the final number of observations in the model
and remove the rest 119 observations because of the steps taken in the
data preparation chapter where after removing the blank values and
applying all the validity checks. I identified the overlap in the two data
sets FAA1 and FAA2 and the duplicate values were removed which left
850 observations. I also deleted the data rows with the heights less than
6 as its unacceptable. 835 records remained as there were 10 rows with
unacceptable heights. When checked for the variable speed_ground as
per the constraints the values of the variable should lie between 30MPH
to 140MPH. This got me delete another 3 data rows and 832 rows
remained. The variable distance could have a length of the runway as
6000 at most; one record exceeded this value, so 831 records were left
finally.

2. What factors and how they impact the landingdistance
of a flight?
From my modelling and results the four variables impact the landing
distance namely – speed_ground, height, pitch and the aircraft.
I eliminated the no_pasg, duration and speed_air due to different
reasons.
No-pasg – It wasn’t correlated to the response variable.
Duration – The regression result showed a very low impact of the
variable on the distance.
Speed_air- This variable has very less values to incorporate it for
analysis and also the major reason to eliminate the variable from our
equation was because it showed a very strong correlation between the
speed_ground. So it was insignificant to use it in our equation.
The variables impacting the landing distance are speed_ground, height,
pitch and the aircraft.
All the variables are highly correlated with our response variable and
also we could obtain the parameter estimate for all the 4 explanatory
variables given in the result of the equation.
The speed_ground is actually the square of the speed_ground as it has
more linear relationship with the distance.

3. Is there any difference between the two makes Boeing and Airbus?
Yes, there is definitely a difference between the make of Boeing and
Airbus as our equation has a variable aircraft_name which is based on
the make of the aircraft.

The equality of variance shows the f value more than 1 and thus we infer
that there is a significant difference between the two makes.
The GLM and T test are done identify the differences between the two
groups and their result clearly shows the difference in their means,
variance and their impact on the distance.

Regression Analysis on Flights data

Recommended

Recommended

More Related Content

Similar to Regression Analysis on Flights data

Similar to Regression Analysis on Flights data (20)

Recently uploaded

Recently uploaded (20)

Regression Analysis on Flights data