Stats computing project_final

_
BANA 6043 Project
NAME: AYANK GUPTA UCID:M12388639
Background: Flight landing.
Motivation: To reduce the risk of landing overrun.
Goal: To study what factors and how they would impact the landing distance of
a commercial flight.
Data: Landing data (landing distance and other parameters) from 950
commercial flights (not real data set but simulated from statistical models). See
two Excel files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ (150 flights).

_
Chapter 1: Data Preparation
1. Combining of the data sets from different sources
Output of both the imports
/**FAA1**/

_
/**FAA2**/
/** Combing both the data sets **/

_
/*Checking for Duplicates and removing them from the combines datasets*/
Note: We observed 100 duplicates entries from the combines dataset hence
removed it from it.

_
2. Performing the completeness check of each variable – examine if
missing values are present;
Variable N Missing Values % Missing Values
Duration 800 50 5.8%
no_pasg 850 0 0%
speed_ground 850 0 0%
speed_air 208 642 75%
Height 850 0 0%
Pitch 850 0 0%
Distance 850 0 0%
Note:
1. 16% of the values of the DURATION variable are missing because 50 rows are missing from
the FAA2 datasets
2. 75% of the values of the speed_air are missing and we need to further examine the column
for data cleaning
Performing the validity check of each variable – examine if abnormal values are present;

_
NOTE: Here we see that the height of the few values in height are negative and we need to flag them
out from our next analysis.
In our next analysis, we will perform the analysis on each and every variable based on the business
rule given for each variables.

_
/*Checking for outliers in height*/
Note: By performing the above step we are able to identify the heights with negative hieghts.

_
Cleaning the data based on the results of Steps 2 and 3
Note : We are able to remove 18 Values according to the abnormalities
1. For now we are not removing the missing values rows because it will create bias in the data
a. I am planning to impute the missing values.
b. Or I will be using some approximations like mean to fill the missing values

_
Summarizing the distribution of each variable
We went ahead to see the distribution of each and every variable to see which of the variable
shows a normal distribution and those variables who are in a way skewed or biased to
Variable Label N Mean Std Dev Minimum Maximum
Skweness
duration duration 782 154.731 48.335 41.949 305.622
0.192089
no_pasg no_pasg 832 60.060 7.488 29.000 87.000
-0.015304
speed_ground speed_ground 832 79.611 18.829 33.574 136.659
0.110191
speed_air speed_air 204 103.646 9.982 90.003 136.423
0.9447
height height 832 30.474 9.791 6.228 59.946
0.125057
pitch pitch 832 4.005 0.526 2.284 5.927
0.016221
distance distance 832 1,528.240 911.045 41.722 6,309.950
1.560395
DURATION

_
CHAPTER 2: Descriptive Study (XY plots and correlation studies)
Distance Vs Duration
Distance Vs NO_PASG

_
Distance Vs Speed Ground
Distance Vs Air Speed

_
Distance Vs Height
Distance Vs Pitch

_
My Interpretation on the XY plot of the data
1. Distance Vs Duration: The values seem to scatter and the relationship
doesn’t seem to be linear
2. Distance Vs No_Pasg: the relationship is not linear
3. Distance Vs Speed_Ground: The relation is linear or in other words the
relationship shows a monotonic relationship
4. Distance Vs Speed air is fairly linear but we have a lot of missing values in
the speed air, hence the relationship cannot be considered significant
5. Distance Vs Height and Pitch seems a bit scattered

_
Correlation Matrix between the variables and their interpretation:
Interpretation of the Correlation between the independent Variables
➢ We need to check the collinearity between all the independent variables to check for multi
collinearity between the independent variables which might lead to some discrepancy in our
linear regression models
➢ We observe that correlation between speed air and speed ground and hence while
considering both the variables in regression we need to be extra carful
➢ Except of that we can observe that all the other variables are fairly uncorrelated with each
other which is a good sign for our regression model
Note: Argument against considering the Air speed variables:
We observe that air speed variables have almost 70% missing values which means if we try to
impute the variables using sensible imputation or through predictive imputation we will be
predicting more that 70% of the values based on the remaining 30% values which may not be a wise
or a sensible decision to do.
Another factor since values of ground speed and air speed are very much correlated we can instead
only use air ground for our regression model.

_
Chapter 3: Statistical modelling
Please look at the R square which is value which we can use to check the regression model with one
another to check for the accuracy of the regression model.
Our Aim on the model improvement will be to have a model with a better R Square but with a
caution that we don’t overfit the model.
Note: For our next iteration of the model we will consider only the variables speed ground , height
and pitch

_
Now we need to check the variables that we need to consider for our regression Analysis.
All the variables with P vales more that 0.1 will be not considered for the analysis.
For the variables with P value slightly significant should be carefully selected as we might be over
fitting our model which will be harmful when we are testing our results on the test sets.

_
Note :
We observe a few things like the residual shows a normal distribution.
Since the R square values doesn’t change we have our regression model finalized with the significant
variables. And R square value seems pretty good for a model in terms of accuracy
We further need to validate a model.
We can either validate our regression model by testing its accuracy on the test data set.
Since at this movement we don’t have a test data set present, we can perform a basic validation
with the help of model checking.

_
Model checking
Observation
1. The residual is normal distributed
2. The mean of the residual is 0
3. We have a constant Variance
Hence, we can conclude that the model is validates through model checking

_
Chapter 4: Project Summary
Summary of the Project
Background: Flight landing.
Motivation: To reduce the risk of landing overrun.
Goal: To study what factors and how they would impact the landing distance of a commercial flight.
Data: Landing data (landing distance and other parameters) from 950 commercial flights (not real
data set but simulated from statistical models
1. Data Preparation
a. Combined both data sets.
b. Removed duplicates on the datasets
c. Removed the abnormal observation from the data sets
d. Checked the distribution of each variable in the datasets.
2. Descriptive Study (XY plots and correlation studies)
a. Studying the X-Y plot between the different variables.
i. We observed that relationship between distance and ground speed is highly
linear
ii. Whereas relationship between distance w.r.t height and pitch are slightly
linear
iii. Relationship between of distance with duration and Nonpigs is obviously not
linear
b. Studying the Correlation between the independent variables
i. Only ground speed and air speed showed a great collinearity but since the
speed air is highly empty we can remove it from our regression model and
hence we don’t need to worry about the multi collinearity.
ii. All the other variables are quite non- collinear.
3. Statistical modelling- Linear regression.
a. To study the factors with respect to the landing distance we made a linear
regression.
i. R2
of the model was roughly 0.84.
ii. It showed ground speed, height and aircraft as significant variables with P
value less than .0001
b. Correction in the model: To make a better model we consider only the significant
variables and then checked the R2
which has increased slightly.
i. Now our dependent variable which is distance depends on the independent
variables which are Ground speed, Height and aircraft.
Our regression models
Distance= 42.7*(Ground Speed)+14.5*(Height)-501(air_craft_flag)-2052

_
Answering the Questions
How many observations (flights) do you use to fit your final model? If not all 950 flights,
why?
1. There were 832 observation that I used to train my data to fit the linear
regression models
1. We removed 100 observations because they were duplicates
2. We further removed 18 values since they were the abnormal values.
3. We could have removed 50 observations for which duration was empty but we did
not because duration was not a significant parameter when considering for
regression
2. What factors and how they impact the landing distance of a flight?
Factors that Affect the landing distance as follows:
1. Ground Speed: With an increase in ground speed the landing distance increases
2. Height: With an increase in height the landing distance increases
3. Air_Craft_flag: Where 1 stands for Airbus and 0 stands for Boing. Both make of the
aircraft showed different behaviour in terms of landing distance
3.Is there any difference between the two makes Boeing and Airbus

_
For Airbus N=444
For Boeing N=388
When we make a regression, model check them with respect to aircraft make we observe
For Boeing, pitch is insignificant in the regression model whereas for air bus, it is quite significant

Stats computing project_final

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Stats computing project_final

Similar to Stats computing project_final (20)

Recently uploaded

Recently uploaded (20)

Stats computing project_final