The goal of this project is to determine the various factors and study their effects on landing distance of a commercial flight. The motivation behind the study is to reduce the risk of landing overrun. We perform our analysis on two data sets FAA1 and FAA2. We follow the various steps for data analysis like data cleaning, data exploration, data visualization, modeling and model checking. During the data cleaning stage we remove the blank records, duplicate observations and the abnormal values. After studying the distribution of all variables like distance, duration, speed air, speed ground, height, pitch and no of passengers we run the correlation analysis for all the variables to determine the relationship between them. We observe that ‘Distance’ is highly correlated to ‘speed air’ and ‘speed ground’. The regression analysis helps us to determine that the factors like speed ground, height and pitch significantly affect landing distance. Also, we determine the factors which play a significant role when the make of the aircraft is considered separately. We conclude that for Boeing the predictor ‘pitch’ does not play a significant role. A final model for independent variable ‘Distance’ is obtained in terms of predictors ‘speed ground’, ‘height’ and ‘pitch’. Finally, we also conduct a model diagnostic for the above derived model.
1. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
SUMMARY
The goal of this project is to determine the various factors and study their effects on landing
distance of a commercial flight. The motivation behind the study is to reduce the risk of landing
overrun. We perform our analysis on two data sets FAA1 and FAA2. We follow the various steps
for data analysis like data cleaning, data exploration, data visualization, modeling and model
checking. During the data cleaning stage we remove the blank records, duplicate observations
and the abnormal values. After studying the distribution of all variables like distance, duration,
speed air, speed ground, height, pitch and no of passengers we run the correlation analysis for
all the variables to determine the relationship between them. We observe that ‘Distance’ is
highly correlated to ‘speed air’ and ‘speed ground’. The regression analysis helps us to
determine that the factors like speed ground, height and pitch significantly affect landing
distance. Also, we determine the factors which play a significant role when the make of the
aircraft is considered separately. We conclude that for Boeing the predictor ‘pitch’ does not
play a significant role. A final model for independent variable ‘Distance’ is obtained in terms of
predictors ‘speed ground’, ‘height’ and ‘pitch’. Finally, we also conduct a model diagnostic for
the above derived model.
4. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
Performing the completeness check of each variable – examine if missing values are present:
From the belowoutputwe cansee that the variable ‘duration’has 50 null observationsand the variable
‘speed_air’has642 null observations.The variables ‘duration’and‘speed_air’are crucial foranalysisas
they directlyimpactthe final goal of ourstudy.So, at the data cleaningstage we wouldnotdelete the
variablesorthe observationswithmissingvalues variablesandpreserve itforlaterstudyandanalysis.
CODE:
OUTPUT:
Performing the validity check of each variable – examine if abnormal values are present:
CODE:
5. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
OUTPUT:
From the above output,itcan be seenthat there are 19 observationswithabnormal values.Thus,
abnormal value constitutesonly2.24%of the complete dataset.Asthispercentage isverylow,we can
separate these valuesinanotherdatasetanddelete itfromthe maindata set.
Separating the abnormal values intoanother data set:
In this step, we are creating a data set ‘Abnormal’ which would contain all observations with
abnormal values which could be used further in the analysis or testing of model.
CODE:
7. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
Summarizing the distribution of each variable:
We will use the univariateprocedure tosummarizethe distributionof eachvariable.
Descriptive measureslike mean,median,mode,stddev,variance,skewness,kurtosis,range,inter-
quantile range will helpustounderstand the distributionof the data.
Histogramhas alsobeenplottedforeachvariable tosummarize andvisualize the distributionof the
variable.
From the valuesof skewnessandkurtosiswe caninfer the following:
The variables‘duration’,no_pasg’,‘speed_ground’,‘height’and‘pitch’are almostsymmetrically
distributedandapproximatelyfollownormal distribution.The variable‘height’isslightskewedtowards
the right.The variable ‘pitch’hasthickertails.
The variable speed_airisskewedtowardsthe right.
The variable ‘distance’isheavilyskewedtowardsthe right.
CODE:
OUTPUT:
12. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
Establishing the relation of Speed_Air with other variables:
The variable speed_airhas75.53% missingdata.Asthisvariable hasa significantimpactonthe landing
distance we will trytopredictthe missingvaluesfromothervariableslikespeed_ground,height,pitch
and duration.
CODE:
OUTPUT:
INTERPRETATION:
From the above table we can see thatspeed_airhashighpositive correlationwithspeed_ground.But,
there isno correlationwithothervariables.Hence,we cantryto predictthe value of speed_airfrom
speed_ground.
13. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
Predicting the value of Speed_Air:
We will runsimple linearregressionwithspeed_airasthe dependentvariable andspeed_groundasthe
independentvariable.
CODE:
OUTPUT:
INTERPRETATION:
The p-value fromanalysisof variance showsthatthe independentvariable ‘speed_ground’canreliably
predictthe dependentvariable‘speed_air’.The R-square value indicatesthatabout98% of variance in
speed_aircanbe predictedfromspeed_ground. The p-value fromparameterestimate suggestthat
parameterestimate forspeed_ground issignificantlydifferentfromzero. The model forspeed_Airis
givenas:
Speed_Air=0.9754(Speed_Ground)+2.64036
Imputing the value of Speed_Air:
CODE:
Establishing relation of Distance with other variables:
The entire purpose of thisstudyisto model the dependentvariable ‘Distance’intermsof independent
variablesduration,speed_air,speed_ground,pitch,heightandno_pasg.We will calculate the
correlationmatrix of all the above variablestodetermine the inter-relationshipbetweenthe variables.
18. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
INTERPRETATION:
From the correlationmatrix we canobserve thatco-efficientof correlationbetween‘distance and
speed_air’andbetween‘distance andspeed_ground’issignificantlyhigh. The correlationof distance
withothervariablesisextremelysmall whichindicatesthatthere isnoindependentlinearrelationship
of the variableswithdistance.The same isevidentfromthe X-Yplotsshownabove. Also, the co-efficient
of correlationbetweenotherpairof variablesisextremelylow andsowe can conclude thatthere isno
inter-relationshipbetweenthem.
Creating a Model for Distance:
CODE:
OUTPUT:
19. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
INTERPRETATION:
From the above regressionanalysiswe cansee that79% of variance inDistance can be predictedusing
the independentvariables.
But, the p-valuesof speed_ground,durationandpitchindicate thatthe parameterestimatesforthese
variablesare insignificantanddonot influence the dependentvariable ‘Distance’.
The p-value forspeed_airissignificant;butas75% of the valuesof speed_airispredictedusing
speed_ground,thissignificance isnotsubstantial.Asspeed_Airismodelledusingspeed_ground we can
conclude thatspeed_groundhassignificantimpactondistance andwe will considerthisvariable(and
not speed_air) alongwithheightandpitchformodelling.
Revised Model for Distance: (Considering both Aircraft makes)
CODE:
OUTPUT:
20. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
INTERPRETATION:
The p-value fromanalysisof variance showsthatthe independentvariableScanreliablypredictthe
dependentvariable ‘Distance’.The R-square valueindicatesthatabout79% of variance in‘Distance’can
be predictedfromthe selectedpredictors.The p-value fromparameterestimatesuggestthatparameter
estimate forall predictorsare significantlydifferentfromzero.
The Model forDistance can be givenas:
Distance= (-3039.75) + (42.06925)*(speed_ground) + (13.49852)*(height) + (200.93948)* (pitch)
Revised Regression analysis for Distance: (Separately for Airbus and Boeing)
CODE:
OUTPUT:
24. Statistical ComputingFinalProject
Pranil Deone,MSBANA,M12412774
How many observationsdo you use to fityour final model.If not all 950 flights,why?
We use 831 observationstofitourfinal model.Whenwe delete the blankrows,duplicate observations
and the observationscontainingabnormal valuesfromthe complete datasetwe are leftwith831
observations.
The variablesspeedairanddurationhave missingvalues.But,astheydonot impactthe distance we
don’tconsidertheminourfinal model.Also,the impactof speedairiscompensatedbyspeedground.
What factors and how theyimpact the landing distance of the flight?
The factors speedground,heightandpitchimpactthe landingdistance. Thesepredictorshave apositive
impactof the Distance.
Is there any difference betweenthe twomakes Boeing and Airbus?
Whenwe considerthe aircraftmake separatelythe factorsaffectingthe landingdistance remainthe
same exceptforone change.For Boeing,the factor‘pitch’becomesinsignificantanddoesnotaffectthe
landingdistance.