The document describes a study that analyzed factors affecting aircraft landing distance using simulated data of 950 commercial flight landing performances. Initially, the data was processed to remove missing or abnormal values before analysis. Bivariate analysis found that aircraft speed and type had a high positive impact on landing distance. A regression model was built using available variables and improved based on diagnostic plots. The regression analysis found that speed, type, pitch, and height of the aircraft significantly affected landing distances. The initial model had an R-Squared of 0.85 and MAPE of 22.5%, which were increased to 0.97 and reduced to 10.8% respectively in the final model.
Predicting aircraft landing distances using linear regression
1. University of Cincinnati, Carl H. Lindner College of Business
MS BANA 2017-18
Statistical Computing Project
Study of factors affecting aircraft landing distance
Samrudh Keshava Kumar
M12420395
(003)
The aim of thisprojectis to studythe simulateddataof 950 commercial flightlandingperformances
and understandthe factorsaffectingthe same. Initially,the datawasprocessedtoremove any
missingorabnormal valuesbefore proceedingwiththe analysis.Bivariate analysiswasperformed
betweenthe variables,the speedandtype of aircraft hada highpositive impactonthe landing
distance.A regressionmodelwasbuiltwithall the available variablesandthe model wasimproved
basedon the diagnosticplotsof the regressionmodel.The speed,type,pitchandthe heightof the
aircraft wasfoundto have significanteffectonthe landingdistancesthroughthe regression analysis.
The initial model hadanR-Squaredof 0.85 and MAPE of 22.5%, the R-Squared wasincreasedto0.97
and MAPE reducedto10.8% inthe final model.
2. Chapter 1
Data exploration and data cleaning
Aim:To verifydataquality&correct thembefore proceedingwiththe analysis.
Loading the datasets into the SAS environment
PROCIMPORTDATAFILE='/home/samrudhkumar0/Project/FAA1.csv'
DBMS=CSV
OUT=FAA1;
GETNAMES=YES;
RUN;
PROCIMPORTDATAFILE='/home/samrudhkumar0/Project/FAA2.csv'
DBMS=CSV
REPLACE
OUT=FAA2;
GETNAMES=YES;
RUN;
/*Print the top 10 rowsof dataof each dataset*/
PROC PRINTDATA=faa1(obs= 10);
RUN;
PROCPRINTDATA=faa2(obs=10);
RUN;
3. The datasetshave beenloadedintothe SASenvironmentasFAA1andFAA2. The summaryof the
data isobtained usingPROCMEANS.
PROCmeansDATA = FAA1n nmiss max min mean median var;
Title "Basic Summary of FAA1";
RUN;
PROCmeansDATA = FAA2n nmiss max min mean median var;
Title "Basic Summary of FAA2";
RUN;
/*Observed thatFAA2hasa fewempty rows,in the subsequentstep itwill be removed*/
DATA NO_DEADROWS;
SET FAA2;
IFMISSING(AIRCRAFT) then delete;
RUN;
/*The 50 missing observationshavebeen removed thedatasetnow contains150
observations*/
4. The emptyrowsof data has beenremoved.The missingvaluesunder speed_airwill be dealtwith
later.
Combiningdata sets from differentsources
Before mergingthe datasetstogether,SASrequiresthatboththe datasetsbe sortedinthe same
fashion.The aircraftname and speed_groundare the unique variablesbywhichthe twodatasets
can be merged.
/*Sorting the datasetbeforemerging*/
PROCSORT DATA = FAA1;
BY aircraft speed_ground;
RUN;
PROCSORT DATA = NO_DEADROWS;
BY aircraft speed_ground;
RUN;
DATA MERGED;
MERGE FAA1 NO_DEADROWS;
BY aircraft speed_ground;
/*Merging by speed_ground sincethereis
repetation in the data,speed_groundhasuniquevaluesso isperfect asa primary key*/
RUN;
/*850 OBSERVATIONSAFTERMERGE*/
The combineddatasetshouldhave had800+150 = 950 observationsbutitcontains850
observations.Thisshowsthatthere were 100observationswhichwere notunique. Summaryof the
mergeddatais showninthe table below
Performingthe completenesscheckofeach variable
Usingthe MEANSprocedure withoptionsN andN Miss to displaythe numberof observationsand
the numberof missingvaluesineachvariable.
PROCMEANSDATA = MERGED N NMISS;
RUN;
/*Treating missing values - duration,speed_air*/
5. 642 and50 valuesare missingfromthe variablesspeed_airanddurationrespectively.
Performingthe validitycheck of each variable
Runningthe UNIVARIATEprocedure todetermine the quartile rangesanyvaluesabovethe 99% and
below1% levelscanbe treatedasabnormal values.
PROCUNIVARIATEDATA=MERGEDPLOT;
RUN;
No of passengers Speed_ground Height
Pitch Distance Speed_air
6. Data cleaning
Basedon the understandingof the datafromthe previoussteps. Abnormal valuesof speed_ground,
height,durationanddistance are deletedfromthe analytical datasetandmovedtoanew datasets
containingonlyoutliers.Forvariable ‘duration’ outof 781 observations,50(~6%) were missing,the
missingvaluescanbe approximatedwiththe average value. Forvariable ‘speed_air’whichhas203
missingoutof 628 (~32%),the missingvaluesare notreplacedsince itwouldleadtoapproximation
errors.
DATA TREATED_DATA;
SET MERGED;
IF SPEED_GROUND< 30 THEN DELETE;
IF SPEED_GROUND> 140 THEN DELETE;
IF HEIGHT < 6 THEN DELETE;
IF (DURATION <40 ANDDURATION >0) THEN DELETE;
IF MISSING(DURATION) THEN DURATION =154.0065385;
IF DISTANCE> 6000 THEN DELETE;
RUN;
/*831 OBSERVATIONSREMAINING*/
PROCMEANSDATA = TREATED_DATA N NMISS;
RUN;
7. The treateddatasetcontains831 observationsand0 missingvaluesforall variablesexpect
‘speed_air’.
PROCSORT DATA = TREATED_DATA;
BY AIRCRAFTSPEED_GROUND;
RUN;
PROCSORT DATA = MERGED;
BY AIRCRAFTSPEED_GROUND;
RUN;
DATA COMPLEMENT;
MERGED TREATED_DATA (IN = X) MERGED (IN = Y);
BY AIRCRAFTSPEED_GROUND;
IF (X= 1 ANDY = 0) OR (X=0 ANDY = 1);
DROPDURATION SPEED_AIR;
RUN;
PROCPRINTDATA = COMPLEMENT;
RUN;
The above statementsgenerate the table of observationsthatwere removedfromthe maindataset.
It contains19 observationsasexpected.
8. Summarizingthe distribution
To summarize the distributionof eachvariable,itwouldbe sufficienttolookatthe meanand
medianvaluesof each.
PROCMEANSDATA=TREATED_DATA N MEAN MEDIAN;
TITLE "MEAN ANDMEDIAN OFTREATED DATA";
RUN;
The mean andmedianvaluesof all the variablesexceptdistance are close toeachotherindicating
that theyfollow anormal distribution.
Usingthe UNIVARIATEprocedure the distance variable isplottedtounderstandthe distribution.
PROCUNIVARIATEDATA=TREATED_DATA PLOT;
VARDISTANCE;
RUN;
The distance variable followsaskewedpatternandmaximumobservationsoccurbetween600to
1000 feet.
It was observedthat100 observationswere duplicate andwere removed.The variable speed_air
had 628 observationsmissing,the missingvalueswouldbe treatedduringthe dataanalysissteps.
9. Chapter 2
Data Visualization
Aim: To understandhowthe independentvariables/factorsaffectthe dependentvariable(distance)
beingmodelled.
Since the data isbeingmodelled usinglinearregression, itisassumedthatthe independentvariables
have a linearrelationshipwiththe predictedvariable.The slope of the plotswillindicate the impact
the independentvariableshave onthe independentvariable (variable beingpredicted) and, the
shape will indicate the type of relationshipi.e.linear,quadraticetc. andthe spread/variabilityof the
data.
/*Chapter2 visualization*/
/*Plottingdistance of landingwithothervariablestounderstandthe relationships*/
proc plotdata = treated_data;
plotdistance*pitch;
plotdistance*height;
plotdistance*speed_air;
plotdistance*speed_ground;
plotdistance*no_pasg;
plotdistance*duration;
plotdistance*aircraft;
run;
The plot indicatesthatthe pitchof the aircraft doesnothave much of an impacton the landing
distance,the datais concentratedinthe centre of the plotand has highvariability.
10. Hightof the aircraft above the thresholdof the runwayhasa slight positive impactonthe
landingdistance.
The variable speed_airhasaminimumvalue of 90 MPH, below whichthe valueshave not
beencapturedinthe data. The variable speed_airshows ahighpositive correlationwiththe
landingdistance andthe spreadof the data pointslooksminimal.Fromthe regression
analysiswhichwouldbe carriedoutlater,thisvariable should have ahighersignificance.
11. The speed_groundvariable hasaquadraticrelationshipwiththe landingdistances,below 70mph
the impact is almostnegligiblebutabove 70mphthere seemstobe a highpositive correlation
similartowhatis beingobservedforthe speed_airvariable.
The no_pasg (No.of passengers) doesnotseemtohave animpacton the landingdistances.
12. Durationof the flightseemstohave aslightnegative impactonthe landingdistance.
The type of aircraftseemsto be affectingthe landingdistance,Airbusseemstoexhibitshorter
landingdistancescomparedtoBoeing.
Furtherto understandthe strengthof the relationships,the correlationbetweenthe variablesis
calculatedusingthe PROCCORRprocedure inSAS.
13. In the previousplotforspeed_ground,the curve seemstobe flatbelow 70MPH,to testthis a subset
of the data below70MPH is takenandthe correlationis calculatedbetweenspeed_groundand
distance.
data ground_speed_low;
settreated_data;
if speed_ground>70thendelete;
keepspeed_grounddistance;
run;
proc corr data = ground_speed_low;
run;
The correlationbetweenthe twovariablesare 0.11 meaning the speedof the aircrafthasminimal
impacton the landingdistancesbelow70MPH,0.39 forspeedsbelow 80MPH and0.65 forspeeds
below90MPH. For speed_air,the missingvaluescouldbe approximatedtobe equal tothe
speed_groundvalues.
/*Calculatingthe correlationbetweenthe variables*/
proc corr data = treated_data;
run;
14. The highlightedvaluesindicate variableswhichare highlycorrelated. The variablesspeed_airand
speed_ground are highlycorrelated witheachotherandare correlatedwiththe predictorvariable
(distance).One of the variables should be eliminatedtopreventmulticollinearityerrors.
The variablesspeed_ground,speed_airandaircrafttype seemtohave an impacton the landing
distances.‘speed_ground’and‘speed_air’have the highestcorrelationcoefficientwithdistance and
are correlatedwitheachother.The missingvaluesof speed_aircouldbe imputedwiththe values
fromspeed_groundandthe speed_groundvariable couldbe eliminatedaltogether.
15. Chapter 3
Statistical Modelling
Aim:Understandthe variablessignificantlyaffectingthe landingdistance andfitalinearmodel to
predictlandingdistance of the aircraft
SASCodesand outputs:
From the previouschapter,the variablesspeed_air, speed_groundandaircrafthassignificantimpact
on the landingdistances.Toinclude aircraftasa variable inthe linearmodel, adummyvariable
calledaircraft_type iscreated withvalues0and1 for AirbusandBoeingrespectively.
/*Run tteston the speed_groundspeed_air*/
data speeds_df;
settreated_data;
if missing(speed_air) thendelete;
keepspeed_airspeed_ground;
run;
proc ttestdata = speeds_df;
pairedspeed_air*speed_ground;
run;
The null hypothesisbeingtestedisthatthe difference betweenthe meansof the twovariablesis
zero.The null hypothesiscannotbe rejectedbecause p>0.05,therefore we couldsaythatthe two
variablesare similar.The meandifference betweenthe twois0.0739 MPH and the correlationis
0.987. Giventhese evidence,the speed_groundisverysimilartospeed_air.The missingvaluesof
speed_aircanbe imputedwithvaluesfromspeed_ground.
16. A newdatasetiscreatedwiththe above-mentioned changes.
/*Creatinga dummyvariable foraircrafttype to include aircrafttype asa
*variable inthe linearmodel
*/
data final_model_data;
settreated_data;
if aircraft = 'airbus' thenaircraft_type = 0;
else aircraft_type =1;
if missing(speed_air) then speed_air=speed_ground;
drop aircraftspeed_ground;
run;
proc meansdata = final_model_dataN Nmiss;
run;
/*Generate corelationmatrix*/
proc corr data = final_model_data;
run;
Variableswithhighcorrelationwithdistance ishighlighted.None of the independentvariablesare
correlatedwitheachother.
The final datasethasnot missingvaluesand831 observations.Variablesspeed_groundandaircraft
have beeneliminatedfurtheranalysisisperformedonthisdataset.
17. A regressionmodelisfittedonthe dataset.
/*Fittinga regressionmodel*/
proc reg data = final_model_data;
model distance =speed_airaircraft_type no_pasgpitchheightduration;
run;
Belowisthe summaryof the correlationand the regressionanalysisof the independentvariables.
Independent
Variables Direction
Correlation
Coefficient
P - Value of
corr
coefficient
Regression
Coefficient
Distance ~ All
P Value reg
coeff
Distance~All
speed_air Strongpositive 0.8675 <.0001 42.45547 <.0001
aircraft_type 0.2381 <.0001 481.22446 <.0001
no_pasg no visible realtion -0.0177 0.6093 -2.15925 0.1806
pitch no visible realtion 0.08703 0.0121 34.84949 0.1552
height no visible realtion 0.09941 0.5082 14.07733 <.0001
duration Slightnegative -0.04995 0.1503 0.00415 0.9871
Nextstepisto eliminatevariables whichhave p-value <0.005 one by one.
The resultantmodel usesair_speed,aircrafttype andheightasdependantvariables.The r -Squared
is0.85.
18. Chapter 4
Model Validation
Aim:Diagnose the model performance byanalysingthe plotof the residuals,R-Squaredandthe
MAPE of the predictedvalues.
/*Model validationcheckif the residualsare normallydistributed*/
proc reg data=final_model_data;
model distance=speed_airaircraft_type height;
run;
19. The fit diagnosticsforthe predictedvariable show thatthe residualsare notrandom.The non-
randompatternshowsthat the linearmodel isinappropriateandthe dataneedssome
transformations.The model isunderestimatingthe relationshipinthe extreme rangesof landing
distance.
Calculationof MAPE
proc reg data = final_model_data;
model distance =speed_airaircraft_type height;
outputout=predicted_valuespredicted=py;
run;
data predicted_values;
setpredicted_values;
error_abs = abs(distance - py)/distance;
keepdistance py error_abs;
run;
proc meansdata = predicted_values N mean;
var error_abs;
run;
/*MAPE is22.575%*/
Model predictionaccuracyisexpectedtobe bad,the predictionscouldbe improvedby transforming
a fewpredictorvariables.
Chapter 5
Remodelling and model Validation
Aim:Transformpredictorvariablesandensure the residual plotisrandom.Compare the new models
withthe base model.
SASCodes:
data remodelling_data;
setfinal_model_data;
speed_air_4= (speed_air**4);
speed_air_3= speed_air**3;
speed_air_2= speed_air**2;
height_pitch=height*pitch;
run;
proc meansdata = remodelling_dataN NmissMinmax median;
run;
proc corr data = remodelling_data;
run;
20. From the correlationplot,speed_air_4isgivingthe highestcorrelationtodistance,height_pitch
whichismultiplicationof heightandpitchhasa highercorrelationcomparedtothe individual
variables,thishave beenselectedforthe final model independentvariable list.
/*Speed_airhasnomissingvalues*/
proc plotdata = remodelling_data;
plotdistance*speed_air;
plotdistance*speed_air_2;
plotdistance*speed_air_3;
plotdistance*speed_air_4;
plotdistance*height_pitch;
run;
21. The transformed speed_air(speed_air^4) variable showsalinearrelationshiptothe landing
distance.The speed_airvariablewill be replacedwithspeed_air_4.
/*Fittinga regressionmodel*/
proc reg data = remodelling_data;
model distance =speed_air_4aircraft_type height_pitch;
run;
22. The model hasa betterresidual plotthoughthe modelisunderpredictingpredicting the longer
landingdistances,thisisacceptablegiventhe lackof datapointsexplainingthese scenarios.The R-
Squaredhasimprovedfrom0.85 to 0.97 indicatinghigherpredictionaccuracy.
proc reg data = remodelling_data;
model distance =speed_air_4aircraft_type height_pitch;
outputout=predicted_valuespredicted=py;
run;
data predicted_values;
setpredicted_values;
error_abs= abs(distance - py)/distance;
keepdistance pyerror_abs;
run;
proc meansdata = predicted_valuesN mean;
var error_abs;
run;
The MAPE (MeanAbsolute Percentage Error) of the improvedmodelis10.88%.
The MAPE hasreducedfrom22.58% to 10.88%, the transformationof the dataimprovedthe
accuracy of the predictions.
The model canbe furtherimprovedwithmore datapointsespeciallyinthe scenarioswhere the
landingdistances are greaterthan4000 feetsince thisare the casesto be predicted. More variables
such as grossweightof the aircraft,aircraft model no,winddirectionetc.wouldsignificantly
improve thismodel.
23. Appendix.
Variable dictionary:
Aircraft: The make of an aircraft (BoeingorAirbus).
Duration (in minutes):Flightdurationbetweentakingoff andlanding.The durationof anormal
flightshouldalwaysbe greaterthan40min.
No_pasg: The numberof passengersinaflight.
Speed_ground(inmilesper hour): The groundspeedof an aircraftwhenpassingoverthe threshold
of the runway.If itsvalue islessthan30MPH or greaterthan 140MPH, thenthe landingwouldbe
consideredasabnormal.
Speed_air(in milesperhour): The air speedof an aircraftwhenpassingoverthe thresholdof the
runway.If its value islessthan30MPH or greaterthan 140MPH, thenthe landingwouldbe
consideredasabnormal.
Height(in meters):The heightof an aircraftwhenit ispassingoverthe thresholdof the runway.The
landingaircraftisrequiredtobe at least6 metershighatthe thresholdof the runway.
Pitch (indegrees):Pitchangle of anaircraft whenitis passingoverthe thresholdof the runway.1
Distance (infeet):The landingdistance of anaircraft.More specifically,itreferstothe distance
betweenthe thresholdof the runwayandthe pointwhere the aircraftcan be fullystopped.The
lengthof the airportrunwayis typicallylessthan6000 feet.