SlideShare a Scribd company logo
1 of 40
BANA 6043
PROJECT

STAT COMPUTING
Name: Mansi Verma
UCID: M 10632087
PROBLEM STATEMEMT:
To study the factors that impact the landing distance of a commercial
flight in the given data of 950 flights with the below data variables:
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing.
The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when
passing over the threshold of the runway. If its value is less than 30MPH
or greater than 140MPH, then the landing would be considered as
abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing
over the threshold of the runway. If its value is less than 30MPH or
greater than 140MPH, then the landing would be considered as
abnormal.
Height (in meters): The height of an aircraft when it is passing over the
threshold of the runway. The landing aircraft is required to be at least 6
meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the
threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically,
it refers to the distance between the threshold of the runway and the
point where the aircraft can be fully stopped. The length of the airport
runway is typically less than 6000 feet.
SUMMARY
The factors that impact the landing distance of a commercial flight in
then given data of 950 flights were studied. After eliminating the
observations that did not meet the constraints of the aviation industry,
we analyzed the remaining 831 records to come up with an equation that
explains the landing distance.
Distance = -1049 + 454.45(aircraft_name) + 0.27(speed_ground1) +
14(height) + 21(pitch)
*Aircraft name is ‘0’ for Boeing and ‘1’ for Airbus
*Speed_ground1 = 𝑆𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑2
The variables provided in the data were all put to test to check the
impact of each one of them on the landing distance. We found that the
number of passengers and the duration of the flight do not affect the
landing distance which is practically reasonable.
The landing distance depend on the pitch, height and majorly on square
of the variable speed_ground. We can see these results in regression
modelling done with the data further in the report. The distance also
depend on the make of the aircraft and thus it is also a part of our
equation obtained.
We calculate the parametric coefficients for each explanatory variable
through linear regression modelling and check the assumptions under
which we can apply it.
CHAPTER ONE: DATA EXPLORATION AND
DATA CLEANING
GOAL:
The objective of exploration and data cleaning is to prepare data for
further analysis as we need to visualize and do modelling in the data to
come up with the results and insights. The validity checks need to be
performed in the data as we have a few conditions that are to be met in
the Airline Business.
Steps:
1.
The data is provided in two separate files, so we import the excel files
and append them together using SAS. It picks blank records which we
need to delete. We use SET command to append data one below others
as the data fields in both data files are almost same.
The code stacks the data files one below other, we notice that the data
from the second set has one less variable of duration than the first data
file. There are also data rows with the same values in FAA1 and FAA2
with only one missing variable of duration.
2.
We need to analyze data by taking some more knowledge of the
variables and data points.
This shows that data needs to be cleaned as the min and max values
show that the there is some discrepancy in the data with the norms it
needs to follow. Also Nmiss values show that there are 711 values
missing for the variable speed_air whereas only 239 values of this
variable is present. We might want to delete this field as it will not give
a true insight to the result as its not captured for more than 75% of the
data rows. The Proc means table printed will help us think through other
steps we will take in eventually to ready our data for modelling.
3.
Now we need to remove duplicates from the data with the same values
and as we know the FAA2 data file did not have the values for the
duration, we need to exclude this variable from consideration while
looking for the duplicates.
For this we sort the data and then use nodupkey option to remove
duplicates by all variables excluding the duration field. This will help us
identify the overlap in the two data sets FAA1 and FAA2 and can be
captured in the data ‘removed’ through the below code.
This leaves us with 850 data rows and puts the duplicated data of 100
rows in a different data set which is trivial but just for us to see.
4.
As the variable speed_air has more than 75% values missing, we can
remove this data or let it be there to drop it later from the data. We might
forecast it or impute it with the mean value basis our model but for the
time being we let it be there to study it further.
5.
Now we perform sanity check of the data by seeing if it fulfills the
norms of the airline industry by validating each variable one by one.
This removes 5 records with the abnormal flight duration. We are left
with 845 records and the 5 data rows with abnormal flight duration can
be put to another data set just for us to see.
(Output below:)
6.
Now we check for the variable height and delete the data rows with the
heights less than 6 as its unacceptable.
We are left with 835 records now and get 10 rows with unacceptable
heights.
7.
Now we check for the variable speed_ground as per the constraints the
values of the variable should lie between 30MPH to 140MPH. So we
delete the data rows with the unacceptable speed_ground.
This delete another 3 data rows and we are left 832 rows.
8.
Now we check for the variable distance and as the length of the runway
is 6000; any data value cannot exceed this value.
We are now left with 831 records.
9.
Now we check for the variable speed_air as per the constraints the
values of the variable should lie between 30MPH to 140MPH. So we
delete the data rows with the unacceptable speed_air.
There are no rows with unacceptable data remaining in the data and thus
we are still left with 831 rows.
10.
Now we see the distributions of each of the variables as we require
distribution assumptions for applying modelling techniques. We capture
the moments, null hypotheses, quantiles of each variable. We look at the
histograms of each variable to notice their distributions.
1. Duration
The duration is almost normal with little skewed.
2. Number of passengers
Appears normally distributed but slightly left skewed.
3. Speed_Ground
Appears normally distributed.
4. Speed _Air
The air speed is not at all normal.
5. Height
Normally distributed with a slight right skew.
6. Pitch
Pitch looks normally distributed.
7. Distance
This doesn’t appear to be normally distributed.
Data Preparation Questions:
1. How to treat data variable with more than 75% of its values
missing?
2. How to realize values in the data for the variables with a very few
data values missing for the sake of completeness.
3. How to impute the values to the data variable with majority of data
missing?
4. How can we substitute a value to an unacceptable data point rather
than delete the entire data row?
CHAPTER TWO: DATA EXPLORATION
GOAL:
The objective of data exploration is to study the prepared data to prepare
it for regression model. This includes visualizing data, check for the
linearity and also to see the correlations between each variable to
eliminate variables which do not change our response variable.
Steps:
1. Before beginning the modeling, we plot our data. By examining
these initial plots, we can quickly assess whether the data have
linear relationships or interactions are present.
A variable that has a linear relationship with the response variable will
produce a plot that resembles a straight line(speed_air). The other plots
are scattered.
We can consider transforming other variables in our modeling to
increase the linearity.
Distance Vs No_pasg
Distance Vs Speed_ground
Distance Vs Speed_air
Distance Vs Height
Distance Vs Pitch
2. Tansformation to speed_ground can increase the linearity.
The square function applied to speed_ground makes the plot much linear
than the previous graph.
Distance Vs 𝑆𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑2
3. As the aircraft is a categorical variable we define it dummy
numerical values so that they correlation with the response variable
can be realized.
This creates 0 as the name for Boeing aircraft and 1 for the Airbus.
We can take any numerical values in place of ‘0’ and ‘1’; it does
not change our results.
4. Now we create a correlation matrix of each variable in the data to
analyze the dependence of the response variable on these different
variables. The correlation matrix also helps us identify the
dependence between two independent variables. In such a case we
are likely to eliminate one of these variables from our model.
Additionally, running correlations among the independent
variables is helpful. These correlations will help prevent
multicollinearity problems later.
Results:
 The variable distance is correlated with all the variables except the
variables aircraft_name and no_pasg which has the p value greater
than 0.05.
 Thus we infer that the no_pasg and aircraft_name don’t play a
significant role in explaining our response variables.
 As the speed_ground and speed_air are also highly correlated, we
can drop one of these variables. We choose to drop speed_air as
there are numerous observations with the missing values for
speed_air.
we can now drop the variable speed_air.
CHAPTER TWO: MODELLING
GOAL:
The objective of modelling is to build an equation for the response
variable to understand its dependence on the independent variables
chosen. We concerned with finding a model that describes the
relationship between distance and several predictor (explanatory)
variables by regression.
Introduction:
A linear model has the form Y = b0 + b1X + ε. The constant b0 is called
the intercept and the coefficient b1 is the parameter estimate for the
variable X. The ε is the error term. ε is the residual that cannot be
explained by the variables in the model.
The F value is as high as 2683.94 and R square is .9288 which shows
that the independent variables very clearly explain our response variable
distance and thus we are in a position to obtain our equation.
As our model has number of variables thus we look into the value of adj
R Sq which also shows a high value. We can thus assume that our model
is fine.
As the pvalue of the variable duration is more than 0.05 , we drop this
variable and our response variable clearly is not dependent on the
duration. So, it should not be a part of the equation. Rest all variables
have their pvalue greater than 0.05 thus they make our equation.
BUILDING the EQAUTION
Y = b0 + b1X1 + b2X2 + ε
Y = distance
B0 = -1049
B1= 454.45
X1=aircraft_name
B2=0.27
X2=speed_ground1
B3=14
X3=height
B4=21
X4=pitch
Distance = -1049 + 454.45(aircraft_name) + 0.27(speed_ground1) +
14(height) + 21(pitch)
The mean of residuals is zero. The overall fit of the model can be
checked by looking at the F-Value and its corresponding p-value (Prob
>F) for the total model under the Analysis of Variance portion of the
REG print out. Generally, we want a Prob>F value less than 0.05.
CHAPTER TWO: MODEL CHECKING
GOAL:
The objective of model checking is to check the assumptions for the
noise terms.
They are assumed to be:
1. Independent
2. Normally distributed. 

3. Mean 0
4. Constant Variance
We will validate that the residuals are independentas it is an assumption
of linear regression by examining the residuals of our final model.
Specifically, we will use diagnostic statistics from REG as well as create
an output dataset of residual values for PROC UNIVARIATE to test.
The p of Chi-square value is less than 0.05.
The distribution of the residuals.
The mean is 0.
Write your short answers to these questions:

1. How many observations (flights) do you use to fit your
final model? If not all 950 flights, why?
I fit 831 observations as the final number of observations in the model
and remove the rest 119 observations because of the steps taken in the
data preparation chapter where after removing the blank values and
applying all the validity checks. I identified the overlap in the two data
sets FAA1 and FAA2 and the duplicate values were removed which left
850 observations. I also deleted the data rows with the heights less than
6 as its unacceptable. 835 records remained as there were 10 rows with
unacceptable heights. When checked for the variable speed_ground as
per the constraints the values of the variable should lie between 30MPH
to 140MPH. This got me delete another 3 data rows and 832 rows
remained. The variable distance could have a length of the runway as
6000 at most; one record exceeded this value, so 831 records were left
finally.
2. What factors and how they impact the landingdistance
of a flight?
From my modelling and results the four variables impact the landing
distance namely – speed_ground, height, pitch and the aircraft.
I eliminated the no_pasg, duration and speed_air due to different
reasons.
No-pasg – It wasn’t correlated to the response variable.
Duration – The regression result showed a very low impact of the
variable on the distance.
Speed_air- This variable has very less values to incorporate it for
analysis and also the major reason to eliminate the variable from our
equation was because it showed a very strong correlation between the
speed_ground. So it was insignificant to use it in our equation.
The variables impacting the landing distance are speed_ground, height,
pitch and the aircraft.
All the variables are highly correlated with our response variable and
also we could obtain the parameter estimate for all the 4 explanatory
variables given in the result of the equation.
The speed_ground is actually the square of the speed_ground as it has
more linear relationship with the distance.
3. Is there any difference between the two makes Boeing and Airbus?
Yes, there is definitely a difference between the make of Boeing and
Airbus as our equation has a variable aircraft_name which is based on
the make of the aircraft.
The equality of variance shows the f value more than 1 and thus we infer
that there is a significant difference between the two makes.
The GLM and T test are done identify the differences between the two
groups and their result clearly shows the difference in their means,
variance and their impact on the distance.

More Related Content

Similar to Regression Analysis on Flights data

Stats computing project_final
Stats computing project_finalStats computing project_final
Stats computing project_finalAyank Gupta
 
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisQuynh Tran
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesAdrián Vallés
 
Flight Landing Analysis
Flight Landing AnalysisFlight Landing Analysis
Flight Landing AnalysisTauseef Alam
 
Predicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPredicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPrerit Saxena
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing projectRashmiSubrahmanya
 
Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...
Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...
Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...Abimbola Ogundipe
 
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point ErrorsA study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errorsijpla
 
LogicProgrammingShortestPathEfficiency
LogicProgrammingShortestPathEfficiencyLogicProgrammingShortestPathEfficiency
LogicProgrammingShortestPathEfficiencySuraj Nair
 
Flight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASFlight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASSarita Maharia
 
Regression kriging
Regression krigingRegression kriging
Regression krigingFAO
 
Predicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regressionPredicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regressionSamrudh Keshava Kumar
 

Similar to Regression Analysis on Flights data (20)

Stats computing project_final
Stats computing project_finalStats computing project_final
Stats computing project_final
 
FAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and AnalysisFAA Flight Landing Distance Forecasting and Analysis
FAA Flight Landing Distance Forecasting and Analysis
 
Flight landing Project
Flight landing ProjectFlight landing Project
Flight landing Project
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian Valles
 
Flight Landing Analysis
Flight Landing AnalysisFlight Landing Analysis
Flight Landing Analysis
 
Flight Landing Risk Assessment Project
Flight Landing Risk Assessment ProjectFlight Landing Risk Assessment Project
Flight Landing Risk Assessment Project
 
Airline delay prediction
Airline delay predictionAirline delay prediction
Airline delay prediction
 
Predicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regressionPredicting aircraft landing overruns using quadratic linear regression
Predicting aircraft landing overruns using quadratic linear regression
 
Flight Data Analysis
Flight Data AnalysisFlight Data Analysis
Flight Data Analysis
 
Statistical computing project
Statistical computing projectStatistical computing project
Statistical computing project
 
Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...
Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...
Optimized Multi model Fuzzy Altitude and Translational Velocity Controller fo...
 
Time series project
Time series projectTime series project
Time series project
 
Eryk_Kulikowski_a4
Eryk_Kulikowski_a4Eryk_Kulikowski_a4
Eryk_Kulikowski_a4
 
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point ErrorsA study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errors
 
LogicProgrammingShortestPathEfficiency
LogicProgrammingShortestPathEfficiencyLogicProgrammingShortestPathEfficiency
LogicProgrammingShortestPathEfficiency
 
Co&al lecture-07
Co&al lecture-07Co&al lecture-07
Co&al lecture-07
 
Flights Landing Overrun Project
Flights Landing Overrun ProjectFlights Landing Overrun Project
Flights Landing Overrun Project
 
Flight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASFlight Landing Distance Study Using SAS
Flight Landing Distance Study Using SAS
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
 
Predicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regressionPredicting aircraft landing distances using linear regression
Predicting aircraft landing distances using linear regression
 

Recently uploaded

VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfmuskan1121w
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | DelhiFULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | DelhiMalviyaNagarCallGirl
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...lizamodels9
 
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsCash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsApsara Of India
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfOrient Homes
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncrdollysharma2066
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxAbhayThakur200703
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCRsoniya singh
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadAyesha Khan
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
RE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechRE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechNewman George Leech
 
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxBanana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxgeorgebrinton95
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...lizamodels9
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9
 

Recently uploaded (20)

VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdf
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | DelhiFULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
 
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsCash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptx
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
RE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechRE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman Leech
 
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptxBanana Powder Manufacturing Plant Project Report 2024 Edition.pptx
Banana Powder Manufacturing Plant Project Report 2024 Edition.pptx
 
KestrelPro Flyer Japan IT Week 2024 (English)
KestrelPro Flyer Japan IT Week 2024 (English)KestrelPro Flyer Japan IT Week 2024 (English)
KestrelPro Flyer Japan IT Week 2024 (English)
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
 

Regression Analysis on Flights data

  • 1. BANA 6043 PROJECT
 STAT COMPUTING Name: Mansi Verma UCID: M 10632087
  • 2. PROBLEM STATEMEMT: To study the factors that impact the landing distance of a commercial flight in the given data of 950 flights with the below data variables: Aircraft: The make of an aircraft (Boeing or Airbus). Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min. No_pasg: The number of passengers in a flight. Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway. Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway. Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
  • 3. SUMMARY The factors that impact the landing distance of a commercial flight in then given data of 950 flights were studied. After eliminating the observations that did not meet the constraints of the aviation industry, we analyzed the remaining 831 records to come up with an equation that explains the landing distance. Distance = -1049 + 454.45(aircraft_name) + 0.27(speed_ground1) + 14(height) + 21(pitch) *Aircraft name is ‘0’ for Boeing and ‘1’ for Airbus *Speed_ground1 = 𝑆𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑2 The variables provided in the data were all put to test to check the impact of each one of them on the landing distance. We found that the number of passengers and the duration of the flight do not affect the landing distance which is practically reasonable. The landing distance depend on the pitch, height and majorly on square of the variable speed_ground. We can see these results in regression modelling done with the data further in the report. The distance also depend on the make of the aircraft and thus it is also a part of our equation obtained. We calculate the parametric coefficients for each explanatory variable through linear regression modelling and check the assumptions under which we can apply it.
  • 4. CHAPTER ONE: DATA EXPLORATION AND DATA CLEANING GOAL: The objective of exploration and data cleaning is to prepare data for further analysis as we need to visualize and do modelling in the data to come up with the results and insights. The validity checks need to be performed in the data as we have a few conditions that are to be met in the Airline Business. Steps: 1. The data is provided in two separate files, so we import the excel files and append them together using SAS. It picks blank records which we need to delete. We use SET command to append data one below others as the data fields in both data files are almost same.
  • 5. The code stacks the data files one below other, we notice that the data from the second set has one less variable of duration than the first data file. There are also data rows with the same values in FAA1 and FAA2 with only one missing variable of duration. 2. We need to analyze data by taking some more knowledge of the variables and data points.
  • 6. This shows that data needs to be cleaned as the min and max values show that the there is some discrepancy in the data with the norms it needs to follow. Also Nmiss values show that there are 711 values missing for the variable speed_air whereas only 239 values of this variable is present. We might want to delete this field as it will not give a true insight to the result as its not captured for more than 75% of the data rows. The Proc means table printed will help us think through other steps we will take in eventually to ready our data for modelling. 3. Now we need to remove duplicates from the data with the same values and as we know the FAA2 data file did not have the values for the duration, we need to exclude this variable from consideration while looking for the duplicates.
  • 7. For this we sort the data and then use nodupkey option to remove duplicates by all variables excluding the duration field. This will help us identify the overlap in the two data sets FAA1 and FAA2 and can be captured in the data ‘removed’ through the below code. This leaves us with 850 data rows and puts the duplicated data of 100 rows in a different data set which is trivial but just for us to see.
  • 8. 4. As the variable speed_air has more than 75% values missing, we can remove this data or let it be there to drop it later from the data. We might forecast it or impute it with the mean value basis our model but for the time being we let it be there to study it further. 5. Now we perform sanity check of the data by seeing if it fulfills the norms of the airline industry by validating each variable one by one. This removes 5 records with the abnormal flight duration. We are left with 845 records and the 5 data rows with abnormal flight duration can be put to another data set just for us to see. (Output below:)
  • 9. 6. Now we check for the variable height and delete the data rows with the heights less than 6 as its unacceptable.
  • 10. We are left with 835 records now and get 10 rows with unacceptable heights. 7. Now we check for the variable speed_ground as per the constraints the values of the variable should lie between 30MPH to 140MPH. So we delete the data rows with the unacceptable speed_ground.
  • 11. This delete another 3 data rows and we are left 832 rows.
  • 12. 8. Now we check for the variable distance and as the length of the runway is 6000; any data value cannot exceed this value. We are now left with 831 records.
  • 13. 9. Now we check for the variable speed_air as per the constraints the values of the variable should lie between 30MPH to 140MPH. So we delete the data rows with the unacceptable speed_air. There are no rows with unacceptable data remaining in the data and thus we are still left with 831 rows.
  • 14. 10. Now we see the distributions of each of the variables as we require distribution assumptions for applying modelling techniques. We capture the moments, null hypotheses, quantiles of each variable. We look at the histograms of each variable to notice their distributions. 1. Duration
  • 15. The duration is almost normal with little skewed. 2. Number of passengers
  • 16. Appears normally distributed but slightly left skewed. 3. Speed_Ground
  • 18. The air speed is not at all normal. 5. Height
  • 19. Normally distributed with a slight right skew. 6. Pitch
  • 20. Pitch looks normally distributed. 7. Distance
  • 21. This doesn’t appear to be normally distributed. Data Preparation Questions: 1. How to treat data variable with more than 75% of its values missing? 2. How to realize values in the data for the variables with a very few data values missing for the sake of completeness. 3. How to impute the values to the data variable with majority of data missing? 4. How can we substitute a value to an unacceptable data point rather than delete the entire data row?
  • 22. CHAPTER TWO: DATA EXPLORATION GOAL: The objective of data exploration is to study the prepared data to prepare it for regression model. This includes visualizing data, check for the linearity and also to see the correlations between each variable to eliminate variables which do not change our response variable. Steps: 1. Before beginning the modeling, we plot our data. By examining these initial plots, we can quickly assess whether the data have linear relationships or interactions are present. A variable that has a linear relationship with the response variable will produce a plot that resembles a straight line(speed_air). The other plots are scattered. We can consider transforming other variables in our modeling to increase the linearity.
  • 23. Distance Vs No_pasg Distance Vs Speed_ground
  • 25. Distance Vs Pitch 2. Tansformation to speed_ground can increase the linearity. The square function applied to speed_ground makes the plot much linear than the previous graph.
  • 26. Distance Vs 𝑆𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑2 3. As the aircraft is a categorical variable we define it dummy numerical values so that they correlation with the response variable can be realized. This creates 0 as the name for Boeing aircraft and 1 for the Airbus. We can take any numerical values in place of ‘0’ and ‘1’; it does not change our results.
  • 27. 4. Now we create a correlation matrix of each variable in the data to analyze the dependence of the response variable on these different variables. The correlation matrix also helps us identify the dependence between two independent variables. In such a case we are likely to eliminate one of these variables from our model. Additionally, running correlations among the independent variables is helpful. These correlations will help prevent multicollinearity problems later. Results:  The variable distance is correlated with all the variables except the variables aircraft_name and no_pasg which has the p value greater than 0.05.  Thus we infer that the no_pasg and aircraft_name don’t play a significant role in explaining our response variables.  As the speed_ground and speed_air are also highly correlated, we can drop one of these variables. We choose to drop speed_air as there are numerous observations with the missing values for speed_air.
  • 28.
  • 29. we can now drop the variable speed_air. CHAPTER TWO: MODELLING GOAL: The objective of modelling is to build an equation for the response variable to understand its dependence on the independent variables chosen. We concerned with finding a model that describes the relationship between distance and several predictor (explanatory) variables by regression. Introduction: A linear model has the form Y = b0 + b1X + ε. The constant b0 is called the intercept and the coefficient b1 is the parameter estimate for the variable X. The ε is the error term. ε is the residual that cannot be explained by the variables in the model.
  • 30. The F value is as high as 2683.94 and R square is .9288 which shows that the independent variables very clearly explain our response variable distance and thus we are in a position to obtain our equation.
  • 31. As our model has number of variables thus we look into the value of adj R Sq which also shows a high value. We can thus assume that our model is fine. As the pvalue of the variable duration is more than 0.05 , we drop this variable and our response variable clearly is not dependent on the duration. So, it should not be a part of the equation. Rest all variables have their pvalue greater than 0.05 thus they make our equation. BUILDING the EQAUTION Y = b0 + b1X1 + b2X2 + ε Y = distance B0 = -1049 B1= 454.45 X1=aircraft_name B2=0.27 X2=speed_ground1 B3=14 X3=height B4=21 X4=pitch Distance = -1049 + 454.45(aircraft_name) + 0.27(speed_ground1) + 14(height) + 21(pitch)
  • 32. The mean of residuals is zero. The overall fit of the model can be checked by looking at the F-Value and its corresponding p-value (Prob >F) for the total model under the Analysis of Variance portion of the REG print out. Generally, we want a Prob>F value less than 0.05.
  • 33. CHAPTER TWO: MODEL CHECKING GOAL: The objective of model checking is to check the assumptions for the noise terms. They are assumed to be: 1. Independent 2. Normally distributed. 
 3. Mean 0 4. Constant Variance We will validate that the residuals are independentas it is an assumption of linear regression by examining the residuals of our final model. Specifically, we will use diagnostic statistics from REG as well as create an output dataset of residual values for PROC UNIVARIATE to test.
  • 34. The p of Chi-square value is less than 0.05. The distribution of the residuals.
  • 36. Write your short answers to these questions:
 1. How many observations (flights) do you use to fit your final model? If not all 950 flights, why? I fit 831 observations as the final number of observations in the model and remove the rest 119 observations because of the steps taken in the data preparation chapter where after removing the blank values and applying all the validity checks. I identified the overlap in the two data sets FAA1 and FAA2 and the duplicate values were removed which left 850 observations. I also deleted the data rows with the heights less than 6 as its unacceptable. 835 records remained as there were 10 rows with unacceptable heights. When checked for the variable speed_ground as per the constraints the values of the variable should lie between 30MPH to 140MPH. This got me delete another 3 data rows and 832 rows remained. The variable distance could have a length of the runway as 6000 at most; one record exceeded this value, so 831 records were left finally.
  • 37. 2. What factors and how they impact the landingdistance of a flight? From my modelling and results the four variables impact the landing distance namely – speed_ground, height, pitch and the aircraft. I eliminated the no_pasg, duration and speed_air due to different reasons. No-pasg – It wasn’t correlated to the response variable. Duration – The regression result showed a very low impact of the variable on the distance. Speed_air- This variable has very less values to incorporate it for analysis and also the major reason to eliminate the variable from our equation was because it showed a very strong correlation between the speed_ground. So it was insignificant to use it in our equation. The variables impacting the landing distance are speed_ground, height, pitch and the aircraft. All the variables are highly correlated with our response variable and also we could obtain the parameter estimate for all the 4 explanatory variables given in the result of the equation. The speed_ground is actually the square of the speed_ground as it has more linear relationship with the distance.
  • 38. 3. Is there any difference between the two makes Boeing and Airbus? Yes, there is definitely a difference between the make of Boeing and Airbus as our equation has a variable aircraft_name which is based on the make of the aircraft.
  • 39.
  • 40. The equality of variance shows the f value more than 1 and thus we infer that there is a significant difference between the two makes. The GLM and T test are done identify the differences between the two groups and their result clearly shows the difference in their means, variance and their impact on the distance.