Random Forest Ensemble learning algorithm for Engineering Analytics Project
1. Engineering Analytics Saurabh Kale
P a g e 1 | 14
FLIGHT DELAY PREDICTION AND
VISUALIZATION
ENGINEERING ANALYTICS COURSE
PROJECT
FINAL RESEARCH PAPER
Project By- Saurabh Kale
Project Advisor- Dr. Ying Lin
2. Engineering Analytics Saurabh Kale
P a g e 2 | 14
ABSTRACT
Air Travel is very common in the USA and it is very important for the traveler to choose a flight which is
cheap and reliable. It is the fastest way to get from A to B. There are risks associated with it, but Aircraft
manufacturers and Airlines are doing a great job of ensuring passenger safety. While travelling, often,
delay is a very important consideration for travelers, especially for people travelling for business and
professionals. It is possible to predict with certain confidence how much a flight is going to be delayed
by depending on certain factors which have been discussed later in the paper.
The data for this study was downloaded from Kaggle Website and the actual source of the data is
Department of Transportation’s Bureau of Transportation Statistics website. The data has more than 5
million rows and 31 columns. Some of the columns are not required for this study and they have been
removed in data preprocessing. Delay values for canceled flights will be outliers if considered which is
why all rows with cancelled flights have been removed in the downloaded dataset. There are obvious
correlations for some columns. Correlation plots have been attached further in the report.
3. Engineering Analytics Saurabh Kale
P a g e 3 | 14
METHODOLOGY
The scope of the project is to fit a model predicting whether a flight is delayed or not based on certain
features available in the Flights dataset made available by Bureau of Transportation Statistics. The
application of such a study would be a prediction model based online dashboard which takes input from
the user (Flyer) about attributes such as Origin Airport, Destination Airport, Airline, Time of Day, Day of
Week, Day of Month etc. and predicts the time duration a certain flight will be delayed by. This may be
a tool to decide which airline to fly on or what time of the day is best when the delays are minimum.
Because of high model complexity, the project scope was cut down. Instead of the model predicting the
time duration a flight is delayed by, the model predicts a binary response which is whether a certain
flight is delayed or not.
The number of unique factor values for airports was found to be ~325. This made it difficult or even
impossible to fit a CART model to the data. To solve this issue, the airport factor variables were converted
to latitude and longitude numeric variables. This made it easy for these features to be included in the
model. Intuitively thinking, Origin and Destination airports features may be the most important features
to this model because average number of passengers per flight would be maximum at cities like New
York, Chicago, Houston, Los Angeles and other big cities. Another reason why Origin and Destination
airport may be important is because these may be weather related delays at certain locations. Other
important features which are Flight Duration and Distance have correlation with Origin and Destination
and these features have been dropped from the study but may later be considered after evaluation.
Factor values for Airline Variable have been encoded and added as columns and have a ‘0’ or ‘1’ response
depending on which Airline was the Air Carrier in a certain Row.
To sum it up, the features that have been considered in the model are Month, Day of Month, Weekday,
Airline, Origin (as Coordinates) and Destination (as Coordinates). Delay is the binary response which is
being predicted.
Features Considered in the Model
1. Day of Month
2. Day of Week
3. Month
4. Airline
5. Origin Latitude
6. Origin Longitude
7. Destination Latitude
8. Destination Longitude
9. Time of Flight Departure (Scheduled Departure Time)
4. Engineering Analytics Saurabh Kale
P a g e 4 | 14
ISSUES WITH DATA
Data downloaded from internet always has anomalies (Missing Values, Outliers etc.) associated with it.
In this case, Airport Names data had both numeric and alphabetic codes. This meant that either numeric
or alphabetic data needed to be substituted for the other. This was accomplished in Excel by creating
lookup tables. Data was then imported into R-Studio and numeric codes were replaced with 3-Letter
Alphabet codes.
Model Proposed
• Random forests or random decision forests are an ensemble learning method for classification
and regression that operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the classes or mean prediction of the individual trees.
• Parameters in Random Forest Function
- Node Size – Set too high causes small trees.
- nTree – Number of Trees to be grown (Does not lead to overfitting- Advantage of Random
Forests)
- Mtry – Number of Features selected in an iteration random (mtry < p where p is number
of features in the model).
5. Engineering Analytics Saurabh Kale
P a g e 5 | 14
VARIABLE SELECTION
The following plots will help to validate selection of these variables-
Image 1
The maximum and minimum values for Delay by Day of Week can explain whether flight delay.
Image 2
The maximum and minimum values for Delay by Day of Month can explain whether flight delay.
6. Engineering Analytics Saurabh Kale
P a g e 6 | 14
Image 3
The maximum and minimum values for delay by Month can explain whether a flight will be delayed.
Image 4
The following 3 plots explain selection of Airline as a feature.
Clearly, Airline, as a feature, will be able to explain whether a flight will be delayed.
7. Engineering Analytics Saurabh Kale
P a g e 7 | 14
Image 5
Image 6
As it is clear from the above plot, delay increases between 0000 hours and 0500 hours. This model was
later included in the model but was not included in the first set of features.
8. Engineering Analytics Saurabh Kale
P a g e 8 | 14
Image 7
Flight Time is the Air Time for a certain flight. This feature is correlated with the Origin and Destination
in the sense that the farther apart Origin and Destination is, the more time it will take to reach from
Origin to Destination. This feature has not been considered in the model.
Image 8
This is an informative plot of delay by state. This in important because four features in our model are
coordinates of Airport Locations. This plot captures why those four features are important.
9. Engineering Analytics Saurabh Kale
P a g e 9 | 14
CORRELATIONS
The following plots show correlations between variables and form the basis for decision to not select
these variables in the model.
Plot 1
X-Sch_Air_Time VS Y-Elapsed_Time (Plot 1)
Plot 2
X-Sch-Air_Time VS Y-Distance (Plot 2)
10. Engineering Analytics Saurabh Kale
P a g e 10 | 14
Plot 3
X-Sch_Air_Time VS Y-Actual_Air_Time (Plot 3)
There are more correlation plots for other variables such as Scheduled Departure Time, Actual Departure
Time and Wheels off Time.
These have not been attached in the report because they are very similar to the plots above.
MODEL ASSESSMENT AND EVALUATION
Based on raw data, the Binary Delay Classification is as follows-
Delay Frequency Percentages
0 3607308 62.92%
1 2125618 37.07%
Total 5732926 100.00%
The model used to fit this data was RandomForest.
Call to the function is as follows-
RFGeoSpa1 <- randomForest(Delay ~., data = Train1, ntree = 400, mtry = 15 ,nodesize =1)
The ROC Curve is as follows-
11. Engineering Analytics Saurabh Kale
P a g e 11 | 14
For a cutoff level of 0.45, the TPR rate was found to be 76.09 % and TNR was found to be 51.60%
0 1
0 18141 6748
1 5699 7197
Accuracy of the Model is 67.05 %.
LESSONS LEARNT
Although Random Forests give very good prediction results considering that data is noise free, one model
should not be relied upon. Multiple models must be built and compared with each other to validate the
other model’s accuracy. This not only validates the models that are built, but also leads to thought
process of deciding why a certain model could not perform better than the other.
The fitted Random Forest model for this study may or may not be performing at the most optimal level
with the data provided to it. There may be more variables required such as Type of Aircraft, Number of
Passengers, Number of Support Staff etc. The addition of these variables may lead to better model
performance.
12. Engineering Analytics Saurabh Kale
P a g e 12 | 14
ERRORS IN PRESENTATION-
The importance of variables when fitting a Random Forest model should only be considered or evaluated
using “Importance” function in R when dealing with Regression using Random Forests, not when
classification is being performed. This was an error in the presentation and this is an attempt to correct
the mistake.
USE OF SOFTWARE
1. R-Package
2. Excel and Excel PowerMap
3. Tableau
13. Engineering Analytics Saurabh Kale
P a g e 13 | 14
REFERENCES
1. http://kellyjclifton.com/Research/EconImpactsofBicycling/OTRECReport-
ConsBehavTravelChoices_Nov2012.pdf
2. An Introduction to Statistical Learning
3. with Applications in R
4. Class Notes- Engineering Analytics – Dr. Lin
5. https://stackoverflow.com/
6. https://stackexchange.com/
7. https://www.bts.gov/
8. https://www.kaggle.com/datasets
9. https://www.rdocumentation.org/
14. Engineering Analytics Saurabh Kale
P a g e 14 | 14
APPENDIX-
Corelations-
<- FlightDF[sample(nrow(FlightDF), 5000000 , replace = FALSE, prob =NULL),]
> cor(FlightsDataSample1$WEATHER_DELAY, FlightsDataSample1$LATE_AIRCRAFT_DELAY, use = "na.or.complete", method = c("pearson","kendall","s
pearman"))
[1] -0.02135492
>
> cor(FlightsDataSample1$WEATHER_DELAY, FlightsDataSample1$AIRLINE_DELAY, use = "na.or.complete", method = c("pearson","kendall","spearman
"))
[1] -0.05103192
>
> cor(FlightsDataSample1$WEATHER_DELAY, FlightsDataSample1$SECURITY_DELAY, use = "na.or.complete", method = c("pearson","kendall","spearm
an"))
[1] -0.004781347
>
> cor(FlightsDataSample1$WEATHER_DELAY, FlightsDataSample1$AIR_SYSTEM_DELAY, use = "na.or.complete", method = c("pearson","kendall","spea
rman"))
[1] -0.0005082514
>
>
> cor(FlightsDataSample1$SCHEDULED_TIME, FlightsDataSample1$ELAPSED_TIME, use = "na.or.complete", method = c("pearson","kendall","spearman
"))
[1] 0.9852726
>
> cor(FlightsDataSample1$SCHEDULED_TIME, FlightsDataSample1$AIR_TIME, use = "na.or.complete", method = c("pearson","kendall","spearman"))
[1] 0.9907503
>
> cor(FlightsDataSample1$ELAPSED_TIME, FlightsDataSample1$ELAPSED_TIME, use = "na.or.complete", method = c("pearson","kendall","spearman"))
[1] 1
>
> cor(FlightsDataSample1$DISTANCE, FlightsDataSample1$AIR_TIME, use = "na.or.complete", method = c("pearson","kendall","spearman"))
[1] 0.9856394
>
> cor(FlightsDataSample1$DISTANCE, FlightsDataSample1$SCHEDULED_TIME, use = "na.or.complete", method = c("pearson","kendall","spearman"))
[1] 0.9843424
DATA FOR MODEL
The dataset is ~550 MB in size and will be presented to course instructor on request.
The code can be sent to instructor on request.