1. Flight Delay Prediction using Data Mining
Abstract— Airplane industry is growing fast these
days as it has become the favorite mode of transport
for most people because they are finding it cheap and
is faster than other modes. However, like other modes,
it also has some negative aspects or has some
disadvantages. Due to growing traffic in the airline
industry and many more reasons, the flights are
getting delayed and causing inconvenience to the
customers. It has cost millions of dollars in United
States in the recent years. It has also affected many
transportation companies. To get rid of this problem,
it is necessary to find out the factors causing the delays
in flight. Classification has been proved to be effective
in many fields for solving different problems. Various
classification algorithms are applied like K-Nearest
Neighbors, Decision Tree C50 and Artificial Neural
Networks. The performance of these algorithms is
compared and decision tree c50 turns out to be the best
algorithm with an overall accuracy of 85%.
Keywords— Classification, Data Mining, K-Nearest
Neighbors, C5.0, Artificial Neural Networks.
I. INTRODUCTION
With the increase in population, the use of
vehicles and transportation is also increasing.
Eventually, traffic has increased which is causing lot
of problems. This results in the wastage of time of
lot of people. People going to office or important
meeting faces problem of reaching on time. For
saving time, now-a-days, people are preferring
airline as a transport. However, air-traffic has also
been increased these days which results in delays of
the airplane. There are a number of reasons that are
causing delays in flight, eventually, affecting many
things.
Bureau of Transportation Statistics (BTS) states
that, there are 5 major reasons behind the flights
getting delayed, such as late aircraft, weather,
carrier, NAS and security [1].
Around 18% flights were delayed and 1.5% of
flights were cancelled in the year 2015 in United
States. It costs billions of dollars in the airline
industry each year in the United States (around 40
billion each year). Not only the industry, but even
the customers are affected due to it. 800 million of a
total of 7 billion people travel in the United States
each year [2].
Many people may have an emergency and need
to reach to a place as early as possible. It would be
very inconvenient for that person to opt for an
alternative during such time as he will need to reach
fast. Also, transportation industry is badly affected
resulting in a great loss. Therefore, it is very much
necessary to stop or at least reduce the affects flight
delays has or to reduce the number of flight delays.
However, we first need to find out the factors or the
reasons resulting in the delays of flight. For solving
the problem of flight delay, a dataset has been used
which consists of information about the flights in the
United States. This motivates us to formulate a
research question and a probable solution for it. The
research question is ‘What are the factors the causes
flight arrival delays in the United States?’ and our
objective is to find out these factors using Data
Mining techniques. Different data mining
classification techniques have been used such as K-
Nearest Neighbors (KNN), Decision Tree C50 and
Artificial Neural Networks (ANN).
This data mining technique was applied on two
different datasets. These datasets consist the details
of flights of the US is taken for this paper. One is of
January 2017 and other of January 2018. First
dataset is taken from ‘Data World’ and other from
‘Transtats’ [25][26]. These datasets were combined,
and the combined dataset consisted of around a
million of flights and has around 60 features of the
flights and their delays. These features are
categorized into details of flights, details of the
source and destination, schedule of the flights,
reasons for delays, amount of delays, etc.
• Details of the flights like Carrier, Tail
Number, Airline.
• Details of the source and destination like
source airport, source city, source state,
destination airport, destination city,
destination state.
• Schedule of the flights like the year, month,
day of the week, day of the month, departure
time, arrival time.
• Reasons for delays like weather delay, NAS
delay, carrier delay, security delay, late
aircraft delay.
• Amount of delays in minutes for both,
departure and arrival delays.
• Many more details like wheels-on time,
wheels-off time, taxi-in time, taxi-out time,
whether the flight was cancelled or diverted,
distance travelled, etc.
However, there were many more attributes which
were not taken into consideration for solving the
research question. The main focus was on the arrival
delay so the arrival delay in minutes was the target
variable.
2. II. RELATED WORK
From the past twenty years, because of the most
convenient and less time-consuming mode of
transportation i.e. air travel is gaining popularity. But
the increase of number of flights also create air
traffic results into flight delay while applying the
classification machine learning algorithm it is found
that departure delay, taxi-out time, origin of flight
gains the most important score [3].
The delays in the flights staggeringly affects the
airline industry because it cost airline industry itself,
customer, economy of country millions of dollars per
month. The reasons of delays can form macroscopic
level to microscopic level. Costing and supervised
machine learning algorithm has been applied to find
cost sensitive classifier and to predict the flight
delays. The performance evaluation is done based on
cost ratio [4].
The bureau of transportation statistics provides the
Airlines data of united states. It gives the detailed
information of flights routes, timings, carrier, types
of delays, etc.. With the help of regression analysis
technique by regularization method it will predict the
flight delay in minutes. With this it will also give the
statistical description of the individual airline and
presents which hours are the busiest [5]. On this
dataset various research is still going on to
recommend customer about the flight delays.
To analyze the flight delay, we need to check every
aspect that are causing the issue. Like Airport, Route
of flights, Airlines, etc. one of them or different
combinations of these parameter should be taken into
consideration while analyzing the delays. To make
prediction better and recommending the best
performance evaluation, the results are grouped into
five parts. Statistical models, probabilistic models,
network representation, operation research, machine
learning will used to forecast the flight delays more
accurately [6].
To know does the airport business matters? for that
we need to check which airports has the maximum
number of flights departed and arrived. For this SQL
business intelligence tool was used. This tool also
presents the visuals and give statistical answer like is
the flight delayed when it departed? This study
presents there is a co-relation within day of the
month, month and departure delay [7]. But this
model presents the visuals by performing clustering
algorithm on the area of interest and with help of this
tool we cannot calculate accuracy percentage.
Two airports are connected by certain routes, that
could create problem in on-time flight arrival. If the
one flight is delayed on a certain route, then the
successive flights will also get delayed because of
this flight. The current delayed flight can affect
badly on all the scheduled flights on that route, the
chain reaction will happen [8]. To solve this problem
Bayesian network can help to know which factors
are influencing the flight delays [9].
While taking other important parameter into
consideration only weather is not lone responsible
for delays. Some research is made on assumption on
the weather condition and flights en-route are most
important factor while analyzing the flight delays
[10]. But the other parameters are fairly related to
weather conditions. Flight delays due to weather
condition shares 40% of the total delays [11]. The
historical weather data has been added to show better
performance. By applying the naïve Bayes and C4.5,
classify the two classes which is non-delayed and
delay above 30 minutes. It found that naïve Bayes
shows the better performance than the C4.5 [12].
The other parameters like time of the day, day of the
week, type of the hour, season might influence the
flight delays. The day of the week like is it weekend,
or week day shows business of the airport. To
classify and predict this, several operations were
performed like Artificial Neural Network (ANN),
Classification and Regression Tree (CART), Markov
Jump Linear System (MJLS). The consistency of
delays and corelated network are analyzed to
determine the delays in the airport. All the three
machine learning algorithm model gave different
accuracy. ANN performed best to show
classification of the origin-destination pair. On
contrast, origin-destination pair regression was best
fitted on Markov Jump Linear System. This study
can help to manage the air-traffic [2].
Two stages are created to perform, binary
classification and then prediction by regression.
Within some major performed machine learning
algorithm, Gradient boosting classifier and Gradient
boosting regressor presents the best results. This
model is built in such way that it can easily associate
with user interface. This interface helps the
passenger to gain prior knowledge about the delay in
the time of the flight the passenger is boarding [1].
Day by day air travel is the most preferable mode of
transportation. Almost all the cities are
interconnected by flights which creates air traffic
congestion. Now controlling this air traffic is also
complex task because it creates great façade in flight
delay. To solve this problem metroplex city, New
York was chosen. New York city’s airport has
served more than 100 million passengers. A multi-
layer clustering is applied to know the spatial
patterns in air-traffic. And by using random forest a
multi-way classification is build [13].
3. III. METHODOLOGY
There are several methodologies that can be used for
performing the data mining techniques such as
CRISP-DM (CRoss Industry Standard Process for
Data Mining), KDD (Knowledge Discovery in
Databases), SEMMA (Sample, Explore, Modify,
Model and Access), etc.
CRISP-DM is six-phase sequential process model
that is hierarchical and iterative and provides an
extendable framework [14]. SEMMA is also an
iterative model where the internal procedures are
iteratively run until the goal is achieved [15]. Both of
these models are somewhat similar but are slightly
different with respect to tasks, activities, phases, etc.
[16]. However, the method that has been in our
project is the KDD because it is easy, complete and
more accurate. As the name suggests, ‘Knowledge
Discovery in Databases’ is a process of extracting
important and useful hidden knowledge or
information from the databases or the available data.
A simple diagram describing the KDD process is
given below.
Fig. 1. KDD Process [17]
KDD is a nine-step model.
1. Understanding the domain, i.e., identifying
the target or what is to be achieved. In this
project, the target is to identify the factors
that are causing the delays in the flights, as
mentioned in the research problem. For this,
a background knowledge of the problem is
required to be understood to decide the
resources that can be used for solving the
problem.
2. Selecting the subset of variables, i.e., the
resources to be used for solving the question.
For this, a dataset of the details for the flight
and the reasons for flight delays was taken
into consideration, as mentioned above. This
was required so that the discovery can be
performed on it which can help us in
identifying the target we want to achieve.
3. Pre-processing of data, i.e., dealing with the
dirty data. Dirty data is very harmful to work
upon because it does not give us accurate
results. The quality of the results is disturbed
which misleads us to some wrong
information. This includes removing of the
noisy data, replacing of the missing values,
etc. In this project, rows containing the dirty
data was removed and the missing values
were replaced with zeros.
4. Reducing the data, i.e., considering only
those attributes that can contribute to the
target variable. Taking into account some
useless features can also disturb the
performance of model or gives inappropriate
or wrong results, i.e., it misleads us. In this
project, some useless attributes such as the
distance group, the date of the flight, airline
ID, etc. were not considered because it had
nothing to do with the target variable, i.e.,
the amount of time for which the flights are
delayed or the factors affecting the delays.
Another part of this step is the
transformation of the data, i.e., converting
the data into appropriate format.
Categorizing the type of data is an example
of transformation of the data. Many of the
attributes were categorized, for example,
arrival delay in minutes and departure delay
in minutes was categorized as early, on-time,
late and very late. Airport was categorized
by the frequency of flights as less busy,
medium busy and high busy. Distance
travelled by flight was categorized into short
distance, medium distance and long distance.
Week was categorized into weekdays and
weekends. Month was categorized into first
half and second half.
Some part of data was removed. For
example, the flights that were cancelled were
not taken into consideration. Similarly, the
flights that were diverted were not taken into
4. consideration. This is because the if the
flights were cancelled or diverted, there was
no question of the flights being arriving on-
time or being delayed. Flights that were
departed less than 5 minutes late were
assumed to be departed on time. Only the
top 4 origin airport were considered because
categorizing all the origin airport was not
possible and many more data were removed.
The top 4 origins were found out using the
‘Tableau’ visualization tool. The result is
shown below.
Fig. 2. Origin airports
5. Selecting the data mining procedures, i.e.,
the type of model that is to be constructed or
developed. These can be of different types
like classification, regression, analysis,
clustering, etc. depending the goal of the
domain. In this project, classification was
performed. Classification was used for
identifying the factors that are most affecting
or causing the flight delays.
6. Data mining algorithm, i.e., the technique of
the procedures that is to be applied to get the
results, which in this case, the classification
algorithm that will be applied. K-Nearest
Neighbours (KNN), Artificial Neural
Networks (ANN) and Decision Tree C50
were used. However, the accuracy of all the
models will be calculated and the results will
be compared to decide the best algorithm.
The detailed explanation of the algorithms is
mentioned in the next section.
7. Searching for patterns, i.e., extracting the
hidden patterns present in the output of the
classification like the factors displayed, or
the trees designed, or the network created.
However, it is also required to interpret the
results from the obtained graphs.
8. Interpreting the results, i.e., understanding
the patterns and extracting some important
information from the graphs as mentioned
above. These are the final results which were
aimed in the first step of the process. Some
of the above steps can be iteratively
reperformed to get better results or
understanding them in a better way. In this
project, the obtained results are compared to
identify the best algorithm amongst all.
9. Consolidating the knowledge, i.e.,
strengthening the output and results by
applying it at the right place or forwarding it
to the required area. In this project, the
acquired results can be applied in the real-
world scenario in the airline industry to
prevent the problem that is occurring or
being faced by the people [18].
IV. EVALUATION AND RESULTS
Various classification algorithms were applied on the
dataset as mentioned above.
A. K-Nearest Neighbours
KNN is a simple supervised non-parametric
model in which a sample input is classified into a
class depending upon which class is common
amongst the nearest neighbors. The nearness of the
neighbors is decided by calculating the distance
between them. ‘K’ number of neighbors are present
within a certain distance [19]. This ‘K’ value should
be such that it is appropriate for the model and gives
the minimum error. Smaller the ‘K’ value, poor the
estimation, bigger the value, smoother and better the
estimation. In this project, various ‘K’ values were
calculated. After performing a number of
combinations, we found out that K=31 was fitting
best for our model [20]. KNN has been used because
it gives all the factors that are logically nearer to the
target variables or more affecting or deciding
variables.
5. Fig. 3. K-values
From the above image, it can be seen that the
error for K=31 was minimum and so it was finalized.
Then, the confusion matrix was generated to
calculate the accuracy of the model.
Fig. 4. Confusion Matrix of KNN
From the above image, it can be seen that out of
all the flights that arrived early, 29513 were correctly
predicted, 82 of late and 118 of very late were
correctly predicted. We also obtained dimensions for
the input variables. They are shown below.
Fig. 5. Dimensions of input variables
After performing these steps, the overall
accuracy and other performance measures were
calculated. The results are shown below.
Fig. 6. Performance measures of KNN
The overall accuracy was found out to be 80%
which is calculated by total truly identified values
divided by the total values. Kappa value of 0.0531
was obtained which is not too low.
B. Decision Tree C50
C50 is type of decision tree classification
where the split is made based on the maximum
information gain [21]. Information gain is calculated
as the product of probability of the class and the log
6. of that probability [22]. C50 has been used because it
helps in identifying the factors and their usage or
contribution affecting the target class. The root node
or the parent node is more affecting than its child
node. The advantage of C50 algorithm is that it can
be applied to any kind of data and saves a lot of
memory. Another advantage is that it can handle
numeric as well as categorized data. In this project,
we categorized some of the factors and then applied
the c50 algorithm on the dataset. Three attributes
amongst all contributed in generating the decision
tree. The usage of these three attributes is shown
below.
Fig. 7. Attribute usage in C50
As you can see, the usage of ‘NAS delay’ was
100%, and that of ‘Weather delay’ and ‘Taxi out’
was 95.93% and 95.92% respectively. A decision
tree was formed consisting of these three attributes.
The decision tree is shown below.
Fig. 8. Decision tree C50
After this, the predictions were made by
calculating the error present in the model. An error
rate of 15% was obtained, i.e., it was 85% accurate.
This result is shown in the image below.
Fig. 9. Error rate of C50
The classification of the data is also shown in
the above image and it can be observed that it has
performed much better than the KNN algorithm.
C. Artificial Neural Networks
Artificial Neural Networks is the processing of
information in a way similar to the processing of
information done by the human brain [23]. They
need not be manually programmed but learns from
the past experience [24]. ANN consists of several
neurons and is made up of three layers; input layer,
hidden layer and output layer. Each neuron is
assigned a weight and added to the other neuron. The
weights of all these neurons are added and then the
result is calculated. In this project, a total of 12
inputs were selected, 5 hidden layers and 1 output
layer and the accuracy was found out to be 79%. The
accuracy was also checked with 2 hidden layers and
77% was obtained and so 5 hidden layers were
selected. The network and accuracy of 5 hidden
layers is shown below.
7. Fig. 10. Neural Network with 5 hidden layers
Fig. 11. Accuracy of ANN
V. CONCLUSION AND FUTURE WORK
In this project, different classification algorithms like
KNN, C50 and ANN were implemented to predict
flight delay. The results of these algorithms were
compared and C50 was found out to be the best one
with an accuracy of 85%. There were many factors
that were causing the delays in flight. C50 algorithm
showed that NAS delay, Weather delay and Taxi-out
were the features causing flight delay. These models
can be used and applied in real-world scenarios to
make improvisation in the airline industry. In future,
we can try to improve the prediction model to gain
higher accuracy. Further analysis can be done by
identifying the airline company in which the delays
are occurring the most. Also, during which time of
the year the delays are occurring can be identified by
combining the weather-related dataset.
REFERENCES
[1] R. J. Hansman, “Identification , Characterization ,
and Prediction of Traffic Flow Patterns in Multi-
Airport Systems,” pp. 1–14, 2018.
[2] M. Baluch and T. Bergstra, “Complex Analysis of
United States Flight Data Using a Data Mining
Approach,” pp. 1–6, 2017.
[3] F. Bus, “Application of Machine Learning
Algorithms to Predict Flight Arrival Delays,” vol.
00, pp. 3992–3997, 2015.
[4] N. E. Md Isa, A. Amir, M. Z. Ilyas, and M. S.
Razalli, “The Performance Analysis of K-Nearest
Neighbors (K-NN) Algorithm for Motor Imagery
Classification Based on EEG Signal,” MATEC
Web Conf., vol. 140, p. 01024, 2017.
[5] M. S. B. Maind, “Research Paper on Basic of
Artificial Neural Network,” Int. J. Recent Innov.
Trends Comput. Commun., vol. 2, no. 1, pp. 96–
100, 2014.
[6] S. Choi, Y. J. Kim, S. Briceno, and D. Mavris,
“Prediction of weather-induced airline delays
based on machine learning algorithms,”
AIAA/IEEE Digit. Avion. Syst. Conf. - Proc., vol.
2016–December, pp. 1–6, 2016.
[7] Y. Ding, “Predicting flight delay based on
multiple linear regression Predicting flight delay
based on multiple linear regression,” 2017.
[8] M. Balamurugan and S. Kannan, “Performance
Analysis of Cart and C5 . 0 using Sampling
Techniques,” 2016 IEEE Int. Conf. Adv. Comput.
Appl., pp. 72–75, 2016.
[9] G. Costagliola, V. Fuccella, M. Giordano, and G.
Polese, “Monitoring online tests through data
visualization,” IEEE Trans. Knowl. Data Eng.,
vol. 21, no. 6, pp. 773–784, 2009.
[10] A. Guerra-hern, “Explorations of the BDI Multi-
Agent support for the Knowledge Discovery in
Databases Process,” no. January, 2008.
[11] O. Niakšu, “CRISP Data Mining Methodology
Extension for Medical Domain,” Balt. J. Mod.
Comput., vol. 3, no. 2, pp. 92–109, 2015.
[12] S. Choi, Y. J. Kim, S. Briceno, and D. Mavris,
“Cost-sensitive prediction of airline delays using
machine learning,” AIAA/IEEE Digit. Avion. Syst.
Conf. - Proc., vol. 2017–September, 2017.
[13] P. N. Patil, R. Lathi, and V. Chitre, “Comparison
of C5 . 0 & CART Classification algorithms using
pruning technique,” Int. J. Eng. Res. Technol.,
vol. 1, no. 4, pp. 1–5, 2012.
[14] N. Kuhn and N. Jamadagni, “Application of
Machine Learning Algorithms to Predict Flight
Arrival Delays,” pp. 1–6, 2017.
[15] S. B. Imandoust and M. Bolandraftar,
“Application of K-Nearest Neighbor ( KNN )
Approach for Predicting Economic Events :
Theoretical Background,” Int. J. Eng. Res. Appl.,
vol. 3, no. 5, pp. 605–610, 2013.
[16] Q. Li, W. Lei, F. Rong, W. Bin, and X. Hei, “An
analysis method for flight delays based on
8. Bayesian network,” Proc. 2015 27th Chinese
Control Decis. Conf. CCDC 2015, pp. 2561–
2565, 2015.
[17] P. Chandraa, N. Prabakaran, and R. Kannadasan,
“Airline delay predictions using supervised
machine learning,” Int. J. Pure Appl. Math., vol.
119, no. Special Issue 7A, 2018.
[18] A. Sternberg, J. Soares, D. Carvalho, and E.
Ogasawara, “A Review on Flight Delay
Prediction,” pp. 1–21, 2017.
[19] B. Thiagarajan, L. Srinivasan, A. V. Sharma, D.
Sreekanthan, and V. Vijayaraghavan, “A machine
learning approach for prediction of on-time
performance of flights,” AIAA/IEEE Digit. Avion.
Syst. Conf. - Proc., vol. 2017–September, 2017.
[20] Y. J. Kim, S. Choi, S. Briceno, and D. Mavris, “A
deep learning approach to flight delay prediction,”
AIAA/IEEE Digit. Avion. Syst. Conf. - Proc., vol.
2016–December, pp. 1–6, 2016.
[21] V. Sharma, S. Rai, and A. Dev, “A
Comprehensive Study of Artificial Neural
Networks,” Int. J. Adv. Res. Comput. Sci. Softw.
Eng., vol. 2, no. 10, pp. 278–284, 2012.
[22] U. Shafique and H. Qaiser, “A Comparative Study
of Data Mining Process Models ( KDD , CRISP-
DM and SEMMA ),” Int. J. Innov. Sci. Res., vol.
12, no. 1, pp. 217–222, 2014.
[23] H. Jair et al., “A comparative between CRISP-
DM and SEMMA through the construction of a
MODIS repository for studies of land use and
cover change,” Adv. Sci. Technol. Eng. Syst. J.,
vol. 2, no. 3, pp. 598–604, 2017.
[24] K. Gopalakrishnan and H. Balakrishnan, “A
Comparative Analysis of Models for Predicting
Delays in Air Traffic Networks,” Eur. Air Traffic
Manag. Res. Dev. Semin., 2017.
[25] Transtats.bts.gov. (2018). OST_R | BTS | Transtats.
[online] Available at:
https://www.transtats.bts.gov/DL_SelectFields.asp
?Table_ID=236 [Accessed 3 Aug. 2018].
[26] Data.world. (2018). data.world. [online] Available
at: https://data.world/hoytick/2017-jan-
ontimeflightdata-usa [Accessed 3 Aug. 2018].