PRESENTATION ON CHALLENGE lab_084627 (1).pptx

PRESENTATION ON CHALLENGE lab ML
Presented by:
Musa Idris
(roll no 40923)
Title of the Minor Project Report on:
Predicting Airplane Delays

Introduction
You work for a travel booking website that wants to improve the
customer experience for flights that were delayed. The company
wants to create a feature to let customers know if the flight will
be delayed because of weather when they book a flight to or
from the busiest airports for domestic travel in the US.
 Flight delay creates major problems in the current aviation
system and in scheduling of airport operations, the
unreliability of flight arrivals is a serious challenge.
 Punctuality is an issue for all major carriers, with some
struggling more than others.

motivation of work
Flight delays are significant concerns in aviation
industries, leading to revenue loss, fuel loss, and
customer dissatisfaction. It creates fear among
passengers taking a connecting flight, whereby the
delay from the first flight could potentially cause them
to miss the subsequent flight. Therefore, this scenario is
a factor of motivation for this study. With a reliable
method to predict flight delays, the event mentioned in
the previous context could either be prevented or better
managed.

Problem Statement
The ability to predict a delay in flight can be helpful for
all parties, including airlines and passengers. This study
explores the method of predicting flight delay by
classifying a specific flight as either delay or no delay.
From the initial review, the flight delay dataset is
skewed. It is expected since most airlines usually have
more non-delayed flights than delayed ones. Hence, this
study compares different methods to deal with an
imbalanced dataset by training a flight delay prediction
model.

Objectives
The objectives of this study are:
1. To identify the attributes that affect flight delay.
2. To develop machine learning models that classify
flight outcomes (either delayed or not delayed)
with selected features.
3. To evaluate the performance of different machine
learning models.

Data source
The data was obtained from the "Airline Delay and
Cancellation Data, 2009 – 2018" at Kaggle page. The
dataset consisting of flight information in the United
States from 2009 to 2018 was obtained from the source
of the U.S. Department of Transportation's Bureau of
Transportation Statistics. In this study, the only data
utilized was from the year 2018. It consisted of 27
attributes and 7,213,446 data points.

Data Preprocessing
To facilitate the modeling process, the only flight data
that was considered and included was the data from the
busiest airports since they contained the most significant
number of schedules for arrival flights in the U.S. Data
cleansing was performed on the name of flight carrier,
origin airport and destination airport as the abbreviation
of IATA code was used. Attributes with more than 50% of
missing values that did not provide helpful information to
this analysis were dropped—unrelated attributes such as
attributes that recorded the outcome of canceled flights
and diverted flights were also removed. Since our main
objective was to predict flight delay, attributes relating to
canceled flights were eliminated.

Data Preprocessing
Instances with missing values were removed as the number of
missing values was less than 1%, which was relatively small.
For classification purposes, a binary attribute, namely "flight delay,"
was added to the record status of the flight. The duration between the
flights taking off and the wheels off the ground, as well as flight on
land and wheels on land, were derived as this provided information
about the actual duration of these activities. Information about a
month, day, and day of the week was transformed from the actual
flight date. Before modeling, all categorical attributes such as
destination airports, day of the week, flight carrier, and flight delay
factors were converted to numerical variables via one hot encoding
method. One dummy variable would be created for every object in
the categorical variable. If the category is presented, the value would
be denoted as one. Otherwise, the value would be denoted as zero.

Feature Selection
The constant variable was removed as it did not provide
helpful information to the model. Attributes highly
correlated to each other were examined to avoid the multi
collinearity effect on the model by selecting the most
predictive one. Planned elapsed time, airtime, distance, and
actual elapsed time correlate higher than 0.8. In this group,
several attributes were highly correlated. To select which
attributes to remove, a random forest algorithm was
utilized to determine their feature importance. Thus, the
actual elapsed time was not removed as it gave the greatest
importance compared to other attributes (shown in Table
below).

Figure below shows the features the random forest classifier reported along
with their importance score, arranged in descending order. It is interesting to
note that scheduled arrival day, month, and destination airport did not
contribute much to a flight's arrival delay. Attributes with low importance
scores were eliminated as keeping all of them did not yield better results for
training models. Thus, only the first nine attributes were used to train the
remaining models.

Modelling and Performance Evaluation
• The outcome of flight delay is the minority class for this study. The data
distribution is skewed, and this class's prediction power is not focused. The
resampling method has dramatically helped to put more emphasis on the
minority class.
• Using SMOTE with the k-nearest neighbor of k = 5, about four synthetic
observations were created with a new ratio of 1:0.88 for the number of
instances of on-time flight to delayed flight. With oversampling techniques,
the risk of overfitting is increased when many synthetic examples are
created.
• With undersampling techniques, potentially vital information may be lost as
we eliminate the existing observation from the dataset.
• The two resampling methods employed were SMOTE and random
undersampling. After employing SMOTE, an evident surge in recall metric
was observed on the test data.
• A similar result was obtained after performing random undersampling,
whereby the data of the majority class was reduced to a similar number of
instances as of the minority class.

Data Analysis
• Various attributes are analyzed to determine which
attributes are relevant in prediction of delays and
which attributes can be discarded as irrelevant.
• It is explored how the delays are distributed across
different variables. This step basically extracts the
importance of each variable in affecting the patterns in
flight departures and delays.

Data Analysis
 There are 8 attributes – each of them are studied
separately –
1. Day of Month
2. Day Of Week
3. Unique Carrier
4. Origin
5. Departure Time
6. Distance Group
7. Arrival Time
8. Destination
4 of these attributes are used in all types of prediction

Test Sets Preparation
• In all, 3 test cases are created based on the ratio of
number of delayed flights to number of on-time
flights:
1. Ratio is 1:1
2. Ratio is 1:3
3. Ratio is 3:1

Types Of Prediction
 After exploring these relationships, we now make 7
prediction models on the basis of 4 parameters. The models
• are –
• Day
• Date
• Time
• Day and date
• Date and time
• Day and Time
• Day, Date and Time

conclusion
In conclusion, all three objectives were achieved in this project.
Valuable attributes for modeling were discovered, such as
Departure Delay, Wheels On/Off Elapse, Taxi In/Out, Distance,
and many more. These had high coefficient values compared to
the dataset from the Bureau of Transportation. Hence they were
kept while other attributes were dropped. Four base algorithms
were initially modeled, N.B., L.R., D.T., and R.F. Then, other
algorithms (Bagging, Boosting, Over/Under sampling) were built
to address the imbalance between the two classes. The evaluation
of all the F1 scores was considered, showing that AdaBoost with
Decision Tree performed the best as it considered the imbalance
nature and obtained the highest score compared to all other
algorithms

future work
The work can be extended by training the model with
the Neural Network algorithm. To handle imbalance
data, there are more options of oversampling techniques
such as Adaptive Synthetic (ADASYN), which prevents
the overlapping of synthetic observations, and
undersampling techniques, which employ data cleaning
concept using Tomek-link (T.L.) and Condensed
Nearest Neighbour (CNN). Other than the resampling
techniques, we can also apply Cost-Sensitive Learning,
which considers misclassification costs by applying
penalties on the wrongly classified results. We can also
employ a hybrid method such as SMOTEBoost to
handle the imbalanced data.

PRESENTATION ON CHALLENGE lab_084627 (1).pptx

Recommended

Recommended

More Related Content

Similar to PRESENTATION ON CHALLENGE lab_084627 (1).pptx

Similar to PRESENTATION ON CHALLENGE lab_084627 (1).pptx (20)

Recently uploaded

Recently uploaded (20)

PRESENTATION ON CHALLENGE lab_084627 (1).pptx