SlideShare a Scribd company logo
1 of 23
PRESENTATION ON CHALLENGE lab ML
Presented by:
Musa Idris
(roll no 40923)
Title of the Minor Project Report on:
Predicting Airplane Delays
Introduction
You work for a travel booking website that wants to improve the
customer experience for flights that were delayed. The company
wants to create a feature to let customers know if the flight will
be delayed because of weather when they book a flight to or
from the busiest airports for domestic travel in the US.
 Flight delay creates major problems in the current aviation
system and in scheduling of airport operations, the
unreliability of flight arrivals is a serious challenge.
 Punctuality is an issue for all major carriers, with some
struggling more than others.
motivation of work
Flight delays are significant concerns in aviation
industries, leading to revenue loss, fuel loss, and
customer dissatisfaction. It creates fear among
passengers taking a connecting flight, whereby the
delay from the first flight could potentially cause them
to miss the subsequent flight. Therefore, this scenario is
a factor of motivation for this study. With a reliable
method to predict flight delays, the event mentioned in
the previous context could either be prevented or better
managed.
Problem Statement
The ability to predict a delay in flight can be helpful for
all parties, including airlines and passengers. This study
explores the method of predicting flight delay by
classifying a specific flight as either delay or no delay.
From the initial review, the flight delay dataset is
skewed. It is expected since most airlines usually have
more non-delayed flights than delayed ones. Hence, this
study compares different methods to deal with an
imbalanced dataset by training a flight delay prediction
model.
Objectives
The objectives of this study are:
1. To identify the attributes that affect flight delay.
2. To develop machine learning models that classify
flight outcomes (either delayed or not delayed)
with selected features.
3. To evaluate the performance of different machine
learning models.
Data source
The data was obtained from the "Airline Delay and
Cancellation Data, 2009 – 2018" at Kaggle page. The
dataset consisting of flight information in the United
States from 2009 to 2018 was obtained from the source
of the U.S. Department of Transportation's Bureau of
Transportation Statistics. In this study, the only data
utilized was from the year 2018. It consisted of 27
attributes and 7,213,446 data points.
Data Preprocessing
To facilitate the modeling process, the only flight data
that was considered and included was the data from the
busiest airports since they contained the most significant
number of schedules for arrival flights in the U.S. Data
cleansing was performed on the name of flight carrier,
origin airport and destination airport as the abbreviation
of IATA code was used. Attributes with more than 50% of
missing values that did not provide helpful information to
this analysis were dropped—unrelated attributes such as
attributes that recorded the outcome of canceled flights
and diverted flights were also removed. Since our main
objective was to predict flight delay, attributes relating to
canceled flights were eliminated.
Data Preprocessing
Instances with missing values were removed as the number of
missing values was less than 1%, which was relatively small.
For classification purposes, a binary attribute, namely "flight delay,"
was added to the record status of the flight. The duration between the
flights taking off and the wheels off the ground, as well as flight on
land and wheels on land, were derived as this provided information
about the actual duration of these activities. Information about a
month, day, and day of the week was transformed from the actual
flight date. Before modeling, all categorical attributes such as
destination airports, day of the week, flight carrier, and flight delay
factors were converted to numerical variables via one hot encoding
method. One dummy variable would be created for every object in
the categorical variable. If the category is presented, the value would
be denoted as one. Otherwise, the value would be denoted as zero.
Feature Selection
The constant variable was removed as it did not provide
helpful information to the model. Attributes highly
correlated to each other were examined to avoid the multi
collinearity effect on the model by selecting the most
predictive one. Planned elapsed time, airtime, distance, and
actual elapsed time correlate higher than 0.8. In this group,
several attributes were highly correlated. To select which
attributes to remove, a random forest algorithm was
utilized to determine their feature importance. Thus, the
actual elapsed time was not removed as it gave the greatest
importance compared to other attributes (shown in Table
below).
Figure below shows the features the random forest classifier reported along
with their importance score, arranged in descending order. It is interesting to
note that scheduled arrival day, month, and destination airport did not
contribute much to a flight's arrival delay. Attributes with low importance
scores were eliminated as keeping all of them did not yield better results for
training models. Thus, only the first nine attributes were used to train the
remaining models.
Modelling and Performance Evaluation
• The outcome of flight delay is the minority class for this study. The data
distribution is skewed, and this class's prediction power is not focused. The
resampling method has dramatically helped to put more emphasis on the
minority class.
• Using SMOTE with the k-nearest neighbor of k = 5, about four synthetic
observations were created with a new ratio of 1:0.88 for the number of
instances of on-time flight to delayed flight. With oversampling techniques,
the risk of overfitting is increased when many synthetic examples are
created.
• With undersampling techniques, potentially vital information may be lost as
we eliminate the existing observation from the dataset.
• The two resampling methods employed were SMOTE and random
undersampling. After employing SMOTE, an evident surge in recall metric
was observed on the test data.
• A similar result was obtained after performing random undersampling,
whereby the data of the majority class was reduced to a similar number of
instances as of the minority class.
Data Analysis
• Various attributes are analyzed to determine which
attributes are relevant in prediction of delays and
which attributes can be discarded as irrelevant.
• It is explored how the delays are distributed across
different variables. This step basically extracts the
importance of each variable in affecting the patterns in
flight departures and delays.
Data Analysis
 There are 8 attributes – each of them are studied
separately –
1. Day of Month
2. Day Of Week
3. Unique Carrier
4. Origin
5. Departure Time
6. Distance Group
7. Arrival Time
8. Destination
4 of these attributes are used in all types of prediction
Test Sets Preparation
• In all, 3 test cases are created based on the ratio of
number of delayed flights to number of on-time
flights:
1. Ratio is 1:1
2. Ratio is 1:3
3. Ratio is 3:1
Types Of Prediction
 After exploring these relationships, we now make 7
prediction models on the basis of 4 parameters. The models
• are –
• Day
• Date
• Time
• Day and date
• Date and time
• Day and Time
• Day, Date and Time
Code
Code
conclusion
In conclusion, all three objectives were achieved in this project.
Valuable attributes for modeling were discovered, such as
Departure Delay, Wheels On/Off Elapse, Taxi In/Out, Distance,
and many more. These had high coefficient values compared to
the dataset from the Bureau of Transportation. Hence they were
kept while other attributes were dropped. Four base algorithms
were initially modeled, N.B., L.R., D.T., and R.F. Then, other
algorithms (Bagging, Boosting, Over/Under sampling) were built
to address the imbalance between the two classes. The evaluation
of all the F1 scores was considered, showing that AdaBoost with
Decision Tree performed the best as it considered the imbalance
nature and obtained the highest score compared to all other
algorithms
future work
The work can be extended by training the model with
the Neural Network algorithm. To handle imbalance
data, there are more options of oversampling techniques
such as Adaptive Synthetic (ADASYN), which prevents
the overlapping of synthetic observations, and
undersampling techniques, which employ data cleaning
concept using Tomek-link (T.L.) and Condensed
Nearest Neighbour (CNN). Other than the resampling
techniques, we can also apply Cost-Sensitive Learning,
which considers misclassification costs by applying
penalties on the wrongly classified results. We can also
employ a hybrid method such as SMOTEBoost to
handle the imbalanced data.
Thanks
for
listening

More Related Content

Similar to PRESENTATION ON CHALLENGE lab_084627 (1).pptx

Prediction of Airlines Delay
Prediction of Airlines Delay Prediction of Airlines Delay
Prediction of Airlines Delay Dinesh Kommireddi
 
Can we predict Airline fares from London to cities in Asia
Can we predict Airline fares from London to cities in Asia Can we predict Airline fares from London to cities in Asia
Can we predict Airline fares from London to cities in Asia Karim Awad
 
Detailed Project Report.pptx
Detailed Project Report.pptxDetailed Project Report.pptx
Detailed Project Report.pptxZafarmwaris
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SASRohan Nanda
 
INFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - CopyINFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - CopyBenjamin Levy
 
Hard landing predection
Hard landing predectionHard landing predection
Hard landing predectionRAJUPADHYAY44
 
DOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdfDOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdfShaizaanKhan
 
Predicting flight cancellation likelihood
Predicting flight cancellation likelihoodPredicting flight cancellation likelihood
Predicting flight cancellation likelihoodAashish Jain
 
Air Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdfAir Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdfAdityaAryan45
 
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ijseajournal
 
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ijseajournal
 
Aviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selectionAviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selectionMohammed Hadi
 
Benchmarking data mining approaches for traveler segmentation
Benchmarking data mining approaches for traveler segmentation  Benchmarking data mining approaches for traveler segmentation
Benchmarking data mining approaches for traveler segmentation IJECEIAES
 
Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...
Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...
Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...Dr Marie Langer
 
Data mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performanceData mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performanceAkiso Yadav
 

Similar to PRESENTATION ON CHALLENGE lab_084627 (1).pptx (20)

Prediction of Airlines Delay
Prediction of Airlines Delay Prediction of Airlines Delay
Prediction of Airlines Delay
 
Can we predict Airline fares from London to cities in Asia
Can we predict Airline fares from London to cities in Asia Can we predict Airline fares from London to cities in Asia
Can we predict Airline fares from London to cities in Asia
 
Detailed Project Report.pptx
Detailed Project Report.pptxDetailed Project Report.pptx
Detailed Project Report.pptx
 
Airline delay prediction
Airline delay predictionAirline delay prediction
Airline delay prediction
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SAS
 
INFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - CopyINFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - Copy
 
Hard landing predection
Hard landing predectionHard landing predection
Hard landing predection
 
DOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdfDOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdf
 
chris guidice RESUME Ver2
chris guidice RESUME Ver2chris guidice RESUME Ver2
chris guidice RESUME Ver2
 
Profit maximization
Profit maximizationProfit maximization
Profit maximization
 
Predicting flight cancellation likelihood
Predicting flight cancellation likelihoodPredicting flight cancellation likelihood
Predicting flight cancellation likelihood
 
AIAA-2013-4399
AIAA-2013-4399AIAA-2013-4399
AIAA-2013-4399
 
Air Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdfAir Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdf
 
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
 
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
 
Aviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selectionAviation articles - Aircraft Evaluation and selection
Aviation articles - Aircraft Evaluation and selection
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 
Benchmarking data mining approaches for traveler segmentation
Benchmarking data mining approaches for traveler segmentation  Benchmarking data mining approaches for traveler segmentation
Benchmarking data mining approaches for traveler segmentation
 
Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...
Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...
Human Factors_ The Journal of the Human Factors and Ergonomics Society-2016-L...
 
Data mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performanceData mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performance
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

PRESENTATION ON CHALLENGE lab_084627 (1).pptx

  • 1. PRESENTATION ON CHALLENGE lab ML Presented by: Musa Idris (roll no 40923) Title of the Minor Project Report on: Predicting Airplane Delays
  • 2. Introduction You work for a travel booking website that wants to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed because of weather when they book a flight to or from the busiest airports for domestic travel in the US.  Flight delay creates major problems in the current aviation system and in scheduling of airport operations, the unreliability of flight arrivals is a serious challenge.  Punctuality is an issue for all major carriers, with some struggling more than others.
  • 3. motivation of work Flight delays are significant concerns in aviation industries, leading to revenue loss, fuel loss, and customer dissatisfaction. It creates fear among passengers taking a connecting flight, whereby the delay from the first flight could potentially cause them to miss the subsequent flight. Therefore, this scenario is a factor of motivation for this study. With a reliable method to predict flight delays, the event mentioned in the previous context could either be prevented or better managed.
  • 4. Problem Statement The ability to predict a delay in flight can be helpful for all parties, including airlines and passengers. This study explores the method of predicting flight delay by classifying a specific flight as either delay or no delay. From the initial review, the flight delay dataset is skewed. It is expected since most airlines usually have more non-delayed flights than delayed ones. Hence, this study compares different methods to deal with an imbalanced dataset by training a flight delay prediction model.
  • 5. Objectives The objectives of this study are: 1. To identify the attributes that affect flight delay. 2. To develop machine learning models that classify flight outcomes (either delayed or not delayed) with selected features. 3. To evaluate the performance of different machine learning models.
  • 6. Data source The data was obtained from the "Airline Delay and Cancellation Data, 2009 – 2018" at Kaggle page. The dataset consisting of flight information in the United States from 2009 to 2018 was obtained from the source of the U.S. Department of Transportation's Bureau of Transportation Statistics. In this study, the only data utilized was from the year 2018. It consisted of 27 attributes and 7,213,446 data points.
  • 7. Data Preprocessing To facilitate the modeling process, the only flight data that was considered and included was the data from the busiest airports since they contained the most significant number of schedules for arrival flights in the U.S. Data cleansing was performed on the name of flight carrier, origin airport and destination airport as the abbreviation of IATA code was used. Attributes with more than 50% of missing values that did not provide helpful information to this analysis were dropped—unrelated attributes such as attributes that recorded the outcome of canceled flights and diverted flights were also removed. Since our main objective was to predict flight delay, attributes relating to canceled flights were eliminated.
  • 8. Data Preprocessing Instances with missing values were removed as the number of missing values was less than 1%, which was relatively small. For classification purposes, a binary attribute, namely "flight delay," was added to the record status of the flight. The duration between the flights taking off and the wheels off the ground, as well as flight on land and wheels on land, were derived as this provided information about the actual duration of these activities. Information about a month, day, and day of the week was transformed from the actual flight date. Before modeling, all categorical attributes such as destination airports, day of the week, flight carrier, and flight delay factors were converted to numerical variables via one hot encoding method. One dummy variable would be created for every object in the categorical variable. If the category is presented, the value would be denoted as one. Otherwise, the value would be denoted as zero.
  • 9. Feature Selection The constant variable was removed as it did not provide helpful information to the model. Attributes highly correlated to each other were examined to avoid the multi collinearity effect on the model by selecting the most predictive one. Planned elapsed time, airtime, distance, and actual elapsed time correlate higher than 0.8. In this group, several attributes were highly correlated. To select which attributes to remove, a random forest algorithm was utilized to determine their feature importance. Thus, the actual elapsed time was not removed as it gave the greatest importance compared to other attributes (shown in Table below).
  • 10. Figure below shows the features the random forest classifier reported along with their importance score, arranged in descending order. It is interesting to note that scheduled arrival day, month, and destination airport did not contribute much to a flight's arrival delay. Attributes with low importance scores were eliminated as keeping all of them did not yield better results for training models. Thus, only the first nine attributes were used to train the remaining models.
  • 11.
  • 12. Modelling and Performance Evaluation • The outcome of flight delay is the minority class for this study. The data distribution is skewed, and this class's prediction power is not focused. The resampling method has dramatically helped to put more emphasis on the minority class. • Using SMOTE with the k-nearest neighbor of k = 5, about four synthetic observations were created with a new ratio of 1:0.88 for the number of instances of on-time flight to delayed flight. With oversampling techniques, the risk of overfitting is increased when many synthetic examples are created. • With undersampling techniques, potentially vital information may be lost as we eliminate the existing observation from the dataset. • The two resampling methods employed were SMOTE and random undersampling. After employing SMOTE, an evident surge in recall metric was observed on the test data. • A similar result was obtained after performing random undersampling, whereby the data of the majority class was reduced to a similar number of instances as of the minority class.
  • 13.
  • 14. Data Analysis • Various attributes are analyzed to determine which attributes are relevant in prediction of delays and which attributes can be discarded as irrelevant. • It is explored how the delays are distributed across different variables. This step basically extracts the importance of each variable in affecting the patterns in flight departures and delays.
  • 15. Data Analysis  There are 8 attributes – each of them are studied separately – 1. Day of Month 2. Day Of Week 3. Unique Carrier 4. Origin 5. Departure Time 6. Distance Group 7. Arrival Time 8. Destination 4 of these attributes are used in all types of prediction
  • 16. Test Sets Preparation • In all, 3 test cases are created based on the ratio of number of delayed flights to number of on-time flights: 1. Ratio is 1:1 2. Ratio is 1:3 3. Ratio is 3:1
  • 17. Types Of Prediction  After exploring these relationships, we now make 7 prediction models on the basis of 4 parameters. The models • are – • Day • Date • Time • Day and date • Date and time • Day and Time • Day, Date and Time
  • 18. Code
  • 19. Code
  • 20. conclusion In conclusion, all three objectives were achieved in this project. Valuable attributes for modeling were discovered, such as Departure Delay, Wheels On/Off Elapse, Taxi In/Out, Distance, and many more. These had high coefficient values compared to the dataset from the Bureau of Transportation. Hence they were kept while other attributes were dropped. Four base algorithms were initially modeled, N.B., L.R., D.T., and R.F. Then, other algorithms (Bagging, Boosting, Over/Under sampling) were built to address the imbalance between the two classes. The evaluation of all the F1 scores was considered, showing that AdaBoost with Decision Tree performed the best as it considered the imbalance nature and obtained the highest score compared to all other algorithms
  • 21. future work The work can be extended by training the model with the Neural Network algorithm. To handle imbalance data, there are more options of oversampling techniques such as Adaptive Synthetic (ADASYN), which prevents the overlapping of synthetic observations, and undersampling techniques, which employ data cleaning concept using Tomek-link (T.L.) and Condensed Nearest Neighbour (CNN). Other than the resampling techniques, we can also apply Cost-Sensitive Learning, which considers misclassification costs by applying penalties on the wrongly classified results. We can also employ a hybrid method such as SMOTEBoost to handle the imbalanced data.
  • 22.