SlideShare a Scribd company logo
1 of 21
Ever happened to you ?
Delayed flight ?
1
Introduction
 In the United States, the Federal Aviation Administration
estimates that flight delays cost airlines $22 billion yearly.
 Airlines are forced to pay federal authorities when they hold
planes on the tarmac for more than three hours for domestic
flights or more than four hours for international flights.
 Flight delays are an inconvenience to passengers as well. A
delayed flight can end up making them late for personal
scheduled events.
 But what if a flight delay can be predicted?
2
Objective
 Identify and analyze the factors that cause flight delay
 Predict flights that will get delayed
3
Dataset
 Title : Airlines Delay
 Source : www.kaggle.com/datasets
 Description : The U.S. Department of Transportation's (DOT)
Bureau of Transportation Statistics (BTS) tracks the on-time
performance of domestic flights operated by large air carriers.
Summary information on the number of on-time, delayed,
canceled and diverted flights appears in this data.
 Number of rows : 1936758
For simplicity sake , in this project a sample of 0.1%is used for
modeling
 Number of variables : 30
4
Variable Description
Name Description
Year 2008
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime Actual departure time (local, hhmm)
CRSDepTime Scheduled departure time (local, hhmm)
ArrTime Actual arrival time (local, hhmm)
CRSArrTime Scheduled departure time (local, hhmm)
UniqueCarrier Unique carrier code
FlightNum Flight number
TailNum Plane tail number
ActualElapsedTime In minutes
CRSElapsedTime In minutes
AirTime In minutes
ArrDelay Arrival delay, in minutes
Name Description
DepDelay Departure delay, in minutes
Origin Origin IATA airport code
Dest Destination IATA airport code
Distance In miles
TaxiIn Taxi in time, in minutes
TaxiOut Taxi out time in minutes
Cancelled Was the flight cancelled?
CancellationCode Reason for cancellation (A = carrier, B =
weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay In minutes
WeatherDelay In minutes
NASDelay In minutes
SecurityDelay In minutes
LateAircraftDelay In minutes
Status Delayed, Non Delayed
5
Approach
Step Tools/Techniques
used
Data preparation R
Data Analysis R, Rattle, Tableau
Data redundancy R
Model Building Random Forest,
Regression using R
Model Evaluation R
6
Data preprocessing
 Missing values
 Missing values in the type of delay columns are replaced with ‘0’, indicating the
cause is not valid
 If ArrTime/DepTime is missing, it has been replaced with its CRS/Scheduled
equivalent.
 Target leakage
There are columns that are directly related to the flight status column such as
 CarrierDelay,
 WeatherDelay
 NASDelay
 SecurityDelay
 LateAirCraftDelay
 ArrDelay
 DepDelay
Flight Status
7
Data preprocessing(contd..)
 Other columns eliminated/created
DepTime, ArrTime, : Eliminated since each of these is an absolute
time value.
FlightNum,TailNum : Eliminated since these are unique to eac
journey.
CRSArrTime and CRSDepTime have been replaced with
ArrivalBucket and DepartureBucket according to the table below:
CRSArrTime/CRSDepTime Bucket
00:00 to 06:00 Morning
06:00 to 12:00 Afternoon
12:00 to 18:00 Evening
18:00 to 00:00 Nights
8
Data exploration
Interpreting how many flights are delayed Interpreting how many flights are delayed by the
time of arrival in the day
It is seen that more delays
are observed during the
evening and night mainly
because more flights are
scheduled to arrive at this
time => Air traffic
Indicates, it is a possible case of class imbalance
9
Data exploration(contd..)
Long distance flights
seem to be less
affected by weather
delay.
10
Data exploration(contd..)
This bar chart show
how many flights are
delayed per carrier.
11
Data exploration(contd..)
From the above charts it is observed that flight delay maybe dependent on the month and day
of the week
12
Modeling methodology
Split the data into training
and testing data ( 75%
training, 25% testing data)
Check for class imbalance
and treat it using SMOTE
technique
Run a basic random forest
model and analyze the
confusion matrix and AUC
Treat the missing values
and re-run the model.
Analyze the confusion
matrix and AUC
In case of high AUC check
for target leakage variables
and eliminate them.
Re run the model with the
new set of variables.
Analyze the confusion
matrix and AUC
Calculate various
parameters of the random
forest
Try other modelling
methods to see if improved
accuracy is achieved
Finalize the model to be
used for the data.
Random Forest Model
Number of trees : 50
Number of variables : 4
13
Before beginning with the model..
 Why we chose 0.1% sample and not restrict to few carriers?
 Initially we considered building the model with 5 Unique Carriers.
However, we realized, this data was not representative of the entire data
and the results were biased towards one carrier. Also, it would not help us
recognize if it was a clear case of class imbalance.
 Should ‘DepDelay’ be considered while building the model?
 DepDelay refers to the column that mentions the amount of time a flight is
delayed at its departure airport.
 Although this is a cause of delay for flights, there are various instances where
even though there has been a delay at the departure airport, the flight has
arrived on time at the arrival airport.
 So should ‘DepDelay’ be considered while building the model? If it is
directly related to the target variable, will it be considered to be target
leakage?
 While investigating this issue, we have considered 4 cases to analyze the
model.
14
Case 1 : Missing values present, ‘DepDelay’ present
Reference
Prediction Delayed Not Delayed
Delayed 3281 1004
Not Delayed 92 444
ROC Curve
Confusion Matrix
Performance parameters
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
3281 + 444
3281 + 1004 + 92 + 444
= 0.7726
𝑅𝑒𝑐𝑎𝑙𝑙 =
3281
3281 + 92
= 0.9727
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
3281
3281 + 1004
= 0.7657
𝐴𝑈𝐶 = 0.8052
15
Case 2 : Missing values present, ‘DepDelay’ eliminated
Reference
Prediction Delayed Not Delayed
Delayed 2991 1294
Not Delayed 276 260
ROC Curve
Confusion Matrix
Performance parameters
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
2991 + 1294
2991 + 1294 + 276 + 260
= 0.9155
𝑅𝑒𝑐𝑎𝑙𝑙 =
2991
2991 + 276
= 0.9155
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
2991
2991 + 1294
= 0.698
𝐴𝑈𝐶 = 0.592
16
Case 3 : Missing values treated, ‘DepDelay’ present
Reference
Prediction Delayed Not Delayed
Delayed 3466 819
Not Delayed 89 447
ROC Curve
Confusion Matrix
Performance parameters
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
3466 + 447
3466 + 819 + 89 + 447
= 0.811755
𝑅𝑒𝑐𝑎𝑙𝑙 =
3466
3466 + 89
= 0.975
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
3466
3466 + 819
= 0.8089
𝐴𝑈𝐶 = 0.82
17
Case 4 : Missing values treated, ‘DepDelay’ eliminated
Reference
Prediction Delayed Not Delayed
Delayed 2966 1319
Not Delayed 214 322
ROC Curve
Confusion Matrix
Performance parameters
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
2966 + 322
2966 + 1319 + 214 + 322
= 0.6820
𝑅𝑒𝑐𝑎𝑙𝑙 =
2966
2966 + 214
= 0.9327
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
2966
2966 + 1319
= 0.6795
𝐴𝑈𝐶 = 0.646
18
Conclusion
 Case 4 is our best option to accept the model
 In this option missing values are treated and the variables are
chosen so as to avoid target leakage.
 What are possible reason for not getting a high AUC?
 Additional delay causes, not captured in the data
Air traffic
19
Additional analysis
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.50355 0.244443 -10.242 < 2e-16 ***
Month 0.012227 0.015952 0.766 0.4434
DayofMonth -0.01018 0.006322 -1.611 0.1072
DayOfWeek -0.00827 0.028181 -0.293 0.7692
Distance 0.000644 8.7E-05 7.407 1.30E-13 ***
Cancelled 16.52819 882.7434 0.019 0.9851
Diverted 16.67214 262.7191 0.063 0.9494
ArrivalBucketEvening 0.315216 0.203532 1.549 0.1214
ArrivalBucketMorning 0.197135 0.450726 0.437 0.6618
ArrivalBucketNight 0.36954 0.252511 1.463 0.1433
DepartureBucketEvening -0.35955 0.1678 -2.143 0.0321 *
DepartureBucketMorning -0.3445 0.629283 -0.547 0.5841
DepartureBucketNight -0.60119 0.232124 -2.59 0.0096 **
To understand the importance of various variables, we ran an initial logistic regression to see
the significance of variables.
As per the table below, it is seen that Distance of flight, and the time of its departure from
airport play a significant role in estimating the delay.
20
THANKYOU
21

More Related Content

What's hot

Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms butest
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
Titanic Survival Prediction Using Machine Learning
Titanic Survival Prediction Using Machine LearningTitanic Survival Prediction Using Machine Learning
Titanic Survival Prediction Using Machine LearningMd. Rana Mahmud
 
AIS, Airline Information System, Pilot Project
AIS, Airline Information System, Pilot ProjectAIS, Airline Information System, Pilot Project
AIS, Airline Information System, Pilot ProjectMahesh Panchal
 
Big Data Analytics in Transportation
Big Data Analytics in TransportationBig Data Analytics in Transportation
Big Data Analytics in TransportationRandeep Sudan
 
The Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines IndustryThe Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines IndustryPromptCloud
 
Data warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaData warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaVaibhav Khanna
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingYu Huang
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Distributed airline reservation system
Distributed airline reservation systemDistributed airline reservation system
Distributed airline reservation systemSJSU
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Predicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining ProjectPredicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining Projectraj
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at RestInternap
 
Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)GLA University
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMaris R
 

What's hot (20)

Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Titanic Survival Prediction Using Machine Learning
Titanic Survival Prediction Using Machine LearningTitanic Survival Prediction Using Machine Learning
Titanic Survival Prediction Using Machine Learning
 
AIS, Airline Information System, Pilot Project
AIS, Airline Information System, Pilot ProjectAIS, Airline Information System, Pilot Project
AIS, Airline Information System, Pilot Project
 
Big Data Analytics in Transportation
Big Data Analytics in TransportationBig Data Analytics in Transportation
Big Data Analytics in Transportation
 
Data mining on Financial Data
Data mining on Financial DataData mining on Financial Data
Data mining on Financial Data
 
The Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines IndustryThe Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines Industry
 
Data warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaData warehouse 21 snowflake schema
Data warehouse 21 snowflake schema
 
Data visualization
Data visualizationData visualization
Data visualization
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Distributed airline reservation system
Distributed airline reservation systemDistributed airline reservation system
Distributed airline reservation system
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Predicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining ProjectPredicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining Project
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at Rest
 
Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
 

Similar to Airline delay prediction

Random Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics ProjectRandom Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics ProjectSaurabh Kale
 
Predicting flight cancellation likelihood
Predicting flight cancellation likelihoodPredicting flight cancellation likelihood
Predicting flight cancellation likelihoodAashish Jain
 
PRESENTATION ON CHALLENGE lab_084627 (1).pptx
PRESENTATION ON CHALLENGE lab_084627 (1).pptxPRESENTATION ON CHALLENGE lab_084627 (1).pptx
PRESENTATION ON CHALLENGE lab_084627 (1).pptxMUSAIDRIS15
 
Regression Analysis on Flights data
Regression Analysis on Flights dataRegression Analysis on Flights data
Regression Analysis on Flights dataMansi Verma
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SASRohan Nanda
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesAdrián Vallés
 
Databaseconcepts
DatabaseconceptsDatabaseconcepts
Databaseconceptsdilipkkr
 
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxExercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxgitagrimston
 
Airline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectAirline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectHaozhe Wang
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptxssuser31398b
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Mingxuan Li
 
A statistical approach to predict flight delay
A statistical approach to predict flight delayA statistical approach to predict flight delay
A statistical approach to predict flight delayiDTechTechnologies
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingSrinath Perera
 
SQL Server 2008 Upgrade
SQL Server 2008 UpgradeSQL Server 2008 Upgrade
SQL Server 2008 UpgradeTed Noga
 
Database Modeling presentation
Database Modeling  presentationDatabase Modeling  presentation
Database Modeling presentationBhavishya Tyagi
 
The ultimate-guide-to-sql
The ultimate-guide-to-sqlThe ultimate-guide-to-sql
The ultimate-guide-to-sqlMcNamaraChiwaye
 
Flight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASFlight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASSarita Maharia
 
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point ErrorsA study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errorsijpla
 

Similar to Airline delay prediction (20)

Random Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics ProjectRandom Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics Project
 
Predicting flight cancellation likelihood
Predicting flight cancellation likelihoodPredicting flight cancellation likelihood
Predicting flight cancellation likelihood
 
PRESENTATION ON CHALLENGE lab_084627 (1).pptx
PRESENTATION ON CHALLENGE lab_084627 (1).pptxPRESENTATION ON CHALLENGE lab_084627 (1).pptx
PRESENTATION ON CHALLENGE lab_084627 (1).pptx
 
Time series project
Time series projectTime series project
Time series project
 
Regression Analysis on Flights data
Regression Analysis on Flights dataRegression Analysis on Flights data
Regression Analysis on Flights data
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SAS
 
Predicting landing distance: Adrian Valles
Predicting landing distance: Adrian VallesPredicting landing distance: Adrian Valles
Predicting landing distance: Adrian Valles
 
Databaseconcepts
DatabaseconceptsDatabaseconcepts
Databaseconcepts
 
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docxExercise Problems for Chapter 5Numerical example on page 203Pe.docx
Exercise Problems for Chapter 5Numerical example on page 203Pe.docx
 
Airline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectAirline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining Project
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
A statistical approach to predict flight delay
A statistical approach to predict flight delayA statistical approach to predict flight delay
A statistical approach to predict flight delay
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 Srping
 
SQL Server 2008 Upgrade
SQL Server 2008 UpgradeSQL Server 2008 Upgrade
SQL Server 2008 Upgrade
 
Database Modeling presentation
Database Modeling  presentationDatabase Modeling  presentation
Database Modeling presentation
 
The ultimate-guide-to-sql
The ultimate-guide-to-sqlThe ultimate-guide-to-sql
The ultimate-guide-to-sql
 
Flight Landing Distance Study Using SAS
Flight Landing Distance Study Using SASFlight Landing Distance Study Using SAS
Flight Landing Distance Study Using SAS
 
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point ErrorsA study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errors
 
FlightDelayAnalysis
FlightDelayAnalysisFlightDelayAnalysis
FlightDelayAnalysis
 

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Airline delay prediction

  • 1. Ever happened to you ? Delayed flight ? 1
  • 2. Introduction  In the United States, the Federal Aviation Administration estimates that flight delays cost airlines $22 billion yearly.  Airlines are forced to pay federal authorities when they hold planes on the tarmac for more than three hours for domestic flights or more than four hours for international flights.  Flight delays are an inconvenience to passengers as well. A delayed flight can end up making them late for personal scheduled events.  But what if a flight delay can be predicted? 2
  • 3. Objective  Identify and analyze the factors that cause flight delay  Predict flights that will get delayed 3
  • 4. Dataset  Title : Airlines Delay  Source : www.kaggle.com/datasets  Description : The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in this data.  Number of rows : 1936758 For simplicity sake , in this project a sample of 0.1%is used for modeling  Number of variables : 30 4
  • 5. Variable Description Name Description Year 2008 Month 1-12 DayofMonth 1-31 DayOfWeek 1 (Monday) - 7 (Sunday) DepTime Actual departure time (local, hhmm) CRSDepTime Scheduled departure time (local, hhmm) ArrTime Actual arrival time (local, hhmm) CRSArrTime Scheduled departure time (local, hhmm) UniqueCarrier Unique carrier code FlightNum Flight number TailNum Plane tail number ActualElapsedTime In minutes CRSElapsedTime In minutes AirTime In minutes ArrDelay Arrival delay, in minutes Name Description DepDelay Departure delay, in minutes Origin Origin IATA airport code Dest Destination IATA airport code Distance In miles TaxiIn Taxi in time, in minutes TaxiOut Taxi out time in minutes Cancelled Was the flight cancelled? CancellationCode Reason for cancellation (A = carrier, B = weather, C = NAS, D = security) Diverted 1 = yes, 0 = no CarrierDelay In minutes WeatherDelay In minutes NASDelay In minutes SecurityDelay In minutes LateAircraftDelay In minutes Status Delayed, Non Delayed 5
  • 6. Approach Step Tools/Techniques used Data preparation R Data Analysis R, Rattle, Tableau Data redundancy R Model Building Random Forest, Regression using R Model Evaluation R 6
  • 7. Data preprocessing  Missing values  Missing values in the type of delay columns are replaced with ‘0’, indicating the cause is not valid  If ArrTime/DepTime is missing, it has been replaced with its CRS/Scheduled equivalent.  Target leakage There are columns that are directly related to the flight status column such as  CarrierDelay,  WeatherDelay  NASDelay  SecurityDelay  LateAirCraftDelay  ArrDelay  DepDelay Flight Status 7
  • 8. Data preprocessing(contd..)  Other columns eliminated/created DepTime, ArrTime, : Eliminated since each of these is an absolute time value. FlightNum,TailNum : Eliminated since these are unique to eac journey. CRSArrTime and CRSDepTime have been replaced with ArrivalBucket and DepartureBucket according to the table below: CRSArrTime/CRSDepTime Bucket 00:00 to 06:00 Morning 06:00 to 12:00 Afternoon 12:00 to 18:00 Evening 18:00 to 00:00 Nights 8
  • 9. Data exploration Interpreting how many flights are delayed Interpreting how many flights are delayed by the time of arrival in the day It is seen that more delays are observed during the evening and night mainly because more flights are scheduled to arrive at this time => Air traffic Indicates, it is a possible case of class imbalance 9
  • 10. Data exploration(contd..) Long distance flights seem to be less affected by weather delay. 10
  • 11. Data exploration(contd..) This bar chart show how many flights are delayed per carrier. 11
  • 12. Data exploration(contd..) From the above charts it is observed that flight delay maybe dependent on the month and day of the week 12
  • 13. Modeling methodology Split the data into training and testing data ( 75% training, 25% testing data) Check for class imbalance and treat it using SMOTE technique Run a basic random forest model and analyze the confusion matrix and AUC Treat the missing values and re-run the model. Analyze the confusion matrix and AUC In case of high AUC check for target leakage variables and eliminate them. Re run the model with the new set of variables. Analyze the confusion matrix and AUC Calculate various parameters of the random forest Try other modelling methods to see if improved accuracy is achieved Finalize the model to be used for the data. Random Forest Model Number of trees : 50 Number of variables : 4 13
  • 14. Before beginning with the model..  Why we chose 0.1% sample and not restrict to few carriers?  Initially we considered building the model with 5 Unique Carriers. However, we realized, this data was not representative of the entire data and the results were biased towards one carrier. Also, it would not help us recognize if it was a clear case of class imbalance.  Should ‘DepDelay’ be considered while building the model?  DepDelay refers to the column that mentions the amount of time a flight is delayed at its departure airport.  Although this is a cause of delay for flights, there are various instances where even though there has been a delay at the departure airport, the flight has arrived on time at the arrival airport.  So should ‘DepDelay’ be considered while building the model? If it is directly related to the target variable, will it be considered to be target leakage?  While investigating this issue, we have considered 4 cases to analyze the model. 14
  • 15. Case 1 : Missing values present, ‘DepDelay’ present Reference Prediction Delayed Not Delayed Delayed 3281 1004 Not Delayed 92 444 ROC Curve Confusion Matrix Performance parameters 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 3281 + 444 3281 + 1004 + 92 + 444 = 0.7726 𝑅𝑒𝑐𝑎𝑙𝑙 = 3281 3281 + 92 = 0.9727 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 3281 3281 + 1004 = 0.7657 𝐴𝑈𝐶 = 0.8052 15
  • 16. Case 2 : Missing values present, ‘DepDelay’ eliminated Reference Prediction Delayed Not Delayed Delayed 2991 1294 Not Delayed 276 260 ROC Curve Confusion Matrix Performance parameters 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 2991 + 1294 2991 + 1294 + 276 + 260 = 0.9155 𝑅𝑒𝑐𝑎𝑙𝑙 = 2991 2991 + 276 = 0.9155 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 2991 2991 + 1294 = 0.698 𝐴𝑈𝐶 = 0.592 16
  • 17. Case 3 : Missing values treated, ‘DepDelay’ present Reference Prediction Delayed Not Delayed Delayed 3466 819 Not Delayed 89 447 ROC Curve Confusion Matrix Performance parameters 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 3466 + 447 3466 + 819 + 89 + 447 = 0.811755 𝑅𝑒𝑐𝑎𝑙𝑙 = 3466 3466 + 89 = 0.975 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 3466 3466 + 819 = 0.8089 𝐴𝑈𝐶 = 0.82 17
  • 18. Case 4 : Missing values treated, ‘DepDelay’ eliminated Reference Prediction Delayed Not Delayed Delayed 2966 1319 Not Delayed 214 322 ROC Curve Confusion Matrix Performance parameters 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 2966 + 322 2966 + 1319 + 214 + 322 = 0.6820 𝑅𝑒𝑐𝑎𝑙𝑙 = 2966 2966 + 214 = 0.9327 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 2966 2966 + 1319 = 0.6795 𝐴𝑈𝐶 = 0.646 18
  • 19. Conclusion  Case 4 is our best option to accept the model  In this option missing values are treated and the variables are chosen so as to avoid target leakage.  What are possible reason for not getting a high AUC?  Additional delay causes, not captured in the data Air traffic 19
  • 20. Additional analysis Estimate Std. Error z value Pr(>|z|) (Intercept) -2.50355 0.244443 -10.242 < 2e-16 *** Month 0.012227 0.015952 0.766 0.4434 DayofMonth -0.01018 0.006322 -1.611 0.1072 DayOfWeek -0.00827 0.028181 -0.293 0.7692 Distance 0.000644 8.7E-05 7.407 1.30E-13 *** Cancelled 16.52819 882.7434 0.019 0.9851 Diverted 16.67214 262.7191 0.063 0.9494 ArrivalBucketEvening 0.315216 0.203532 1.549 0.1214 ArrivalBucketMorning 0.197135 0.450726 0.437 0.6618 ArrivalBucketNight 0.36954 0.252511 1.463 0.1433 DepartureBucketEvening -0.35955 0.1678 -2.143 0.0321 * DepartureBucketMorning -0.3445 0.629283 -0.547 0.5841 DepartureBucketNight -0.60119 0.232124 -2.59 0.0096 ** To understand the importance of various variables, we ran an initial logistic regression to see the significance of variables. As per the table below, it is seen that Distance of flight, and the time of its departure from airport play a significant role in estimating the delay. 20