SlideShare a Scribd company logo
Predicting the Likelihood of
Flight Cancellations
Aashish Jain, PhD
Data Science Intensive Capstone Project, May 29th 2017 Cohort
Thanks to Springboard mentors
Hassan Kingravi, Pindrop Labs Srdjan Santic, RCC, University of Belgrade
0
The Problem
Total flights in two years (2015
and 2016) in top 20 airports:
1.14% (≅32,600 flights)
2.85+ millions
Cancellation rate:
Even though 1.14% is a small
number, one rare event causes
lot of troubles to passengers
What factors affect flight cancellation rate?
Can we predict the likelihood of a flight getting cancelled? 1
Who might care?
Mobile Apps
…and many more…
Travel Agencies Airlines
2
What factors might affect flight cancellations?
• Flight schedules:
departure and arrival
dates and times
• Airline carriers • Airport locations:
both origin and
destination
• Flight length:
distance, flight
times
• Weather conditions:
both origin and
destination
• Temperature,
humidity, wind
speed, pressure,
etc..: both origin
and destination
DataSource
FlightDataWeatherData
Bureau of Transportation
Statistics
Wundergound.com API
3
Data Information
Data acquired for the period: January 2015 – December 2016
Flight and weather data for: Top 20 airports in the US
Number of records: 2,857,139
Number of fields: 90
Flight Data Specifics
• Source: Bureau of Transportation Statistics
• Data can be downloaded for one month at a time
• File format : csv
• Concatenate files from all months into one
• Each record: a unique flight
Weather Data Specifics
• Source: Wunderground.com API
• File format : XML (hourly data available)
• One API call: One day data for a chosen airport
• Concatenate files from days into one, for an airport
• Each record: weather details for an airport at a given
hour
Merge them together using following keys:
• Flight date at origin and destination
• Origin and destination airport names
• Scheduled departure and arrival times
Data Acquisition and Merging
https://github.com/aajains/springboard-datascience-intensive/tree/master/capstone_project/DataAcquisitionMerging
4
Engineering Flight Historical Performance Features
From Flight Data
Steps for each flight:
1. Store the date of the flight in question: "thisDate”
2. Get Carrier, Origin, Destination, and Departure
time
3. Filter the dataset based on 4 values found in step
2
4. Filter the filtered data further to get a subset of
the data with only those dates that are "ndays"
prior to "thisDate". Here, "ndays" is the number of
days that we want to calculate the history for.
5. Use the ndays filtered dataset and perform
aggregations such as count of cancelled flights,
median of departure delays etc.. We used ndays
= 10, 20 and 30.
6. These aggregated results are then added as new
fields/features to original dataset.
https://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/DataAcquisitionMerging/history_calc.ipynb
5
Data Explorationhttps://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/EDA/ExploratoryDataAnalysis_v1.ipynb
Flight schedules
Airports
Airlines
Flight distance
Weather factors
Flight historical performances
6
Flight Cancellation Rate: Definition
Here, a given scenario
corresponds to a date. A scenario
could be a weather condition, or a
specific airline, or an airport, or a
day of week, etc..
Number of flights and
cancellations on a daily basis
Cancellation rates on a daily
basis
7
Flight schedules
Airports
Airlines
Flight distance
8
Flight Schedules
Monthly cancellation rates
Weekly cancellation rates
Hourly cancellation rates
9
MQ - Envoy Air
EV - ExpressJet Airlines
US - US Airways
NK - Spirit Airlines
AA - American Airlines
B6 - Jetblue Airways
UA - United Airlines
OO - SkyWest Airlines
WN - Southwest Airlines
VX - Virgin America
DL - Delta Air Lines
F9 - Frontier Airlines
AS - Alaska Airlines
Airlines
Airports and Airlines
ATL Atlanta, GA
BOS Boston, MA
BWI Baltimore, MD
CLT Charlotte, NC
DEN Denver, CO
DFW Dallas/Fort Worth, TX
DTW Detroit, MI
EWR Newark, NJ
IAH Houston, TX
JFK New York, NY
LAS Las Vegas, NV
LAX Los Angeles, CA
LGA New York, NY
MCO Orlando, FL
MSP Minneapolis, MN
ORD Chicago, IL
PHX Phoenix, AZ
SEA Seattle, WA
SFO San Francisco, CA
SLC Salt Lake City, UT
Airports
10
Flight Distance
11
Weather factors
12
Temperature, Humidity and Wind Speed
Temperature
Wind speedHumidity
13
Wind Direction
1400 : North, 900: East, 1800: South, 2700: West
Weather Conditions
15
Flight historical performances
16
Cancellation and Diversion History
17
Some Historical Indicators
18
Temporary flight: A flight
which has history of no
flights in lasy ndays
All cancelled: A flight
which has history of
100% cancellations
All diverted: A flight
which has history of
100% diversions
Departure and Arrival Delay History
Departure Delay
Arrival Delay
19
Machine Learning Modelinghttps://github.com/aajains/springboard-datascience-intensive/tree/master/capstone_project/Modeling
20
Modeling Overview
Type: Supervised learning
Tools: Python’s scikit learn and imblearn
Binary classification: 1 for cancelled and 0 for non-
cancelled flights
Highly imbalanced data: 1.14% data tagged with
class 1
Needs
special
attention
21
Modeling Steps
Data pre-processing
steps:
1. Label encoding
2. Data splitting into
training and test sets
(50%-50%)
3. Resampling or
weighting the training
data to take care of
imbalanced problem
4. Scaling
Cross validation (CV)
for hyperparameter
tuning:
• 5 fold cv
• Using scikit-learn’s grid
search method
• Evaluation metric: Area
under precision recall
curve
Classifier training using
optimal parameters and
50% of the whole data
Performance
evaluation using
holdout dataset
(50% of the whole
data)
These steps piped using imblearn’s pipeline
class
Testing using the same
pipe
(excluding cross validation)
22
Resampling/Weighting Techniques Used:
Resampling techniques (imblearn):
• Random under-sampling (RUS)
• Synthetic Minority Oversampling Technique (SMOTE)
Weighting technique (sklearn):
• Using class_weight (=‘balanced’) parameter in several scikit-learn’s classifier
implementations
Classification Algorithms Used:
1. Logistic Regression – RUS works best
2. Gaussian Naïve Bayes – RUS works best
3. Random Forest – class_weight = ‘balanced’ works best
4. Gradient Boosting – SMOTE works best
5. Extremely Randomized Trees – class_weight = ‘balanced’ works best
23
Model Comparisons
LR: Logistic Regression, GNB: Gaussian Naïve Bayes, RF: Random Forest, GB: Gradient Boosting, ET: Extremely Randomized Trees
24
Model Comparisons
Logistic Regression is the worst and Extremely Randomized Trees is the best
25
Some Details on the Best Model (ET)
Figure27: ROC and PR curvesfor thegradient boosting classifier.
using a threshold. In ET, however, thresholds are selected at random for each
feature. In ET, wedo not to useany featurescaling. Thisalgorithm isnot that slow
so wework with the whole data. Table 5 shows the results for various metrics for
all different resampling/weighting techniques that we tried. For ET, we find that
Table 5: ET classifier - choosing thebest resampling/weighting technique
Resampling/weighting technique Class Values PR AUC ROC AUC CPU time(min)
Precision Recall F1-score
RandomUnderSampler 0 1.00 0.83 0.91 0.26 0.86 0.25
1 0.05 0.73 0.09
SMOTE 0 0.99 0.99 0.99 0.30 0.85 18
1 0.40 0.32 0.35
• Before optimizing number of
trees and set of features.
• We used 50 trees and all 67
features (note that we removed
some features after merging
various datasets)
• After optimizing number of trees
and set of features.
• Optimum number of trees: 50
• Optimum number of features: 31
• Used one-hot encoding (OHE),
leading to 295 features
How did we optimize number of trees and set of
features? 26
Some Details on the Best Model (ET)
Optimizing number of trees
try to study theeffect of the number of estimators in Fig. 28 to see if 50 trees isa
good enough number. Clearly, thePR AUC improvesby increasing thenumber of
Figure28: PR AUC (green) and CPU time(red) asafunction of thenumber of trees. Thevertical
blue line indicates 50 trees.
trees but at the cost of computational time. The PR AUC seems to beleveling off
asymptotically after around 40-50 trees. In other words, for morethan 50 trees, the
gain in PR AUC is not as significant as the increase in computational cost is. So,
we stick with 50 trees. With 50 trees, and all other optimized hyper-parameters,
21
Optimizing set of
features
we can now study the feature importance. Figure 29 shows the top 20 features i
termsof their featureimportances. Out of thesetop20features, 8haveinformatio
about the weather, 3 are historical performance features (which we engineered
6 are calendar variables and remaining three have information about the fligh
(Carrier, Distance and Origin). The calendar variables do not make much sens
herebecauseit isquitepossiblethat DayOfWeek and DayOfWeek_Dest havequi
the same information. Similarly, DayOfMonth and DayOfMonth_Dest are ver
much thesamein termsof information. In order to find optimal choiceof feature
that maximize PR AUC, we use different values of cutoffs on feature importanc
and obtain PR AUC. Figure30 suggeststhat thePR AUC isalmost constant whe
the cutoff is less than 0.02. To minimize computational cost by maintaining th
PR AUC, wechoosethecutoff to be0.015. Wecan usethereduced set of feature
Top 20
features
27
Testing on Under-sampled Test Data
LR: Logistic Regression, GNB: Gaussian Naïve Bayes, RF: Random Forest, GB: Gradient Boosting, ET: Extremely Randomized Trees
28
Testing on Under-sampled Test Data
Logistic Regression is the worst and Extremely Randomized Trees is the best
29
Using the Model
• Select the top 31 features from
the dataset.
• Note that Ncan_10, Ncan_20
and Ncan_30 are not originally
present in any dataset. One
needs to calculate (or engineer)
them beforehand
• Perform one-hot encoding on
categorical variables
• Use model pipeline on the new
data and predict probabilities for
flight cancellation 30
An Example of Model Usage: Possible
Recommendations
31
Assumptions, Limitations and Disclaimers
 We assume that all flights are independent, though there
might be some time-correlations
 Used only 20 airports and two years of data
 The model will behave poorly if we try to predict
cancellations of flights that are scheduled too far in
future
 The flight data does not contain all airlines data (e.g.
Sun Country Airlines data is missing in the current flight
dataset)
32
More Ideas to Improve the Model in Future
 Engineer more features related with airports such as
number of runways, airport capacity, airport infrastructure,
etc.
 Extract more information about airlines such as their
ratings, stock market performance etc., to get more
features
 Similar to flight historical performances, generate features
with airport historical performances
 Use social network and news media to extract sentiments
about airlines and airport to get more features
33
Conclusions
 All sources of datasets contributed to the predictive power of the
model.
 Out of 5 supervised classification models, the Extremely Randomized
Trees provided the best results.
 Out of 67 features, we used only 31 features for the best model with
12 from the flight data, 16 from the weather data and 3 from the flight
historical performances data (which we engineered).
 With 50%-50% splitting, the test data set gave ROC AUC = 0.89.
 With more ideas, the model can be improved in the future.
34
Thank you!
Aashish Jain, PhD
Email: aajains@gmail.com
https://www.linkedin.com/in/aashishjain/
https://github.com/aajains/
Project report: https://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/Report/CapstoneReport.pdf
35

More Related Content

Similar to Predicting flight cancellation likelihood

Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
Mingxuan Li
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Random Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics ProjectRandom Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics Project
Saurabh Kale
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
Vyshak Srishylappa
 
FlightDelayAnalysis
FlightDelayAnalysisFlightDelayAnalysis
FlightDelayAnalysis
Sonali Chaudhari
 
DOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdfDOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdf
ShaizaanKhan
 
Hard landing predection
Hard landing predectionHard landing predection
Hard landing predection
RAJUPADHYAY44
 
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine LearningIRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET Journal
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET Journal
 
Cmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutletCmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutlet
Poppinss
 
6
66
Time series project
Time series projectTime series project
Time series project
Rupantar Rana
 
Se notes
Se notesSe notes
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel TalkWinter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
Sudhendu Rai
 
Analyzing Air Quality Measurements in Macedonia with Apache Drill
Analyzing Air Quality Measurements in Macedonia with Apache DrillAnalyzing Air Quality Measurements in Macedonia with Apache Drill
Analyzing Air Quality Measurements in Macedonia with Apache Drill
Marjan Sterjev
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
Carlo Carandang
 
Software estimation techniques
Software estimation techniquesSoftware estimation techniques
Software estimation techniquesTan Tran
 
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET Journal
 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining project
Akshay Kumar Bhushan
 

Similar to Predicting flight cancellation likelihood (20)

Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Random Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics ProjectRandom Forest Ensemble learning algorithm for Engineering Analytics Project
Random Forest Ensemble learning algorithm for Engineering Analytics Project
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 
FlightDelayAnalysis
FlightDelayAnalysisFlightDelayAnalysis
FlightDelayAnalysis
 
DOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdfDOC245-20240219-WA0000_240219_090212.pdf
DOC245-20240219-WA0000_240219_090212.pdf
 
Hard landing predection
Hard landing predectionHard landing predection
Hard landing predection
 
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine LearningIRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
 
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...IRJET -  	  Comparative Study of Flight Delay Prediction using Back Propagati...
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...
 
Cmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutletCmis 102 hands on/tutorialoutlet
Cmis 102 hands on/tutorialoutlet
 
6
66
6
 
Time series project
Time series projectTime series project
Time series project
 
Se notes
Se notesSe notes
Se notes
 
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel TalkWinter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
 
Analyzing Air Quality Measurements in Macedonia with Apache Drill
Analyzing Air Quality Measurements in Macedonia with Apache DrillAnalyzing Air Quality Measurements in Macedonia with Apache Drill
Analyzing Air Quality Measurements in Macedonia with Apache Drill
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
Software estimation techniques
Software estimation techniquesSoftware estimation techniques
Software estimation techniques
 
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining project
 

Recently uploaded

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

Predicting flight cancellation likelihood

  • 1. Predicting the Likelihood of Flight Cancellations Aashish Jain, PhD Data Science Intensive Capstone Project, May 29th 2017 Cohort Thanks to Springboard mentors Hassan Kingravi, Pindrop Labs Srdjan Santic, RCC, University of Belgrade 0
  • 2. The Problem Total flights in two years (2015 and 2016) in top 20 airports: 1.14% (≅32,600 flights) 2.85+ millions Cancellation rate: Even though 1.14% is a small number, one rare event causes lot of troubles to passengers What factors affect flight cancellation rate? Can we predict the likelihood of a flight getting cancelled? 1
  • 3. Who might care? Mobile Apps …and many more… Travel Agencies Airlines 2
  • 4. What factors might affect flight cancellations? • Flight schedules: departure and arrival dates and times • Airline carriers • Airport locations: both origin and destination • Flight length: distance, flight times • Weather conditions: both origin and destination • Temperature, humidity, wind speed, pressure, etc..: both origin and destination DataSource FlightDataWeatherData Bureau of Transportation Statistics Wundergound.com API 3
  • 5. Data Information Data acquired for the period: January 2015 – December 2016 Flight and weather data for: Top 20 airports in the US Number of records: 2,857,139 Number of fields: 90 Flight Data Specifics • Source: Bureau of Transportation Statistics • Data can be downloaded for one month at a time • File format : csv • Concatenate files from all months into one • Each record: a unique flight Weather Data Specifics • Source: Wunderground.com API • File format : XML (hourly data available) • One API call: One day data for a chosen airport • Concatenate files from days into one, for an airport • Each record: weather details for an airport at a given hour Merge them together using following keys: • Flight date at origin and destination • Origin and destination airport names • Scheduled departure and arrival times Data Acquisition and Merging https://github.com/aajains/springboard-datascience-intensive/tree/master/capstone_project/DataAcquisitionMerging 4
  • 6. Engineering Flight Historical Performance Features From Flight Data Steps for each flight: 1. Store the date of the flight in question: "thisDate” 2. Get Carrier, Origin, Destination, and Departure time 3. Filter the dataset based on 4 values found in step 2 4. Filter the filtered data further to get a subset of the data with only those dates that are "ndays" prior to "thisDate". Here, "ndays" is the number of days that we want to calculate the history for. 5. Use the ndays filtered dataset and perform aggregations such as count of cancelled flights, median of departure delays etc.. We used ndays = 10, 20 and 30. 6. These aggregated results are then added as new fields/features to original dataset. https://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/DataAcquisitionMerging/history_calc.ipynb 5
  • 8. Flight Cancellation Rate: Definition Here, a given scenario corresponds to a date. A scenario could be a weather condition, or a specific airline, or an airport, or a day of week, etc.. Number of flights and cancellations on a daily basis Cancellation rates on a daily basis 7
  • 10. Flight Schedules Monthly cancellation rates Weekly cancellation rates Hourly cancellation rates 9
  • 11. MQ - Envoy Air EV - ExpressJet Airlines US - US Airways NK - Spirit Airlines AA - American Airlines B6 - Jetblue Airways UA - United Airlines OO - SkyWest Airlines WN - Southwest Airlines VX - Virgin America DL - Delta Air Lines F9 - Frontier Airlines AS - Alaska Airlines Airlines Airports and Airlines ATL Atlanta, GA BOS Boston, MA BWI Baltimore, MD CLT Charlotte, NC DEN Denver, CO DFW Dallas/Fort Worth, TX DTW Detroit, MI EWR Newark, NJ IAH Houston, TX JFK New York, NY LAS Las Vegas, NV LAX Los Angeles, CA LGA New York, NY MCO Orlando, FL MSP Minneapolis, MN ORD Chicago, IL PHX Phoenix, AZ SEA Seattle, WA SFO San Francisco, CA SLC Salt Lake City, UT Airports 10
  • 14. Temperature, Humidity and Wind Speed Temperature Wind speedHumidity 13
  • 15. Wind Direction 1400 : North, 900: East, 1800: South, 2700: West
  • 19. Some Historical Indicators 18 Temporary flight: A flight which has history of no flights in lasy ndays All cancelled: A flight which has history of 100% cancellations All diverted: A flight which has history of 100% diversions
  • 20. Departure and Arrival Delay History Departure Delay Arrival Delay 19
  • 22. Modeling Overview Type: Supervised learning Tools: Python’s scikit learn and imblearn Binary classification: 1 for cancelled and 0 for non- cancelled flights Highly imbalanced data: 1.14% data tagged with class 1 Needs special attention 21
  • 23. Modeling Steps Data pre-processing steps: 1. Label encoding 2. Data splitting into training and test sets (50%-50%) 3. Resampling or weighting the training data to take care of imbalanced problem 4. Scaling Cross validation (CV) for hyperparameter tuning: • 5 fold cv • Using scikit-learn’s grid search method • Evaluation metric: Area under precision recall curve Classifier training using optimal parameters and 50% of the whole data Performance evaluation using holdout dataset (50% of the whole data) These steps piped using imblearn’s pipeline class Testing using the same pipe (excluding cross validation) 22
  • 24. Resampling/Weighting Techniques Used: Resampling techniques (imblearn): • Random under-sampling (RUS) • Synthetic Minority Oversampling Technique (SMOTE) Weighting technique (sklearn): • Using class_weight (=‘balanced’) parameter in several scikit-learn’s classifier implementations Classification Algorithms Used: 1. Logistic Regression – RUS works best 2. Gaussian Naïve Bayes – RUS works best 3. Random Forest – class_weight = ‘balanced’ works best 4. Gradient Boosting – SMOTE works best 5. Extremely Randomized Trees – class_weight = ‘balanced’ works best 23
  • 25. Model Comparisons LR: Logistic Regression, GNB: Gaussian Naïve Bayes, RF: Random Forest, GB: Gradient Boosting, ET: Extremely Randomized Trees 24
  • 26. Model Comparisons Logistic Regression is the worst and Extremely Randomized Trees is the best 25
  • 27. Some Details on the Best Model (ET) Figure27: ROC and PR curvesfor thegradient boosting classifier. using a threshold. In ET, however, thresholds are selected at random for each feature. In ET, wedo not to useany featurescaling. Thisalgorithm isnot that slow so wework with the whole data. Table 5 shows the results for various metrics for all different resampling/weighting techniques that we tried. For ET, we find that Table 5: ET classifier - choosing thebest resampling/weighting technique Resampling/weighting technique Class Values PR AUC ROC AUC CPU time(min) Precision Recall F1-score RandomUnderSampler 0 1.00 0.83 0.91 0.26 0.86 0.25 1 0.05 0.73 0.09 SMOTE 0 0.99 0.99 0.99 0.30 0.85 18 1 0.40 0.32 0.35 • Before optimizing number of trees and set of features. • We used 50 trees and all 67 features (note that we removed some features after merging various datasets) • After optimizing number of trees and set of features. • Optimum number of trees: 50 • Optimum number of features: 31 • Used one-hot encoding (OHE), leading to 295 features How did we optimize number of trees and set of features? 26
  • 28. Some Details on the Best Model (ET) Optimizing number of trees try to study theeffect of the number of estimators in Fig. 28 to see if 50 trees isa good enough number. Clearly, thePR AUC improvesby increasing thenumber of Figure28: PR AUC (green) and CPU time(red) asafunction of thenumber of trees. Thevertical blue line indicates 50 trees. trees but at the cost of computational time. The PR AUC seems to beleveling off asymptotically after around 40-50 trees. In other words, for morethan 50 trees, the gain in PR AUC is not as significant as the increase in computational cost is. So, we stick with 50 trees. With 50 trees, and all other optimized hyper-parameters, 21 Optimizing set of features we can now study the feature importance. Figure 29 shows the top 20 features i termsof their featureimportances. Out of thesetop20features, 8haveinformatio about the weather, 3 are historical performance features (which we engineered 6 are calendar variables and remaining three have information about the fligh (Carrier, Distance and Origin). The calendar variables do not make much sens herebecauseit isquitepossiblethat DayOfWeek and DayOfWeek_Dest havequi the same information. Similarly, DayOfMonth and DayOfMonth_Dest are ver much thesamein termsof information. In order to find optimal choiceof feature that maximize PR AUC, we use different values of cutoffs on feature importanc and obtain PR AUC. Figure30 suggeststhat thePR AUC isalmost constant whe the cutoff is less than 0.02. To minimize computational cost by maintaining th PR AUC, wechoosethecutoff to be0.015. Wecan usethereduced set of feature Top 20 features 27
  • 29. Testing on Under-sampled Test Data LR: Logistic Regression, GNB: Gaussian Naïve Bayes, RF: Random Forest, GB: Gradient Boosting, ET: Extremely Randomized Trees 28
  • 30. Testing on Under-sampled Test Data Logistic Regression is the worst and Extremely Randomized Trees is the best 29
  • 31. Using the Model • Select the top 31 features from the dataset. • Note that Ncan_10, Ncan_20 and Ncan_30 are not originally present in any dataset. One needs to calculate (or engineer) them beforehand • Perform one-hot encoding on categorical variables • Use model pipeline on the new data and predict probabilities for flight cancellation 30
  • 32. An Example of Model Usage: Possible Recommendations 31
  • 33. Assumptions, Limitations and Disclaimers  We assume that all flights are independent, though there might be some time-correlations  Used only 20 airports and two years of data  The model will behave poorly if we try to predict cancellations of flights that are scheduled too far in future  The flight data does not contain all airlines data (e.g. Sun Country Airlines data is missing in the current flight dataset) 32
  • 34. More Ideas to Improve the Model in Future  Engineer more features related with airports such as number of runways, airport capacity, airport infrastructure, etc.  Extract more information about airlines such as their ratings, stock market performance etc., to get more features  Similar to flight historical performances, generate features with airport historical performances  Use social network and news media to extract sentiments about airlines and airport to get more features 33
  • 35. Conclusions  All sources of datasets contributed to the predictive power of the model.  Out of 5 supervised classification models, the Extremely Randomized Trees provided the best results.  Out of 67 features, we used only 31 features for the best model with 12 from the flight data, 16 from the weather data and 3 from the flight historical performances data (which we engineered).  With 50%-50% splitting, the test data set gave ROC AUC = 0.89.  With more ideas, the model can be improved in the future. 34
  • 36. Thank you! Aashish Jain, PhD Email: aajains@gmail.com https://www.linkedin.com/in/aashishjain/ https://github.com/aajains/ Project report: https://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/Report/CapstoneReport.pdf 35