Extremely Randomized Trees classification model achieved the best performance in predicting flight cancellations. The model used 31 features from flight, weather, and historical flight data to predict cancellations for over 2.8 million flights. Cross-validation determined the optimal number of trees and features. Testing on a holdout dataset produced an ROC AUC of 0.89, accurately predicting cancellations for the imbalanced dataset where cancellations were only 1.14% of flights. The model provides probabilistic cancellation predictions and could help travelers and airlines.
The aim of the project is to track the on-time performance of major domestic carriers in the US. The complete information on air travel report including raw data and summary statistics is available which enables to make predictions about possible delays in flights
The objective of the project is to predict whether a flight will be delayed or not by studying the various features of the flight and calculating the probability for delay using Bayesian Classification.
The aim of the project is to track the on-time performance of major domestic carriers in the US. The complete information on air travel report including raw data and summary statistics is available which enables to make predictions about possible delays in flights
The objective of the project is to predict whether a flight will be delayed or not by studying the various features of the flight and calculating the probability for delay using Bayesian Classification.
ADS Team 3 Final Presentation.
Discusses different models to perform delay prediction, flight cancellation, flight recommendation and flight fare prediction.
FOR MORE CLASSES VISIT
tutorialoutletdotcom
CMIS 102 Hands-On Lab
Week 8
Overview
This hands-on lab allows you to follow and experiment with the critical steps of developing a program
including the program description, analysis, test plan, and implementation with C code. The example
provided uses sequential, repetition, selection statements, functions, strings and arrays.
Air traffic forecast serves as an important quantitative basis for airport planning - in particular for capacity planning CAPEX ,as well as for aeronautical and non-aeronautical revenue planning. High level decisions and planning in airports relies heavilly on future airport activity.
Winter Simulation Conference 2021 - Process Wind Tunnel TalkSudhendu Rai
The talk associated with this presentation can be accessed at:
https://youtu.be/VXEVuXW9knU
Abstract
In this talk, we will introduce a simulation-based process improvement framework and methodology called the Process Wind Tunnel. We will describe this framework and introduce the underlying technologies namely process mapping and data collection, data wrangling, exploratory data analysis and visualization, process mining, discrete-event simulation optimization and solution implementation. We will discuss how Process Wind Tunnel framework was utilized to improve a critical business process namely, the post-execution trade settlement process. The work builds upon and generalizes the Lean Document Production solution (2008 Edelman finalist) for optimizing printshops to more general and complex business processes found within the insurance and financial services industry.
Analyzing Air Quality Measurements in Macedonia with Apache DrillMarjan Sterjev
The article provides an example for JSON data analysis with Apache Drill. The "toy" model is based on the publicly available air quality measurement data.
Air Pollution in Nova Scotia: Analysis and PredictionsCarlo Carandang
"Air Pollution in Nova Scotia: Analysis and Predictions"
Halifax, Nova Scotia, Canada; May 22, 2018
Presentation to the Department of Environment, Government of Nova Scotia.
Analysis of air fine particulate matter (PM 2.5) open datasets in Nova Scotia, showing both business intelligence and predictive analytics.
ADS Team 3 Final Presentation.
Discusses different models to perform delay prediction, flight cancellation, flight recommendation and flight fare prediction.
FOR MORE CLASSES VISIT
tutorialoutletdotcom
CMIS 102 Hands-On Lab
Week 8
Overview
This hands-on lab allows you to follow and experiment with the critical steps of developing a program
including the program description, analysis, test plan, and implementation with C code. The example
provided uses sequential, repetition, selection statements, functions, strings and arrays.
Air traffic forecast serves as an important quantitative basis for airport planning - in particular for capacity planning CAPEX ,as well as for aeronautical and non-aeronautical revenue planning. High level decisions and planning in airports relies heavilly on future airport activity.
Winter Simulation Conference 2021 - Process Wind Tunnel TalkSudhendu Rai
The talk associated with this presentation can be accessed at:
https://youtu.be/VXEVuXW9knU
Abstract
In this talk, we will introduce a simulation-based process improvement framework and methodology called the Process Wind Tunnel. We will describe this framework and introduce the underlying technologies namely process mapping and data collection, data wrangling, exploratory data analysis and visualization, process mining, discrete-event simulation optimization and solution implementation. We will discuss how Process Wind Tunnel framework was utilized to improve a critical business process namely, the post-execution trade settlement process. The work builds upon and generalizes the Lean Document Production solution (2008 Edelman finalist) for optimizing printshops to more general and complex business processes found within the insurance and financial services industry.
Analyzing Air Quality Measurements in Macedonia with Apache DrillMarjan Sterjev
The article provides an example for JSON data analysis with Apache Drill. The "toy" model is based on the publicly available air quality measurement data.
Air Pollution in Nova Scotia: Analysis and PredictionsCarlo Carandang
"Air Pollution in Nova Scotia: Analysis and Predictions"
Halifax, Nova Scotia, Canada; May 22, 2018
Presentation to the Department of Environment, Government of Nova Scotia.
Analysis of air fine particulate matter (PM 2.5) open datasets in Nova Scotia, showing both business intelligence and predictive analytics.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
1. Predicting the Likelihood of
Flight Cancellations
Aashish Jain, PhD
Data Science Intensive Capstone Project, May 29th 2017 Cohort
Thanks to Springboard mentors
Hassan Kingravi, Pindrop Labs Srdjan Santic, RCC, University of Belgrade
0
2. The Problem
Total flights in two years (2015
and 2016) in top 20 airports:
1.14% (≅32,600 flights)
2.85+ millions
Cancellation rate:
Even though 1.14% is a small
number, one rare event causes
lot of troubles to passengers
What factors affect flight cancellation rate?
Can we predict the likelihood of a flight getting cancelled? 1
4. What factors might affect flight cancellations?
• Flight schedules:
departure and arrival
dates and times
• Airline carriers • Airport locations:
both origin and
destination
• Flight length:
distance, flight
times
• Weather conditions:
both origin and
destination
• Temperature,
humidity, wind
speed, pressure,
etc..: both origin
and destination
DataSource
FlightDataWeatherData
Bureau of Transportation
Statistics
Wundergound.com API
3
5. Data Information
Data acquired for the period: January 2015 – December 2016
Flight and weather data for: Top 20 airports in the US
Number of records: 2,857,139
Number of fields: 90
Flight Data Specifics
• Source: Bureau of Transportation Statistics
• Data can be downloaded for one month at a time
• File format : csv
• Concatenate files from all months into one
• Each record: a unique flight
Weather Data Specifics
• Source: Wunderground.com API
• File format : XML (hourly data available)
• One API call: One day data for a chosen airport
• Concatenate files from days into one, for an airport
• Each record: weather details for an airport at a given
hour
Merge them together using following keys:
• Flight date at origin and destination
• Origin and destination airport names
• Scheduled departure and arrival times
Data Acquisition and Merging
https://github.com/aajains/springboard-datascience-intensive/tree/master/capstone_project/DataAcquisitionMerging
4
6. Engineering Flight Historical Performance Features
From Flight Data
Steps for each flight:
1. Store the date of the flight in question: "thisDate”
2. Get Carrier, Origin, Destination, and Departure
time
3. Filter the dataset based on 4 values found in step
2
4. Filter the filtered data further to get a subset of
the data with only those dates that are "ndays"
prior to "thisDate". Here, "ndays" is the number of
days that we want to calculate the history for.
5. Use the ndays filtered dataset and perform
aggregations such as count of cancelled flights,
median of departure delays etc.. We used ndays
= 10, 20 and 30.
6. These aggregated results are then added as new
fields/features to original dataset.
https://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/DataAcquisitionMerging/history_calc.ipynb
5
8. Flight Cancellation Rate: Definition
Here, a given scenario
corresponds to a date. A scenario
could be a weather condition, or a
specific airline, or an airport, or a
day of week, etc..
Number of flights and
cancellations on a daily basis
Cancellation rates on a daily
basis
7
11. MQ - Envoy Air
EV - ExpressJet Airlines
US - US Airways
NK - Spirit Airlines
AA - American Airlines
B6 - Jetblue Airways
UA - United Airlines
OO - SkyWest Airlines
WN - Southwest Airlines
VX - Virgin America
DL - Delta Air Lines
F9 - Frontier Airlines
AS - Alaska Airlines
Airlines
Airports and Airlines
ATL Atlanta, GA
BOS Boston, MA
BWI Baltimore, MD
CLT Charlotte, NC
DEN Denver, CO
DFW Dallas/Fort Worth, TX
DTW Detroit, MI
EWR Newark, NJ
IAH Houston, TX
JFK New York, NY
LAS Las Vegas, NV
LAX Los Angeles, CA
LGA New York, NY
MCO Orlando, FL
MSP Minneapolis, MN
ORD Chicago, IL
PHX Phoenix, AZ
SEA Seattle, WA
SFO San Francisco, CA
SLC Salt Lake City, UT
Airports
10
19. Some Historical Indicators
18
Temporary flight: A flight
which has history of no
flights in lasy ndays
All cancelled: A flight
which has history of
100% cancellations
All diverted: A flight
which has history of
100% diversions
22. Modeling Overview
Type: Supervised learning
Tools: Python’s scikit learn and imblearn
Binary classification: 1 for cancelled and 0 for non-
cancelled flights
Highly imbalanced data: 1.14% data tagged with
class 1
Needs
special
attention
21
23. Modeling Steps
Data pre-processing
steps:
1. Label encoding
2. Data splitting into
training and test sets
(50%-50%)
3. Resampling or
weighting the training
data to take care of
imbalanced problem
4. Scaling
Cross validation (CV)
for hyperparameter
tuning:
• 5 fold cv
• Using scikit-learn’s grid
search method
• Evaluation metric: Area
under precision recall
curve
Classifier training using
optimal parameters and
50% of the whole data
Performance
evaluation using
holdout dataset
(50% of the whole
data)
These steps piped using imblearn’s pipeline
class
Testing using the same
pipe
(excluding cross validation)
22
24. Resampling/Weighting Techniques Used:
Resampling techniques (imblearn):
• Random under-sampling (RUS)
• Synthetic Minority Oversampling Technique (SMOTE)
Weighting technique (sklearn):
• Using class_weight (=‘balanced’) parameter in several scikit-learn’s classifier
implementations
Classification Algorithms Used:
1. Logistic Regression – RUS works best
2. Gaussian Naïve Bayes – RUS works best
3. Random Forest – class_weight = ‘balanced’ works best
4. Gradient Boosting – SMOTE works best
5. Extremely Randomized Trees – class_weight = ‘balanced’ works best
23
25. Model Comparisons
LR: Logistic Regression, GNB: Gaussian Naïve Bayes, RF: Random Forest, GB: Gradient Boosting, ET: Extremely Randomized Trees
24
27. Some Details on the Best Model (ET)
Figure27: ROC and PR curvesfor thegradient boosting classifier.
using a threshold. In ET, however, thresholds are selected at random for each
feature. In ET, wedo not to useany featurescaling. Thisalgorithm isnot that slow
so wework with the whole data. Table 5 shows the results for various metrics for
all different resampling/weighting techniques that we tried. For ET, we find that
Table 5: ET classifier - choosing thebest resampling/weighting technique
Resampling/weighting technique Class Values PR AUC ROC AUC CPU time(min)
Precision Recall F1-score
RandomUnderSampler 0 1.00 0.83 0.91 0.26 0.86 0.25
1 0.05 0.73 0.09
SMOTE 0 0.99 0.99 0.99 0.30 0.85 18
1 0.40 0.32 0.35
• Before optimizing number of
trees and set of features.
• We used 50 trees and all 67
features (note that we removed
some features after merging
various datasets)
• After optimizing number of trees
and set of features.
• Optimum number of trees: 50
• Optimum number of features: 31
• Used one-hot encoding (OHE),
leading to 295 features
How did we optimize number of trees and set of
features? 26
28. Some Details on the Best Model (ET)
Optimizing number of trees
try to study theeffect of the number of estimators in Fig. 28 to see if 50 trees isa
good enough number. Clearly, thePR AUC improvesby increasing thenumber of
Figure28: PR AUC (green) and CPU time(red) asafunction of thenumber of trees. Thevertical
blue line indicates 50 trees.
trees but at the cost of computational time. The PR AUC seems to beleveling off
asymptotically after around 40-50 trees. In other words, for morethan 50 trees, the
gain in PR AUC is not as significant as the increase in computational cost is. So,
we stick with 50 trees. With 50 trees, and all other optimized hyper-parameters,
21
Optimizing set of
features
we can now study the feature importance. Figure 29 shows the top 20 features i
termsof their featureimportances. Out of thesetop20features, 8haveinformatio
about the weather, 3 are historical performance features (which we engineered
6 are calendar variables and remaining three have information about the fligh
(Carrier, Distance and Origin). The calendar variables do not make much sens
herebecauseit isquitepossiblethat DayOfWeek and DayOfWeek_Dest havequi
the same information. Similarly, DayOfMonth and DayOfMonth_Dest are ver
much thesamein termsof information. In order to find optimal choiceof feature
that maximize PR AUC, we use different values of cutoffs on feature importanc
and obtain PR AUC. Figure30 suggeststhat thePR AUC isalmost constant whe
the cutoff is less than 0.02. To minimize computational cost by maintaining th
PR AUC, wechoosethecutoff to be0.015. Wecan usethereduced set of feature
Top 20
features
27
29. Testing on Under-sampled Test Data
LR: Logistic Regression, GNB: Gaussian Naïve Bayes, RF: Random Forest, GB: Gradient Boosting, ET: Extremely Randomized Trees
28
30. Testing on Under-sampled Test Data
Logistic Regression is the worst and Extremely Randomized Trees is the best
29
31. Using the Model
• Select the top 31 features from
the dataset.
• Note that Ncan_10, Ncan_20
and Ncan_30 are not originally
present in any dataset. One
needs to calculate (or engineer)
them beforehand
• Perform one-hot encoding on
categorical variables
• Use model pipeline on the new
data and predict probabilities for
flight cancellation 30
32. An Example of Model Usage: Possible
Recommendations
31
33. Assumptions, Limitations and Disclaimers
We assume that all flights are independent, though there
might be some time-correlations
Used only 20 airports and two years of data
The model will behave poorly if we try to predict
cancellations of flights that are scheduled too far in
future
The flight data does not contain all airlines data (e.g.
Sun Country Airlines data is missing in the current flight
dataset)
32
34. More Ideas to Improve the Model in Future
Engineer more features related with airports such as
number of runways, airport capacity, airport infrastructure,
etc.
Extract more information about airlines such as their
ratings, stock market performance etc., to get more
features
Similar to flight historical performances, generate features
with airport historical performances
Use social network and news media to extract sentiments
about airlines and airport to get more features
33
35. Conclusions
All sources of datasets contributed to the predictive power of the
model.
Out of 5 supervised classification models, the Extremely Randomized
Trees provided the best results.
Out of 67 features, we used only 31 features for the best model with
12 from the flight data, 16 from the weather data and 3 from the flight
historical performances data (which we engineered).
With 50%-50% splitting, the test data set gave ROC AUC = 0.89.
With more ideas, the model can be improved in the future.
34