The aim of the project is to track the on-time performance of major domestic carriers in the US. The complete information on air travel report including raw data and summary statistics is available which enables to make predictions about possible delays in flights
2. Introduction
In the United States, the Federal Aviation Administration
estimates that flight delays cost airlines $22 billion yearly.
Airlines are forced to pay federal authorities when they hold
planes on the tarmac for more than three hours for domestic
flights or more than four hours for international flights.
Flight delays are an inconvenience to passengers as well. A
delayed flight can end up making them late for personal
scheduled events.
But what if a flight delay can be predicted?
2
3. Objective
Identify and analyze the factors that cause flight delay
Predict flights that will get delayed
3
4. Dataset
Title : Airlines Delay
Source : www.kaggle.com/datasets
Description : The U.S. Department of Transportation's (DOT)
Bureau of Transportation Statistics (BTS) tracks the on-time
performance of domestic flights operated by large air carriers.
Summary information on the number of on-time, delayed,
canceled and diverted flights appears in this data.
Number of rows : 1936758
For simplicity sake , in this project a sample of 0.1%is used for
modeling
Number of variables : 30
4
5. Variable Description
Name Description
Year 2008
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime Actual departure time (local, hhmm)
CRSDepTime Scheduled departure time (local, hhmm)
ArrTime Actual arrival time (local, hhmm)
CRSArrTime Scheduled departure time (local, hhmm)
UniqueCarrier Unique carrier code
FlightNum Flight number
TailNum Plane tail number
ActualElapsedTime In minutes
CRSElapsedTime In minutes
AirTime In minutes
ArrDelay Arrival delay, in minutes
Name Description
DepDelay Departure delay, in minutes
Origin Origin IATA airport code
Dest Destination IATA airport code
Distance In miles
TaxiIn Taxi in time, in minutes
TaxiOut Taxi out time in minutes
Cancelled Was the flight cancelled?
CancellationCode Reason for cancellation (A = carrier, B =
weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay In minutes
WeatherDelay In minutes
NASDelay In minutes
SecurityDelay In minutes
LateAircraftDelay In minutes
Status Delayed, Non Delayed
5
7. Data preprocessing
Missing values
Missing values in the type of delay columns are replaced with ‘0’, indicating the
cause is not valid
If ArrTime/DepTime is missing, it has been replaced with its CRS/Scheduled
equivalent.
Target leakage
There are columns that are directly related to the flight status column such as
CarrierDelay,
WeatherDelay
NASDelay
SecurityDelay
LateAirCraftDelay
ArrDelay
DepDelay
Flight Status
7
8. Data preprocessing(contd..)
Other columns eliminated/created
DepTime, ArrTime, : Eliminated since each of these is an absolute
time value.
FlightNum,TailNum : Eliminated since these are unique to eac
journey.
CRSArrTime and CRSDepTime have been replaced with
ArrivalBucket and DepartureBucket according to the table below:
CRSArrTime/CRSDepTime Bucket
00:00 to 06:00 Morning
06:00 to 12:00 Afternoon
12:00 to 18:00 Evening
18:00 to 00:00 Nights
8
9. Data exploration
Interpreting how many flights are delayed Interpreting how many flights are delayed by the
time of arrival in the day
It is seen that more delays
are observed during the
evening and night mainly
because more flights are
scheduled to arrive at this
time => Air traffic
Indicates, it is a possible case of class imbalance
9
13. Modeling methodology
Split the data into training
and testing data ( 75%
training, 25% testing data)
Check for class imbalance
and treat it using SMOTE
technique
Run a basic random forest
model and analyze the
confusion matrix and AUC
Treat the missing values
and re-run the model.
Analyze the confusion
matrix and AUC
In case of high AUC check
for target leakage variables
and eliminate them.
Re run the model with the
new set of variables.
Analyze the confusion
matrix and AUC
Calculate various
parameters of the random
forest
Try other modelling
methods to see if improved
accuracy is achieved
Finalize the model to be
used for the data.
Random Forest Model
Number of trees : 50
Number of variables : 4
13
14. Before beginning with the model..
Why we chose 0.1% sample and not restrict to few carriers?
Initially we considered building the model with 5 Unique Carriers.
However, we realized, this data was not representative of the entire data
and the results were biased towards one carrier. Also, it would not help us
recognize if it was a clear case of class imbalance.
Should ‘DepDelay’ be considered while building the model?
DepDelay refers to the column that mentions the amount of time a flight is
delayed at its departure airport.
Although this is a cause of delay for flights, there are various instances where
even though there has been a delay at the departure airport, the flight has
arrived on time at the arrival airport.
So should ‘DepDelay’ be considered while building the model? If it is
directly related to the target variable, will it be considered to be target
leakage?
While investigating this issue, we have considered 4 cases to analyze the
model.
14
19. Conclusion
Case 4 is our best option to accept the model
In this option missing values are treated and the variables are
chosen so as to avoid target leakage.
What are possible reason for not getting a high AUC?
Additional delay causes, not captured in the data
Air traffic
19
20. Additional analysis
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.50355 0.244443 -10.242 < 2e-16 ***
Month 0.012227 0.015952 0.766 0.4434
DayofMonth -0.01018 0.006322 -1.611 0.1072
DayOfWeek -0.00827 0.028181 -0.293 0.7692
Distance 0.000644 8.7E-05 7.407 1.30E-13 ***
Cancelled 16.52819 882.7434 0.019 0.9851
Diverted 16.67214 262.7191 0.063 0.9494
ArrivalBucketEvening 0.315216 0.203532 1.549 0.1214
ArrivalBucketMorning 0.197135 0.450726 0.437 0.6618
ArrivalBucketNight 0.36954 0.252511 1.463 0.1433
DepartureBucketEvening -0.35955 0.1678 -2.143 0.0321 *
DepartureBucketMorning -0.3445 0.629283 -0.547 0.5841
DepartureBucketNight -0.60119 0.232124 -2.59 0.0096 **
To understand the importance of various variables, we ran an initial logistic regression to see
the significance of variables.
As per the table below, it is seen that Distance of flight, and the time of its departure from
airport play a significant role in estimating the delay.
20