SlideShare a Scribd company logo
1 of 17
Download to read offline
US Commercial Flights
Francesca Pappalardo
29 gennaio 2019
Us commercial flight analysis
Introduction
This report describes and detects the analysis pages performed on a data set provided by the site http://stat-computing.org/. Data comes from
Research and Innovative Technology Administration (RITA). Data includes 22 years from the year 1987 to the year 2007 with a total of 123 million
observations and 29 different variables. I highlight the main variables used with the related description.
Data Description (used in this analysis)
Year 1987-2008
Month 1-12
ArrTime actual arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number
AirTime in minutes
ArrDelay arrival delay, in minutes
DepDelay departure delay, in minutes
Distance in miles
Cancelled was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
The dataset has a size of 12 GB compressed for which it was appropriate to create an SQlite database and a connection to facilitate the use of
data to perform the analysis.
path_db ="C:/ProjectInferential/ontimefly.sqlite3"
con <- dbConnect(RSQLite::SQLite(), dbname=path_db)
from_db <- function(sql) {
dbGetQuery(ontimefly, sql)
}
ontime <- tbl(con, "ontimefly")
*EDA (Exploration Data Analysis) The analyzes performed in this report focus on cancellation, delays and performance of an flight. Before
obtaining the specific data, it is good to perform cognitive analyzes on the entire dataset.
1. What is the main reason why flights are canceled?
To avoid errors of inconsistencies it is advisable to eliminate all null values.
By analyzing all the canceled flights which have a value of 1 within the Cancelled variable, the values of the CancellationCode variable have been
set in the following order:
Uknown: NA or empty
Carrier: A
Weather: B
NAS: C
Security: D
cancellation <- flights[flights$Cancelled == 1,]
cancellation$CancellationCode[ cancellation$CancellationCode == 'NA' | cancellation$CancellationCode == ''] <- 'Uk
nown'
cancellation$CancellationCode[cancellation$CancellationCode == 'A'] <- 'Carrier'
cancellation$CancellationCode[cancellation$CancellationCode == 'B'] <- 'Weather'
cancellation$CancellationCode[cancellation$CancellationCode == 'C'] <- 'NAS'
cancellation$CancellationCode[cancellation$CancellationCode == 'D'] <- 'Security'
plot_cancellation <- ggplot( data = cancellation, aes(x = CancellationCode))+ geom_bar(aes(y =(..count..)/sum(..co
unt..), fill=CancellationCode))+
scale_y_continuous(labels=percent)+
ggtitle("Cancellation Causes")+
ylab("% Cancellation")
plot_cancellation
Cause of Cancellation of flight
The largest percentage that represents the cause of cancellation of a flight is Uknown with a value of about 80%, followed by Carrier with a
value of around 15%.
2. Distribution Carrier
Specific analyzes have also been carried out on Carrier types, so it is important to know Carrier Distribution.
carrier <- flights%>%
filter(UniqueCarrier != "NA")
carrier$UniqueCarrier[carrier$UniqueCarrier != "NW" &
carrier$UniqueCarrier != "DL" &
carrier$UniqueCarrier != "US" &
carrier$UniqueCarrier != "AA" &
carrier$UniqueCarrier != "UA"] <- "Other"
carrier <- carrier %>%
group_by(UniqueCarrier) %>%
dplyr::summarize(Num = n())
In the dataset there are 29 different carrier but I analyze olny the most important frutto delle analisi successive Description Carrier American
Airlines Inc. : AA Delta Air Lines Inc. : DL US Airways Inc. : US Northwest Airlines Inc.: NW United Air Lines Inc.: UA
uniquec <- c('AA', 'DL', 'US', 'NW', 'UA','Other')
x=carrier$Num/sum(carrier$Num)
etichette <- paste(carrier$UniqueCarrier," (",round(x*100, 1), "%)")
p <- pie(carrier$Num/sum(carrier$Num), labels=uniquec)
Specific Analysis
The present report, performs various analyzes answering the following questions:
1. What is the month in which more cancellations occurred?
2. What is the season with less delays?
3. Which aerial manufacture allows a better performance?
4. Which of the two carriers that make the most flights is faster?
1. What is the month in which more cancellations occurred?
*Dataframe: flights_cancelled
## # A tibble: 123,534,969 x 4
## Month FlightNum Cancelled CancellationCode
## <int> <int> <int> <chr>
## 1 1 335 0 ""
## 2 1 3231 0 ""
## 3 1 448 0 ""
## 4 1 1746 0 ""
## 5 1 3920 0 ""
## 6 1 378 0 ""
## 7 1 509 0 ""
## 8 1 535 0 ""
## 9 1 11 0 ""
## 10 1 810 0 ""
## # ... with 123,534,959 more rows
To make clear and fast the reading of the data, I assign to the flights not canceled therefore the ones that variables Cancelled equal to 0 value and ‘No’, and flights canceled with
Cancelled variable equal to 1 value ‘Yes’.
Furthermore, any reason for the cancellation is represented by the variable CancellationCode. These are reasons below. CancellationCode | DescriptionCancellationCode A | Carrier B |
Weather C | NAS D | Security NA or “” | Uknown
cancelled_analysis <- flights_cancelled
cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 0] <- 'No' #assegno no se il volo nn è stato cancellato
cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 1] <- 'Yes' #assegno Si se il volo è stato cancellato
cancelled_analysis$CancellationCode[ cancelled_analysis$CancellationCode == 'NA' | cancelled_analysis$CancellationCode == ''] <- 'Uknown'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'A'] <- 'Carrier'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'B'] <- 'Weather'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'C'] <- 'NAS'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'D'] <- 'Security'
cancelled_analysis$Month[cancelled_analysis$Month == 1] <- 'Juanuary'
cancelled_analysis$Month[cancelled_analysis$Month == 2] <- 'February'
cancelled_analysis$Month[cancelled_analysis$Month == 3] <- 'March'
cancelled_analysis$Month[cancelled_analysis$Month == 4] <- 'April'
cancelled_analysis$Month[cancelled_analysis$Month == 5] <- 'May'
cancelled_analysis$Month[cancelled_analysis$Month == 6] <- 'June'
cancelled_analysis$Month[cancelled_analysis$Month == 7] <- 'July'
cancelled_analysis$Month[cancelled_analysis$Month == 8] <- 'August'
cancelled_analysis$Month[cancelled_analysis$Month == 9] <- 'September'
cancelled_analysis$Month[cancelled_analysis$Month == 10] <- 'October'
cancelled_analysis$Month[cancelled_analysis$Month == 11] <- 'November'
cancelled_analysis$Month[cancelled_analysis$Month == 12] <- 'December'
na.omit(cancelled_analysis)
## # A tibble: 123,534,969 x 4
## Month FlightNum Cancelled CancellationCode
## <chr> <int> <chr> <chr>
## 1 Juanuary 335 No Uknown
## 2 Juanuary 3231 No Uknown
## 3 Juanuary 448 No Uknown
## 4 Juanuary 1746 No Uknown
## 5 Juanuary 3920 No Uknown
## 6 Juanuary 378 No Uknown
## 7 Juanuary 509 No Uknown
## 8 Juanuary 535 No Uknown
## 9 Juanuary 11 No Uknown
## 10 Juanuary 810 No Uknown
## # ... with 123,534,959 more rows
To get an overview of the information related to the cancellation of flights, I show the number of flights canceled depending on the variation of Cancellation COde
cancelled_analysis %>%
group_by(CancellationCode) %>%
tally %>%
arrange(desc(n))
## # A tibble: 5 x 2
## CancellationCode n
## <chr> <int>
## 1 Uknown 122800263
## 2 Carrier 317972
## 3 Weather 267054
## 4 NAS 149079
## 5 Security 601
Results: * Causes of cancellations Carrier : 317972 flights Weather : 267054 flights NAS : 149079 flights Security : 601 flights
I count the numbers of flights canceled and not canceled.
cancelled_analysis %>%
group_by(Cancelled) %>%
tally %>%
arrange(desc(n))
## # A tibble: 2 x 2
## Cancelled n
## <chr> <int>
## 1 No 121231645
## 2 Yes 2303324
*Result: The number of canceled flights: 2303324 The number of flights not canceled: 121231645
percent(2303324/121231645)
## [1] "1.90%"
1.90% represents the probability of percentage of flight cancellation.
Analysis of canceled flights only
Dataframe: cancelled_Flights (Contains only canceled flights)
cancelled_Flights <- cancelled_analysis%>%
filter(Cancelled == 'Yes') %>%
group_by(FlightNum, Month, CancellationCode, Cancelled) %>% as_data_frame()
na.omit(cancelled_Flights)
## # A tibble: 2,303,324 x 4
## Month FlightNum Cancelled CancellationCode
## <chr> <int> <chr> <chr>
## 1 Juanuary 126 Yes Carrier
## 2 Juanuary 1146 Yes Carrier
## 3 Juanuary 469 Yes Carrier
## 4 Juanuary 618 Yes NAS
## 5 Juanuary 2528 Yes Carrier
## 6 Juanuary 437 Yes Carrier
## 7 Juanuary 934 Yes Carrier
## 8 Juanuary 3326 Yes Carrier
## 9 Juanuary 1402 Yes Carrier
## 10 Juanuary 2205 Yes Carrier
## # ... with 2,303,314 more rows
cancelledplot <- ggplot(cancelled_Flights, aes( Month, fill=CancellationCode)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
ylab("Percentages")+
xlab("Month")
cancelledplot
Results: The result provided in this analysis gives us the following results: The largest cancellation rate for a flight is due to an “Unknown” case and has a higher value especially in the
month of January followed by the month of September.
2. What is the season with less delays?
Preparation Data
In this analysis, we get the best season to travel by getting fewer departure delays.
Given that the analysis affects the delays occurred in the four seasons, for clarity the following legend is drawn: Legend: 1:winter(Month: 1,2,12) 2:spring(Month: 3,4,5) 3:summer(Month:
6,7,8) *4:fall(Month: 9,10,11)
flights_effective$Month [flights_effective$Month == 1] <- 1
flights_effective$Month [flights_effective$Month == 2] <- 1
flights_effective$Month [flights_effective$Month == 12] <- 1
flights_effective$Month [flights_effective$Month == 3] <- 2
flights_effective$Month [flights_effective$Month == 4] <- 2
flights_effective$Month [flights_effective$Month == 5] <- 2
flights_effective$Month [flights_effective$Month == 6] <- 3
flights_effective$Month [flights_effective$Month == 7] <- 3
flights_effective$Month [flights_effective$Month == 8] <- 3
flights_effective$Month [flights_effective$Month == 9] <- 4
flights_effective$Month [flights_effective$Month == 10] <- 4
flights_effective$Month [flights_effective$Month == 11] <- 4
To provide a detailed analysis, we calculate the mean, Standard Error, Confidence Interval and t.test relative to the DepDelay variable.
meanDepDelay <-mean(flights_effective$DepDelay)
standardDepDelay <- sd(flights_effective$DepDelay)/sqrt(length(flights_effective$DepDelay))
ci <- CI(flights_effective$DepDelay) # a 95% confidence interval fot the mean DepDelay is given by
t.test(flights_effective$DepDelay, alternative="two.sided", conf.level = .95) # mu=12
##
## One Sample t-test
##
## data: flights_effective$DepDelay
## t = 3155.6, df = 121230000, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.165247 8.175396
## sample estimates:
## mean of x
## 8.170322
meanDepDelay
## [1] 8.170322
standardDepDelay #How is accurate this point estimate
## [1] 0.002589136
ci
## upper mean lower
## 8.175396 8.170322 8.165247
Results: The sample mean for the variable DepDelay is: 8.170322 minutes. How is accurate the point estimate(mean)? I answer the question with Standard Error. The Standard Error is:
0.002 A more readible result can be obtained by using a confidence interval CI with the result: +upper: 8.175396 +mean: 8.170322 +lower: 8.165247 Testing the mean DepDelay of a
flight: the t-test produce also the p-value, which is the probability of wrongly rejecting the null hypothesis. The p-value is always compared with the significance level of the test. The result
of p-vale is p< 2.2e-16, so, suggest that the null hypotesis is unlikely to be true. The smaller it is, the more confident we can reject the null hypotesys
seasonplot <- boxplot(formula = DepDelay ~ Month,
data = flights_effective,
main = 'Departures delays depending on the season',
xlab = 'Season',
ylab = 'Departure delay',
border = c('springgreen', 'yellow', 'orange', 'skyblue'),
names = c('Spring', 'Summer', 'Fall', 'Winter'))
Result: The plot shows on the x-axis the 4 seasons, and on the y-axis the minutes of the departure delay of a flight in the range -1000 up to 2000. The longer delay occurs in the ** Winter
** season with a delay of more than 2000 minutes, instead in the ** Spring ** season the minimum departure delay is present with a negative value.
3. Which aerial manufacture allows a better performance?
It is important to evaluate which manufacturer allows a better performance of the plane.
To perform this analysis, we need additional information, contained within the csv “plane-data.csv”
plane_data <- read_csv("C:/ProjectInferential/plane-data.csv")
plane_data <- na.omit(plane_data)
To get a clear view of the types of producers, I create a sort of legend to quickly identify the various types. In particular, the types of manufacture that occur most often are highlighted.
Legenda: Embrarer: E Boeing: B AirBus Industrie: A *Other: O
plane_performance <- na.omit(plane_performance)
For more information, we analyze the numbers of the various types of manufacturer.
plane_performance %>%
group_by(manufacturer) %>%
tally %>%
arrange(desc(n))
## # A tibble: 4 x 2
## manufacturer n
## <chr> <int>
## 1 B 2061
## 2 O 1397
## 3 E 588
## 4 A 434
Results Manufacturer B are: 2061 Manufacturer O are: 1397 Manufacturer E are: 588 Manufacturer A are: 434
I create a single dataframe containing flight and plane information. I combine the two data frames with the TailNum variable that should be unique for each flight.
For better modeling and interpretation I create a plot representing the density for clear and efficient data reading. Kernel density plot are usually a much more effective way to view the
distribution of a variable.
mdensity <- ggplot(plane_performance, aes(x=airtime))
mdensity + geom_density(aes(colour=manufacturer, fill=manufacturer), alpha=0.3)+
theme_gray(base_size=14)
## Warning: Removed 3 rows containing non-finite values (stat_density).
Now to explore the data AirTime and manufacturer I calculate the mean, standard deviation and median and show them on a plots.
Meanplot <- ggplot(airtimeMean, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly Mean AirTime")+
theme_gray(base_size = 14)
StandardDeviationplot <-ggplot(airtimeSD, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly AirTime SD")+
theme_gray(base_size = 14)
Medianplot <- ggplot(airtimemedian, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly AirTime SD")+
theme_gray(base_size = 14)
ggarrange(Meanplot, StandardDeviationplot, Medianplot, ncol = 2, nrow=2);
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 1 rows containing missing values (geom_bar).
To evaluate the flight performance, calculate performance_index by adding DepDelay ArrDelay and dividing the flight time AirTime
## # A tibble: 4,480 x 6
## # Groups: tailnum, manufacturer [4,480]
## tailnum manufacturer arrdelay depdelay airtime performance_index
## <chr> <chr> <int> <int> <int> <dbl>
## 1 N10156 E 34 35 160 0.431
## 2 N102UW A 23 20 308 0.140
## 3 N10323 B 88 70 208 0.760
## 4 N103US A 6 11 251 0.0677
## 5 N104UA B 193 185 240 1.58
## 6 N104UW A 14 23 288 0.128
## 7 N10575 E 80 85 55 3
## 8 N105UA B 3 27 206 0.146
## 9 N105UW A 45 24 256 0.270
## 10 N106US A 7 5 65 0.185
## # ... with 4,470 more rows
plane_performance_index <- plane_performance %>%
group_by(tailnum)%>%
dplyr::summarise(avg_performance_index = mean(performance_index, na.rm=FALSE)) %>% as_data_frame()
plane_performance_index <- na.omit(plane_performance_index)
Result: The highest performance index is given by TailNum: N581SW with value 50.000000 The lowest performance index is given by TailNum: N5ETAA with value 0.005524862
To combine both the information on the planes and the flights and then the two data frames, I have combined them with the unique TailNum index.
pos1 <- match(plane_performance_index$tailnum, plane_data$tailnum)
plane_performance_index$manufacturer <- plane_data$manufacturer[pos1]
plot1 <- ggplot(f, aes(x = factor(manufacturer) , y =p))+
geom_bar(colour ="blue", stat = "identity")+
ggtitle("Performance Index based on the manufacturer of plane")+
guides(fill=FALSE)+
xlab("Manufacturer") +
ylab("Performance Index") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot1 <- ggplotly(plot1)
plot1
Result How we can see on the plot the best performance is given by manufacturer FEDERICK CHRISK viceversa, the bad performance is given by manufacturer BAUMAN RANDY
4. Which of the two carriers that make the most flights is faster?
In this analysis, we initially determine the carriers that make the most flights, after which we analyze the two companies and find out who is the fastest.
data_carrier_air <- select(ontime, Distance, AirTime, UniqueCarrier, DepDelay, ArrDelay) %>%
filter((ArrDelay > 0) & (DepDelay > 0) & (AirTime > 0)) %>%
group_by(UniqueCarrier, DepDelay, ArrDelay) %>% as_data_frame()
data_carrier_air <- na.omit(data_carrier_air)
Covariance between AirTime and Distance
cov(data_carrier_air$Distance, data_carrier_air$AirTime)
## [1] 25409.49
Covariance Result The result provided by the cov (Covariance) function, indicates the variable AirTime and the Distance variable are positively correlated, ie we assume a linear
relationship between AirTime and Distance there is a positive correlation, from the increase of AirTime there is also an increase of the Distance average.
The frequencies and the relative frequencies of each carrier are calculated to obtain the carriers that carry out more flights.
Calculation of frequencies and frequencies of careers
flight_to_carrier <- cbind (Frequency = table(data_carrier_air$UniqueCarrier), RelFreq = prop.table(table(data_carrier_air$UniqueCarrier)) )
flight_to_carrier
## Frequency RelFreq
## 9E 134604 0.0033550012
## AA 4505617 0.1123023864
## AQ 41702 0.0010394213
## AS 928903 0.0231528831
## B6 261440 0.0065163852
## CO 2348172 0.0585281260
## DH 199362 0.0049690927
## DL 6239652 0.1555231636
## EA 215526 0.0053719799
## EV 588954 0.0146796631
## F9 117590 0.0029309277
## FL 394566 0.0098345473
## HA 38710 0.0009648457
## HP 1176193 0.0293165799
## ML (1) 14393 0.0003587452
## MQ 1299845 0.0323986028
## NW 2702993 0.0673720301
## OH 413364 0.0103030869
## OO 893097 0.0222604195
## PA (1) 61635 0.0015362508
## PI 466058 0.0116164835
## PS 35534 0.0008856840
## TW 1092832 0.0272388091
## TZ 47440 0.0011824408
## UA 4669724 0.1163927491
## US 5217884 0.1300556228
## WN 5083808 0.1267137820
## XE 664729 0.0165683530
## YV 266076 0.0066319374
The largest number of flights is made by the company ** DL Delta Air Lines Inc ** with a percentage of 15.55% followed by ** US Airways Inc ** with a percentage of 13%.
Pearson’s Correlation Test
Correlation between AirTime and Distance
cor.test(data_carrier_air$AirTime, data_carrier_air$Distance)
##
## Pearson's product-moment correlation
##
## data: data_carrier_air$AirTime and data_carrier_air$Distance
## t = 5112.5, df = 40120000, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6278910 0.6282657
## sample estimates:
## cor
## 0.6280784
Correlation: 0.6280784
Regression Analysis
airtime.lm.DL <- lm(formula = AirTime ~ Distance,
data = data_carrier_air,
subset = UniqueCarrier == "DL" )
summary (airtime.lm.DL)
##
## Call:
## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier ==
## "DL")
##
## Residuals:
## Min 1Q Median 3Q Max
## -381.96 -37.16 20.30 39.22 481.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.595e-01 3.822e-02 14.64 <2e-16 ***
## Distance 8.472e-02 4.236e-05 1999.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.28 on 6239650 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3906
## F-statistic: 4e+06 on 1 and 6239650 DF, p-value: < 2.2e-16
airtime.lm.US <- lm(formula = AirTime ~ Distance,
data = data_carrier_air,
subset = UniqueCarrier == "US" )
summary (airtime.lm.US)
##
## Call:
## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier ==
## "US")
##
## Residuals:
## Min 1Q Median 3Q Max
## -256.40 -28.59 15.26 35.60 339.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.590e-01 3.515e-02 -21.59 <2e-16 ***
## Distance 8.632e-02 4.552e-05 1896.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.6 on 5217882 degrees of freedom
## Multiple R-squared: 0.4081, Adjusted R-squared: 0.4081
## F-statistic: 3.597e+06 on 1 and 5217882 DF, p-value: < 2.2e-16
Multiple R-squared is 0.4081, which is the R-squared value.
Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
plotDL <- plot(airtime.lm.DL)
plotDL
## NULL
plotUS <- plot(airtime.lm.US)
plotUS
## NULL
** Residual vs Fitted** It is a scatterplot of residuals on the y axes and fitted values on the x axes. This plot is used to detect non-linearity by looking at redline for questionable pattern.
The characteristic of a well-behaved residual vs fitted plot are: The residual bounce randomly around the 0 line. The residual roughly for a horizontal band around the 0 line. No residual
stands out. So, this plot has the residual randomly around the 0 line, so this suggest that the assumption that the relationship is linear is reasonable.
Scale Location This plot tells us if the residual apread equally along the ranges of predictor.
Normal QQ This scatterplot show if residuals are normally distributed. The closer point are to falling directly on the diagonal line then the more we can interpret the residual as normally
distributed.
DL <- subset(data_carrier_air, UniqueCarrier == 'DL')
US <- subset(data_carrier_air, UniqueCarrier == 'US')
finalplot <- plot(x=DL$Distance,
y=DL$AirTime,
xlab = 'Distance',
ylab = 'Air Time',
main = 'Air time based on distance by carrier',
pch=20,
col='dodgerblue1'
)
points (x = US$Distance,
y = US$AirTime,
pch=20,
col='forestgreen'
)
abline (airtime.lm.DL , col = 'slateblue1')
abline (airtime.lm.US, col = 'springgreen1')
legend ('topleft',
legend = c('Delta Air Lines Inc.', 'US Airways Inc.'),
col = c('dodgerblue1', 'forestgreen'),
pch = 20)
finalplot
## NULL
The plot is a relationship between Distance and AirTime for the DL and US companies. (ps: flight times are all in positive integer value) WHO is faster ?? DL is represented by slateblue1
US is presented by springgreen1 It is best to travel with Delta Air Lines Inc. for distances of less than 1500 miles and with US Airways Inc for distances greater than 1500 miles.
Conclusion
The analyzes carried out on all the years (1987-2008) confirm that the worst month to start is January as there is a greater number of cancellations with an unknown reason. The average
starting delay is 8.17 minutes. The worst season for starting delays is Winter, while in Winter there is also a negative DepDelay. The best performing flight number is N581SW built by the
manufacturer FEDERICK CHRISK The worst performing flight number is N5ETAA built by the manufacturer BAUMAN RANDY. The two companies that operate more flights are DL Delta
Airlines with a percentage of 15% and US Airwais with a percentage of 13%. In addition, DL is faster on flights with a distance of less than 15,000 miles.

More Related Content

Similar to Report Statistical Analysis

SAS writing example
SAS writing exampleSAS writing example
SAS writing example
Tianyue Wang
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SAS
Rohan Nanda
 
Forecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsForecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series Models
Geoffery Mullings
 
RCCharter-DataWarehouseQueries
RCCharter-DataWarehouseQueriesRCCharter-DataWarehouseQueries
RCCharter-DataWarehouseQueries
Patrick Seery
 

Similar to Report Statistical Analysis (20)

Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Prediction of Airlines Delay
Prediction of Airlines Delay Prediction of Airlines Delay
Prediction of Airlines Delay
 
Common Performance Pitfalls in Odoo apps
Common Performance Pitfalls in Odoo appsCommon Performance Pitfalls in Odoo apps
Common Performance Pitfalls in Odoo apps
 
2015 Flight Delay/Cancellation Analysis
2015 Flight Delay/Cancellation Analysis2015 Flight Delay/Cancellation Analysis
2015 Flight Delay/Cancellation Analysis
 
SAS writing example
SAS writing exampleSAS writing example
SAS writing example
 
project
projectproject
project
 
dplyr-tutorial.pdf
dplyr-tutorial.pdfdplyr-tutorial.pdf
dplyr-tutorial.pdf
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SAS
 
Flight departure delay prediction
Flight departure delay predictionFlight departure delay prediction
Flight departure delay prediction
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Es6 part2
Es6 part2Es6 part2
Es6 part2
 
Forecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsForecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series Models
 
RCCharter-DataWarehouseQueries
RCCharter-DataWarehouseQueriesRCCharter-DataWarehouseQueries
RCCharter-DataWarehouseQueries
 
dplyr
dplyrdplyr
dplyr
 
Airline Database Design
Airline Database DesignAirline Database Design
Airline Database Design
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 
Writing Readable Code with Pipes
Writing Readable Code with PipesWriting Readable Code with Pipes
Writing Readable Code with Pipes
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Data Wrangling with dplyr
Data Wrangling with dplyrData Wrangling with dplyr
Data Wrangling with dplyr
 

More from Francesca Pappalardo

More from Francesca Pappalardo (10)

Fraud Detection with Ensemble Learning Technique
Fraud Detection with Ensemble Learning TechniqueFraud Detection with Ensemble Learning Technique
Fraud Detection with Ensemble Learning Technique
 
Final written Essay Francesca Pappalardo
Final written Essay Francesca PappalardoFinal written Essay Francesca Pappalardo
Final written Essay Francesca Pappalardo
 
FATE Financial Analysis Tool for Excel - Prenatal
FATE Financial Analysis Tool for Excel - PrenatalFATE Financial Analysis Tool for Excel - Prenatal
FATE Financial Analysis Tool for Excel - Prenatal
 
Small Summary
Small SummarySmall Summary
Small Summary
 
Presentation CCT
Presentation CCTPresentation CCT
Presentation CCT
 
CCT (Check and Calculate Transfer)
CCT (Check and Calculate Transfer)CCT (Check and Calculate Transfer)
CCT (Check and Calculate Transfer)
 
CCT Check and Calculate Transfer
CCT Check and Calculate TransferCCT Check and Calculate Transfer
CCT Check and Calculate Transfer
 
SLEMapp
SLEMappSLEMapp
SLEMapp
 
CoolMi Documentation
CoolMi DocumentationCoolMi Documentation
CoolMi Documentation
 
Cool mi by Coolook
Cool mi by Coolook Cool mi by Coolook
Cool mi by Coolook
 

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 

Report Statistical Analysis

  • 1. US Commercial Flights Francesca Pappalardo 29 gennaio 2019 Us commercial flight analysis Introduction This report describes and detects the analysis pages performed on a data set provided by the site http://stat-computing.org/. Data comes from Research and Innovative Technology Administration (RITA). Data includes 22 years from the year 1987 to the year 2007 with a total of 123 million observations and 29 different variables. I highlight the main variables used with the related description. Data Description (used in this analysis) Year 1987-2008 Month 1-12 ArrTime actual arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number AirTime in minutes ArrDelay arrival delay, in minutes DepDelay departure delay, in minutes Distance in miles Cancelled was the flight cancelled? CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) The dataset has a size of 12 GB compressed for which it was appropriate to create an SQlite database and a connection to facilitate the use of data to perform the analysis. path_db ="C:/ProjectInferential/ontimefly.sqlite3" con <- dbConnect(RSQLite::SQLite(), dbname=path_db) from_db <- function(sql) { dbGetQuery(ontimefly, sql) } ontime <- tbl(con, "ontimefly") *EDA (Exploration Data Analysis) The analyzes performed in this report focus on cancellation, delays and performance of an flight. Before obtaining the specific data, it is good to perform cognitive analyzes on the entire dataset. 1. What is the main reason why flights are canceled? To avoid errors of inconsistencies it is advisable to eliminate all null values. By analyzing all the canceled flights which have a value of 1 within the Cancelled variable, the values of the CancellationCode variable have been set in the following order: Uknown: NA or empty Carrier: A Weather: B NAS: C Security: D cancellation <- flights[flights$Cancelled == 1,] cancellation$CancellationCode[ cancellation$CancellationCode == 'NA' | cancellation$CancellationCode == ''] <- 'Uk nown' cancellation$CancellationCode[cancellation$CancellationCode == 'A'] <- 'Carrier' cancellation$CancellationCode[cancellation$CancellationCode == 'B'] <- 'Weather' cancellation$CancellationCode[cancellation$CancellationCode == 'C'] <- 'NAS' cancellation$CancellationCode[cancellation$CancellationCode == 'D'] <- 'Security' plot_cancellation <- ggplot( data = cancellation, aes(x = CancellationCode))+ geom_bar(aes(y =(..count..)/sum(..co unt..), fill=CancellationCode))+ scale_y_continuous(labels=percent)+ ggtitle("Cancellation Causes")+ ylab("% Cancellation") plot_cancellation
  • 2. Cause of Cancellation of flight The largest percentage that represents the cause of cancellation of a flight is Uknown with a value of about 80%, followed by Carrier with a value of around 15%. 2. Distribution Carrier Specific analyzes have also been carried out on Carrier types, so it is important to know Carrier Distribution. carrier <- flights%>% filter(UniqueCarrier != "NA") carrier$UniqueCarrier[carrier$UniqueCarrier != "NW" & carrier$UniqueCarrier != "DL" & carrier$UniqueCarrier != "US" & carrier$UniqueCarrier != "AA" & carrier$UniqueCarrier != "UA"] <- "Other" carrier <- carrier %>% group_by(UniqueCarrier) %>% dplyr::summarize(Num = n()) In the dataset there are 29 different carrier but I analyze olny the most important frutto delle analisi successive Description Carrier American Airlines Inc. : AA Delta Air Lines Inc. : DL US Airways Inc. : US Northwest Airlines Inc.: NW United Air Lines Inc.: UA uniquec <- c('AA', 'DL', 'US', 'NW', 'UA','Other') x=carrier$Num/sum(carrier$Num) etichette <- paste(carrier$UniqueCarrier," (",round(x*100, 1), "%)") p <- pie(carrier$Num/sum(carrier$Num), labels=uniquec)
  • 3. Specific Analysis The present report, performs various analyzes answering the following questions: 1. What is the month in which more cancellations occurred? 2. What is the season with less delays? 3. Which aerial manufacture allows a better performance? 4. Which of the two carriers that make the most flights is faster? 1. What is the month in which more cancellations occurred? *Dataframe: flights_cancelled ## # A tibble: 123,534,969 x 4 ## Month FlightNum Cancelled CancellationCode ## <int> <int> <int> <chr> ## 1 1 335 0 "" ## 2 1 3231 0 "" ## 3 1 448 0 "" ## 4 1 1746 0 "" ## 5 1 3920 0 "" ## 6 1 378 0 "" ## 7 1 509 0 "" ## 8 1 535 0 "" ## 9 1 11 0 "" ## 10 1 810 0 "" ## # ... with 123,534,959 more rows To make clear and fast the reading of the data, I assign to the flights not canceled therefore the ones that variables Cancelled equal to 0 value and ‘No’, and flights canceled with Cancelled variable equal to 1 value ‘Yes’. Furthermore, any reason for the cancellation is represented by the variable CancellationCode. These are reasons below. CancellationCode | DescriptionCancellationCode A | Carrier B | Weather C | NAS D | Security NA or “” | Uknown
  • 4. cancelled_analysis <- flights_cancelled cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 0] <- 'No' #assegno no se il volo nn è stato cancellato cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 1] <- 'Yes' #assegno Si se il volo è stato cancellato cancelled_analysis$CancellationCode[ cancelled_analysis$CancellationCode == 'NA' | cancelled_analysis$CancellationCode == ''] <- 'Uknown' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'A'] <- 'Carrier' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'B'] <- 'Weather' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'C'] <- 'NAS' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'D'] <- 'Security' cancelled_analysis$Month[cancelled_analysis$Month == 1] <- 'Juanuary' cancelled_analysis$Month[cancelled_analysis$Month == 2] <- 'February' cancelled_analysis$Month[cancelled_analysis$Month == 3] <- 'March' cancelled_analysis$Month[cancelled_analysis$Month == 4] <- 'April' cancelled_analysis$Month[cancelled_analysis$Month == 5] <- 'May' cancelled_analysis$Month[cancelled_analysis$Month == 6] <- 'June' cancelled_analysis$Month[cancelled_analysis$Month == 7] <- 'July' cancelled_analysis$Month[cancelled_analysis$Month == 8] <- 'August' cancelled_analysis$Month[cancelled_analysis$Month == 9] <- 'September' cancelled_analysis$Month[cancelled_analysis$Month == 10] <- 'October' cancelled_analysis$Month[cancelled_analysis$Month == 11] <- 'November' cancelled_analysis$Month[cancelled_analysis$Month == 12] <- 'December' na.omit(cancelled_analysis) ## # A tibble: 123,534,969 x 4 ## Month FlightNum Cancelled CancellationCode ## <chr> <int> <chr> <chr> ## 1 Juanuary 335 No Uknown ## 2 Juanuary 3231 No Uknown ## 3 Juanuary 448 No Uknown ## 4 Juanuary 1746 No Uknown ## 5 Juanuary 3920 No Uknown ## 6 Juanuary 378 No Uknown ## 7 Juanuary 509 No Uknown ## 8 Juanuary 535 No Uknown ## 9 Juanuary 11 No Uknown ## 10 Juanuary 810 No Uknown ## # ... with 123,534,959 more rows To get an overview of the information related to the cancellation of flights, I show the number of flights canceled depending on the variation of Cancellation COde cancelled_analysis %>% group_by(CancellationCode) %>% tally %>% arrange(desc(n)) ## # A tibble: 5 x 2 ## CancellationCode n ## <chr> <int> ## 1 Uknown 122800263 ## 2 Carrier 317972 ## 3 Weather 267054 ## 4 NAS 149079 ## 5 Security 601 Results: * Causes of cancellations Carrier : 317972 flights Weather : 267054 flights NAS : 149079 flights Security : 601 flights I count the numbers of flights canceled and not canceled. cancelled_analysis %>% group_by(Cancelled) %>% tally %>% arrange(desc(n)) ## # A tibble: 2 x 2 ## Cancelled n ## <chr> <int> ## 1 No 121231645 ## 2 Yes 2303324 *Result: The number of canceled flights: 2303324 The number of flights not canceled: 121231645 percent(2303324/121231645) ## [1] "1.90%" 1.90% represents the probability of percentage of flight cancellation. Analysis of canceled flights only Dataframe: cancelled_Flights (Contains only canceled flights)
  • 5. cancelled_Flights <- cancelled_analysis%>% filter(Cancelled == 'Yes') %>% group_by(FlightNum, Month, CancellationCode, Cancelled) %>% as_data_frame() na.omit(cancelled_Flights) ## # A tibble: 2,303,324 x 4 ## Month FlightNum Cancelled CancellationCode ## <chr> <int> <chr> <chr> ## 1 Juanuary 126 Yes Carrier ## 2 Juanuary 1146 Yes Carrier ## 3 Juanuary 469 Yes Carrier ## 4 Juanuary 618 Yes NAS ## 5 Juanuary 2528 Yes Carrier ## 6 Juanuary 437 Yes Carrier ## 7 Juanuary 934 Yes Carrier ## 8 Juanuary 3326 Yes Carrier ## 9 Juanuary 1402 Yes Carrier ## 10 Juanuary 2205 Yes Carrier ## # ... with 2,303,314 more rows cancelledplot <- ggplot(cancelled_Flights, aes( Month, fill=CancellationCode)) + geom_bar(aes(y = (..count..)/sum(..count..))) + ylab("Percentages")+ xlab("Month") cancelledplot Results: The result provided in this analysis gives us the following results: The largest cancellation rate for a flight is due to an “Unknown” case and has a higher value especially in the month of January followed by the month of September. 2. What is the season with less delays? Preparation Data In this analysis, we get the best season to travel by getting fewer departure delays. Given that the analysis affects the delays occurred in the four seasons, for clarity the following legend is drawn: Legend: 1:winter(Month: 1,2,12) 2:spring(Month: 3,4,5) 3:summer(Month: 6,7,8) *4:fall(Month: 9,10,11)
  • 6. flights_effective$Month [flights_effective$Month == 1] <- 1 flights_effective$Month [flights_effective$Month == 2] <- 1 flights_effective$Month [flights_effective$Month == 12] <- 1 flights_effective$Month [flights_effective$Month == 3] <- 2 flights_effective$Month [flights_effective$Month == 4] <- 2 flights_effective$Month [flights_effective$Month == 5] <- 2 flights_effective$Month [flights_effective$Month == 6] <- 3 flights_effective$Month [flights_effective$Month == 7] <- 3 flights_effective$Month [flights_effective$Month == 8] <- 3 flights_effective$Month [flights_effective$Month == 9] <- 4 flights_effective$Month [flights_effective$Month == 10] <- 4 flights_effective$Month [flights_effective$Month == 11] <- 4 To provide a detailed analysis, we calculate the mean, Standard Error, Confidence Interval and t.test relative to the DepDelay variable. meanDepDelay <-mean(flights_effective$DepDelay) standardDepDelay <- sd(flights_effective$DepDelay)/sqrt(length(flights_effective$DepDelay)) ci <- CI(flights_effective$DepDelay) # a 95% confidence interval fot the mean DepDelay is given by t.test(flights_effective$DepDelay, alternative="two.sided", conf.level = .95) # mu=12 ## ## One Sample t-test ## ## data: flights_effective$DepDelay ## t = 3155.6, df = 121230000, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 8.165247 8.175396 ## sample estimates: ## mean of x ## 8.170322 meanDepDelay ## [1] 8.170322 standardDepDelay #How is accurate this point estimate ## [1] 0.002589136 ci ## upper mean lower ## 8.175396 8.170322 8.165247 Results: The sample mean for the variable DepDelay is: 8.170322 minutes. How is accurate the point estimate(mean)? I answer the question with Standard Error. The Standard Error is: 0.002 A more readible result can be obtained by using a confidence interval CI with the result: +upper: 8.175396 +mean: 8.170322 +lower: 8.165247 Testing the mean DepDelay of a flight: the t-test produce also the p-value, which is the probability of wrongly rejecting the null hypothesis. The p-value is always compared with the significance level of the test. The result of p-vale is p< 2.2e-16, so, suggest that the null hypotesis is unlikely to be true. The smaller it is, the more confident we can reject the null hypotesys seasonplot <- boxplot(formula = DepDelay ~ Month, data = flights_effective, main = 'Departures delays depending on the season', xlab = 'Season', ylab = 'Departure delay', border = c('springgreen', 'yellow', 'orange', 'skyblue'), names = c('Spring', 'Summer', 'Fall', 'Winter'))
  • 7. Result: The plot shows on the x-axis the 4 seasons, and on the y-axis the minutes of the departure delay of a flight in the range -1000 up to 2000. The longer delay occurs in the ** Winter ** season with a delay of more than 2000 minutes, instead in the ** Spring ** season the minimum departure delay is present with a negative value. 3. Which aerial manufacture allows a better performance? It is important to evaluate which manufacturer allows a better performance of the plane. To perform this analysis, we need additional information, contained within the csv “plane-data.csv” plane_data <- read_csv("C:/ProjectInferential/plane-data.csv") plane_data <- na.omit(plane_data) To get a clear view of the types of producers, I create a sort of legend to quickly identify the various types. In particular, the types of manufacture that occur most often are highlighted. Legenda: Embrarer: E Boeing: B AirBus Industrie: A *Other: O plane_performance <- na.omit(plane_performance) For more information, we analyze the numbers of the various types of manufacturer. plane_performance %>% group_by(manufacturer) %>% tally %>% arrange(desc(n)) ## # A tibble: 4 x 2 ## manufacturer n ## <chr> <int> ## 1 B 2061 ## 2 O 1397 ## 3 E 588 ## 4 A 434 Results Manufacturer B are: 2061 Manufacturer O are: 1397 Manufacturer E are: 588 Manufacturer A are: 434 I create a single dataframe containing flight and plane information. I combine the two data frames with the TailNum variable that should be unique for each flight. For better modeling and interpretation I create a plot representing the density for clear and efficient data reading. Kernel density plot are usually a much more effective way to view the distribution of a variable. mdensity <- ggplot(plane_performance, aes(x=airtime)) mdensity + geom_density(aes(colour=manufacturer, fill=manufacturer), alpha=0.3)+ theme_gray(base_size=14) ## Warning: Removed 3 rows containing non-finite values (stat_density).
  • 8. Now to explore the data AirTime and manufacturer I calculate the mean, standard deviation and median and show them on a plots. Meanplot <- ggplot(airtimeMean, aes(Manufacturer, x, fill=Manufacturer))+ geom_bar(stat="identity", position="dodge") + xlab("Manufacturer")+ ylab("Hourly Mean AirTime")+ theme_gray(base_size = 14) StandardDeviationplot <-ggplot(airtimeSD, aes(Manufacturer, x, fill=Manufacturer))+ geom_bar(stat="identity", position="dodge") + xlab("Manufacturer")+ ylab("Hourly AirTime SD")+ theme_gray(base_size = 14) Medianplot <- ggplot(airtimemedian, aes(Manufacturer, x, fill=Manufacturer))+ geom_bar(stat="identity", position="dodge") + xlab("Manufacturer")+ ylab("Hourly AirTime SD")+ theme_gray(base_size = 14) ggarrange(Meanplot, StandardDeviationplot, Medianplot, ncol = 2, nrow=2); ## Warning: Removed 1 rows containing missing values (geom_bar). ## Warning: Removed 1 rows containing missing values (geom_bar). ## Warning: Removed 1 rows containing missing values (geom_bar).
  • 9. To evaluate the flight performance, calculate performance_index by adding DepDelay ArrDelay and dividing the flight time AirTime ## # A tibble: 4,480 x 6 ## # Groups: tailnum, manufacturer [4,480] ## tailnum manufacturer arrdelay depdelay airtime performance_index ## <chr> <chr> <int> <int> <int> <dbl> ## 1 N10156 E 34 35 160 0.431 ## 2 N102UW A 23 20 308 0.140 ## 3 N10323 B 88 70 208 0.760 ## 4 N103US A 6 11 251 0.0677 ## 5 N104UA B 193 185 240 1.58 ## 6 N104UW A 14 23 288 0.128 ## 7 N10575 E 80 85 55 3 ## 8 N105UA B 3 27 206 0.146 ## 9 N105UW A 45 24 256 0.270 ## 10 N106US A 7 5 65 0.185 ## # ... with 4,470 more rows plane_performance_index <- plane_performance %>% group_by(tailnum)%>% dplyr::summarise(avg_performance_index = mean(performance_index, na.rm=FALSE)) %>% as_data_frame() plane_performance_index <- na.omit(plane_performance_index) Result: The highest performance index is given by TailNum: N581SW with value 50.000000 The lowest performance index is given by TailNum: N5ETAA with value 0.005524862 To combine both the information on the planes and the flights and then the two data frames, I have combined them with the unique TailNum index. pos1 <- match(plane_performance_index$tailnum, plane_data$tailnum) plane_performance_index$manufacturer <- plane_data$manufacturer[pos1] plot1 <- ggplot(f, aes(x = factor(manufacturer) , y =p))+ geom_bar(colour ="blue", stat = "identity")+ ggtitle("Performance Index based on the manufacturer of plane")+ guides(fill=FALSE)+ xlab("Manufacturer") + ylab("Performance Index") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) plot1 <- ggplotly(plot1) plot1
  • 10. Result How we can see on the plot the best performance is given by manufacturer FEDERICK CHRISK viceversa, the bad performance is given by manufacturer BAUMAN RANDY 4. Which of the two carriers that make the most flights is faster? In this analysis, we initially determine the carriers that make the most flights, after which we analyze the two companies and find out who is the fastest. data_carrier_air <- select(ontime, Distance, AirTime, UniqueCarrier, DepDelay, ArrDelay) %>% filter((ArrDelay > 0) & (DepDelay > 0) & (AirTime > 0)) %>% group_by(UniqueCarrier, DepDelay, ArrDelay) %>% as_data_frame() data_carrier_air <- na.omit(data_carrier_air) Covariance between AirTime and Distance cov(data_carrier_air$Distance, data_carrier_air$AirTime) ## [1] 25409.49 Covariance Result The result provided by the cov (Covariance) function, indicates the variable AirTime and the Distance variable are positively correlated, ie we assume a linear relationship between AirTime and Distance there is a positive correlation, from the increase of AirTime there is also an increase of the Distance average. The frequencies and the relative frequencies of each carrier are calculated to obtain the carriers that carry out more flights. Calculation of frequencies and frequencies of careers flight_to_carrier <- cbind (Frequency = table(data_carrier_air$UniqueCarrier), RelFreq = prop.table(table(data_carrier_air$UniqueCarrier)) ) flight_to_carrier
  • 11. ## Frequency RelFreq ## 9E 134604 0.0033550012 ## AA 4505617 0.1123023864 ## AQ 41702 0.0010394213 ## AS 928903 0.0231528831 ## B6 261440 0.0065163852 ## CO 2348172 0.0585281260 ## DH 199362 0.0049690927 ## DL 6239652 0.1555231636 ## EA 215526 0.0053719799 ## EV 588954 0.0146796631 ## F9 117590 0.0029309277 ## FL 394566 0.0098345473 ## HA 38710 0.0009648457 ## HP 1176193 0.0293165799 ## ML (1) 14393 0.0003587452 ## MQ 1299845 0.0323986028 ## NW 2702993 0.0673720301 ## OH 413364 0.0103030869 ## OO 893097 0.0222604195 ## PA (1) 61635 0.0015362508 ## PI 466058 0.0116164835 ## PS 35534 0.0008856840 ## TW 1092832 0.0272388091 ## TZ 47440 0.0011824408 ## UA 4669724 0.1163927491 ## US 5217884 0.1300556228 ## WN 5083808 0.1267137820 ## XE 664729 0.0165683530 ## YV 266076 0.0066319374 The largest number of flights is made by the company ** DL Delta Air Lines Inc ** with a percentage of 15.55% followed by ** US Airways Inc ** with a percentage of 13%. Pearson’s Correlation Test Correlation between AirTime and Distance cor.test(data_carrier_air$AirTime, data_carrier_air$Distance) ## ## Pearson's product-moment correlation ## ## data: data_carrier_air$AirTime and data_carrier_air$Distance ## t = 5112.5, df = 40120000, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.6278910 0.6282657 ## sample estimates: ## cor ## 0.6280784 Correlation: 0.6280784 Regression Analysis airtime.lm.DL <- lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == "DL" ) summary (airtime.lm.DL) ## ## Call: ## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == ## "DL") ## ## Residuals: ## Min 1Q Median 3Q Max ## -381.96 -37.16 20.30 39.22 481.85 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.595e-01 3.822e-02 14.64 <2e-16 *** ## Distance 8.472e-02 4.236e-05 1999.98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 58.28 on 6239650 degrees of freedom ## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3906 ## F-statistic: 4e+06 on 1 and 6239650 DF, p-value: < 2.2e-16 airtime.lm.US <- lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == "US" ) summary (airtime.lm.US)
  • 12. ## ## Call: ## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == ## "US") ## ## Residuals: ## Min 1Q Median 3Q Max ## -256.40 -28.59 15.26 35.60 339.54 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.590e-01 3.515e-02 -21.59 <2e-16 *** ## Distance 8.632e-02 4.552e-05 1896.59 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 52.6 on 5217882 degrees of freedom ## Multiple R-squared: 0.4081, Adjusted R-squared: 0.4081 ## F-statistic: 3.597e+06 on 1 and 5217882 DF, p-value: < 2.2e-16 Multiple R-squared is 0.4081, which is the R-squared value. Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model. plotDL <- plot(airtime.lm.DL)
  • 13.
  • 14. plotDL ## NULL plotUS <- plot(airtime.lm.US)
  • 15.
  • 16. plotUS ## NULL ** Residual vs Fitted** It is a scatterplot of residuals on the y axes and fitted values on the x axes. This plot is used to detect non-linearity by looking at redline for questionable pattern. The characteristic of a well-behaved residual vs fitted plot are: The residual bounce randomly around the 0 line. The residual roughly for a horizontal band around the 0 line. No residual
  • 17. stands out. So, this plot has the residual randomly around the 0 line, so this suggest that the assumption that the relationship is linear is reasonable. Scale Location This plot tells us if the residual apread equally along the ranges of predictor. Normal QQ This scatterplot show if residuals are normally distributed. The closer point are to falling directly on the diagonal line then the more we can interpret the residual as normally distributed. DL <- subset(data_carrier_air, UniqueCarrier == 'DL') US <- subset(data_carrier_air, UniqueCarrier == 'US') finalplot <- plot(x=DL$Distance, y=DL$AirTime, xlab = 'Distance', ylab = 'Air Time', main = 'Air time based on distance by carrier', pch=20, col='dodgerblue1' ) points (x = US$Distance, y = US$AirTime, pch=20, col='forestgreen' ) abline (airtime.lm.DL , col = 'slateblue1') abline (airtime.lm.US, col = 'springgreen1') legend ('topleft', legend = c('Delta Air Lines Inc.', 'US Airways Inc.'), col = c('dodgerblue1', 'forestgreen'), pch = 20) finalplot ## NULL The plot is a relationship between Distance and AirTime for the DL and US companies. (ps: flight times are all in positive integer value) WHO is faster ?? DL is represented by slateblue1 US is presented by springgreen1 It is best to travel with Delta Air Lines Inc. for distances of less than 1500 miles and with US Airways Inc for distances greater than 1500 miles. Conclusion The analyzes carried out on all the years (1987-2008) confirm that the worst month to start is January as there is a greater number of cancellations with an unknown reason. The average starting delay is 8.17 minutes. The worst season for starting delays is Winter, while in Winter there is also a negative DepDelay. The best performing flight number is N581SW built by the manufacturer FEDERICK CHRISK The worst performing flight number is N5ETAA built by the manufacturer BAUMAN RANDY. The two companies that operate more flights are DL Delta Airlines with a percentage of 15% and US Airwais with a percentage of 13%. In addition, DL is faster on flights with a distance of less than 15,000 miles.