Statistics notes ,it includes mean to index numbers
Bidm assignment airtrip data
1. BIDM Assignment – AIRLINE DATA
Assignment by Group A (PGPMX 2015-17 Batch)
Jayant Shenoy(2015PGPMX007)
Lata Srijit(2015PGPMX013)
Mandar Risbud(2015PGPMX014)
Parineeta Katgaonkar(2015PGPMX018)
Samir Shah(2015PGPMX023)
Sanmeet Dhokay(2015PGPMX 025)
2. Contents
1. Problem Statement.....................................................................................................................3
2. Preparation of Data.....................................................................................................................3
3. Identify Top 4 Airports by TrafficVolume......................................................................................3
4. Identify Top 5 Airlines................................................................................................................19
5. Seasonalityin terms of Delay and Operating Performance...........................................................21
6. Conclusion................................................................................................................................24
3. 1.Problem Statement
Datasetfor this problemconsistsof arrival anddeparture dataof flightsacrossall the airports withinUS
for 25 years.Your taskis to identifytop4airportsinterms of trafficvolume.Alsoidentifytop5airline
carriersthat operate onthose routes(intermsof numberof flights).Selectonlythose flightsof the
carriersthat flewinor outor betweenanyof these top4 airports.Please trytoanswerfollowing
questionsusingthisdataset.
Is there anyseasonalityintermsof flightdelay? (Hint:Use atleast10 yearsof data to findany
seasonal effect)
Excludingsystemicdelays(i.e.whenall flightsin/outof an airportare delayeddue tosome
commonproblemlike weatherorcomputerglitches),compare the operatingperformanceof
the top 5 airline players
We have identifiedthe belowmentionedobjectivesforthe problem.
Objectivesof the problem:
Top 4 Airportsintermsof trafficvolume
Top 5 Airline Carriersonthose airportsintermsof numberof flights
Seasonalityintermsof flightdelay?
Compare the operatingperformance of the top5 airline players
2.Preparation of Data
Overall datasize was31.7 GB. We usedHypothesisTestingonthe 303 Individual Datasets(.csvfiles) i.e.
Taking1% sample fromeach of the filesandthenmergingall the datainto one file.
We usedLinux batchcommandsto take 1% sample andmerge all the recordsinto 1 file.
- In all, there were 303 files (1 file per month, data for around 25 years)
- Each file had around 80+ MB of data, having around 500,000 data records
- The approach was to take 1% of data records from each file and at the end, merge all the
records from 303 files into a single data file
- For extracting 1% data from each individual file and then merging all extracted data, we used
Linux commands, as they come as handy utilities for data movement and merging
- Since the overall datavolume wasverylarge,we usedMSAzure Cloudenvironmentfor moving,
storing and handling data
3.Identify Top 4 Airports by Traffic Volume
StepsFollowed :-
1) Findthe frequency (count) of each of the AirportCodesinthe OriginColumn
4. 2) Findthe frequency (count) of eachof the AirportCodesinthe DestinationColumn
3) Addthe countsof bothOriginandDestination
4) Sort the data from highestcounttothe lowestcount
5) Top 4 rowsgive the top 4 busiestairports
Findthe Frequency(Count)
Selectthe ORIGIN/DESTColumn
18. Top 4 BusiestAirports:-
Airport
Code
Airport
Name ORIGIN DEST Total
LAX Los Angeles 155745 140072 295817
ABQ Albuquerque 10439 229381 239820
DFW DallasFort
Worth 131597 99047 230644
ORD O’hare
International 105448 95167 200615
ExtractedData and Calculationsare presentinthe below attachedexcel sheet
TOPAIRPORTOUTPU
T.xlsx
19. 4. Identify Top 5 Airlines
Nowthat we have got the top 4 busiestairports,we filteredthe datatoget the 4 busiestairportsinthe
ORIGIN and DEST Columns
We usedFrequencyanalysisonUNIQUE_CARRIERcolumntofindthe top 5 airlinesoperatingonthe 4
busiestairports.
20.
21. 5. Seasonality in terms of Delay and Operating Performance
We have usedthe filtereddatasetfortop4 busiestairportssince the seasonalityintermsof delayhasto
be plottedforthose airports
Stepsfollowed:-
1)Filteredthe datafortop 4 busiestairports
Belowisthe screenshotforScriptto filterthe dataforthe top5 Airlinesandcreate anew dataset(.sav)
file :-
2)UsedDEP_DELAY_GROUP and ARR_DELAY_GROUP as the parameters forDelayas it containssimple
integercategorical datamappedagainstminutesof delay.Foreg:- anythingupto15 minutesdelay
maps to1, 37 minutesdelayis2, 105 minutesdelayis7and so on.
3)Createda newcolumnnamedTOTAL_DELAYwhichcontainsthe sum of DEP_DELAY_GROUP and
ARR_DELAY_GROUP parameters
4)UsedClusteredScatterplot asfollows
22. TOTAL_DELAY is plottedonY Axis,MONTH onX AxisandsplitbasedonUNIQUE_CARRIER to compare
the performance of the topairline playersandalsocheckthe seasonalityof delay
24. Performance of WN Airline isthe bestintermsof ontime arrival and departure andthat of UA and AA
beingthe worst
As we can observe fromthe charts,duringthe secondand thirdquarterof the yearwe see a lot of
delaysascomparedto the firstand the last quarter.
6. Conclusion
We have noteddown a fewobservationsandfindingsaftercompletingthe assignmentwhichare as
follows
Wheneverwe have averylarge datasetto deal with,HypothesisTestingisthe bestapproach
and itconvertsthe data intoa practical sample toworkupon.
Understandingthe sample datawell isanintegral partof the BusinessIntelligenceandData
MiningProcess.
Top 4 busiestairportsfromthe assignmentdataare LAX,ABQ,DFW,ORD.
Top 5 airline carriersoperatingonthose 4 busiestairportsare AA ,NW,PS,UA,WN
WN Airline isthe bestintermsof ontime arrival and departure.AirlinesUA andAA are the
worstin termsof delay.
Duringthe secondand thirdquarterof the yearwe see a lotof delaysas comparedtothe first
and the lastquarter.
Thisassignmenthashelpedustogaina detailedinsightinto the applicationof BusinessIntelligence and
Data Mining.