This document discusses analyzing a large airline data set from 2005-2014 using Apache Hadoop and Hive on Microsoft Azure HDInsight. It describes the characteristics of the data set, challenges of traditional computing for big data, and how Hadoop provides data intensive distributed computing. Experimental results using Hive on Azure show trends in cancelled/diverted flights, departure delays by distance, month and year. The summary concludes that interesting patterns were found, cloud infrastructure enables scalable big data analysis at low cost, and Hive provides an easy query interface for the complex Hadoop ecosystem.
Big Data Analysis of Airline Data Set on Cloud Computing
1. Jongwook Woo
HiPIC
CSULA
Big Data Analysis of Airline
Data Set on Cloud
Computing
JIIBR SYMPOSIUM 2015
Cal State LA, CA
October 9 2015
Nillohit Bhattacharya, nbhatta2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
2. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Airline Data Set
Hadoop: Data Intensive Computing
Hadoop on Cloud Computing
Hive and its Architecture on Azure
Experimental Results
Conclusions
3. High Performance Information Computing Center
Jongwook Woo
CSULA
Characteristics of the Airline Data Set
Data has been taken from the US
Department of Transportation
Consist of the arrival and departure records
of domestic airlines
Time period January 2005 – December 2014
(10 Years)
Total number of files: 120
File Format: csv (comma separated values)
Total file size: 13.1 GB
Total Number of records: 66 million
4. High Performance Information Computing Center
Jongwook Woo
CSULA
Traditional Computing Challenges
Not easy for a single computer to store and
process all the data by itself.
Approached the problem in a different way
Traditional Parallel Computer
– Processor Intensive Computing
• by increasing the processing speed and power of the
computer
As the data grows exponentially,
– The processing power of the single computer
becomes a bottleneck
– And, mostly it does not work for large scale data
because of the latency in data transfer on Network
and Disk I/O
5. High Performance Information Computing Center
Jongwook Woo
CSULA
A New Approach (Hadoop)
Many non-expensive commodity computers
all working together,
Data Intensive Computing
– break the data in smaller chunks and process the data
locally where it is stored
– Data Locality
• Computation occurs where data resides
All the computers process the data in parallel.
Provides the ability to harness the power of
multiple computers simultaneously.
6. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop on Cloud
Create Hadoop clusters with minimal
investment.
No overhead of maintaining the cluster.
Delete the cluster when no longer needed.
Increase/Decrease resources on demand.
Deleting the cluster does not result in loss
of data.
7. High Performance Information Computing Center
Jongwook Woo
CSULA
Apache Hive
SQL like language
Developed at Facebook
HQL (Hive Query Language) is
different than SQL
Runs map reduce jobs under the hood.
Batch Process
Queries have a high latency
Read based
Not appropriate for transaction processing
8. High Performance Information Computing Center
Jongwook Woo
CSULA
Microsoft Azure HDInsight
Deploys and provisions Hadoop clusters in
the cloud
HDInsight uses Hortonworks Data Platform
(HDP) Hadoop Distribution
HDInsight cluster configuration
Number of data nodes: 4
CPU: 4 Cores
Memory: 7 GB
Operating System: Windows Server 2012 R2 Datacenter
Hadoop clusters can be launched using
Linux Operating System
Windows Server Operating System
10. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Total number of flights cancelled each
month for the period 2005-2014
Time taken: 210.862 seconds, Fetched: 120 row(s)
Total number of flights diverted each month
for the period 2005-2014
Time taken: 216.704 seconds, Fetched: 120 row(s)
11. High Performance Information Computing Center
Jongwook Woo
CSULA
Cancelled and Diverted flights by month
0
5000
10000
15000
20000
25000
30000
35000
Numberofcancelled/divertedflights
Cancelled/Diverted Vs Time
Cancelled
Diverted
12. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Total number of flights cancelled every year
for the period 2005-2014
Time taken: 302.465 seconds, Fetched: 10 row(s)
Total number of flights diverted every year
for the period 2005-2014
Time taken: 461.433 seconds, Fetched: 10 row(s)
13. High Performance Information Computing Center
Jongwook Woo
CSULA
Cancelled and Diverted flights by year
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Numberofcancelledflights
Number of cancelled/diverted flights Vs Year
Cancelled
Diverted
14. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Effect of flight distance on flight
diversions
Time taken: 675.725 seconds, Fetched: 1500 row(s)
15. High Performance Information Computing Center
Jongwook Woo
CSULA
Diverted Flights Vs Distance
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000
NumberofDivertedflights(count)
Flight Distance (in miles)
Number of diverted flights Vs Distance
Diverted (Count)
16. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Effect of flight distance on flight
cancellations
Time taken: 576.925 seconds, Fetched: 1500 row(s)
17. High Performance Information Computing Center
Jongwook Woo
CSULA
Cancelled Flights Vs Distance
0
2000
4000
6000
8000
10000
12000
14000
0 1000 2000 3000 4000 5000 6000
Numberofcancelledflights(count)
Flight Distance (in miles)
Number of cancelled flights Vs Distance
Cancellation (Count)
18. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Effect of flight distance on average
departure delay
Time taken: 992.911 seconds, Fetched: 1500 row(s)
19. High Performance Information Computing Center
Jongwook Woo
CSULA
Average Departure Delay vs Flight
Distance
0
50
100
150
200
250
0 1000 2000 3000 4000 5000 6000
AverageDepartureDelay(inminutes)
Flight Distance (in miles)
Average Departure Delay Vs Flight Distance
Avg Dep Delay
20. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Monthly average departure delay for
the period 2005-2014
Time taken: 973.695 seconds, Fetched: 13 row(s)
21. High Performance Information Computing Center
Jongwook Woo
CSULA
Average Departure Delay by month
0
2
4
6
8
10
12
14
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
AverageDepartureDelay(inminutes)
Average Depature Delay Vs Month
Avg Dep Delay
22. High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Yearly average departure delay for the
period 2005-2014
Time taken: 623.694 seconds, Fetched: 11 row(s)
23. High Performance Information Computing Center
Jongwook Woo
CSULA
Average Departure Delay by year
0
2
4
6
8
10
12
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
AverageDepartureDelay(inminutes)
Average Departure Delay Vs Year
Avg Dep Delay
24. High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Interesting sets of trends and patterns exists in large data
sets
Average Departure delay is at a peak during the mid and end of the year i.e.
during the months of June, July and December
The highest number of flights were cancelled in the year 2007 as observed
between the period 2005-2014
Cloud infrastructure has enabled the use of Hadoop for big
data systems with minimal investment and cost of ownership
Hive provides an easy way to query the data without worrying
about the underlying complex structure of the system
Big Data systems build in the cloud can be decommissioned
without loosing the data
Any large scale data set in Business can be analyzed
Marketing, Finance, Economics, Management
Contact Prof Jongwook Woo (jwoo5@calstatela.edu) if you
need a collaboration
26. High Performance Information Computing Center
Jongwook Woo
CSULA
References
Airline Data Set, United States Department of Transportation,
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
What is Hive?,
http://www-01.ibm.com/software/data/infosphere/hadoop/hive/
Introduction to Windows Azure Blob Storage, https://www.simple-
talk.com/cloud/cloud-data/an-introduction-to-windows-azure-blob-
storage-/
Introduction to Hadoop in HDInsight: Big-data analysis and processing in
the cloud, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-hadoop-introduction/
Explorer for Microsoft Azure Storage: Freeware Client,
http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx
Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-upload-data/
“Market Basket Analysis Algorithms with MapReduce”, Jongwook Woo,
DMKD-00150, Wiley Interdisciplinary Reviews Data Mining and Knowledge
Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795