Big Data Analysis of Airline Data Set on Cloud Computing

Jongwook Woo
HiPIC
CSULA
Big Data Analysis of Airline
Data Set on Cloud
Computing
JIIBR SYMPOSIUM 2015
Cal State LA, CA
October 9 2015
Nillohit Bhattacharya, nbhatta2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Airline Data Set
 Hadoop: Data Intensive Computing
 Hadoop on Cloud Computing
 Hive and its Architecture on Azure
 Experimental Results
 Conclusions

Jongwook Woo
CSULA
Characteristics of the Airline Data Set
Data has been taken from the US
Department of Transportation
Consist of the arrival and departure records
of domestic airlines
Time period January 2005 – December 2014
(10 Years)
Total number of files: 120
File Format: csv (comma separated values)
Total file size: 13.1 GB
Total Number of records: 66 million

Jongwook Woo
CSULA
Traditional Computing Challenges
Not easy for a single computer to store and
process all the data by itself.
Approached the problem in a different way
 Traditional Parallel Computer
– Processor Intensive Computing
• by increasing the processing speed and power of the
computer
As the data grows exponentially,
– The processing power of the single computer
becomes a bottleneck
– And, mostly it does not work for large scale data
because of the latency in data transfer on Network
and Disk I/O

Jongwook Woo
CSULA
A New Approach (Hadoop)
Many non-expensive commodity computers
all working together,
 Data Intensive Computing
– break the data in smaller chunks and process the data
locally where it is stored
– Data Locality
• Computation occurs where data resides
All the computers process the data in parallel.
Provides the ability to harness the power of
multiple computers simultaneously.

Jongwook Woo
CSULA
Hadoop on Cloud
Create Hadoop clusters with minimal
investment.
No overhead of maintaining the cluster.
Delete the cluster when no longer needed.
Increase/Decrease resources on demand.
Deleting the cluster does not result in loss
of data.

Jongwook Woo
CSULA
Apache Hive
SQL like language
Developed at Facebook
HQL (Hive Query Language) is
different than SQL
Runs map reduce jobs under the hood.
Batch Process
Queries have a high latency
Read based
Not appropriate for transaction processing

Jongwook Woo
CSULA
Microsoft Azure HDInsight
Deploys and provisions Hadoop clusters in
the cloud
HDInsight uses Hortonworks Data Platform
(HDP) Hadoop Distribution
HDInsight cluster configuration
 Number of data nodes: 4
 CPU: 4 Cores
 Memory: 7 GB
 Operating System: Windows Server 2012 R2 Datacenter
Hadoop clusters can be launched using
 Linux Operating System
 Windows Server Operating System

Jongwook Woo
CSULA
System Architecture

Jongwook Woo
CSULA
Experimental Results
Total number of flights cancelled each
month for the period 2005-2014
 Time taken: 210.862 seconds, Fetched: 120 row(s)
Total number of flights diverted each month
for the period 2005-2014

Jongwook Woo
CSULA
Cancelled and Diverted flights by month
0
5000
10000
15000
20000
25000
30000
35000
Numberofcancelled/divertedflights
Cancelled/Diverted Vs Time
Cancelled
Diverted

Jongwook Woo
CSULA
Total number of flights cancelled every year
Total number of flights diverted every year

Jongwook Woo
CSULA
Cancelled and Diverted flights by year
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Numberofcancelledflights
Number of cancelled/diverted flights Vs Year
Cancelled
Diverted

Jongwook Woo
CSULA
Effect of flight distance on flight
diversions
Time taken: 675.725 seconds, Fetched: 1500 row(s)

Jongwook Woo
CSULA
Diverted Flights Vs Distance
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000
NumberofDivertedflights(count)
Flight Distance (in miles)
Number of diverted flights Vs Distance
Diverted (Count)

Jongwook Woo
CSULA
Effect of flight distance on flight
cancellations

Jongwook Woo
CSULA
Cancelled Flights Vs Distance
0
2000
4000
6000
8000
10000
12000
14000
0 1000 2000 3000 4000 5000 6000
Numberofcancelledflights(count)
Number of cancelled flights Vs Distance
Cancellation (Count)

Jongwook Woo
CSULA
Effect of flight distance on average
departure delay

Jongwook Woo
CSULA
Average Departure Delay vs Flight
Distance
0
50
100
150
200
250
0 1000 2000 3000 4000 5000 6000
AverageDepartureDelay(inminutes)
Average Departure Delay Vs Flight Distance
Avg Dep Delay

Jongwook Woo
CSULA
Monthly average departure delay for
the period 2005-2014

Jongwook Woo
CSULA
Average Departure Delay by month
0
2
4
6
8
10
12
14
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Average Depature Delay Vs Month
Avg Dep Delay

Jongwook Woo
CSULA
Yearly average departure delay for the
period 2005-2014

Jongwook Woo
CSULA
Average Departure Delay by year
0
2
4
6
8
10
12
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Average Departure Delay Vs Year
Avg Dep Delay

Jongwook Woo
CSULA
Conclusion
 Interesting sets of trends and patterns exists in large data
sets
 Average Departure delay is at a peak during the mid and end of the year i.e.
during the months of June, July and December
 The highest number of flights were cancelled in the year 2007 as observed
between the period 2005-2014
 Cloud infrastructure has enabled the use of Hadoop for big
data systems with minimal investment and cost of ownership
 Hive provides an easy way to query the data without worrying
about the underlying complex structure of the system
 Big Data systems build in the cloud can be decommissioned
without loosing the data
 Any large scale data set in Business can be analyzed
 Marketing, Finance, Economics, Management
 Contact Prof Jongwook Woo (jwoo5@calstatela.edu) if you
need a collaboration

Jongwook Woo
CSULA
Question?

Jongwook Woo
CSULA
References
 Airline Data Set, United States Department of Transportation,
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
 What is Hive?,
http://www-01.ibm.com/software/data/infosphere/hadoop/hive/
 Introduction to Windows Azure Blob Storage, https://www.simple-
talk.com/cloud/cloud-data/an-introduction-to-windows-azure-blob-
storage-/
 Introduction to Hadoop in HDInsight: Big-data analysis and processing in
the cloud, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-hadoop-introduction/
 Explorer for Microsoft Azure Storage: Freeware Client,
http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx
 Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-upload-data/
 “Market Basket Analysis Algorithms with MapReduce”, Jongwook Woo,
DMKD-00150, Wiley Interdisciplinary Reviews Data Mining and Knowledge
Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795

Big Data Analysis of Airline Data Set on Cloud Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Analysis of Airline Data Set on Cloud Computing

Similar to Big Data Analysis of Airline Data Set on Cloud Computing (20)

Recently uploaded

Recently uploaded (20)

Big Data Analysis of Airline Data Set on Cloud Computing