SlideShare a Scribd company logo
1 of 26
Jongwook Woo
HiPIC
CSULA
Big Data Analysis of Airline
Data Set on Cloud
Computing
JIIBR SYMPOSIUM 2015
Cal State LA, CA
October 9 2015
Nillohit Bhattacharya, nbhatta2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Airline Data Set
 Hadoop: Data Intensive Computing
 Hadoop on Cloud Computing
 Hive and its Architecture on Azure
 Experimental Results
 Conclusions
High Performance Information Computing Center
Jongwook Woo
CSULA
Characteristics of the Airline Data Set
Data has been taken from the US
Department of Transportation
Consist of the arrival and departure records
of domestic airlines
Time period January 2005 – December 2014
(10 Years)
Total number of files: 120
File Format: csv (comma separated values)
Total file size: 13.1 GB
Total Number of records: 66 million
High Performance Information Computing Center
Jongwook Woo
CSULA
Traditional Computing Challenges
Not easy for a single computer to store and
process all the data by itself.
Approached the problem in a different way
 Traditional Parallel Computer
– Processor Intensive Computing
• by increasing the processing speed and power of the
computer
As the data grows exponentially,
– The processing power of the single computer
becomes a bottleneck
– And, mostly it does not work for large scale data
because of the latency in data transfer on Network
and Disk I/O
High Performance Information Computing Center
Jongwook Woo
CSULA
A New Approach (Hadoop)
Many non-expensive commodity computers
all working together,
 Data Intensive Computing
– break the data in smaller chunks and process the data
locally where it is stored
– Data Locality
• Computation occurs where data resides
All the computers process the data in parallel.
Provides the ability to harness the power of
multiple computers simultaneously.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop on Cloud
Create Hadoop clusters with minimal
investment.
No overhead of maintaining the cluster.
Delete the cluster when no longer needed.
Increase/Decrease resources on demand.
Deleting the cluster does not result in loss
of data.
High Performance Information Computing Center
Jongwook Woo
CSULA
Apache Hive
SQL like language
Developed at Facebook
HQL (Hive Query Language) is
different than SQL
Runs map reduce jobs under the hood.
Batch Process
Queries have a high latency
Read based
Not appropriate for transaction processing
High Performance Information Computing Center
Jongwook Woo
CSULA
Microsoft Azure HDInsight
Deploys and provisions Hadoop clusters in
the cloud
HDInsight uses Hortonworks Data Platform
(HDP) Hadoop Distribution
HDInsight cluster configuration
 Number of data nodes: 4
 CPU: 4 Cores
 Memory: 7 GB
 Operating System: Windows Server 2012 R2 Datacenter
Hadoop clusters can be launched using
 Linux Operating System
 Windows Server Operating System
High Performance Information Computing Center
Jongwook Woo
CSULA
System Architecture
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Total number of flights cancelled each
month for the period 2005-2014
 Time taken: 210.862 seconds, Fetched: 120 row(s)
Total number of flights diverted each month
for the period 2005-2014
 Time taken: 216.704 seconds, Fetched: 120 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Cancelled and Diverted flights by month
0
5000
10000
15000
20000
25000
30000
35000
Numberofcancelled/divertedflights
Cancelled/Diverted Vs Time
Cancelled
Diverted
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Total number of flights cancelled every year
for the period 2005-2014
 Time taken: 302.465 seconds, Fetched: 10 row(s)
Total number of flights diverted every year
for the period 2005-2014
 Time taken: 461.433 seconds, Fetched: 10 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Cancelled and Diverted flights by year
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Numberofcancelledflights
Number of cancelled/diverted flights Vs Year
Cancelled
Diverted
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Effect of flight distance on flight
diversions
Time taken: 675.725 seconds, Fetched: 1500 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Diverted Flights Vs Distance
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000
NumberofDivertedflights(count)
Flight Distance (in miles)
Number of diverted flights Vs Distance
Diverted (Count)
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Effect of flight distance on flight
cancellations
Time taken: 576.925 seconds, Fetched: 1500 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Cancelled Flights Vs Distance
0
2000
4000
6000
8000
10000
12000
14000
0 1000 2000 3000 4000 5000 6000
Numberofcancelledflights(count)
Flight Distance (in miles)
Number of cancelled flights Vs Distance
Cancellation (Count)
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Effect of flight distance on average
departure delay
Time taken: 992.911 seconds, Fetched: 1500 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Average Departure Delay vs Flight
Distance
0
50
100
150
200
250
0 1000 2000 3000 4000 5000 6000
AverageDepartureDelay(inminutes)
Flight Distance (in miles)
Average Departure Delay Vs Flight Distance
Avg Dep Delay
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Monthly average departure delay for
the period 2005-2014
Time taken: 973.695 seconds, Fetched: 13 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Average Departure Delay by month
0
2
4
6
8
10
12
14
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
AverageDepartureDelay(inminutes)
Average Depature Delay Vs Month
Avg Dep Delay
High Performance Information Computing Center
Jongwook Woo
CSULA
Experimental Results
Yearly average departure delay for the
period 2005-2014
 Time taken: 623.694 seconds, Fetched: 11 row(s)
High Performance Information Computing Center
Jongwook Woo
CSULA
Average Departure Delay by year
0
2
4
6
8
10
12
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
AverageDepartureDelay(inminutes)
Average Departure Delay Vs Year
Avg Dep Delay
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
 Interesting sets of trends and patterns exists in large data
sets
 Average Departure delay is at a peak during the mid and end of the year i.e.
during the months of June, July and December
 The highest number of flights were cancelled in the year 2007 as observed
between the period 2005-2014
 Cloud infrastructure has enabled the use of Hadoop for big
data systems with minimal investment and cost of ownership
 Hive provides an easy way to query the data without worrying
about the underlying complex structure of the system
 Big Data systems build in the cloud can be decommissioned
without loosing the data
 Any large scale data set in Business can be analyzed
 Marketing, Finance, Economics, Management
 Contact Prof Jongwook Woo (jwoo5@calstatela.edu) if you
need a collaboration
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
 Airline Data Set, United States Department of Transportation,
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
 What is Hive?,
http://www-01.ibm.com/software/data/infosphere/hadoop/hive/
 Introduction to Windows Azure Blob Storage, https://www.simple-
talk.com/cloud/cloud-data/an-introduction-to-windows-azure-blob-
storage-/
 Introduction to Hadoop in HDInsight: Big-data analysis and processing in
the cloud, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-hadoop-introduction/
 Explorer for Microsoft Azure Storage: Freeware Client,
http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx
 Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-upload-data/
 “Market Basket Analysis Algorithms with MapReduce”, Jongwook Woo,
DMKD-00150, Wiley Interdisciplinary Reviews Data Mining and Knowledge
Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795

More Related Content

What's hot

Earley Executive Roundtable Summary - Data Analytics
Earley Executive Roundtable Summary - Data AnalyticsEarley Executive Roundtable Summary - Data Analytics
Earley Executive Roundtable Summary - Data AnalyticsEarley Information Science
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective Viewijtsrd
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCEAM Publications,India
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangDatabricks
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data ScienceBrijeshGoyani
 
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
High Performance Data Analysis (HPDA): HPC - Big Data ConvergenceHigh Performance Data Analysis (HPDA): HPC - Big Data Convergence
High Performance Data Analysis (HPDA): HPC - Big Data Convergenceinside-BigData.com
 
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Fujitsu Middle East
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Labkevinflorian
 
Analysis of big data in pandemic case
Analysis of big data in pandemic case Analysis of big data in pandemic case
Analysis of big data in pandemic case Muh Saleh
 
Practical architecture - in-memory data systems patterns - WD-40 between data...
Practical architecture - in-memory data systems patterns - WD-40 between data...Practical architecture - in-memory data systems patterns - WD-40 between data...
Practical architecture - in-memory data systems patterns - WD-40 between data...emmanuelbernard
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...Dataconomy Media
 
Big Data Landscape 2016
Big Data Landscape 2016 Big Data Landscape 2016
Big Data Landscape 2016 Matt Turck
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
 
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Cambridge Semantics
 

What's hot (20)

Earley Executive Roundtable Summary - Data Analytics
Earley Executive Roundtable Summary - Data AnalyticsEarley Executive Roundtable Summary - Data Analytics
Earley Executive Roundtable Summary - Data Analytics
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
BIg Data Trends in 2016
BIg Data Trends in 2016BIg Data Trends in 2016
BIg Data Trends in 2016
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
High Performance Data Analysis (HPDA): HPC - Big Data ConvergenceHigh Performance Data Analysis (HPDA): HPC - Big Data Convergence
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
 
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
Analysis of big data in pandemic case
Analysis of big data in pandemic case Analysis of big data in pandemic case
Analysis of big data in pandemic case
 
Practical architecture - in-memory data systems patterns - WD-40 between data...
Practical architecture - in-memory data systems patterns - WD-40 between data...Practical architecture - in-memory data systems patterns - WD-40 between data...
Practical architecture - in-memory data systems patterns - WD-40 between data...
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
 
Big Data Landscape 2016
Big Data Landscape 2016 Big Data Landscape 2016
Big Data Landscape 2016
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
 

Similar to Big Data Analysis of Airline Data Set on Cloud Computing

Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzerpriyal mistry
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
Cloud computing shim
Cloud computing shimCloud computing shim
Cloud computing shimtistrue
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsHong-Linh Truong
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 

Similar to Big Data Analysis of Airline Data Set on Cloud Computing (20)

Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Cloud computing shim
Cloud computing shimCloud computing shim
Cloud computing shim
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflows
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
DWBI Testing Profile
DWBI Testing ProfileDWBI Testing Profile
DWBI Testing Profile
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Big Data Analysis of Airline Data Set on Cloud Computing

  • 1. Jongwook Woo HiPIC CSULA Big Data Analysis of Airline Data Set on Cloud Computing JIIBR SYMPOSIUM 2015 Cal State LA, CA October 9 2015 Nillohit Bhattacharya, nbhatta2@calstatela.edu Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Airline Data Set  Hadoop: Data Intensive Computing  Hadoop on Cloud Computing  Hive and its Architecture on Azure  Experimental Results  Conclusions
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Characteristics of the Airline Data Set Data has been taken from the US Department of Transportation Consist of the arrival and departure records of domestic airlines Time period January 2005 – December 2014 (10 Years) Total number of files: 120 File Format: csv (comma separated values) Total file size: 13.1 GB Total Number of records: 66 million
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Traditional Computing Challenges Not easy for a single computer to store and process all the data by itself. Approached the problem in a different way  Traditional Parallel Computer – Processor Intensive Computing • by increasing the processing speed and power of the computer As the data grows exponentially, – The processing power of the single computer becomes a bottleneck – And, mostly it does not work for large scale data because of the latency in data transfer on Network and Disk I/O
  • 5. High Performance Information Computing Center Jongwook Woo CSULA A New Approach (Hadoop) Many non-expensive commodity computers all working together,  Data Intensive Computing – break the data in smaller chunks and process the data locally where it is stored – Data Locality • Computation occurs where data resides All the computers process the data in parallel. Provides the ability to harness the power of multiple computers simultaneously.
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Hadoop on Cloud Create Hadoop clusters with minimal investment. No overhead of maintaining the cluster. Delete the cluster when no longer needed. Increase/Decrease resources on demand. Deleting the cluster does not result in loss of data.
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Apache Hive SQL like language Developed at Facebook HQL (Hive Query Language) is different than SQL Runs map reduce jobs under the hood. Batch Process Queries have a high latency Read based Not appropriate for transaction processing
  • 8. High Performance Information Computing Center Jongwook Woo CSULA Microsoft Azure HDInsight Deploys and provisions Hadoop clusters in the cloud HDInsight uses Hortonworks Data Platform (HDP) Hadoop Distribution HDInsight cluster configuration  Number of data nodes: 4  CPU: 4 Cores  Memory: 7 GB  Operating System: Windows Server 2012 R2 Datacenter Hadoop clusters can be launched using  Linux Operating System  Windows Server Operating System
  • 9. High Performance Information Computing Center Jongwook Woo CSULA System Architecture
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Total number of flights cancelled each month for the period 2005-2014  Time taken: 210.862 seconds, Fetched: 120 row(s) Total number of flights diverted each month for the period 2005-2014  Time taken: 216.704 seconds, Fetched: 120 row(s)
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Cancelled and Diverted flights by month 0 5000 10000 15000 20000 25000 30000 35000 Numberofcancelled/divertedflights Cancelled/Diverted Vs Time Cancelled Diverted
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Total number of flights cancelled every year for the period 2005-2014  Time taken: 302.465 seconds, Fetched: 10 row(s) Total number of flights diverted every year for the period 2005-2014  Time taken: 461.433 seconds, Fetched: 10 row(s)
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Cancelled and Diverted flights by year 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Numberofcancelledflights Number of cancelled/diverted flights Vs Year Cancelled Diverted
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Effect of flight distance on flight diversions Time taken: 675.725 seconds, Fetched: 1500 row(s)
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Diverted Flights Vs Distance 0 100 200 300 400 500 600 700 800 900 1000 0 1000 2000 3000 4000 5000 6000 NumberofDivertedflights(count) Flight Distance (in miles) Number of diverted flights Vs Distance Diverted (Count)
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Effect of flight distance on flight cancellations Time taken: 576.925 seconds, Fetched: 1500 row(s)
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Cancelled Flights Vs Distance 0 2000 4000 6000 8000 10000 12000 14000 0 1000 2000 3000 4000 5000 6000 Numberofcancelledflights(count) Flight Distance (in miles) Number of cancelled flights Vs Distance Cancellation (Count)
  • 18. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Effect of flight distance on average departure delay Time taken: 992.911 seconds, Fetched: 1500 row(s)
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Average Departure Delay vs Flight Distance 0 50 100 150 200 250 0 1000 2000 3000 4000 5000 6000 AverageDepartureDelay(inminutes) Flight Distance (in miles) Average Departure Delay Vs Flight Distance Avg Dep Delay
  • 20. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Monthly average departure delay for the period 2005-2014 Time taken: 973.695 seconds, Fetched: 13 row(s)
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Average Departure Delay by month 0 2 4 6 8 10 12 14 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec AverageDepartureDelay(inminutes) Average Depature Delay Vs Month Avg Dep Delay
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Yearly average departure delay for the period 2005-2014  Time taken: 623.694 seconds, Fetched: 11 row(s)
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Average Departure Delay by year 0 2 4 6 8 10 12 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 AverageDepartureDelay(inminutes) Average Departure Delay Vs Year Avg Dep Delay
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Interesting sets of trends and patterns exists in large data sets  Average Departure delay is at a peak during the mid and end of the year i.e. during the months of June, July and December  The highest number of flights were cancelled in the year 2007 as observed between the period 2005-2014  Cloud infrastructure has enabled the use of Hadoop for big data systems with minimal investment and cost of ownership  Hive provides an easy way to query the data without worrying about the underlying complex structure of the system  Big Data systems build in the cloud can be decommissioned without loosing the data  Any large scale data set in Business can be analyzed  Marketing, Finance, Economics, Management  Contact Prof Jongwook Woo (jwoo5@calstatela.edu) if you need a collaboration
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Question?
  • 26. High Performance Information Computing Center Jongwook Woo CSULA References  Airline Data Set, United States Department of Transportation, http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236  What is Hive?, http://www-01.ibm.com/software/data/infosphere/hadoop/hive/  Introduction to Windows Azure Blob Storage, https://www.simple- talk.com/cloud/cloud-data/an-introduction-to-windows-azure-blob- storage-/  Introduction to Hadoop in HDInsight: Big-data analysis and processing in the cloud, https://azure.microsoft.com/en- us/documentation/articles/hdinsight-hadoop-introduction/  Explorer for Microsoft Azure Storage: Freeware Client, http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx  Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/en- us/documentation/articles/hdinsight-upload-data/  “Market Basket Analysis Algorithms with MapReduce”, Jongwook Woo, DMKD-00150, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795