Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data
Airlines Project
ZIYAD SALEH
What is Big Data ?
Big data is a broad term for very large or complex data sets that are
difficult to process using tradit...
Small Data Vs. Big Data
Map Reduce
Map Reduce model
Project Scope
The Scope is limited to :
1. Installing and configuring Hadoop Map/Reduce
platform.
2. Analyzing a big data sample belongi...
Source of Data for the project
Datasets will be collected
from :
U.S. Department of
Transportation's (DOT) –
Statistical C...
Dataset size will be between 500 GB and 1 TB and
covering 5 years of flight statistics.
Size of Data
Field Name Description
Year Year of the scheduled flight
Month Month of the scheduled flight (1–12).
Day Day of the month ...
Data Pre-Processing , Processing and Analytics
Data pre-processing:
Data will be cleansed and some artifacts will be filte...
Data Storage
Data will be stored in the HDFS multiple
storage nodes with total size between
500 GB and 1 TB.
Airlines
Big ...
Target Analysis:
During the 5 years of all US domestic airlines flight
information
1. Which carriers have the most aggrega...
Design
Airlines Project Workflow and Design
Master Node Node 1
Node 2
Node 3
Node 4
Name
Node
Job
Tracker
Airlines
Big Data
Task
...
Implementation
Software and Tools
1. CentOS Linux Operating System.
2. Apache Hadoop
3. Cloudera CDH 5.3 virtual machine
4. Oracle VM Vir...
Mapper :
Reducer:
R:
Findings
US Airlines Delay (Per Carrier)
0
0.2
0.4
0.6
0.8
1
1.2
WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP ...
Thematic Map of US Airlines Delay (Per State)
Conclusion
Conclusion:
 Big Data is the large amount of continuously generated data that cannot be processed and
analyzed using trad...
Big Data Tutorials
Online Big Data Tutorials:
1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY
2. Udaci...
Thank You
Ziyad Saleh
34
‫ينفعنا‬ ‫ما‬ ‫علمنا‬ ‫اللهم‬..‫علمتنا‬ ‫بما‬ ‫وانفعنا‬
‫علما‬ ‫وزدنا‬
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Big Data Airline Project at UAEU
Upcoming SlideShare
Loading in …5
×

Big Data Airline Project at UAEU

4,373 views

Published on

Big Data Airline Project at UAEU

Published in: Data & Analytics

Big Data Airline Project at UAEU

  1. 1. Big Data Airlines Project ZIYAD SALEH
  2. 2. What is Big Data ? Big data is a broad term for very large or complex data sets that are difficult to process using traditional data processing applications. Big Data is Terra bytes (1024 GB) of data to be processed and analyzed, terra bytes of new data is being generated daily, which means the speed of analyzing this huge flow of data is a challenge. Big data can be described by the 4 Vs which are: Volume, Velocity, Variety and Veracity.
  3. 3. Small Data Vs. Big Data
  4. 4. Map Reduce
  5. 5. Map Reduce model
  6. 6. Project Scope
  7. 7. The Scope is limited to : 1. Installing and configuring Hadoop Map/Reduce platform. 2. Analyzing a big data sample belonging to U.S domestic flights performance and delay for 5 years to try to figure out 1. Top carriers experiencing delays. 2. Top airports and states with departure delays. 3. Plotting state delay in a thematic map of USA
  8. 8. Source of Data for the project Datasets will be collected from : U.S. Department of Transportation's (DOT) – Statistical Computing
  9. 9. Dataset size will be between 500 GB and 1 TB and covering 5 years of flight statistics. Size of Data
  10. 10. Field Name Description Year Year of the scheduled flight Month Month of the scheduled flight (1–12). Day Day of the month (1–31). DepTime Actual departure time of the flight CRSDepTime Scheduled departure time ArrTime Actual arrival time in HH/MM format CRSArrTime Scheduled arrival time FlightNum Flight number. ArrDelay Arrival delay DepDelay departure delay, in minutes CarrierDelay Delay (in minutes) caused by factors within control of the carrier. WeatherDelay Delay (in minutes) caused by extreme weather conditions NASDelay Delay (in minutes) within the control of the National Airspace System (NAS) SecurityDelay Security delay (in minutes) caused by security reasons LateAircraftDelay Delay (in minutes) due to the same aircraft arriving late at a previous airport. Table 1 : Airline Dataset Dictionary.
  11. 11. Data Pre-Processing , Processing and Analytics Data pre-processing: Data will be cleansed and some artifacts will be filtered out as necessary. Many fields in the airline data set need to be discarded as they are irrelevant to the subject of delay that we are concerned on. Data Processing and Analytics : Data will be processed using java programming on Map/Reduce to reduce the size of the data and produce an organized smaller datasets. Next, the resulting datasets will be analyzed using additional tools like R.
  12. 12. Data Storage Data will be stored in the HDFS multiple storage nodes with total size between 500 GB and 1 TB. Airlines Big Data HDFS
  13. 13. Target Analysis: During the 5 years of all US domestic airlines flight information 1. Which carriers have the most aggregated delay in their flights ? 2. What are the states with most delays. ) ?
  14. 14. Design
  15. 15. Airlines Project Workflow and Design Master Node Node 1 Node 2 Node 3 Node 4 Name Node Job Tracker Airlines Big Data Task Java Code Reducer Node HDFS Mapper Reducer Top Airlines
  16. 16. Implementation
  17. 17. Software and Tools 1. CentOS Linux Operating System. 2. Apache Hadoop 3. Cloudera CDH 5.3 virtual machine 4. Oracle VM Virtual Box Manager 5. Eclipse IDE 6. Java (Oracle JDK ) 7. Maven 8. Microsoft Excel and Access 2010. 9. The R statistical tool
  18. 18. Mapper :
  19. 19. Reducer:
  20. 20. R:
  21. 21. Findings
  22. 22. US Airlines Delay (Per Carrier) 0 0.2 0.4 0.6 0.8 1 1.2 WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP EA PS TW ArrivalOnTime ArrivalDelays DepartureOnTime DepartureDelays Cancellations Diversions
  23. 23. Thematic Map of US Airlines Delay (Per State)
  24. 24. Conclusion
  25. 25. Conclusion:  Big Data is the large amount of continuously generated data that cannot be processed and analyzed using traditional data management tools .  Big data is a new topic that is rising dramatically , reshaping the future , and a large demand for big data scientist is taking place and will continue to happen during the coming period of time.  Hadoop is an open source framework for storing and processing large datasets using clusters of commodity hardware.  Big Data analytics is attracting both business and policy makers to leverage from this new phenomenon towards more informed decisions and planning for the future.  Big Data now , Normal Data tomorrow.
  26. 26. Big Data Tutorials
  27. 27. Online Big Data Tutorials: 1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY 2. Udacity : https://www.udacity.com/courses#!/data-science 3. EMC : https://education.emc.com/guest/campaign/data_science.aspx 4. Coursera : https://www.coursera.org/course/datasci 5. CalTech’s : Learning from Data http://work.caltech.edu/telecourse.html 6. MIT : Open Courseware http://ocw.mit.edu/courses/sloan-school-of-management/15-062- data-mining-spring-2003/index.htm 7. Stanford’s OpenClassroom http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning 8. Big Data University : https://bigdatauniversity.com/curriculum-map/
  28. 28. Thank You Ziyad Saleh 34 ‫ينفعنا‬ ‫ما‬ ‫علمنا‬ ‫اللهم‬..‫علمتنا‬ ‫بما‬ ‫وانفعنا‬ ‫علما‬ ‫وزدنا‬

×