What is Big Data ?
Big data is a broad term for very large or complex data sets that are
difficult to process using traditional data processing applications.
Big Data is Terra bytes (1024 GB) of data to be processed and
analyzed, terra bytes of new data is being generated daily, which
means the speed of analyzing this huge flow of data is a challenge.
Big data can be described by the 4 Vs which are: Volume, Velocity,
Variety and Veracity.
The Scope is limited to :
1. Installing and configuring Hadoop Map/Reduce
2. Analyzing a big data sample belonging to U.S
domestic flights performance and delay for 5 years
to try to figure out
1. Top carriers experiencing delays.
2. Top airports and states with departure delays.
3. Plotting state delay in a thematic map of USA
Source of Data for the project
Datasets will be collected
U.S. Department of
Transportation's (DOT) –
Dataset size will be between 500 GB and 1 TB and
covering 5 years of flight statistics.
Size of Data
Field Name Description
Year Year of the scheduled flight
Month Month of the scheduled flight (1–12).
Day Day of the month (1–31).
DepTime Actual departure time of the flight
CRSDepTime Scheduled departure time
ArrTime Actual arrival time in HH/MM format
CRSArrTime Scheduled arrival time
FlightNum Flight number.
ArrDelay Arrival delay
DepDelay departure delay, in minutes
CarrierDelay Delay (in minutes) caused by factors within control of the carrier.
WeatherDelay Delay (in minutes) caused by extreme weather conditions
NASDelay Delay (in minutes) within the control of the National Airspace System (NAS)
SecurityDelay Security delay (in minutes) caused by security reasons
LateAircraftDelay Delay (in minutes) due to the same aircraft arriving late at a previous airport.
Table 1 : Airline Dataset Dictionary.
Data Pre-Processing , Processing and Analytics
Data will be cleansed and some artifacts will be filtered out as
necessary. Many fields in the airline data set need to be discarded as
they are irrelevant to the subject of delay that we are concerned on.
Data Processing and Analytics :
Data will be processed using java programming on Map/Reduce to
reduce the size of the data and produce an organized smaller
Next, the resulting datasets will be analyzed using additional tools
Data will be stored in the HDFS multiple
storage nodes with total size between
500 GB and 1 TB.
During the 5 years of all US domestic airlines flight
1. Which carriers have the most aggregated
delay in their flights ?
2. What are the states with most delays. ) ?
US Airlines Delay (Per Carrier)
WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP EA PS TW
Big Data is the large amount of continuously generated data that cannot be processed and
analyzed using traditional data management tools .
Big data is a new topic that is rising dramatically , reshaping the future , and a large demand
for big data scientist is taking place and will continue to happen during the coming period of
Hadoop is an open source framework for storing and processing large datasets using clusters
of commodity hardware.
Big Data analytics is attracting both business and policy makers to leverage from this new
phenomenon towards more informed decisions and planning for the future.
Big Data now , Normal Data tomorrow.
Online Big Data Tutorials:
1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY
2. Udacity : https://www.udacity.com/courses#!/data-science
3. EMC : https://education.emc.com/guest/campaign/data_science.aspx
4. Coursera : https://www.coursera.org/course/datasci
5. CalTech’s : Learning from Data http://work.caltech.edu/telecourse.html
6. MIT : Open Courseware http://ocw.mit.edu/courses/sloan-school-of-management/15-062-
7. Stanford’s OpenClassroom
8. Big Data University : https://bigdatauniversity.com/curriculum-map/