2. Contents
Big Data
Problem with Big Data
From where this much of data is generated
Data Statistics
Hadoop
MapReduce
Work Flow of MapReduce
Work Flow Example
Conclusion
1
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
3. Big Data
Big data is first and foremost about data volume, namely large data sets
measured in tens of terabytes, or in hundreds of terabytes or petabytes.
Big data can also be a combination of -
Structured Data (relational data)
Unstructured Data (.doc files, images)
Semi Structured Data (JSON Files, CSV Files)
Big data is extremely large set of data that may be processed to reveal
patterns, trends and human patterns about particular topic
2
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
4. Problems with Big Data
The problem with Big Data is based on three V’s –
Volume of the data
Variety of the data
Velocity of the data
5 billion gigabytes data is produced by us until it is 2004.
In 2011 same amount of data was produced in two days.
In 2013 it is even possible in only in 10 minutes.
3
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
5. From Where This Much of Data is Generated?
Social media such as Facebook and Twitter is responsible for huge amount
of data.
The data recorded by black box aircrafts and helicopters generates lots of
unstructured data.
Sensex, Nifty and other stock exchange across the world generates lots of
data.
Various types of sensors generates large data volume.
4
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
6. Monthly active users (Jan, 2017)
1871
1000
600
106
317
300
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Facebook
WhatsApp
Instagram
LinkedIn
Twitter
Snapchat
In millions
5
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
8. Hadoop
Google provides a solution to solve the processing problems of Big Data
i.e. to divide a task into small parts and assign those parts to many
computers connected over the network, and collect the result to form the
final data set.
Doug Cutting & Mike Cafarella took the solution provided by Google and
started project called “Hadoop” in 2005 named after elephant of the
Cutting’s son.
Now Hadoop is a registered trademark of Apache Software Foundation.
7
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
9. Hadoop (Contd.)
Written in JAVA
Allows distributed processing of large datasets across multiple of
computers.
Designed to scale the system from single server to thousands of servers.
Can be used with commodity hardware.
Hadoop library itself has been designed to detect and handle failure at the
application layer.
8
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
10. Hadoop (Contd.)
Since 2012 ‘Hadoop’ not only offers basic modules but also provide Apache
Pig, Apache Hive, Apache Hbase and Apache Spark which can work in top of
the basic Hadoop.
Compatible with all the platforms since it is JAVA based.
Basically Hadoop works with following two strategies –
MapReduce
HDFS
9
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
11. MapReduce
Based on the paradigm – ‘Sending the computer where data resides’
It is a programming model based on the JAVA.
There are two important tasks-
Map takes a set of data and convert it into another set of data, where
individual elements are broken down into tuple key/value pair.
Reduce takes the output of map and combine those data tuples into a
smaller set of tuples.
Reduce task is always performed after the map job.
10
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
12. Work Flow of MapReduce
Work flow of MapReduce consist five steps, these steps are –
1. Splitting – The splitting parameter can be anything such as splitting by space,
comma, semicolon or by a new line
2. Mapping – It takes set of data and convert them into another set of data, where
individual elements are broken down into key – value pair
3. Intermediate Splitting – The entire process in parallel on different systems. In
order to group them in “Reduce phase” the similar key data should be on same
system
4. Reduce – Takes the output intermediate splitting and combines those data into
smaller set of tuples.
5. Combining – The last phase where all the individual result is combined to form
final result.
11
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data
14. Conclusion
Initially Hadoop is by developed by yahoo engineers to counter the
Google's “Big Table”
Hadoop is used for distributed data processing of all types of data
To learn hadoop one should explore core java very well, without core java
learning process of hadoop is hard nut to crack
There was 4.4 million jobs of hadoop in 2015 but only one – third of those
jobs filled
13
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data