Why Hadoop is Useful?

How Hadoop is
Useful for Solving
Problems of Big Data
RISHISH MOHAN BHATNAGAR

Contents
 Big Data
 Problem with Big Data
 From where this much of data is generated
 Data Statistics
 Hadoop
 MapReduce
 Work Flow of MapReduce
 Work Flow Example
 Conclusion
1
Rishish Mohan Bhatnagar
How Hadoop is Useful for Solving Problems of Big Data

Big Data
 Big data is first and foremost about data volume, namely large data sets
measured in tens of terabytes, or in hundreds of terabytes or petabytes.
 Big data can also be a combination of -
 Structured Data (relational data)
 Unstructured Data (.doc files, images)
 Semi Structured Data (JSON Files, CSV Files)
 Big data is extremely large set of data that may be processed to reveal
patterns, trends and human patterns about particular topic
2

Problems with Big Data
 The problem with Big Data is based on three V’s –
 Volume of the data
 Variety of the data
 Velocity of the data
 5 billion gigabytes data is produced by us until it is 2004.
 In 2011 same amount of data was produced in two days.
 In 2013 it is even possible in only in 10 minutes.
3

From Where This Much of Data is Generated?
 Social media such as Facebook and Twitter is responsible for huge amount
of data.
 The data recorded by black box aircrafts and helicopters generates lots of
unstructured data.
 Sensex, Nifty and other stock exchange across the world generates lots of
data.
 Various types of sensors generates large data volume.
4

Monthly active users (Jan, 2017)
1871
1000
600
106
317
300
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Facebook
WhatsApp
Instagram
LinkedIn
Twitter
Snapchat
In millions
5

Data Statistics
Data Statistics
6

Hadoop
 Google provides a solution to solve the processing problems of Big Data
i.e. to divide a task into small parts and assign those parts to many
computers connected over the network, and collect the result to form the
final data set.
 Doug Cutting & Mike Cafarella took the solution provided by Google and
started project called “Hadoop” in 2005 named after elephant of the
Cutting’s son.
 Now Hadoop is a registered trademark of Apache Software Foundation.
7

Hadoop (Contd.)
 Written in JAVA
 Allows distributed processing of large datasets across multiple of
computers.
 Designed to scale the system from single server to thousands of servers.
 Can be used with commodity hardware.
 Hadoop library itself has been designed to detect and handle failure at the
application layer.
8

Hadoop (Contd.)
 Since 2012 ‘Hadoop’ not only offers basic modules but also provide Apache
Pig, Apache Hive, Apache Hbase and Apache Spark which can work in top of
the basic Hadoop.
 Compatible with all the platforms since it is JAVA based.
 Basically Hadoop works with following two strategies –
 MapReduce
 HDFS
9

MapReduce
 Based on the paradigm – ‘Sending the computer where data resides’
 It is a programming model based on the JAVA.
 There are two important tasks-
 Map takes a set of data and convert it into another set of data, where
individual elements are broken down into tuple key/value pair.
 Reduce takes the output of map and combine those data tuples into a
smaller set of tuples.
 Reduce task is always performed after the map job.
10

Work Flow of MapReduce
Work flow of MapReduce consist five steps, these steps are –
1. Splitting – The splitting parameter can be anything such as splitting by space,
comma, semicolon or by a new line
2. Mapping – It takes set of data and convert them into another set of data, where
individual elements are broken down into key – value pair
3. Intermediate Splitting – The entire process in parallel on different systems. In
order to group them in “Reduce phase” the similar key data should be on same
system
4. Reduce – Takes the output intermediate splitting and combines those data into
smaller set of tuples.
5. Combining – The last phase where all the individual result is combined to form
final result.
11

Work Flow Example
12

Conclusion
 Initially Hadoop is by developed by yahoo engineers to counter the
Google's “Big Table”
 Hadoop is used for distributed data processing of all types of data
 To learn hadoop one should explore core java very well, without core java
learning process of hadoop is hard nut to crack
 There was 4.4 million jobs of hadoop in 2015 but only one – third of those
jobs filled
13

Why Hadoop is Useful?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why Hadoop is Useful?

Similar to Why Hadoop is Useful? (20)

Recently uploaded

Recently uploaded (20)

Why Hadoop is Useful?