Hadoop

Hadoop
Presented by
Rajesh Piryani
South Asian University

3
(Visually…)
HDFS
Map/
Reduce
1
2

Hadoop
• It is open source software framework
• Licensed under Apache V2 License
• Created by Doug Cutting and Mike Cafarella in 2005
• Doug, who was working at Yahoo at the time, named it
after his son's toy elephant
• Derived from Google Map Reduce and Google File System
• Written in Java Programming Language

0
100
200
300
400
500
600
700
800
900
1000
966
848
715
619
434
364
269
227
Amount of Stored Data By Sector
(in Petabytes, 2009)
1 zettabyte?
= 1 million petabytes
= 1 billion terabytes
= 1 trillion gigabytes
Why Hadoop?
5
Sources:
"Big Data: The Next Frontier for Innovation, Competition and Productivity."
US Bureau of Labor Statistics | McKinsley Global Institute Analysis
Petabytes
Mars
Earth
35ZB = enough data
to fill a stack of DVDs
reaching halfway to Mars
If you like analogies…

Why Hadoop?
• Need to process 100TB datasets On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework

Distributed File System (DFS)
• classical model of a file system distributed across multiple
machines
• allows access to files located on another remote host as though
working on the actual host computer.
• multiple users on multiple machines to share files and storage
resources.

Distributed File System (DFS)
• one or more central servers store files that can be accessed,
– with proper authorization rights, by any number of remote
clients in the network.
• facilities for transparent replication and fault tolerance.
• DFS Operation should be fast to increase the performance of
System
– Operation: open, close, read, write file, send and receive file/object

Hadoop Distributed File System
(HDFS)
• Type of distributed file system
• Originally built as infrastructure for the Apache Nutch Web
Search Engine Project
• Significant Difference over other Distributed File System
– High Fault Tolerance
– High Throughput
– Easy Deployment on Low Cost Hardware
• Suitable for application that process massive data

HDFS Architecture
• Master-Slave Architecture
• HDFS Master “NameNode”
– Manages all file system metadata (hostname of datanode, block node)
• Transactions are logged, merged at startup
– Controls read/write access to files
– Mapping of block to DataNode
– Manages block replication
• HDFS Slaves “DataNodes”
– Notifies NameNode about block-IDs it has
– Serve read/write requests from clients
– Perform block create and replication tasks upon instruction by NameNode

HDFS Architecture
12
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
(heartbeat, balancing, replication, etc.)
nodes write to local disk
namespace backups

Getting Files From HDFS
13
NameNode BackupNode
Giant File
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
HDFS
Client return locations
of blocks of file
stream blocks from data nodes

Failure types:
 Disk errors and failures
 DataNode failures
 Switch/Rack failures
 NameNode failures
 Datacenter failures
Failures, Failures, Failures
• HDFS was designed with the expectation that failures
(both hardware and software) would occur frequently
14
NameNode
DataNode

Fault Tolerance (DataNode
Failure)
15
NameNode BackupNode
NameNode detects DataNode lossBlocks are auto-replicated on remaining
nodes to satisfy replication factor
DataNodeDataNode DataNode

Fault Tolerance (NameNode
Failure)
16
NameNode BackupNode
Not an epic failure, because you
have the BackupNode
NameNode loss requires
manual intervention
Automatic failover is
in the works

Live Horizontal Scaling and
Rebalancing
17
NameNode BackupNode
DataNode DataNode
NameNode detects new DataNode
is added to cluster
DataNodeDataNode DataNode
Blocks are re-balanced
and re-distributed
DataNode DataNodeDataNode

• Highly scalable
– 1000s of nodes and massive (100s of TB) files
– Large block sizes to maximize sequential I/O
performance
• No use of mirroring or RAID.
– Reduce cost
– Use one mechanism (triply replicated blocks)
to deal with a wide variety of failure types
rather than multiple different mechanisms
HDFS Summary
18
Why?

Hadoop MapReduce (MR)
• Programming framework (library and runtime) for
analyzing data sets stored in HDFS
• MapReduce jobs are composed of two functions:
• User only writes the Map and Reduce functions
19
map()  reduce()
sub-divide &
conquer
combine & reduce
cardinality

Essentially, it’s…
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
20
DoWork() DoWork() DoWork()
…
…
…
Output
MAPREDUCE
Hadoop MapReduce (MR)

MapReduce
Layer
HDFS
Layer
hadoop-namenode
hadoop-
datanode1
hadoop-
datanode2
hadoop-
datanode3
hadoop-
datanode4
MapReduce Components
21
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored to local file system
JobTracker controls and
heartbeats TaskTracker nodes
TaskTrackers store temp data
Master
Slaves
- Coordinates all M/R tasks & events
- Manages job queues and scheduling
- Maintains and Controls TaskTrackers
- Moves/restarts map/reduce tasks if needed
Execute individual
map and reduce
tasks as assigned by
JobTracker (in
separate JVM)
DataNode DataNode DataNode DataNode
NameNode
MapReduce
Layer
HDFS
Layer

Job Submission
22
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored to local file system
map()’s are assigned to TaskTrackers
(HDFS DataNode locality aware)
Submit jobs to JobTracker
MR
Client
jobs get queued
Mapper Mapper Mapper Mapper
mappers spawned
in separate JVM
and execute
mappers store temp results
reduce phase begins
Reducer Reducer Reducer Reducer

Map tasks
MapReduce Visualized: Map Phase
23
53705 $65
53705 $30
53705 $15
54235 $75
54235 $22
02115 $15
02115 $15
44313 $10
44313 $25
44313 $55
5 53705 $15
6 44313 $10
5 53705 $65
0 54235 $22
9 02115 $15
6 44313 $25
3 10025 $95
8 44313 $55
2 53705 $30
1 02115 $15
4 54235 $75
7 10025 $60
Mapper
Mapper
4 54235 $75
7 10025 $60
2 53705 $30
1 02115 $15
10025 $60
5 53705 $65
0 54235 $22
5 53705 $15
6 44313 $10
3 10025 $95
8 44313 $55
9 02115 $15
6 44313 $25
10025 $95
Get sum sales grouped by zipCode
DataNode3DataNode2DataNode1
Blocks
of the
Sales
file in
HDFS
Group
By
Group
By
(custId, zipCode, amount)
One output
bucket per
reduce task

Reducer
Reducer
Reduce
tasks
ReducerReduce Phase
53705 $65
54235 $75
54235 $22
10025 $95
44313 $55
10025 $60
Mapper
53705 $30
53705 $15
02115 $15
02115 $15
44313 $10
44313 $25
Mapper
53705 $65
53705 $30
53705 $15
44313 $10
44313 $25
10025 $95
44313 $55
10025 $60
54235 $75
54235 $22
02115 $15
02115 $15
Sort
Sort
Sort
53705 $65
53705 $30
53705 $15
44313 $10
44313 $25
44313 $55
10025 $95
10025 $60
54235 $75
54235 $22
02115 $15
02115 $15
SUM
SUM
SUM
10025 $155
44313 $90
53705 $110
54235 $97
02115 $30
Shuffle

Dealing With Failures
• Like HDFS, MapReduce framework designed to be
highly fault tolerant
• Worker (Map or Reduce) failures
– Detected by periodic Master pings
– Map or Reduce jobs that fail are reset and then
given to a different node
– If a node failure occurs after the Map job has
completed, the job is redone and all Reduce jobs
are notified
• Master failure
– If the master fails for any reason the entire
computation is redone
25

Application
• Hadoop is used in wide area of application. Some
Examples are
– Search (Yahoo! , Amazon, Zvents)
– Log Processing (Facebook,Yahoo! ,ContextWeb, Joost , Last.fm)
– Recommendation System (Facebook)
– Data WareHousing (Facebook , AOL)
– Video and Image Analysis (New York Times, Eysalike)

27
HDFS
Map/
Reduce
(Visually…)
1
2

References
• Meet Hadoop! Open Source Grid Computing by
Devraj Das
• The Hadoop Distributed File System: Architecture
and Design by Dhruba Borthakur
• Hadoop Design And Architecture
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
• An Introduction to Hadoop Distributed File System
http://www.ibm.com/developerworks/library/wa-
introhdfs/
• Big Data What’s the Big Deal? By David J. DeWitt
and Rimma Nehme. Microsoft Jim Gray System Lab

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Hadoop

Similar to Hadoop (20)

More from Rajesh Piryani

More from Rajesh Piryani (11)

Recently uploaded

Recently uploaded (20)

Hadoop