Presentation by:
Asst. Prof. Amresh Kumar
Department of Computer Science & Engineering
GH Raisoni College of Engineering, Nagpur
112/05/14
2
• Introduction
• Hadoop Architecture
• MapReduce Program Model
• References
12/05/14
3
• Hadoop is :
Open source software framework.
Scalable, fault-tolerant system, Simple and Accessible and
supports distributed applications.
• Hadoop is build on (Provides) two main parts:
A shared storage: HDFS and Framework: MR
•Hadoop project includes: HDFS, MR, YARN.
•Other Hadoop-related projects at Apache.
Hadoop was created by Doug Cutting.
12/05/14 4
• Software framework.
• The name derives from the application of map() and reduce() functions.
• It splits the I/P data-set into independent chunks, which are processed
by the map() and the reduce().
• Typically, compute nodes & storage nodes are the same (MRF + DFS).
• HDFS :
o A shared storage.
o Stores data on the compute nodes (After Processing).
o HDFS has a master (Namenodes)/slave (DataNodes)
architecture.
5
Data Node
Name Node
Data Node Data Node
DFS
DFS DFS DFS
Master
Slave Slave Slave
Task
Tracke
r
Task
Tracke
r
Task
Tracke
r
MRF MRF MRF
Job
Tracker
MRF
M
A
P
P
E
R
M
A
P
P
E
R
R
E
D
U
C
E
R
O
U
T
P
U
T
Input
Dataset
Partitioning
Merge
on Disk
Other
Reducers
Merge
Merge
Merge
Fetch
Other Mappers
Map Phase Reduce PhaseCopy Phase Sort Phase
DataNodesDataNodes
NameNode
Start
NameNode
Start
JobTracker
Start DataNode
Start
TaskTracker
Working on Hadoop
DataNodesDataNodes
NameNode
http://localhost:50070
http://localhost:50030
Confirm: Cluster Working
DataNodesDataNodes
NameNode
Put I/P on
HDFS
Run Algorithm
http://localhost:50070
http://localhost:50030
Running Algorithm on Cluster
Map & ReduceRunnING……..
Reliability: Replication of Data
NameNodeNameNode
http://localhost:50070
http://localhost:50030
Analyzing Result
12
[1] http://ieeexplore.ieee.org
[2] http://hadoop.apache.org/
[3] https://wiki.cloudera.com
[4] http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html
[5] Books:
•Data-Intensive Text Processing with MapReduce, Jimmy Lin and
Chris Dyer University of Maryland, College Park.
•Hadoop: The Definitive Guide, Tom White.
12/05/14
Thank You
1312/05/14

Hadoop and MapReduce

Editor's Notes

  • #5 1 PB (Petabytes)= 1000000000000000B = 10005 B = 1015 B = 1 million gigabytes = 1 thousand terabytes. Hadoop runs in different modes: Local (standalone) mode: The standalone mode is the default mode for Hadoop. Pseudo-distributed mode: The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a single machine. Fully distributed mode: An actual Hadoop cluster runs in the fully distributed mode. Scalable: Enables applications to work with thousands of computational independent computers and petabytes of data.  Fault-Tolerant: Both MapReduce and the HDFS are designed to handles the node failure automatically. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
  • #6 MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. The model is inspired by the map and reduce functions commonly used in functional programming MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). 
  • #7 Here, Hadoop Architecture Drawn on basis of Reading Some IEEE Papers and Some Books.
  • #8 Try to Explain the terms: Combiner and Partitioner, for More details read book and see Last Slide.