Hadoop Mapreduce

HADOOP MAPREDUCE
Darwade Sandip
MNIT Jaipur
December 25, 2013
Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 1 / 21

Outline
What is HADOOP
What is MapReduce
Componants of Hadoop
Architecture
Implementation
Bibliography

What is Hadoop?
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Hadoop is best known for MapReduce and its distributed
ﬁlesystem (HDFS),and large-scale data processing.

MapReduce
Programming model for data processing
Hadoop can run MapReduce programs written in various
languages Java,Python
Parallel Processing,put Mapreduce in very large-scale
data analysis
Mapper produce intermediate results
Reducer aggregates the results

Componants Of Hadoop
Two Main Components of Hadoop
HDFS
MAPREDUCE
HDFS
Files are stored in HDFS and divided into blocks, which are then
copied to multiple Data Nodes
Hadoop cluster contains only one Name Node and many
DataNodes
Data blocks are replicated for High Availability and fast access

HDFS
NameNode
Run on a separate machine
Manage the file system namespace,
and control access of external clients
Store file system Meta-data in memory
File information, each block information of files,
and every file block information in Data Node
DataNode
Run on Separate machine,which is the basic unit of file storage
Sent all messages of existing Blocks periodically to Name Node
Data Node response read and write request from the Name Node,
and also respond, create, delete, and copy the block command
from Name Node

MapReduce
Files are split into ﬁxed sized blocks
and stored on data nodes (Default 64MB)
Programs written, can process on distributed clusters in parallel
Input data is a set of key / value pairs, the output is also the key /
value pairs
Mainly Two Phase Map and Reduce

MapReduce (continue...)
Figure: MapReduce Process Architecture

MapReduce (continue...)
Map
Map process each block separately in parallel
Generate an intermediate key/value pairs set
Results of these logic blocks are reassembled
Reduce
Accepts an intermediate key and related value
Processed the intermediate key and value
Form a set of relatively small value set

How Hadoop runs a MapReduce.
The client, which submits the MapReduce job.
The jobtracker, which coordinates the job.
The tasktrackers, which run the tasks that the job has
been split into.
Tasktrackers are Java applications whose main class is
TaskTracker.
The distributed ﬁlesystem, which is used for sharing job
ﬁles between the other entities.

Job Submission
Job Initialization
Task Assignment
Task Execution
Job Completion

How Hadoop runs a MapReduce
Figure: How Hadoop runs a MapReduce job using the classic framework

Job Submission
submit() method creates an internal JobSummitter calls
submitJobInternal()
The job, waitForCompletion() polls the jobs progress once per
second
JobSummitter does
Asks the jobtracker for a new job ID (by calling getNewJobId() on
JobTracker
Checks the output speciﬁcation of the job
Computes the input splits for the job.
Copies the resources.
Tells the jobtracker that the job is ready for execution by calling
submitJob() .

Job Initialization
When the JobTracker receives a call submitJob(), it puts it into an
internal queue.
retrieves the input splits computed by the client from the shared
ﬁlesystem
Job Assignment
Tasktrackers periodically sends heartbeat.
Assign task to Tasktracker

Job Execution
Next step for the TaskTracker is to run the task.
It localizes the job JAR by copying it from local HDFS
Creates an instance of TaskRunner to run the task.
Job completion
When the jobtracker receives a notiﬁcation that the last task for a
job is complete, it changes the status for the job to ”successful”.
And tell the user that it returns from the waitForCompletion()
method.
The jobtracker cleans up its working state

Implementation
Figure: Minimum Tempurature

Implementation
Figure: Maximum Tempurature

Implementation (continue...)
Figure: Word Count

Bibliography I
G. Yang, “The application of mapreduce in the cloud computing,” Intelligence
Information Processing and Trusted Computing (IPTC) 2011, vol. 9,
pp. 154–156, Oct 2011.
X. Zhang, G. Wang, Z. Yang, and Y. Ding, “A two-phase execution engine of
reduce tasks in hadoop mapreduce.,” 2012 International Conference on Systems
and Informatics (ICSAI 2012), pp. 858–864, May 2012.
T. White, Hadoop:The Deﬁnitive Guide, Third Edition.
1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc.,
2012.
J. Dean and S. Ghemawat, “Mapreduce: Simpliﬁed data processing on large
clusters,” Operating System Design and Implementation (OSDI 2004), vol. 6,
pp. 137–150, 2004.
X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for
hadoop mapreduce,” 2012 IEEE International Conference on Cluster
Computing Workshops, pp. 231–239, Sept 2012.

Bibliography II
Z. Gua, M. Pierce, G. Fox, and M. Zhou, “Automatic task re-organization in
mapreduce,” 2011 IEEE International Conference on Cluster Computing,
pp. 335–343, May 2011.
K. Wang, X. Lin, and W. Tang, “An experience guided conﬁguration optimizer
for hadoop mapreduce,” Cloud Computing Technology and Science
(CloudCom), pp. 419–426, Dec 2012.

Hadoop Mapreduce

More Related Content

What's hot

Similar to Hadoop Mapreduce

Recently uploaded

Hadoop Mapreduce