Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SECOURS COLLEGE FOR WOMEN THANJAVUR
1. SUBMITTED BY
Name : S.JENCY JAYASTINA
Class : II-MSC CS
Batch : 2017 – 2019
Incharge Staff : Ms. M. Florence Dayana
2. MapReduce is a programming model for
processing large data sets with a parallel ,
distributed algorithm on a cluster.
Map Reduce when coupled with HDFS can be
used to handle big data. ... It has an extensive
capability to handle unstructured data as well
3. MapReduce is a programming model Google has
used successfully is processing its “big-data” sets (~
20000 peta bytes per day)
• Users specify the computation in terms of a map
and a reduce function,
• Underlying runtime system automatically
parallelizes the computation across large-scale
clusters of machines, and
• Underlying system also handles machine failures,
efficient communications, and performance issues.
4. MapReduce is the processing engine of
the Apache Hadoop that was directly derived
from the Google MapReduce.
The MapReduce application is written
basically in Java. It conveniently computes
huge amounts of data by the applications of
mapping and reducing steps in order to come
up with the solution for the required problem.
5. The mapping step takes a set of data in order to
convert it into another set of data by breaking
the individual elements in to key/value pairs
called tuples.
The second step of reducing takes the output
derived from the mapping process and
combines the data tuples into a smaller set of
tuples.
6.
7. MapReduce is mainly used for parallel
processing of large sets of data stored in
Hadoop cluster
it is a hypothesis specially designed by Google
to provide parallelism, data distribution and
fault-tolerance.
MR processes data in the form of key-value
pairs. A key-value (KV) pair is a mapping
element between two linked data items - key
and its value.
8. The entire MapReduce process is a massive
parallel processing setup where the
computation is moved to the place of the data
instead of moving the data to the place of the
computation
The entire computation process is broken down
into the mapping, shuffling and reducing
stages.
MapReduce program executes in three stages,
namely
map stage
shuffle stage
reduce stage
9. There are two stages
Mapping Stage
Reducing Stage
Mapping Stage: This is the first step of the
MapReduce and it includes the process of
reading the information from the Hadoop
Distributed File System (HDFS).
Reducing Stage: The reducer phase can consist
of multiple processes. In the shuffling process
the data is transferred from the mapper to the
reducer.
10. MasterNode – Place where JobTracker runs and which
accepts job requests from clients
SlaveNode – It is the place where the mapping and
reducing programs are run
JobTracker – it is the entity that schedules the jobs and
tracks the jobs assigned using Task Tracker
TaskTracker – It is the entity that actually tracks the tasks
and provides the report status to the JobTracker
Job – A MapReduce job is the execution of the Mapper &
Reducer program across a dataset
Task – the execution of the Mapper & Reducer program on
a specific data section
TaskAttempt – A particular task execution attempt on a
SlaveNode
11. At Google MapReduce operation are run on a special
file system called Google File System (GFS) that is
highly optimized for this purpose.
GFS is not open source.
Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
This is open source and distributed by Apache