Map Reduce

© Vigen Sahakyan 2016
Hadoop Tutorial
MapReduce

Agenda
● What is MapReduce?
● Anatomy of MapReduce
● Purposes & Weaknesses

What is MapReduce?
● Distributed data processing paradigm
● Designed especially for batch processing
● It was first introduced at Google
● Integral part of Hadoop ecosystem
● In Hadoop 2 it became application on
top of Yarn
● It split big data on chunks and apply
mappers and reducers to that chunks
which can be processed in parallel

Anatomy of MapReduce
In Hadoop 2 MapReduce is batch processing framework implemented on top of
Yarn. To understand how MapReduce work in Hadoop you have to know how
MapReduce job is running.
Also you have to understand important parts of MR framework and how work:
● Mapper
● Shuffler
● Reducer

How MapReduce job is running ?
1. MapReduce app submit job to Hadoop client
2. Client ask ResourceManager to get app ID
3. Copy job resources to HDFS:
a. Checks the output specification of the job
b. Computes the input splits for the job
c. Copy Jar and throw error if need.
4. Submit Application
5. ResourceManager:
a. Allocate container on some node
b. Run ApplicationMaster on that node
6. ApplicationMaster initialize job
7. Retrieve input splits
8. Allocate resources and start container
9. step 10 Retrieve resource for map or reduce task
10. step 11 run map or reduce task

MapReduce step by step
Mapper:
1. Perform Map side operation (by you)
2. Write to in-memory buffer ( by framework)
Shuffler:
1. Partitioner figure out which Map output key goes to which Reducer (by framework). It possible to
have a several unique key in one partition.
a. You can specify Partitioner class which implement PartitionID extraction process.
2. Sort first by PartitionID then by Key value within partition. (by framework)

Shuffler:
3. Call single Combiner (if it enabled by you) for each Key of every partition. (by framework)
a. Combiner implement Reduce interface, hence you can specify your Reducer as Combiner class, but
only if your operation are commutative and associative (e.g: sum in case of wordcount) otherwise you
have to override it.
4. Spill to disk (also group by key and merge) if limit exceed
(by default limit is 100mb)
Reducer:
1. Start read Map outputs from disk and from in-memory.
2. Merge outputs. Sort by PartitionID and then by Key.
3. Group by key.
4. Call Reduce operation defined by you for each unique Key

Purposes & Weaknesses
Purposes:
● Batch processing
● Long running application
Weaknesses:
● Iterative algorithms (e.g: machine learning, graph processing and so on)
● Ad-hoc queries
● Computation depends on previously computed value
● Algorithms depend on shared global state

References
1. Hadoop in Practice 2nd Edition by Alex Holmes
http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222
2. Hadoop: The Definitive Guide by Tom White
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC

Thanks!

Map Reduce

More Related Content

What's hot

Similar to Map Reduce

Recently uploaded

Map Reduce