Map reduce

MapReduce
Presented by – Somesh maliye

Content
•Motivation for MapReduce.
•What is MapReduce.
•Map() & Reduce() functions.
•MapReduce - Example.
•Dataflow.
•MapReduce Job.
•Job Tracker & Task tracker.
•Characteristics of MapReduce.
•Real Time uses.
•Failure in MapReduce.
•Conclusion.

Motivation For MapReduce
•Large scale data processing.
◦ Want to use 1000s of CPUs
•MapReduce Architecture provides
◦ Automatic parallelization & distribution
◦ Fault tolerance
◦ I/O scheduling
◦ Monitoring & status updates

What is MapReduce
•MapReduce is programming model and an associated implementation for
processing and generating large data sets with parallel and distributed algorithm
on clusters.

Map() function
• Reads in input pair <Key, Value>
• Outputs a pair <K’, V’>
• Let’s count number of each word in user queries (or Tweets/Blogs)
• The input to the map() will be <queryID, QueryText>:
• <Q1,“The teacher went to the store. The store was closed; the store opens in
the morning. The store opens at 9am.” >
• The output would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store,1> <the, 1> <store, 1> <was, 1>
<closed, 1> <the, 1> <store,1> <opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>

Reduce() function
•Accepts the Map() output, and aggregates values on the key
•For our example, the reducer input would be:
• <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1> <the, 1> <store, 1> <was, 1> <closed, 1>
<the, 1> <store, 1> <opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1> <opens, 1> <at, 1>
<9am, 1>
• The output would be:
• <The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3> <was, 1> <closed, 1> <opens, 1> <morning, 1> <at, 1>
<9am, 1>

Dataflow
Dataflow can be determine through the following function:
• an input reader
• a Map function
• a partition function
• a compare function
• a Reduce function
• an output writer

Dataflow(Cont.)
•Input reader
The input reader divides the input into appropriate size 'splits' (in practice typically 64 MB to 128 MB)
and the framework assigns one split to each Map function. The input reader reads data from stable
storage (typically a distributed file system) and generates key/value pairs.
•Map function
The Map function takes a series of key/value pairs, processes each, and generates zero or more output
key/value pairs.
• Partition function
Each Map function output is allocated to a particular reducer by the application's partition function for
sharding purposes. The partition function is given the key and the number of reducers and returns the
index of the desired reducer.
•Comparison function
The input for each Reduce is pulled from the machine where the Map ran and sorted using the
application's comparison function.

Dataflow(Cont.)
•Reduce function
The framework calls the application's Reduce function once for each unique key in the sorted order. The
Reduce can iterate through the values that are associated with that key and produce zero or more
outputs.
•Output writer
The Output Writer writes the output of the Reduce to the stable storage.

MapReduce Job
A job is a full MapReduce program , which typically will cause multiple Map
and Reduce functions to be run in parallel over the life of program. A task is a
map or reduce function executed on a subset of data.

Failures in MapReduce
• Failures are norm in commodity hardware
• Worker failure
• Detect failure via periodic heartbeats
• Re-execute in-progress map/reduce tasks
• Master failure
• Single point of failure; Resume from Execution Log

Conclusion.
•Simplifies large-scale computations that fit this model
•Allows user to focus on the problem without worrying about details

Map reduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Map reduce

Similar to Map reduce (20)

Recently uploaded

Recently uploaded (20)

Map reduce