MapReduce
Presentation – Advance Distributed system
MapReduce
 MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
 Map() procedure that performs filtering and sorting.
 Reduce() procedure that performs a summary operation (such as statistical
operations)
HADOOP
 Hadoop is a free, Java-based programming framework that supports the
processing of large data sets in a distributed computing environment. It is
part of the Apache project sponsored by the Apache Software Foundation.
MapReduce - Orchestrating
 MapReduce System is orchestrates the processing by
1. marshalling (saving the system) the distributed servers
2. Running the various tasks in parallel
3. managing all communications and data transfers between the various parts of
the system
4. providing for redundancy and fault tolerance.
MapReduce contributions
 The key contributions ( an important values added to the system ) are
Scalability (by controlling a big number of nodes) and fault-tolerance
achieved for a variety of applications by optimizing the execution engine
once.
MapReduce main steps (3 phases way)
 Steps of MapReduce :
1. "Map" step: Each worker node applies the "map()" function to the local data,
and writes the output to a temporary storage. A master node orchestrates
that for redundant copies of input data, only one is processed.
2. "Shuffle" step: Worker nodes redistribute data based on the output keys
(produced by the "map()" function), all the data belonging to one key are
located on the same worker node.
3. "Reduce" step: Worker nodes now process each group of output data, per key,
in parallel.
MapReduce main steps (3 phases way)
MapReduce main steps (5 phases way)
 Another way to look at MapReduce is as a 5-step parallel and distributed
computation:
1. Map step.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key
value, generating output organized by key values K2.
3. Shuffle step.
4. Reduce step.
5. Produce the final output – the MapReduce system collects all the Reduce
output, and sorts it by K2 to produce the final outcome
MapReduce main steps (5 phases way)
Working in parallel
 Each mapping operation is independent of the others ( run in parallel )
 Limitations are:
1. Number of independent data sources
2. Number of CPUs near each source.
 Set of 'reducers' can perform the reduction phase, all outputs of the map
operation that share the same key are presented to the same reducer at the
same time.
 The main advantage for working in parallel is the recovering from partial
failure of servers or storage during the operation, if one mapper or reducer
fails, the work can be rescheduled – assuming the input data is still available.
Working in parallel (cont)
Logical work
 The Map and Reduce functions of MapReduce are both defined with respect to the
data structured in (key, value) pairs. Map takes one pair of data with a type in one
data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2)
 The Map function is applied in parallel to every pair in the input dataset. This
produces a list of pairs for each call. After that, the MapReduce framework
collects all pairs with the same key from all lists and groups them together,
creating one group for each key.
 The Reduce function is then applied in parallel to each group, which in turn
produces a collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)
 Each Reduce call typically produces either one value v3 or an empty return,
though one call is allowed to return more than one value. The returns of all calls
are collected as the desired result list.
Logical work (cont)
Implementation
 Distributed implementations of MapReduce require a means of connecting
the processes performing the Map and Reduce phases
Implementation (cont)
 counts the appearance of each word in a set of documents
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
Dataflow
 The hotspots in the dataflow are ( depends on the application)
 an input reader
 a Map function
 a partition function
 a compare function
 a Reduce function
 an output writer
Dataflow (cont)
 Input reader
 The input reader divides the input into appropriate size and the framework
assigns one split to each Map function. The input reader reads data from
stable storage (typically a distributed file system) and generates key/value
pairs.
Dataflow (cont)
 Map function
 The Map function takes a series of key/value pairs, processes each, and
generates zero or more output key/value pairs. The input and output types of
the map are often different from each other.
 As in example If the application is doing a word count, the map function
would break the line into words and output a key/value pair for each word.
Each output pair would contain the word as the key and the number of
instances of that word in the line as the value.
Dataflow (cont)
 Partition function
 Each Map function output is allocated to a particular reducer by the application's
partition function for sharing purposes. The partition function is given the key and
the number of reducers and returns the index of the desired reducer.
 It is important to pick a partition function that gives an approximately uniform
distribution of data per shard for load-balancing purposes, otherwise the
MapReduce operation can be held up waiting for slow reducers (reducers assigned
more than their share of data) to finish.
 Between the map and reduce stages, the data is shuffled (means that the data are
parallel-sorted or exchanged between nodes) in order to move the data from the
map node that produced it to the shard in which it will be reduced.
Dataflow (cont)
 Comparison function
 The input for each Reduce is pulled from the machine where the Map ran and
sorted using the application's comparison function.
Dataflow (cont)
 Reduce function
 The framework calls the application's Reduce function once for each unique
key in the sorted order. The Reduce can iterate through the values that are
associated with that key and produce zero or more outputs.
 In the word count example, the Reduce function takes the input values, sums
them and generates a single output of the word and the final sum.
Dataflow (cont)
 Output writer
 The Output Writer writes the output of the Reduce to the stable storage,
usually a distributed file system.
Performance
 MapReduce programs are not guaranteed to be fast.
 The partition function and the amount of data written by the Map function
can have a large impact on the performance.
 Additional modules such as the Combiner function can help to reduce the
amount of data written to disk, and transmitted over the network.
 Communication cost often dominates the computation cost, and many
MapReduce implementations are designed to write all communication to
distributed storage for crash recovery.
Distribution and reliability
 MapReduce achieves reliability by parceling out a number of operations on
the set of data to each node in the network (load distributing).
 Each node is expected to report back periodically with completed work and
status updates.
 If a node falls silent for longer than that interval, the master node records
the node as dead and sends out the node's assigned work to other nodes.
 Individual operations use atomic operations for naming file outputs as a check
to ensure that there are not parallel conflicting threads running
 Reduce operations operate much the same way to save the bandwidth across
the backbone network of the datacenter
Uses
 Useful in distributed pattern-based searching, distributed sorting, web link-
graph reversal, Singular Value Decomposition, web access log stats, inverted
index construction, document clustering, machine learning, and statistical
machine translation.
 adapted to several computing environments like multi-core and many-core
systems, desktop grids, volunteer computing environments, dynamic cloud
environments, and mobile environments
Criticism
 Lack of novelty
 Restricted programming framework

Map reduce presentation

  • 1.
  • 2.
    MapReduce  MapReduce isa programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.  Map() procedure that performs filtering and sorting.  Reduce() procedure that performs a summary operation (such as statistical operations)
  • 3.
    HADOOP  Hadoop isa free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
  • 4.
    MapReduce - Orchestrating MapReduce System is orchestrates the processing by 1. marshalling (saving the system) the distributed servers 2. Running the various tasks in parallel 3. managing all communications and data transfers between the various parts of the system 4. providing for redundancy and fault tolerance.
  • 5.
    MapReduce contributions  Thekey contributions ( an important values added to the system ) are Scalability (by controlling a big number of nodes) and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.
  • 6.
    MapReduce main steps(3 phases way)  Steps of MapReduce : 1. "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. 2. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), all the data belonging to one key are located on the same worker node. 3. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
  • 7.
    MapReduce main steps(3 phases way)
  • 8.
    MapReduce main steps(5 phases way)  Another way to look at MapReduce is as a 5-step parallel and distributed computation: 1. Map step. 2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. 3. Shuffle step. 4. Reduce step. 5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome
  • 9.
    MapReduce main steps(5 phases way)
  • 10.
    Working in parallel Each mapping operation is independent of the others ( run in parallel )  Limitations are: 1. Number of independent data sources 2. Number of CPUs near each source.  Set of 'reducers' can perform the reduction phase, all outputs of the map operation that share the same key are presented to the same reducer at the same time.  The main advantage for working in parallel is the recovering from partial failure of servers or storage during the operation, if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
  • 11.
  • 12.
    Logical work  TheMap and Reduce functions of MapReduce are both defined with respect to the data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2)  The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.  The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3)  Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.
  • 13.
  • 14.
    Implementation  Distributed implementationsof MapReduce require a means of connecting the processes performing the Map and Reduce phases
  • 15.
    Implementation (cont)  countsthe appearance of each word in a set of documents function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  • 16.
    Dataflow  The hotspotsin the dataflow are ( depends on the application)  an input reader  a Map function  a partition function  a compare function  a Reduce function  an output writer
  • 17.
    Dataflow (cont)  Inputreader  The input reader divides the input into appropriate size and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs.
  • 18.
    Dataflow (cont)  Mapfunction  The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map are often different from each other.  As in example If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value.
  • 19.
    Dataflow (cont)  Partitionfunction  Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.  It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers (reducers assigned more than their share of data) to finish.  Between the map and reduce stages, the data is shuffled (means that the data are parallel-sorted or exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced.
  • 20.
    Dataflow (cont)  Comparisonfunction  The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function.
  • 21.
    Dataflow (cont)  Reducefunction  The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.  In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.
  • 22.
    Dataflow (cont)  Outputwriter  The Output Writer writes the output of the Reduce to the stable storage, usually a distributed file system.
  • 23.
    Performance  MapReduce programsare not guaranteed to be fast.  The partition function and the amount of data written by the Map function can have a large impact on the performance.  Additional modules such as the Combiner function can help to reduce the amount of data written to disk, and transmitted over the network.  Communication cost often dominates the computation cost, and many MapReduce implementations are designed to write all communication to distributed storage for crash recovery.
  • 24.
    Distribution and reliability MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network (load distributing).  Each node is expected to report back periodically with completed work and status updates.  If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes.  Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running  Reduce operations operate much the same way to save the bandwidth across the backbone network of the datacenter
  • 25.
    Uses  Useful indistributed pattern-based searching, distributed sorting, web link- graph reversal, Singular Value Decomposition, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.  adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments
  • 26.
    Criticism  Lack ofnovelty  Restricted programming framework