MapReduce
 Ahmed Elmorsy
What is MapReduce?
● MapReduce is a programming model for
  processing and generating large data sets.

● Inspired by the map and reduce primitives
  present in Lisp and many other functional
  languages

● Use of a functional model with user-specified
  map and reduce operations allows us to
  parallelize large computations easily.
Map function
Map, written by the user, takes an input pair
and produces a set of intermediate key/value
pairs. The MapReduce library groups together
all intermediate values associated with the
same intermediate key I and passes them to the
Reduce function.


   map (k1,v1) → list(k2,v2)
Reduce function
The Reduce function, also written by the user,
accepts an intermediate key I and a set of
values for that key. It merges together these
values to form a possibly smaller set of values.
Typically just zero or one output value is
produced per Reduce invocation.


reduce (k2,list(v2)) → list(v2)
Example (Word Count)
          Problem

    Counting the number of
 occurrences of each word in a
 large collection of documents
Example (Word Count)
Map function:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Example (Word Count)
Reduce function:

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Execution Overview
● The Map invocations are distributed across
  multiple machines by automatically
  partitioning the input data into a set of M splits.
● Reduce invocations are distributed by
  partitioning the intermediate key space into R
  pieces using a partitioning function.
● The number of partitions (R) and the
  partitioning function are specified by the user.
How Master works?
● The master picks idle workers and assigns
  each one a map task or a reduce task.

● For each map task and reduce task, it stores
  the state (idle, in-progress, or
  completed), and the identity of the worker
  machine (for non-idle tasks).
Fault Tolerance

● Worker Failure

● Master Failure
Worker Failure
● The master pings every worker periodically.

● If no response is received from a worker,
  the master marks the worker as failed.

● Any map task or reduce task in progress on a
  failed worker is also reset to idle and
  becomes eligible for rescheduling.
Worker Failure
● Any map tasks completed by the worker are
  reset back to their initial idle state, and
  therefore become eligible for scheduling on
  other workers (WHY?!).

● Completed reduce tasks do not need to be re-
  executed (WHY?!).
Master Failure
There are two options:
1. Make the master write periodic checkpoints
   of the master data structures described
   above. If the master task dies, a new copy
   can be started from the last checkpointed
   state.

2. Abort the MapReduce computation if the
   master fails.
Backup Tasks
● One of the common causes that lengthens
  the total time taken for a MapReduce
  operation is a “straggler”.
● When a MapReduce operation is close to
  completion, the master schedules backup
  executions of the remaining in-progress
  tasks.
● The task is marked as completed whenever
  either the primary or the backup execution
  completes
Refinements
1.   Partitioning Function
2.   Combiner Function
3.   Input and Output Types
4.   Skipping Bad Records
5.   Status Information
6.   Counters
More Examples
● Distributed Grep
● Count of URL Access Frequency
● Reverse Web-Link Graph
● Inverted Index
● Distributed Sort
Apache Hadoop
Open Source Implementation of MapReduce
Hadoop Modules
● Hadoop Common

● Hadoop Distributed File System (HDFS™)

● Hadoop YARN

● Hadoop MapReduce
Projects based on Hadoop
● Apache Hive
Developed by Facebook and used by Netflix.

● Apache Pig
Developed at Yahoo! and used by Twitter.

● Apache Cassandra
Developed by Facebook
Template Hadoop Program
public class MyJob extends Configured implements Tool {
   public static class MapClass extends MapReduceBase
   implements Mapper<Text, Text, Text, Text> {
       public void map(Text key, Text value,
       OutputCollector<Text, Text> output, Reporter reporter)
       throws IOException {
          //Map Function
       }
   }
   public static class Reduce extends MapReduceBase implements
   Reducer<Text, Text, Text, Text> {
       public void reduce(Text key, Iterator<Text> values,
       OutputCollector<Text, Text> output, Reporter reporter)
       throws IOException {
       }
   }
public int run(String[] args) throws Exception {
   Configuration conf = getConf();
   JobConf job = new JobConf(conf, MyJob.class);
   Path in = new Path(args[0]);
   Path out = new Path(args[1]);
   FileInputFormat.setInputPaths(job, in);
   FileOutputFormat.setOutputPath(job, out);
   job.setJobName("MyJob");
   job.setMapperClass(MapClass.class);
   job.setReducerClass(Reduce.class);
   job.setInputFormat(KeyValueTextInputFormat.class);
   job.setOutputFormat(TextOutputFormat.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.set("key.value.separator.in.input.line", ",");
   JobClient.runJob(job);
   return 0;
}
public static void main(String[] args) throws Exception {
       int res = ToolRunner.run(new Configuration(), new
       MyJob(), args);
       System.exit(res);
    }
}



To Run it, You have to generate the JAR file, then
you can use the command:

bin/hadoop jar playground/MyJob.jar MyJob
input/cite75_99.txt output
Readings

         Chapter 4 in
(Lam, Chuck. Hadoop in action.
Manning Publications Co., 2010.)
References
[1] Jeffrey Dean and Sanjay Ghemawat. MapReduce:
Simplified data processing
on large clusters. In OSDI, pages 137–150, 2004.

[2] Lam, Chuck. Hadoop in action. Manning
Publications Co., 2010.

[3] http://hadoop.apache.org/

MapReduce

  • 1.
  • 2.
    What is MapReduce? ●MapReduce is a programming model for processing and generating large data sets. ● Inspired by the map and reduce primitives present in Lisp and many other functional languages ● Use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily.
  • 3.
    Map function Map, writtenby the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. map (k1,v1) → list(k2,v2)
  • 4.
    Reduce function The Reducefunction, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. reduce (k2,list(v2)) → list(v2)
  • 5.
    Example (Word Count) Problem Counting the number of occurrences of each word in a large collection of documents
  • 6.
    Example (Word Count) Mapfunction: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
  • 7.
    Example (Word Count) Reducefunction: reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 9.
    Execution Overview ● TheMap invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. ● Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function. ● The number of partitions (R) and the partitioning function are specified by the user.
  • 10.
    How Master works? ●The master picks idle workers and assigns each one a map task or a reduce task. ● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
  • 11.
    Fault Tolerance ● WorkerFailure ● Master Failure
  • 12.
    Worker Failure ● Themaster pings every worker periodically. ● If no response is received from a worker, the master marks the worker as failed. ● Any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
  • 13.
    Worker Failure ● Anymap tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers (WHY?!). ● Completed reduce tasks do not need to be re- executed (WHY?!).
  • 14.
    Master Failure There aretwo options: 1. Make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. 2. Abort the MapReduce computation if the master fails.
  • 15.
    Backup Tasks ● Oneof the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”. ● When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. ● The task is marked as completed whenever either the primary or the backup execution completes
  • 16.
    Refinements 1. Partitioning Function 2. Combiner Function 3. Input and Output Types 4. Skipping Bad Records 5. Status Information 6. Counters
  • 17.
    More Examples ● DistributedGrep ● Count of URL Access Frequency ● Reverse Web-Link Graph ● Inverted Index ● Distributed Sort
  • 18.
    Apache Hadoop Open SourceImplementation of MapReduce
  • 19.
    Hadoop Modules ● HadoopCommon ● Hadoop Distributed File System (HDFS™) ● Hadoop YARN ● Hadoop MapReduce
  • 20.
    Projects based onHadoop ● Apache Hive Developed by Facebook and used by Netflix. ● Apache Pig Developed at Yahoo! and used by Twitter. ● Apache Cassandra Developed by Facebook
  • 21.
    Template Hadoop Program publicclass MyJob extends Configured implements Tool { public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> { public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { //Map Function } } public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { } }
  • 22.
    public int run(String[]args) throws Exception { Configuration conf = getConf(); JobConf job = new JobConf(conf, MyJob.class); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setJobName("MyJob"); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setInputFormat(KeyValueTextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.set("key.value.separator.in.input.line", ","); JobClient.runJob(job); return 0; }
  • 23.
    public static voidmain(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new MyJob(), args); System.exit(res); } } To Run it, You have to generate the JAR file, then you can use the command: bin/hadoop jar playground/MyJob.jar MyJob input/cite75_99.txt output
  • 24.
    Readings Chapter 4 in (Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.)
  • 25.
    References [1] Jeffrey Deanand Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137–150, 2004. [2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010. [3] http://hadoop.apache.org/