Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

MapReduce MapReduce Presentation Transcript

  • MapReduce Ahmed Elmorsy
  • What is MapReduce?● MapReduce is a programming model for processing and generating large data sets.● Inspired by the map and reduce primitives present in Lisp and many other functional languages● Use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily.
  • Map functionMap, written by the user, takes an input pairand produces a set of intermediate key/valuepairs. The MapReduce library groups togetherall intermediate values associated with thesame intermediate key I and passes them to theReduce function. map (k1,v1) → list(k2,v2)
  • Reduce functionThe Reduce function, also written by the user,accepts an intermediate key I and a set ofvalues for that key. It merges together thesevalues to form a possibly smaller set of values.Typically just zero or one output value isproduced per Reduce invocation.reduce (k2,list(v2)) → list(v2)
  • Example (Word Count) Problem Counting the number of occurrences of each word in a large collection of documents
  • Example (Word Count)Map function:map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");
  • Example (Word Count)Reduce function:reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
  • Execution Overview● The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits.● Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function.● The number of partitions (R) and the partitioning function are specified by the user.
  • How Master works?● The master picks idle workers and assigns each one a map task or a reduce task.● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
  • Fault Tolerance● Worker Failure● Master Failure
  • Worker Failure● The master pings every worker periodically.● If no response is received from a worker, the master marks the worker as failed.● Any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
  • Worker Failure● Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers (WHY?!).● Completed reduce tasks do not need to be re- executed (WHY?!).
  • Master FailureThere are two options:1. Make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state.2. Abort the MapReduce computation if the master fails.
  • Backup Tasks● One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”.● When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.● The task is marked as completed whenever either the primary or the backup execution completes
  • Refinements1. Partitioning Function2. Combiner Function3. Input and Output Types4. Skipping Bad Records5. Status Information6. Counters
  • More Examples● Distributed Grep● Count of URL Access Frequency● Reverse Web-Link Graph● Inverted Index● Distributed Sort
  • Apache HadoopOpen Source Implementation of MapReduce
  • Hadoop Modules● Hadoop Common● Hadoop Distributed File System (HDFS™)● Hadoop YARN● Hadoop MapReduce
  • Projects based on Hadoop● Apache HiveDeveloped by Facebook and used by Netflix.● Apache PigDeveloped at Yahoo! and used by Twitter.● Apache CassandraDeveloped by Facebook
  • Template Hadoop Programpublic class MyJob extends Configured implements Tool { public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> { public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { //Map Function } } public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { } }
  • public int run(String[] args) throws Exception { Configuration conf = getConf(); JobConf job = new JobConf(conf, MyJob.class); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setJobName("MyJob"); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setInputFormat(KeyValueTextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.set("", ","); JobClient.runJob(job); return 0;}
  • public static void main(String[] args) throws Exception { int res = Configuration(), new MyJob(), args); System.exit(res); }}To Run it, You have to generate the JAR file, thenyou can use the command:bin/hadoop jar playground/MyJob.jar MyJobinput/cite75_99.txt output
  • Readings Chapter 4 in(Lam, Chuck. Hadoop in action.Manning Publications Co., 2010.)
  • References[1] Jeffrey Dean and Sanjay Ghemawat. MapReduce:Simplified data processingon large clusters. In OSDI, pages 137–150, 2004.[2] Lam, Chuck. Hadoop in action. ManningPublications Co., 2010.[3]