MapReduce
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
640
On Slideshare
640
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
37
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MapReduce Ahmed Elmorsy
  • 2. What is MapReduce?● MapReduce is a programming model for processing and generating large data sets.● Inspired by the map and reduce primitives present in Lisp and many other functional languages● Use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily.
  • 3. Map functionMap, written by the user, takes an input pairand produces a set of intermediate key/valuepairs. The MapReduce library groups togetherall intermediate values associated with thesame intermediate key I and passes them to theReduce function. map (k1,v1) → list(k2,v2)
  • 4. Reduce functionThe Reduce function, also written by the user,accepts an intermediate key I and a set ofvalues for that key. It merges together thesevalues to form a possibly smaller set of values.Typically just zero or one output value isproduced per Reduce invocation.reduce (k2,list(v2)) → list(v2)
  • 5. Example (Word Count) Problem Counting the number of occurrences of each word in a large collection of documents
  • 6. Example (Word Count)Map function:map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");
  • 7. Example (Word Count)Reduce function:reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
  • 8. Execution Overview● The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits.● Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function.● The number of partitions (R) and the partitioning function are specified by the user.
  • 9. How Master works?● The master picks idle workers and assigns each one a map task or a reduce task.● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
  • 10. Fault Tolerance● Worker Failure● Master Failure
  • 11. Worker Failure● The master pings every worker periodically.● If no response is received from a worker, the master marks the worker as failed.● Any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
  • 12. Worker Failure● Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers (WHY?!).● Completed reduce tasks do not need to be re- executed (WHY?!).
  • 13. Master FailureThere are two options:1. Make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state.2. Abort the MapReduce computation if the master fails.
  • 14. Backup Tasks● One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”.● When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.● The task is marked as completed whenever either the primary or the backup execution completes
  • 15. Refinements1. Partitioning Function2. Combiner Function3. Input and Output Types4. Skipping Bad Records5. Status Information6. Counters
  • 16. More Examples● Distributed Grep● Count of URL Access Frequency● Reverse Web-Link Graph● Inverted Index● Distributed Sort
  • 17. Apache HadoopOpen Source Implementation of MapReduce
  • 18. Hadoop Modules● Hadoop Common● Hadoop Distributed File System (HDFS™)● Hadoop YARN● Hadoop MapReduce
  • 19. Projects based on Hadoop● Apache HiveDeveloped by Facebook and used by Netflix.● Apache PigDeveloped at Yahoo! and used by Twitter.● Apache CassandraDeveloped by Facebook
  • 20. Template Hadoop Programpublic class MyJob extends Configured implements Tool { public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> { public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { //Map Function } } public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { } }
  • 21. public int run(String[] args) throws Exception { Configuration conf = getConf(); JobConf job = new JobConf(conf, MyJob.class); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setJobName("MyJob"); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setInputFormat(KeyValueTextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.set("key.value.separator.in.input.line", ","); JobClient.runJob(job); return 0;}
  • 22. public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new MyJob(), args); System.exit(res); }}To Run it, You have to generate the JAR file, thenyou can use the command:bin/hadoop jar playground/MyJob.jar MyJobinput/cite75_99.txt output
  • 23. Readings Chapter 4 in(Lam, Chuck. Hadoop in action.Manning Publications Co., 2010.)
  • 24. References[1] Jeffrey Dean and Sanjay Ghemawat. MapReduce:Simplified data processingon large clusters. In OSDI, pages 137–150, 2004.[2] Lam, Chuck. Hadoop in action. ManningPublications Co., 2010.[3] http://hadoop.apache.org/