6: MapReduce Applications                            Zubair Nabi                  zubair.nabi@itu.edu.pk                  ...
Outline  1    The Anatomy of a MapReduce Application  2    MapReduce Design Patterns  3    Common MapReduce Application Ty...
Outline  1    The Anatomy of a MapReduce Application  2    MapReduce Design Patterns  3    Common MapReduce Application Ty...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
MapReduce job phases A MapReduce job can be divided into 4 phases:     1    Input split: The input dataset is sliced into ...
Of mappers and reducers          In the common case, programmers only need to write a map and a          reduce function  ...
Of mappers and reducers          In the common case, programmers only need to write a map and a          reduce function  ...
Of mappers and reducers          In the common case, programmers only need to write a map and a          reduce function  ...
Wordcount: High-level view          Input: A text corpus such as Wikipedia dump, books from Gutenberg,          etc.  Zuba...
Wordcount: High-level view          Input: A text corpus such as Wikipedia dump, books from Gutenberg,          etc.      ...
Wordcount: High-level view          Input: A text corpus such as Wikipedia dump, books from Gutenberg,          etc.      ...
Wordcount: High-level view          Input: A text corpus such as Wikipedia dump, books from Gutenberg,          etc.      ...
Wordcount: High-level view          Input: A text corpus such as Wikipedia dump, books from Gutenberg,          etc.      ...
Wordcount: High-level view          Input: A text corpus such as Wikipedia dump, books from Gutenberg,          etc.      ...
Wordcount: Low-level view          A new process is created for each map, called MapRunner  Zubair Nabi               6: M...
Wordcount: Low-level view          A new process is created for each map, called MapRunner          MapRunner has a Record...
Wordcount: Low-level view          A new process is created for each map, called MapRunner          MapRunner has a Record...
Wordcount: Low-level view          A new process is created for each map, called MapRunner          MapRunner has a Record...
Wordcount: Low-level view          A new process is created for each map, called MapRunner          MapRunner has a Record...
Wordcount: Low-level view (2)          WordCountMapper has an OutputCollector instance which maintains          an in-memo...
Wordcount: Low-level view (2)          WordCountMapper has an OutputCollector instance which maintains          an in-memo...
Wordcount: Low-level view (2)          WordCountMapper has an OutputCollector instance which maintains          an in-memo...
Wordcount: Low-level view (2)          WordCountMapper has an OutputCollector instance which maintains          an in-memo...
Wordcount: Low-level view (2)          WordCountMapper has an OutputCollector instance which maintains          an in-memo...
Wordcount: Low-level view (2)          WordCountMapper has an OutputCollector instance which maintains          an in-memo...
Wordcount: Low-level view (3)          Once all maps have completed their execution, the reduce phase is          started ...
Wordcount: Low-level view (3)          Once all maps have completed their execution, the reduce phase is          started ...
Wordcount: Low-level view (3)          Once all maps have completed their execution, the reduce phase is          started ...
Wordcount: Low-level view (3)          Once all maps have completed their execution, the reduce phase is          started ...
Wordcount: Low-level view (4)          ReduceRunner contains a Reducer instance with a reduce function,          WordCount...
Wordcount: Low-level view (4)          ReduceRunner contains a Reducer instance with a reduce function,          WordCount...
Wordcount: Low-level view (4)          ReduceRunner contains a Reducer instance with a reduce function,          WordCount...
Wordcount: Low-level view (4)          ReduceRunner contains a Reducer instance with a reduce function,          WordCount...
Wordcount: Low-level view (4)          ReduceRunner contains a Reducer instance with a reduce function,          WordCount...
Wordcount: Low-level view (4)          ReduceRunner contains a Reducer instance with a reduce function,          WordCount...
Wordcount map in Java1   public void map( Object key , Text value , Context context ) {2                 StringTokenizer i...
Wordcount reduce in Java1   public void reduce (Text key , Iterable < IntWritable > values ,2                             ...
Wordcount map in Python1   def map(self , key , value ):2                 [self. _output_collector . collect (word , 1) fo...
Wordcount reduce in Python1   def reduce (self , key , values ):2         sum__ = 03         for value in values :4       ...
Outline  1    The Anatomy of a MapReduce Application  2    MapReduce Design Patterns  3    Common MapReduce Application Ty...
Bird’s-eye view          The MapReduce paradigm is amenable to divide-and-conquer          algorithms  Zubair Nabi        ...
Bird’s-eye view          The MapReduce paradigm is amenable to divide-and-conquer          algorithms          One way to ...
Bird’s-eye view          The MapReduce paradigm is amenable to divide-and-conquer          algorithms          One way to ...
Bird’s-eye view          The MapReduce paradigm is amenable to divide-and-conquer          algorithms          One way to ...
Bird’s-eye view          The MapReduce paradigm is amenable to divide-and-conquer          algorithms          One way to ...
Bird’s-eye view          The MapReduce paradigm is amenable to divide-and-conquer          algorithms          One way to ...
Programmer control  The programmer has no control over     1    The location of a map or reduce task in terms of nodes in ...
Programmer control  The programmer has no control over     1    The location of a map or reduce task in terms of nodes in ...
Programmer control  The programmer has no control over     1    The location of a map or reduce task in terms of nodes in ...
Programmer control  The programmer has no control over     1    The location of a map or reduce task in terms of nodes in ...
Programmer control (2)  The programmer does have control over     1    The data structures to be used as keys and values  ...
Programmer control (2)  The programmer does have control over     1    The data structures to be used as keys and values  ...
Programmer control (2)  The programmer does have control over     1    The data structures to be used as keys and values  ...
Programmer control (2)  The programmer does have control over     1    The data structures to be used as keys and values  ...
Programmer control (2)  The programmer does have control over     1    The data structures to be used as keys and values  ...
Multi-job algorithms          Many algorithms cannot be easily expressed as a single MapReduce          job  Zubair Nabi  ...
Multi-job algorithms          Many algorithms cannot be easily expressed as a single MapReduce          job          Compl...
Multi-job algorithms          Many algorithms cannot be easily expressed as a single MapReduce          job          Compl...
Multi-job algorithms          Many algorithms cannot be easily expressed as a single MapReduce          job          Compl...
Local aggregation          Network and disk latencies are expensive compared to other          operations  Zubair Nabi    ...
Local aggregation          Network and disk latencies are expensive compared to other          operations          Decreas...
Local aggregation          Network and disk latencies are expensive compared to other          operations          Decreas...
Local aggregation          Network and disk latencies are expensive compared to other          operations          Decreas...
Outline  1    The Anatomy of a MapReduce Application  2    MapReduce Design Patterns  3    Common MapReduce Application Ty...
Counting and Summing     1    Problem                A number of documents with a set of terms  Zubair Nabi               ...
Counting and Summing     1    Problem                A number of documents with a set of terms                Need to calc...
Counting and Summing     1    Problem                A number of documents with a set of terms                Need to calc...
Counting and Summing     1    Problem                A number of documents with a set of terms                Need to calc...
Collating     1    Problem                A number of documents with a set of terms and some function of one              ...
Collating     1    Problem                A number of documents with a set of terms and some function of one              ...
Collating     1    Problem                A number of documents with a set of terms and some function of one              ...
Collating     1    Problem                A number of documents with a set of terms and some function of one              ...
Collating     1    Problem                A number of documents with a set of terms and some function of one              ...
Filtering, Parsing, and Validation     1    Problem                A set of records  Zubair Nabi                      6: M...
Filtering, Parsing, and Validation     1    Problem                A set of records                Need to collect all rec...
Filtering, Parsing, and Validation     1    Problem                A set of records                Need to collect all rec...
Filtering, Parsing, and Validation     1    Problem                A set of records                Need to collect all rec...
Filtering, Parsing, and Validation     1    Problem                A set of records                Need to collect all rec...
Distributed Task Execution     1    Problem                Large computational problem  Zubair Nabi                  6: Ma...
Distributed Task Execution     1    Problem                Large computational problem                Need to divide it in...
Distributed Task Execution     1    Problem                Large computational problem                Need to divide it in...
Distributed Task Execution     1    Problem                Large computational problem                Need to divide it in...
Distributed Task Execution     1    Problem                Large computational problem                Need to divide it in...
Sorting     1    Problem                A set of records  Zubair Nabi                      6: MapReduce Applications   Apr...
Sorting     1    Problem                A set of records                Need to sort records in some order  Zubair Nabi   ...
Sorting     1    Problem                A set of records                Need to sort records in some order     2    Soluti...
Sorting     1    Problem                A set of records                Need to sort records in some order     2    Soluti...
Sorting     1    Problem                A set of records                Need to sort records in some order     2    Soluti...
References     1    Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with          MapReduce. Morgan and Cla...
Upcoming SlideShare
Loading in...5
×

Topic 6: MapReduce Applications

1,159

Published on

Cloud Computing Workshop 2013, ITU

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,159
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
76
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Topic 6: MapReduce Applications

  1. 1. 6: MapReduce Applications Zubair Nabi zubair.nabi@itu.edu.pk April 18, 2013Zubair Nabi 6: MapReduce Applications April 18, 2013 1 / 27
  2. 2. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 2 / 27
  3. 3. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 3 / 27
  4. 4. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  5. 5. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  6. 6. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  7. 7. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  8. 8. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  9. 9. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  10. 10. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Before the application of the reduce function, the input keys are merged to get globally sorted key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  11. 11. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  12. 12. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modified) in the input file and is passed the line number as key and line contents as value Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  13. 13. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modified) in the input file and is passed the line number as key and line contents as value The user-provided reduce function is invoked for each key output by the map phase and is passed the set of associated values as iterable values Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  14. 14. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  15. 15. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  16. 16. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  17. 17. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  18. 18. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  19. 19. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Reduce output: Words and their final counts Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  20. 20. Wordcount: Low-level view A new process is created for each map, called MapRunner Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  21. 21. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  22. 22. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  23. 23. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  24. 24. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case For each line parse by RecordReader, MapRunner calls WordCountMapper.map() and passes it the line Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  25. 25. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  26. 26. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  27. 27. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  28. 28. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  29. 29. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a configurable threshold, its contents are first sorted by key and then flushed to disk Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  30. 30. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a configurable threshold, its contents are first sorted by key and then flushed to disk This process is repeated till the map logic has been applied to all lines within the input file Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  31. 31. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  32. 32. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  33. 33. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  34. 34. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run All input partitions are then merged to get a globally sorted partition of key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  35. 35. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  36. 36. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  37. 37. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  38. 38. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  39. 39. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  40. 40. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs At the end of the entire job, each reduce produces an output file with words and their number of occurrences Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  41. 41. Wordcount map in Java1 public void map( Object key , Text value , Context context ) {2 StringTokenizer itr = new StringTokenizer ( value . toString ());3 while (itr. hasMoreTokens ()) {4 word.set(itr. nextToken ());5 context .write (word , one );6 }7 } Zubair Nabi 6: MapReduce Applications April 18, 2013 11 / 27
  42. 42. Wordcount reduce in Java1 public void reduce (Text key , Iterable < IntWritable > values ,2 Context context ) {3 int sum = 0;4 for ( IntWritable val : values ) {5 sum += val.get ();6 }7 result .set(sum );8 context .write(key , result );9 } Zubair Nabi 6: MapReduce Applications April 18, 2013 12 / 27
  43. 43. Wordcount map in Python1 def map(self , key , value ):2 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)] Zubair Nabi 6: MapReduce Applications April 18, 2013 13 / 27
  44. 44. Wordcount reduce in Python1 def reduce (self , key , values ):2 sum__ = 03 for value in values :4 sum__ += value5 self. _output_collector . collect (key , sum__ ) Zubair Nabi 6: MapReduce Applications April 18, 2013 14 / 27
  45. 45. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 15 / 27
  46. 46. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  47. 47. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  48. 48. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  49. 49. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  50. 50. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  51. 51. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Fortunately a large number of algorithms easily fit this rigid pattern Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  52. 52. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  53. 53. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  54. 54. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a specific map task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  55. 55. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a specific map task 4 The intermediate key/value pairs processed by a specific reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  56. 56. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  57. 57. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  58. 58. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  59. 59. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  60. 60. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys 5 Partitioning of key space and in turn, the set of keys that a particular reducer encounters Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  61. 61. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  62. 62. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  63. 63. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  64. 64. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Most interactive algorithms need to be run by an external driver program that performs the convergence check Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  65. 65. Local aggregation Network and disk latencies are expensive compared to other operations Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  66. 66. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  67. 67. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  68. 68. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Another strategy, dubbed “in-mapper combining” can not only decrease the amount of intermediate data but also the number of key/valur pairs emitted by the map tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  69. 69. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 21 / 27
  70. 70. Counting and Summing 1 Problem A number of documents with a set of terms Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  71. 71. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  72. 72. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) 2 Solution Map: For each term, emit the term and “1” Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  73. 73. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) 2 Solution Map: For each term, emit the term and “1” Reduce: Take the sum (or any other operation) of each term values Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  74. 74. Collating 1 Problem A number of documents with a set of terms and some function of one item Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  75. 75. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  76. 76. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  77. 77. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  78. 78. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Example: Inverted Index: Items are words and function is document ID Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  79. 79. Filtering, Parsing, and Validation 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  80. 80. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  81. 81. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  82. 82. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  83. 83. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Example: Text parsing or transformation such as word capitalization Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  84. 84. Distributed Task Execution 1 Problem Large computational problem Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  85. 85. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  86. 86. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  87. 87. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a final one Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  88. 88. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a final one Example: RGB histogram calculation of bitmap images Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  89. 89. Sorting 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  90. 90. Sorting 1 Problem A set of records Need to sort records in some order Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  91. 91. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  92. 92. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  93. 93. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Also possible to sort by value, either perform a secondary sort or perform a key-to-value conversion Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  94. 94. References 1 Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers. 2 MapReduce Patterns, Algorithms, and Use Cases: http://highlyscalable.wordpress.com/2012/02/01/ mapreduce-patterns/ Zubair Nabi 6: MapReduce Applications April 18, 2013 27 / 27
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×