Upcoming SlideShare
×

Topic 6: MapReduce Applications

1,448 views
1,287 views

Published on

Cloud Computing Workshop 2013, ITU

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,448
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
78
0
Likes
0
Embeds 0
No embeds

No notes for slide

Topic 6: MapReduce Applications

1. 1. 6: MapReduce Applications Zubair Nabi zubair.nabi@itu.edu.pk April 18, 2013Zubair Nabi 6: MapReduce Applications April 18, 2013 1 / 27
2. 2. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 2 / 27
3. 3. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 3 / 27
4. 4. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
5. 5. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
6. 6. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
7. 7. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
8. 8. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shufﬂe: Map output is relayed to all reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
9. 9. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shufﬂe: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
10. 10. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shufﬂe: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Before the application of the reduce function, the input keys are merged to get globally sorted key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
11. 11. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
12. 12. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modiﬁed) in the input ﬁle and is passed the line number as key and line contents as value Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
13. 13. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modiﬁed) in the input ﬁle and is passed the line number as key and line contents as value The user-provided reduce function is invoked for each key output by the map phase and is passed the set of associated values as iterable values Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
14. 14. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
15. 15. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
16. 16. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
17. 17. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
18. 18. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
19. 19. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Reduce output: Words and their ﬁnal counts Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
20. 20. Wordcount: Low-level view A new process is created for each map, called MapRunner Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
21. 21. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input ﬁle Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
22. 22. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input ﬁle RecordReader reads the input ﬁle in chunks and parses the chunks into lines Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
23. 23. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input ﬁle RecordReader reads the input ﬁle in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
24. 24. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input ﬁle RecordReader reads the input ﬁle in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case For each line parse by RecordReader, MapRunner calls WordCountMapper.map() and passes it the line Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
25. 25. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
26. 26. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
27. 27. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
28. 28. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
29. 29. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a conﬁgurable threshold, its contents are ﬁrst sorted by key and then ﬂushed to disk Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
30. 30. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a conﬁgurable threshold, its contents are ﬁrst sorted by key and then ﬂushed to disk This process is repeated till the map logic has been applied to all lines within the input ﬁle Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
31. 31. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
32. 32. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
33. 33. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
34. 34. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run All input partitions are then merged to get a globally sorted partition of key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
35. 35. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
36. 36. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
37. 37. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
38. 38. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its ﬁnal count to the OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
39. 39. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its ﬁnal count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
40. 40. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its ﬁnal count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs At the end of the entire job, each reduce produces an output ﬁle with words and their number of occurrences Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
41. 41. Wordcount map in Java1 public void map( Object key , Text value , Context context ) {2 StringTokenizer itr = new StringTokenizer ( value . toString ());3 while (itr. hasMoreTokens ()) {4 word.set(itr. nextToken ());5 context .write (word , one );6 }7 } Zubair Nabi 6: MapReduce Applications April 18, 2013 11 / 27
42. 42. Wordcount reduce in Java1 public void reduce (Text key , Iterable < IntWritable > values ,2 Context context ) {3 int sum = 0;4 for ( IntWritable val : values ) {5 sum += val.get ();6 }7 result .set(sum );8 context .write(key , result );9 } Zubair Nabi 6: MapReduce Applications April 18, 2013 12 / 27
43. 43. Wordcount map in Python1 def map(self , key , value ):2 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)] Zubair Nabi 6: MapReduce Applications April 18, 2013 13 / 27
44. 44. Wordcount reduce in Python1 def reduce (self , key , values ):2 sum__ = 03 for value in values :4 sum__ += value5 self. _output_collector . collect (key , sum__ ) Zubair Nabi 6: MapReduce Applications April 18, 2013 14 / 27
45. 45. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 15 / 27
46. 46. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
47. 47. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
48. 48. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at speciﬁc hook points Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
49. 49. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at speciﬁc hook points Algorithms must be expressed in terms of a small number of speciﬁc components that ﬁt together in preset ways Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
50. 50. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at speciﬁc hook points Algorithms must be expressed in terms of a small number of speciﬁc components that ﬁt together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
51. 51. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at speciﬁc hook points Algorithms must be expressed in terms of a small number of speciﬁc components that ﬁt together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Fortunately a large number of algorithms easily ﬁt this rigid pattern Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
52. 52. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
53. 53. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
54. 54. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a speciﬁc map task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
55. 55. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a speciﬁc map task 4 The intermediate key/value pairs processed by a speciﬁc reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
56. 56. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
57. 57. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
58. 58. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
59. 59. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
60. 60. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys 5 Partitioning of key space and in turn, the set of keys that a particular reducer encounters Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
61. 61. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
62. 62. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
63. 63. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
64. 64. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Most interactive algorithms need to be run by an external driver program that performs the convergence check Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
65. 65. Local aggregation Network and disk latencies are expensive compared to other operations Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
66. 66. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shufﬂe phase results in efﬁciency Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
67. 67. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shufﬂe phase results in efﬁciency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
68. 68. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shufﬂe phase results in efﬁciency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Another strategy, dubbed “in-mapper combining” can not only decrease the amount of intermediate data but also the number of key/valur pairs emitted by the map tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
69. 69. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 21 / 27
70. 70. Counting and Summing 1 Problem A number of documents with a set of terms Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
71. 71. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log ﬁles) Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
72. 72. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log ﬁles) 2 Solution Map: For each term, emit the term and “1” Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
73. 73. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log ﬁles) 2 Solution Map: For each term, emit the term and “1” Reduce: Take the sum (or any other operation) of each term values Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
74. 74. Collating 1 Problem A number of documents with a set of terms and some function of one item Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
75. 75. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
76. 76. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
77. 77. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
78. 78. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Example: Inverted Index: Items are words and function is document ID Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
79. 79. Filtering, Parsing, and Validation 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
80. 80. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
81. 81. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
82. 82. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
83. 83. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Example: Text parsing or transformation such as word capitalization Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
84. 84. Distributed Task Execution 1 Problem Large computational problem Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
85. 85. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a ﬁnal result Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
86. 86. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a ﬁnal result 2 Solution Map: Perform corresponding computation Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
87. 87. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a ﬁnal result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a ﬁnal one Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
88. 88. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a ﬁnal result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a ﬁnal one Example: RGB histogram calculation of bitmap images Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
89. 89. Sorting 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
90. 90. Sorting 1 Problem A set of records Need to sort records in some order Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
91. 91. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
92. 92. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
93. 93. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Also possible to sort by value, either perform a secondary sort or perform a key-to-value conversion Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
94. 94. References 1 Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers. 2 MapReduce Patterns, Algorithms, and Use Cases: http://highlyscalable.wordpress.com/2012/02/01/ mapreduce-patterns/ Zubair Nabi 6: MapReduce Applications April 18, 2013 27 / 27