Mastering Hadoop Map Reduce was a presentation I gave to Orlando Data Science on April 23, 2015. The presentation provides a clear overview of how Hadoop Map Reduce works, and then dives into more advanced topics of how to optimize runtime performance and implement custom data types.
The examples are written in Python and Java, and the presentation walks through how to create an n-gram count map reduce program using custom data types.
You can get the full source code for the examples on my Github! http://www.github.com/scottcrespo/ngrams
9. Generating N-Grams
N-Gram: Set of all n sequential elements in a set.
Trigram: “The quick brown fox jumps over the lazy dog”
(the quick brown), (quick brown fox), (brown fox jumps),
(fox jumps over), (jumps over the), (the lazy dog)
10. Solution Design
NGramCounter {
NGramMapper {
map() {
//Tokenize and Sanitize Inputs
// Create NGram
// Output (NGram ngram, Int count)
}
}
NGramCombiner {
combine() {
// Sum local NGrams counts that are of the same key
// Output (NGram ngram, Int Count)
}
}
NGramReducer {
reduce() {
// Sum Ngrams counts of the same key
// Output (NGram ngram, Int Count)
}
}
}
CustomType!
13. Prototype
def test_mapper():
lines = [“the quick brown fox jumped over the lazy dog", "the quick brown”]
for line in lines:
words = line.split()
length = len(words)
sys.stdout.write("nLength of %d n-------------------n" % length)
i = 0
while (i+2 < length):
first = words[i]
second = words [i+1]
third = words[i+2]
trigram = "%s %s %s n" % (first, second, third)
sys.stdout.write(trigram)
i += 1
14. Output
Length of 9
-------------------
the quick brown
quick brown fox
brown fox jumped
fox jumped over
jumped over the
over the lazy
the lazy dog
Length of 3
-------------------
the quick brown
16. Custom KeyTypes
Must implement Hadoops WritableComparable interface
Writable:The key can be serialized and transmitted across a
network
Comparable:The key can be compared to other keys &
combined/sorted for the reduce phase
write() readFields() compareTo() hashCode()
toString() equals()
17. Trigram.java
public class Trigram implements WritableComparable<Trigram> {
…
public int compareTo(Trigram other) {
int compared = first.compareTo(other.first);
if (compared != 0) {
return compared;
}
compared = second.compareTo(other.second);
if (compared != 0) {
return compared;
}
return third.compareTo(other.third);
}
public int hashCode() {
return first.hashCode()*163 + second.hashCode() + third.hashCode();
}
}
19. TrigramMapper
public static class TrigramMapper
extends Mapper<Object, Text, Trigram, IntWritable> {
…
public void map(Object key, Text value, Context context) {
String line = value.toString().toLowerCase(); // create string and lower case
line = line.replaceAll("[^a-zs]",""); // remove bad non-word chars
String[] words = line.split("s"); // split line into list of words
int len = words.length; // need the length for our loop condition
for(int i = 0; i+2 < len; i++) {
if(len <= 1) { continue; } // remove short lines
first.set(words[i]);
second.set(words[i+1]);
third.set(words[i+2]);
trigram.set(first, second, third);
context.write(trigram, one);
20. TrigramReducer
public static class TrigramReducer
extends Reducer<Trigram, IntWritable, Trigram, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Trigram key, Iterable<IntWritable> values, Context context ) {
int sum = 0;
for(IntWritable value : values) {
sum += value.get();
}
result.set(sum);
context.write(key, result);
…