MapReduce入門

MapReduce入門

2011-04-08 社内勉強会

MapReduceとは

並列分散処理用のフレームワークです。mapとreduceという処理を
組み合わせて処理を行う点が特徴です。

map処理

入力ファイルの各行からKeyとValueの組み合わせを作る処理です。
例えば、ファイルの中にある単語数を数えるような処理(wordcount)
の場合、各行にある単語毎にKeyとValueの組み合わせを作ること
になります。

reduce処理

map処理で作られたKeyとValueの組み合わせから別のKeyとValue
の組み合わせを作る処理です。なお、reduceの入力は自動的に
MapReduceによって自動的にKey毎にValueがまとめられた状態に
なっています。

wordcount：map処理

wordcountであるため、keyが単語、valueが「1」となります。

wordcount：reduce処理

reduceの入力時にkey（単語）ごとにvalue「1」がまとめられます。そ
してreduceにおいて「1」を足して出現回数が求められます。

wordcountのソース(1) : map処理

public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

wordcountのソース(2) : reduce処理

public static class IntSumReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

wordcountのソース(3) : main処理

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

wordcountのソース(4) : Driver

public class ExampleDriver {
public static void main(String argv[]){
int exitCode = -1;
ProgramDriver pgd = new ProgramDriver();
try {
pgd.addClass("wordcount", WordCount.class,
"A map/reduce program that counts the words in the input files.");
pgd.driver(argv);
// Success
exitCode = 0;
}
catch(Throwable e){
e.printStackTrace();
}
System.exit(exitCode);
}
}

wordcount（1）

【内容】
ファイル中の単語数をカウントするMapReduceジョブです。Hadoopに
付属しているサンプルプログラムです。以下のようにして実行しま
す。

【コマンド】
$ hadoop jar /usr/src/hadoop-0.20.1+133/hadoop-0.20.1+133-
examples.jar wordcount hdfs_readme wordcount

【構文】
$ hadoop jar <jarファイルのpath> <実行するジョブ>
<入力ファイル...> <出力ディレクトリ>

wordcount（2）

【内容】
wordcountの処理結果の確認をします。ホームディレクトリに
wordcountというディレクトリが作成されていることが分かります。

【コマンド】
$ hadoop fs -ls
【結果】
Found 2 items
-rw-r--r-- 1 training supergroup 538 2010-12-13 09:09 /user/training/hdfs_readme
drwxr-xr-x - training supergroup 0 2010-12-15 06:16 /user/training/wordcount

wordcount（3）

【内容】
wordcountディレクトリの中に処理結果のファイル（part-r-00000）が
格納されていることを確認します。

【コマンド】
$ hadoop fs -ls wordcount
【結果】
Found 2 items
drwxr-xr-x - training supergroup 0 2010-12-15 06:15 /user/training/wordcount/_logs
-rw-r--r-- 1 training supergroup 582 2010-12-15 06:15 /user/training/wordcount/part-r-
00000

wordcount（4）

【内容】
処理結果のファイル（part-r-00000）の中身を見てみます。

【コマンド】
$ hadoop fs -cat wordcount/p* | less
【結果】
To 2
You 1
a 1
access 1
all 1
and 3

MapReduceがやってくれること

分散処理の制御
複数台のコンピューターの制御（タスクの割り当て）
タスクを割り当てたコンピューターに障害が発生した場合に
別のコンピューターに割り当てて再実行
入力ファイルの分割
各mapに処理対象となる入力ファイルを割り当てる
mapで処理した結果をreduceに渡す
その際にmapの出力結果についてkey単位でvalueをまとめる

その他の機能

不良レコードのスキップ
カウンター
ジョブスケジューラー
Hadoopストリーミング
スクリプト言語でmapおよびreduce処理を実装できる。
Hadoop Pipes
C++でmapおよびreduce処理を実装できる。

MapReduce入門

More Related Content

What's hot

Similar to MapReduce入門

More from Satoshi Noto

MapReduce入門