Scala on Hadoop

Hadoop Conference Scala on Hadoop はてな田中慎司 stanaka @ hatena.ne.jp http://d.hatena.ne.jp/stanaka/ http://twitter.com/stanaka/

アジェンダ ,[object Object],[object Object],[object Object],[object Object],[object Object]

自己紹介 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

はてなでの Hadoop #1 ,[object Object],[object Object],[object Object],[object Object],[object Object]

はてなでの Hadoop #2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

はてなでの Hadoop システム ( 現状 ) Hadoop MapReduce HDFS Reverse Proxy ジョブの投入 Hatena Fotolife Hatena Graph ログを時間毎に蓄積 /logs/$service/$year/$month/$date/$host_access-$hour.log

Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hadoop Streaming ,[object Object],[object Object],[object Object]

map.pl #!/usr/bin/env perl use strict; use warnings; while (<>) { chomp; my @segments = split /+/; printf "%s%s", $segments[8], 1; }

reduce.pl #!/usr/bin/env perl use strict; use warnings; my %count; while (<>) { chomp; my ($key, $value) = split //; $count{$key}++; } while (my ($key, $value) = each %count) { printf "%s%s", $key, $value; }

実行 % hadoop jar $HADOOP_DIR/contrib/hadoop-*-streaming.jar -input httpd_logs -output analog_out -mapper /home/user/work/analog/map.pl -reducer /home/user/work/analog/reduce.pl

ジョブの定義 ,[object Object],- name: latency mapper: class: LogAnalyzer::Mapper options: filters: isbot: 0 conditions: - key: Top filters: uri: '^$' value: $response reducer: class: Reducer::Distribution input: class: LogAnalyzer::Input options: service: ugomemo period: 1 output: class: Output::Gnuplot options: title: "Ugomemo Latency $date" xlabel: "Response time (msec)" ylabel: "Rates of requests (%)" fotolife_folder: ugomemo

Hadoop Streaming の限界 ,[object Object],[object Object],[object Object],[object Object]

Scala ,[object Object],[object Object],[object Object],[object Object],object HelloWorld { def main(args: Array[String]) { println("Hello, world!") } }

Scala による Quick sort def qsort[T <% Ordered[T]](list: List[T]): List[T] = list match { case Nil => Nil case pivot::tail => qsort(tail.filter(_ < pivot)) ::: pivot :: qsort(tail.filter(_ >= pivot)) } scala> qsort(List(2,1,3)) res1: List[Int] = List(1, 2, 3)

WordCount by Java public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable> { …

WordCount by Scala object WordCount { class MyMap extends Mapper[LongWritable, Text, Text, IntWritable] { val one = 1 override def map(ky: LongWritable, value: Text, output: Mapper[LongWritable, Text, Text, IntWritable]#Context) = { (value split " ") foreach (output write (_, one)) } } class MyReduce extends Reducer[Text, IntWritable, Text, IntWritable] { override def reduce(key: Text, values: java.lang.Iterable[IntWritable], output: Reducer[Text, IntWritable, Text, IntWritable]#Context) = { val iter: Iterator[IntWritable] = values.iterator() val sum = iter reduceLeft ((a: Int, b: Int) => a + b) output write (key, sum) } } def main(args: Array[String]) = { …

Java vs Scala ,[object Object],[object Object],public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } override def map(ky: LongWritable, value: Text, output: Mapper[LongWritable, Text, Text, IntWritable]#Context) = { (value split " ") foreach (output write (_, one)) }

Scala on Hadoop ,[object Object],[object Object],[object Object],[object Object]

mapper class MyMap extends Mapper[LongWritable, Text, Text, IntWritable] { val one = 1 override def map(ky: LongWritable, value: Text, output: Mapper[LongWritable, Text, Text, IntWritable]#Context) = { (value split " ") foreach (output write (_, one)) } }

reducer class MyReduce extends Reducer[Text, IntWritable, Text, IntWritable] { override def reduce(key: Text, values: java.lang.Iterable[IntWritable], output: Reducer[Text, IntWritable, Text, IntWritable]#Context) = { val iter: Iterator[IntWritable] = values.iterator() val sum = iter reduceLeft ((a: Int, b: Int) => a + b) output write (key, sum) } }

main def main(args: Array[String]) = { val conf = new Configuration() val otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs() val job = new Job(conf, "word count") job setJarByClass(WordCount getClass()) job setMapperClass(classOf[WordCount.MyMap]) job setCombinerClass(classOf[WordCount.MyReduce]) job setReducerClass(classOf[WordCount.MyReduce]) job setMapOutputKeyClass(classOf[Text]) job setMapOutputValueClass(classOf[IntWritable]) job setOutputKeyClass(classOf[Text]) job setOutputValueClass(classOf[IntWritable]) FileInputFormat addInputPath(job, new Path(otherArgs(0))) FileOutputFormat setOutputPath(job, new Path(otherArgs(1))) System exit(job waitForCompletion(true) match { case true => 0 case false => 1}) }

HDFS 操作 import java.net.URI import org.apache.hadoop.fs._ import org.apache.hadoop.hdfs._ import org.apache.hadoop.conf.Configuration object Hdfs { def main(args: Array[String]) = { val conf = new Configuration() val uri = new URI("hdfs://hadoop01:9000/") val fs = new DistributedFileSystem fs.initialize(uri, conf) var status = fs.getFileStatus(new Path(args(0))) println(status.getModificationTime) } }

ビルド手法 ,[object Object],[object Object],[object Object],[object Object],mvn org.apache.maven.plugins:maven-archetype-plugin:2.0-alpha-4:create -DarchetypeGroupId=org.scala-tools.archetypes -DarchetypeArtifactId=scala-archetype-simple -DarchetypeVersion=1.2 -DremoteRepositories=http://scala-tools.org/repo-releases -DgroupId=com.hatena.hadoop -DartifactId=hadoop

依存関係の記述 ,[object Object],[object Object],<dependency> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> <version>1.0.4</version> <scope>provided</scope> </dependency> <dependency> <groupId>commons-cli</groupId> <artifactId>commons-cli</artifactId> <version>1.0</version> <scope>provided</scope> </dependency> mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=0.20.1 -Dpackaging=jar -Dfile=/opt/hadoop/hadoop-0.20.1-core.jar

ビルド・パッケージ作成と実行 ,[object Object],[object Object],$HADOOP_HOME/bin/hadoop jar ../maven/hadoop/target/hadoop-1.0-SNAPSHOT.jar com.hatena.hadoop.Hadoop -D mapred.job.tracker=local -D fs.default.name=file:/// input output mvn scala:compile mvn package mvn clean

レスポンス時間の計測 #1 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

レスポンス時間の計測 #2 Mapper URL などの条件でフィルタレスポンス時間を記録 Reducer レスポンス時間の分布を計算後処理グラフ化 (gnuplot) Fotolife にアップロード (AtomAPI)

レスポンス時間の分布グラフ

良好なレスポンスの例

キャッシュによる影響

まとめ ,[object Object],[object Object],[object Object]

[object Object],[object Object]

Scala on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Scala on Hadoop

Similar to Scala on Hadoop (20)

Recently uploaded

Recently uploaded (14)

Scala on Hadoop