JRubyKaigi2010 Hadoop Papyrus


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

JRubyKaigi2010 Hadoop Papyrus

  1. 1. MapReduce by JRuby and DSL Hadoop Papyrus 2010/8/28 JRubyKaigi 2010 藤川幸一 FUJIKAWA Koichi @fujibee
  2. 2. What’s Hadoop? • FW of parallel distributed processing framework for BIG data • OSS clone of Google MapReduce • For over terabyte scale data processing – Took over 2000hr if you read the data of 400TB(Web scale data) by standard HDD, reading 50MB/s – Need the distributed file system and parallel processing framework!
  3. 3. Hadoop Papyrus • My own OSS project – Hosted by github http://github.com/fujibee/hadoop-papyrus • Framework for running Hadoop jobs by (J)Ruby DSL description – Originally Hadoop jobs written by Java – Just few lines in Ruby same as the very complex procedure if using Java! • Supported by IPA MITOH 2009 project (Government support) • Can run by Hudson (CI tool) plug-in
  4. 4. Step.1 Not Java, But we can write in Ruby!
  5. 5. Step.2 Simple description by DSL in Ruby Map Reduce Job Description Log Analysis DSL
  6. 6. Step.3 Enable the Hadoop server environment easily by Hudson
  7. 7. package org.apache.hadoop.examples; Javaの場合 import java.io.IOException; import java.util.StringTokenizer; 70 lines are needed in Java.. import org.apache.hadoop.conf.Configuration; Hadoop Papyrus is only needed 10 lines! import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends public static class import org.apache.hadoop.mapreduce.Reducer; Reducer<Text, IntWritable, Text, IntWritable> { import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public class WordCount { int sum = 0; for (IntWritable val : values) { sum += val.get(); public static class TokenizerMapper extends } Mapper<Object, Text, Text, IntWritable> { result.set(sum); Hadoop Papyrus context.write(key, result); } dsl 'LogAnalysis‘ private final static IntWritable one = new IntWritable(1); } private Text word = new Text(); public static void main(String[] args) throws Exception { public void map(Object key, Text value, Context context) Configuration(); Configuration conf = new from ‘test/in‘ throws IOException, InterruptedException { String[] otherArgs = new GenericOptionsParser(conf, args) StringTokenizer itr = new StringTokenizer(value.toString()); .getRemainingArgs(); to ‘test/out’ while (itr.hasMoreTokens()) { if (otherArgs.length != 2) { word.set(itr.nextToken()); System.err.println("Usage: wordcount <in> <out>"); context.write(word, one); System.exit(2); } } } pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/ Job job = new Job(conf, "word count"); } job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); column_name :link job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); topic "link num", :label => 'n' do FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); count_uniq column[:link] System.exit(job.waitForCompletion(true) ? 0 : 1); } } end
  8. 8. Hadoop Papyrus Details • Invoke Ruby script using JRuby in the process of Map/Reduce running on Java
  9. 9. Hadoop Papyrus Details (con’t) • Additionally, we can write the DSL script you want to process (log analysis, etc). Papyrus can choose the different process on each phase (Map or Reduce, job initialization). So we just need the only one script.
  10. 10. ありがとうございました! Thank you! Twitter ID: @fujibee