MapReduce by JRuby and DSL
     Hadoop Papyrus

        2010/8/28
      JRubyKaigi 2010
 藤川幸一 FUJIKAWA Koichi @fujibee
What’s Hadoop?
• FW of parallel distributed processing
  framework for BIG data
• OSS clone of Google MapReduce
• For over...
Hadoop Papyrus
• My own OSS project
  – Hosted by github http://github.com/fujibee/hadoop-papyrus
• Framework for running ...
Step.1
Not Java, But we can write in Ruby!
Step.2
Simple description by DSL in Ruby

       Map   Reduce      Job
                      Description




             ...
Step.3
Enable the Hadoop server environment
easily by Hudson
package org.apache.hadoop.examples;            Javaの場合
import java.io.IOException;
import java.util.StringTokenizer;
     ...
Hadoop Papyrus Details
• Invoke Ruby script using JRuby in the process
  of Map/Reduce running on Java
Hadoop Papyrus Details (con’t)
• Additionally, we can write the DSL script you want to process (log analysis,
  etc). Papy...
ありがとうございました! Thank you!




      Twitter ID: @fujibee
Upcoming SlideShare
Loading in...5
×

JRubyKaigi2010 Hadoop Papyrus

2,659

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,659
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "JRubyKaigi2010 Hadoop Papyrus"

  1. 1. MapReduce by JRuby and DSL Hadoop Papyrus 2010/8/28 JRubyKaigi 2010 藤川幸一 FUJIKAWA Koichi @fujibee
  2. 2. What’s Hadoop? • FW of parallel distributed processing framework for BIG data • OSS clone of Google MapReduce • For over terabyte scale data processing – Took over 2000hr if you read the data of 400TB(Web scale data) by standard HDD, reading 50MB/s – Need the distributed file system and parallel processing framework!
  3. 3. Hadoop Papyrus • My own OSS project – Hosted by github http://github.com/fujibee/hadoop-papyrus • Framework for running Hadoop jobs by (J)Ruby DSL description – Originally Hadoop jobs written by Java – Just few lines in Ruby same as the very complex procedure if using Java! • Supported by IPA MITOH 2009 project (Government support) • Can run by Hudson (CI tool) plug-in
  4. 4. Step.1 Not Java, But we can write in Ruby!
  5. 5. Step.2 Simple description by DSL in Ruby Map Reduce Job Description Log Analysis DSL
  6. 6. Step.3 Enable the Hadoop server environment easily by Hudson
  7. 7. package org.apache.hadoop.examples; Javaの場合 import java.io.IOException; import java.util.StringTokenizer; 70 lines are needed in Java.. import org.apache.hadoop.conf.Configuration; Hadoop Papyrus is only needed 10 lines! import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends public static class import org.apache.hadoop.mapreduce.Reducer; Reducer<Text, IntWritable, Text, IntWritable> { import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public class WordCount { int sum = 0; for (IntWritable val : values) { sum += val.get(); public static class TokenizerMapper extends } Mapper<Object, Text, Text, IntWritable> { result.set(sum); Hadoop Papyrus context.write(key, result); } dsl 'LogAnalysis‘ private final static IntWritable one = new IntWritable(1); } private Text word = new Text(); public static void main(String[] args) throws Exception { public void map(Object key, Text value, Context context) Configuration(); Configuration conf = new from ‘test/in‘ throws IOException, InterruptedException { String[] otherArgs = new GenericOptionsParser(conf, args) StringTokenizer itr = new StringTokenizer(value.toString()); .getRemainingArgs(); to ‘test/out’ while (itr.hasMoreTokens()) { if (otherArgs.length != 2) { word.set(itr.nextToken()); System.err.println("Usage: wordcount <in> <out>"); context.write(word, one); System.exit(2); } } } pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/ Job job = new Job(conf, "word count"); } job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); column_name :link job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); topic "link num", :label => 'n' do FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); count_uniq column[:link] System.exit(job.waitForCompletion(true) ? 0 : 1); } } end
  8. 8. Hadoop Papyrus Details • Invoke Ruby script using JRuby in the process of Map/Reduce running on Java
  9. 9. Hadoop Papyrus Details (con’t) • Additionally, we can write the DSL script you want to process (log analysis, etc). Papyrus can choose the different process on each phase (Map or Reduce, job initialization). So we just need the only one script.
  10. 10. ありがとうございました! Thank you! Twitter ID: @fujibee
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×