Your SlideShare is downloading. ×
JRubyKaigi2010 Hadoop Papyrus
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

JRubyKaigi2010 Hadoop Papyrus

2,577
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,577
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. MapReduce by JRuby and DSL Hadoop Papyrus 2010/8/28 JRubyKaigi 2010 藤川幸一 FUJIKAWA Koichi @fujibee
  • 2. What’s Hadoop? • FW of parallel distributed processing framework for BIG data • OSS clone of Google MapReduce • For over terabyte scale data processing – Took over 2000hr if you read the data of 400TB(Web scale data) by standard HDD, reading 50MB/s – Need the distributed file system and parallel processing framework!
  • 3. Hadoop Papyrus • My own OSS project – Hosted by github http://github.com/fujibee/hadoop-papyrus • Framework for running Hadoop jobs by (J)Ruby DSL description – Originally Hadoop jobs written by Java – Just few lines in Ruby same as the very complex procedure if using Java! • Supported by IPA MITOH 2009 project (Government support) • Can run by Hudson (CI tool) plug-in
  • 4. Step.1 Not Java, But we can write in Ruby!
  • 5. Step.2 Simple description by DSL in Ruby Map Reduce Job Description Log Analysis DSL
  • 6. Step.3 Enable the Hadoop server environment easily by Hudson
  • 7. package org.apache.hadoop.examples; Javaの場合 import java.io.IOException; import java.util.StringTokenizer; 70 lines are needed in Java.. import org.apache.hadoop.conf.Configuration; Hadoop Papyrus is only needed 10 lines! import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; IntSumReducer extends public static class import org.apache.hadoop.mapreduce.Reducer; Reducer<Text, IntWritable, Text, IntWritable> { import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { public class WordCount { int sum = 0; for (IntWritable val : values) { sum += val.get(); public static class TokenizerMapper extends } Mapper<Object, Text, Text, IntWritable> { result.set(sum); Hadoop Papyrus context.write(key, result); } dsl 'LogAnalysis‘ private final static IntWritable one = new IntWritable(1); } private Text word = new Text(); public static void main(String[] args) throws Exception { public void map(Object key, Text value, Context context) Configuration(); Configuration conf = new from ‘test/in‘ throws IOException, InterruptedException { String[] otherArgs = new GenericOptionsParser(conf, args) StringTokenizer itr = new StringTokenizer(value.toString()); .getRemainingArgs(); to ‘test/out’ while (itr.hasMoreTokens()) { if (otherArgs.length != 2) { word.set(itr.nextToken()); System.err.println("Usage: wordcount <in> <out>"); context.write(word, one); System.exit(2); } } } pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/ Job job = new Job(conf, "word count"); } job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); column_name :link job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); topic "link num", :label => 'n' do FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); count_uniq column[:link] System.exit(job.waitForCompletion(true) ? 0 : 1); } } end
  • 8. Hadoop Papyrus Details • Invoke Ruby script using JRuby in the process of Map/Reduce running on Java
  • 9. Hadoop Papyrus Details (con’t) • Additionally, we can write the DSL script you want to process (log analysis, etc). Papyrus can choose the different process on each phase (Map or Reduce, job initialization). So we just need the only one script.
  • 10. ありがとうございました! Thank you! Twitter ID: @fujibee