Solving real world problems with Hadoop

4,122
-1

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,122
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
176
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Solving real world problems with Hadoop

  1. 1. Solving Real World Problems with Hadoop and SQL -> Hadoop Masahji Stewart <masahji@synctree.com>Tuesday, April 5, 2011
  2. 2. Solving Real World Problems with HadoopTuesday, April 5, 2011
  3. 3. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ...Tuesday, April 5, 2011
  4. 4. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ... Output as! ! ! ! 1 MapReduce!! 1 (nodes),! ! 1 certain! ! 1 cluster! ! 1 a! ! ! ! 3 collectively! 1 computers!! 1 is! ! ! ! 1 datasets! ! 1 distributable! 1 large! ! ! 1 framework!! 1 for!! ! ! 1 processing! 1 huge! ! ! 1 kinds! ! ! 1 using! ! ! 1 number!! ! 1 of! ! ! ! 2 on! ! ! ! 1 problems! ! 1 referred! ! 1 to! ! ! ! 1Tuesday, April 5, 2011
  5. 5. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }Tuesday, April 5, 2011
  6. 6. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { Extract StringTokenizer itr = word = “MapReduce” new StringTokenizer(value.toString()); word = ”is” while (itr.hasMoreTokens()) { word = “a” word.set(itr.nextToken()); ... context.write(word, one); } } }Tuesday, April 5, 2011
  7. 7. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); Emit } “MapReduce”, 1 } } “is”, 1 “a”, 1 ...Tuesday, April 5, 2011
  8. 8. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }Tuesday, April 5, 2011
  9. 9. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { Sum private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context key=“of” ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum = 2 sum += val.get(); } result.set(sum); context.write(key, result); } }Tuesday, April 5, 2011
  10. 10. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); Emit } result.set(sum); context.write(key, result); } “of”, 2 }Tuesday, April 5, 2011
  11. 11. Word Count (Running) $ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount -D mapred.reduce.tasks=3 input_file out 11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 2 11/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 3 11/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 1 11/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_0659 11/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0% 11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0% 11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33% 11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66% 11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100% 11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_0659 11/04/03 21:22:08 INFO mapred.JobClient: Counters: 17 ... 11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=286 11/04/03 21:22:08 INFO mapred.JobClient: Combine input records=27 11/04/03 21:22:08 INFO mapred.JobClient: Map output records=27 11/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24Tuesday, April 5, 2011
  12. 12. Word Count (Output) $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out Found 3 items -rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000 -rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001 -rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002 $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000 as! 1 certain! 1 collectively! 1 datasets! 1 framework!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 A file per reducer MapReduce!1 cluster! 1 computers!1 distributable!1 for!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1 a! 3 is! 1 large! 1 processing! 1 using! 1Tuesday, April 5, 2011
  13. 13. Word Count (Output) $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out Found 3 items -rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000 -rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001 -rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002 $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000 as! 1 certain! 1 collectively! 1 datasets! 1 framework!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 MapReduce!1 cluster! 1 computers!1 distributable!1 for!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1 a! 3 is! 1 large! 1 processing! 1 using! 1Tuesday, April 5, 2011
  14. 14. Word Count Input Split Map Shuffle/Sort Reduce Output as 1 certain 1 collectively 1 MAP datasets 1 MapReduce is a framework 1 MapReduce is a framework for huge 1 processsing number 1 framework for on 1 REDUCE processing referred 1 huge datasets on MAP to 1 huge datasets certain kinds of on certain kinds distributable MapReduce 1 of distributable cluster 1 problems using a computers 1 problems using distributable 1 large number of MAP REDUCE for 1 a large number computers kinds 1 o f c o m p u te r s of 2 problems 1 ( n o d e s ) , (nodes) collectively collectively referrered to as a MAP referred to as a REDUCE (nodes), 1 cluster a 3 cluster is 1 ... large 1 MAP processing 1 using 1Tuesday, April 5, 2011
  15. 15. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ...Tuesday, April 5, 2011
  16. 16. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 18/Jul/2010! ! 189.186.9.181! 1 18/Jul/2010! ! 201.201.16.82! 3 18/Jul/2010! ! 66.195.114.59! 1 18/Jul/2010! ! 67.195.114.59! 1 18/Jul/2010! ! 90.221.175.16! 1 19/Jul/2010! ! 90.221.75.196! 1 ...Tuesday, April 5, 2011
  17. 17. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } }Tuesday, April 5, 2011
  18. 18. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) Extract throws IOException { ip = “189.186.9.181” ip = ”201.201.16.82” String text = value.toString(); ip = “66.249.67.57” Matcher matcher = LOG_PATTERN.matcher(text); ... while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } }Tuesday, April 5, 2011
  19. 19. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); Emit context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); “18/Jul/2010t189.186.9.181”, } ... } } }Tuesday, April 5, 2011
  20. 20. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
  21. 21. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); Mapper job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
  22. 22. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); Reducer job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
  23. 23. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); } System.exit(job.waitForCompletion(true) ? 0 : 1); Input/ } Output SettingsTuesday, April 5, 2011
  24. 24. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run it!Tuesday, April 5, 2011
  25. 25. Log Processing (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator -libjars hadoop-examples.jar data/access.log log_results 11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 1 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop- examples.jar in /tmp/hadoop-masahji/mapred/local/ archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/ hadoop-recipes-work--8125788655475885988 with rwxr-xr-x 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%Tuesday, April 5, 2011
  26. 26. Log Processing (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator -libjars hadoop-examples.jar data/access.log log_results 11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 1 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop- examples.jar in /tmp/hadoop-masahji/mapred/local/ archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/ hadoop-recipes-work--8125788655475885988 with rwxr-xr-x 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100% JAR placed into Distributed CacheTuesday, April 5, 2011
  27. 27. Log Processing (Output) $ hadoop fs -ls log_results Found 2 items -rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS -rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000 $ hadoop fs -cat log_results/part-r-00000 18/Jul/2010! 189.186.9.181!1 18/Jul/2010! 201.201.16.82!3 18/Jul/2010! 66.195.114.59!1 18/Jul/2010! 67.195.114.59!1 18/Jul/2010! 90.221.175.16!1 19/Jul/2010! 90.221.75.196!1 ...Tuesday, April 5, 2011
  28. 28. Hadoop Streaming Fork Mapper / Task Tracker Reducer STDIN STDOUT scriptTuesday, April 5, 2011
  29. 29. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck ones tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ...Tuesday, April 5, 2011
  30. 30. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck ones tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ... Output ... 汇 [hui4 chu1] /to export data (e.g. from a database)/! [sou1 suo3] /to search/.../internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database// ...Tuesday, April 5, 2011
  31. 31. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarinTuesday, April 5, 2011
  32. 32. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar Scripts or -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database ... -reducer org.apache.hadoop.mapred.lib.IdentityReducer Java Classes 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarinTuesday, April 5, 2011
  33. 33. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin $ hadoop fs -cat streaming/grep_database_mandarin/part-00000 汇 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/! [sou1 suo3] /to search/to look for sth/internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database/Tuesday, April 5, 2011
  34. 34. Ruby Example (ignore ip list) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0" 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/ 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ...Tuesday, April 5, 2011
  35. 35. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ Read STDIN while(line = STDIN.gets) Write STDOUT next unless line =~ log_regex ip = $1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty? endTuesday, April 5, 2011
  36. 36. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ while(line = STDIN.gets) next unless line =~ log_regex ip = $1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty? endTuesday, April 5, 2011
  37. 37. Ruby Example (ignore ip list) $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper ./script/filter_ips -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ...Tuesday, April 5, 2011
  38. 38. Ruby Example (ignore ip list) $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper ./script/filter_ips -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ... $ hadoop fs -cat out/streaming/filter_ips/part-00000 ...! 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/ 4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/ 3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"Tuesday, April 5, 2011
  39. 39. SQL -> HadoopTuesday, April 5, 2011
  40. 40. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2Tuesday, April 5, 2011
  41. 41. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input id first_name last_name favorite_movie_id 1 John Mulligan 3 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2Tuesday, April 5, 2011
  42. 42. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input Output id first_name last_name favorite_movie_id first_name last_name 1 John Mulligan 3 John Mulligan John Smith 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2Tuesday, April 5, 2011
  43. 43. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  44. 44. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  45. 45. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  46. 46. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  47. 47. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); Emit } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  48. 48. Simple Query (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery data/people.tsv out/simple_query ... 11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100% 11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_0001 11/04/04 09:19:15 INFO mapred.JobClient: Counters: 13 11/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=306296 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=398676 11/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=3 11/04/04 09:19:15 INFO mapred.JobClient: Combine output records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map input records=4 11/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=0 11/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=3 11/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=6 11/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=54 11/04/04 09:19:15 INFO mapred.JobClient: Combine input records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map output records=3 11/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3 ...Tuesday, April 5, 2011
  49. 49. Simple Query (Running) $ hadoop fs -cat out/simple_query/part-r-00000 John! Mulligan! John! Smith! Royce! Rollins!Tuesday, April 5, 2011
  50. 50. Join Query Query SELECT first_name, last_name, movies.name name, movies.image FROM people JOIN movies ON ( people.favorite_movie_id = movies.id )Tuesday, April 5, 2011
  51. 51. Join Query Input id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpgTuesday, April 5, 2011
  52. 52. Join Query Input people movies id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg Output first_name last_name name image John Mulligan The Matrix http://bit.ly/matrix.jpg Samir Ahmed Gatacca http://bit.ly/g.jpg Royce Rollins AI http://bit.ly/ai.jpg John Smith Avatar http://bit.ly/avatar.jpgTuesday, April 5, 2011
  53. 53. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  54. 54. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  55. 55. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  56. 56. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); Emit } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  57. 57. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  58. 58. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  59. 59. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  60. 60. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  61. 61. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); } } context.write(BLANK, columns); Emit ...Tuesday, April 5, 2011
  62. 62. HiveTuesday, April 5, 2011
  63. 63. What is Hive? “Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.”Tuesday, April 5, 2011
  64. 64. Hive Features SerDe MetaStore Query Processor Compiler Processor Functions / UDFs, UDAFs, UDTFsTuesday, April 5, 2011
  65. 65. Hive DemoTuesday, April 5, 2011
  66. 66. Links http://hadoop.apache.org/ https://github.com/synctree/hadoop-recipes http://hadoop.apache.org/common/docs/r0.20.2/streaming.html http://developer.yahoo.com/blogs/hadoop/ http://wiki.apache.org/hadoop/HiveTuesday, April 5, 2011
  67. 67. Questions?Tuesday, April 5, 2011
  68. 68. ThanksTuesday, April 5, 2011

×