Solving real world problems with Hadoop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Solving real world problems with Hadoop

on

  • 3,538 views

 

Statistics

Views

Total Views
3,538
Views on SlideShare
3,537
Embed Views
1

Actions

Likes
4
Downloads
162
Comments
1

1 Embed 1

https://tasks.crowdflower.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Solving real world problems with Hadoop Presentation Transcript

  • 1. Solving Real World Problems with Hadoop and SQL -> Hadoop Masahji Stewart <masahji@synctree.com>Tuesday, April 5, 2011
  • 2. Solving Real World Problems with HadoopTuesday, April 5, 2011
  • 3. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ...Tuesday, April 5, 2011
  • 4. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ... Output as! ! ! ! 1 MapReduce!! 1 (nodes),! ! 1 certain! ! 1 cluster! ! 1 a! ! ! ! 3 collectively! 1 computers!! 1 is! ! ! ! 1 datasets! ! 1 distributable! 1 large! ! ! 1 framework!! 1 for!! ! ! 1 processing! 1 huge! ! ! 1 kinds! ! ! 1 using! ! ! 1 number!! ! 1 of! ! ! ! 2 on! ! ! ! 1 problems! ! 1 referred! ! 1 to! ! ! ! 1Tuesday, April 5, 2011
  • 5. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }Tuesday, April 5, 2011
  • 6. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { Extract StringTokenizer itr = word = “MapReduce” new StringTokenizer(value.toString()); word = ”is” while (itr.hasMoreTokens()) { word = “a” word.set(itr.nextToken()); ... context.write(word, one); } } }Tuesday, April 5, 2011
  • 7. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); Emit } “MapReduce”, 1 } } “is”, 1 “a”, 1 ...Tuesday, April 5, 2011
  • 8. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }Tuesday, April 5, 2011
  • 9. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { Sum private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context key=“of” ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum = 2 sum += val.get(); } result.set(sum); context.write(key, result); } }Tuesday, April 5, 2011
  • 10. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); Emit } result.set(sum); context.write(key, result); } “of”, 2 }Tuesday, April 5, 2011
  • 11. Word Count (Running) $ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount -D mapred.reduce.tasks=3 input_file out 11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 2 11/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 3 11/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 1 11/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_0659 11/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0% 11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0% 11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33% 11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66% 11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100% 11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_0659 11/04/03 21:22:08 INFO mapred.JobClient: Counters: 17 ... 11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=286 11/04/03 21:22:08 INFO mapred.JobClient: Combine input records=27 11/04/03 21:22:08 INFO mapred.JobClient: Map output records=27 11/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24Tuesday, April 5, 2011
  • 12. Word Count (Output) $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out Found 3 items -rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000 -rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001 -rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002 $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000 as! 1 certain! 1 collectively! 1 datasets! 1 framework!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 A file per reducer MapReduce!1 cluster! 1 computers!1 distributable!1 for!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1 a! 3 is! 1 large! 1 processing! 1 using! 1Tuesday, April 5, 2011
  • 13. Word Count (Output) $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out Found 3 items -rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000 -rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001 -rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002 $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000 as! 1 certain! 1 collectively! 1 datasets! 1 framework!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 MapReduce!1 cluster! 1 computers!1 distributable!1 for!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1 a! 3 is! 1 large! 1 processing! 1 using! 1Tuesday, April 5, 2011
  • 14. Word Count Input Split Map Shuffle/Sort Reduce Output as 1 certain 1 collectively 1 MAP datasets 1 MapReduce is a framework 1 MapReduce is a framework for huge 1 processsing number 1 framework for on 1 REDUCE processing referred 1 huge datasets on MAP to 1 huge datasets certain kinds of on certain kinds distributable MapReduce 1 of distributable cluster 1 problems using a computers 1 problems using distributable 1 large number of MAP REDUCE for 1 a large number computers kinds 1 o f c o m p u te r s of 2 problems 1 ( n o d e s ) , (nodes) collectively collectively referrered to as a MAP referred to as a REDUCE (nodes), 1 cluster a 3 cluster is 1 ... large 1 MAP processing 1 using 1Tuesday, April 5, 2011
  • 15. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ...Tuesday, April 5, 2011
  • 16. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 18/Jul/2010! ! 189.186.9.181! 1 18/Jul/2010! ! 201.201.16.82! 3 18/Jul/2010! ! 66.195.114.59! 1 18/Jul/2010! ! 67.195.114.59! 1 18/Jul/2010! ! 90.221.175.16! 1 19/Jul/2010! ! 90.221.75.196! 1 ...Tuesday, April 5, 2011
  • 17. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } }Tuesday, April 5, 2011
  • 18. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) Extract throws IOException { ip = “189.186.9.181” ip = ”201.201.16.82” String text = value.toString(); ip = “66.249.67.57” Matcher matcher = LOG_PATTERN.matcher(text); ... while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } }Tuesday, April 5, 2011
  • 19. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); Emit context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); “18/Jul/2010t189.186.9.181”, } ... } } }Tuesday, April 5, 2011
  • 20. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
  • 21. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); Mapper job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
  • 22. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); Reducer job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
  • 23. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); } System.exit(job.waitForCompletion(true) ? 0 : 1); Input/ } Output SettingsTuesday, April 5, 2011
  • 24. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run it!Tuesday, April 5, 2011
  • 25. Log Processing (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator -libjars hadoop-examples.jar data/access.log log_results 11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 1 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop- examples.jar in /tmp/hadoop-masahji/mapred/local/ archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/ hadoop-recipes-work--8125788655475885988 with rwxr-xr-x 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%Tuesday, April 5, 2011
  • 26. Log Processing (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator -libjars hadoop-examples.jar data/access.log log_results 11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 1 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop- examples.jar in /tmp/hadoop-masahji/mapred/local/ archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/ hadoop-recipes-work--8125788655475885988 with rwxr-xr-x 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100% JAR placed into Distributed CacheTuesday, April 5, 2011
  • 27. Log Processing (Output) $ hadoop fs -ls log_results Found 2 items -rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS -rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000 $ hadoop fs -cat log_results/part-r-00000 18/Jul/2010! 189.186.9.181!1 18/Jul/2010! 201.201.16.82!3 18/Jul/2010! 66.195.114.59!1 18/Jul/2010! 67.195.114.59!1 18/Jul/2010! 90.221.175.16!1 19/Jul/2010! 90.221.75.196!1 ...Tuesday, April 5, 2011
  • 28. Hadoop Streaming Fork Mapper / Task Tracker Reducer STDIN STDOUT scriptTuesday, April 5, 2011
  • 29. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck ones tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ...Tuesday, April 5, 2011
  • 30. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck ones tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ... Output ... 汇 [hui4 chu1] /to export data (e.g. from a database)/! [sou1 suo3] /to search/.../internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database// ...Tuesday, April 5, 2011
  • 31. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarinTuesday, April 5, 2011
  • 32. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar Scripts or -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database ... -reducer org.apache.hadoop.mapred.lib.IdentityReducer Java Classes 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarinTuesday, April 5, 2011
  • 33. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin $ hadoop fs -cat streaming/grep_database_mandarin/part-00000 汇 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/! [sou1 suo3] /to search/to look for sth/internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database/Tuesday, April 5, 2011
  • 34. Ruby Example (ignore ip list) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0" 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/ 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ...Tuesday, April 5, 2011
  • 35. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ Read STDIN while(line = STDIN.gets) Write STDOUT next unless line =~ log_regex ip = $1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty? endTuesday, April 5, 2011
  • 36. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ while(line = STDIN.gets) next unless line =~ log_regex ip = $1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty? endTuesday, April 5, 2011
  • 37. Ruby Example (ignore ip list) $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper ./script/filter_ips -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ...Tuesday, April 5, 2011
  • 38. Ruby Example (ignore ip list) $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper ./script/filter_ips -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ... $ hadoop fs -cat out/streaming/filter_ips/part-00000 ...! 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/ 4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/ 3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"Tuesday, April 5, 2011
  • 39. SQL -> HadoopTuesday, April 5, 2011
  • 40. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2Tuesday, April 5, 2011
  • 41. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input id first_name last_name favorite_movie_id 1 John Mulligan 3 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2Tuesday, April 5, 2011
  • 42. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input Output id first_name last_name favorite_movie_id first_name last_name 1 John Mulligan 3 John Mulligan John Smith 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2Tuesday, April 5, 2011
  • 43. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  • 44. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  • 45. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  • 46. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  • 47. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); Emit } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
  • 48. Simple Query (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery data/people.tsv out/simple_query ... 11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100% 11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_0001 11/04/04 09:19:15 INFO mapred.JobClient: Counters: 13 11/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=306296 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=398676 11/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=3 11/04/04 09:19:15 INFO mapred.JobClient: Combine output records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map input records=4 11/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=0 11/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=3 11/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=6 11/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=54 11/04/04 09:19:15 INFO mapred.JobClient: Combine input records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map output records=3 11/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3 ...Tuesday, April 5, 2011
  • 49. Simple Query (Running) $ hadoop fs -cat out/simple_query/part-r-00000 John! Mulligan! John! Smith! Royce! Rollins!Tuesday, April 5, 2011
  • 50. Join Query Query SELECT first_name, last_name, movies.name name, movies.image FROM people JOIN movies ON ( people.favorite_movie_id = movies.id )Tuesday, April 5, 2011
  • 51. Join Query Input id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpgTuesday, April 5, 2011
  • 52. Join Query Input people movies id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg Output first_name last_name name image John Mulligan The Matrix http://bit.ly/matrix.jpg Samir Ahmed Gatacca http://bit.ly/g.jpg Royce Rollins AI http://bit.ly/ai.jpg John Smith Avatar http://bit.ly/avatar.jpgTuesday, April 5, 2011
  • 53. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  • 54. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  • 55. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  • 56. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); Emit } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
  • 57. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  • 58. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  • 59. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  • 60. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
  • 61. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); } } context.write(BLANK, columns); Emit ...Tuesday, April 5, 2011
  • 62. HiveTuesday, April 5, 2011
  • 63. What is Hive? “Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.”Tuesday, April 5, 2011
  • 64. Hive Features SerDe MetaStore Query Processor Compiler Processor Functions / UDFs, UDAFs, UDTFsTuesday, April 5, 2011
  • 65. Hive DemoTuesday, April 5, 2011
  • 66. Links http://hadoop.apache.org/ https://github.com/synctree/hadoop-recipes http://hadoop.apache.org/common/docs/r0.20.2/streaming.html http://developer.yahoo.com/blogs/hadoop/ http://wiki.apache.org/hadoop/HiveTuesday, April 5, 2011
  • 67. Questions?Tuesday, April 5, 2011
  • 68. ThanksTuesday, April 5, 2011