0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Solving real world problems with Hadoop

3,661

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Views
Total Views
3,661
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
169
1
Likes
5
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Solving Real World Problems with Hadoop and SQL -> Hadoop Masahji Stewart <masahji@synctree.com>Tuesday, April 5, 2011
• 2. Solving Real World Problems with HadoopTuesday, April 5, 2011
• 3. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ...Tuesday, April 5, 2011
• 4. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ... Output as! ! ! ! 1 MapReduce!! 1 (nodes),! ! 1 certain! ! 1 cluster! ! 1 a! ! ! ! 3 collectively! 1 computers!! 1 is! ! ! ! 1 datasets! ! 1 distributable! 1 large! ! ! 1 framework!! 1 for!! ! ! 1 processing! 1 huge! ! ! 1 kinds! ! ! 1 using! ! ! 1 number!! ! 1 of! ! ! ! 2 on! ! ! ! 1 problems! ! 1 referred! ! 1 to! ! ! ! 1Tuesday, April 5, 2011
• 5. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }Tuesday, April 5, 2011
• 6. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { Extract StringTokenizer itr = word = “MapReduce” new StringTokenizer(value.toString()); word = ”is” while (itr.hasMoreTokens()) { word = “a” word.set(itr.nextToken()); ... context.write(word, one); } } }Tuesday, April 5, 2011
• 7. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); Emit } “MapReduce”, 1 } } “is”, 1 “a”, 1 ...Tuesday, April 5, 2011
• 8. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }Tuesday, April 5, 2011
• 9. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { Sum private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context key=“of” ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum = 2 sum += val.get(); } result.set(sum); context.write(key, result); } }Tuesday, April 5, 2011
• 10. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); Emit } result.set(sum); context.write(key, result); } “of”, 2 }Tuesday, April 5, 2011
• 11. Word Count (Running) \$ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount -D mapred.reduce.tasks=3 input_file out 11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 2 11/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 3 11/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 1 11/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_0659 11/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0% 11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0% 11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33% 11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66% 11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100% 11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_0659 11/04/03 21:22:08 INFO mapred.JobClient: Counters: 17 ... 11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=286 11/04/03 21:22:08 INFO mapred.JobClient: Combine input records=27 11/04/03 21:22:08 INFO mapred.JobClient: Map output records=27 11/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24Tuesday, April 5, 2011
• 14. Word Count Input Split Map Shufﬂe/Sort Reduce Output as 1 certain 1 collectively 1 MAP datasets 1 MapReduce is a framework 1 MapReduce is a framework for huge 1 processsing number 1 framework for on 1 REDUCE processing referred 1 huge datasets on MAP to 1 huge datasets certain kinds of on certain kinds distributable MapReduce 1 of distributable cluster 1 problems using a computers 1 problems using distributable 1 large number of MAP REDUCE for 1 a large number computers kinds 1 o f c o m p u te r s of 2 problems 1 ( n o d e s ) , (nodes) collectively collectively referrered to as a MAP referred to as a REDUCE (nodes), 1 cluster a 3 cluster is 1 ... large 1 MAP processing 1 using 1Tuesday, April 5, 2011
• 15. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ...Tuesday, April 5, 2011
• 16. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 18/Jul/2010! ! 189.186.9.181! 1 18/Jul/2010! ! 201.201.16.82! 3 18/Jul/2010! ! 66.195.114.59! 1 18/Jul/2010! ! 67.195.114.59! 1 18/Jul/2010! ! 90.221.175.16! 1 19/Jul/2010! ! 90.221.75.196! 1 ...Tuesday, April 5, 2011
• 17. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } }Tuesday, April 5, 2011
• 18. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) Extract throws IOException { ip = “189.186.9.181” ip = ”201.201.16.82” String text = value.toString(); ip = “66.249.67.57” Matcher matcher = LOG_PATTERN.matcher(text); ... while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } }Tuesday, April 5, 2011
• 19. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); Emit context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); “18/Jul/2010t189.186.9.181”, } ... } } }Tuesday, April 5, 2011
• 20. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
• 21. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); Mapper job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
• 22. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); Reducer job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }Tuesday, April 5, 2011
• 23. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); } System.exit(job.waitForCompletion(true) ? 0 : 1); Input/ } Output SettingsTuesday, April 5, 2011
• 24. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run it!Tuesday, April 5, 2011
• 27. Log Processing (Output) \$ hadoop fs -ls log_results Found 2 items -rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS -rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000 \$ hadoop fs -cat log_results/part-r-00000 18/Jul/2010! 189.186.9.181!1 18/Jul/2010! 201.201.16.82!3 18/Jul/2010! 66.195.114.59!1 18/Jul/2010! 67.195.114.59!1 18/Jul/2010! 90.221.175.16!1 19/Jul/2010! 90.221.75.196!1 ...Tuesday, April 5, 2011
• 28. Hadoop Streaming Fork Mapper / Task Tracker Reducer STDIN STDOUT scriptTuesday, April 5, 2011
• 29. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck ones tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ...Tuesday, April 5, 2011
• 30. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck ones tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ... Output ... 汇 [hui4 chu1] /to export data (e.g. from a database)/! [sou1 suo3] /to search/.../internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database// ...Tuesday, April 5, 2011
• 31. Basic grep \$ hadoop jar \$HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarinTuesday, April 5, 2011
• 32. Basic grep \$ hadoop jar \$HADOOP_HOME/hadoop-streaming.jar Scripts or -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database ... -reducer org.apache.hadoop.mapred.lib.IdentityReducer Java Classes 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarinTuesday, April 5, 2011
• 33. Basic grep \$ hadoop jar \$HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper grep database -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin \$ hadoop fs -cat streaming/grep_database_mandarin/part-00000 汇 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/! [sou1 suo3] /to search/to look for sth/internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database/Tuesday, April 5, 2011
• 34. Ruby Example (ignore ip list) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0" 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/ 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ...Tuesday, April 5, 2011
• 35. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ Read STDIN while(line = STDIN.gets) Write STDOUT next unless line =~ log_regex ip = \$1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|\$)/ }.empty? endTuesday, April 5, 2011
• 36. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ while(line = STDIN.gets) next unless line =~ log_regex ip = \$1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|\$)/ }.empty? endTuesday, April 5, 2011
• 37. Ruby Example (ignore ip list) \$ hadoop jar \$HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper ./script/filter_ips -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ...Tuesday, April 5, 2011
• 38. Ruby Example (ignore ip list) \$ hadoop jar \$HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper ./script/filter_ips -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ... \$ hadoop fs -cat out/streaming/filter_ips/part-00000 ...! 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/ 4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/ 3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"Tuesday, April 5, 2011
• 39. SQL -> HadoopTuesday, April 5, 2011
• 40. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2Tuesday, April 5, 2011
• 41. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input id first_name last_name favorite_movie_id 1 John Mulligan 3 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2Tuesday, April 5, 2011
• 42. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input Output id first_name last_name favorite_movie_id first_name last_name 1 John Mulligan 3 John Mulligan John Smith 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2Tuesday, April 5, 2011
• 43. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
• 44. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
• 45. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
• 46. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
• 47. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); Emit } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... }Tuesday, April 5, 2011
• 48. Simple Query (Running) \$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery data/people.tsv out/simple_query ... 11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100% 11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_0001 11/04/04 09:19:15 INFO mapred.JobClient: Counters: 13 11/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=306296 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=398676 11/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=3 11/04/04 09:19:15 INFO mapred.JobClient: Combine output records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map input records=4 11/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=0 11/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=3 11/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=6 11/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=54 11/04/04 09:19:15 INFO mapred.JobClient: Combine input records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map output records=3 11/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3 ...Tuesday, April 5, 2011
• 49. Simple Query (Running) \$ hadoop fs -cat out/simple_query/part-r-00000 John! Mulligan! John! Smith! Royce! Rollins!Tuesday, April 5, 2011
• 50. Join Query Query SELECT first_name, last_name, movies.name name, movies.image FROM people JOIN movies ON ( people.favorite_movie_id = movies.id )Tuesday, April 5, 2011
• 51. Join Query Input id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpgTuesday, April 5, 2011
• 52. Join Query Input people movies id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg Output first_name last_name name image John Mulligan The Matrix http://bit.ly/matrix.jpg Samir Ahmed Gatacca http://bit.ly/g.jpg Royce Rollins AI http://bit.ly/ai.jpg John Smith Avatar http://bit.ly/avatar.jpgTuesday, April 5, 2011
• 53. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
• 54. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
• 55. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
• 56. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); Emit } catch(InterruptedException ex) { throw new IOException(ex); } ...Tuesday, April 5, 2011
• 57. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
• 58. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
• 59. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
• 60. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); context.write(BLANK, columns); } } ...Tuesday, April 5, 2011
• 61. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); } } context.write(BLANK, columns); Emit ...Tuesday, April 5, 2011
• 62. HiveTuesday, April 5, 2011
• 63. What is Hive? “Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop ﬁles. Hive deﬁnes a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.”Tuesday, April 5, 2011
• 64. Hive Features SerDe MetaStore Query Processor Compiler Processor Functions / UDFs, UDAFs, UDTFsTuesday, April 5, 2011
• 65. Hive DemoTuesday, April 5, 2011