SlideShare a Scribd company logo
1 of 68
Download to read offline
Solving Real World Problems with Hadoop
                                           and
                                      SQL -> Hadoop

                         Masahji Stewart <masahji@synctree.com>




Tuesday, April 5, 2011
Solving Real World Problems
                                 with Hadoop




Tuesday, April 5, 2011
Word Count
                         Input
                         MapReduce is a framework for processing huge datasets on
                         certain kinds of distributable problems using a large number
                         of computers (nodes), collectively referred to as a cluster
                         ...




Tuesday, April 5, 2011
Word Count
                         Input
                         MapReduce is a framework for processing huge datasets on
                         certain kinds of distributable problems using a large number
                         of computers (nodes), collectively referred to as a cluster
                         ...




                         Output
                         as! ! ! ! 1          MapReduce!! 1        (nodes),! !   1
                         certain! ! 1         cluster! ! 1         a! ! ! !      3
                         collectively! 1      computers!! 1        is! ! ! !     1
                         datasets! ! 1        distributable!
                                                           1       large! ! !    1
                         framework!! 1        for!! ! ! 1          processing!   1
                         huge! ! ! 1          kinds! ! ! 1         using! ! !    1
                         number!! ! 1         of! ! ! ! 2
                         on! ! ! ! 1          problems! ! 1
                         referred! ! 1
                         to! ! ! ! 1



Tuesday, April 5, 2011
Word Count (Mapper)

                   public static class TokenizerMapper
                        extends Mapper<Object, Text, Text, IntWritable>{

                         private final static IntWritable one = new IntWritable(1);
                         private Text word = new Text();

                         public void map(Object key, Text value, Context context
                                         ) throws IOException, InterruptedException {
                           StringTokenizer itr =
                             new StringTokenizer(value.toString());
                           while (itr.hasMoreTokens()) {
                             word.set(itr.nextToken());
                             context.write(word, one);
                           }
                         }
                   }




Tuesday, April 5, 2011
Word Count (Mapper)

                   public static class TokenizerMapper
                        extends Mapper<Object, Text, Text, IntWritable>{

                         private final static IntWritable one = new IntWritable(1);
                         private Text word = new Text();

                         public void map(Object key, Text value, Context context
                                         ) throws IOException, InterruptedException {
                                                                                      Extract
                           StringTokenizer itr =                                      word = “MapReduce”
                             new StringTokenizer(value.toString());                   word = ”is”
                           while (itr.hasMoreTokens()) {                              word = “a”
                             word.set(itr.nextToken());
                                                                                      ...
                             context.write(word, one);
                           }
                         }
                   }




Tuesday, April 5, 2011
Word Count (Mapper)

                   public static class TokenizerMapper
                        extends Mapper<Object, Text, Text, IntWritable>{

                         private final static IntWritable one = new IntWritable(1);
                         private Text word = new Text();

                         public void map(Object key, Text value, Context context
                                         ) throws IOException, InterruptedException {
                           StringTokenizer itr =
                             new StringTokenizer(value.toString());
                           while (itr.hasMoreTokens()) {
                             word.set(itr.nextToken());
                             context.write(word, one);                                  Emit
                           }                                                            “MapReduce”, 1
                         }
                   }                                                                    “is”,        1
                                                                                        “a”,         1
                                                                                        ...




Tuesday, April 5, 2011
Word Count (Reducer)
                   public static class IntSumReducer
                        extends Reducer<Text,IntWritable,Text,IntWritable> {
                     private IntWritable result = new IntWritable();

                         public void reduce(Text key, Iterable<IntWritable> values,
                                            Context context
                                            ) throws IOException, InterruptedException {
                           int sum = 0;
                           for (IntWritable val : values) {
                             sum += val.get();
                           }
                           result.set(sum);
                           context.write(key, result);
                         }
                   }




Tuesday, April 5, 2011
Word Count (Reducer)
                   public static class IntSumReducer
                        extends Reducer<Text,IntWritable,Text,IntWritable> {

                                                                                      Sum
                     private IntWritable result = new IntWritable();

                         public void reduce(Text key, Iterable<IntWritable> values,
                                            Context context
                                                                                      key=“of”
                                            ) throws IOException, InterruptedException {
                             int sum = 0;
                             for (IntWritable val : values) {                  sum = 2
                               sum += val.get();
                             }
                             result.set(sum);
                             context.write(key, result);
                         }
                   }




Tuesday, April 5, 2011
Word Count (Reducer)
                   public static class IntSumReducer
                        extends Reducer<Text,IntWritable,Text,IntWritable> {
                     private IntWritable result = new IntWritable();

                         public void reduce(Text key, Iterable<IntWritable> values,
                                            Context context
                                            ) throws IOException, InterruptedException {
                           int sum = 0;
                           for (IntWritable val : values) {
                             sum += val.get();

                                                                                    Emit
                           }
                           result.set(sum);
                           context.write(key, result);
                         }                                                            “of”,   2
                   }




Tuesday, April 5, 2011
Word Count (Running)
               $ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount 
                   -D mapred.reduce.tasks=3
                   input_file out

               11/04/03   21:21:27   INFO   mapred.JobClient: Default number of map tasks: 2
               11/04/03   21:21:27   INFO   mapred.JobClient: Default number of reduce tasks: 3
               11/04/03   21:21:28   INFO   input.FileInputFormat: Total input paths to process : 1
               11/04/03   21:21:29   INFO   mapred.JobClient: Running job: job_201103252110_0659
               11/04/03   21:21:30   INFO   mapred.JobClient: map 0% reduce 0%
               11/04/03   21:21:37   INFO   mapred.JobClient: map 100% reduce 0%
               11/04/03   21:21:49   INFO   mapred.JobClient: map 100% reduce 33%
               11/04/03   21:21:52   INFO   mapred.JobClient: map 100% reduce 66%
               11/04/03   21:22:05   INFO   mapred.JobClient: map 100% reduce 100%
               11/04/03   21:22:08   INFO   mapred.JobClient: Job complete: job_201103252110_0659
               11/04/03   21:22:08   INFO   mapred.JobClient: Counters: 17
               ...
               11/04/03   21:22:08   INFO   mapred.JobClient:     Map output bytes=286
               11/04/03   21:22:08   INFO   mapred.JobClient:     Combine input records=27
               11/04/03   21:22:08   INFO   mapred.JobClient:     Map output records=27
               11/04/03   21:22:08   INFO   mapred.JobClient:     Reduce input records=24




Tuesday, April 5, 2011
Word Count (Output)
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out
               Found 3 items
               -rw-r--r--    2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000
               -rw-r--r--    2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001
               -rw-r--r--    2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000
               as! 1
               certain! 1
               collectively! 1
               datasets! 1
               framework!1
               ...
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001       A file per
                                                                                   reducer
               MapReduce!1
               cluster! 1
               computers!1
               distributable!1
               for!1
               ...
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002
               (nodes),! 1
               a! 3
               is! 1
               large! 1
               processing! 1
               using! 1


Tuesday, April 5, 2011
Word Count (Output)
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out
               Found 3 items
               -rw-r--r--    2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000
               -rw-r--r--    2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001
               -rw-r--r--    2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000
               as! 1
               certain! 1
               collectively! 1
               datasets! 1
               framework!1
               ...
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001
               MapReduce!1
               cluster! 1
               computers!1
               distributable!1
               for!1
               ...
               $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002
               (nodes),! 1
               a! 3
               is! 1
               large! 1
               processing! 1
               using! 1


Tuesday, April 5, 2011
Word Count
                              Input                  Split           Map   Shuffle/Sort   Reduce       Output

                                                                                                  as             1
                                                                                                  certain        1
                                                                                                  collectively   1
                                                                     MAP                          datasets       1
                                                MapReduce is a                                    framework      1
                         MapReduce is a         framework for                                     huge           1
                                                processsing                                       number         1
                         framework for                                                            on             1
                                                                                         REDUCE
                         processing                                                               referred       1
                                                huge datasets on     MAP                          to             1
                         huge datasets          certain kinds of
                         on certain kinds       distributable                                     MapReduce 1
                         of distributable                                                         cluster       1
                                                problems using a                                  computers 1
                         problems using                                                           distributable 1
                                                large number of      MAP                 REDUCE   for           1
                         a large number         computers                                         kinds         1
                         o f c o m p u te r s                                                     of            2
                                                                                                  problems 1
                         ( n o d e s ) ,             (nodes)
                                                   collectively
                         collectively           referrered to as a   MAP
                         referred to as a
                                                                                         REDUCE   (nodes),       1
                         cluster                                                                  a              3
                                                     cluster                                      is             1
                         ...                                                                      large          1
                                                                     MAP                          processing     1
                                                                                                  using          1




Tuesday, April 5, 2011
Log Processing                                                                        (Date IP COUNT)

    Input
     67.195.114.59   -   -   [18/Jul/2010:16:21:35   -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")
     189.186.9.181   -   -   [18/Jul/2010:16:21:35   -0700] "-" 400 0 "-" "-"
     90.221.175.16   -   -   [18/Jul/2010:16:21:35   -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"
     201.201.16.82   -   -   [18/Jul/2010:16:21:35   -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"
     201.201.16.82   -   -   [18/Jul/2010:16:21:35   -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
     201.201.16.82   -   -   [18/Jul/2010:16:21:35   -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
     66.195.114.59   -   -   [18/Jul/2010:16:21:35   -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"
     90.221.75.196   -   -   [19/Jul/2010:16:21:35   -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
                                                                                      ...




Tuesday, April 5, 2011
Log Processing                                                                        (Date IP COUNT)

    Input
     67.195.114.59   -   -   [18/Jul/2010:16:21:35   -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")
     189.186.9.181   -   -   [18/Jul/2010:16:21:35   -0700] "-" 400 0 "-" "-"
     90.221.175.16   -   -   [18/Jul/2010:16:21:35   -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"
     201.201.16.82   -   -   [18/Jul/2010:16:21:35   -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"
     201.201.16.82   -   -   [18/Jul/2010:16:21:35   -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
     201.201.16.82   -   -   [18/Jul/2010:16:21:35   -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
     66.195.114.59   -   -   [18/Jul/2010:16:21:35   -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"
     90.221.75.196   -   -   [19/Jul/2010:16:21:35   -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
                                                                                      ...




    Output
    18/Jul/2010!
               !                189.186.9.181!          1
    18/Jul/2010!
               !                201.201.16.82!          3
    18/Jul/2010!
               !                66.195.114.59!          1
    18/Jul/2010!
               !                67.195.114.59!          1
    18/Jul/2010!
               !                90.221.175.16!          1
    19/Jul/2010!
               !                90.221.75.196!          1
                                ...




Tuesday, April 5, 2011
Log Processing (Mapper)
                 public static final Pattern LOG_PATTERN = Pattern.compile("^
               ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
               ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");

                   public static class ExtractDateAndIpMapper
                     extends Mapper<Object, Text, Text, IntWritable> {

                         private final static IntWritable one = new IntWritable(1);
                         private Text ip = new Text();

                         public void map(Object key, Text value, Context context)
                           throws IOException {

                             String text = value.toString();
                             Matcher matcher = LOG_PATTERN.matcher(text);
                             while (matcher.find()) {
                               try {
                                 ip.set(matcher.group(5) + "t" + matcher.group(1));
                                 context.write(ip, one);
                               } catch(InterruptedException ex) {
                                 throw new IOException(ex);
                               }
                             }

                         }
                   }
Tuesday, April 5, 2011
Log Processing (Mapper)
                 public static final Pattern LOG_PATTERN = Pattern.compile("^
               ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
               ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");

                   public static class ExtractDateAndIpMapper
                     extends Mapper<Object, Text, Text, IntWritable> {

                         private final static IntWritable one = new IntWritable(1);
                         private Text ip = new Text();

                         public void map(Object key, Text value, Context context)      Extract
                           throws IOException {                                        ip = “189.186.9.181”
                                                                                       ip = ”201.201.16.82”
                             String text = value.toString();
                                                                                       ip = “66.249.67.57”
                             Matcher matcher = LOG_PATTERN.matcher(text);
                                                                                       ...
                             while (matcher.find()) {
                               try {
                                 ip.set(matcher.group(5) + "t" + matcher.group(1));
                                 context.write(ip, one);
                               } catch(InterruptedException ex) {
                                 throw new IOException(ex);
                               }
                             }

                         }
                   }
Tuesday, April 5, 2011
Log Processing (Mapper)
                 public static final Pattern LOG_PATTERN = Pattern.compile("^
               ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
               ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");

                   public static class ExtractDateAndIpMapper
                     extends Mapper<Object, Text, Text, IntWritable> {

                         private final static IntWritable one = new IntWritable(1);
                         private Text ip = new Text();

                         public void map(Object key, Text value, Context context)
                           throws IOException {

                             String text = value.toString();
                             Matcher matcher = LOG_PATTERN.matcher(text);
                             while (matcher.find()) {
                               try {
                                 ip.set(matcher.group(5) + "t" + matcher.group(1));

                                                                           Emit
                                 context.write(ip, one);
                               } catch(InterruptedException ex) {
                                 throw new IOException(ex);                  “18/Jul/2010t189.186.9.181”,
                               }                                             ...
                             }

                         }
                   }
Tuesday, April 5, 2011
Log Processing (main)
               public class LogAggregator {
               ...
                 public static void main(String[] args) throws Exception {
                   Configuration conf = new Configuration();
                   String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                   if (otherArgs.length != 2) {
                     System.err.println("Usage: LogAggregator <in> <out>");
                     System.exit(2);
                   }
                   Job job = new Job(conf, "LogAggregator");
                   job.setJarByClass(LogAggregator.class);
                   job.setMapperClass(ExtractDateAndIpMapper.class);
                   job.setCombinerClass(WordCount.IntSumReducer.class);
                   job.setReducerClass(WordCount.IntSumReducer.class);
                   job.setOutputKeyClass(Text.class);
                   job.setOutputValueClass(IntWritable.class);
                   FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                   FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
                   System.exit(job.waitForCompletion(true) ? 0 : 1);
                 }
               }




Tuesday, April 5, 2011
Log Processing (main)
               public class LogAggregator {
               ...
                 public static void main(String[] args) throws Exception {
                   Configuration conf = new Configuration();
                   String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                   if (otherArgs.length != 2) {
                     System.err.println("Usage: LogAggregator <in> <out>");
                     System.exit(2);
                   }
                   Job job = new Job(conf, "LogAggregator");
                   job.setJarByClass(LogAggregator.class);
                   job.setMapperClass(ExtractDateAndIpMapper.class);
                   job.setCombinerClass(WordCount.IntSumReducer.class);            Mapper
                   job.setReducerClass(WordCount.IntSumReducer.class);
                   job.setOutputKeyClass(Text.class);
                   job.setOutputValueClass(IntWritable.class);
                   FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                   FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
                   System.exit(job.waitForCompletion(true) ? 0 : 1);
                 }
               }




Tuesday, April 5, 2011
Log Processing (main)
               public class LogAggregator {
               ...
                 public static void main(String[] args) throws Exception {
                   Configuration conf = new Configuration();
                   String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                   if (otherArgs.length != 2) {
                     System.err.println("Usage: LogAggregator <in> <out>");
                     System.exit(2);
                   }
                   Job job = new Job(conf, "LogAggregator");
                   job.setJarByClass(LogAggregator.class);
                   job.setMapperClass(ExtractDateAndIpMapper.class);
                   job.setCombinerClass(WordCount.IntSumReducer.class);
                   job.setReducerClass(WordCount.IntSumReducer.class);            Reducer
                   job.setOutputKeyClass(Text.class);
                   job.setOutputValueClass(IntWritable.class);
                   FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                   FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
                   System.exit(job.waitForCompletion(true) ? 0 : 1);
                 }
               }




Tuesday, April 5, 2011
Log Processing (main)
               public class LogAggregator {
               ...
                 public static void main(String[] args) throws Exception {
                   Configuration conf = new Configuration();
                   String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                   if (otherArgs.length != 2) {
                     System.err.println("Usage: LogAggregator <in> <out>");
                     System.exit(2);
                   }
                   Job job = new Job(conf, "LogAggregator");
                   job.setJarByClass(LogAggregator.class);
                   job.setMapperClass(ExtractDateAndIpMapper.class);
                   job.setCombinerClass(WordCount.IntSumReducer.class);
                   job.setReducerClass(WordCount.IntSumReducer.class);
                   job.setOutputKeyClass(Text.class);
                   job.setOutputValueClass(IntWritable.class);
                   FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                   FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

                 }
                   System.exit(job.waitForCompletion(true) ? 0 : 1);
                                                                                Input/
               }
                                                                                Output
                                                                                Settings
Tuesday, April 5, 2011
Log Processing (main)
               public class LogAggregator {
               ...
                 public static void main(String[] args) throws Exception {
                   Configuration conf = new Configuration();
                   String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                   if (otherArgs.length != 2) {
                     System.err.println("Usage: LogAggregator <in> <out>");
                     System.exit(2);
                   }
                   Job job = new Job(conf, "LogAggregator");
                   job.setJarByClass(LogAggregator.class);
                   job.setMapperClass(ExtractDateAndIpMapper.class);
                   job.setCombinerClass(WordCount.IntSumReducer.class);
                   job.setReducerClass(WordCount.IntSumReducer.class);
                   job.setOutputKeyClass(Text.class);
                   job.setOutputValueClass(IntWritable.class);
                   FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                   FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
                   System.exit(job.waitForCompletion(true) ? 0 : 1);

               }
                 }                                                               Run it!


Tuesday, April 5, 2011
Log Processing (Running)
        $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator
                     -libjars hadoop-examples.jar data/access.log log_results

        11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
        00:51:30 INFO input.FileInputFormat: Total input paths to process : 1
        11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-
        examples.jar in /tmp/hadoop-masahji/mapred/local/
        archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/
        hadoop-recipes-work--8125788655475885988 with rwxr-xr-x
        11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
        Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
        mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
        Development/hadoop-recipes/hadoop-examples.jar
        11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
        Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
        mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
        Development/hadoop-recipes/hadoop-examples.jar
        11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%




Tuesday, April 5, 2011
Log Processing (Running)
        $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator
                     -libjars hadoop-examples.jar data/access.log log_results

        11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
        00:51:30 INFO input.FileInputFormat: Total input paths to process : 1
        11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-
        examples.jar in /tmp/hadoop-masahji/mapred/local/
        archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/
        hadoop-recipes-work--8125788655475885988 with rwxr-xr-x
        11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
        Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
        mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
        Development/hadoop-recipes/hadoop-examples.jar
        11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
        Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
        mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
        Development/hadoop-recipes/hadoop-examples.jar
        11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%


                                                      JAR placed into
                                                      Distributed Cache
Tuesday, April 5, 2011
Log Processing (Output)

               $ hadoop fs -ls log_results
               Found 2 items
               -rwxrwxrwx    1 masahji staff         0 2011-04-04 00:51 log_results/_SUCCESS
               -rwxrwxrwx    1 masahji staff       168 2011-04-04 00:51 log_results/part-r-00000

               $ hadoop fs -cat log_results/part-r-00000
               18/Jul/2010! 189.186.9.181!1
               18/Jul/2010! 201.201.16.82!3
               18/Jul/2010! 66.195.114.59!1
               18/Jul/2010! 67.195.114.59!1
               18/Jul/2010! 90.221.175.16!1
               19/Jul/2010! 90.221.75.196!1
               ...




Tuesday, April 5, 2011
Hadoop Streaming

                                        Fork      Mapper /
                         Task Tracker
                                                  Reducer



                                               STDIN   STDOUT




                                                   script




Tuesday, April 5, 2011
Basic grep
                  Input
                  ...
                           [sou1 suo3] /to search/.../internet search/database search/
                           [ji2 ri4] /propitious day/lucky day/
                           [ji2 xiang2] /lucky/auspicious/propitious/
                           [duo1 duo1] /to cluck one's tongue/tut-tut/
                         鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/
                  ...




Tuesday, April 5, 2011
Basic grep
                  Input
                  ...
                              [sou1 suo3] /to search/.../internet search/database search/
                              [ji2 ri4] /propitious day/lucky day/
                              [ji2 xiang2] /lucky/auspicious/propitious/
                              [duo1 duo1] /to cluck one's tongue/tut-tut/
                         鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/
                  ...


                  Output
                  ...
                         汇    [hui4 chu1] /to export data (e.g. from a database)/!
                              [sou1 suo3] /to search/.../internet search/database search/!
                              库 [shu4 ju4 ku4] /database/!
                                  库软    [shu4 ju4 ku4 ruan3 jian4] /database software/!
                          资   库 [zi1 liao4 ku4] /database//
                  ...




Tuesday, April 5, 2011
Basic grep
        $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
             -input    data/cedict.txt.gz 
             -output   streaming/grep_database_mandarin 
             -mapper   'grep database' 
             -reducer org.apache.hadoop.mapred.lib.IdentityReducer
        ...
        11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%
        11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001
        11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin




Tuesday, April 5, 2011
Basic grep
        $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
                                                                         Scripts
                                                                         or
             -input    data/cedict.txt.gz 
             -output   streaming/grep_database_mandarin 
             -mapper   'grep database' 

        ...
             -reducer org.apache.hadoop.mapred.lib.IdentityReducer
                                                                         Java Classes
        11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%
        11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001
        11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin




Tuesday, April 5, 2011
Basic grep
        $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
             -input    data/cedict.txt.gz 
             -output   streaming/grep_database_mandarin 
             -mapper   'grep database' 
             -reducer org.apache.hadoop.mapred.lib.IdentityReducer
        ...
        11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%
        11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001
        11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin



        $ hadoop fs -cat streaming/grep_database_mandarin/part-00000

                汇        [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/!
                         [sou1 suo3] /to search/to look for sth/internet search/database search/!
                         库 [shu4 ju4 ku4] /database/!
                            库软    [shu4 ju4 ku4 ruan3 jian4] /database software/!
                   资     库 [zi1 liao4 ku4] /database/




Tuesday, April 5, 2011
Ruby Example                                                                        (ignore ip list)
    Input
     67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")
     192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0"
     189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"
     90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"
     10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0"
     201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"
     201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
     10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")
     201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
     10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0"
     66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"
     90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
     ...




    Output
     189.186.9.181       - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!
     201.201.16.82       - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla
     4.0"!
     201.201.16.82       - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/
     4.0"!
     201.201.16.82       -   -   [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"!
     66.195.114.59       -   -   [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!
     67.195.114.59       -   -   [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")!
     90.221.175.16       -   -   [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!
     90.221.75.196       -   -   [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!
     90.221.75.196       -   -   [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
     ...




Tuesday, April 5, 2011
Ruby Example                                (ignore ip list)


            #!/usr/bin/env ruby

            ignore = %w(127.0.0.1 192.168 10)
            log_regex = /^([d.]+)s/
                                                                   Read STDIN
            while(line = STDIN.gets)                               Write STDOUT
              next unless line =~ log_regex
              ip = $1

              print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty?
            end




Tuesday, April 5, 2011
Ruby Example                                (ignore ip list)


            #!/usr/bin/env ruby

            ignore = %w(127.0.0.1 192.168 10)
            log_regex = /^([d.]+)s/

            while(line = STDIN.gets)
              next unless line =~ log_regex
              ip = $1

              print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty?
            end




Tuesday, April 5, 2011
Ruby Example                             (ignore ip list)
        $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
              -input    data/access.log 
              -output   out/streaming/filter_ips 
              -mapper   './script/filter_ips' 
              -reducer org.apache.hadoop.mapred.lib.IdentityReducer
        11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
        11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not
        11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1
        11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/
        11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001
        11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)
        ...




Tuesday, April 5, 2011
Ruby Example                                                          (ignore ip list)
        $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
              -input    data/access.log 
              -output   out/streaming/filter_ips 
              -mapper   './script/filter_ips' 
              -reducer org.apache.hadoop.mapred.lib.IdentityReducer
        11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
        11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not
        11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1
        11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/
        11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001
        11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)
        ...



        $ hadoop fs -cat out/streaming/filter_ips/part-00000 ...!

        189.186.9.181 - - [18/Jul/2010:16:21:35   -0700] "-" 400 0 "-" "-"!
        201.201.16.82 - - [18/Jul/2010:16:21:35   -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/
        1450" "Mozilla/4.0"!
        201.201.16.82 - - [18/Jul/2010:16:21:35   -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/
        1450" "Mozilla/4.0"!
        201.201.16.82 - - [18/Jul/2010:16:21:35   -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/
        4.0"!
        66.195.114.59 - - [18/Jul/2010:16:21:35   -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!
        67.195.114.59 - - [18/Jul/2010:16:21:35   -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/
        3.0")!
        90.221.175.16 - - [18/Jul/2010:16:21:35   -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!
        90.221.75.196 - - [19/Jul/2010:16:21:35   -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music"
        "Mozilla/5.0"




Tuesday, April 5, 2011
SQL -> Hadoop




Tuesday, April 5, 2011
Simple Query
    Query
     SELECT first_name, last_name FROM people
     WHERE first_name = ‘John’
        OR favorite_movie_id = 2




Tuesday, April 5, 2011
Simple Query
    Query
     SELECT first_name, last_name FROM people
     WHERE first_name = ‘John’
        OR favorite_movie_id = 2


     Input
        id         first_name    last_name   favorite_movie_id

         1               John    Mulligan            3

         2               Samir     Ahmed             5

         3               Royce    Rollins            2

         4               John      Smith             2




Tuesday, April 5, 2011
Simple Query
    Query
     SELECT first_name, last_name FROM people
     WHERE first_name = ‘John’
        OR favorite_movie_id = 2


     Input                                                       Output
        id         first_name    last_name   favorite_movie_id   first_name   last_name

         1               John    Mulligan            3              John      Mulligan

                                                                    John        Smith
         2               Samir     Ahmed             5

         3               Royce    Rollins            2

         4               John      Smith             2




Tuesday, April 5, 2011
Simple Query (Mapper)
               public class SimpleQuery {
               ...
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, TextArrayWritable, Text> {
                   ...
                   public void map(Object key, Text value, Context context)
                     throws IOException {

                             String [] row = value.toString().split(DELIMITER);

                             try {
                               if( row[FIRST_NAME_COLUMN].equals("John") ||
                                   row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

                                 columns.set( new String[] {
                                   row[FIRST_NAME_COLUMN],
                                   row[LAST_NAME_COLUMN]
                                 });

                                 context.write(columns, blank);

                               }
                             } catch(InterruptedException ex) { throw new IOException(ex); }
                         }
                 }
               ...
               }
Tuesday, April 5, 2011
Simple Query (Mapper)
               public class SimpleQuery {
               ...
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, TextArrayWritable, Text> {
                   ...
                   public void map(Object key, Text value, Context context)                    Extract
                     throws IOException {

                             String [] row = value.toString().split(DELIMITER);

                             try {
                               if( row[FIRST_NAME_COLUMN].equals("John") ||
                                   row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

                                 columns.set( new String[] {
                                   row[FIRST_NAME_COLUMN],
                                   row[LAST_NAME_COLUMN]
                                 });

                                 context.write(columns, blank);

                               }
                             } catch(InterruptedException ex) { throw new IOException(ex); }
                         }
                 }
               ...
               }
Tuesday, April 5, 2011
Simple Query (Mapper)
               public class SimpleQuery {
               ...
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, TextArrayWritable, Text> {
                   ...
                   public void map(Object key, Text value, Context context)                    Extract
                     throws IOException {

                             String [] row = value.toString().split(DELIMITER);

                             try {                                                WHERE
                               if( row[FIRST_NAME_COLUMN].equals("John") ||     WHERE first_name = ‘John’
                                   row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2

                                 columns.set( new String[] {
                                   row[FIRST_NAME_COLUMN],
                                   row[LAST_NAME_COLUMN]
                                 });

                                 context.write(columns, blank);

                               }
                             } catch(InterruptedException ex) { throw new IOException(ex); }
                         }
                 }
               ...
               }
Tuesday, April 5, 2011
Simple Query (Mapper)
               public class SimpleQuery {
               ...
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, TextArrayWritable, Text> {
                   ...
                   public void map(Object key, Text value, Context context)                    Extract
                     throws IOException {

                             String [] row = value.toString().split(DELIMITER);

                             try {                                                WHERE
                               if( row[FIRST_NAME_COLUMN].equals("John") ||     WHERE first_name = ‘John’
                                   row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2


                                                                         SELECT
                                 columns.set( new String[] {
                                   row[FIRST_NAME_COLUMN],
                                   row[LAST_NAME_COLUMN]                 SELECT first_name, last_name
                                 });

                                 context.write(columns, blank);

                               }
                             } catch(InterruptedException ex) { throw new IOException(ex); }
                         }
                 }
               ...
               }
Tuesday, April 5, 2011
Simple Query (Mapper)
               public class SimpleQuery {
               ...
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, TextArrayWritable, Text> {
                   ...
                   public void map(Object key, Text value, Context context)                    Extract
                     throws IOException {

                             String [] row = value.toString().split(DELIMITER);

                             try {                                                WHERE
                               if( row[FIRST_NAME_COLUMN].equals("John") ||     WHERE first_name = ‘John’
                                   row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2


                                                                         SELECT
                                 columns.set( new String[] {
                                   row[FIRST_NAME_COLUMN],
                                   row[LAST_NAME_COLUMN]                 SELECT first_name, last_name
                                 });

                                 context.write(columns, blank);         Emit
                               }
                             } catch(InterruptedException ex) { throw new IOException(ex); }
                         }
                 }
               ...
               }
Tuesday, April 5, 2011
Simple Query (Running)
               $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery
                            data/people.tsv out/simple_query

               ...
               11/04/04   09:19:15   INFO   mapred.JobClient: map 100% reduce 100%
               11/04/04   09:19:15   INFO   mapred.JobClient: Job complete: job_local_0001
               11/04/04   09:19:15   INFO   mapred.JobClient: Counters: 13
               11/04/04   09:19:15   INFO   mapred.JobClient:   FileSystemCounters
               11/04/04   09:19:15   INFO   mapred.JobClient:     FILE_BYTES_READ=306296
               11/04/04   09:19:15   INFO   mapred.JobClient:     FILE_BYTES_WRITTEN=398676
               11/04/04   09:19:15   INFO   mapred.JobClient:   Map-Reduce Framework
               11/04/04   09:19:15   INFO   mapred.JobClient:     Reduce input groups=3
               11/04/04   09:19:15   INFO   mapred.JobClient:     Combine output records=0
               11/04/04   09:19:15   INFO   mapred.JobClient:     Map input records=4
               11/04/04   09:19:15   INFO   mapred.JobClient:     Reduce shuffle bytes=0
               11/04/04   09:19:15   INFO   mapred.JobClient:     Reduce output records=3
               11/04/04   09:19:15   INFO   mapred.JobClient:     Spilled Records=6
               11/04/04   09:19:15   INFO   mapred.JobClient:     Map output bytes=54
               11/04/04   09:19:15   INFO   mapred.JobClient:     Combine input records=0
               11/04/04   09:19:15   INFO   mapred.JobClient:     Map output records=3
               11/04/04   09:19:15   INFO   mapred.JobClient:     SPLIT_RAW_BYTES=127
               11/04/04   09:19:15   INFO   mapred.JobClient:     Reduce input records=3
               ...




Tuesday, April 5, 2011
Simple Query (Running)


               $ hadoop fs -cat out/simple_query/part-r-00000

               John! Mulligan!
               John! Smith!
               Royce! Rollins!




Tuesday, April 5, 2011
Join Query
    Query
     SELECT first_name, last_name, movies.name name,
            movies.image
     FROM people JOIN movies ON (
       people.favorite_movie_id = movies.id
     )




Tuesday, April 5, 2011
Join Query
     Input
        id      first_name      last_name   favorite_...   id     name              image

         1               John   Mulligan         3         2    The Matrix http://bit.ly/matrix.jpg

         2           Samir        Ahmed          5         3     Gatacca     http://bit.ly/g.jpg

         3           Royce       Rollins         2         4       AI        http://bit.ly/ai.jpg

         4               John     Smith          2         5     Avatar    http://bit.ly/avatar.jpg




Tuesday, April 5, 2011
Join Query
     Input                                                  people                                           movies

        id      first_name         last_name        favorite_...          id     name               image

         1               John       Mulligan                3              2   The Matrix http://bit.ly/matrix.jpg

         2           Samir              Ahmed               5              3    Gatacca      http://bit.ly/g.jpg

         3           Royce             Rollins              2              4      AI         http://bit.ly/ai.jpg

         4               John           Smith               2              5    Avatar     http://bit.ly/avatar.jpg



                          Output
                            first_name          last_name          name                   image

                                John             Mulligan       The Matrix     http://bit.ly/matrix.jpg

                                Samir             Ahmed          Gatacca          http://bit.ly/g.jpg

                                Royce            Rollins            AI           http://bit.ly/ai.jpg

                                John              Smith           Avatar       http://bit.ly/avatar.jpg



Tuesday, April 5, 2011
Join Query (Mapper)
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, Text, TextArrayWritable> {
               ...
                   public void map(Object key, Text value, Context context)
                     throws IOException {

                         String [] row = value.toString().split(DELIMITER);
                         String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

                         try {
                           if(fileName.startsWith("people")) {
                             columns.set( new String [] {
                               "people",
                               row[PEOPLE_FIRST_NAME_COLUMN],
                               row[PEOPLE_LAST_NAME_COLUMN]
                             });
                             joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]);
                           }
                           else if(fileName.startsWith("movies")) {
                             columns.set( new String [] {
                               "movies",
                               row[MOVIES_NAME_COLUMN],
                               row[MOVIES_IMAGE_COLUMN]
                             });

                               joinKey.set(row[MOVIES_ID_COLUMN]);
                           }

                           context.write(joinKey, columns);

                         } catch(InterruptedException ex) {
                           throw new IOException(ex);
                         }
               ...
Tuesday, April 5, 2011
Join Query (Mapper)
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, Text, TextArrayWritable> {
               ...
                   public void map(Object key, Text value, Context context)                Parse
                     throws IOException {

                         String [] row = value.toString().split(DELIMITER);
                         String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

                         try {
                           if(fileName.startsWith("people")) {
                             columns.set( new String [] {
                               "people",
                               row[PEOPLE_FIRST_NAME_COLUMN],
                               row[PEOPLE_LAST_NAME_COLUMN]
                             });
                             joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]);
                           }
                           else if(fileName.startsWith("movies")) {
                             columns.set( new String [] {
                               "movies",
                               row[MOVIES_NAME_COLUMN],
                               row[MOVIES_IMAGE_COLUMN]
                             });

                               joinKey.set(row[MOVIES_ID_COLUMN]);
                           }

                           context.write(joinKey, columns);

                         } catch(InterruptedException ex) {
                           throw new IOException(ex);
                         }
               ...
Tuesday, April 5, 2011
Join Query (Mapper)
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, Text, TextArrayWritable> {
               ...
                   public void map(Object key, Text value, Context context)                Parse
                     throws IOException {

                         String [] row = value.toString().split(DELIMITER);
                         String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

                         try {
                           if(fileName.startsWith("people")) {
                             columns.set( new String [] {
                               "people",
                               row[PEOPLE_FIRST_NAME_COLUMN],
                               row[PEOPLE_LAST_NAME_COLUMN]
                                                                                              Classify
                             });
                             joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]);
                           }
                           else if(fileName.startsWith("movies")) {
                             columns.set( new String [] {
                               "movies",
                               row[MOVIES_NAME_COLUMN],
                               row[MOVIES_IMAGE_COLUMN]
                             });

                               joinKey.set(row[MOVIES_ID_COLUMN]);
                           }

                           context.write(joinKey, columns);

                         } catch(InterruptedException ex) {
                           throw new IOException(ex);
                         }
               ...
Tuesday, April 5, 2011
Join Query (Mapper)
                 public static class SelectAndFilterMapper
                   extends Mapper<Object, Text, Text, TextArrayWritable> {
               ...
                   public void map(Object key, Text value, Context context)                Parse
                     throws IOException {

                         String [] row = value.toString().split(DELIMITER);
                         String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

                         try {
                           if(fileName.startsWith("people")) {
                             columns.set( new String [] {
                               "people",
                               row[PEOPLE_FIRST_NAME_COLUMN],
                               row[PEOPLE_LAST_NAME_COLUMN]
                                                                                              Classify
                             });
                             joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]);
                           }
                           else if(fileName.startsWith("movies")) {
                             columns.set( new String [] {
                               "movies",
                               row[MOVIES_NAME_COLUMN],
                               row[MOVIES_IMAGE_COLUMN]
                             });

                               joinKey.set(row[MOVIES_ID_COLUMN]);
                           }

                           context.write(joinKey, columns);                                Emit
                         } catch(InterruptedException ex) {
                           throw new IOException(ex);
                         }
               ...
Tuesday, April 5, 2011
Join Query (Reducer)
                 public static class CombineMapsReducer
                      extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {
               ...
                   public void reduce(Text key, Iterable<TextArrayWritable> values,
                                      Context context
                                      ) throws IOException, InterruptedException {

                         LinkedList<String []> people = new LinkedList<String[]>();
                         LinkedList<String []> movies = new LinkedList<String[]>();

                         for (TextArrayWritable val : values) {
                           String dataset = val.getTextAt(0).toString();
                           if(dataset.equals("people")) {
                             people.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                           if(dataset.equals("movies")) {
                             movies.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                         }

                         for(String[] person : people) {
                           for(String[] movie : movies) {
                             columns.set(new String[] {
                               person[0], person[1],
                               movie[0], movie[1]
                             });
                             context.write(BLANK, columns);
                           }
                         }
               ...
Tuesday, April 5, 2011
Join Query (Reducer)
                 public static class CombineMapsReducer
                      extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {
               ...
                   public void reduce(Text key, Iterable<TextArrayWritable> values,
                                      Context context
                                      ) throws IOException, InterruptedException {

                         LinkedList<String []> people = new LinkedList<String[]>();
                         LinkedList<String []> movies = new LinkedList<String[]>();



                                                                                          Extract
                         for (TextArrayWritable val : values) {
                           String dataset = val.getTextAt(0).toString();
                           if(dataset.equals("people")) {
                             people.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                           if(dataset.equals("movies")) {
                             movies.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                         }

                         for(String[] person : people) {
                           for(String[] movie : movies) {
                             columns.set(new String[] {
                               person[0], person[1],
                               movie[0], movie[1]
                             });
                             context.write(BLANK, columns);
                           }
                         }
               ...
Tuesday, April 5, 2011
Join Query (Reducer)
                 public static class CombineMapsReducer
                      extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {
               ...
                   public void reduce(Text key, Iterable<TextArrayWritable> values,
                                      Context context
                                      ) throws IOException, InterruptedException {

                         LinkedList<String []> people = new LinkedList<String[]>();
                         LinkedList<String []> movies = new LinkedList<String[]>();



                                                                                          Extract
                         for (TextArrayWritable val : values) {
                           String dataset = val.getTextAt(0).toString();
                           if(dataset.equals("people")) {
                             people.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                           if(dataset.equals("movies")) {
                             movies.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                                                                                      people X movies
                             });
                           }
                         }

                         for(String[] person : people) {
                           for(String[] movie : movies) {
                             columns.set(new String[] {
                               person[0], person[1],
                               movie[0], movie[1]
                             });
                             context.write(BLANK, columns);
                           }
                         }
               ...
Tuesday, April 5, 2011
Join Query (Reducer)
                 public static class CombineMapsReducer
                      extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {
               ...
                   public void reduce(Text key, Iterable<TextArrayWritable> values,
                                      Context context
                                      ) throws IOException, InterruptedException {

                         LinkedList<String []> people = new LinkedList<String[]>();
                         LinkedList<String []> movies = new LinkedList<String[]>();



                                                                                              Extract
                         for (TextArrayWritable val : values) {
                           String dataset = val.getTextAt(0).toString();
                           if(dataset.equals("people")) {
                             people.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                           if(dataset.equals("movies")) {
                             movies.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                                                                                      people X movies
                             });

                                                                                      SELECT
                           }
                         }

                         for(String[] person : people) {                              SELECT first_name,
                           for(String[] movie : movies) {                                    last_name,
                             columns.set(new String[] {
                                                                                             movies.name name,
                               person[0], person[1],
                               movie[0], movie[1]                                            movies.image
                             });
                             context.write(BLANK, columns);
                           }
                         }
               ...
Tuesday, April 5, 2011
Join Query (Reducer)
                 public static class CombineMapsReducer
                      extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {
               ...
                   public void reduce(Text key, Iterable<TextArrayWritable> values,
                                      Context context
                                      ) throws IOException, InterruptedException {

                         LinkedList<String []> people = new LinkedList<String[]>();
                         LinkedList<String []> movies = new LinkedList<String[]>();



                                                                                              Extract
                         for (TextArrayWritable val : values) {
                           String dataset = val.getTextAt(0).toString();
                           if(dataset.equals("people")) {
                             people.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                             });
                           }
                           if(dataset.equals("movies")) {
                             movies.add(new String[] {
                               val.getTextAt(1).toString(),
                               val.getTextAt(2).toString(),
                                                                                      people X movies
                             });

                                                                                      SELECT
                           }
                         }

                         for(String[] person : people) {                              SELECT first_name,
                           for(String[] movie : movies) {                                    last_name,
                             columns.set(new String[] {
                                                                                             movies.name name,
                               person[0], person[1],
                               movie[0], movie[1]                                            movies.image
                             });


                         }
                           }
                             context.write(BLANK, columns);
                                                                                          Emit
               ...
Tuesday, April 5, 2011
Hive




Tuesday, April 5, 2011
What is Hive?
                  “Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable
                  easy data ETL, a mechanism to put structures on the data, and the capability to querying
                  and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query
                  language, called QL, that enables users familiar with SQL to query the data. At the same
                  time, this language also allows programmers who are familiar with the MapReduce
                  framework to be able to plug in their custom mappers and reducers to perform more
                  sophisticated analysis that may not be supported by the built-in capabilities of the
                  language.”




Tuesday, April 5, 2011
Hive Features

                         SerDe

                         MetaStore

                         Query Processor

                            Compiler

                            Processor

                            Functions / UDFs, UDAFs, UDTFs




Tuesday, April 5, 2011
Hive Demo




Tuesday, April 5, 2011
Links

                         http://hadoop.apache.org/

                         https://github.com/synctree/hadoop-recipes

                         http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

                         http://developer.yahoo.com/blogs/hadoop/

                         http://wiki.apache.org/hadoop/Hive




Tuesday, April 5, 2011
Questions?




Tuesday, April 5, 2011
Thanks




Tuesday, April 5, 2011

More Related Content

Similar to Solving real world problems with Hadoop

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Functional Programming You Already Know
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already KnowKevlin Henney
 
Map reduce模型
Map reduce模型Map reduce模型
Map reduce模型dhlzj
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리Byeongweon Moon
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Coding convention
Coding conventionCoding convention
Coding conventionKhoa Nguyen
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 

Similar to Solving real world problems with Hadoop (20)

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Functional Programming You Already Know
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already Know
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Map reduce模型
Map reduce模型Map reduce模型
Map reduce模型
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Hadoop
HadoopHadoop
Hadoop
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Java
JavaJava
Java
 
Interpreter Case Study - Design Patterns
Interpreter Case Study - Design PatternsInterpreter Case Study - Design Patterns
Interpreter Case Study - Design Patterns
 
Hadoop
HadoopHadoop
Hadoop
 
Coding convention
Coding conventionCoding convention
Coding convention
 
An introduction to scala
An introduction to scalaAn introduction to scala
An introduction to scala
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Solving real world problems with Hadoop

  • 1. Solving Real World Problems with Hadoop and SQL -> Hadoop Masahji Stewart <masahji@synctree.com> Tuesday, April 5, 2011
  • 2. Solving Real World Problems with Hadoop Tuesday, April 5, 2011
  • 3. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ... Tuesday, April 5, 2011
  • 4. Word Count Input MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster ... Output as! ! ! ! 1 MapReduce!! 1 (nodes),! ! 1 certain! ! 1 cluster! ! 1 a! ! ! ! 3 collectively! 1 computers!! 1 is! ! ! ! 1 datasets! ! 1 distributable! 1 large! ! ! 1 framework!! 1 for!! ! ! 1 processing! 1 huge! ! ! 1 kinds! ! ! 1 using! ! ! 1 number!! ! 1 of! ! ! ! 2 on! ! ! ! 1 problems! ! 1 referred! ! 1 to! ! ! ! 1 Tuesday, April 5, 2011
  • 5. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Tuesday, April 5, 2011
  • 6. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { Extract StringTokenizer itr = word = “MapReduce” new StringTokenizer(value.toString()); word = ”is” while (itr.hasMoreTokens()) { word = “a” word.set(itr.nextToken()); ... context.write(word, one); } } } Tuesday, April 5, 2011
  • 7. Word Count (Mapper) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); Emit } “MapReduce”, 1 } } “is”, 1 “a”, 1 ... Tuesday, April 5, 2011
  • 8. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Tuesday, April 5, 2011
  • 9. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { Sum private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context key=“of” ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum = 2 sum += val.get(); } result.set(sum); context.write(key, result); } } Tuesday, April 5, 2011
  • 10. Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); Emit } result.set(sum); context.write(key, result); } “of”, 2 } Tuesday, April 5, 2011
  • 11. Word Count (Running) $ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount -D mapred.reduce.tasks=3 input_file out 11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 2 11/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 3 11/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 1 11/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_0659 11/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0% 11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0% 11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33% 11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66% 11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100% 11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_0659 11/04/03 21:22:08 INFO mapred.JobClient: Counters: 17 ... 11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=286 11/04/03 21:22:08 INFO mapred.JobClient: Combine input records=27 11/04/03 21:22:08 INFO mapred.JobClient: Map output records=27 11/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24 Tuesday, April 5, 2011
  • 12. Word Count (Output) $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out Found 3 items -rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000 -rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001 -rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002 $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000 as! 1 certain! 1 collectively! 1 datasets! 1 framework!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 A file per reducer MapReduce!1 cluster! 1 computers!1 distributable!1 for!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1 a! 3 is! 1 large! 1 processing! 1 using! 1 Tuesday, April 5, 2011
  • 13. Word Count (Output) $ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out Found 3 items -rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000 -rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001 -rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002 $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000 as! 1 certain! 1 collectively! 1 datasets! 1 framework!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 MapReduce!1 cluster! 1 computers!1 distributable!1 for!1 ... $ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1 a! 3 is! 1 large! 1 processing! 1 using! 1 Tuesday, April 5, 2011
  • 14. Word Count Input Split Map Shuffle/Sort Reduce Output as 1 certain 1 collectively 1 MAP datasets 1 MapReduce is a framework 1 MapReduce is a framework for huge 1 processsing number 1 framework for on 1 REDUCE processing referred 1 huge datasets on MAP to 1 huge datasets certain kinds of on certain kinds distributable MapReduce 1 of distributable cluster 1 problems using a computers 1 problems using distributable 1 large number of MAP REDUCE for 1 a large number computers kinds 1 o f c o m p u te r s of 2 problems 1 ( n o d e s ) , (nodes) collectively collectively referrered to as a MAP referred to as a REDUCE (nodes), 1 cluster a 3 cluster is 1 ... large 1 MAP processing 1 using 1 Tuesday, April 5, 2011
  • 15. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Tuesday, April 5, 2011
  • 16. Log Processing (Date IP COUNT) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 18/Jul/2010! ! 189.186.9.181! 1 18/Jul/2010! ! 201.201.16.82! 3 18/Jul/2010! ! 66.195.114.59! 1 18/Jul/2010! ! 67.195.114.59! 1 18/Jul/2010! ! 90.221.175.16! 1 19/Jul/2010! ! 90.221.75.196! 1 ... Tuesday, April 5, 2011
  • 17. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } } Tuesday, April 5, 2011
  • 18. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) Extract throws IOException { ip = “189.186.9.181” ip = ”201.201.16.82” String text = value.toString(); ip = “66.249.67.57” Matcher matcher = LOG_PATTERN.matcher(text); ... while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } } } } Tuesday, April 5, 2011
  • 19. Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^ ([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4}) ] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)""); public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text ip = new Text(); public void map(Object key, Text value, Context context) throws IOException { String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set(matcher.group(5) + "t" + matcher.group(1)); Emit context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); “18/Jul/2010t189.186.9.181”, } ... } } } Tuesday, April 5, 2011
  • 20. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Tuesday, April 5, 2011
  • 21. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); Mapper job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Tuesday, April 5, 2011
  • 22. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); Reducer job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Tuesday, April 5, 2011
  • 23. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); } System.exit(job.waitForCompletion(true) ? 0 : 1); Input/ } Output Settings Tuesday, April 5, 2011
  • 24. Log Processing (main) public class LogAggregator { ... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run it! Tuesday, April 5, 2011
  • 25. Log Processing (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator -libjars hadoop-examples.jar data/access.log log_results 11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 1 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop- examples.jar in /tmp/hadoop-masahji/mapred/local/ archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/ hadoop-recipes-work--8125788655475885988 with rwxr-xr-x 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100% Tuesday, April 5, 2011
  • 26. Log Processing (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator -libjars hadoop-examples.jar data/access.log log_results 11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 1 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop- examples.jar in /tmp/hadoop-masahji/mapred/local/ archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/ hadoop-recipes-work--8125788655475885988 with rwxr-xr-x 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:/// Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/ mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/ Development/hadoop-recipes/hadoop-examples.jar 11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100% JAR placed into Distributed Cache Tuesday, April 5, 2011
  • 27. Log Processing (Output) $ hadoop fs -ls log_results Found 2 items -rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS -rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000 $ hadoop fs -cat log_results/part-r-00000 18/Jul/2010! 189.186.9.181!1 18/Jul/2010! 201.201.16.82!3 18/Jul/2010! 66.195.114.59!1 18/Jul/2010! 67.195.114.59!1 18/Jul/2010! 90.221.175.16!1 19/Jul/2010! 90.221.75.196!1 ... Tuesday, April 5, 2011
  • 28. Hadoop Streaming Fork Mapper / Task Tracker Reducer STDIN STDOUT script Tuesday, April 5, 2011
  • 29. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck one's tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ... Tuesday, April 5, 2011
  • 30. Basic grep Input ... [sou1 suo3] /to search/.../internet search/database search/ [ji2 ri4] /propitious day/lucky day/ [ji2 xiang2] /lucky/auspicious/propitious/ [duo1 duo1] /to cluck one's tongue/tut-tut/ 鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/ ... Output ... 汇 [hui4 chu1] /to export data (e.g. from a database)/! [sou1 suo3] /to search/.../internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database// ... Tuesday, April 5, 2011
  • 31. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper 'grep database' -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin Tuesday, April 5, 2011
  • 32. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar Scripts or -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper 'grep database' ... -reducer org.apache.hadoop.mapred.lib.IdentityReducer Java Classes 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin Tuesday, April 5, 2011
  • 33. Basic grep $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/cedict.txt.gz -output streaming/grep_database_mandarin -mapper 'grep database' -reducer org.apache.hadoop.mapred.lib.IdentityReducer ... 11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100% 11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001 11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin $ hadoop fs -cat streaming/grep_database_mandarin/part-00000 汇 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/! [sou1 suo3] /to search/to look for sth/internet search/database search/! 库 [shu4 ju4 ku4] /database/! 库软 [shu4 ju4 ku4 ruan3 jian4] /database software/! 资 库 [zi1 liao4 ku4] /database/ Tuesday, April 5, 2011
  • 34. Ruby Example (ignore ip list) Input 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0" 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0" 10.1.10.12 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0" 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0" 10.1.10.14 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0" 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Output 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/ 4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" ... Tuesday, April 5, 2011
  • 35. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ Read STDIN while(line = STDIN.gets) Write STDOUT next unless line =~ log_regex ip = $1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty? end Tuesday, April 5, 2011
  • 36. Ruby Example (ignore ip list) #!/usr/bin/env ruby ignore = %w(127.0.0.1 192.168 10) log_regex = /^([d.]+)s/ while(line = STDIN.gets) next unless line =~ log_regex ip = $1 print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty? end Tuesday, April 5, 2011
  • 37. Ruby Example (ignore ip list) $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper './script/filter_ips' -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ... Tuesday, April 5, 2011
  • 38. Ruby Example (ignore ip list) $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input data/access.log -output out/streaming/filter_ips -mapper './script/filter_ips' -reducer org.apache.hadoop.mapred.lib.IdentityReducer 11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1 11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/ 11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001 11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop) ... $ hadoop fs -cat out/streaming/filter_ips/part-00000 ...! 189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/ 1450" "Mozilla/4.0"! 201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/ 4.0"! 66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! 67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/ 3.0")! 90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"! 90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0" Tuesday, April 5, 2011
  • 39. SQL -> Hadoop Tuesday, April 5, 2011
  • 40. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Tuesday, April 5, 2011
  • 41. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input id first_name last_name favorite_movie_id 1 John Mulligan 3 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2 Tuesday, April 5, 2011
  • 42. Simple Query Query SELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2 Input Output id first_name last_name favorite_movie_id first_name last_name 1 John Mulligan 3 John Mulligan John Smith 2 Samir Ahmed 5 3 Royce Rollins 2 4 John Smith 2 Tuesday, April 5, 2011
  • 43. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... } Tuesday, April 5, 2011
  • 44. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... } Tuesday, April 5, 2011
  • 45. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... } Tuesday, April 5, 2011
  • 46. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... } Tuesday, April 5, 2011
  • 47. Simple Query (Mapper) public class SimpleQuery { ... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) Extract throws IOException { String [] row = value.toString().split(DELIMITER); try { WHERE if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’ row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2 SELECT columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] SELECT first_name, last_name }); context.write(columns, blank); Emit } } catch(InterruptedException ex) { throw new IOException(ex); } } } ... } Tuesday, April 5, 2011
  • 48. Simple Query (Running) $ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery data/people.tsv out/simple_query ... 11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100% 11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_0001 11/04/04 09:19:15 INFO mapred.JobClient: Counters: 13 11/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=306296 11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=398676 11/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=3 11/04/04 09:19:15 INFO mapred.JobClient: Combine output records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map input records=4 11/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=0 11/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=3 11/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=6 11/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=54 11/04/04 09:19:15 INFO mapred.JobClient: Combine input records=0 11/04/04 09:19:15 INFO mapred.JobClient: Map output records=3 11/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 11/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3 ... Tuesday, April 5, 2011
  • 49. Simple Query (Running) $ hadoop fs -cat out/simple_query/part-r-00000 John! Mulligan! John! Smith! Royce! Rollins! Tuesday, April 5, 2011
  • 50. Join Query Query SELECT first_name, last_name, movies.name name, movies.image FROM people JOIN movies ON ( people.favorite_movie_id = movies.id ) Tuesday, April 5, 2011
  • 51. Join Query Input id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg Tuesday, April 5, 2011
  • 52. Join Query Input people movies id first_name last_name favorite_... id name image 1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg 2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg 3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg 4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg Output first_name last_name name image John Mulligan The Matrix http://bit.ly/matrix.jpg Samir Ahmed Gatacca http://bit.ly/g.jpg Royce Rollins AI http://bit.ly/ai.jpg John Smith Avatar http://bit.ly/avatar.jpg Tuesday, April 5, 2011
  • 53. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ... Tuesday, April 5, 2011
  • 54. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ... Tuesday, April 5, 2011
  • 55. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); } catch(InterruptedException ex) { throw new IOException(ex); } ... Tuesday, April 5, 2011
  • 56. Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> { ... public void map(Object key, Text value, Context context) Parse throws IOException { String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName(); try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] Classify }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] }); joinKey.set(row[MOVIES_ID_COLUMN]); } context.write(joinKey, columns); Emit } catch(InterruptedException ex) { throw new IOException(ex); } ... Tuesday, April 5, 2011
  • 57. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ... Tuesday, April 5, 2011
  • 58. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ... Tuesday, April 5, 2011
  • 59. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); } } for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } } ... Tuesday, April 5, 2011
  • 60. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); context.write(BLANK, columns); } } ... Tuesday, April 5, 2011
  • 61. Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> { ... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException { LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>(); Extract for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), people X movies }); SELECT } } for(String[] person : people) { SELECT first_name, for(String[] movie : movies) { last_name, columns.set(new String[] { movies.name name, person[0], person[1], movie[0], movie[1] movies.image }); } } context.write(BLANK, columns); Emit ... Tuesday, April 5, 2011
  • 63. What is Hive? “Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.” Tuesday, April 5, 2011
  • 64. Hive Features SerDe MetaStore Query Processor Compiler Processor Functions / UDFs, UDAFs, UDTFs Tuesday, April 5, 2011
  • 66. Links http://hadoop.apache.org/ https://github.com/synctree/hadoop-recipes http://hadoop.apache.org/common/docs/r0.20.2/streaming.html http://developer.yahoo.com/blogs/hadoop/ http://wiki.apache.org/hadoop/Hive Tuesday, April 5, 2011