Cascalog
                   Programmation logique pour Hadoop




                             Bertrand Dechoux   13 Octobre 2012




Saturday, October 13, 2012
MapReduce : et vous?

 Python
      ▶   map(function, iterable, ...)
      ▶   reduce(function,iterable[, initializer])


 Perl
      ▶   map BLOCK LIST
      ▶   reduce BLOCK LIST


 Ruby
      ▶   map {|item| block} -> new_ary / collect {|item| block} -> new_ary
      ▶   reduce(initial,sym) -> obj / inject(initial,sym) -> obj


 Smalltalk
      ▶   collect:aBlock=TheArray
      ▶   inject: thisValue into: binaryBlock


 PHP
      ▶   array array_map ( callable $callback, array $arr1 [, array $...])
      ▶   mixed array_reduce (array $input, callable $function [, mixed $initial = NULL])




                                                                                            2
Saturday, October 13, 2012
Hadoop MapReduce : la théorie




 Map
      ▶   Map(k1,v1) -> list(k2,v2)



 Reduce
      ▶   Reduce(k2, list (v2)) -> list(k3,v3)




                                                             3
Saturday, October 13, 2012
Hadoop MapReduce : la théorie




 Map
      ▶ Map(k1,v1) -> list(k2,v2)
      ▶ SortByKey(list(k2,v2)) -> list(k2,v2)



 Reduce
      ▶ MergeByKey(list,list,...) -> list(k2,list(v2))
      ▶ Reduce(k2, list (v2)) -> list(k3,v3)




                                                             4
Saturday, October 13, 2012
Hadoop MapReduce : la pratique
                             public class WordCount {

                                 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
                                    private final static IntWritable one = new IntWritable(1);




                                                                          X
                                    private Text word = new Text();

                                     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                                         String line = value.toString();
                                         StringTokenizer tokenizer = new StringTokenizer(line);
                                         while (tokenizer.hasMoreTokens()) {
                                             word.set(tokenizer.nextToken());
                                             context.write(word, one);
                                         }
                                     }
                                 }

                                 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

                                     public void reduce(Text key, Iterable<IntWritable> values, Context context)
                                       throws IOException, InterruptedException {
                                         int sum = 0;
                                         for (IntWritable val : values) {
                                             sum += val.get();
                                         }
                                         context.write(key, new IntWritable(sum));
                                     }
                                 }

                                 public static void main(String[] args) throws Exception {
                                    Configuration conf = new Configuration();

                                         Job job = new Job(conf, "wordcount");

                                     job.setOutputKeyClass(Text.class);
                                     job.setOutputValueClass(IntWritable.class);

                                     job.setMapperClass(Map.class);
                                     job.setReducerClass(Reduce.class);

                                     job.setInputFormatClass(TextInputFormat.class);
                                     job.setOutputFormatClass(TextOutputFormat.class);

                                     FileInputFormat.addInputPath(job, new Path(args[0]));
                                     FileOutputFormat.setOutputPath(job, new Path(args[1]));

                                     job.waitForCompletion(true);
                                 }

                             }


                                                                                                                                                 5
Saturday, October 13, 2012
Cascading : des abstractions necessaires




                                                       6
Saturday, October 13, 2012
Cascading : des abstractions necessaires




                                                       7
Saturday, October 13, 2012
Cascading : ‘field algebra’ ?!




                                       X
                                                              8
Saturday, October 13, 2012
Cascalog
                   programmation logique pour Hadoop




 (my-predicate ?var1 42 ?var3 :> ?var4 ?var5)




                                                       9
Saturday, October 13, 2012
Cascalog : select ... from ...


 (?<- (stdout) [?person] (person ?person))




                                                              10
Saturday, October 13, 2012
Cascalog : select ... from ...


 (?<- (stdout) [?person] (person ?person))


 (?<- (stdout) [?person ?age] (age ?person ?age))




                                                              11
Saturday, October 13, 2012
Cascalog : select ... from ...


 (?<- (stdout) [?person] (person ?person))


 (?<- (stdout) [?person ?age] (age ?person ?age))


 (?<- (stdout) [?age] (age _ ?age))




                                                              12
Saturday, October 13, 2012
Cascalog : select ... from ...


 (?<- (stdout) [?person] (person ?person))


 (?<- (stdout) [?person ?age] (age ?person ?age))


 (?<- (stdout) [?age] (age _ ?age))


 (?<- (stdout) [?person] (age ?person 42))


                                                              13
Saturday, October 13, 2012
Cascalog : select ... from ... where




 (?<- (stdout) [?person ?age]
                             (age ?person ?age)
                             (< ?age 30))




                                                               14
Saturday, October 13, 2012
Cascalog : select ... as ... from ...




 (?<- (stdout) [?person ?junior]
                               (age ?person ?age)
                               (< ?age 30 :> ?junior))




                                                                     15
Saturday, October 13, 2012
Cascalog : select count(*) from ... group by ...




 (?<- (stdout) [?count]
                             (age _ _)
                             (c/count ?count))




                                                 16
Saturday, October 13, 2012
Cascalog : select count(*) from ... group by ...




 (?<- (stdout) [?junior ?count]
                             (age _ ?age)
                             (< ?age 30 :> ?junior)
                             (c/count ?count))




                                                      17
Saturday, October 13, 2012
Cascalog : select ... from ... join ...




 (?<- (stdout) [?person ?age ?gender]
                             (age ?person ?age)
                             (gender ?person ?gender))




                                                                  18
Saturday, October 13, 2012
Cascalog : select ... from ... (select ...)


 (let [many-follows
                  (<- [?person] (follows ?person _)
                            (c/count ?count)
                            (> ?count 2))]


                  (?<- (stdout) [?personA ?personB]
                            (many-follows ?personA)
                            (many-follows ?personB)
                            (follows ?personA ?personB))
)




                                                               19
Saturday, October 13, 2012
Cascalog : définir vos fonctions




 (defn toUpperCase [person] (.toUpperCase person))
     (?<- (stdout) [?PERSON]
               (person ?person)
               (toUpperCase ?person :> ?PERSON))




                                                                20
Saturday, October 13, 2012
Une conclusion?



 ‘nouveaux’ datastores, ‘nouveaux’ types de requetage
      ▶   Cascalog, RDF, Datomic, Neo4j ...


 Affinitée entre le paradigme fonctionel
      ▶ Et les traitements de données?
      ▶ Et vous? Cascalog mais aussi...




                                                     ...
                             PIG

                                                           21
Saturday, October 13, 2012
http://blog.xebia.fr/author/bdechoux/

                             @BertrandDechoux




                                  ?
                                                     22
Saturday, October 13, 2012

OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop

  • 1.
    Cascalog Programmation logique pour Hadoop Bertrand Dechoux 13 Octobre 2012 Saturday, October 13, 2012
  • 2.
    MapReduce : etvous?  Python ▶ map(function, iterable, ...) ▶ reduce(function,iterable[, initializer])  Perl ▶ map BLOCK LIST ▶ reduce BLOCK LIST  Ruby ▶ map {|item| block} -> new_ary / collect {|item| block} -> new_ary ▶ reduce(initial,sym) -> obj / inject(initial,sym) -> obj  Smalltalk ▶ collect:aBlock=TheArray ▶ inject: thisValue into: binaryBlock  PHP ▶ array array_map ( callable $callback, array $arr1 [, array $...]) ▶ mixed array_reduce (array $input, callable $function [, mixed $initial = NULL]) 2 Saturday, October 13, 2012
  • 3.
    Hadoop MapReduce :la théorie  Map ▶ Map(k1,v1) -> list(k2,v2)  Reduce ▶ Reduce(k2, list (v2)) -> list(k3,v3) 3 Saturday, October 13, 2012
  • 4.
    Hadoop MapReduce :la théorie  Map ▶ Map(k1,v1) -> list(k2,v2) ▶ SortByKey(list(k2,v2)) -> list(k2,v2)  Reduce ▶ MergeByKey(list,list,...) -> list(k2,list(v2)) ▶ Reduce(k2, list (v2)) -> list(k3,v3) 4 Saturday, October 13, 2012
  • 5.
    Hadoop MapReduce :la pratique public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); X private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } 5 Saturday, October 13, 2012
  • 6.
    Cascading : desabstractions necessaires 6 Saturday, October 13, 2012
  • 7.
    Cascading : desabstractions necessaires 7 Saturday, October 13, 2012
  • 8.
    Cascading : ‘fieldalgebra’ ?! X 8 Saturday, October 13, 2012
  • 9.
    Cascalog programmation logique pour Hadoop  (my-predicate ?var1 42 ?var3 :> ?var4 ?var5) 9 Saturday, October 13, 2012
  • 10.
    Cascalog : select... from ...  (?<- (stdout) [?person] (person ?person)) 10 Saturday, October 13, 2012
  • 11.
    Cascalog : select... from ...  (?<- (stdout) [?person] (person ?person))  (?<- (stdout) [?person ?age] (age ?person ?age)) 11 Saturday, October 13, 2012
  • 12.
    Cascalog : select... from ...  (?<- (stdout) [?person] (person ?person))  (?<- (stdout) [?person ?age] (age ?person ?age))  (?<- (stdout) [?age] (age _ ?age)) 12 Saturday, October 13, 2012
  • 13.
    Cascalog : select... from ...  (?<- (stdout) [?person] (person ?person))  (?<- (stdout) [?person ?age] (age ?person ?age))  (?<- (stdout) [?age] (age _ ?age))  (?<- (stdout) [?person] (age ?person 42)) 13 Saturday, October 13, 2012
  • 14.
    Cascalog : select... from ... where  (?<- (stdout) [?person ?age] (age ?person ?age) (< ?age 30)) 14 Saturday, October 13, 2012
  • 15.
    Cascalog : select... as ... from ...  (?<- (stdout) [?person ?junior] (age ?person ?age) (< ?age 30 :> ?junior)) 15 Saturday, October 13, 2012
  • 16.
    Cascalog : selectcount(*) from ... group by ...  (?<- (stdout) [?count] (age _ _) (c/count ?count)) 16 Saturday, October 13, 2012
  • 17.
    Cascalog : selectcount(*) from ... group by ...  (?<- (stdout) [?junior ?count] (age _ ?age) (< ?age 30 :> ?junior) (c/count ?count)) 17 Saturday, October 13, 2012
  • 18.
    Cascalog : select... from ... join ...  (?<- (stdout) [?person ?age ?gender] (age ?person ?age) (gender ?person ?gender)) 18 Saturday, October 13, 2012
  • 19.
    Cascalog : select... from ... (select ...)  (let [many-follows (<- [?person] (follows ?person _) (c/count ?count) (> ?count 2))] (?<- (stdout) [?personA ?personB] (many-follows ?personA) (many-follows ?personB) (follows ?personA ?personB)) ) 19 Saturday, October 13, 2012
  • 20.
    Cascalog : définirvos fonctions  (defn toUpperCase [person] (.toUpperCase person)) (?<- (stdout) [?PERSON] (person ?person) (toUpperCase ?person :> ?PERSON)) 20 Saturday, October 13, 2012
  • 21.
    Une conclusion?  ‘nouveaux’datastores, ‘nouveaux’ types de requetage ▶ Cascalog, RDF, Datomic, Neo4j ...  Affinitée entre le paradigme fonctionel ▶ Et les traitements de données? ▶ Et vous? Cascalog mais aussi... ... PIG 21 Saturday, October 13, 2012
  • 22.
    http://blog.xebia.fr/author/bdechoux/ @BertrandDechoux ? 22 Saturday, October 13, 2012