OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop

Cascalog
Programmation logique pour Hadoop

Bertrand Dechoux 13 Octobre 2012

Saturday, October 13, 2012

MapReduce : et vous?

 Python
▶ map(function, iterable, ...)
▶ reduce(function,iterable[, initializer])

 Perl
▶ map BLOCK LIST
▶ reduce BLOCK LIST

 Ruby
▶ map {|item| block} -> new_ary / collect {|item| block} -> new_ary
▶ reduce(initial,sym) -> obj / inject(initial,sym) -> obj

 Smalltalk
▶ collect:aBlock=TheArray
▶ inject: thisValue into: binaryBlock

 PHP
▶ array array_map ( callable $callback, array $arr1 [, array $...])
▶ mixed array_reduce (array $input, callable $function [, mixed $initial = NULL])

2

Hadoop MapReduce : la théorie

 Map
▶ Map(k1,v1) -> list(k2,v2)

 Reduce
▶ Reduce(k2, list (v2)) -> list(k3,v3)

3

Hadoop MapReduce : la théorie

 Map
▶ Map(k1,v1) -> list(k2,v2)
▶ SortByKey(list(k2,v2)) -> list(k2,v2)

 Reduce
▶ MergeByKey(list,list,...) -> list(k2,list(v2))
▶ Reduce(k2, list (v2)) -> list(k3,v3)

4

Hadoop MapReduce : la pratique
public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);

X
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

}

5

Cascading : des abstractions necessaires

6

Cascading : des abstractions necessaires

7

Cascading : ‘field algebra’ ?!

X
8

Cascalog
programmation logique pour Hadoop

 (my-predicate ?var1 42 ?var3 :> ?var4 ?var5)

9

Cascalog : select ... from ...

 (?<- (stdout) [?person] (person ?person))

10



 (?<- (stdout) [?person ?age] (age ?person ?age))

11




 (?<- (stdout) [?age] (age _ ?age))

12




 (?<- (stdout) [?age] (age _ ?age))

 (?<- (stdout) [?person] (age ?person 42))

13

Cascalog : select ... from ... where

 (?<- (stdout) [?person ?age]
(age ?person ?age)
(< ?age 30))

14

Cascalog : select ... as ... from ...

 (?<- (stdout) [?person ?junior]
(age ?person ?age)
(< ?age 30 :> ?junior))

15

Cascalog : select count(*) from ... group by ...

 (?<- (stdout) [?count]
(age _ _)
(c/count ?count))

16

Cascalog : select count(*) from ... group by ...

 (?<- (stdout) [?junior ?count]
(age _ ?age)
(< ?age 30 :> ?junior)
(c/count ?count))

17

Cascalog : select ... from ... join ...

 (?<- (stdout) [?person ?age ?gender]
(age ?person ?age)
(gender ?person ?gender))

18

Cascalog : select ... from ... (select ...)

 (let [many-follows
(<- [?person] (follows ?person _)
(c/count ?count)
(> ?count 2))]

(?<- (stdout) [?personA ?personB]
(many-follows ?personA)
(many-follows ?personB)
(follows ?personA ?personB))
)

19

Cascalog : définir vos fonctions

 (defn toUpperCase [person] (.toUpperCase person))
(?<- (stdout) [?PERSON]
(person ?person)
(toUpperCase ?person :> ?PERSON))

20

Une conclusion?

 ‘nouveaux’ datastores, ‘nouveaux’ types de requetage
▶ Cascalog, RDF, Datomic, Neo4j ...

 Affinitée entre le paradigme fonctionel
▶ Et les traitements de données?
▶ Et vous? Cascalog mais aussi...

...
PIG

21

http://blog.xebia.fr/author/bdechoux/

@BertrandDechoux

?
22

OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop

Similar to OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop (20)

More from Publicis Sapient Engineering

More from Publicis Sapient Engineering (20)

Recently uploaded

Recently uploaded (20)

OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop