Scalding is a Scala library built on top of Cascading that simplifies the process of defining MapReduce programs. It uses a functional programming approach where data flows are represented as chained transformations on TypedPipes, similar to operations on Scala iterators. This avoids some limitations of the traditional Hadoop MapReduce model by allowing for more flexible multi-step jobs and features like joins. The Scalding TypeSafe API also provides compile-time type safety compared to Cascading's runtime type checking.
2. What is Scalding
Scalding is a Scala library written on top of Cascading that makes
it easy to define MapReduce programs
2/21
3. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
3/21
4. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
4/21
5. Map and Reduce
At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.
5/21
6. Map and Reduce
At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.
Map: a function from an input key/value pair to a list of
intermediate key/value pairs
map : (keyinput , valueinput ) → list(keymap , valuemap )
5/21
7. Map and Reduce
At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.
Map: a function from an input key/value pair to a list of
intermediate key/value pairs
map : (keyinput , valueinput ) → list(keymap , valuemap )
Reduce: a function from an intermediate key/values pairs to
a list of output key/value pairs
reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce )
5/21
8. Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workflow components. Job components are divided in two
phases:
6/21
9. Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workflow components. Job components are divided in two
phases:
The Map Phase:
Km1 Vm1
Input Km1 Vm6
1 1 2 2 Km1 Vm6 Km3 Vm3
Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
m m
Source Ki2 Vi2 Km1 Vm5
Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2
Km3 Vm3
combine(Vm1,Vm5)=Vm6
6/21
10. Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workflow components. Job components are divided in two
phases:
The Map Phase:
Km1 Vm1
Input Km1 Vm6
1 1 2 2 Km1 Vm6 Km3 Vm3
Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
m m
Source Ki2 Vi2 Km1 Vm5
Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2
Km3 Vm3
combine(Vm1,Vm5)=Vm6
The Reduce Phase:
Km Vm3
3
Km3 Vm3 Output
3 3
Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer Data
Shuffle Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest
Kr2 Vr2
Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7
6/21
11. Example: Word Count 1/2
1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
2
3 public void map(Object key, Text value, Context context)
4 throws IOException, InterruptedException {
5 StringTokenizer itr = new StringTokenizer(value.toString());
6 while (itr.hasMoreTokens()) {
7 word.set(itr.nextToken());
8 context.write(new Text(word), new IntWritable(1));
9 }
10 }
11 }
12
13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
14
15 public void reduce(Text key, Iterable<IntWritable> values,
16 Context context
17 ) throws IOException, InterruptedException {
18 int sum = 0;
19 for (IntWritable val : values)
20 sum += val.get();
21 context.write(key, new IntWritable(sum));
22 }
23 }
7/21
12. Example: Word Count 2/2
1 public class WordCount {
2
3 public static void main(String[] args) throws Exception {
4 Job job = new Job(conf, "word count");
5 job.setMapperClass(TokenizerMapper.class);
6
7
8
9 job.setReducerClass(IntSumReducer.class);
10 job.setOutputKeyClass(Text.class);
11 job.setOutputValueClass(IntWritable.class);
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 System.exit(job.waitForCompletion(true) ? 0 : 1);
15 }
16 }
8/21
13. Example: Word Count 2/2
1 public class WordCount {
2
3 public static void main(String[] args) throws Exception {
4 Job job = new Job(conf, "word count");
5 job.setMapperClass(TokenizerMapper.class);
6
7 job.setCombinerClass(IntSumReducer.class);
8
9 job.setReducerClass(IntSumReducer.class);
10 job.setOutputKeyClass(Text.class);
11 job.setOutputValueClass(IntWritable.class);
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 System.exit(job.waitForCompletion(true) ? 0 : 1);
15 }
16 }
Sending the integer 1 for each instance of a word is very
inefficient (1TB of data yields 1TB+ of data)
Hadoop doesn’t know if it can use the reducer as combiner. A
manual set is needed
8/21
14. Hadoop weaknesses
The reducer cannot be always used as combiner, Hadoop
relies on the combiner specification or on manual partial
aggregation inside the mapper instance life cycle (in-mapper
combiner)
Combiners are limited to associative and commutative
functions (like sum). Partial aggregation is more general and
powerful
Programming model limited to the map/reduce phases
model, multi-job programs are often difficult and
counter-intuitive (think about iterative algorithms like
PageRank)
Joins can be difficult, many techniques must be
implemented from scratch
More in general, MapReduce is indeed simple but many
optimizations are similar to hacks and not so natural
9/21
15. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
10/21
16. Cascading
Open source project developed @Concurrent
It is Java application framework on top of Hadoop developed
to be extendible by providing:
Processing API: to develop complex data flows
Integration API: integration test supported by the framework
to avoid to put in production unstable software
Scheduling API: used to schedule unit of work from any
third-party application
It changes the MapReduce programming model to a more
generic data flow oriented programming model
Cascading has a data flow optimizer that converts user data
flows to optimized data flows
11/21
17. Cascading Programming Model
A Cascading program is composed by flows
A flow is composed by a source tap, a sink tap and pipes
that connect them
A pipe holds a particular transformation over its input data
flow
Pipes can be combined to create more complex programs
12/21
18. Example: Word Count
MapReduce word count concept:
Map(tokenize text
and emit 1 for Reduce(count values
1 1
TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine
Data Data
Shuffle
Source Dest
Cascading word count concept:
TextLine
tokenize each line group by tokens count values in every group
TextLine
13/21
19. Example: Word Count
1 public class WordCount {
2 public static void main( String[] args ) {
3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] );
4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] );
5
6 RegexSplitGenerator s = new RegexSplitGenerator(
7 new Fields("token"),
8 "[ [](),.]" );
9 Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10 Fields.RESULTS ); // text -> token
11
12 Pipe wcPipe = new Pipe( "wc", docPipe );
13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
15
16 // connect the taps and pipes to create a flow definition
17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
18 .addSource( docPipe, docTap )
19 .addTailSink( wcPipe, wcTap );
20
21 getFlowConnector().connect( flowDef ).complete();
22 }
23 }
14/21
20. Summary
Hadoop MapReduce Programming Model
Cascading
Scalding
15/21
21. Scalding
Open source project developed @Twitter
16/21
22. Scalding
Open source project developed @Twitter
Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time
16/21
23. Scalding
Open source project developed @Twitter
Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time
The two APIs can work together using pipe.typed and
TypedPipe.from
16/21
24. Scalding
Open source project developed @Twitter
Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time
The two APIs can work together using pipe.typed and
TypedPipe.from
This presentation is about the TypeSafe API ¨
16/21
26. Why Scalding
MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition
17/21
27. Why Scalding
MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition
Cascading works on objects representing functions and uses
constructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );
2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
4 Fields.ALL );
17/21
28. Why Scalding
MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition
Cascading works on objects representing functions and uses
constructors as compositor between pipes:
1 Pipe wcPipe = new Pipe( "wc", docPipe );
2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
4 Fields.ALL );
Functional programming can naturally describe data flows:
every pipe can be seen as a function working and pipes can be
combined using functional compositing. The code above can
be written as:
1 docPipe.groupBy( new Fields( "token" ) )
2 .every(Fields.ALL, new Count(), Fields.ALL)
17/21
29. Example: Word Count
1 class WordCount(args : Args) extends Job(args) {
2
3 /* TextLine reads each line of the given file */
4 val input = TypedPipe.from( TextLine( args( "input" ) ) )
5
6 /* tokenize every line and flat the result into a list of words */
7 val words = input.flatMap{ tokenize(_) }
8
9 /* group by words and add a new field size that is the group size */
10 val wordGroups = words.groupBy{ identity(_) }.size
11
12 /* write each pair (word,count) as line using TextLine */
13 wordGroups.write((0,1), TextLine( args( "output" ) ) )
14
15 /* Split a piece of text into individual words */
16 def tokenize(text : String) : Array[String] = {
17 // Lowercase each word and remove punctuation.
18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")
19 .split("s+")
20 }
21 }
18/21
31. Scalding TypeSafe API
Two main concepts:
TypedPipe[T]: class whose instances are distributed
objects that wrap a cascading Pipe object, and holds the
transformation done up until that point. Its interface is similar
to Scala’s Iterator[T] (map, flatMap, groupBy,
filter,. . . )
19/21
32. Scalding TypeSafe API
Two main concepts:
TypedPipe[T]: class whose instances are distributed
objects that wrap a cascading Pipe object, and holds the
transformation done up until that point. Its interface is similar
to Scala’s Iterator[T] (map, flatMap, groupBy,
filter,. . . )
KeyedList[K,V]: trait that represents a sharded lists of
items. Two implementations:
Grouped[K,V]: represents a grouping on keys of type K
CoGrouped2[K,V,W,Result]: represents a cogroup over
two grouped pipes. Used for joins
19/21
33. Conclusions
MapReduce API is powerful but limited
20/21
34. Conclusions
MapReduce API is powerful but limited
Cascading API is as simple as the MapReduce API but more
generic and powerful
20/21
35. Conclusions
MapReduce API is powerful but limited
Cascading API is as simple as the MapReduce API but more
generic and powerful
Scalding combines Cascading and Scala to easily describe
distributed programs. Major strength points are:
Functional programming to naturally describe data flows.
Scalding is similar to Scala library, if you know Scala then
you already know how to use Scalding
Statically typed (TypeSafe API), no type errors at runtime
Scala is standard and works on top of the JVM
Scala libraries and tools can be used in production: IDEs,
debug systems, test systems, build systems and everything else.
20/21