Scalding: A Scala Library for Defining MapReduce Programs

Scalding

Mario Pastorelli (Mario.Pastorelli@eurecom.fr)

EURECOM

September 27, 2012

1/21

What is Scalding

Scalding is a Scala library written on top of Cascading that makes
it easy to deﬁne MapReduce programs

2/21

Summary

Hadoop MapReduce Programming Model

Cascading

Scalding

3/21

Summary


Cascading

Scalding

4/21

Map and Reduce

At high level, a MapReduce Job is described with two functions
operating over lists of key/value pairs.

5/21

Map and Reduce

Map: a function from an input key/value pair to a list of
intermediate key/value pairs

map : (keyinput , valueinput ) → list(keymap , valuemap )

5/21

Map and Reduce

Map: a function from an input key/value pair to a list of
intermediate key/value pairs

map : (keyinput , valueinput ) → list(keymap , valuemap )
Reduce: a function from an intermediate key/values pairs to
a list of output key/value pairs

reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce )

5/21

Hadoop Programming Model
The Hadoop MapReduce programming model allows to control all
the job workﬂow components. Job components are divided in two
phases:

6/21

phases:

The Map Phase:
Km1 Vm1
Input Km1 Vm6
1 1 2 2 Km1 Vm6 Km3 Vm3
Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
m m
Source Ki2 Vi2 Km1 Vm5
Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2
Km3 Vm3

combine(Vm1,Vm5)=Vm6

6/21

phases:

The Map Phase:
Km1 Vm1
Input Km1 Vm6
1 1 2 2 Km1 Vm6 Km3 Vm3
Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
m m
Source Ki2 Vi2 Km1 Vm5
Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2
Km3 Vm3

combine(Vm1,Vm5)=Vm6

The Reduce Phase:
Km Vm3
3
Km3 Vm3 Output
3 3
Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer Data
Shuﬄe Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest
Kr2 Vr2
Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7

6/21

Example: Word Count 1/2
1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
2
3 public void map(Object key, Text value, Context context)
4 throws IOException, InterruptedException {
5 StringTokenizer itr = new StringTokenizer(value.toString());
6 while (itr.hasMoreTokens()) {
7 word.set(itr.nextToken());
8 context.write(new Text(word), new IntWritable(1));
9 }
10 }
11 }
12
13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
14
15 public void reduce(Text key, Iterable<IntWritable> values,
16 Context context
17 ) throws IOException, InterruptedException {
18 int sum = 0;
19 for (IntWritable val : values)
20 sum += val.get();
21 context.write(key, new IntWritable(sum));
22 }
23 }

7/21

1 public class WordCount {
2
3 public static void main(String[] args) throws Exception {
4 Job job = new Job(conf, "word count");
5 job.setMapperClass(TokenizerMapper.class);
6
7
8
9 job.setReducerClass(IntSumReducer.class);
10 job.setOutputKeyClass(Text.class);
11 job.setOutputValueClass(IntWritable.class);
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 System.exit(job.waitForCompletion(true) ? 0 : 1);
15 }
16 }

8/21

2
3 public static void main(String[] args) throws Exception {
4 Job job = new Job(conf, "word count");
5 job.setMapperClass(TokenizerMapper.class);
6
7 job.setCombinerClass(IntSumReducer.class);
8
9 job.setReducerClass(IntSumReducer.class);
10 job.setOutputKeyClass(Text.class);
11 job.setOutputValueClass(IntWritable.class);
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 System.exit(job.waitForCompletion(true) ? 0 : 1);
15 }
16 }

Sending the integer 1 for each instance of a word is very
ineﬃcient (1TB of data yields 1TB+ of data)
Hadoop doesn’t know if it can use the reducer as combiner. A
manual set is needed
8/21

Hadoop weaknesses
The reducer cannot be always used as combiner, Hadoop
relies on the combiner specification or on manual partial
aggregation inside the mapper instance life cycle (in-mapper
combiner)
Combiners are limited to associative and commutative
functions (like sum). Partial aggregation is more general and
powerful
Programming model limited to the map/reduce phases
model, multi-job programs are often difficult and
counter-intuitive (think about iterative algorithms like
PageRank)
Joins can be difficult, many techniques must be
implemented from scratch
More in general, MapReduce is indeed simple but many
optimizations are similar to hacks and not so natural

9/21

Summary


Cascading

Scalding

10/21

Cascading

Open source project developed @Concurrent
It is Java application framework on top of Hadoop developed
to be extendible by providing:
Processing API: to develop complex data flows
Integration API: integration test supported by the framework
to avoid to put in production unstable software
Scheduling API: used to schedule unit of work from any
third-party application
It changes the MapReduce programming model to a more
generic data flow oriented programming model
Cascading has a data flow optimizer that converts user data
flows to optimized data flows

11/21

Cascading Programming Model

A Cascading program is composed by flows
A flow is composed by a source tap, a sink tap and pipes
that connect them
A pipe holds a particular transformation over its input data
flow
Pipes can be combined to create more complex programs

12/21

Example: Word Count

MapReduce word count concept:
Map(tokenize text
and emit 1 for Reduce(count values
1 1
TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine
Data Data
Shuﬄe
Source Dest

Cascading word count concept:
TextLine

tokenize each line group by tokens count values in every group

TextLine

13/21

Example: Word Count
2 public static void main( String[] args ) {
3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] );
4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] );
5
6 RegexSplitGenerator s = new RegexSplitGenerator(
7 new Fields("token"),
8 "[ [](),.]" );
9 Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10 Fields.RESULTS ); // text -> token
11
12 Pipe wcPipe = new Pipe( "wc", docPipe );
13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
15
16 // connect the taps and pipes to create a flow definition
17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
18 .addSource( docPipe, docTap )
19 .addTailSink( wcPipe, wcTap );
20
21 getFlowConnector().connect( flowDef ).complete();
22 }
23 }

14/21

Summary


Cascading

Scalding

15/21

Scalding

Open source project developed @Twitter

16/21

Scalding

Two APIs:
Field Based
Primary API: stable
Uses Cascading Fields: dynamic with errors at runtime
Type Safe
Secondary API: experimental
Uses Scala Types: static with errors at compile time

16/21

Scalding

Two APIs:
Field Based
Primary API: stable
Type Safe
The two APIs can work together using pipe.typed and
TypedPipe.from

16/21

Scalding

Two APIs:
Field Based
Primary API: stable
Type Safe
The two APIs can work together using pipe.typed and
TypedPipe.from
This presentation is about the TypeSafe API ¨

16/21

Why Scalding

MapReduce high-level idea comes from LISP and works on
functions (Map/Reduce) and function composition

17/21

Why Scalding

Cascading works on objects representing functions and uses
constructors as compositor between pipes:
3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
4 Fields.ALL );

17/21

Why Scalding

Cascading works on objects representing functions and uses
constructors as compositor between pipes:
3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
4 Fields.ALL );

Functional programming can naturally describe data ﬂows:
every pipe can be seen as a function working and pipes can be
combined using functional compositing. The code above can
be written as:
1 docPipe.groupBy( new Fields( "token" ) )
2 .every(Fields.ALL, new Count(), Fields.ALL)

17/21

Example: Word Count
1 class WordCount(args : Args) extends Job(args) {
2
3 /* TextLine reads each line of the given file */
4 val input = TypedPipe.from( TextLine( args( "input" ) ) )
5
6 /* tokenize every line and flat the result into a list of words */
7 val words = input.flatMap{ tokenize(_) }
8
9 /* group by words and add a new field size that is the group size */
10 val wordGroups = words.groupBy{ identity(_) }.size
11
12 /* write each pair (word,count) as line using TextLine */
13 wordGroups.write((0,1), TextLine( args( "output" ) ) )
14
15 /* Split a piece of text into individual words */
16 def tokenize(text : String) : Array[String] = {
17 // Lowercase each word and remove punctuation.
18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")
19 .split("s+")
20 }
21 }

18/21

Scalding TypeSafe API

Two main concepts:

19/21


Two main concepts:
TypedPipe[T]: class whose instances are distributed
objects that wrap a cascading Pipe object, and holds the
transformation done up until that point. Its interface is similar
to Scala’s Iterator[T] (map, flatMap, groupBy,
filter,. . . )

19/21


Two main concepts:
TypedPipe[T]: class whose instances are distributed
objects that wrap a cascading Pipe object, and holds the
transformation done up until that point. Its interface is similar
to Scala’s Iterator[T] (map, flatMap, groupBy,
filter,. . . )
KeyedList[K,V]: trait that represents a sharded lists of
items. Two implementations:
Grouped[K,V]: represents a grouping on keys of type K
CoGrouped2[K,V,W,Result]: represents a cogroup over
two grouped pipes. Used for joins

19/21

Conclusions

MapReduce API is powerful but limited

20/21

Conclusions

Cascading API is as simple as the MapReduce API but more
generic and powerful

20/21

Conclusions

Cascading API is as simple as the MapReduce API but more
generic and powerful
Scalding combines Cascading and Scala to easily describe
distributed programs. Major strength points are:
Functional programming to naturally describe data ﬂows.
Scalding is similar to Scala library, if you know Scala then
you already know how to use Scalding
Statically typed (TypeSafe API), no type errors at runtime
Scala is standard and works on top of the JVM
Scala libraries and tools can be used in production: IDEs,
debug systems, test systems, build systems and everything else.

20/21

Thank you for listening

21/21

Scalding: A Scala Library for Defining MapReduce Programs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalding: A Scala Library for Defining MapReduce Programs

Similar to Scalding: A Scala Library for Defining MapReduce Programs (20)

Recently uploaded

Recently uploaded (20)

Scalding: A Scala Library for Defining MapReduce Programs