Scalding
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Scalding

  • 4,390 views
Uploaded on

This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure

This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,390
On Slideshare
4,339
From Embeds
51
Number of Embeds
4

Actions

Shares
Downloads
73
Comments
0
Likes
7

Embeds 51

https://twitter.com 30
https://si0.twimg.com 15
http://tweetedtimes.com 3
https://twimg0-a.akamaihd.net 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ScaldingMario Pastorelli (Mario.Pastorelli@eurecom.fr) EURECOM September 27, 2012 1/21
  • 2. What is Scalding Scalding is a Scala library written on top of Cascading that makes it easy to define MapReduce programs 2/21
  • 3. Summary Hadoop MapReduce Programming Model Cascading Scalding 3/21
  • 4. Summary Hadoop MapReduce Programming Model Cascading Scalding 4/21
  • 5. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. 5/21
  • 6. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) 5/21
  • 7. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) Reduce: a function from an intermediate key/values pairs to a list of output key/value pairs reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce ) 5/21
  • 8. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: 6/21
  • 9. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m mSource Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 6/21
  • 10. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m mSource Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 The Reduce Phase: Km Vm3 3 Km3 Vm3 Output 3 3 Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer DataShuffle Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest Kr2 Vr2 Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7 6/21
  • 11. Example: Word Count 1/2 1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{ 2 3 public void map(Object key, Text value, Context context) 4 throws IOException, InterruptedException { 5 StringTokenizer itr = new StringTokenizer(value.toString()); 6 while (itr.hasMoreTokens()) { 7 word.set(itr.nextToken()); 8 context.write(new Text(word), new IntWritable(1)); 9 }10 }11 }1213 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{1415 public void reduce(Text key, Iterable<IntWritable> values,16 Context context17 ) throws IOException, InterruptedException {18 int sum = 0;19 for (IntWritable val : values)20 sum += val.get();21 context.write(key, new IntWritable(sum));22 }23 } 7/21
  • 12. Example: Word Count 2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 8 9 job.setReducerClass(IntSumReducer.class);10 job.setOutputKeyClass(Text.class);11 job.setOutputValueClass(IntWritable.class);12 FileInputFormat.addInputPath(job, new Path(args[0]));13 FileOutputFormat.setOutputPath(job, new Path(args[1]));14 System.exit(job.waitForCompletion(true) ? 0 : 1);15 }16 } 8/21
  • 13. Example: Word Count 2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 job.setCombinerClass(IntSumReducer.class); 8 9 job.setReducerClass(IntSumReducer.class);10 job.setOutputKeyClass(Text.class);11 job.setOutputValueClass(IntWritable.class);12 FileInputFormat.addInputPath(job, new Path(args[0]));13 FileOutputFormat.setOutputPath(job, new Path(args[1]));14 System.exit(job.waitForCompletion(true) ? 0 : 1);15 }16 } Sending the integer 1 for each instance of a word is very inefficient (1TB of data yields 1TB+ of data) Hadoop doesn’t know if it can use the reducer as combiner. A manual set is needed 8/21
  • 14. Hadoop weaknesses The reducer cannot be always used as combiner, Hadoop relies on the combiner specification or on manual partial aggregation inside the mapper instance life cycle (in-mapper combiner) Combiners are limited to associative and commutative functions (like sum). Partial aggregation is more general and powerful Programming model limited to the map/reduce phases model, multi-job programs are often difficult and counter-intuitive (think about iterative algorithms like PageRank) Joins can be difficult, many techniques must be implemented from scratch More in general, MapReduce is indeed simple but many optimizations are similar to hacks and not so natural 9/21
  • 15. Summary Hadoop MapReduce Programming Model Cascading Scalding 10/21
  • 16. Cascading Open source project developed @Concurrent It is Java application framework on top of Hadoop developed to be extendible by providing: Processing API: to develop complex data flows Integration API: integration test supported by the framework to avoid to put in production unstable software Scheduling API: used to schedule unit of work from any third-party application It changes the MapReduce programming model to a more generic data flow oriented programming model Cascading has a data flow optimizer that converts user data flows to optimized data flows 11/21
  • 17. Cascading Programming Model A Cascading program is composed by flows A flow is composed by a source tap, a sink tap and pipes that connect them A pipe holds a particular transformation over its input data flow Pipes can be combined to create more complex programs 12/21
  • 18. Example: Word Count MapReduce word count concept: Map(tokenize text and emit 1 for Reduce(count values 1 1 TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine Data Data ShuffleSource Dest Cascading word count concept:TextLine tokenize each line group by tokens count values in every group TextLine 13/21
  • 19. Example: Word Count 1 public class WordCount { 2 public static void main( String[] args ) { 3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] ); 4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] ); 5 6 RegexSplitGenerator s = new RegexSplitGenerator( 7 new Fields("token"), 8 "[ [](),.]" ); 9 Pipe docPipe = new Each( "token", new Fields( "text" ), s,10 Fields.RESULTS ); // text -> token1112 Pipe wcPipe = new Pipe( "wc", docPipe );13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );1516 // connect the taps and pipes to create a flow definition17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" )18 .addSource( docPipe, docTap )19 .addTailSink( wcPipe, wcTap );2021 getFlowConnector().connect( flowDef ).complete();22 }23 } 14/21
  • 20. Summary Hadoop MapReduce Programming Model Cascading Scalding 15/21
  • 21. Scalding Open source project developed @Twitter 16/21
  • 22. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time 16/21
  • 23. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from 16/21
  • 24. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from This presentation is about the TypeSafe API ¨ 16/21
  • 25. Why Scalding 17/21
  • 26. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition 17/21
  • 27. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); 17/21
  • 28. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); Functional programming can naturally describe data flows: every pipe can be seen as a function working and pipes can be combined using functional compositing. The code above can be written as: 1 docPipe.groupBy( new Fields( "token" ) ) 2 .every(Fields.ALL, new Count(), Fields.ALL) 17/21
  • 29. Example: Word Count 1 class WordCount(args : Args) extends Job(args) { 2 3 /* TextLine reads each line of the given file */ 4 val input = TypedPipe.from( TextLine( args( "input" ) ) ) 5 6 /* tokenize every line and flat the result into a list of words */ 7 val words = input.flatMap{ tokenize(_) } 8 9 /* group by words and add a new field size that is the group size */10 val wordGroups = words.groupBy{ identity(_) }.size1112 /* write each pair (word,count) as line using TextLine */13 wordGroups.write((0,1), TextLine( args( "output" ) ) )1415 /* Split a piece of text into individual words */16 def tokenize(text : String) : Array[String] = {17 // Lowercase each word and remove punctuation.18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")19 .split("s+")20 }21 } 18/21
  • 30. Scalding TypeSafe API Two main concepts: 19/21
  • 31. Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) 19/21
  • 32. Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) KeyedList[K,V]: trait that represents a sharded lists of items. Two implementations: Grouped[K,V]: represents a grouping on keys of type K CoGrouped2[K,V,W,Result]: represents a cogroup over two grouped pipes. Used for joins 19/21
  • 33. Conclusions MapReduce API is powerful but limited 20/21
  • 34. Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful 20/21
  • 35. Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful Scalding combines Cascading and Scala to easily describe distributed programs. Major strength points are: Functional programming to naturally describe data flows. Scalding is similar to Scala library, if you know Scala then you already know how to use Scalding Statically typed (TypeSafe API), no type errors at runtime Scala is standard and works on top of the JVM Scala libraries and tools can be used in production: IDEs, debug systems, test systems, build systems and everything else. 20/21
  • 36. Thank you for listening 21/21