Why                     Scalding            Needs Scala         A Look at Scoobi and Scalding            Scala DSLs for Ha...
Obligatory “About Me” Slide
Rocks!
But programming    kindaSucks!
Hello World Word Count        using Hadoop MapReduce
Split lines into wordsTurn each word into a Pair(word, 1)                        Group by word (?)    For each word, sum t...
Lots of small unintuitive                   Mapper and Reducer                          Classes          Lots of Hadoop in...
This does not make me a           happy Hadoop developer!Especially for things that are a little bit more complicated than...
What Are the Alternatives?
Counting Words using Apache PigNice!Already a lot better, but anything more complex getshard pretty fast.Pig is hard to cu...
package cascadingtutorial.wordcount;/**                                                                                 Ve...
Meh...  I’m lazyI want more power with less work!
How would wecount words in plain Scala?  (My current language of choice)
Nice!Familiar, intuitiveWhat if...?
But that code doesn’t scale to my cluster!                 Or does it?Meanwhile at Google...
Introducing         Scoobi & Scalding         Scala DSLs for Hadoop MapReduceNOTE:My relative familiaritywith either platf...
http://github.com/nicta/scoobi       A Scala library that    implements a higher level     programming model for       Had...
Counting Words using Scoobi                                            Split lines into words                             ...
Scoobi is...•   A distributed collections abstraction:    •   Distributed collection objects abstract data in HDFS    •   ...
DList[T]•   Abstracts storage of data and files on HDFS•   Calling methods on DList objects to transform and    manipulate ...
DList[T]
IO    •   Can read/write text files, Sequence files and Avro files    •   Can influence sorting (raw, secondary)              ...
IO/Serialization I
IO/Serialization II      For normal (i.e. non-case) classes
Further InfoVersion 0.4 released today (!)• Avro, Sequence Files• Materialized DObjects• DList reduction methods (product,...
Scalding!http://github.com/twitter/scalding      A Scala library that   implements a higher level    programming model for...
Counting Words using Scalding
Scalding is...•   A distributed collections abstraction•   A wrapper around Cascading (i.e. no source code    generation)•...
Further InfoCurrent version: 0.5.4http://github.com/twitter/scaldinghttps://github.com/twitter/scalding/wiki@scaldingcasca...
How do they compare?                              Small featureDifferent approaches,    differences, which will     simila...
Which one should I use?Ehm...    ...I’m extremely prejudiced!
Questions?
Upcoming SlideShare
Loading in …5
×

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

3,882 views

Published on

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,882
On SlideShare
0
From Embeds
0
Number of Embeds
223
Actions
Shares
0
Downloads
58
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

  1. 1. Why Scalding Needs Scala A Look at Scoobi and Scalding Scala DSLs for HadoopScoobi @agemooij
  2. 2. Obligatory “About Me” Slide
  3. 3. Rocks!
  4. 4. But programming kindaSucks!
  5. 5. Hello World Word Count using Hadoop MapReduce
  6. 6. Split lines into wordsTurn each word into a Pair(word, 1) Group by word (?) For each word, sum the 1s to get the total
  7. 7. Lots of small unintuitive Mapper and Reducer Classes Lots of Hadoop intrusiveness (Context, Writables, Exceptions, etc.)Low level glue codeActually runs the code on the cluster
  8. 8. This does not make me a happy Hadoop developer!Especially for things that are a little bit more complicated than counting words • Unintuitive, invasive programming model • Hard to compose/chain jobs into real, more complicated programs • Lots of low-level boilerplate code • Branching, Joins, CoGroups, etc. hard to implement
  9. 9. What Are the Alternatives?
  10. 10. Counting Words using Apache PigNice!Already a lot better, but anything more complex getshard pretty fast.Pig is hard to customize/extendHandy for quick exploration of data! And the same goes for Hive
  11. 11. package cascadingtutorial.wordcount;/** Very powerful! * Wordcount example in Cascading */ Record Modelpublic class Main { Pipes & Filters public static void main( String[] args ) { String inputPath = args[0]; Joins & CoGroups String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : Not very intuitive new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : Strange new abstraction new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", Lots of boilerplate code new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  12. 12. Meh... I’m lazyI want more power with less work!
  13. 13. How would wecount words in plain Scala? (My current language of choice)
  14. 14. Nice!Familiar, intuitiveWhat if...?
  15. 15. But that code doesn’t scale to my cluster! Or does it?Meanwhile at Google...
  16. 16. Introducing Scoobi & Scalding Scala DSLs for Hadoop MapReduceNOTE:My relative familiaritywith either platform: Scalding 5% Scoobi 95%
  17. 17. http://github.com/nicta/scoobi A Scala library that implements a higher level programming model for Hadoop MapReduce
  18. 18. Counting Words using Scoobi Split lines into words Turn each word into a Pair(word, 1) Group by word For each word, sum the 1s to get the total Actually runs the code on the cluster
  19. 19. Scoobi is...• A distributed collections abstraction: • Distributed collection objects abstract data in HDFS • Methods on these objects abstract map/reduce operations • Programs manipulate distributed collections objects • Scoobi turns these manipulations into MapReduce jobs • Based on Google’s FlumeJava / Cascades• A source code generator (it generates Java code!)• A job plan optimizer• Open sourced by NICTA• Written in Scala (W00t!)
  20. 20. DList[T]• Abstracts storage of data and files on HDFS• Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce• Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution• Very familiar: like standard Scala Lists• Strongly typed• Parameterized with rich types and Tuples• Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.
  21. 21. DList[T]
  22. 22. IO • Can read/write text files, Sequence files and Avro files • Can influence sorting (raw, secondary) Serialization• Serialization of custom types through Scala type classes and WireFormat[T]• Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc.• Out of the box support for serialization of Scala case classes
  23. 23. IO/Serialization I
  24. 24. IO/Serialization II For normal (i.e. non-case) classes
  25. 25. Further InfoVersion 0.4 released today (!)• Avro, Sequence Files• Materialized DObjects• DList reduction methods (product, min, etc.)• Vastly improved testing support• Less overhead• Much morehttp://nicta.github.com/scoobi/scoobi-dev@googlegroups.comscoobi-users@googlegroups.com
  26. 26. Scalding!http://github.com/twitter/scalding A Scala library that implements a higher level programming model for Hadoop MapReduce Cascading
  27. 27. Counting Words using Scalding
  28. 28. Scalding is...• A distributed collections abstraction• A wrapper around Cascading (i.e. no source code generation)• Based on the same record model (i.e. named fields)• Less strongly typed• Uses Kryo Serialization• Used by Twitter in production• Written in Scala (W00t!)
  29. 29. Further InfoCurrent version: 0.5.4http://github.com/twitter/scaldinghttps://github.com/twitter/scalding/wiki@scaldingcascading-user@googlegroups.comhttp://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
  30. 30. How do they compare? Small featureDifferent approaches, differences, which will similar power even out over time Scoobi gets a little Twitter is definitely a closer to idiomatic bigger fish than Scala NICTA, so Scalding gets all the attention Both open sourced (last year) Scoobi has better docs!
  31. 31. Which one should I use?Ehm... ...I’m extremely prejudiced!
  32. 32. Questions?

×