ScoobiBen Lever@bmlever
Me                                                  Haskell DSL for development of computer      Machine learning, softwar...
Hadoop app development – wish list        Quick dev cycles          Expressive          Reusability          Type safety  ...
Bridging the “tooling” gap                      Scoobi MapReduce  pipelines          DListDObject                         ...
At a glance•   Scoobi = Scala for Hadoop•   Inspired by Google’s FlumeJava•   Developed at NICTA•   Open-sourced Oct 2011•...
HadoopMapReduce – word count                                                      ... hello … cat                        3...
Java stylepublic class WordCount {                                           public static void main(String[] args) throws...
DList abstraction                                                            Distributed List (DList) Data   onHDFS       ...
Scoobi styleimportcom.nicta.scoobi.Scoobi._// Count the frequency of words from corpus of documentsobjectWordCountextendsS...
DList traittraitDList[A] {/* Abstract methods */def parallelDo[B](dofn: DoFn[A, B]): DList[B]def ++(that: DList[A]): DList...
Under the hoodfromTextFile               LD                   lines           HDFS    flatMap                PD           ...
Removing less than the averageimportcom.nicta.scoobi.Scoobi._// Remove all integers that are less than the average integer...
Under the hood        HDFS                 LD                                        MapReduce Job    ints         PD     ...
DObject abstraction                                                Dlist[B]                                               ...
Mirroring the Scala Collection API     DList =>DList    DList =>DObject       flatMap           reduce         map        ...
Building abstractions               Functional programmingFunctions as                            Functions asprocedures  ...
Composing// Compute the average of a DList of “numbers”defaverage[A : Numeric](in: DList[A]): DObject[A] =  (in.sum, in.si...
Unit-testing ‘histogram’// Specification for histogram functionclass HistogramSpecextendsHadoopSpecification {“Histogram f...
sbt integration> test-only *Histogram* -- exclude cluster[info] HistogramSpec[info][info] Histogram from DList[info] + Sum...
Other features• Grouping:  – API for controlling Hadoop’s sort-and-shuffle  – Useful for implementing secondary sorting• J...
Want to know more?• http://nicta.github.com/scoobi• Mailing lists:  – http://groups.google.com/group/scoobi-users  – http:...
Upcoming SlideShare
Loading in...5
×

Scoobi - Scala for Startups

7,275

Published on

My talk on Scoobi at the Scala for Startups meetup in San Francisco, June 2012.

Published in: Technology, Education
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,275
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • 5VSQVUH22
  • Scoobi - Scala for Startups

    1. 1. ScoobiBen Lever@bmlever
    2. 2. Me Haskell DSL for development of computer Machine learning, software vision algorithms targeting GPUs systems, computervision, optimisation, networks, control and signal processing Predictive analytics for the enterprise
    3. 3. Hadoop app development – wish list Quick dev cycles Expressive Reusability Type safety Reliability
    4. 4. Bridging the “tooling” gap Scoobi MapReduce pipelines DListDObject ScalaCheck Implementation Testing Java APIs HadoopMapReduce
    5. 5. At a glance• Scoobi = Scala for Hadoop• Inspired by Google’s FlumeJava• Developed at NICTA• Open-sourced Oct 2011• Apache V2
    6. 6. HadoopMapReduce – word count ... hello … cat 324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat … …(k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper cat 1 hello 1 hello 1 fire 1 cat 1 fire 1 cat 1[(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key hello 1 1 cat 1 1 1 fire 1 1(k2, [v2])  [(k3, v3)] Reducer Reducer Reducer hello 2 cat 3 fire 2 6
    7. 7. Java stylepublic class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration();public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class);StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class);word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class);context.write(word, one); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); } public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true); Text, IntWritable> { } } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval : values) { Source: http://wiki.apache.org/hadoop/WordCount sum += val.get(); }context.write(key, new IntWritable(sum)); } } 7
    8. 8. DList abstraction Distributed List (DList) Data onHDFS Transform DList type Abstraction for DList[String] Lines of text files DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M” DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
    9. 9. Scoobi styleimportcom.nicta.scoobi.Scoobi._// Count the frequency of words from corpus of documentsobjectWordCountextendsScoobiApp {def run() {vallines: DList[String] = fromTextFile(args(0))valfreqs: DList[(String, Int)] =lines.flatMap(_.split(" ")) // DList[String] .map(w=> (w, 1)) // DList[(String, Int)] .groupByKey// DList[(String, Iterable[Int])] .combine(_+_) // DList[(String, Int)]persist(toTextFile(freqs, args(1))) }}
    10. 10. DList traittraitDList[A] {/* Abstract methods */def parallelDo[B](dofn: DoFn[A, B]): DList[B]def ++(that: DList[A]): DList[A]def groupByKey[K, V] (implicit A <:< (K, V)): DList[(K, Iterable[V])]def combine[K, V] (f: (V, V) => V) (implicit A <:< (K, Iterable[V])): DList[(K, V)]/* All other methods are derived, e.g. „map‟ */}
    11. 11. Under the hoodfromTextFile LD lines HDFS flatMap PD words map PD MapReduce Job word1 groupByKey GBK wordG HDFS combine CV freq persist
    12. 12. Removing less than the averageimportcom.nicta.scoobi.Scoobi._// Remove all integers that are less than the average integerobjectBetterThanAverageextendsScoobiApp {def run() {valints: DList[Int] =fromTextFile(args(0)) collect { case AnInt(i) =>i }valtotal: DObject[Int] = ints.sumvalcount: DObject[Int] = ints.sizevalaverage: DObject[Int] = (total, count) map { case (t, c) =>t / c }valbigger: DList[Int] = (average join ints) filter { case (a, i) =>i> a }persist(toTextFile(bigger, args(1))) }}
    13. 13. Under the hood HDFS LD MapReduce Job ints PD PD PD HDFS GBK GBK CV CV Client computation M M DCachtotal count HDFS OP e average PD MapReduce Job PD bigger HDFS
    14. 14. DObject abstraction Dlist[B] HDFS map map DistributedDObject DObject DObject Cache Dobject[A] Client-side computations join DList[(A, B)] HDFS +trait DObject[A] { Distributed def map[B](f: A => B): DObject[B] Cache def join[B](list: DList[B]): DList[(A, B)]}
    15. 15. Mirroring the Scala Collection API DList =>DList DList =>DObject flatMap reduce map product filter sum filterNot length groupBy size partition count flatten max distinct maxBy ++ min keys, values minBy
    16. 16. Building abstractions Functional programmingFunctions as Functions asprocedures parameters Composability + Reusability
    17. 17. Composing// Compute the average of a DList of “numbers”defaverage[A : Numeric](in: DList[A]): DObject[A] = (in.sum, in.size) map { case (sum, size) => sum / size }// Compute histogramdefhistogram[A](in: DList[A]): DList[(A, Int)] =in.map(x=> (x, 1)).groupByKey.combine(_+_)// Throw away words with less-than-average frequencydefbetterThanAvgWords(lines: DList[String]): DList[String] = {val words = lines.flatMap(_.split(“ “))valwordCnts = histogram(words)valavgFreq = average(wordCounts.values) (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }}
    18. 18. Unit-testing ‘histogram’// Specification for histogram functionclass HistogramSpecextendsHadoopSpecification {“Histogram from DList”>> { ScalaCheck“Sum of bins must equal size of DList”>> { implicitc: SC=>Prop.forAll { list: List[Int] =>valhist = histogram(list.toDList)valbinSum = persist(hist.values.sum)binSum == list.sz } }“Number of bins must equal number of unique values”>> { implicitc: SC=>Prop.forAll { list: List[Int] =>val input = list.toDListval bins = histogram(input).keys.sizevaluniques = input.distinct.sizeval (b, u) = persist(bins, uniques)b == u } } }}
    19. 19. sbt integration> test-only *Histogram* -- exclude cluster[info] HistogramSpec[info][info] Histogram from DList[info] + Sum of bins must equal size of DList[info] No cluster execution time[info] + Number of bins must equal number of unique values[info] No cluster execution time[info][info][info] Total for specification BoundedFilterSpec[info] Finished in 12 seconds, 600 ms[info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are[info] copied (once) to a[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster> (~/libjars by default)> test-only *Histogram*> test-only *Histogram* -- scoobi verbose> test-only *Histogram* -- scoobiverbose.warning
    20. 20. Other features• Grouping: – API for controlling Hadoop’s sort-and-shuffle – Useful for implementing secondary sorting• Join and Co-group helper methods• Matrix multiplication utilities• I/O: – Text, sequence, Avro – Roll your own
    21. 21. Want to know more?• http://nicta.github.com/scoobi• Mailing lists: – http://groups.google.com/group/scoobi-users – http://groups.google.com/group/scoobi-dev• Twitter: – @bmlever – @etorreborre• Meet me: – Will also be at Hadoop Summit (June 13-14) – Keen to get feedback

    ×