Scoobi - Scala for Startups

Uploaded on

My talk on Scoobi at the Scala for Startups meetup in San Francisco, June 2012.

My talk on Scoobi at the Scala for Startups meetup in San Francisco, June 2012.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • 5VSQVUH22


  • 1. ScoobiBen Lever@bmlever
  • 2. Me Haskell DSL for development of computer Machine learning, software vision algorithms targeting GPUs systems, computervision, optimisation, networks, control and signal processing Predictive analytics for the enterprise
  • 3. Hadoop app development – wish list Quick dev cycles Expressive Reusability Type safety Reliability
  • 4. Bridging the “tooling” gap Scoobi MapReduce pipelines DListDObject ScalaCheck Implementation Testing Java APIs HadoopMapReduce
  • 5. At a glance• Scoobi = Scala for Hadoop• Inspired by Google’s FlumeJava• Developed at NICTA• Open-sourced Oct 2011• Apache V2
  • 6. HadoopMapReduce – word count ... hello … cat 324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat … …(k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper cat 1 hello 1 hello 1 fire 1 cat 1 fire 1 cat 1[(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key hello 1 1 cat 1 1 1 fire 1 1(k2, [v2])  [(k3, v3)] Reducer Reducer Reducer hello 2 cat 3 fire 2 6
  • 7. Java stylepublic class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration();public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class);StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class);word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class);context.write(word, one); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); } public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true); Text, IntWritable> { } } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval : values) { Source: sum += val.get(); }context.write(key, new IntWritable(sum)); } } 7
  • 8. DList abstraction Distributed List (DList) Data onHDFS Transform DList type Abstraction for DList[String] Lines of text files DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M” DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
  • 9. Scoobi styleimportcom.nicta.scoobi.Scoobi._// Count the frequency of words from corpus of documentsobjectWordCountextendsScoobiApp {def run() {vallines: DList[String] = fromTextFile(args(0))valfreqs: DList[(String, Int)] =lines.flatMap(_.split(" ")) // DList[String] .map(w=> (w, 1)) // DList[(String, Int)] .groupByKey// DList[(String, Iterable[Int])] .combine(_+_) // DList[(String, Int)]persist(toTextFile(freqs, args(1))) }}
  • 10. DList traittraitDList[A] {/* Abstract methods */def parallelDo[B](dofn: DoFn[A, B]): DList[B]def ++(that: DList[A]): DList[A]def groupByKey[K, V] (implicit A <:< (K, V)): DList[(K, Iterable[V])]def combine[K, V] (f: (V, V) => V) (implicit A <:< (K, Iterable[V])): DList[(K, V)]/* All other methods are derived, e.g. „map‟ */}
  • 11. Under the hoodfromTextFile LD lines HDFS flatMap PD words map PD MapReduce Job word1 groupByKey GBK wordG HDFS combine CV freq persist
  • 12. Removing less than the averageimportcom.nicta.scoobi.Scoobi._// Remove all integers that are less than the average integerobjectBetterThanAverageextendsScoobiApp {def run() {valints: DList[Int] =fromTextFile(args(0)) collect { case AnInt(i) =>i }valtotal: DObject[Int] = ints.sumvalcount: DObject[Int] = ints.sizevalaverage: DObject[Int] = (total, count) map { case (t, c) =>t / c }valbigger: DList[Int] = (average join ints) filter { case (a, i) =>i> a }persist(toTextFile(bigger, args(1))) }}
  • 13. Under the hood HDFS LD MapReduce Job ints PD PD PD HDFS GBK GBK CV CV Client computation M M DCachtotal count HDFS OP e average PD MapReduce Job PD bigger HDFS
  • 14. DObject abstraction Dlist[B] HDFS map map DistributedDObject DObject DObject Cache Dobject[A] Client-side computations join DList[(A, B)] HDFS +trait DObject[A] { Distributed def map[B](f: A => B): DObject[B] Cache def join[B](list: DList[B]): DList[(A, B)]}
  • 15. Mirroring the Scala Collection API DList =>DList DList =>DObject flatMap reduce map product filter sum filterNot length groupBy size partition count flatten max distinct maxBy ++ min keys, values minBy
  • 16. Building abstractions Functional programmingFunctions as Functions asprocedures parameters Composability + Reusability
  • 17. Composing// Compute the average of a DList of “numbers”defaverage[A : Numeric](in: DList[A]): DObject[A] = (in.sum, in.size) map { case (sum, size) => sum / size }// Compute histogramdefhistogram[A](in: DList[A]): DList[(A, Int)]> (x, 1)).groupByKey.combine(_+_)// Throw away words with less-than-average frequencydefbetterThanAvgWords(lines: DList[String]): DList[String] = {val words = lines.flatMap(_.split(“ “))valwordCnts = histogram(words)valavgFreq = average(wordCounts.values) (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }}
  • 18. Unit-testing ‘histogram’// Specification for histogram functionclass HistogramSpecextendsHadoopSpecification {“Histogram from DList”>> { ScalaCheck“Sum of bins must equal size of DList”>> { implicitc: SC=>Prop.forAll { list: List[Int] =>valhist = histogram(list.toDList)valbinSum = persist(hist.values.sum)binSum == } }“Number of bins must equal number of unique values”>> { implicitc: SC=>Prop.forAll { list: List[Int] =>val input = list.toDListval bins = histogram(input).keys.sizevaluniques = input.distinct.sizeval (b, u) = persist(bins, uniques)b == u } } }}
  • 19. sbt integration> test-only *Histogram* -- exclude cluster[info] HistogramSpec[info][info] Histogram from DList[info] + Sum of bins must equal size of DList[info] No cluster execution time[info] + Number of bins must equal number of unique values[info] No cluster execution time[info][info][info] Total for specification BoundedFilterSpec[info] Finished in 12 seconds, 600 ms[info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are[info] copied (once) to a[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster> (~/libjars by default)> test-only *Histogram*> test-only *Histogram* -- scoobi verbose> test-only *Histogram* -- scoobiverbose.warning
  • 20. Other features• Grouping: – API for controlling Hadoop’s sort-and-shuffle – Useful for implementing secondary sorting• Join and Co-group helper methods• Matrix multiplication utilities• I/O: – Text, sequence, Avro – Roll your own
  • 21. Want to know more?•• Mailing lists: – –• Twitter: – @bmlever – @etorreborre• Meet me: – Will also be at Hadoop Summit (June 13-14) – Keen to get feedback