Successfully reported this slideshow.

Scoobi - Scala for Startups

14

Share

Upcoming SlideShare
Scalding for Hadoop
Scalding for Hadoop
Loading in …3
×
1 of 21
1 of 21

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Scoobi - Scala for Startups

  1. 1. Scoobi Ben Lever @bmlever
  2. 2. Me Haskell DSL for development of computer Machine learning, software vision algorithms targeting GPUs systems, computer vision, optimisation, networks, control and signal processing Predictive analytics for the enterprise
  3. 3. Hadoop app development – wish list Quick dev cycles Expressive Reusability Type safety Reliability
  4. 4. Bridging the “tooling” gap Scoobi MapReduce pipelines DList DObject ScalaCheck Implementation Testing Java APIs HadoopMapReduce
  5. 5. At a glance • Scoobi = Scala for Hadoop • Inspired by Google’s FlumeJava • Developed at NICTA • Open-sourced Oct 2011 • Apache V2
  6. 6. HadoopMapReduce – word count ... hello … cat 324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat … … (k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper cat 1 hello 1 hello 1 fire 1 cat 1 fire 1 cat 1 [(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key hello 1 1 cat 1 1 1 fire 1 1 (k2, [v2])  [(k3, v3)] Reducer Reducer Reducer hello 2 cat 3 fire 2 6
  7. 7. Java style public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class); StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class); word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class); context.write(word, one); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); } public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true); Text, IntWritable> { } } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { Source: http://wiki.apache.org/hadoop/WordCount sum += val.get(); } context.write(key, new IntWritable(sum)); } } 7
  8. 8. DList abstraction Distributed List (DList) Data on HDFS Transform DList type Abstraction for DList[String] Lines of text files DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M” DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
  9. 9. Scoobi style importcom.nicta.scoobi.Scoobi._ // Count the frequency of words from corpus of documents objectWordCountextendsScoobiApp { def run() { vallines: DList[String] = fromTextFile(args(0)) valfreqs: DList[(String, Int)] = lines.flatMap(_.split(" ")) // DList[String] .map(w=> (w, 1)) // DList[(String, Int)] .groupByKey// DList[(String, Iterable[Int])] .combine(_+_) // DList[(String, Int)] persist(toTextFile(freqs, args(1))) } }
  10. 10. DList trait traitDList[A] { /* Abstract methods */ def parallelDo[B](dofn: DoFn[A, B]): DList[B] def ++(that: DList[A]): DList[A] def groupByKey[K, V] (implicit A <:< (K, V)): DList[(K, Iterable[V])] def combine[K, V] (f: (V, V) => V) (implicit A <:< (K, Iterable[V])): DList[(K, V)] /* All other methods are derived, e.g. „map‟ */ }
  11. 11. Under the hood fromTextFile LD lines HDFS flatMap PD words map PD MapReduce Job word1 groupByKey GBK wordG HDFS combine CV freq persist
  12. 12. Removing less than the average importcom.nicta.scoobi.Scoobi._ // Remove all integers that are less than the average integer objectBetterThanAverageextendsScoobiApp { def run() { valints: DList[Int] = fromTextFile(args(0)) collect { case AnInt(i) =>i } valtotal: DObject[Int] = ints.sum valcount: DObject[Int] = ints.size valaverage: DObject[Int] = (total, count) map { case (t, c) =>t / c } valbigger: DList[Int] = (average join ints) filter { case (a, i) =>i> a } persist(toTextFile(bigger, args(1))) } }
  13. 13. Under the hood HDFS LD MapReduce Job ints PD PD PD HDFS GBK GBK CV CV Client computation M M DCach total count HDFS OP e average PD MapReduce Job PD bigger HDFS
  14. 14. DObject abstraction Dlist[B] HDFS map map Distributed DObject DObject DObject Cache Dobject[A] Client-side computations join DList[(A, B)] HDFS + trait DObject[A] { Distributed def map[B](f: A => B): DObject[B] Cache def join[B](list: DList[B]): DList[(A, B)] }
  15. 15. Mirroring the Scala Collection API DList =>DList DList =>DObject flatMap reduce map product filter sum filterNot length groupBy size partition count flatten max distinct maxBy ++ min keys, values minBy
  16. 16. Building abstractions Functional programming Functions as Functions as procedures parameters Composability + Reusability
  17. 17. Composing // Compute the average of a DList of “numbers” defaverage[A : Numeric](in: DList[A]): DObject[A] = (in.sum, in.size) map { case (sum, size) => sum / size } // Compute histogram defhistogram[A](in: DList[A]): DList[(A, Int)] = in.map(x=> (x, 1)).groupByKey.combine(_+_) // Throw away words with less-than-average frequency defbetterThanAvgWords(lines: DList[String]): DList[String] = { val words = lines.flatMap(_.split(“ “)) valwordCnts = histogram(words) valavgFreq = average(wordCounts.values) (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w } }
  18. 18. Unit-testing ‘histogram’ // Specification for histogram function class HistogramSpecextendsHadoopSpecification { “Histogram from DList”>> { ScalaCheck “Sum of bins must equal size of DList”>> { implicitc: SC=> Prop.forAll { list: List[Int] => valhist = histogram(list.toDList) valbinSum = persist(hist.values.sum) binSum == list.sz } } “Number of bins must equal number of unique values”>> { implicitc: SC=> Prop.forAll { list: List[Int] => val input = list.toDList val bins = histogram(input).keys.size valuniques = input.distinct.size val (b, u) = persist(bins, uniques) b == u } } } }
  19. 19. sbt integration > test-only *Histogram* -- exclude cluster [info] HistogramSpec [info] [info] Histogram from DList [info] + Sum of bins must equal size of DList [info] No cluster execution time [info] + Number of bins must equal number of unique values [info] No cluster execution time [info] [info] [info] Total for specification BoundedFilterSpec [info] Finished in 12 seconds, 600 ms [info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are [info] copied (once) to a [info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster > (~/libjars by default) > test-only *Histogram* > test-only *Histogram* -- scoobi verbose > test-only *Histogram* -- scoobiverbose.warning
  20. 20. Other features • Grouping: – API for controlling Hadoop’s sort-and-shuffle – Useful for implementing secondary sorting • Join and Co-group helper methods • Matrix multiplication utilities • I/O: – Text, sequence, Avro – Roll your own
  21. 21. Want to know more? • http://nicta.github.com/scoobi • Mailing lists: – http://groups.google.com/group/scoobi-users – http://groups.google.com/group/scoobi-dev • Twitter: – @bmlever – @etorreborre • Meet me: – Will also be at Hadoop Summit (June 13-14) – Keen to get feedback

Editor's Notes

  • 5VSQVUH22
  • ×