SHOULD I USE
SCALDING OR SCOOBI
OR SCRUNCH?
CHRIS SEVERS @ccsevers
eBay SEARCH SCIENCE
Hadoop Summit
June 26th, 2013
Obligatory Big Data Stuff
•Some fun facts about eBay from Hugh Williams’ blog:
–We have over 50 petabytes of data stored in
our Hadoop and Teradata clusters
–We have over 400 million items for sale
–We process more than 250 million user queries per day
–We serve over 100,000 pages per second
–Our users spend over 180 years in total every day looking
at items
–We have over 112 million active users
–We sold over US$75 billion in merchandize in 2012
•http://hughewilliams.com/2013/06/24/the-size-and-scale-
of-ebay-2013-edition/
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 2
THE ANSWER
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 4
YES.
THANK YOU AND GOOD NIGHT
•Questions/comments?
•Thanks to Avi Bryant, @avibryant for settling this issue.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 5
NO REALLY, WHICH ONE SHOULD I USE?!
•All three (Scalding, Scoobi, Scrunch) are amazing projects
•They seem to be converging to a common API
•There are small differences, but if you can use one you will
likely be productive with the others very quickly
•They are all much better than the alternatives.
•Scalding: https://github.com/twitter/scalding, @scalding
–Authors: @posco @argyris @avibryant
•Scoobi: https://github.com/NICTA/scoobi
–Authors: @bmlever @etorreborre
•Scrunch: http://crunch.apache.org/scrunch.html
–Author: @josh_wills
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 6
THE AGENDA
1. Quick survey of the current landscape outside
Scalding, Scoobi, and Scrunch
2. A light comparison of Scalding, Scoobi, and Scrunch.
3. Some code samples
4. The future
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 7
THE ALTERNATIVES
I promise this part will be quick
VANILLA MAPREDUCE
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 9
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
PIG
•Apache Pig is a really great tool for quick, ad-hoc data
analysis
•While we can do amazing things with it, I’m not sure we
should
•Anything complicated requires User Defined Functions
(UDFs)
•UDFs require a separate code base
•Now you have to maintain two separate languages for
no good reason
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 10
APACHE HIVE
•On previous slide: s/Pig/Hive/g
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 11
STREAMING
•Can be concise
•Easy to test
–cat myfile.txt | ./mymapper.sh | sort | ./myreducer.sh
•Same problems as vanila MR when it comes to
multistage flows
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 12
CASCADING/CRUNCH
•Higher level abstractions are good
•Tell the API what you want to do, let it sort out the actual
series of MR jobs
•Very easy to do cogroup, join, multiple passes
•Still a bit too verbose
•Feels like shoehorning a fundamentally functional notion
into an imperative context
•If you can’t move away from Java, this is your best bet
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 13
LET’S COMPARE
This slide is bright yellow!
FEATURE COMPARISON
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 15
Scalding Scoobi Scrunch
Data model Tuple or
distributed
collection
Distributed
collection
Distributed
collection
Has Algebird
baked in
Yes No No
Is Java-free No Yes No
Backed by
Cloudera
No No Yes
Free Yes Yes Yes
SOME SCALA CODE
val myLines = getStuff
val myWords = myLines.flatMap(w =>
w.split("s+"))
val myWordsGrouped = myLines.groupBy(identity)
val countedWords = myWordsGrouped.
mapValues(x=>x.size)
write(countedWords)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 16
SOME SCALDING CODE
val myLines = TextLine(path)
val myWords= myLines.flatMap(w =>
w.split(" "))
.groupBy(identity)
.size
myWords.write(TypedTSV(output))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 17
SOME SCOOBI CODE
val lines = fromTextFile("hdfs://in/...")
val counts = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(_+_)
persist(counts.toTextFile("hdfs://out/...", overwrite=true))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 18
SOME SCRUNCH CODE
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.count
}
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 19
ADVANTAGES
•Type checking
–Find errors at compile time, not at job submission time (or
even worse, 5 hours after job submission time)
•Single language
–Scala is a fully functional programming language
•Productivity
–Since the code you write looks like collections code you can
use the Scala REPL to prototype
•Clarity
–Write code as a series of operations and let the job planner
smash it all together
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 20
BREAD AND BUTTER
•You can be effective within hours by just learning a few
simple ideas
–map
–flatMap
–filter
–groupBy
–reduce
–foldLeft
•Everything above takes a function as an argument.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 21
map
•Does what you think it does
scala> val mylist = List(1,2,3)
mylist: List[Int] = List(1, 2, 3)
scala> mylist.map(x => x + 5)
res0: List[Int] = List(6, 7, 8)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 22
flatMap
•Kind of like map
•Does a map then a flatten
scala> val mystrings = List("hello there", "hadoop summit")
mystrings: List[String] = List(hello there, hadoop summit)
scala> mystrings.map(x => x.split(" "))
res5: List[Array[String]] =
List(Array(hello, there), Array(hadoop, summit))
scala> mystrings.map(x => x.split(" ")).flatten
res6: List[String] = List(hello, there, hadoop, summit)
scala> mystrings.flatMap(x => x.split(" "))
res7: List[String] = List(hello, there, hadoop, summit)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 23
filter
•Pretty obvious
scala> mystrings.filter(x => x.contains("hadoop"))
res8: List[String] = List(hadoop summit)
•Takes a predicate function
•Use this a lot
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 24
groupBy
•Puts items together using an arbitrary function
scala> mylist.groupBy(x => x % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] =
Map(false -> List(1, 3), true -> List(2))
scala> mylist.groupBy(x => x % 2)
res10: scala.collection.immutable.Map[Int,List[Int]] =
Map(1 -> List(1, 3), 0 -> List(2))
scala> mystrings.groupBy(x => x.length)
res11: scala.collection.immutable.Map[Int,List[String]] =
Map(11 -> List(hello there), 13 -> List(hadoop summit))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 25
reduce
•Not necessarily what you think
•Signature: (T,T) => T
scala> mylist.reduce( (l,r) => l + r )
res12: Int = 6
scala> mystrings.reduce( (l,r) => l + r )
res13: String = hello therehadoop summit
scala> mystrings.reduce( (l,r) => l + " " + r )
res14: String = hello there hadoop summit
•In the case of Scalding/Scoobi/Scrunch, this happens on
the values after a group operation.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 26
foldLeft
•This is a fancy reduce
•Signature: (z: B)((B,T) => B
•The input z is called the accumulator
scala> mylist.foldLeft(Set[Int]())((s,x) => s + x)
res15: scala.collection.immutable.Set[Int] =
Set(1, 2, 3)
scala> mylist.foldLeft(List[Int]())((xs, x) => x :: xs)
res16: List[Int] = List(3, 2, 1)
•Like reduce, this happens on the values after a groupBy
•Called slightly different things in Scoobi/Scrunch
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 27
MONOIDS: WHY YOU SHOULD CARE ABOUT
MATH
•From Wikipedia:
–a monoid is an algebraic structure with a single associative
binary operation and an identity element.
•Almost everything you want to do is a monoid
–Standard addition of numeric types is the most common
–List/map/set/string concatenation
–Top k elements
–Bloom filter, count-min sketch, hyperloglog
–stochastic gradient descent
–moments of distributions
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 28
MORE MONOID STUFF
•If you are aggregating, you are probably using a monoid
•Scalding has Algebird and monoid support baked in
•Scoobi and Scrunch can use Algebird (or any other
monoid library) with almost no work
–combine { case (l,r) => monoid.plus(l,r) }
•Algebird handles tuples with ease
•Very easy to define monoids for your own types
•Algebird: https://github.com/twitter/algebird @algebird
–Authors: Lots!
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 29
THE FUTURE
It’s here now!
SPARK
•Spark is a system for executing general computation
graphs, not just MR
•The syntax looks very much like Scalding, Scoobi
and, Scrunch
–It inspired the API on a couple of them
•Spark runs on YARN as of the latest release
•Can cache intermediate data
–Iterative problems become much easier
•Developed by the AMPLab at UC Berkeley
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 31
GREAT, NOW I HAVE TO LEARN 4 THINGS
INSTEAD OF 3
•Scalding, Scoobi, and Scrunch seem to have all sprung
into being around the same time and independently of
each other
•Spark was around a little before that
•Do we really need 3 (or 4) very similar solutions?
•Wouldn’t it be nice if we could just pick one and all get
behind it?
•I was prepared to make a definitive statement about the
best one, but then I learned something new
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 32
HAVE CAKE AND EAT IT
•There is currently working being done on a common API
that spans Scalding, Scoobi, Scrunch and Spark
•Not much is implemented yet, but all 4 groups are talking
and working things out
•The main use case is already done
–After word count everything else is just academic
•Check it out here: https://github.com/jwills/driskill
•In the future* you’ll be able to write against this common
API and then decide which system you want to execute
the job
–Think of choosing a list, a buffer or a vector
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 33
HOW CAN WE HELP?
•Get involved
•If something bothers you, fix it
•If you want a new feature, try and build it
•Everyone involved is actually quite friendly
•You can build jars to run on your cluster and no one will
know there is Scala Inside™
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 34
CONCLUSION
We’re almost done!
THINGS TO TAKE AWAY
•Mapreduce is a functional problem, we should use
functional tools
•You can increase productivity, safety, and maintainability
all at once with no down side
•Thinking of data flows in a functional way opens up
many new possibilities
•The community is awesome
THANKS! (FOR REAL THIS TIME)
•Questions/comments?
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 37

Should I Use Scalding or Scoobi or Scrunch?

  • 1.
    SHOULD I USE SCALDINGOR SCOOBI OR SCRUNCH? CHRIS SEVERS @ccsevers eBay SEARCH SCIENCE Hadoop Summit June 26th, 2013
  • 2.
    Obligatory Big DataStuff •Some fun facts about eBay from Hugh Williams’ blog: –We have over 50 petabytes of data stored in our Hadoop and Teradata clusters –We have over 400 million items for sale –We process more than 250 million user queries per day –We serve over 100,000 pages per second –Our users spend over 180 years in total every day looking at items –We have over 112 million active users –We sold over US$75 billion in merchandize in 2012 •http://hughewilliams.com/2013/06/24/the-size-and-scale- of-ebay-2013-edition/ SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 2
  • 3.
  • 4.
    SHOULD I USESCALDING OR SCOOBI OR SCRUNCH? 4 YES.
  • 5.
    THANK YOU ANDGOOD NIGHT •Questions/comments? •Thanks to Avi Bryant, @avibryant for settling this issue. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 5
  • 6.
    NO REALLY, WHICHONE SHOULD I USE?! •All three (Scalding, Scoobi, Scrunch) are amazing projects •They seem to be converging to a common API •There are small differences, but if you can use one you will likely be productive with the others very quickly •They are all much better than the alternatives. •Scalding: https://github.com/twitter/scalding, @scalding –Authors: @posco @argyris @avibryant •Scoobi: https://github.com/NICTA/scoobi –Authors: @bmlever @etorreborre •Scrunch: http://crunch.apache.org/scrunch.html –Author: @josh_wills SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 6
  • 7.
    THE AGENDA 1. Quicksurvey of the current landscape outside Scalding, Scoobi, and Scrunch 2. A light comparison of Scalding, Scoobi, and Scrunch. 3. Some code samples 4. The future SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 7
  • 8.
    THE ALTERNATIVES I promisethis part will be quick
  • 9.
    VANILLA MAPREDUCE SHOULD IUSE SCALDING OR SCOOBI OR SCRUNCH? 9 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 10.
    PIG •Apache Pig isa really great tool for quick, ad-hoc data analysis •While we can do amazing things with it, I’m not sure we should •Anything complicated requires User Defined Functions (UDFs) •UDFs require a separate code base •Now you have to maintain two separate languages for no good reason SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 10
  • 11.
    APACHE HIVE •On previousslide: s/Pig/Hive/g SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 11
  • 12.
    STREAMING •Can be concise •Easyto test –cat myfile.txt | ./mymapper.sh | sort | ./myreducer.sh •Same problems as vanila MR when it comes to multistage flows SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 12
  • 13.
    CASCADING/CRUNCH •Higher level abstractionsare good •Tell the API what you want to do, let it sort out the actual series of MR jobs •Very easy to do cogroup, join, multiple passes •Still a bit too verbose •Feels like shoehorning a fundamentally functional notion into an imperative context •If you can’t move away from Java, this is your best bet SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 13
  • 14.
    LET’S COMPARE This slideis bright yellow!
  • 15.
    FEATURE COMPARISON SHOULD IUSE SCALDING OR SCOOBI OR SCRUNCH? 15 Scalding Scoobi Scrunch Data model Tuple or distributed collection Distributed collection Distributed collection Has Algebird baked in Yes No No Is Java-free No Yes No Backed by Cloudera No No Yes Free Yes Yes Yes
  • 16.
    SOME SCALA CODE valmyLines = getStuff val myWords = myLines.flatMap(w => w.split("s+")) val myWordsGrouped = myLines.groupBy(identity) val countedWords = myWordsGrouped. mapValues(x=>x.size) write(countedWords) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 16
  • 17.
    SOME SCALDING CODE valmyLines = TextLine(path) val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .size myWords.write(TypedTSV(output)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 17
  • 18.
    SOME SCOOBI CODE vallines = fromTextFile("hdfs://in/...") val counts = lines.flatMap(_.split(" ")) .map(word => (word, 1)) .groupByKey .combine(_+_) persist(counts.toTextFile("hdfs://out/...", overwrite=true)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 18
  • 19.
    SOME SCRUNCH CODE valpipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .count } SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 19
  • 20.
    ADVANTAGES •Type checking –Find errorsat compile time, not at job submission time (or even worse, 5 hours after job submission time) •Single language –Scala is a fully functional programming language •Productivity –Since the code you write looks like collections code you can use the Scala REPL to prototype •Clarity –Write code as a series of operations and let the job planner smash it all together SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 20
  • 21.
    BREAD AND BUTTER •Youcan be effective within hours by just learning a few simple ideas –map –flatMap –filter –groupBy –reduce –foldLeft •Everything above takes a function as an argument. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 21
  • 22.
    map •Does what youthink it does scala> val mylist = List(1,2,3) mylist: List[Int] = List(1, 2, 3) scala> mylist.map(x => x + 5) res0: List[Int] = List(6, 7, 8) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 22
  • 23.
    flatMap •Kind of likemap •Does a map then a flatten scala> val mystrings = List("hello there", "hadoop summit") mystrings: List[String] = List(hello there, hadoop summit) scala> mystrings.map(x => x.split(" ")) res5: List[Array[String]] = List(Array(hello, there), Array(hadoop, summit)) scala> mystrings.map(x => x.split(" ")).flatten res6: List[String] = List(hello, there, hadoop, summit) scala> mystrings.flatMap(x => x.split(" ")) res7: List[String] = List(hello, there, hadoop, summit) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 23
  • 24.
    filter •Pretty obvious scala> mystrings.filter(x=> x.contains("hadoop")) res8: List[String] = List(hadoop summit) •Takes a predicate function •Use this a lot SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 24
  • 25.
    groupBy •Puts items togetherusing an arbitrary function scala> mylist.groupBy(x => x % 2 == 0) res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3), true -> List(2)) scala> mylist.groupBy(x => x % 2) res10: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1, 3), 0 -> List(2)) scala> mystrings.groupBy(x => x.length) res11: scala.collection.immutable.Map[Int,List[String]] = Map(11 -> List(hello there), 13 -> List(hadoop summit)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 25
  • 26.
    reduce •Not necessarily whatyou think •Signature: (T,T) => T scala> mylist.reduce( (l,r) => l + r ) res12: Int = 6 scala> mystrings.reduce( (l,r) => l + r ) res13: String = hello therehadoop summit scala> mystrings.reduce( (l,r) => l + " " + r ) res14: String = hello there hadoop summit •In the case of Scalding/Scoobi/Scrunch, this happens on the values after a group operation. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 26
  • 27.
    foldLeft •This is afancy reduce •Signature: (z: B)((B,T) => B •The input z is called the accumulator scala> mylist.foldLeft(Set[Int]())((s,x) => s + x) res15: scala.collection.immutable.Set[Int] = Set(1, 2, 3) scala> mylist.foldLeft(List[Int]())((xs, x) => x :: xs) res16: List[Int] = List(3, 2, 1) •Like reduce, this happens on the values after a groupBy •Called slightly different things in Scoobi/Scrunch SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 27
  • 28.
    MONOIDS: WHY YOUSHOULD CARE ABOUT MATH •From Wikipedia: –a monoid is an algebraic structure with a single associative binary operation and an identity element. •Almost everything you want to do is a monoid –Standard addition of numeric types is the most common –List/map/set/string concatenation –Top k elements –Bloom filter, count-min sketch, hyperloglog –stochastic gradient descent –moments of distributions SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 28
  • 29.
    MORE MONOID STUFF •Ifyou are aggregating, you are probably using a monoid •Scalding has Algebird and monoid support baked in •Scoobi and Scrunch can use Algebird (or any other monoid library) with almost no work –combine { case (l,r) => monoid.plus(l,r) } •Algebird handles tuples with ease •Very easy to define monoids for your own types •Algebird: https://github.com/twitter/algebird @algebird –Authors: Lots! SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 29
  • 30.
  • 31.
    SPARK •Spark is asystem for executing general computation graphs, not just MR •The syntax looks very much like Scalding, Scoobi and, Scrunch –It inspired the API on a couple of them •Spark runs on YARN as of the latest release •Can cache intermediate data –Iterative problems become much easier •Developed by the AMPLab at UC Berkeley SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 31
  • 32.
    GREAT, NOW IHAVE TO LEARN 4 THINGS INSTEAD OF 3 •Scalding, Scoobi, and Scrunch seem to have all sprung into being around the same time and independently of each other •Spark was around a little before that •Do we really need 3 (or 4) very similar solutions? •Wouldn’t it be nice if we could just pick one and all get behind it? •I was prepared to make a definitive statement about the best one, but then I learned something new SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 32
  • 33.
    HAVE CAKE ANDEAT IT •There is currently working being done on a common API that spans Scalding, Scoobi, Scrunch and Spark •Not much is implemented yet, but all 4 groups are talking and working things out •The main use case is already done –After word count everything else is just academic •Check it out here: https://github.com/jwills/driskill •In the future* you’ll be able to write against this common API and then decide which system you want to execute the job –Think of choosing a list, a buffer or a vector SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 33
  • 34.
    HOW CAN WEHELP? •Get involved •If something bothers you, fix it •If you want a new feature, try and build it •Everyone involved is actually quite friendly •You can build jars to run on your cluster and no one will know there is Scala Inside™ SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 34
  • 35.
  • 36.
    THINGS TO TAKEAWAY •Mapreduce is a functional problem, we should use functional tools •You can increase productivity, safety, and maintainability all at once with no down side •Thinking of data flows in a functional way opens up many new possibilities •The community is awesome
  • 37.
    THANKS! (FOR REALTHIS TIME) •Questions/comments? SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 37