SlideShare a Scribd company logo
1 of 37
SHOULD I USE
SCALDING OR SCOOBI
OR SCRUNCH?
CHRIS SEVERS @ccsevers
eBay SEARCH SCIENCE
Hadoop Summit
June 26th, 2013
Obligatory Big Data Stuff
•Some fun facts about eBay from Hugh Williams’ blog:
–We have over 50 petabytes of data stored in
our Hadoop and Teradata clusters
–We have over 400 million items for sale
–We process more than 250 million user queries per day
–We serve over 100,000 pages per second
–Our users spend over 180 years in total every day looking
at items
–We have over 112 million active users
–We sold over US$75 billion in merchandize in 2012
•http://hughewilliams.com/2013/06/24/the-size-and-scale-
of-ebay-2013-edition/
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 2
THE ANSWER
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 4
YES.
THANK YOU AND GOOD NIGHT
•Questions/comments?
•Thanks to Avi Bryant, @avibryant for settling this issue.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 5
NO REALLY, WHICH ONE SHOULD I USE?!
•All three (Scalding, Scoobi, Scrunch) are amazing projects
•They seem to be converging to a common API
•There are small differences, but if you can use one you will
likely be productive with the others very quickly
•They are all much better than the alternatives.
•Scalding: https://github.com/twitter/scalding, @scalding
–Authors: @posco @argyris @avibryant
•Scoobi: https://github.com/NICTA/scoobi
–Authors: @bmlever @etorreborre
•Scrunch: http://crunch.apache.org/scrunch.html
–Author: @josh_wills
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 6
THE AGENDA
1. Quick survey of the current landscape outside
Scalding, Scoobi, and Scrunch
2. A light comparison of Scalding, Scoobi, and Scrunch.
3. Some code samples
4. The future
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 7
THE ALTERNATIVES
I promise this part will be quick
VANILLA MAPREDUCE
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 9
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
PIG
•Apache Pig is a really great tool for quick, ad-hoc data
analysis
•While we can do amazing things with it, I’m not sure we
should
•Anything complicated requires User Defined Functions
(UDFs)
•UDFs require a separate code base
•Now you have to maintain two separate languages for
no good reason
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 10
APACHE HIVE
•On previous slide: s/Pig/Hive/g
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 11
STREAMING
•Can be concise
•Easy to test
–cat myfile.txt | ./mymapper.sh | sort | ./myreducer.sh
•Same problems as vanila MR when it comes to
multistage flows
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 12
CASCADING/CRUNCH
•Higher level abstractions are good
•Tell the API what you want to do, let it sort out the actual
series of MR jobs
•Very easy to do cogroup, join, multiple passes
•Still a bit too verbose
•Feels like shoehorning a fundamentally functional notion
into an imperative context
•If you can’t move away from Java, this is your best bet
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 13
LET’S COMPARE
This slide is bright yellow!
FEATURE COMPARISON
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 15
Scalding Scoobi Scrunch
Data model Tuple or
distributed
collection
Distributed
collection
Distributed
collection
Has Algebird
baked in
Yes No No
Is Java-free No Yes No
Backed by
Cloudera
No No Yes
Free Yes Yes Yes
SOME SCALA CODE
val myLines = getStuff
val myWords = myLines.flatMap(w =>
w.split("s+"))
val myWordsGrouped = myLines.groupBy(identity)
val countedWords = myWordsGrouped.
mapValues(x=>x.size)
write(countedWords)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 16
SOME SCALDING CODE
val myLines = TextLine(path)
val myWords= myLines.flatMap(w =>
w.split(" "))
.groupBy(identity)
.size
myWords.write(TypedTSV(output))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 17
SOME SCOOBI CODE
val lines = fromTextFile("hdfs://in/...")
val counts = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(_+_)
persist(counts.toTextFile("hdfs://out/...", overwrite=true))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 18
SOME SCRUNCH CODE
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.count
}
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 19
ADVANTAGES
•Type checking
–Find errors at compile time, not at job submission time (or
even worse, 5 hours after job submission time)
•Single language
–Scala is a fully functional programming language
•Productivity
–Since the code you write looks like collections code you can
use the Scala REPL to prototype
•Clarity
–Write code as a series of operations and let the job planner
smash it all together
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 20
BREAD AND BUTTER
•You can be effective within hours by just learning a few
simple ideas
–map
–flatMap
–filter
–groupBy
–reduce
–foldLeft
•Everything above takes a function as an argument.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 21
map
•Does what you think it does
scala> val mylist = List(1,2,3)
mylist: List[Int] = List(1, 2, 3)
scala> mylist.map(x => x + 5)
res0: List[Int] = List(6, 7, 8)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 22
flatMap
•Kind of like map
•Does a map then a flatten
scala> val mystrings = List("hello there", "hadoop summit")
mystrings: List[String] = List(hello there, hadoop summit)
scala> mystrings.map(x => x.split(" "))
res5: List[Array[String]] =
List(Array(hello, there), Array(hadoop, summit))
scala> mystrings.map(x => x.split(" ")).flatten
res6: List[String] = List(hello, there, hadoop, summit)
scala> mystrings.flatMap(x => x.split(" "))
res7: List[String] = List(hello, there, hadoop, summit)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 23
filter
•Pretty obvious
scala> mystrings.filter(x => x.contains("hadoop"))
res8: List[String] = List(hadoop summit)
•Takes a predicate function
•Use this a lot
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 24
groupBy
•Puts items together using an arbitrary function
scala> mylist.groupBy(x => x % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] =
Map(false -> List(1, 3), true -> List(2))
scala> mylist.groupBy(x => x % 2)
res10: scala.collection.immutable.Map[Int,List[Int]] =
Map(1 -> List(1, 3), 0 -> List(2))
scala> mystrings.groupBy(x => x.length)
res11: scala.collection.immutable.Map[Int,List[String]] =
Map(11 -> List(hello there), 13 -> List(hadoop summit))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 25
reduce
•Not necessarily what you think
•Signature: (T,T) => T
scala> mylist.reduce( (l,r) => l + r )
res12: Int = 6
scala> mystrings.reduce( (l,r) => l + r )
res13: String = hello therehadoop summit
scala> mystrings.reduce( (l,r) => l + " " + r )
res14: String = hello there hadoop summit
•In the case of Scalding/Scoobi/Scrunch, this happens on
the values after a group operation.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 26
foldLeft
•This is a fancy reduce
•Signature: (z: B)((B,T) => B
•The input z is called the accumulator
scala> mylist.foldLeft(Set[Int]())((s,x) => s + x)
res15: scala.collection.immutable.Set[Int] =
Set(1, 2, 3)
scala> mylist.foldLeft(List[Int]())((xs, x) => x :: xs)
res16: List[Int] = List(3, 2, 1)
•Like reduce, this happens on the values after a groupBy
•Called slightly different things in Scoobi/Scrunch
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 27
MONOIDS: WHY YOU SHOULD CARE ABOUT
MATH
•From Wikipedia:
–a monoid is an algebraic structure with a single associative
binary operation and an identity element.
•Almost everything you want to do is a monoid
–Standard addition of numeric types is the most common
–List/map/set/string concatenation
–Top k elements
–Bloom filter, count-min sketch, hyperloglog
–stochastic gradient descent
–moments of distributions
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 28
MORE MONOID STUFF
•If you are aggregating, you are probably using a monoid
•Scalding has Algebird and monoid support baked in
•Scoobi and Scrunch can use Algebird (or any other
monoid library) with almost no work
–combine { case (l,r) => monoid.plus(l,r) }
•Algebird handles tuples with ease
•Very easy to define monoids for your own types
•Algebird: https://github.com/twitter/algebird @algebird
–Authors: Lots!
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 29
THE FUTURE
It’s here now!
SPARK
•Spark is a system for executing general computation
graphs, not just MR
•The syntax looks very much like Scalding, Scoobi
and, Scrunch
–It inspired the API on a couple of them
•Spark runs on YARN as of the latest release
•Can cache intermediate data
–Iterative problems become much easier
•Developed by the AMPLab at UC Berkeley
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 31
GREAT, NOW I HAVE TO LEARN 4 THINGS
INSTEAD OF 3
•Scalding, Scoobi, and Scrunch seem to have all sprung
into being around the same time and independently of
each other
•Spark was around a little before that
•Do we really need 3 (or 4) very similar solutions?
•Wouldn’t it be nice if we could just pick one and all get
behind it?
•I was prepared to make a definitive statement about the
best one, but then I learned something new
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 32
HAVE CAKE AND EAT IT
•There is currently working being done on a common API
that spans Scalding, Scoobi, Scrunch and Spark
•Not much is implemented yet, but all 4 groups are talking
and working things out
•The main use case is already done
–After word count everything else is just academic
•Check it out here: https://github.com/jwills/driskill
•In the future* you’ll be able to write against this common
API and then decide which system you want to execute
the job
–Think of choosing a list, a buffer or a vector
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 33
HOW CAN WE HELP?
•Get involved
•If something bothers you, fix it
•If you want a new feature, try and build it
•Everyone involved is actually quite friendly
•You can build jars to run on your cluster and no one will
know there is Scala Inside™
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 34
CONCLUSION
We’re almost done!
THINGS TO TAKE AWAY
•Mapreduce is a functional problem, we should use
functional tools
•You can increase productivity, safety, and maintainability
all at once with no down side
•Thinking of data flows in a functional way opens up
many new possibilities
•The community is awesome
THANKS! (FOR REAL THIS TIME)
•Questions/comments?
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 37

More Related Content

What's hot

A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Turtle Graphics in Groovy
Turtle Graphics in GroovyTurtle Graphics in Groovy
Turtle Graphics in GroovyJim Driscoll
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Scala for Java programmers
Scala for Java programmersScala for Java programmers
Scala for Java programmers輝 子安
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to ScalaTim Underwood
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldBTI360
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 

What's hot (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Meet scala
Meet scalaMeet scala
Meet scala
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Turtle Graphics in Groovy
Turtle Graphics in GroovyTurtle Graphics in Groovy
Turtle Graphics in Groovy
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Scala for Java programmers
Scala for Java programmersScala for Java programmers
Scala for Java programmers
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 World
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 

Viewers also liked

How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemLieYah Daliah
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Sriram Krishnan
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 

Viewers also liked (7)

How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
ebay
ebayebay
ebay
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 

Similar to Should I Use Scalding or Scoobi or Scrunch?

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay NetanyaPresentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay NetanyaChristopher Severs
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyTypesafe
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovyIsuru Samaraweera
 
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...Uri Cohen
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksSeniorDevOnly
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With ScalaKnoldus Inc.
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoMatt Stine
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with ScalaNeelkanth Sachdeva
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...Andrew Phillips
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Andrew Phillips
 

Similar to Should I Use Scalding or Scoobi or Scrunch? (20)

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay NetanyaPresentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin Odersky
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovy
 
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Yes, Sql!
Yes, Sql!Yes, Sql!
Yes, Sql!
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Groovy unleashed
Groovy unleashed Groovy unleashed
Groovy unleashed
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risks
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
Scala intro workshop
Scala intro workshopScala intro workshop
Scala intro workshop
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With Scala
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to Go
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with Scala
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Should I Use Scalding or Scoobi or Scrunch?

  • 1. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? CHRIS SEVERS @ccsevers eBay SEARCH SCIENCE Hadoop Summit June 26th, 2013
  • 2. Obligatory Big Data Stuff •Some fun facts about eBay from Hugh Williams’ blog: –We have over 50 petabytes of data stored in our Hadoop and Teradata clusters –We have over 400 million items for sale –We process more than 250 million user queries per day –We serve over 100,000 pages per second –Our users spend over 180 years in total every day looking at items –We have over 112 million active users –We sold over US$75 billion in merchandize in 2012 •http://hughewilliams.com/2013/06/24/the-size-and-scale- of-ebay-2013-edition/ SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 2
  • 4. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 4 YES.
  • 5. THANK YOU AND GOOD NIGHT •Questions/comments? •Thanks to Avi Bryant, @avibryant for settling this issue. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 5
  • 6. NO REALLY, WHICH ONE SHOULD I USE?! •All three (Scalding, Scoobi, Scrunch) are amazing projects •They seem to be converging to a common API •There are small differences, but if you can use one you will likely be productive with the others very quickly •They are all much better than the alternatives. •Scalding: https://github.com/twitter/scalding, @scalding –Authors: @posco @argyris @avibryant •Scoobi: https://github.com/NICTA/scoobi –Authors: @bmlever @etorreborre •Scrunch: http://crunch.apache.org/scrunch.html –Author: @josh_wills SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 6
  • 7. THE AGENDA 1. Quick survey of the current landscape outside Scalding, Scoobi, and Scrunch 2. A light comparison of Scalding, Scoobi, and Scrunch. 3. Some code samples 4. The future SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 7
  • 8. THE ALTERNATIVES I promise this part will be quick
  • 9. VANILLA MAPREDUCE SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 9 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 10. PIG •Apache Pig is a really great tool for quick, ad-hoc data analysis •While we can do amazing things with it, I’m not sure we should •Anything complicated requires User Defined Functions (UDFs) •UDFs require a separate code base •Now you have to maintain two separate languages for no good reason SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 10
  • 11. APACHE HIVE •On previous slide: s/Pig/Hive/g SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 11
  • 12. STREAMING •Can be concise •Easy to test –cat myfile.txt | ./mymapper.sh | sort | ./myreducer.sh •Same problems as vanila MR when it comes to multistage flows SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 12
  • 13. CASCADING/CRUNCH •Higher level abstractions are good •Tell the API what you want to do, let it sort out the actual series of MR jobs •Very easy to do cogroup, join, multiple passes •Still a bit too verbose •Feels like shoehorning a fundamentally functional notion into an imperative context •If you can’t move away from Java, this is your best bet SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 13
  • 14. LET’S COMPARE This slide is bright yellow!
  • 15. FEATURE COMPARISON SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 15 Scalding Scoobi Scrunch Data model Tuple or distributed collection Distributed collection Distributed collection Has Algebird baked in Yes No No Is Java-free No Yes No Backed by Cloudera No No Yes Free Yes Yes Yes
  • 16. SOME SCALA CODE val myLines = getStuff val myWords = myLines.flatMap(w => w.split("s+")) val myWordsGrouped = myLines.groupBy(identity) val countedWords = myWordsGrouped. mapValues(x=>x.size) write(countedWords) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 16
  • 17. SOME SCALDING CODE val myLines = TextLine(path) val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .size myWords.write(TypedTSV(output)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 17
  • 18. SOME SCOOBI CODE val lines = fromTextFile("hdfs://in/...") val counts = lines.flatMap(_.split(" ")) .map(word => (word, 1)) .groupByKey .combine(_+_) persist(counts.toTextFile("hdfs://out/...", overwrite=true)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 18
  • 19. SOME SCRUNCH CODE val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .count } SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 19
  • 20. ADVANTAGES •Type checking –Find errors at compile time, not at job submission time (or even worse, 5 hours after job submission time) •Single language –Scala is a fully functional programming language •Productivity –Since the code you write looks like collections code you can use the Scala REPL to prototype •Clarity –Write code as a series of operations and let the job planner smash it all together SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 20
  • 21. BREAD AND BUTTER •You can be effective within hours by just learning a few simple ideas –map –flatMap –filter –groupBy –reduce –foldLeft •Everything above takes a function as an argument. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 21
  • 22. map •Does what you think it does scala> val mylist = List(1,2,3) mylist: List[Int] = List(1, 2, 3) scala> mylist.map(x => x + 5) res0: List[Int] = List(6, 7, 8) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 22
  • 23. flatMap •Kind of like map •Does a map then a flatten scala> val mystrings = List("hello there", "hadoop summit") mystrings: List[String] = List(hello there, hadoop summit) scala> mystrings.map(x => x.split(" ")) res5: List[Array[String]] = List(Array(hello, there), Array(hadoop, summit)) scala> mystrings.map(x => x.split(" ")).flatten res6: List[String] = List(hello, there, hadoop, summit) scala> mystrings.flatMap(x => x.split(" ")) res7: List[String] = List(hello, there, hadoop, summit) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 23
  • 24. filter •Pretty obvious scala> mystrings.filter(x => x.contains("hadoop")) res8: List[String] = List(hadoop summit) •Takes a predicate function •Use this a lot SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 24
  • 25. groupBy •Puts items together using an arbitrary function scala> mylist.groupBy(x => x % 2 == 0) res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3), true -> List(2)) scala> mylist.groupBy(x => x % 2) res10: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1, 3), 0 -> List(2)) scala> mystrings.groupBy(x => x.length) res11: scala.collection.immutable.Map[Int,List[String]] = Map(11 -> List(hello there), 13 -> List(hadoop summit)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 25
  • 26. reduce •Not necessarily what you think •Signature: (T,T) => T scala> mylist.reduce( (l,r) => l + r ) res12: Int = 6 scala> mystrings.reduce( (l,r) => l + r ) res13: String = hello therehadoop summit scala> mystrings.reduce( (l,r) => l + " " + r ) res14: String = hello there hadoop summit •In the case of Scalding/Scoobi/Scrunch, this happens on the values after a group operation. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 26
  • 27. foldLeft •This is a fancy reduce •Signature: (z: B)((B,T) => B •The input z is called the accumulator scala> mylist.foldLeft(Set[Int]())((s,x) => s + x) res15: scala.collection.immutable.Set[Int] = Set(1, 2, 3) scala> mylist.foldLeft(List[Int]())((xs, x) => x :: xs) res16: List[Int] = List(3, 2, 1) •Like reduce, this happens on the values after a groupBy •Called slightly different things in Scoobi/Scrunch SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 27
  • 28. MONOIDS: WHY YOU SHOULD CARE ABOUT MATH •From Wikipedia: –a monoid is an algebraic structure with a single associative binary operation and an identity element. •Almost everything you want to do is a monoid –Standard addition of numeric types is the most common –List/map/set/string concatenation –Top k elements –Bloom filter, count-min sketch, hyperloglog –stochastic gradient descent –moments of distributions SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 28
  • 29. MORE MONOID STUFF •If you are aggregating, you are probably using a monoid •Scalding has Algebird and monoid support baked in •Scoobi and Scrunch can use Algebird (or any other monoid library) with almost no work –combine { case (l,r) => monoid.plus(l,r) } •Algebird handles tuples with ease •Very easy to define monoids for your own types •Algebird: https://github.com/twitter/algebird @algebird –Authors: Lots! SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 29
  • 31. SPARK •Spark is a system for executing general computation graphs, not just MR •The syntax looks very much like Scalding, Scoobi and, Scrunch –It inspired the API on a couple of them •Spark runs on YARN as of the latest release •Can cache intermediate data –Iterative problems become much easier •Developed by the AMPLab at UC Berkeley SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 31
  • 32. GREAT, NOW I HAVE TO LEARN 4 THINGS INSTEAD OF 3 •Scalding, Scoobi, and Scrunch seem to have all sprung into being around the same time and independently of each other •Spark was around a little before that •Do we really need 3 (or 4) very similar solutions? •Wouldn’t it be nice if we could just pick one and all get behind it? •I was prepared to make a definitive statement about the best one, but then I learned something new SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 32
  • 33. HAVE CAKE AND EAT IT •There is currently working being done on a common API that spans Scalding, Scoobi, Scrunch and Spark •Not much is implemented yet, but all 4 groups are talking and working things out •The main use case is already done –After word count everything else is just academic •Check it out here: https://github.com/jwills/driskill •In the future* you’ll be able to write against this common API and then decide which system you want to execute the job –Think of choosing a list, a buffer or a vector SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 33
  • 34. HOW CAN WE HELP? •Get involved •If something bothers you, fix it •If you want a new feature, try and build it •Everyone involved is actually quite friendly •You can build jars to run on your cluster and no one will know there is Scala Inside™ SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 34
  • 36. THINGS TO TAKE AWAY •Mapreduce is a functional problem, we should use functional tools •You can increase productivity, safety, and maintainability all at once with no down side •Thinking of data flows in a functional way opens up many new possibilities •The community is awesome
  • 37. THANKS! (FOR REAL THIS TIME) •Questions/comments? SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 37