Scoobi - Scala for Startups

Me

Haskell DSL for development of computer
Machine learning, software
vision algorithms targeting GPUs
systems, computer
vision, optimisation, networks, control
and signal processing

Predictive analytics for the enterprise

Hadoop app development – wish list

Quick dev cycles
Expressive
Reusability
Type safety
Reliability

Bridging the “tooling” gap
Scoobi

MapReduce
pipelines
DList
DObject
ScalaCheck
Implementation Testing

Java APIs

HadoopMapReduce

At a glance
• Scoobi = Scala for Hadoop
• Inspired by Google’s FlumeJava
• Developed at NICTA
• Open-sourced Oct 2011
• Apache V2

HadoopMapReduce – word count
... hello … cat
324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat …
…

(k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper

cat 1 hello 1 hello 1 fire 1
cat 1 fire 1 cat 1

[(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key

hello 1 1 cat 1 1 1 fire 1 1

(k2, [v2])  [(k3, v3)] Reducer Reducer Reducer

hello 2 cat 3 fire 2

6

Java style
public class WordCount { public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> { Job job = new Job(conf, "wordcount");
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException { job.setMapperClass(Map.class);
String line = value.toString(); job.setReducerClass(Reduce.class);
StringTokenizertokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class);
word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class);
context.write(word, one);
} FileInputFormat.addInputPath(job, new Path(args[0]));
} FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true);
Text, IntWritable> {
}
}
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval : values) {
Source: http://wiki.apache.org/hadoop/WordCount
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

7

DList abstraction
Distributed List (DList)
Data
on
HDFS

Transform

DList type Abstraction for
DList[String] Lines of text files
DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M”
DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}

Scoobi style
importcom.nicta.scoobi.Scoobi._

// Count the frequency of words from corpus of documents
objectWordCountextendsScoobiApp {
def run() {
vallines: DList[String] = fromTextFile(args(0))

valfreqs: DList[(String, Int)] =
lines.flatMap(_.split(" ")) // DList[String]
.map(w=> (w, 1)) // DList[(String, Int)]
.groupByKey// DList[(String, Iterable[Int])]
.combine(_+_) // DList[(String, Int)]

persist(toTextFile(freqs, args(1)))
}
}

DList trait
traitDList[A] {
/* Abstract methods */
def parallelDo[B](dofn: DoFn[A, B]): DList[B]

def ++(that: DList[A]): DList[A]

def groupByKey[K, V]
(implicit A <:< (K, V)): DList[(K, Iterable[V])]

def combine[K, V]
(f: (V, V) => V)
(implicit A <:< (K, Iterable[V])): DList[(K, V)]

/* All other methods are derived, e.g. „map‟ */
}

Under the hood
fromTextFile LD

lines HDFS
flatMap PD

words

map PD MapReduce Job

word1

groupByKey GBK

wordG HDFS

combine CV

freq
persist

Removing less than the average
importcom.nicta.scoobi.Scoobi._

// Remove all integers that are less than the average integer
objectBetterThanAverageextendsScoobiApp {
def run() {
valints: DList[Int] =
fromTextFile(args(0)) collect { case AnInt(i) =>i }

valtotal: DObject[Int] = ints.sum
valcount: DObject[Int] = ints.size

valaverage: DObject[Int] =
(total, count) map { case (t, c) =>t / c }

valbigger: DList[Int] =
(average join ints) filter { case (a, i) =>i> a }

persist(toTextFile(bigger, args(1)))
}
}

Under the hood HDFS
LD

MapReduce Job
ints PD

PD PD
HDFS
GBK GBK

CV CV Client computation

M M
DCach
total count HDFS
OP e
average

PD MapReduce Job

PD
bigger
HDFS

DObject abstraction
Dlist[B]
HDFS

map map Distributed
DObject DObject DObject
Cache
Dobject[A]

Client-side computations join

DList[(A, B)]
HDFS +
trait DObject[A] { Distributed
def map[B](f: A => B): DObject[B] Cache
def join[B](list: DList[B]): DList[(A, B)]
}

Mirroring the Scala Collection API
DList =>DList DList =>DObject

flatMap reduce
map product
filter sum
filterNot length
groupBy size
partition count
flatten max
distinct maxBy
++ min
keys, values minBy

Building abstractions
Functional programming

Functions as Functions as
procedures parameters

Composability
+
Reusability

Composing
// Compute the average of a DList of “numbers”
defaverage[A : Numeric](in: DList[A]): DObject[A] =
(in.sum, in.size) map { case (sum, size) => sum / size }

// Compute histogram
defhistogram[A](in: DList[A]): DList[(A, Int)] =
in.map(x=> (x, 1)).groupByKey.combine(_+_)

// Throw away words with less-than-average frequency
defbetterThanAvgWords(lines: DList[String]): DList[String] = {
val words = lines.flatMap(_.split(“ “))
valwordCnts = histogram(words)
valavgFreq = average(wordCounts.values)
(avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }
}

Unit-testing ‘histogram’
// Specification for histogram function
class HistogramSpecextendsHadoopSpecification {

“Histogram from DList”>> {
ScalaCheck
“Sum of bins must equal size of DList”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
valhist = histogram(list.toDList)
valbinSum = persist(hist.values.sum)
binSum == list.sz
}
}

“Number of bins must equal number of unique values”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
val input = list.toDList
val bins = histogram(input).keys.size
valuniques = input.distinct.size
val (b, u) = persist(bins, uniques)
b == u
}
}
}
}

sbt integration
> test-only *Histogram* -- exclude cluster
[info] HistogramSpec
[info]
[info] Histogram from DList
[info] + Sum of bins must equal size of DList
[info] No cluster execution time
[info] + Number of bins must equal number of unique values
[info] No cluster execution time
[info]
[info]
[info] Total for specification BoundedFilterSpec
[info] Finished in 12 seconds, 600 ms
[info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are
[info] copied (once) to a
[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster
> (~/libjars by default)
> test-only *Histogram*
> test-only *Histogram* -- scoobi verbose
> test-only *Histogram* -- scoobiverbose.warning

Other features
• Grouping:
– API for controlling Hadoop’s sort-and-shuffle
– Useful for implementing secondary sorting
• Join and Co-group helper methods
• Matrix multiplication utilities
• I/O:
– Text, sequence, Avro
– Roll your own

Want to know more?
• http://nicta.github.com/scoobi
• Mailing lists:
– http://groups.google.com/group/scoobi-users
– http://groups.google.com/group/scoobi-dev
• Twitter:
– @bmlever
– @etorreborre
• Meet me:
– Will also be at Hadoop Summit (June 13-14)
– Keen to get feedback

Scoobi - Scala for Startups

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scoobi - Scala for Startups

Similar to Scoobi - Scala for Startups (20)

Recently uploaded

Recently uploaded (20)

Scoobi - Scala for Startups

Editor's Notes