Scalding - the not-so-basics @ ScalaDays 2014

Scalding
the not-so-basics
Konrad 'ktoso' Malawski

Scala Days 2014 @ Berlin

Konrad `@ktosopl` Malawski
typesafe.com

geecon.org

Java.pl / KrakowScala.pl

sckrk.com / meetup.com/Paper-Cup @ London

GDGKrakow.pl

meetup.com/Lambda-Lounge-Krakow
hAkker @

http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
How old is this guy?

http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
Google MapReduce, paper: 2004
Hadoop (Yahoo impl): 2005

https://github.com/twitter/scalding
Scalding is “on top of” Hadoop

Scalding is “on top of” Cascading,

which is “on top of” Hadoop
http://www.cascading.org/

Summingbird is “op top of” Scalding,

which is “on top of” Cascading,

https://github.com/twitter/summingbird

Summingbird is “op top of” Scalding or Storm,


http://storm.incubator.apache.org/



which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://spark.apache.org/



HDFS yes,

MapReduce no



HDFS yes,

MapReduce no
Possibly soon?!



Spark has nothing to do with all this.
-streams



this talk

Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!

Stuff > Memory
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory

Stuff > Memory
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory

Stuff > Memory
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory

Stuff > Memory
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
in Memory

Stuff > Memory
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
in Memory
in Memory

package org.myorg;!
!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.io.LongWritable;!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.mapred.*;!
!
import java.io.IOException;!
import java.util.Iterator;!
import java.util.StringTokenizer;!
!
public class WordCount {!
!
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
Why Scalding?
Word Count in Hadoop MR

private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
}!
}!
}!
!
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {!
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
output.collect(key, new IntWritable(sum));!
}!
}!
!
public static void main(String[] args) throws Exception {!
JobConf conf = new JobConf(WordCount.class);!
conf.setJobName("wordcount");!
!
conf.setOutputKeyClass(Text.class);!
conf.setOutputValueClass(IntWritable.class);!
!
conf.setMapperClass(Map.class);!
conf.setCombinerClass(Reduce.class);!
conf.setReducerClass(Reduce.class);!
!
conf.setInputFormat(TextInputFormat.class);!
conf.setOutputFormat(TextOutputFormat.class);!
!
FileInputFormat.setInputPaths(conf, new Path(args[0]));!
FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
!
JobClient.runJob(conf);!
}!
}!
Why Scalding?
Word Count in Hadoop MR

val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
Scala:

val data = 1 :: 2 :: 3 :: Nil
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:

val data = 1 :: 2 :: 3 :: Nil
map
Scala:
available in Pipe

val data = 1 :: 2 :: 3 :: Nil
map
Scala:
available in Pipestays in Pipe

val data = 1 :: 2 :: 3 :: Nil!
!
val doubled = data map { _ * 2 }!
!
// Int => Int
map
IterableSource(data)!
.map('number -> 'doubled) { n: Int => n * 2 }!
!
!
// Int => Int
Scala:
must choose type!

var data = 1 :: 2 :: 3 :: Nil
data = null
mapTo
Scala:

var data = 1 :: 2 :: 3 :: Nil
data = null
mapTo
Scala:
“release reference”

var data = 1 :: 2 :: 3 :: Nil
data = null
mapTo
.mapTo('doubled) { n: Int => n * 2 }
Scala:

var data = 1 :: 2 :: 3 :: Nil
data = null
mapTo
Scala:
doubled stays in Pipe

var data = 1 :: 2 :: 3 :: Nil
data = null
mapTo
Scala:
doubled stays in Pipenumber is removed

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
ﬂatMap
Scala:

line.split(",") // Array[String]
} map { _.toInt } // List[Int]
ﬂatMap
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",") } // like List[String]
.map('word -> 'number) { _.toInt } // like List[Int]
Scala:

line.split(",").map(_.toInt) // Array[Int]
}
ﬂatMap
Scala:

line.split(",").map(_.toInt) // Array[Int]
}
ﬂatMap
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",").map(_.toInt) }
// like List[Int]
Scala:

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
Scala:

groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:

groupBy
Scala:
groups all with == value

groupBy
Scala:
groups all with == value 'lessThanTenCounts

groupBy

groupBy
.groupBy('lessThanTen) { _.sum('total) }

groupBy
.groupBy('lessThanTen) { _.sum('total) }
'total = [3, 74]

import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
!
object ScaldingJobRunner extends App {!
!
ToolRunner.run(new Configuration, new scalding.Tool, args)!
!
}
Main Class - "Runner"

import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
!
object ScaldingJobRunner extends App {!
!
ToolRunner.run(new Configuration, new scalding.Tool, args)!
!
}
Main Class - "Runner"
from App

class WordCountJob(args: Args) extends Job(args) {!
!
!
!
!
!
!
!
!
!
!
}
Word Count in Scalding

!
val inputFile = args("input")!
val outputFile = args("output")!
!
!
!
!
!
!
!
}

!
!
TextLine(inputFile)!
!
!
!
!
!
}

!
!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
!
!
!
def tokenize(text: String): Array[String] = implemented!
}

!
!
.groupBy('word) { group => group.size('count) }!
!
!
}

!
!
.groupBy('word) { group => group.size }!
!
!
}

!
!
.groupBy('word) { _.size }!
!
!
}

!
!
.write(Tsv(outputFile))!
!
}

!
!
!
}
4{

1 day in the life of

a guy implementing Scalding jobs

“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output))

}!
.write(Tsv(output))
1!107!
2!144!
3!16!
… …

}!
.write(Tsv(output, writeHeader = true))

}!
shopId! totalSoldItems!
1!! ! ! 107!
2!! ! ! 144!
3!! ! ! 16!
…!! ! ! …

“Which are the top selling shops?”
}!
.groupAll { _.sortBy('totalSoldItems).reverse }!

“Which are the top selling shops?”
}!
.groupAll { _.sortBy('totalSoldItems).reverse }!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16!
…!! ! ! …

“What’s the top 3 shops?”
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!

}!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16

}!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!

}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!

}!
.groupAll { !
}!
x!
List((5,146), (2,142), (3,32))!

}!
.groupAll { !
}!
x!
List((5,146), (2,142), (3,32))!
WAT!?

}!
.groupAll { !
}!
x!
List((5,146), (2,142), (3,32))!
WAT!?
Emits scala.collection.List[_]

}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!

}!
.groupAll { !
l._2 < l._2 !
}!
}!
}!
Provide Ordering explicitly because implicit Ordering

is not enough for Tuple2 here

}!
.groupAll { !
l._2 < l._2 !
}!
}!
}!

}!
.groupAll { !
l._2 < l._2 !
}!
}!
}!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16

}!
.groupAll { !
l._2 < l._2 !
}!
}!
}!
MUCH faster Job

=

Happier me.

trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
Reduce, these Monoids

trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:

+ 3 laws:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:

+ 3 laws:
Closure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:

+ 3 laws:
(T, T) => TClosure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
interface:

+ 3 laws:
(T, T) => TClosure:
Associativity:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
interface:

+ 3 laws:
(T, T) => TClosure:
Associativity:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:

+ 3 laws:
(T, T) => TClosure:
Associativity:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
(a + b) + c!
==!
a + (b + c)
interface:

+ 3 laws:
(T, T) => TClosure:
Associativity:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
(a + b) + c!
==!
a + (b + c)
interface:
∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a

object IntSum extends Monoid[Int] {!
def zero = 0!
def +(a: Int, b: Int) = a + b!
}
Summing:

Monoid ops can start “Map-side”
bear, 2
car, 3
deer, 2
Monoid ops can already start
being computed map-side!
Monoid ops can already start
being computed map-side!
river, 2

Monoid ops can start “Map-side”
average()

sum()
sortWithTake()

histogram()
Examples:
bear, 2
car, 3
deer, 2
river, 2

Obligatory: “Go check out Algebird, NOW!” slide
https://github.com/twitter/algebird
ALGE-birds

BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true

BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!

BloomFilterMonoid
}!
}!
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!

BloomFilterMonoid
}!
}!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
Why not Set[String]? It would OutOfMemory.

BloomFilterMonoid
}!
}!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
ApproximateBoolean(true,0.9999580954658956)
Why not Set[String]? It would OutOfMemory.

Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)

Joins
!
!
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.

Joins
!
!
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
The “usual”

Joins
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!

Joins
import com.twitter.scalding.FunctionImplicits._!
!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
}!
.project('sentence)!
.write(output)
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!

Joins
import com.twitter.scalding.FunctionImplicits._!
!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
}!
.project('sentence)!
.write(output)
Hello hans, your bmw is really nice!
Hello bob, your bob's car is really nice!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!

“map-side” join
that.joinWithTiny('id1 -> 'id2, tinyPipe)
Choose this when:

!
or:

when the Left side is 3 orders of magnitude larger.
Left > max(mappers,reducers) * Right!

Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))

Skew Joins
val reducers = 10!
!
!
!
followers!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
1. Sample from the left and right pipes with some small probability, 
in order to determine approximately how often each join key appears in each pipe.

Skew Joins
val reducers = 10!
!
!
!
followers!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
2. Use these estimated counts to replicate the join keys,  
according to the given replication strategy.

Skew Joins
val reducers = 10!
!
!
!
followers!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
2. Use these estimated counts to replicate the join keys,  
according to the given replication strategy.
3. Join the replicated pipes together.

Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!

.write(Tsv(out))!
Caused by: cascading.flow.FlowException: local step failed

at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219)

at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)

at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)


at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:744)

Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation

at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81)


at cascading.flow.stream.SourceStage.map(SourceStage.java:102)

at cascading.flow.stream.SourceStage.call(SourceStage.java:53)






Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

at java.lang.Long.parseLong(Long.java:589)


at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50)


.write(Tsv(out))!
Caused by: cascading.flow.FlowException: local step failed

at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219)

at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)







Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation



at cascading.flow.stream.SourceStage.map(SourceStage.java:102)







Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)




“oh, right… We
changed that file to be
user names, not ids…”

Trap it!
.addTrap(Tsv(“errors")) // add a trap!
.write(Tsv(out))

Trap it!
.addTrap(Tsv(“errors")) // add a trap!
.write(Tsv(out))
solves “dirty data”,

no help for maintenance

TypedAPI’s
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!

TypedAPI’s
.write(Tsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!

TypedAPI’s
.write(Tsv(out))!
import TDsl._!
!
.filter { _._1 === "bob" }!
Must give Type to
each Field

TypedAPI’s
.write(Tsv(out))!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
import TDsl._!
!
.filter { _._1 == "bob" }!

TypedAPI’s
.write(Tsv(out))!
.filter { _._1 === "bob" }!
import TDsl._!
!
.filter { _._1 == "bob" }!
Tuple arity: 2 Tuple arity: 3

TypedAPI’s
.write(Tsv(out))!
Caused by: java.lang.IllegalArgumentException:

num of types must equal number of ﬁelds: [{3}:'user1', 'user2', 'rel'], found: 2

at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)
.filter { _._1 === "bob" }!
import TDsl._!
!
.filter { _._1 == "bob" }!

TypedAPI’s
.write(Tsv(out))!
Caused by: java.lang.IllegalArgumentException:

num of types must equal number of ﬁelds: [{3}:'user1', 'user2', 'rel'], found: 2

at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)
.filter { _._1 === "bob" }!
import TDsl._!
!
.filter { _._1 == "bob" }!
“planing-time” exception

TypedAPI’s
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
!
}

TypedAPI’s
.write(Tsv(out))!
import TDsl._!
!
.filter { _._ == "bob" }!
!
}
Easier to reuse
schemas now

TypedAPI’s
.write(Tsv(out))!
import TDsl._!
!
.filter { _._ == "bob" }!
!
}
Easier to reuse
schemas now
Not coupled by Field names,

but still too magic for reuse… “_1”?

TypedAPI’s
.write(Tsv(out))!
import TDsl._!
!
userRelationships(date) !
.filter { p: Person => p.name == ”bob" }!
!
}

TypedAPI’s
.write(Tsv(out))!
import TDsl._!
!
userRelationships(date) !
.filter { p: Person => p.name == ”bob" }!
!
}
TypedPipe[Person]

Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!

Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
3-way-merge

in 1 MR step

> run pl.project13.oculus.job.WordCountJob !
—local —tool.graph --input in --output out!
!
writing DOT: !
pl.project13.oculus.job.WordCountJob0.dot!
!
writing Steps DOT: !
pl.project13.oculus.job.WordCountJob0_steps.dot
Do the DOT

Do the DOT!
!
!
!
pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
pl.project13.oculus.job.WordCountJob0_steps.dot

!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT

!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
M

A

P

!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
M

A

P
R

E

D

class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing

!
}!
.runHadoop!
.finish!
}!
!
}!
<3 Testing

!
}!
.runHadoop!
.finish!
}!
!
}!
<3 Testing
run || runHadoop

“Parallelize all the batches!”

Feels much like Scala collections

Local Mode thanks to Cascading

Easy to add custom Taps

Type Safe, when you want to

Pure Scala

Pure Scala
Testing friendly

Pure Scala
Testing friendly
Matrix API

Pure Scala
Testing friendly
Matrix API
Efﬁcient columnar storage (Parquet)

Scalding Re-Cap
!
!
!
!
!
!
!

Scalding Re-Cap
!
!
!
!
!
!
!
4{

!
!
!
!
!
$ activator new activator-scalding!
!
Try it!
http://typesafe.com/activator/template/activator-scalding
Template by Dean Wampler

Loads Of Links
1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about

2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala

3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4

4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2?
qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3

5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efﬁcient-mapreduce?
qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2

6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomﬁlter-cogroup/

8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science

9. https://github.com/parquet/parquet-format

10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code

11. https://github.com/scalaz/scalaz

12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

!
Danke!
Dzięki!
Thanks!
Gracias!
ありがとう！
ktoso @ typesafe.com
t: ktosopl / g: ktoso
blog: project13.pl

Scalding - the not-so-basics @ ScalaDays 2014

More Related Content

What's hot

Viewers also liked

Similar to Scalding - the not-so-basics @ ScalaDays 2014

More from Konrad Malawski

Recently uploaded

Scalding - the not-so-basics @ ScalaDays 2014