Scalding
the not-so-basics
Konrad 'ktoso' Malawski	

Scala Days 2014 @ Berlin
Konrad `@ktosopl` Malawski
typesafe.com	

geecon.org	

Java.pl / KrakowScala.pl	

sckrk.com / meetup.com/Paper-Cup @ London	

GDGKrakow.pl 	

meetup.com/Lambda-Lounge-Krakow
hAkker @
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
How old is this guy?
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
Google MapReduce, paper: 2004
Hadoop (Yahoo impl): 2005
the Big Landscape
Hadoop
https://github.com/twitter/scalding
Scalding is “on top of” Hadoop
https://github.com/twitter/scalding
Scalding is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/summingbird
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
HDFS yes,	

MapReduce no
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
HDFS yes,	

MapReduce no
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
HDFS yes,	

MapReduce no
Possibly soon?!
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark has nothing to do with all this.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
-streams
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
this talk
Why?
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
in Memory
in Memory
package org.myorg;!
!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.io.LongWritable;!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.mapred.*;!
!
import java.io.IOException;!
import java.util.Iterator;!
import java.util.StringTokenizer;!
!
public class WordCount {!
!
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
Why Scalding?
Word Count in Hadoop MR
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
}!
}!
}!
!
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {!
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
output.collect(key, new IntWritable(sum));!
}!
}!
!
public static void main(String[] args) throws Exception {!
JobConf conf = new JobConf(WordCount.class);!
conf.setJobName("wordcount");!
!
conf.setOutputKeyClass(Text.class);!
conf.setOutputValueClass(IntWritable.class);!
!
conf.setMapperClass(Map.class);!
conf.setCombinerClass(Reduce.class);!
conf.setReducerClass(Reduce.class);!
!
conf.setInputFormat(TextInputFormat.class);!
conf.setOutputFormat(TextOutputFormat.class);!
!
FileInputFormat.setInputPaths(conf, new Path(args[0]));!
FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
!
JobClient.runJob(conf);!
}!
}!
Why Scalding?
Word Count in Hadoop MR
“Field API”
map
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
Scala:
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
Scala:
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:
available in Pipe
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:
available in Pipestays in Pipe
val data = 1 :: 2 :: 3 :: Nil!
!
val doubled = data map { _ * 2 }!
!
// Int => Int
map
IterableSource(data)!
.map('number -> 'doubled) { n: Int => n * 2 }!
!
!
// Int => Int
Scala:
must choose type!
mapTo
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
Scala:
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
Scala:
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
IterableSource(data)
.mapTo('doubled) { n: Int => n * 2 }
Scala:
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
IterableSource(data)
.mapTo('doubled) { n: Int => n * 2 }
Scala:
doubled stays in Pipe
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
IterableSource(data)
.mapTo('doubled) { n: Int => n * 2 }
Scala:
doubled stays in Pipenumber is removed
“release reference”
flatMap
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
flatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
flatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
flatMap
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",") } // like List[String]
.map('word -> 'number) { _.toInt } // like List[Int]
Scala:
flatMap
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",").map(_.toInt) // Array[Int]
}
flatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",").map(_.toInt) // Array[Int]
}
flatMap
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",").map(_.toInt) }
// like List[Int]
Scala:
groupBy
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:
groups all with == value
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:
groups all with == value 'lessThanTenCounts
groupBy
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.sum('total) }
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.sum('total) }
'total = [3, 74]
import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
!
object ScaldingJobRunner extends App {!
!
ToolRunner.run(new Configuration, new scalding.Tool, args)!
!
}
Main Class - "Runner"
import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
!
object ScaldingJobRunner extends App {!
!
ToolRunner.run(new Configuration, new scalding.Tool, args)!
!
}
Main Class - "Runner"
from App
class WordCountJob(args: Args) extends Job(args) {!
!
!
!
!
!
!
!
!
!
!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
!
!
!
!
!
!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
!
!
!
!
!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { group => group.size('count) }!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { group => group.size }!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
4{
1 day in the life of	

a guy implementing Scalding jobs
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output))
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output))
1!107!
2!144!
3!16!
… …
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output, writeHeader = true))
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
1!! ! ! 107!
2!! ! ! 144!
3!! ! ! 16!
…!! ! ! …
“Which are the top selling shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy('totalSoldItems).reverse }!
.write(Tsv(output, writeHeader = true))
“Which are the top selling shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy('totalSoldItems).reverse }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16!
…!! ! ! …
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
x!
List((5,146), (2,142), (3,32))!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
x!
List((5,146), (2,142), (3,32))!
WAT!?
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
x!
List((5,146), (2,142), (3,32))!
WAT!?
Emits scala.collection.List[_]
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
Provide Ordering explicitly because implicit Ordering	

is not enough for Tuple2 here
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
MUCH faster Job	

= 	

Happier me.
Reduce, these Monoids
Reduce, these Monoids
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
Reduce, these Monoids
Reduce, these Monoids
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:
Reduce, these Monoids
+ 3 laws:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:
Reduce, these Monoids
+ 3 laws:
Closure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:
∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
Reduce, these Monoids
object IntSum extends Monoid[Int] {!
def zero = 0!
def +(a: Int, b: Int) = a + b!
}
Summing:
Monoid ops can start “Map-side”
bear, 2
car, 3
deer, 2
Monoid ops can already start
being computed map-side!
Monoid ops can already start
being computed map-side!
river, 2
Monoid ops can start “Map-side”
average()	

sum()
sortWithTake()	

histogram()
Examples:
bear, 2
car, 3
deer, 2
river, 2
Obligatory: “Go check out Algebird, NOW!” slide
https://github.com/twitter/algebird
ALGE-birds
BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true
BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true
BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
Why not Set[String]? It would OutOfMemory.
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
ApproximateBoolean(true,0.9999580954658956)
Why not Set[String]? It would OutOfMemory.
Joins
Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)
Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
The “usual”
Joins
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
Joins
import com.twitter.scalding.FunctionImplicits._!
!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
}!
.project('sentence)!
.write(output)
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
Joins
import com.twitter.scalding.FunctionImplicits._!
!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
}!
.project('sentence)!
.write(output)
Hello hans, your bmw is really nice!
Hello bob, your bob's car is really nice!
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
“map-side” join
that.joinWithTiny('id1 -> 'id2, tinyPipe)
Choose this when:	

!
or:	

when the Left side is 3 orders of magnitude larger.
Left > max(mappers,reducers) * Right!
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
2. Use these estimated counts to replicate the join keys, 

according to the given replication strategy.
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
2. Use these estimated counts to replicate the join keys, 

according to the given replication strategy.
3. Join the replicated pipes together.
Where did my type-safety go?!
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!
Caused by: cascading.flow.FlowException: local step failed
	

 at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219)	

	

 at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81)	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34)	

	

 at cascading.flow.stream.SourceStage.map(SourceStage.java:102)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:53)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:38)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	

 at java.lang.Long.parseLong(Long.java:589)	

	

 at java.lang.Long.parseLong(Long.java:631)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!
Caused by: cascading.flow.FlowException: local step failed
	

 at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219)	

	

 at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81)	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34)	

	

 at cascading.flow.stream.SourceStage.map(SourceStage.java:102)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:53)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:38)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	

 at java.lang.Long.parseLong(Long.java:589)	

	

 at java.lang.Long.parseLong(Long.java:631)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
“oh, right… We
changed that file to be
user names, not ids…”
Trap it!
Tsv(in, ('userId1, 'userId2, 'rel))!
.addTrap(Tsv(“errors")) // add a trap!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))
Trap it!
Tsv(in, ('userId1, 'userId2, 'rel))!
.addTrap(Tsv(“errors")) // add a trap!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))
solves “dirty data”,	

no help for maintenance
Typed API
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
Must give Type to
each Field
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
Tuple arity: 2 Tuple arity: 3
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
Caused by: java.lang.IllegalArgumentException: 	

num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2
	

 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
Tuple arity: 2 Tuple arity: 3
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
Caused by: java.lang.IllegalArgumentException: 	

num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2
	

 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
Tuple arity: 2 Tuple arity: 3
“planing-time” exception
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
.write(TypedTsv(out))!
!
}
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
.write(TypedTsv(out))!
!
}
Easier to reuse
schemas now
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
.write(TypedTsv(out))!
!
}
Easier to reuse
schemas now
Not coupled by Field names,	

but still too magic for reuse… “_1”?
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date) !
.filter { p: Person => p.name == ”bob" }!
.write(TypedTsv(out))!
!
}
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date) !
.filter { p: Person => p.name == ”bob" }!
.write(TypedTsv(out))!
!
}
TypedPipe[Person]
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
3-way-merge 	

in 1 MR step
> run pl.project13.oculus.job.WordCountJob !
—local —tool.graph --input in --output out!
!
writing DOT: !
pl.project13.oculus.job.WordCountJob0.dot!
!
writing Steps DOT: !
pl.project13.oculus.job.WordCountJob0_steps.dot
Do the DOT
Do the DOT!
!
!
!
pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
pl.project13.oculus.job.WordCountJob0_steps.dot
!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
M	

A	

P
!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
M	

A	

P
R	

E	

D
Do the DOT
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.runHadoop!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.runHadoop!
.finish!
}!
!
}!
<3 Testing
run || runHadoop
“Parallelize all the batches!”
“Parallelize all the batches!”
Feels much like Scala collections
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
Matrix API
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
Matrix API
Efficient columnar storage (Parquet)
Scalding Re-Cap
!
!
!
!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
!
Scalding Re-Cap
!
!
!
!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
!
4{
!
!
!
!
!
$ activator new activator-scalding!
!
Try it!
http://typesafe.com/activator/template/activator-scalding
Template by Dean Wampler
Loads Of Links
1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about	

2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala	

3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4	

4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2?
qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3	

5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce?
qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2	

6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/	

7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/	

8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science	

9. https://github.com/parquet/parquet-format	

10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code	

11. https://github.com/scalaz/scalaz	

12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
!
Danke!
Dzięki!
Thanks!
Gracias!
ありがとう!
ktoso @ typesafe.com
t: ktosopl / g: ktoso
blog: project13.pl

Scalding - the not-so-basics @ ScalaDays 2014

  • 1.
    Scalding the not-so-basics Konrad 'ktoso'Malawski Scala Days 2014 @ Berlin
  • 2.
    Konrad `@ktosopl` Malawski typesafe.com geecon.org Java.pl/ KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow hAkker @
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    https://github.com/twitter/scalding Scalding is “ontop of” Cascading, which is “on top of” Hadoop http://www.cascading.org/
  • 9.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird
  • 10.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/
  • 11.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/
  • 12.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no
  • 13.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/
  • 14.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no
  • 15.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no Possibly soon?!
  • 16.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark has nothing to do with all this. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ -streams
  • 17.
    https://github.com/twitter/scalding Summingbird is “optop of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ this talk
  • 18.
  • 19.
    Stuff > Memory Scalacollections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!
  • 20.
    Stuff > Memory Scalacollections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory
  • 21.
    Stuff > Memory Scalacollections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory
  • 22.
    Stuff > Memory Scalacollections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory
  • 23.
    Stuff > Memory Scalacollections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory in Memory
  • 24.
    Stuff > Memory Scalacollections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory in Memory in Memory
  • 25.
    package org.myorg;! ! import org.apache.hadoop.fs.Path;! importorg.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.LongWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapred.*;! ! import java.io.IOException;! import java.util.Iterator;! import java.util.StringTokenizer;! ! public class WordCount {! ! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! Why Scalding? Word Count in Hadoop MR
  • 26.
    private final staticIntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }! }! ! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }! ! public static void main(String[] args) throws Exception {! JobConf conf = new JobConf(WordCount.class);! conf.setJobName("wordcount");! ! conf.setOutputKeyClass(Text.class);! conf.setOutputValueClass(IntWritable.class);! ! conf.setMapperClass(Map.class);! conf.setCombinerClass(Reduce.class);! conf.setReducerClass(Reduce.class);! ! conf.setInputFormat(TextInputFormat.class);! conf.setOutputFormat(TextOutputFormat.class);! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! JobClient.runJob(conf);! }! }! Why Scalding? Word Count in Hadoop MR
  • 35.
  • 36.
  • 37.
    val data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  • 38.
    val data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  • 39.
    val data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala:
  • 40.
    val data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipe
  • 41.
    val data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipestays in Pipe
  • 42.
    val data =1 :: 2 :: 3 :: Nil! ! val doubled = data map { _ * 2 }! ! // Int => Int map IterableSource(data)! .map('number -> 'doubled) { n: Int => n * 2 }! ! ! // Int => Int Scala: must choose type!
  • 43.
  • 44.
    var data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala:
  • 45.
    var data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  • 46.
    var data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  • 47.
    var data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: “release reference”
  • 48.
    var data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipe “release reference”
  • 49.
    var data =1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipenumber is removed “release reference”
  • 50.
  • 51.
    val data ="1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  • 52.
    val data ="1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  • 53.
    val data ="1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int] Scala:
  • 54.
  • 55.
    val data ="1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap Scala:
  • 56.
    val data ="1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int] Scala:
  • 57.
  • 58.
    val data =1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  • 59.
    val data =1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  • 60.
    val data =1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala:
  • 61.
    val data =1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value
  • 62.
    val data =1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value 'lessThanTenCounts
  • 63.
  • 64.
  • 65.
    groupBy IterableSource(List(1, 2, 30,42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
  • 66.
    groupBy IterableSource(List(1, 2, 30,42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
  • 67.
    groupBy IterableSource(List(1, 2, 30,42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) } 'total = [3, 74]
  • 68.
    import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! objectScaldingJobRunner extends App {! ! ToolRunner.run(new Configuration, new scalding.Tool, args)! ! } Main Class - "Runner"
  • 69.
    import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! objectScaldingJobRunner extends App {! ! ToolRunner.run(new Configuration, new scalding.Tool, args)! ! } Main Class - "Runner" from App
  • 70.
    class WordCountJob(args: Args)extends Job(args) {! ! ! ! ! ! ! ! ! ! ! } Word Count in Scalding
  • 71.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! ! ! ! ! ! ! } Word Count in Scalding
  • 72.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! ! ! ! ! ! } Word Count in Scalding
  • 73.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! ! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 74.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size('count) }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 75.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 76.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 77.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 78.
    class WordCountJob(args: Args)extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding 4{
  • 79.
    1 day inthe life of a guy implementing Scalding jobs
  • 80.
    “How much aremy shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))
  • 81.
    “How much aremy shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output)) 1!107! 2!144! 3!16! … …
  • 82.
    “How much aremy shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))
  • 83.
    “How much aremy shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 1!! ! ! 107! 2!! ! ! 144! 3!! ! ! 16! …!! ! ! …
  • 84.
    “Which are thetop selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))
  • 85.
    “Which are thetop selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16! …!! ! ! …
  • 86.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))
  • 87.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  • 88.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16 SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
  • 89.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))
  • 90.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))!
  • 91.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!?
  • 92.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!? Emits scala.collection.List[_]
  • 93.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))
  • 94.
    “What’s the top3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) Provide Ordering explicitly because implicit Ordering is not enough for Tuple2 here
  • 95.
    Tsv(input, ('shopId, 'itemId,'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  • 96.
    Tsv(input, ('shopId, 'itemId,'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  • 97.
    Tsv(input, ('shopId, 'itemId,'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  • 98.
    Tsv(input, ('shopId, 'itemId,'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” MUCH faster Job = Happier me.
  • 99.
  • 100.
  • 101.
    trait Monoid[T] {! defzero: T! def +(a: T, b: T): T! } Reduce, these Monoids
  • 102.
    Reduce, these Monoids traitMonoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 103.
    Reduce, these Monoids +3 laws: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 104.
    Reduce, these Monoids +3 laws: Closure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 105.
    Reduce, these Monoids +3 laws: (T, T) => TClosure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  • 106.
    Reduce, these Monoids +3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  • 107.
    Reduce, these Monoids +3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  • 108.
    Reduce, these Monoids +3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  • 109.
    Reduce, these Monoids +3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface: ∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
  • 110.
    Reduce, these Monoids objectIntSum extends Monoid[Int] {! def zero = 0! def +(a: Int, b: Int) = a + b! } Summing:
  • 111.
    Monoid ops canstart “Map-side” bear, 2 car, 3 deer, 2 Monoid ops can already start being computed map-side! Monoid ops can already start being computed map-side! river, 2
  • 112.
    Monoid ops canstart “Map-side” average() sum() sortWithTake() histogram() Examples: bear, 2 car, 3 deer, 2 river, 2
  • 113.
    Obligatory: “Go checkout Algebird, NOW!” slide https://github.com/twitter/algebird ALGE-birds
  • 114.
    BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES =6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 115.
    BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES =6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 116.
    BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES =6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 117.
    BloomFilterMonoid Csv(input, separator, ('shopId,'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))
  • 118.
    BloomFilterMonoid Csv(input, separator, ('shopId,'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!
  • 119.
    BloomFilterMonoid Csv(input, separator, ('shopId,'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! Why not Set[String]? It would OutOfMemory.
  • 120.
    BloomFilterMonoid Csv(input, separator, ('shopId,'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! ApproximateBoolean(true,0.9999580954658956) Why not Set[String]? It would OutOfMemory.
  • 121.
  • 122.
    Joins that.joinWithLarger('id1 -> 'id2,other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other)
  • 123.
    Joins that.joinWithLarger('id1 -> 'id2,other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job.
  • 124.
    Joins that.joinWithLarger('id1 -> 'id2,other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job. The “usual”
  • 125.
    Joins val people =IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 126.
    Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id ->'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 127.
    Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id ->'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) Hello hans, your bmw is really nice! Hello bob, your bob's car is really nice! val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 128.
    “map-side” join that.joinWithTiny('id1 ->'id2, tinyPipe) Choose this when: ! or: when the Left side is 3 orders of magnitude larger. Left > max(mappers,reducers) * Right!
  • 129.
    Skew Joins val sampleRate= 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))
  • 130.
    Skew Joins val sampleRate= 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe.
  • 131.
    Skew Joins val sampleRate= 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy.
  • 132.
    Skew Joins val sampleRate= 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy. 3. Join the replicated pipes together.
  • 133.
    Where did mytype-safety go?!
  • 134.
    Where did mytype-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!
  • 135.
    Where did mytype-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
  • 136.
    Where did mytype-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29) “oh, right… We changed that file to be user names, not ids…”
  • 137.
    Trap it! Tsv(in, ('userId1,'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))
  • 138.
    Trap it! Tsv(in, ('userId1,'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out)) solves “dirty data”, no help for maintenance
  • 139.
  • 140.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!
  • 141.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!
  • 142.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! Must give Type to each Field
  • 143.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!
  • 144.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  • 145.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176) TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  • 146.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176) TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3 “planing-time” exception
  • 147.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! }
  • 148.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now
  • 149.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now Not coupled by Field names, but still too magic for reuse… “_1”?
  • 150.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))! ! }
  • 151.
    TypedAPI’s Tsv(in, ('userId1, 'userId2,'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))! ! } TypedPipe[Person]
  • 152.
    Typed Joins case classUserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  • 153.
    Typed Joins case classUserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  • 154.
    Typed Joins case classUserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)! 3-way-merge in 1 MR step
  • 155.
    > run pl.project13.oculus.job.WordCountJob! —local —tool.graph --input in --output out! ! writing DOT: ! pl.project13.oculus.job.WordCountJob0.dot! ! writing Steps DOT: ! pl.project13.oculus.job.WordCountJob0_steps.dot Do the DOT
  • 156.
  • 157.
    ! ! ! ! > dot -Tpngpl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT
  • 158.
    ! ! ! ! > dot -Tpngpl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P
  • 159.
    ! ! ! ! > dot -Tpngpl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P R E D
  • 160.
  • 161.
  • 162.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 163.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 164.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 165.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 166.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 167.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 168.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 169.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing
  • 170.
    class WordCountJobTest extendsFlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing run || runHadoop
  • 172.
  • 173.
    “Parallelize all thebatches!” Feels much like Scala collections
  • 174.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading
  • 175.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps
  • 176.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to
  • 177.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala
  • 178.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  • 179.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  • 180.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API
  • 181.
    “Parallelize all thebatches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API Efficient columnar storage (Parquet)
  • 182.
    Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line ->'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! !
  • 183.
    Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line ->'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! ! 4{
  • 184.
    ! ! ! ! ! $ activator newactivator-scalding! ! Try it! http://typesafe.com/activator/template/activator-scalding Template by Dean Wampler
  • 185.
    Loads Of Links 1.http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about 2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala 3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4 4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2? qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3 5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce? qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2 6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/ 7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/ 8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science 9. https://github.com/parquet/parquet-format 10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code 11. https://github.com/scalaz/scalaz 12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
  • 186.